Introduction to Cloud Operations Suite

112 mins remaining

About this codelab

Last updated Aug 4, 2023

Written by Romin Irani

1. Introduction

Last Updated: 2023-07-28

What is Google Cloud Operations Suite?

Google Cloud Operations Suite is a platform where you can monitor, troubleshoot, and improve application performance on your Google Cloud environment. Key Pillars of Cloud Operations Suite include Cloud Monitoring, Cloud Logging and Cloud Tracing.

Check out this video to get a high level overview of Google Cloud Operations.

What you'll build

In this codelab, you're going to deploy a sample API on Google Cloud. You will then explore and configure multiple features in Cloud Monitoring vis-a-vis the API.

What you'll learn

Use of Google Cloud's Cloud Shell to deploy a sample application to Cloud Run.
Use of Google Cloud Monitoring features like Dashboards, Alerts, Uptime Checks, SLI/SLO Monitoring and more.

What you'll need

A recent version of Chrome (74 or later)
A Google Cloud Account and Google Cloud Project

2. Setup and Requirements

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console ( console.cloud.google.com) and create a new project.

The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can update it at any time.
The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (it is typically identified as PROJECT_ID). If you don't like the generated ID, you may generate another random one. Alternatively, you can try your own and see if it's available. It cannot be changed after this step and will remain for the duration of the project.
For your information, there is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.

Caution: A project ID must be globally unique and cannot be used by anyone else after you've selected it. You are the only user of that ID. Even if a project is deleted, the ID can never be used again

Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, you can delete the resources you created or delete the whole project. New users of Google Cloud are eligible for the $300 USD Free Trial program.

Google Cloud Shell Setup

While Google Cloud and Google Cloud Trace can be operated remotely from your laptop, in this codelab we will be using Google Cloud Shell, a command line environment running in the Cloud.

To activate Cloud Shell from the Cloud Console, simply click Activate Cloud Shell (it should only take a few moments to provision and connect to the environment).

If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

It should only take a few moments to provision and connect to Cloud Shell.

This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)

gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If, for some reason, the project is not set, simply issue the following command:

gcloud config set project <PROJECT_ID>

Cloud Shell also sets some environment variables by default, which may be useful as you run future commands.

echo $GOOGLE_CLOUD_PROJECT

Command output

<PROJECT_ID>

Sample Applications

We've put everything you need for this project into a Git repo. The repo contains a couple of sample applications and you can choose to use any of them for this exercise.

Git repo link: https://github.com/rominirani/cloud-code-sample-repository

3. Deploy the API application

What is the sample application or API about?

Our application is a simple Inventory API application that exposes a REST API Endpoint with a couple of operations to list the inventory items and getting specific item inventory count.

Once we deploy the API and assuming that it is hosted at https://<somehost>, we can access the API endpoints as follows:

https://<somehost>/inventory

This will list down all the product items with the on-hand inventory levels.

https://<somehost>/inventory/{productid}

This will provide a single record with the productid and on-hand inventory level for that product.

The response data returned is in JSON format.

Sample Data and API Request/Response

The application is not powered by a database at the backend to keep things simple. It contains 3 sample product ids and their on-hand inventory levels.

Product Id	On-Hand Inventory Level
I-1	10
I-2	20
I-3	30

Sample API Request and Response are shown below:

API Request	API Response
https://<somehost>/inventory	[ { "I-1": 10, "I-2": 20, "I-3": 30 }]
https://<somehost>/inventory/I-1	{ "productid": "I-1", "qty": 10}
https://<somehost>/inventory/I-2	{ "productid": "I-2", "qty": 20}
https://<somehost>/inventory/I-200	{ "productid": I-200, "qty": -1}

Clone the Repository

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

From the GCP Console click the Cloud Shell icon on the top right toolbar:

It should only take a few moments to provision and connect to the environment. When it is finished, you should see something like this:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on Google Cloud, greatly enhancing network performance and authentication. All of your work in this lab can be done with simply a browser.

Setup gcloud

In Cloud Shell, set your project ID and save it as the PROJECT_ID variable.

PROJECT_ID=[YOUR-PROJECT-ID]
gcloud config set project $PROJECT_ID

Now, execute the following command:

$ git clone https://github.com/rominirani/cloud-code-sample-repository.git

This will create a folder titled cloud-code-sample-repository in this folder.

(Optional) Run the application on Cloud Shell

You can run the application locally by following these steps:

From the terminal, navigate to the Python version of the API via the following command:

$ cd cloud-code-sample-repository
$ cd python-flask-api

In the terminal, provide the following command (At the time of writing, Cloud Shell comes with Python 3.9.x installed and we will use the default version. If you plan to run it locally on your laptop, you can go with Python 3.8+) :

$ python app.py

You can run the following command to start the Python Server locally.

This will start a server on port 8080 and you can test it out locally via the Web Preview feature of Cloud Shell. Click the Web Preview button as shown below:

Click on Preview on port 8080.

This will open a browser window. You will see a 404 Error and that is fine. Modify URL and change it to just have /inventory after the host name.

For e.g. on my machine, it looks like this:

https://8080-cs-557561579860-default.cs-asia-southeast1-yelo.cloudshell.dev/inventory

This will display the list of inventory items as explained earlier:

You can stop the server now by going to the Terminal and pressing Ctrl-C

Deploy the application

We will now deploy this API application to Cloud Run. The process involved utilizing the glcoud command line client to run the command to deploy the code to Cloud Run.

From the terminal, give the following gcloud command:

$ gcloud run deploy --source .

This will ask you multiple questions (if asked to authorize, please go ahead) and some of the points are mentioned below. You may or may not get all the questions, depending on the configuration and if you have already enabled specific APIs in your Google Cloud project.

Service name (python-flask-api): Either go with this default or choose something like my-inventory-api
API [run.googleapis.com] not enabled on project [project-number]. Would you like to enable and retry (this will take a few minutes)? (y/N)? Y
Please specify a region: Choose a region of your choice by giving a number.
API [artifactregistry.googleapis.com] not enabled on project [project-number]. Would you like to enable and retry (this will take a few minutes)? (y/N)? Y
Deploying from source requires an Artifact Registry Docker repository to store built containers. A repository named [cloud-run-source-deploy] in region [us-west1] will be created.

Do you want to continue (Y/n)? Y

Allow unauthenticated invocations to [my-inventory-api] (y/N)? Y

Eventually, this will kick-off the process to take your source code, containerize it, push it to the Artifact Registry and then deploy the Cloud Run service + revision. You should be patient through this process (can take 3-4 minutes) and you should see the process getting completed with the Service URL shown to you.

A sample run is shown below:

Test the application

Now that we have deployed the application to Cloud Run, you can access the API application as follows:

Note down the Service URL from the previous step. For e.g. on my setup, it is shown as https://my-inventory-api-bt2r5243dq-uw.a.run.app. Let's call this <SERVICE_URL>.
Open a browser and access the following 3 URLs for the API endpoints:
<SERVICE_URL>/inventory
<SERVICE_URL>/inventory/I-1
<SERVICE_URL>/inventory/I-100

It should be as per the specifications that we had provided in an earlier section with sample API Request and Response.

Get Service Details from Cloud Run

We deployed our API Service to Cloud Run, a serverless compute environment. We can visit the Cloud Run service via Google Cloud console at any point in time.

From the main menu, navigate to Cloud Run. This will display the list of services that you have running in Cloud Run. You should see the service that you just deployed. Depending on the name that you selected, you should see something like this:

Click on the Service name to view the details. The sample details are shown below:

Notice the URL, which is nothing but the service URL that you can punch into the browser and access the Inventory API that we just deployed. Feel free to look at Metrics and other details.

Let's jump in and start with Google Cloud Operations Suite now.

4. Setup a Dashboard

One of the convenient features that Cloud Monitoring provides is Out-of-the-Box (OOTB) dashboards across multiple resources in Google Cloud. This makes the initial setup of Dashboards with standard metrics, a quick and convenient process.

Let us look at how to do that for the API Service that we just deployed to Cloud Run.

Custom Dashboard for our Service

Since we have deployed our API service to Cloud Run, let us check out how to set up Dashboards that can help to visualize various metrics, some of which include the service latency.

First up, from the console, visit Monitoring → Overview as shown below:

The Overview shows multiple things that you would have configured in Monitoring like Dashboards, Alerting, Uptime checks, etc.

For now, let us click on Dashboards from the side main menu. This will bring us to the following screen:

Click on SAMPLE LIBRARY . This will display the list of Out-Of-The-Box (OOTB) Dashboards that are available in Google Cloud, across multiple resources. Specifically, scroll down the list and select Google Cloud Run as shown below.

This will display a list of standard dashboards that are available for Google Cloud Run. We are interested in that since we have deployed our service on Cloud Run.

You will see one Dashboard for Cloud Run Monitoring. Click on PREVIEW link to view the list of standard charts (metrics)that are available for Cloud Run Monitoring. Simply click on IMPORT SAMPLE DASHBOARD to import all these charts into a custom dashboard. This will present a Dashboard screen with a prefilled name for the same as shown below:

You can navigate back by clicking on the Left Arrow , which is to the left of the Dashboard name, right on the top left. This will lead to the list of Dashboards, out of which you should be able to see the new Dashboard that you just created.

Click on that Dashboard link and you can monitor multiple metrics that are available out of the box. These metrics include Latency, Request Count, Container metrics and more.

You can also choose to mark any of the Dashboard as a favorite, by simply selecting the star icon as shown below:

This will add the Dashboard to the Overview screen of Monitoring and it becomes an easy way to navigate to frequently used dashboards.

Fantastic ! You've just added a Custom Dashboard for monitoring your Cloud Run services. Well done !

5. Uptime checks

In this section, we are going to set up an uptime check for our API Service that we have deployed. A public uptime check can issue requests from multiple locations throughout the world to publicly available URLs or Google Cloud resources to see whether the resource responds.

The resource in this case is going to be the API Service that we have deployed to Cloud Run. The URL will be a specific endpoint that the API Service exposes to indicate the health of the service.

In the sample API service code, we have exposed an endpoint /healthy that returns a string value "All Izz Well". So all we need to do is define an uptime check that hits something like https://<SERVICE_URL>/healthy and checks if the string "All Izz Well" is returned or not.

Create a Notification Channel

Before we create the uptime check, it is important to first configure notification channels. A notification channel is a medium over which you will be alerted if there is an incident/issue with any of our monitored resources. An example of a notification channel is Email and you will receive emails in case there is an Alert, etc.

For now, we are going to configure an Email Notification Channel and configure it with our email address, so that we can get notified in case of any alerts which our system will raise and which we will configure.

To create a Notification Channel, follow these steps:

Go to Monitoring → Alerting from the main menu in Google Cloud Console, as shown below:

This will display a page with Alerts, Policies and more. For now, you will see a link at the top titled EDIT NOTIFICATION CHANNELS. Click on that.

This will display a list of various Notification Channels as shown below:

Locate the Email section and click on ADD NEW for that row. This will bring up the Email Configuration details as shown below:

Put in your email address and a Display Name as shown below. Click on SAVE.

This completes the creation of the Email Notification Channel. Let's go ahead and configure the uptime check now.

Creating an uptime check

Go to Monitoring → Uptime checks from the main menu in Google Cloud Console. At the top you will see the CREATE UPTIME CHECK link. Click on that.

This brings up a series of steps that you will need to complete to configure the uptime check.

The first step is to set up the Target details i.e. information on the Cloud Run service that we have deployed. A filled out form is shown below:

The different values can be selected as follows:

Protocol : HTTPS
Resource Type : Select Cloud Run Service. Notice the other resources that it supports and that you could set Uptime checks on them too.
Cloud Run Service : Select the my-inventory-api or the specific name that you have for the Cloud Run service.
The Path is /healthy, since we are returning back a string "All Izz Well" and we want to check for that.

Click CONTINUE to move to the next step. The next step is the Response Validation step as shown below:

You can see that we are enabling the check for "Content Matching" and then setting up that the response returned by the /healthy endpoint will be "All Izz Well". Click on CONTINUE to move to the next step where we will configure the Alert and which notification channel we should be alerted on, if the Uptime check fails.

In this step, give a name to the Alert. I have chosen it as Inventory API Uptime Check failure, but you can choose your name. The important thing here is to select the correct notification channel from the list that you configured earlier.

Click on REVIEW for the final step to review the Uptime check that we have configured.

In this final step, give a name to the Uptime check (e.g. Inventory API Uptime Check) and then you can also test out if the check is configured correctly. Click on the TEST button for that.

Go ahead and complete the process (click on the CREATE button on the left). Google Cloud will instruct the uptime check probes configured across different regions to ping the URL and these responses will be collected. Visit the Monitoring → Uptime checks section after a few minutes and you should ideally see all the green signals that indicate that the URL was reachable from the different probes.

If any of the probes fails for a period of time (which is configurable), you will get an Alert Notification on the email channel that we configured.

This completes our section on setting up an Uptime check. Well done !

6. Metrics Explorer

Cloud Monitoring exposes thousands of standard metrics from multiple Google Cloud products. These metrics are available for you to investigate, query, convert to Charts, add to Dashboards, raise Alerts on and more.

Our goal in this section is:

Understand how you can look at various metrics and then we will investigate a specific metric (latency) for our API service.
Convert that metric into a Chart and custom Dashboard that we can then use to visualize the metric anytime.

Explore Latency Metric for Inventory API Service

Go to Monitoring → Metrics Explorer from the main menu in Google Cloud Console. This will take you to the Metrics Explorer screen. Click on SELECT A METRIC. You can now navigate several active resources that have metrics generated.

Since we are dealing with Cloud Run services, click on Cloud Run Revision , then the category and specific metric titled Request Latency as shown below:

Click on Apply. This will display the Request Latency in a chart. You can change the Widget Type to a Line Chart from the Display settings on the right as shown below:

This will display the Latency Chart as shown below:

Create Chart and custom Dashboard

Let us go ahead and save this Chart. Click on Save Chart and use the details as shown below:

Keep in mind that we are creating a new dashboard , instead of saving it in an existing dashboard. Click on the SAVE button. This will add the newly created dashboard to our list of dashboards as shown below:

Click on the specific dashboard that we created to view the details.

This completes the section on investigating various metrics via Metrics Explorer and how we can create our custom dashboards.

7. Cloud Logging

In this section, we are going to explore Cloud Logging. Cloud Logging comes with a Logs Explorer interface that helps you navigate and dive into logs generated by various Google Services and your own applications.

In this section, we will learn about Logs Explorer and simulate a few log messages that we can then search for and convert into metrics, via a feature called Log-based metrics.

Logs Explorer

You can visit the Logs Explorer via Logging →Logs Explorer from the main Google cloud console as shown below:

This will display a log interface where you can specifically select/deselect various Resources (Project, Google cloud Resource, service names, etc) along with Log levels to filter the log messages as needed.

Shown above is the list of logs for the Cloud Run Revision i.e. Cloud Run services that we have deployed. You will see several requests that are Uptime checks hitting the /healthy endpoint that we have configured.

Search for Warnings

Simulate a few invalid requests to the Inventory Service by providing product ids that are not one of I-1, I-2 and I-3. For e.g. an incorrect request is:

https://<SERVICE_URL>/inventory/I-999

We will now search for all the WARNINGs that have been generated by our API, when an incorrect Product Id is provided in the Query.

In the Query Box, insert the following query parameters:

resource.type="cloud_run_revision"

textPayload =~ "Received inventory request for incorrect productid"

It should look something like this:

Click on Run Query. This will then show you all the requests that have come in and which have this issue.

Log-based Metrics

Let's create a Custom Log Metric to track these errors. We would like to understand if there is a significant number of calls happening with wrong Product Ids.

To convert the above to an error metric, click on the Create Metric button that you see in the Logs Explorer.

This will bring up the form to create the metric definition. Go with a Counter Metric and enter the details for the Metric Name (inventory_lookup_errors) and Description as shown below and click on Create Metric.

This will create the counter metric and you should see a message as displayed below:

Visit the Logging → Logs-based Metrics from the main menu and you should see the custom metric that we defined in the list of User-defined metrics as given below:

At the end of this entry, you will find three vertical dots, click on them to see the operations that you can perform on this custom metric. The list should be similar to the one that you are seeing below. Click on the View in Metrics Explorer option.

This should lead us to the Metrics Explorer that we learnt about in the previous section, except that it is now prepopulated for us.

Click on Save Chart. Use the following values for the Save Chart options:

This will now create a new Dashboard that you can see the Inventory Search errors and it will be available in the list of Dashboards.

Great ! You have now created a custom metric from your logs, converted that into a chart that is there in a custom dashboard. This will help us track the number of calls that are using incorrect product Ids.

8. Alert Policies

In this section, we will use the custom metric that we created and monitor its data for a threshold i.e. if the number of errors goes beyond a certain threshold, we will raise an alert. In other words, we are going to set up an alert policy.

Create an Alert Policy

Let us go to the Inventory Search Dashboard. This will bring up the chart that we created to note the Inventory Lookup Errors as shown below:

This will bring up the current metric data. Let us first edit the metric as shown below (Click on the Edit button):

This will bring up the metric details. We are going to convert the chart from showing us the rate of errors to a sum i.e. number of errors. The field to change is shown below:

Click on APPLY in the top right corner and we will be back at our Metrics screen but this time we will be able to see the total number of errors in the alignment period v/s the rate of errors.

We are going to create an Alert Policy that can notify us in case the number of errors are going beyond a threshold. Click on the 3 dots at the top right corner of the chart and from the list of options, as shown above, click on Convert to alert chart.

You should see a screen as shown below:

Click on Next , this will bring up a Threshold value that we can set. The sample threshold that we have taken over here is 5 , but you can choose as per your preference.

Click on NEXT to bring up the Notifications form

We have selected the Notification Channel as the Email channel that we created earlier. You may fill up the other details like Documentation (which will be provided as part of the Alert that gets raised). Click on NEXT to see the summary and complete the process.

Once you create this Alert Policy, it will be visible in the list of Alert Policies as shown below. You can get to the list of Alert Policies, by going to Monitoring → Alerting. Scan for the Policies section in the page to see the list of policies that we have configured so far.

Great ! You have now configured a custom Alert Policy that will notify you in case of an increased rate of errors while looking up the Inventory API.

9. Service Monitoring (Optional)

In this section, we are going to set up SLI/SLOs for our services as per Site Reliability Engineering (SRE) principles. You will notice that Cloud Monitoring makes it easier for you by auto-discovering services that you have deployed in Cloud Run and can compute key SLIs like Availability, Latency automatically for you along with Error Budget calculations.

Let's go ahead and set up the Latency SLO for our API Service.

Setting up Latency SLO for Inventory Service

Click on Monitoring → Services from the main menu in Cloud Console. This will bring up the list of services that have been configured for Service Monitoring.

Currently, we do not have any services that have been setup for SLI/SLO Monitoring, so the list is empty. Click on the DEFINE SERVICE link at the top to define / identify a service first.

This will auto discover services that are a candidate for SLO Monitoring. It is able to discover Cloud Run services and hence our Inventory API service deployed to Cloud Run will be visible in the list.

The display name that you see might be different and will depend on what you chose at the time of deploying the service to Cloud Run. Click on the SUBMIT button. This will bring up the screen shown below:

You can click on CREATE SLO. This will now allow you to select from the SLIs that are automatically calculated for you.

We choose Latency SLI as a start. Click on CONTINUE. Next up you see a screen that shows you the current performance of this service and what the typical latency has been.

We put in a value for the Threshold i.e. 300ms , which is what we want to achieve. You can choose a different value if you want but keep in mind that it will affect the error budget that you define accordingly. Click on CONTINUE.

We now set the SLO (Target and Measurement window) as shown below:

This means that we are selecting the Measurement window as a Rolling type window and measuring it across 7 days. Similarly for the target, we have chosen a goal of 90%. What we are trying to say here is that 90% of the requests to the API service should complete within 300ms and this should be measured across 7 days.

Click on Continue. This brings up the summary screen, which you can confirm by clicking on the UPDATE SLO button.

This saves your SLO definition and the Error Budget is computed for you automatically.

A few things that you can try:

Exercise the API via multiple calls and see the performance of the service and how it affects the remaining Error Budget.
Modify the source code to introduce some additional delay (sleep) randomly in some calls. This will push up the latency for a number of calls and it should adversely affect the Error Budget.

10. Congratulations

Congratulations, you've successfully deployed a sample application to Google Cloud and learnt about using Google Cloud Operations Suite to monitor the health of the application!

What we've covered

Deploying a Service to Google Cloud Run.
Setting up a Dashboard for Google Cloud Run Service.
Uptime checks.
Setting up Custom log metrics and Dashboard/Chart based on it.
Exploring Metrics Explorer and setting up Dashboard/Chart.
Setting up Alert Policies.
Setting up SLI/SLO for Service Monitoring in Google Cloud.

Note: If you have executed the codelab using your own account and Google Cloud project, the resources allocated may continue to incur a billing charge. So delete the Project and resources once you are done with the lab.

What's next?

Check out this Cloud Skills Boost Quest to learn more about Google Cloud Operations Suite.

Google Cloud Operations Suite

Introduction to Cloud Operations Suite

About this codelab

What is Google Cloud Operations Suite?

What you'll build

What you'll learn

What you'll need

Self-paced environment setup

Google Cloud Shell Setup

Sample Applications

What is the sample application or API about?

Sample Data and API Request/Response

Clone the Repository

Setup gcloud

(Optional) Run the application on Cloud Shell

Deploy the application

Test the application

Get Service Details from Cloud Run

Custom Dashboard for our Service

Create a Notification Channel

Creating an uptime check

Explore Latency Metric for Inventory API Service

Create Chart and custom Dashboard

Logs Explorer

Search for Warnings

Log-based Metrics

Create an Alert Policy

Setting up Latency SLO for Inventory Service

What we've covered

What's next?

Further reading