Instrument for better performance in your app in Go (part 1: trace)

1. Introduction

505827108874614d.png

Last Updated: 2022-07-15

Observability of the application

Observability and OpenTelemetry

Observability is the term used to describe an attribute of a system. A system with observability allows teams to actively debug their system. In that context, three pillars of observability; logs, metrics, and traces are the fundamental instrumentation for the system to acquire observability.

OpenTelemetry is a set of specifications, libraries and agents that accelerate the instrumentation and export of telemetry data (logs, metrics, and traces) that observability requires. OpenTelemetry is an open standard and community driven project under CNCF. By utilizing libraries that the project and its ecosystem provide, developers are able to instrument their applications in the vendor neutral way and against multiple architectures.

Also in addition to the three pillars of observability, continuous profiling is another key component for observability and it is expandings the user base in the industry. Cloud Profiler is one of the originators and provides an easy interface to drill down the performance metrics in the application call stacks.

This codelab is part 1 of the series and covers instrumenting distributed traces in microservices with OpenTelemetry and Cloud Trace. Part 2 will cover continuous profiling with Cloud Profiler.

Distributed Trace

Among logs, metrics, and traces, trace is the telemetry that tells the latency of a specific part of the process in the system. Especially in the era of microservices, distributed trace is the strong driver to find out latency bottlenecks in the overall distributed system.

When analyzing distributed traces, the trace data visualization is the key to grasp overall system latencies at a glance. In distributed trace, we handle a set of calls to process a single request to the system entrypoint in a form of Trace containing multiple Spans.

Span represents an individual unit of work done in a distributed system, recording start and stop times. Spans often have hierarchical relationships between each other - in the picture below all smaller spans are child spans of a large /messages span, and are assembled into one Trace that shows the path of work through a system.

A trace

Google Cloud Trace is one of the options for distributed trace backend and it is well integrated with other products in Google Cloud.

What you'll build

In this codelab, you're going to instrument trace information in the services called "Shakespeare application" (a.k.a. Shakesapp) that runs on a Google Kubernetes Engine cluster. The architecture of Shakesapp is as described below:

44e243182ced442f.png

  • Loadgen sends a query string to the client in HTTP
  • Clients pass through the query from the loadgen to the server in gRPC
  • Server accepts the query from the client, fetches all Shakespare works in text format from Google Cloud Storage, searches the lines that contain the query and return the number of the line that matched to the client

You will instrument the trace information across the request. After that, you will embed a profiler agent in the server and investigate the bottleneck.

What you'll learn

  • How to get started with the OpenTelemetry Trace libraries in Go project
  • How to create a span with the library
  • How to propagate span contexts across the wire between app components
  • How to send trace data to Cloud Trace
  • How to analyze the trace on Cloud Trace

This codelab explains how to instrument your microservices. To make it easy to understand, this example only contains 3 components (load generator, client, and server), but you can apply the same process explained in this codelab to more complex and large systems.

What you'll need

  • A basic knowledge of Go
  • A basic knowledge of Kubernetes

2. Setup and Requirements

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console ( console.cloud.google.com) and create a new project.

If you already have a project, click on the project selection pull down menu in the upper left of the console:

7a32e5469db69e9.png

and click the ‘NEW PROJECT' button in the resulting dialog to create a new project:

7136b3ee36ebaf89.png

If you don't already have a project, you should see a dialog like this to create your first one:

870a3cbd6541ee86.png

The subsequent project creation dialog allows you to enter the details of your new project:

affdc444517ba805.png

Remember the project ID, which is a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, if you haven't already done so, you'll need to enable billing in the Developers Console in order to use Google Cloud resources and enable the Cloud Trace API.

15d0ef27a8fbab27.png

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). Pricings of Google Cloud Trace, Google Kubernetes Engine and Google Artifact Registry are noted on the official documentation.

New users of Google Cloud Platform are eligible for a $300 free trial, which should make this codelab entirely free of charge.

Google Cloud Shell Setup

While Google Cloud and Google Cloud Trace can be operated remotely from your laptop, in this codelab we will be using Google Cloud Shell, a command line environment running in the Cloud.

This Debian-based virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. This means that all you will need for this codelab is a browser (yes, it works on a Chromebook).

To activate Cloud Shell from the Cloud Console, simply click Activate Cloud Shell gcLMt5IuEcJJNnMId-Bcz3sxCd0rZn7IzT_r95C8UZeqML68Y1efBG_B0VRp7hc7qiZTLAF-TXD7SsOadxn8uadgHhaLeASnVS3ZHK39eOlKJOgj9SJua_oeGhMxRrbOg3qigddS2A (it should only take a few moments to provision and connect to the environment).

JjEuRXGg0AYYIY6QZ8d-66gx_Mtc-_jDE9ijmbXLJSAXFvJt-qUpNtsBsYjNpv2W6BQSrDc1D-ARINNQ-1EkwUhz-iUK-FUCZhJ-NtjvIEx9pIkE-246DomWuCfiGHK78DgoeWkHRw

Screen Shot 2017-06-14 at 10.13.43 PM.png

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If, for some reason, the project is not set, simply issue the following command:

gcloud config set project <PROJECT_ID>

Looking for your PROJECT_ID? Check out what ID you used in the setup steps or look it up in the Cloud Console dashboard:

158fNPfwSxsFqz9YbtJVZes8viTS3d1bV4CVhij3XPxuzVFOtTObnwsphlm6lYGmgdMFwBJtc-FaLrZU7XHAg_ZYoCrgombMRR3h-eolLPcvO351c5iBv506B3ZwghZoiRg6cz23Qw

Cloud Shell also sets some environment variables by default, which may be useful as you run future commands.

echo $GOOGLE_CLOUD_PROJECT

Command output

<PROJECT_ID>

Finally, set the default zone and project configuration.

gcloud config set compute/zone us-central1-f

You can choose a variety of different zones. For more information, see Regions & Zones.

Go language setup

In this codelab, we use Go for all source code. Run the following command on Cloud Shell and confirm if the Go's version is 1.17+

go version

Command output

go version go1.18.3 linux/amd64

Setup a Google Kubernetes Cluster

In this codelab, you will run a cluster of microservices on Google Kubernetes Engine (GKE). The process of this codelab is as the following:

  1. Download the baseline project into Cloud Shell
  2. Build microservices into containers
  3. Upload containers onto Google Artifact Registry (GAR)
  4. Deploy containers onto GKE
  5. Modify the source code of services for trace instrumentation
  6. Go to step2

Enable Kubernetes Engine

First, we set up a Kubernetes cluster where Shakesapp runs on GKE, so we need to enable GKE. Navigate to the menu "Kubernetes Engine" and press the ENABLE button.

548cfd95bc6d344d.png

Now you are ready to create a Kubernetes cluster.

Create Kubernetes cluster

On Cloud Shell, run the following command to create a Kubernetes cluster. Please confirm the zone value is under the region that you will use for Artifact Registry repository creation. Change the zone value us-central1-f if your repository region is not covering the zone.

gcloud container clusters create otel-trace-codelab2 \
--zone us-central1-f \
--release-channel rapid \
--preemptible \
--enable-autoscaling \
--max-nodes 8 \
--no-enable-ip-alias \
--scopes cloud-platform

Command output

Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
Creating cluster otel-trace-codelab2 in us-central1-f... Cluster is being health-checked (master is healthy)...done.     
Created [https://container.googleapis.com/v1/projects/development-215403/zones/us-central1-f/clusters/otel-trace-codelab2].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-f/otel-trace-codelab2?project=development-215403
kubeconfig entry generated for otel-trace-codelab2.
NAME: otel-trace-codelab2
LOCATION: us-central1-f
MASTER_VERSION: 1.23.6-gke.1501
MASTER_IP: 104.154.76.89
MACHINE_TYPE: e2-medium
NODE_VERSION: 1.23.6-gke.1501
NUM_NODES: 3
STATUS: RUNNING

Artifact Registry and skaffold setup

Now we have a Kubernetes cluster ready for deployment. Next we prepare for a container registry for push and deploy containers. For these steps, we need to set up an Artifact Registry (GAR) and skaffold to use it.

Artifact Registry setup

Navigate to the menu of "Artifact Registry" and press the ENABLE button.

45e384b87f7cf0db.png

After some moments, you will see the repository browser of GAR. Click the "CREATE REPOSITORY" button and enter the name of the repository.

d6a70f4cb4ebcbe3.png

In this codelab, I name the new repository trace-codelab. The format of the artifact is "Docker" and the location type is "Region". Choose the region close to the one you set for Google Compute Engine default zone. For example, this example chose "us-central1-f" above, so here we choose "us-central1 (Iowa)". Then click the "CREATE" button.

9c2d1ce65258ef70.png

Now you see "trace-codelab" on the repository browser.

7a3c1f47346bea15.png

We will come back here later to check the registry path.

Skaffold setup

Skaffold is a handy tool when you work on building microservices that run on Kubernetes. It handles the workflow of building, pushing and deploying containers of applications with a small set of commands. Skaffold by default uses Docker Registry as container registry, so you need to configure skaffold to recognize GAR on pushing containers to.

Open Cloud Shell again and confirm if skaffold is installed. (Cloud Shell installs skaffold into the environment by default.) Run the following command and see the skaffold version.

skaffold version

Command output

v1.38.0

Now, you can register the default repository for skaffold to use. In order to obtain the registry path, navigate yourself to the Artifact Registry dashboard and click the name of repository you just set up in the previous step.

7a3c1f47346bea15.png

Then you will see breadcrumbs trails on the top of the page. Click e157b1359c3edc06.pngicon to copy the registry path to the clipboard.

e0f2ae2144880b8b.png

On clicking the copy button, you see the dialog at the bottom of the browser with the message like:

"us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab" has been copied

Go back to the cloud shell. Run skaffold config set default-repo command with the value you just copied from the dashboard.

skaffold config set default-repo us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab

Command output

set value default-repo to us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab for context gke_stackdriver-sandbox-3438851889_us-central1-b_stackdriver-sandbox

Also, you need to configure the registry to Docker configuration. Run the following command:

gcloud auth configure-docker us-central1-docker.pkg.dev --quiet

Command output

{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud",
    "us-central1-docker.pkg.dev": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev

Now you are good to go for the next step to set up a Kubernetes container on GKE.

Summary

In this step, you set up your codelab environment:

  • Set up Cloud Shell
  • Created a Artifact Registry repository for the container registry
  • Set up skaffold to use the container registry
  • Created a Kubernetes cluster where the codelab microservices run

Next up

In next step, you will build, push and deploy your microservices onto the cluster

3. Build, push and deploy the microservices

Download the codelab material

In the previous step, we have set up all prerequisites for this codelab. Now you are ready to run whole microservices on top of them. The codelab material is hosted on GitHub, so download them to the Cloud Shell environment with the following git command.

cd ~
git clone https://github.com/ymotongpoo/opentelemetry-trace-codelab-go.git
cd opentelemetry-trace-codelab-go

The directory structure of the project is as the followings:

.
├── README.md
├── step0
│   ├── manifests
│   ├── proto
│   ├── skaffold.yaml
│   └── src
├── step1
│   ├── manifests
│   ├── proto
│   ├── skaffold.yaml
│   └── src
├── step2
│   ├── manifests
│   ├── proto
│   ├── skaffold.yaml
│   └── src
├── step3
│   ├── manifests
│   ├── proto
│   ├── skaffold.yaml
│   └── src
├── step4
│   ├── manifests
│   ├── proto
│   ├── skaffold.yaml
│   └── src
├── step5
│   ├── manifests
│   ├── proto
│   ├── skaffold.yaml
│   └── src
└── step6
    ├── manifests
    ├── proto
    ├── skaffold.yaml
    └── src
  • manifests: Kubernetes manifest files
  • proto: proto definition for the communication between client and server
  • src: directories for the source code of each services
  • skaffold.yaml: Configuration file for skaffold

In this codelab, you will update the source code located under step0 folder. You can also refer to the source code in step[1-6] folders for the answers in the following steps. (Part 1 covers step0 to step4, and Part 2 covers step 5 and 6)

Run skaffold command

Finally you are ready to build, push and deploy whole content onto the Kubernetes cluster you have just created. This sounds like it contains multiple steps but the actual is skaffold does everything for you. Let's try that with the following command:

cd step0
skaffold dev

As soon as running the command, you see the log output of docker build and can confirm that they are successfully pushed to the registry.

Command output

...
---> Running in c39b3ea8692b
 ---> 90932a583ab6
Successfully built 90932a583ab6
Successfully tagged us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/serverservice:step1
The push refers to repository [us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/serverservice]
cc8f5a05df4a: Preparing
5bf719419ee2: Preparing
2901929ad341: Preparing
88d9943798ba: Preparing
b0fdf826a39a: Preparing
3c9c1e0b1647: Preparing
f3427ce9393d: Preparing
14a1ca976738: Preparing
f3427ce9393d: Waiting
14a1ca976738: Waiting
3c9c1e0b1647: Waiting
b0fdf826a39a: Layer already exists
88d9943798ba: Layer already exists
f3427ce9393d: Layer already exists
3c9c1e0b1647: Layer already exists
14a1ca976738: Layer already exists
2901929ad341: Pushed
5bf719419ee2: Pushed
cc8f5a05df4a: Pushed
step1: digest: sha256:8acdbe3a453001f120fb22c11c4f6d64c2451347732f4f271d746c2e4d193bbe size: 2001

After the push of all service containers, Kubernetes deployments start automatically.

Command output

sha256:b71fce0a96cea08075dc20758ae561cf78c83ff656b04d211ffa00cedb77edf8 size: 1997
Tags used in deployment:
 - serverservice -> us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/serverservice:step4@sha256:8acdbe3a453001f120fb22c11c4f6d64c2451347732f4f271d746c2e4d193bbe
 - clientservice -> us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/clientservice:step4@sha256:b71fce0a96cea08075dc20758ae561cf78c83ff656b04d211ffa00cedb77edf8
 - loadgen -> us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/loadgen:step4@sha256:eea2e5bc8463ecf886f958a86906cab896e9e2e380a0eb143deaeaca40f7888a
Starting deploy...
 - deployment.apps/clientservice created
 - service/clientservice created
 - deployment.apps/loadgen created
 - deployment.apps/serverservice created
 - service/serverservice created

After the deployment, you'll see the actual application logs emitted to stdout in each containers like this:

Command output

[client] 2022/07/14 06:33:15 {"match_count":3040}
[loadgen] 2022/07/14 06:33:15 query 'love': matched 3040
[client] 2022/07/14 06:33:15 {"match_count":3040}
[loadgen] 2022/07/14 06:33:15 query 'love': matched 3040
[client] 2022/07/14 06:33:16 {"match_count":3040}
[loadgen] 2022/07/14 06:33:16 query 'love': matched 3040
[client] 2022/07/14 06:33:19 {"match_count":463}
[loadgen] 2022/07/14 06:33:19 query 'tear': matched 463
[loadgen] 2022/07/14 06:33:20 query 'world': matched 728
[client] 2022/07/14 06:33:20 {"match_count":728}
[client] 2022/07/14 06:33:22 {"match_count":463}
[loadgen] 2022/07/14 06:33:22 query 'tear': matched 463

Note that at this point, you want to see any messages from the server. OK, finally you are ready to start instrumenting your application with OpenTelemetry for distributed tracing of the services.

Before starting instrumenting the service, please shut down your cluster with Ctrl-C.

Command output

...
[client] 2022/07/14 06:34:57 {"match_count":1}
[loadgen] 2022/07/14 06:34:57 query 'what's past is prologue': matched 1
^CCleaning up...
 - W0714 06:34:58.464305   28078 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead.
 - To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
 - deployment.apps "clientservice" deleted
 - service "clientservice" deleted
 - deployment.apps "loadgen" deleted
 - deployment.apps "serverservice" deleted
 - service "serverservice" deleted

Summary

In this step, you have prepared the codelab material in your environment and confirmed skaffold runs as expected.

Next up

In the next step, you will modify the source code of loadgen service to instrument the trace information.

4. Instrumentation for HTTP

Concept of trace instrumentation and propagation

Before editing the source code, let me briefly explain how distributed traces work in a simple diagram.

6be42e353b9bfd1d.png

In this example, we instrument the code to export Trace and Span information to Cloud Trace and propagate trace context across the request from the loadgen service to the server service.

Applications need to send Trace metadata such as Trace ID and Span ID in order Cloud Trace to assemble all spans that have the same Trace ID into one trace. Also the application needs to propagate trace contexts (the combination of Trace ID and Span ID of the parent span) on requesting downstream services, so that they can be aware of which trace context they are handling.

OpenTelemetry helps you:

  • to generate unique Trace ID and Span ID
  • to export Trace ID and Span ID to the backend
  • to propagate trace contexts to other services
  • to embed extra metadata that helps analyzing traces

Components in OpenTelemetry Trace

b01f7bb90188db0d.png

The process to instrument application trace with OpenTelemetry is as the followings:

  1. Create an exporter
  2. Create a TracerProvider binding the exporter in 1 and set it global.
  3. Set TextMapPropagaror to set the propagation method
  4. Get the Tracer from the TracerProvider
  5. Generate Span from the Tracer

As of now, you do not need to understand the detailed properties in each component, but the most important thing to remember are:

  • the exporter here is pluggable to TracerProvider
  • TracerProvider holds all configuration regarding trace sampling and export
  • all traces are bundled in Tracer object

With the understanding of this, let's move on to the actual coding work.

Instrument first span

Instrument load generator service

Open Cloud Shell Editor by pressing the button 776a11bfb2122549.pngat the top right of the Cloud Shell. Open step0/src/loadgen/main.go from the explorer in the left pane and find the main function.

step0/src/loadgen/main.go

func main() {
        ...
        for range t.C {
                log.Printf("simulating client requests, round %d", i)
                if err := run(numWorkers, numConcurrency); err != nil {
                        log.Printf("aborted round with error: %v", err)
                }
                log.Printf("simulated %d requests", numWorkers)
                if numRounds != 0 && i > numRounds {
                        break
                }
                i++
        }
}

In the main function, you see the loop calling the function run in it. In the current implementation, the section has 2 log lines that record the beginning and ending of the function call. Now let's instrument Span information to track the latency of the function call.

First, as noted in the previous section, let's set up the whole configurations for OpenTelemetry. Add OpenTelemetry packages as follows:

step0/src/loadgen/main.go

import (
        "context" // step1. add packages
        "encoding/json"
        "fmt"
        "io"
        "log"
        "math/rand"
        "net/http"
        "net/url"
        "time"
        // step1. add packages
        "go.opentelemetry.io/otel"
        "go.opentelemetry.io/otel/attribute"
        stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
        "go.opentelemetry.io/otel/propagation"
        sdktrace "go.opentelemetry.io/otel/sdk/trace"
        semconv "go.opentelemetry.io/otel/semconv/v1.10.0"
        "go.opentelemetry.io/otel/trace"
        // step1. end add packages
)

For readability, we create a setup function called initTracer and call it in the main function.

step0/src/loadgen/main.go

// step1. add OpenTelemetry initialization function
func initTracer() (*sdktrace.TracerProvider, error) {
        // create a stdout exporter to show collected spans out to stdout.
        exporter, err := stdout.New(stdout.WithPrettyPrint())
        if err != nil {
                return nil, err
        }

        // for the demonstration, we use AlwaysSmaple sampler to take all spans.
        // do not use this option in production.
        tp := sdktrace.NewTracerProvider(
                sdktrace.WithSampler(sdktrace.AlwaysSample()),
                sdktrace.WithBatcher(exporter),
        )
        otel.SetTracerProvider(tp)
        otel.SetTextMapPropagator(propagation.TraceContext{})
        return tp, nil
}

You may realize that the procedure to set up OpenTelemetry is as described in the previous section. In this implementation, we use a stdout exporter that exports all trace information into the stdout in a structured format.

Then you call it from the main function. Call the initTracer() and make sure to call the TracerProvider.Shutdown() when you close the application.

step0/src/loadgen/main.go

func main() {
        // step1. setup OpenTelemetry
        tp, err := initTracer()
        if err != nil {
                log.Fatalf("failed to initialize TracerProvider: %v", err)
        }
        defer func() {
                if err := tp.Shutdown(context.Background()); err != nil {
                        log.Fatalf("error shutting down TracerProvider: %v", err)
                }
        }()
        // step1. end setup

        log.Printf("starting worder with %d workers in %d concurrency", numWorkers, numConcurrency)
        log.Printf("number of rounds: %d (0 is inifinite)", numRounds)
        ...

Once you finish the set up, you need to create a Span with a unique Trace ID and Span ID. OpenTelemetry provides a handy library for it. Add extra new packages to the instrument HTTP client.

step0/src/loadgen/main.go

import (
        "context"
        "encoding/json"
        "fmt"
        "io"
        "log"
        "math/rand"
        "net/http"
        "net/http/httptrace" // step1. add packages
        "net/url"
        "time"
        // step1. add packages
        "go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace"
        "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
        // step1. end add packages
        "go.opentelemetry.io/otel"
        "go.opentelemetry.io/otel/attribute"
        stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
        "go.opentelemetry.io/otel/propagation"
        sdktrace "go.opentelemetry.io/otel/sdk/trace"
        semconv "go.opentelemetry.io/otel/semconv/v1.10.0"
        "go.opentelemetry.io/otel/trace"
)

Because the load generator is calling client service in HTTP with net/http in runQuery function, we use the contrib package for net/http and enable instrumentation with the extension of httptrace and otelhttp package.

First adding a package global variable httpClient to call HTTP requests via the instrumented client.

step0/src/loadgen/main.go

var httpClient = http.Client{
        Transport: otelhttp.NewTransport(http.DefaultTransport)
}

Next add instrumentation in the runQuery function to create the custom span using OpenTelemetry and the auto generated span from the custom HTTP client. What you will do is:

  1. Get a Tracer from global TracerProvider with otel.Tracer()
  2. Create a root span with Tracer.Start() method
  3. End the root span in an arbitrary timing (in this case, end of runQuery function)

step0/src/loadgen/main.go

        reqURL.RawQuery = v.Encode()
        // step1. replace http.Get() with custom client call
        // resp, err := http.Get(reqURL.String())

        // step1. instrument trace
        ctx := context.Background()
        tr := otel.Tracer("loadgen")
        ctx, span := tr.Start(ctx, "query.request", trace.WithAttributes(
                semconv.TelemetrySDKLanguageGo,
                semconv.ServiceNameKey.String("loadgen.runQuery"),
                attribute.Key("query").String(s),
        ))
        defer span.End()
        ctx = httptrace.WithClientTrace(ctx, otelhttptrace.NewClientTrace(ctx))
        req, err := http.NewRequestWithContext(ctx, "GET", reqURL.String(), nil)
        if err != nil {
                return -1, fmt.Errorf("error creating HTTP request object: %v", err)
        }
        resp, err := httpClient.Do(req)
        // step1. end instrumentation
        if err != nil {
                return -1, fmt.Errorf("error sending request to %v: %v", reqURL.String(), err)
        }

Now you are done with the instrumentation in loadgen (HTTP client application). Please make sure to update your go.mod and go.sum with go mod command.

go mod tidy

Instrument client service

In the previous section, we instrumented the part enclosed in the red rectangle in the drawing below. We instrumented span information in load generator service. Similarly to the load generator service, now we need to instrument the client service. The difference from the load generator service is that the client service needs to extract Trace ID information propagated from the load generator service in the HTTP header and use the ID to generate Spans.

bcaccd06691269f8.png

Open Cloud Shell Editor and add required packages like we did for load generator service.

step0/src/client/main.go

import (
        "context"
        "encoding/json"
        "fmt"
        "io"
        "log"
        "net/http"
        "net/url"
        "os"
        "time"

        "opentelemetry-trace-codelab-go/client/shakesapp"
        // step1. add new import
        "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
        "go.opentelemetry.io/otel"
        "go.opentelemetry.io/otel/attribute"
        stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
        "go.opentelemetry.io/otel/propagation"
        sdktrace "go.opentelemetry.io/otel/sdk/trace"
        "go.opentelemetry.io/otel/trace"
        "google.golang.org/grpc"
        "google.golang.org/grpc/credentials/insecure"
        // step1. end new import
)

Again, we need to set up the OpenTelemtry. Just copy and paste the initTracer function from loadgen and call it in the client service's main function as well.

step0/src/client/main.go

// step1. add OpenTelemetry initialization function
func initTracer() (*sdktrace.TracerProvider, error) {
        // create a stdout exporter to show collected spans out to stdout.
        exporter, err := stdout.New(stdout.WithPrettyPrint())
        if err != nil {
                return nil, err
        }

        // for the demonstration, we use AlwaysSmaple sampler to take all spans.
        // do not use this option in production.
        tp := sdktrace.NewTracerProvider(
                sdktrace.WithSampler(sdktrace.AlwaysSample()),
                sdktrace.WithBatcher(exporter),
        )
        otel.SetTracerProvider(tp)
        otel.SetTextMapPropagator(propagation.TraceContext{})
        return tp, nil
}

Now it's time to instrument spans. Because the client service needs to accept HTTP requests from the loadgen service, it needs to instrument the handler. The HTTP server in the client service is implemented with net/http, and you can use otelhttp package like what we did in the loadgen.

First, we replace the handler registration with otelhttp Handler. In the main function, find the lines where the HTTP handler is registered with http.HandleFunc().

step0/src/client/main.go

        // step1. change handler to intercept OpenTelemetry related headers
        // http.HandleFunc("/", svc.handler)
        otelHandler := otelhttp.NewHandler(http.HandlerFunc(svc.handler), "client.handler")
        http.Handle("/", otelHandler)
        // step1. end intercepter setting
        http.HandleFunc("/_genki", svc.health)

Then, we instrument the actual span inside the handler. Find func (*clientService) handler(), and add span instrumentation with trace.SpanFromContext().

step0/src/client/main.go

func (cs *clientService) handler(w http.ResponseWriter, r *http.Request) {
        ...
        ctx := r.Context()
        ctx, cancel := context.WithCancel(ctx)
        defer cancel()
        // step1. instrument trace
        span := trace.SpanFromContext(ctx)
        defer span.End()
        // step1. end instrument
        ...

With this instrumentation, you get the spans from the beginning of the handler method to the end of it. To make the spans easy to analyze, add an extra attribute that stores the matched count to the query. Right before the log line, add the following code.

step0/src/client/main.go

func (cs *clientService) handler(w http.ResponseWriter, r *http.Request) {
        ...
        // step1. add span specific attribute
        span.SetAttributes(attribute.Key("matched").Int64(resp.MatchCount))
        // step1. end adding attribute
        log.Println(string(ret))
        ...

With all instrumentation above, you completed the trace instrumentation between loadgen and client. Let's see how it works. Run the code with skaffold again.

skaffold dev

After some time to run the services on GKE cluster, you will see the huge amount of log messages like this:

Command output

[loadgen] {
[loadgen]       "Name": "query.request",
[loadgen]       "SpanContext": {
[loadgen]               "TraceID": "cfa22247a542beeb55a3434392d46b89",
[loadgen]               "SpanID": "18b06404b10c418b",
[loadgen]               "TraceFlags": "01",
[loadgen]               "TraceState": "",
[loadgen]               "Remote": false
[loadgen]       },
[loadgen]       "Parent": {
[loadgen]               "TraceID": "00000000000000000000000000000000",
[loadgen]               "SpanID": "0000000000000000",
[loadgen]               "TraceFlags": "00",
[loadgen]               "TraceState": "",
[loadgen]               "Remote": false
[loadgen]       },
[loadgen]       "SpanKind": 1,
[loadgen]       "StartTime": "2022-07-14T13:13:36.686751087Z",
[loadgen]       "EndTime": "2022-07-14T13:14:31.849601964Z",
[loadgen]       "Attributes": [
[loadgen]               {
[loadgen]                       "Key": "telemetry.sdk.language",
[loadgen]                       "Value": {
[loadgen]                               "Type": "STRING",
[loadgen]                               "Value": "go"
[loadgen]                       }
[loadgen]               },
[loadgen]               {
[loadgen]                       "Key": "service.name",
[loadgen]                       "Value": {
[loadgen]                               "Type": "STRING",
[loadgen]                               "Value": "loadgen.runQuery"
[loadgen]                       }
[loadgen]               },
[loadgen]               {
[loadgen]                       "Key": "query",
[loadgen]                       "Value": {
[loadgen]                               "Type": "STRING",
[loadgen]                               "Value": "faith"
[loadgen]                       }
[loadgen]               }
[loadgen]       ],
[loadgen]       "Events": null,
[loadgen]       "Links": null,
[loadgen]       "Status": {
[loadgen]               "Code": "Unset",
[loadgen]               "Description": ""
[loadgen]       },
[loadgen]       "DroppedAttributes": 0,
[loadgen]       "DroppedEvents": 0,
[loadgen]       "DroppedLinks": 0,
[loadgen]       "ChildSpanCount": 5,
[loadgen]       "Resource": [
[loadgen]               {
[loadgen]                       "Key": "service.name",
[loadgen]                       "Value": {
[loadgen]                               "Type": "STRING",
[loadgen]                               "Value": "unknown_service:loadgen"
...

The stdout exporter emits these messages. You will notice that the parents of all the spans by loadgen have TraceID: 00000000000000000000000000000000, because this is the root span, i.e. the first span in the trace. Also you find that the embed attribute "query" has the query string that is passed to client service.

Summary

In this step, you have instrumented load generator service and client service that communicate in HTTP and confirmed that you could successfully propagate Trace Context across services and export Span information from both services to stdout.

Next up

In the next step, you will instrument client service and server service to confirm how to propagate Trace Context via gRPC.

5. Instrumentation for gRPC

In the previous step, we instrumented the first half of the request in this microservices. In this step, we try to instrument gRPC communication between client service and server service. (Green and purple rectangle in the picture below)

75310d8e0e3b1a30.png

Pre-build instrumentation for gRPC client

OpenTelemetry's ecosystem offers a lot of handy libraries that help developers to instrument applications. In the previous step, we used pre-build instrumentation for net/http package. In this step, as we are trying to propagate Trace Context through gRPC, we use the library for it.

First, you import the prebuilt gRPC package called otelgrpc.

step0/src/client/main.go

import (
        "context"
        "encoding/json"
        "fmt"
        "io"
        "log"
        "net/http"
        "net/url"
        "os"
        "time"

        "opentelemetry-trace-codelab-go/client/shakesapp"
        // step2. add prebuilt gRPC package (otelgrpc) 
        "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
        "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
        "go.opentelemetry.io/otel"
        "go.opentelemetry.io/otel/attribute"
        stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
        "go.opentelemetry.io/otel/propagation"
        sdktrace "go.opentelemetry.io/otel/sdk/trace"
        "go.opentelemetry.io/otel/trace"
        "google.golang.org/grpc"
        "google.golang.org/grpc/credentials/insecure"
)

This time, the client service is a gRPC client against the server service, so you need to instrument the gRPC client. Find mustConnGRPC function and add gRPC interceptors that instruments new spans every time the client makes requests to the server.

step0/src/client/main.go

// Helper function for gRPC connections: Dial and create client once, reuse.
func mustConnGRPC(ctx context.Context, conn **grpc.ClientConn, addr string) {
        var err error
        // step2. add gRPC interceptor
        interceptorOpt := otelgrpc.WithTracerProvider(otel.GetTracerProvider())
        *conn, err = grpc.DialContext(ctx, addr,
                grpc.WithTransportCredentials(insecure.NewCredentials()),
                grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor(interceptorOpt)),
                grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor(interceptorOpt)),
                grpc.WithTimeout(time.Second*3),
        )
        // step2: end adding interceptor
        if err != nil {
                panic(fmt.Sprintf("Error %s grpc: failed to connect %s", err, addr))
        }
}

Because you already set up OpenTelemetry in the previous section, you don't need to do it.

Pre-built instrumentation for gRPC server

Like what we did for the gRPC client, we call pre-built instrumentation for the gRPC server. Add new package to the import section like:

step0/src/server/main.go

import (
        "context"
        "fmt"
        "io/ioutil"
        "log"
        "net"
        "os"
        "regexp"
        "strings"

        "opentelemetry-trace-codelab-go/server/shakesapp"

        "cloud.google.com/go/storage"
        // step2. add OpenTelemetry packages including otelgrpc
        "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
        "go.opentelemetry.io/otel"
        stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
        "go.opentelemetry.io/otel/propagation"
        sdktrace "go.opentelemetry.io/otel/sdk/trace"
        "google.golang.org/api/iterator"
        "google.golang.org/api/option"
        "google.golang.org/grpc"
        healthpb "google.golang.org/grpc/health/grpc_health_v1"
)

Because this is the first time to instrument server, you need to set up OpenTelemetry first, similarly to what we did for loadgen and client services.

step0/src/server/main.go

// step2. add OpenTelemetry initialization function
func initTracer() (*sdktrace.TracerProvider, error) {
        // create a stdout exporter to show collected spans out to stdout.
        exporter, err := stdout.New(stdout.WithPrettyPrint())
        if err != nil {
                return nil, err
        }
        // for the demonstration, we use AlwaysSmaple sampler to take all spans.
        // do not use this option in production.
        tp := sdktrace.NewTracerProvider(
                sdktrace.WithSampler(sdktrace.AlwaysSample()),
                sdktrace.WithBatcher(exporter),
        )
        otel.SetTracerProvider(tp)
        otel.SetTextMapPropagator(propagation.TraceContext{})
        return tp, nil
}

func main() {
        ...

        // step2. setup OpenTelemetry
        tp, err := initTracer()
        if err != nil {
                log.Fatalf("failed to initialize TracerProvider: %v", err)
        }
        defer func() {
                if err := tp.Shutdown(context.Background()); err != nil {
                        log.Fatalf("error shutting down TracerProvider: %v", err)
                }
        }()
        // step2. end setup
        ...

And next, you need to add server interceptors. In the main function, find the place where grpc.NewServer() is called and add interceptors to the function.

step0/src/server/main.go

func main() {
        ...
        svc := NewServerService()
        // step2: add interceptor
        interceptorOpt := otelgrpc.WithTracerProvider(otel.GetTracerProvider())
        srv := grpc.NewServer(
                grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor(interceptorOpt)),
                grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor(interceptorOpt)),
        )
        // step2: end adding interceptor
        shakesapp.RegisterShakespeareServiceServer(srv, svc)
        ...

Run the microservice and confirm the trace

Then run your modified code with skaffold command.

skaffold dev

Now again, you see a bunch of span information on stdout.

Command output

...
[server] {
[server]        "Name": "shakesapp.ShakespeareService/GetMatchCount",
[server]        "SpanContext": {
[server]                "TraceID": "89b472f213a400cf975e0a0041649667",
[server]                "SpanID": "96030dbad0061b3f",
[server]                "TraceFlags": "01",
[server]                "TraceState": "",
[server]                "Remote": false
[server]        },
[server]        "Parent": {
[server]                "TraceID": "89b472f213a400cf975e0a0041649667",
[server]                "SpanID": "cd90cc3859b73890",
[server]                "TraceFlags": "01",
[server]                "TraceState": "",
[server]                "Remote": true
[server]        },
[server]        "SpanKind": 2,
[server]        "StartTime": "2022-07-14T14:05:55.74822525Z",
[server]        "EndTime": "2022-07-14T14:06:03.449258891Z",
[server]        "Attributes": [
...
[server]        ],
[server]        "Events": [
[server]                {
[server]                        "Name": "message",
[server]                        "Attributes": [
...
[server]                        ],
[server]                        "DroppedAttributeCount": 0,
[server]                        "Time": "2022-07-14T14:05:55.748235489Z"
[server]                },
[server]                {
[server]                        "Name": "message",
[server]                        "Attributes": [
...
[server]                        ],
[server]                        "DroppedAttributeCount": 0,
[server]                        "Time": "2022-07-14T14:06:03.449255889Z"
[server]                }
[server]        ],
[server]        "Links": null,
[server]        "Status": {
[server]                "Code": "Unset",
[server]                "Description": ""
[server]        },
[server]        "DroppedAttributes": 0,
[server]        "DroppedEvents": 0,
[server]        "DroppedLinks": 0,
[server]        "ChildSpanCount": 0,
[server]        "Resource": [
[server]                {
...
[server]        ],
[server]        "InstrumentationLibrary": {
[server]                "Name": "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc",
[server]                "Version": "semver:0.33.0",
[server]                "SchemaURL": ""
[server]        }
[server] }
...

You notice that you haven't embedded any span names and manually created spans with trace.Start() or span.SpanFromContext(). Still you get a large number of spans because the gRPC interceptors generated them.

Summary

In this step, you instrumented gRPC based communication with the support from the OpenTelemetry ecosystem libraries.

Next up

In the next step, you will at last visualize the trace with Cloud Trace and learn how to analyze the collected spans.

6. Visualize trace with Cloud Trace

You have instrumented traces in the whole system with OpenTelemetry. You have learned how to instrument HTTP and gRPC services so far. Though you have learned how to instrument them, still you have not learned how to analyze them. In this section, you will replace stdout exporters with Cloud Trace exporters, and learn how to analyze your traces.

Use Cloud Trace exporter

One of the powerful characteristics of OpenTelemetry is its pluggability. To visualize all spans collected by your instrumentation, what you need to do is just to replace the stdout exporter with Cloud Trace exporter.

Open main.go files of each service, and find initTracer() function. Delete the line to generate a stdout exporter and create Cloud Trace exporter instead.

step0/src/loadgen/main.go

import (
        ...
        // step3. add OpenTelemetry for Cloud Trace package
        cloudtrace "github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/trace"
)

// step1. add OpenTelemetry initialization function
func initTracer() (*sdktrace.TracerProvider, error) {
        // step3. replace stdout exporter with Cloud Trace exporter
        // cloudtrace.New() finds the credentials to Cloud Trace automatically following the
        // rules defined by golang.org/x/oauth2/google.findDefaultCredentailsWithParams.
        // https://pkg.go.dev/golang.org/x/oauth2/google#FindDefaultCredentialsWithParams
        exporter, err := cloudtrace.New()
        // step3. end replacing exporter
        if err != nil {
                return nil, err
        }

        // for the demonstration, we use AlwaysSmaple sampler to take all spans.
        // do not use this option in production.
        tp := sdktrace.NewTracerProvider(
                sdktrace.WithSampler(sdktrace.AlwaysSample()),
                sdktrace.WithBatcher(exporter),
        )
        otel.SetTracerProvider(tp)
        otel.SetTextMapPropagator(propagation.TraceContext{})
        return tp, nil
}

You need to edit the same function in the client and server service as well.

Run the microservice and confirm the trace

After the edit, just run the cluster as usual with skaffold command.

skaffold dev

Then now you do not see much span information in structured logs format on stdout, because you replaced the exporter with Cloud Trace one.

Command output

[loadgen] 2022/07/14 15:01:07 simulated 20 requests
[loadgen] 2022/07/14 15:01:07 simulating client requests, round 37
[loadgen] 2022/07/14 15:01:14 query 'sweet': matched 958
[client] 2022/07/14 15:01:14 {"match_count":958}
[client] 2022/07/14 15:01:14 {"match_count":3040}
[loadgen] 2022/07/14 15:01:14 query 'love': matched 3040
[client] 2022/07/14 15:01:15 {"match_count":349}
[loadgen] 2022/07/14 15:01:15 query 'hello': matched 349
[client] 2022/07/14 15:01:15 {"match_count":484}
[loadgen] 2022/07/14 15:01:15 query 'faith': matched 484
[loadgen] 2022/07/14 15:01:15 query 'insolence': matched 14
[client] 2022/07/14 15:01:15 {"match_count":14}
[client] 2022/07/14 15:01:21 {"match_count":484}
[loadgen] 2022/07/14 15:01:21 query 'faith': matched 484
[client] 2022/07/14 15:01:21 {"match_count":728}
[loadgen] 2022/07/14 15:01:21 query 'world': matched 728
[client] 2022/07/14 15:01:22 {"match_count":484}
[loadgen] 2022/07/14 15:01:22 query 'faith': matched 484
[loadgen] 2022/07/14 15:01:22 query 'hello': matched 349
[client] 2022/07/14 15:01:22 {"match_count":349}
[client] 2022/07/14 15:01:23 {"match_count":1036}
[loadgen] 2022/07/14 15:01:23 query 'friend': matched 1036
[loadgen] 2022/07/14 15:01:28 query 'tear': matched 463
...

Now let's confirm if all spans are correctly sent to Cloud Trace. Access Cloud console, and navigate to "Trace list". It is easy to access from the search box. Otherwise, you can click the menu in the left pane. 8b3f8411bd737e06.png

Then you see a lot of blue spots are distributed across the latency graph. Each spot represents a single trace.

3ecf131423fc4c40.png

Click one of them and you can see the details inside the trace. 4fd10960c6648a03.png

Even from this simple quick look, you already know a lot of insights. For example, from the waterfall graph, you can see that the cause of the latency is mostly because of the span named shakesapp.ShakespeareService/GetMatchCount. (See 1 in the image above) You can confirm that from the summary table. (The most right column shows the duration of each span.) Also, this trace was for the query "friend". (See 2 in the image above)

Given these short analyses, you may realize that you need to know more granular spans inside the GetMatchCount method. Compared to stdout information, visualization is powerful. To learn more about Cloud Trace details, please visit our official documentation.

Summary

In this step, you replaced the stdout exporter with Cloud Trace one and visualized traces on Cloud Trace. Also you learned how to start analyzing the traces.

Next up

In the next step, you will modify the source code of server service to add a sub span in GetMatchCount.

7. Add sub span for better analysis

In the previous step, you found the cause of the round trip time observed from loadgen is mostly the process inside GetMatchCount method, the gRPC handler, in the server service. However, because we have not instrumented anything other than the handler, we are not able to find further insights from the waterfall graph. This is a common case when we start instrumenting microservices.

3b63a1e471dddb8c.png

In this section, we are going to instrument a sub span where the server calls Google Cloud Storage, because it is common where some external network I/O takes a long time in the process and it's important to identify if the call is the cause.

Instrument a sub span in the server

Open main.go in the server and find the function readFiles. This function is calling a request to Google Cloud Storage to fetch all text files of Shakespeare works. In this function, you can create a sub span, like what you did for HTTP server instrumentation in the client service.

step0/src/server/main.go

func readFiles(ctx context.Context, bucketName, prefix string) ([]string, error) {
        type resp struct {
                s   string
                err error
        }

        // step4: add an extra span
        span := trace.SpanFromContext(ctx)
        span.SetName("server.readFiles")
        span.SetAttributes(attribute.Key("bucketname").String(bucketName))
        defer span.End()
        // step4: end add span
        ...

And that's all for adding a new span. Let's see how it goes by running the app.

Run the microservice and confirm the trace

After the edit, just run the cluster as usual with skaffold command.

skaffold dev

And pick one trace named query.request from the trace list. You will see a similar trace waterfall graph except for a new span under shakesapp.ShakespeareService/GetMatchCount. (The span enclosed by red rectangle below)

3d4a891aa30d7a32.png

What you can tell from this graph now is that the external call to Google Cloud Storage occupies a large amount of latency, but still other things are making a majority of the latency.

Already you gained a lot of insights just from a couple of looks from the trace waterfall graph. How do you obtain the further performance details in your application? Here the profiler comes in, but for now, let's make it the end of this codelab and delegate all profiler tutorials to part 2.

Summary

In this step, you instrumented another span in the server service and obtained further insights of the system latency.

8. Congratulations

You have successfully created distributed traces with OpenTelemery and confirmed request latencies across the microservice on Google Cloud Trace.

For extended exercises, you can try the following topics by yourself.

  • Current implementation sends all spans generated by health check. (grpc.health.v1.Health/Check) How do you filter out those spans from Cloud Traces? Hint is here.
  • Correlate event logs with spans and see how it works on Google Cloud Trace and Google Cloud Logging. Hint is here.
  • Replace some service with the one in another language and try to instrument it with OpenTelemetry for that language.

Also, if you would like to learn about the profiler after this, please move on to part 2. You can skip the clean up section below in that case.

Clean up

After this codelab, please stop the Kubernetes cluster and make sure to delete the project so that you don't get unexpected charges on Google Kubernetes Engine, Google Cloud Trace, Google Artifact Registry.

First, delete the cluster. If you are running the cluster with skaffold dev, you just need to press Ctrl-C. If you are running the cluster with skaffold run, run the following command:

skaffold delete

Command output

Cleaning up...
 - deployment.apps "clientservice" deleted
 - service "clientservice" deleted
 - deployment.apps "loadgen" deleted
 - deployment.apps "serverservice" deleted
 - service "serverservice" deleted

After deleting the cluster, from the menu pane, select "IAM & Admin" > "Settings", and then click "SHUT DOWN" button.

45aa37b7d5e1ddd1.png

Then enter the Project ID (not Project Name) in the form in the dialog and confirm shutdown.