1. Introduction
Last Updated: 2022-07-15
Observability of the application
Observability and OpenTelemetry
Observability is the term used to describe an attribute of a system. A system with observability allows teams to actively debug their system. In that context, three pillars of observability; logs, metrics, and traces are the fundamental instrumentation for the system to acquire observability.
OpenTelemetry is a set of specifications, libraries and agents that accelerate the instrumentation and export of telemetry data (logs, metrics, and traces) that observability requires. OpenTelemetry is an open standard and community driven project under CNCF. By utilizing libraries that the project and its ecosystem provide, developers are able to instrument their applications in the vendor neutral way and against multiple architectures.
Also in addition to the three pillars of observability, continuous profiling is another key component for observability and it is expandings the user base in the industry. Cloud Profiler is one of the originators and provides an easy interface to drill down the performance metrics in the application call stacks.
This codelab is part 1 of the series and covers instrumenting distributed traces in microservices with OpenTelemetry and Cloud Trace. Part 2 will cover continuous profiling with Cloud Profiler.
Distributed Trace
Among logs, metrics, and traces, trace is the telemetry that tells the latency of a specific part of the process in the system. Especially in the era of microservices, distributed trace is the strong driver to find out latency bottlenecks in the overall distributed system.
When analyzing distributed traces, the trace data visualization is the key to grasp overall system latencies at a glance. In distributed trace, we handle a set of calls to process a single request to the system entrypoint in a form of Trace containing multiple Spans.
Span represents an individual unit of work done in a distributed system, recording start and stop times. Spans often have hierarchical relationships between each other - in the picture below all smaller spans are child spans of a large /messages span, and are assembled into one Trace that shows the path of work through a system.
Google Cloud Trace is one of the options for distributed trace backend and it is well integrated with other products in Google Cloud.
What you'll build
In this codelab, you're going to instrument trace information in the services called "Shakespeare application" (a.k.a. Shakesapp) that runs on a Google Kubernetes Engine cluster. The architecture of Shakesapp is as described below:
- Loadgen sends a query string to the client in HTTP
- Clients pass through the query from the loadgen to the server in gRPC
- Server accepts the query from the client, fetches all Shakespare works in text format from Google Cloud Storage, searches the lines that contain the query and return the number of the line that matched to the client
You will instrument the trace information across the request. After that, you will embed a profiler agent in the server and investigate the bottleneck.
What you'll learn
- How to get started with the OpenTelemetry Trace libraries in Go project
- How to create a span with the library
- How to propagate span contexts across the wire between app components
- How to send trace data to Cloud Trace
- How to analyze the trace on Cloud Trace
This codelab explains how to instrument your microservices. To make it easy to understand, this example only contains 3 components (load generator, client, and server), but you can apply the same process explained in this codelab to more complex and large systems.
What you'll need
- A basic knowledge of Go
- A basic knowledge of Kubernetes
2. Setup and Requirements
Self-paced environment setup
If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console ( console.cloud.google.com) and create a new project.
If you already have a project, click on the project selection pull down menu in the upper left of the console:
and click the ‘NEW PROJECT' button in the resulting dialog to create a new project:
If you don't already have a project, you should see a dialog like this to create your first one:
The subsequent project creation dialog allows you to enter the details of your new project:
Remember the project ID, which is a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.
Next, if you haven't already done so, you'll need to enable billing in the Developers Console in order to use Google Cloud resources and enable the Cloud Trace API.
Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). Pricings of Google Cloud Trace, Google Kubernetes Engine and Google Artifact Registry are noted on the official documentation.
- Pricing for Google Cloud's operations suite | Operations Suite
- Pricing | Kubernetes Engine Documentation
- Artifact Registry Pricing | Artifact Registry documentation
New users of Google Cloud Platform are eligible for a $300 free trial, which should make this codelab entirely free of charge.
Google Cloud Shell Setup
While Google Cloud and Google Cloud Trace can be operated remotely from your laptop, in this codelab we will be using Google Cloud Shell, a command line environment running in the Cloud.
This Debian-based virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. This means that all you will need for this codelab is a browser (yes, it works on a Chromebook).
To activate Cloud Shell from the Cloud Console, simply click Activate Cloud Shell (it should only take a few moments to provision and connect to the environment).
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID
.
gcloud auth list
Command output
Credentialed accounts: - <myaccount>@<mydomain>.com (active)
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If, for some reason, the project is not set, simply issue the following command:
gcloud config set project <PROJECT_ID>
Looking for your PROJECT_ID
? Check out what ID you used in the setup steps or look it up in the Cloud Console dashboard:
Cloud Shell also sets some environment variables by default, which may be useful as you run future commands.
echo $GOOGLE_CLOUD_PROJECT
Command output
<PROJECT_ID>
Finally, set the default zone and project configuration.
gcloud config set compute/zone us-central1-f
You can choose a variety of different zones. For more information, see Regions & Zones.
Go language setup
In this codelab, we use Go for all source code. Run the following command on Cloud Shell and confirm if the Go's version is 1.17+
go version
Command output
go version go1.18.3 linux/amd64
Setup a Google Kubernetes Cluster
In this codelab, you will run a cluster of microservices on Google Kubernetes Engine (GKE). The process of this codelab is as the following:
- Download the baseline project into Cloud Shell
- Build microservices into containers
- Upload containers onto Google Artifact Registry (GAR)
- Deploy containers onto GKE
- Modify the source code of services for trace instrumentation
- Go to step2
Enable Kubernetes Engine
First, we set up a Kubernetes cluster where Shakesapp runs on GKE, so we need to enable GKE. Navigate to the menu "Kubernetes Engine" and press the ENABLE button.
Now you are ready to create a Kubernetes cluster.
Create Kubernetes cluster
On Cloud Shell, run the following command to create a Kubernetes cluster. Please confirm the zone value is under the region that you will use for Artifact Registry repository creation. Change the zone value us-central1-f
if your repository region is not covering the zone.
gcloud container clusters create otel-trace-codelab2 \ --zone us-central1-f \ --release-channel rapid \ --preemptible \ --enable-autoscaling \ --max-nodes 8 \ --no-enable-ip-alias \ --scopes cloud-platform
Command output
Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s). Creating cluster otel-trace-codelab2 in us-central1-f... Cluster is being health-checked (master is healthy)...done. Created [https://container.googleapis.com/v1/projects/development-215403/zones/us-central1-f/clusters/otel-trace-codelab2]. To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-f/otel-trace-codelab2?project=development-215403 kubeconfig entry generated for otel-trace-codelab2. NAME: otel-trace-codelab2 LOCATION: us-central1-f MASTER_VERSION: 1.23.6-gke.1501 MASTER_IP: 104.154.76.89 MACHINE_TYPE: e2-medium NODE_VERSION: 1.23.6-gke.1501 NUM_NODES: 3 STATUS: RUNNING
Artifact Registry and skaffold setup
Now we have a Kubernetes cluster ready for deployment. Next we prepare for a container registry for push and deploy containers. For these steps, we need to set up an Artifact Registry (GAR) and skaffold to use it.
Artifact Registry setup
Navigate to the menu of "Artifact Registry" and press the ENABLE button.
After some moments, you will see the repository browser of GAR. Click the "CREATE REPOSITORY" button and enter the name of the repository.
In this codelab, I name the new repository trace-codelab
. The format of the artifact is "Docker" and the location type is "Region". Choose the region close to the one you set for Google Compute Engine default zone. For example, this example chose "us-central1-f" above, so here we choose "us-central1 (Iowa)". Then click the "CREATE" button.
Now you see "trace-codelab" on the repository browser.
We will come back here later to check the registry path.
Skaffold setup
Skaffold is a handy tool when you work on building microservices that run on Kubernetes. It handles the workflow of building, pushing and deploying containers of applications with a small set of commands. Skaffold by default uses Docker Registry as container registry, so you need to configure skaffold to recognize GAR on pushing containers to.
Open Cloud Shell again and confirm if skaffold is installed. (Cloud Shell installs skaffold into the environment by default.) Run the following command and see the skaffold version.
skaffold version
Command output
v1.38.0
Now, you can register the default repository for skaffold to use. In order to obtain the registry path, navigate yourself to the Artifact Registry dashboard and click the name of repository you just set up in the previous step.
Then you will see breadcrumbs trails on the top of the page. Click icon to copy the registry path to the clipboard.
On clicking the copy button, you see the dialog at the bottom of the browser with the message like:
"us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab" has been copied
Go back to the cloud shell. Run skaffold config set default-repo
command with the value you just copied from the dashboard.
skaffold config set default-repo us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab
Command output
set value default-repo to us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab for context gke_stackdriver-sandbox-3438851889_us-central1-b_stackdriver-sandbox
Also, you need to configure the registry to Docker configuration. Run the following command:
gcloud auth configure-docker us-central1-docker.pkg.dev --quiet
Command output
{ "credHelpers": { "gcr.io": "gcloud", "us.gcr.io": "gcloud", "eu.gcr.io": "gcloud", "asia.gcr.io": "gcloud", "staging-k8s.gcr.io": "gcloud", "marketplace.gcr.io": "gcloud", "us-central1-docker.pkg.dev": "gcloud" } } Adding credentials for: us-central1-docker.pkg.dev
Now you are good to go for the next step to set up a Kubernetes container on GKE.
Summary
In this step, you set up your codelab environment:
- Set up Cloud Shell
- Created a Artifact Registry repository for the container registry
- Set up skaffold to use the container registry
- Created a Kubernetes cluster where the codelab microservices run
Next up
In next step, you will build, push and deploy your microservices onto the cluster
3. Build, push and deploy the microservices
Download the codelab material
In the previous step, we have set up all prerequisites for this codelab. Now you are ready to run whole microservices on top of them. The codelab material is hosted on GitHub, so download them to the Cloud Shell environment with the following git command.
cd ~ git clone https://github.com/ymotongpoo/opentelemetry-trace-codelab-go.git cd opentelemetry-trace-codelab-go
The directory structure of the project is as the followings:
. ├── README.md ├── step0 │ ├── manifests │ ├── proto │ ├── skaffold.yaml │ └── src ├── step1 │ ├── manifests │ ├── proto │ ├── skaffold.yaml │ └── src ├── step2 │ ├── manifests │ ├── proto │ ├── skaffold.yaml │ └── src ├── step3 │ ├── manifests │ ├── proto │ ├── skaffold.yaml │ └── src ├── step4 │ ├── manifests │ ├── proto │ ├── skaffold.yaml │ └── src ├── step5 │ ├── manifests │ ├── proto │ ├── skaffold.yaml │ └── src └── step6 ├── manifests ├── proto ├── skaffold.yaml └── src
- manifests: Kubernetes manifest files
- proto: proto definition for the communication between client and server
- src: directories for the source code of each services
- skaffold.yaml: Configuration file for skaffold
In this codelab, you will update the source code located under step0
folder. You can also refer to the source code in step[1-6]
folders for the answers in the following steps. (Part 1 covers step0 to step4, and Part 2 covers step 5 and 6)
Run skaffold command
Finally you are ready to build, push and deploy whole content onto the Kubernetes cluster you have just created. This sounds like it contains multiple steps but the actual is skaffold does everything for you. Let's try that with the following command:
cd step0 skaffold dev
As soon as running the command, you see the log output of docker build
and can confirm that they are successfully pushed to the registry.
Command output
... ---> Running in c39b3ea8692b ---> 90932a583ab6 Successfully built 90932a583ab6 Successfully tagged us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/serverservice:step1 The push refers to repository [us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/serverservice] cc8f5a05df4a: Preparing 5bf719419ee2: Preparing 2901929ad341: Preparing 88d9943798ba: Preparing b0fdf826a39a: Preparing 3c9c1e0b1647: Preparing f3427ce9393d: Preparing 14a1ca976738: Preparing f3427ce9393d: Waiting 14a1ca976738: Waiting 3c9c1e0b1647: Waiting b0fdf826a39a: Layer already exists 88d9943798ba: Layer already exists f3427ce9393d: Layer already exists 3c9c1e0b1647: Layer already exists 14a1ca976738: Layer already exists 2901929ad341: Pushed 5bf719419ee2: Pushed cc8f5a05df4a: Pushed step1: digest: sha256:8acdbe3a453001f120fb22c11c4f6d64c2451347732f4f271d746c2e4d193bbe size: 2001
After the push of all service containers, Kubernetes deployments start automatically.
Command output
sha256:b71fce0a96cea08075dc20758ae561cf78c83ff656b04d211ffa00cedb77edf8 size: 1997 Tags used in deployment: - serverservice -> us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/serverservice:step4@sha256:8acdbe3a453001f120fb22c11c4f6d64c2451347732f4f271d746c2e4d193bbe - clientservice -> us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/clientservice:step4@sha256:b71fce0a96cea08075dc20758ae561cf78c83ff656b04d211ffa00cedb77edf8 - loadgen -> us-central1-docker.pkg.dev/psychic-order-307806/trace-codelab/loadgen:step4@sha256:eea2e5bc8463ecf886f958a86906cab896e9e2e380a0eb143deaeaca40f7888a Starting deploy... - deployment.apps/clientservice created - service/clientservice created - deployment.apps/loadgen created - deployment.apps/serverservice created - service/serverservice created
After the deployment, you'll see the actual application logs emitted to stdout in each containers like this:
Command output
[client] 2022/07/14 06:33:15 {"match_count":3040} [loadgen] 2022/07/14 06:33:15 query 'love': matched 3040 [client] 2022/07/14 06:33:15 {"match_count":3040} [loadgen] 2022/07/14 06:33:15 query 'love': matched 3040 [client] 2022/07/14 06:33:16 {"match_count":3040} [loadgen] 2022/07/14 06:33:16 query 'love': matched 3040 [client] 2022/07/14 06:33:19 {"match_count":463} [loadgen] 2022/07/14 06:33:19 query 'tear': matched 463 [loadgen] 2022/07/14 06:33:20 query 'world': matched 728 [client] 2022/07/14 06:33:20 {"match_count":728} [client] 2022/07/14 06:33:22 {"match_count":463} [loadgen] 2022/07/14 06:33:22 query 'tear': matched 463
Note that at this point, you want to see any messages from the server. OK, finally you are ready to start instrumenting your application with OpenTelemetry for distributed tracing of the services.
Before starting instrumenting the service, please shut down your cluster with Ctrl-C.
Command output
... [client] 2022/07/14 06:34:57 {"match_count":1} [loadgen] 2022/07/14 06:34:57 query 'what's past is prologue': matched 1 ^CCleaning up... - W0714 06:34:58.464305 28078 gcp.go:120] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.25+; use gcloud instead. - To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke - deployment.apps "clientservice" deleted - service "clientservice" deleted - deployment.apps "loadgen" deleted - deployment.apps "serverservice" deleted - service "serverservice" deleted
Summary
In this step, you have prepared the codelab material in your environment and confirmed skaffold runs as expected.
Next up
In the next step, you will modify the source code of loadgen service to instrument the trace information.
4. Instrumentation for HTTP
Concept of trace instrumentation and propagation
Before editing the source code, let me briefly explain how distributed traces work in a simple diagram.
In this example, we instrument the code to export Trace and Span information to Cloud Trace and propagate trace context across the request from the loadgen service to the server service.
Applications need to send Trace metadata such as Trace ID and Span ID in order Cloud Trace to assemble all spans that have the same Trace ID into one trace. Also the application needs to propagate trace contexts (the combination of Trace ID and Span ID of the parent span) on requesting downstream services, so that they can be aware of which trace context they are handling.
OpenTelemetry helps you:
- to generate unique Trace ID and Span ID
- to export Trace ID and Span ID to the backend
- to propagate trace contexts to other services
- to embed extra metadata that helps analyzing traces
Components in OpenTelemetry Trace
The process to instrument application trace with OpenTelemetry is as the followings:
- Create an exporter
- Create a TracerProvider binding the exporter in 1 and set it global.
- Set TextMapPropagaror to set the propagation method
- Get the Tracer from the TracerProvider
- Generate Span from the Tracer
As of now, you do not need to understand the detailed properties in each component, but the most important thing to remember are:
- the exporter here is pluggable to TracerProvider
- TracerProvider holds all configuration regarding trace sampling and export
- all traces are bundled in Tracer object
With the understanding of this, let's move on to the actual coding work.
Instrument first span
Instrument load generator service
Open Cloud Shell Editor by pressing the button at the top right of the Cloud Shell. Open step0/src/loadgen/main.go
from the explorer in the left pane and find the main function.
step0/src/loadgen/main.go
func main() { ... for range t.C { log.Printf("simulating client requests, round %d", i) if err := run(numWorkers, numConcurrency); err != nil { log.Printf("aborted round with error: %v", err) } log.Printf("simulated %d requests", numWorkers) if numRounds != 0 && i > numRounds { break } i++ } }
In the main function, you see the loop calling the function run
in it. In the current implementation, the section has 2 log lines that record the beginning and ending of the function call. Now let's instrument Span information to track the latency of the function call.
First, as noted in the previous section, let's set up the whole configurations for OpenTelemetry. Add OpenTelemetry packages as follows:
step0/src/loadgen/main.go
import ( "context" // step1. add packages "encoding/json" "fmt" "io" "log" "math/rand" "net/http" "net/url" "time" // step1. add packages "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace" "go.opentelemetry.io/otel/propagation" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.10.0" "go.opentelemetry.io/otel/trace" // step1. end add packages )
For readability, we create a setup function called initTracer
and call it in the main
function.
step0/src/loadgen/main.go
// step1. add OpenTelemetry initialization function func initTracer() (*sdktrace.TracerProvider, error) { // create a stdout exporter to show collected spans out to stdout. exporter, err := stdout.New(stdout.WithPrettyPrint()) if err != nil { return nil, err } // for the demonstration, we use AlwaysSmaple sampler to take all spans. // do not use this option in production. tp := sdktrace.NewTracerProvider( sdktrace.WithSampler(sdktrace.AlwaysSample()), sdktrace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator(propagation.TraceContext{}) return tp, nil }
You may realize that the procedure to set up OpenTelemetry is as described in the previous section. In this implementation, we use a stdout
exporter that exports all trace information into the stdout in a structured format.
Then you call it from the main function. Call the initTracer()
and make sure to call the TracerProvider.Shutdown()
when you close the application.
step0/src/loadgen/main.go
func main() { // step1. setup OpenTelemetry tp, err := initTracer() if err != nil { log.Fatalf("failed to initialize TracerProvider: %v", err) } defer func() { if err := tp.Shutdown(context.Background()); err != nil { log.Fatalf("error shutting down TracerProvider: %v", err) } }() // step1. end setup log.Printf("starting worder with %d workers in %d concurrency", numWorkers, numConcurrency) log.Printf("number of rounds: %d (0 is inifinite)", numRounds) ...
Once you finish the set up, you need to create a Span with a unique Trace ID and Span ID. OpenTelemetry provides a handy library for it. Add extra new packages to the instrument HTTP client.
step0/src/loadgen/main.go
import ( "context" "encoding/json" "fmt" "io" "log" "math/rand" "net/http" "net/http/httptrace" // step1. add packages "net/url" "time" // step1. add packages "go.opentelemetry.io/contrib/instrumentation/net/http/httptrace/otelhttptrace" "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp" // step1. end add packages "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace" "go.opentelemetry.io/otel/propagation" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.10.0" "go.opentelemetry.io/otel/trace" )
Because the load generator is calling client service in HTTP with net/http
in runQuery
function, we use the contrib package for net/http
and enable instrumentation with the extension of httptrace
and otelhttp
package.
First adding a package global variable httpClient to call HTTP requests via the instrumented client.
step0/src/loadgen/main.go
var httpClient = http.Client{ Transport: otelhttp.NewTransport(http.DefaultTransport) }
Next add instrumentation in the runQuery
function to create the custom span using OpenTelemetry and the auto generated span from the custom HTTP client. What you will do is:
- Get a Tracer from global
TracerProvider
withotel.Tracer()
- Create a root span with
Tracer.Start()
method - End the root span in an arbitrary timing (in this case, end of
runQuery
function)
step0/src/loadgen/main.go
reqURL.RawQuery = v.Encode() // step1. replace http.Get() with custom client call // resp, err := http.Get(reqURL.String()) // step1. instrument trace ctx := context.Background() tr := otel.Tracer("loadgen") ctx, span := tr.Start(ctx, "query.request", trace.WithAttributes( semconv.TelemetrySDKLanguageGo, semconv.ServiceNameKey.String("loadgen.runQuery"), attribute.Key("query").String(s), )) defer span.End() ctx = httptrace.WithClientTrace(ctx, otelhttptrace.NewClientTrace(ctx)) req, err := http.NewRequestWithContext(ctx, "GET", reqURL.String(), nil) if err != nil { return -1, fmt.Errorf("error creating HTTP request object: %v", err) } resp, err := httpClient.Do(req) // step1. end instrumentation if err != nil { return -1, fmt.Errorf("error sending request to %v: %v", reqURL.String(), err) }
Now you are done with the instrumentation in loadgen (HTTP client application). Please make sure to update your go.mod
and go.sum
with go mod
command.
go mod tidy
Instrument client service
In the previous section, we instrumented the part enclosed in the red rectangle in the drawing below. We instrumented span information in load generator service. Similarly to the load generator service, now we need to instrument the client service. The difference from the load generator service is that the client service needs to extract Trace ID information propagated from the load generator service in the HTTP header and use the ID to generate Spans.
Open Cloud Shell Editor and add required packages like we did for load generator service.
step0/src/client/main.go
import ( "context" "encoding/json" "fmt" "io" "log" "net/http" "net/url" "os" "time" "opentelemetry-trace-codelab-go/client/shakesapp" // step1. add new import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace" "go.opentelemetry.io/otel/propagation" sdktrace "go.opentelemetry.io/otel/sdk/trace" "go.opentelemetry.io/otel/trace" "google.golang.org/grpc" "google.golang.org/grpc/credentials/insecure" // step1. end new import )
Again, we need to set up the OpenTelemtry. Just copy and paste the initTracer
function from loadgen and call it in the client service's main
function as well.
step0/src/client/main.go
// step1. add OpenTelemetry initialization function func initTracer() (*sdktrace.TracerProvider, error) { // create a stdout exporter to show collected spans out to stdout. exporter, err := stdout.New(stdout.WithPrettyPrint()) if err != nil { return nil, err } // for the demonstration, we use AlwaysSmaple sampler to take all spans. // do not use this option in production. tp := sdktrace.NewTracerProvider( sdktrace.WithSampler(sdktrace.AlwaysSample()), sdktrace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator(propagation.TraceContext{}) return tp, nil }
Now it's time to instrument spans. Because the client service needs to accept HTTP requests from the loadgen service, it needs to instrument the handler. The HTTP server in the client service is implemented with net/http, and you can use otelhttp
package like what we did in the loadgen.
First, we replace the handler registration with otelhttp
Handler. In the main
function, find the lines where the HTTP handler is registered with http.HandleFunc()
.
step0/src/client/main.go
// step1. change handler to intercept OpenTelemetry related headers // http.HandleFunc("/", svc.handler) otelHandler := otelhttp.NewHandler(http.HandlerFunc(svc.handler), "client.handler") http.Handle("/", otelHandler) // step1. end intercepter setting http.HandleFunc("/_genki", svc.health)
Then, we instrument the actual span inside the handler. Find func (*clientService) handler(), and add span instrumentation with trace.SpanFromContext()
.
step0/src/client/main.go
func (cs *clientService) handler(w http.ResponseWriter, r *http.Request) { ... ctx := r.Context() ctx, cancel := context.WithCancel(ctx) defer cancel() // step1. instrument trace span := trace.SpanFromContext(ctx) defer span.End() // step1. end instrument ...
With this instrumentation, you get the spans from the beginning of the handler
method to the end of it. To make the spans easy to analyze, add an extra attribute that stores the matched count to the query. Right before the log line, add the following code.
func (cs *clientService) handler(w http.ResponseWriter, r *http.Request) { ... // step1. add span specific attribute span.SetAttributes(attribute.Key("matched").Int64(resp.MatchCount)) // step1. end adding attribute log.Println(string(ret)) ...
With all instrumentation above, you completed the trace instrumentation between loadgen and client. Let's see how it works. Run the code with skaffold again.
skaffold dev
After some time to run the services on GKE cluster, you will see the huge amount of log messages like this:
Command output
[loadgen] { [loadgen] "Name": "query.request", [loadgen] "SpanContext": { [loadgen] "TraceID": "cfa22247a542beeb55a3434392d46b89", [loadgen] "SpanID": "18b06404b10c418b", [loadgen] "TraceFlags": "01", [loadgen] "TraceState": "", [loadgen] "Remote": false [loadgen] }, [loadgen] "Parent": { [loadgen] "TraceID": "00000000000000000000000000000000", [loadgen] "SpanID": "0000000000000000", [loadgen] "TraceFlags": "00", [loadgen] "TraceState": "", [loadgen] "Remote": false [loadgen] }, [loadgen] "SpanKind": 1, [loadgen] "StartTime": "2022-07-14T13:13:36.686751087Z", [loadgen] "EndTime": "2022-07-14T13:14:31.849601964Z", [loadgen] "Attributes": [ [loadgen] { [loadgen] "Key": "telemetry.sdk.language", [loadgen] "Value": { [loadgen] "Type": "STRING", [loadgen] "Value": "go" [loadgen] } [loadgen] }, [loadgen] { [loadgen] "Key": "service.name", [loadgen] "Value": { [loadgen] "Type": "STRING", [loadgen] "Value": "loadgen.runQuery" [loadgen] } [loadgen] }, [loadgen] { [loadgen] "Key": "query", [loadgen] "Value": { [loadgen] "Type": "STRING", [loadgen] "Value": "faith" [loadgen] } [loadgen] } [loadgen] ], [loadgen] "Events": null, [loadgen] "Links": null, [loadgen] "Status": { [loadgen] "Code": "Unset", [loadgen] "Description": "" [loadgen] }, [loadgen] "DroppedAttributes": 0, [loadgen] "DroppedEvents": 0, [loadgen] "DroppedLinks": 0, [loadgen] "ChildSpanCount": 5, [loadgen] "Resource": [ [loadgen] { [loadgen] "Key": "service.name", [loadgen] "Value": { [loadgen] "Type": "STRING", [loadgen] "Value": "unknown_service:loadgen" ...
The stdout
exporter emits these messages. You will notice that the parents of all the spans by loadgen have TraceID: 00000000000000000000000000000000
, because this is the root span, i.e. the first span in the trace. Also you find that the embed attribute "query"
has the query string that is passed to client service.
Summary
In this step, you have instrumented load generator service and client service that communicate in HTTP and confirmed that you could successfully propagate Trace Context across services and export Span information from both services to stdout.
Next up
In the next step, you will instrument client service and server service to confirm how to propagate Trace Context via gRPC.
5. Instrumentation for gRPC
In the previous step, we instrumented the first half of the request in this microservices. In this step, we try to instrument gRPC communication between client service and server service. (Green and purple rectangle in the picture below)
Pre-build instrumentation for gRPC client
OpenTelemetry's ecosystem offers a lot of handy libraries that help developers to instrument applications. In the previous step, we used pre-build instrumentation for net/http
package. In this step, as we are trying to propagate Trace Context through gRPC, we use the library for it.
First, you import the prebuilt gRPC package called otelgrpc
.
step0/src/client/main.go
import ( "context" "encoding/json" "fmt" "io" "log" "net/http" "net/url" "os" "time" "opentelemetry-trace-codelab-go/client/shakesapp" // step2. add prebuilt gRPC package (otelgrpc) "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc" "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/attribute" stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace" "go.opentelemetry.io/otel/propagation" sdktrace "go.opentelemetry.io/otel/sdk/trace" "go.opentelemetry.io/otel/trace" "google.golang.org/grpc" "google.golang.org/grpc/credentials/insecure" )
This time, the client service is a gRPC client against the server service, so you need to instrument the gRPC client. Find mustConnGRPC
function and add gRPC interceptors that instruments new spans every time the client makes requests to the server.
step0/src/client/main.go
// Helper function for gRPC connections: Dial and create client once, reuse. func mustConnGRPC(ctx context.Context, conn **grpc.ClientConn, addr string) { var err error // step2. add gRPC interceptor interceptorOpt := otelgrpc.WithTracerProvider(otel.GetTracerProvider()) *conn, err = grpc.DialContext(ctx, addr, grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor(interceptorOpt)), grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor(interceptorOpt)), grpc.WithTimeout(time.Second*3), ) // step2: end adding interceptor if err != nil { panic(fmt.Sprintf("Error %s grpc: failed to connect %s", err, addr)) } }
Because you already set up OpenTelemetry in the previous section, you don't need to do it.
Pre-built instrumentation for gRPC server
Like what we did for the gRPC client, we call pre-built instrumentation for the gRPC server. Add new package to the import section like:
step0/src/server/main.go
import ( "context" "fmt" "io/ioutil" "log" "net" "os" "regexp" "strings" "opentelemetry-trace-codelab-go/server/shakesapp" "cloud.google.com/go/storage" // step2. add OpenTelemetry packages including otelgrpc "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc" "go.opentelemetry.io/otel" stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace" "go.opentelemetry.io/otel/propagation" sdktrace "go.opentelemetry.io/otel/sdk/trace" "google.golang.org/api/iterator" "google.golang.org/api/option" "google.golang.org/grpc" healthpb "google.golang.org/grpc/health/grpc_health_v1" )
Because this is the first time to instrument server, you need to set up OpenTelemetry first, similarly to what we did for loadgen and client services.
step0/src/server/main.go
// step2. add OpenTelemetry initialization function func initTracer() (*sdktrace.TracerProvider, error) { // create a stdout exporter to show collected spans out to stdout. exporter, err := stdout.New(stdout.WithPrettyPrint()) if err != nil { return nil, err } // for the demonstration, we use AlwaysSmaple sampler to take all spans. // do not use this option in production. tp := sdktrace.NewTracerProvider( sdktrace.WithSampler(sdktrace.AlwaysSample()), sdktrace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator(propagation.TraceContext{}) return tp, nil } func main() { ... // step2. setup OpenTelemetry tp, err := initTracer() if err != nil { log.Fatalf("failed to initialize TracerProvider: %v", err) } defer func() { if err := tp.Shutdown(context.Background()); err != nil { log.Fatalf("error shutting down TracerProvider: %v", err) } }() // step2. end setup ...
And next, you need to add server interceptors. In the main
function, find the place where grpc.NewServer()
is called and add interceptors to the function.
step0/src/server/main.go
func main() { ... svc := NewServerService() // step2: add interceptor interceptorOpt := otelgrpc.WithTracerProvider(otel.GetTracerProvider()) srv := grpc.NewServer( grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor(interceptorOpt)), grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor(interceptorOpt)), ) // step2: end adding interceptor shakesapp.RegisterShakespeareServiceServer(srv, svc) ...
Run the microservice and confirm the trace
Then run your modified code with skaffold command.
skaffold dev
Now again, you see a bunch of span information on stdout.
Command output
... [server] { [server] "Name": "shakesapp.ShakespeareService/GetMatchCount", [server] "SpanContext": { [server] "TraceID": "89b472f213a400cf975e0a0041649667", [server] "SpanID": "96030dbad0061b3f", [server] "TraceFlags": "01", [server] "TraceState": "", [server] "Remote": false [server] }, [server] "Parent": { [server] "TraceID": "89b472f213a400cf975e0a0041649667", [server] "SpanID": "cd90cc3859b73890", [server] "TraceFlags": "01", [server] "TraceState": "", [server] "Remote": true [server] }, [server] "SpanKind": 2, [server] "StartTime": "2022-07-14T14:05:55.74822525Z", [server] "EndTime": "2022-07-14T14:06:03.449258891Z", [server] "Attributes": [ ... [server] ], [server] "Events": [ [server] { [server] "Name": "message", [server] "Attributes": [ ... [server] ], [server] "DroppedAttributeCount": 0, [server] "Time": "2022-07-14T14:05:55.748235489Z" [server] }, [server] { [server] "Name": "message", [server] "Attributes": [ ... [server] ], [server] "DroppedAttributeCount": 0, [server] "Time": "2022-07-14T14:06:03.449255889Z" [server] } [server] ], [server] "Links": null, [server] "Status": { [server] "Code": "Unset", [server] "Description": "" [server] }, [server] "DroppedAttributes": 0, [server] "DroppedEvents": 0, [server] "DroppedLinks": 0, [server] "ChildSpanCount": 0, [server] "Resource": [ [server] { ... [server] ], [server] "InstrumentationLibrary": { [server] "Name": "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc", [server] "Version": "semver:0.33.0", [server] "SchemaURL": "" [server] } [server] } ...
You notice that you haven't embedded any span names and manually created spans with trace.Start()
or span.SpanFromContext()
. Still you get a large number of spans because the gRPC interceptors generated them.
Summary
In this step, you instrumented gRPC based communication with the support from the OpenTelemetry ecosystem libraries.
Next up
In the next step, you will at last visualize the trace with Cloud Trace and learn how to analyze the collected spans.
6. Visualize trace with Cloud Trace
You have instrumented traces in the whole system with OpenTelemetry. You have learned how to instrument HTTP and gRPC services so far. Though you have learned how to instrument them, still you have not learned how to analyze them. In this section, you will replace stdout exporters with Cloud Trace exporters, and learn how to analyze your traces.
Use Cloud Trace exporter
One of the powerful characteristics of OpenTelemetry is its pluggability. To visualize all spans collected by your instrumentation, what you need to do is just to replace the stdout exporter with Cloud Trace exporter.
Open main.go
files of each service, and find initTracer()
function. Delete the line to generate a stdout exporter and create Cloud Trace exporter instead.
step0/src/loadgen/main.go
import ( ... // step3. add OpenTelemetry for Cloud Trace package cloudtrace "github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/trace" ) // step1. add OpenTelemetry initialization function func initTracer() (*sdktrace.TracerProvider, error) { // step3. replace stdout exporter with Cloud Trace exporter // cloudtrace.New() finds the credentials to Cloud Trace automatically following the // rules defined by golang.org/x/oauth2/google.findDefaultCredentailsWithParams. // https://pkg.go.dev/golang.org/x/oauth2/google#FindDefaultCredentialsWithParams exporter, err := cloudtrace.New() // step3. end replacing exporter if err != nil { return nil, err } // for the demonstration, we use AlwaysSmaple sampler to take all spans. // do not use this option in production. tp := sdktrace.NewTracerProvider( sdktrace.WithSampler(sdktrace.AlwaysSample()), sdktrace.WithBatcher(exporter), ) otel.SetTracerProvider(tp) otel.SetTextMapPropagator(propagation.TraceContext{}) return tp, nil }
You need to edit the same function in the client and server service as well.
Run the microservice and confirm the trace
After the edit, just run the cluster as usual with skaffold command.
skaffold dev
Then now you do not see much span information in structured logs format on stdout, because you replaced the exporter with Cloud Trace one.
Command output
[loadgen] 2022/07/14 15:01:07 simulated 20 requests [loadgen] 2022/07/14 15:01:07 simulating client requests, round 37 [loadgen] 2022/07/14 15:01:14 query 'sweet': matched 958 [client] 2022/07/14 15:01:14 {"match_count":958} [client] 2022/07/14 15:01:14 {"match_count":3040} [loadgen] 2022/07/14 15:01:14 query 'love': matched 3040 [client] 2022/07/14 15:01:15 {"match_count":349} [loadgen] 2022/07/14 15:01:15 query 'hello': matched 349 [client] 2022/07/14 15:01:15 {"match_count":484} [loadgen] 2022/07/14 15:01:15 query 'faith': matched 484 [loadgen] 2022/07/14 15:01:15 query 'insolence': matched 14 [client] 2022/07/14 15:01:15 {"match_count":14} [client] 2022/07/14 15:01:21 {"match_count":484} [loadgen] 2022/07/14 15:01:21 query 'faith': matched 484 [client] 2022/07/14 15:01:21 {"match_count":728} [loadgen] 2022/07/14 15:01:21 query 'world': matched 728 [client] 2022/07/14 15:01:22 {"match_count":484} [loadgen] 2022/07/14 15:01:22 query 'faith': matched 484 [loadgen] 2022/07/14 15:01:22 query 'hello': matched 349 [client] 2022/07/14 15:01:22 {"match_count":349} [client] 2022/07/14 15:01:23 {"match_count":1036} [loadgen] 2022/07/14 15:01:23 query 'friend': matched 1036 [loadgen] 2022/07/14 15:01:28 query 'tear': matched 463 ...
Now let's confirm if all spans are correctly sent to Cloud Trace. Access Cloud console, and navigate to "Trace list". It is easy to access from the search box. Otherwise, you can click the menu in the left pane.
Then you see a lot of blue spots are distributed across the latency graph. Each spot represents a single trace.
Click one of them and you can see the details inside the trace.
Even from this simple quick look, you already know a lot of insights. For example, from the waterfall graph, you can see that the cause of the latency is mostly because of the span named shakesapp.ShakespeareService/GetMatchCount
. (See 1 in the image above) You can confirm that from the summary table. (The most right column shows the duration of each span.) Also, this trace was for the query "friend". (See 2 in the image above)
Given these short analyses, you may realize that you need to know more granular spans inside the GetMatchCount
method. Compared to stdout information, visualization is powerful. To learn more about Cloud Trace details, please visit our official documentation.
Summary
In this step, you replaced the stdout exporter with Cloud Trace one and visualized traces on Cloud Trace. Also you learned how to start analyzing the traces.
Next up
In the next step, you will modify the source code of server service to add a sub span in GetMatchCount.
7. Add sub span for better analysis
In the previous step, you found the cause of the round trip time observed from loadgen is mostly the process inside GetMatchCount method, the gRPC handler, in the server service. However, because we have not instrumented anything other than the handler, we are not able to find further insights from the waterfall graph. This is a common case when we start instrumenting microservices.
In this section, we are going to instrument a sub span where the server calls Google Cloud Storage, because it is common where some external network I/O takes a long time in the process and it's important to identify if the call is the cause.
Instrument a sub span in the server
Open main.go
in the server and find the function readFiles
. This function is calling a request to Google Cloud Storage to fetch all text files of Shakespeare works. In this function, you can create a sub span, like what you did for HTTP server instrumentation in the client service.
step0/src/server/main.go
func readFiles(ctx context.Context, bucketName, prefix string) ([]string, error) { type resp struct { s string err error } // step4: add an extra span span := trace.SpanFromContext(ctx) span.SetName("server.readFiles") span.SetAttributes(attribute.Key("bucketname").String(bucketName)) defer span.End() // step4: end add span ...
And that's all for adding a new span. Let's see how it goes by running the app.
Run the microservice and confirm the trace
After the edit, just run the cluster as usual with skaffold command.
skaffold dev
And pick one trace named query.request
from the trace list. You will see a similar trace waterfall graph except for a new span under shakesapp.ShakespeareService/GetMatchCount
. (The span enclosed by red rectangle below)
What you can tell from this graph now is that the external call to Google Cloud Storage occupies a large amount of latency, but still other things are making a majority of the latency.
Already you gained a lot of insights just from a couple of looks from the trace waterfall graph. How do you obtain the further performance details in your application? Here the profiler comes in, but for now, let's make it the end of this codelab and delegate all profiler tutorials to part 2.
Summary
In this step, you instrumented another span in the server service and obtained further insights of the system latency.
8. Congratulations
You have successfully created distributed traces with OpenTelemery and confirmed request latencies across the microservice on Google Cloud Trace.
For extended exercises, you can try the following topics by yourself.
- Current implementation sends all spans generated by health check. (
grpc.health.v1.Health/Check
) How do you filter out those spans from Cloud Traces? Hint is here. - Correlate event logs with spans and see how it works on Google Cloud Trace and Google Cloud Logging. Hint is here.
- Replace some service with the one in another language and try to instrument it with OpenTelemetry for that language.
Also, if you would like to learn about the profiler after this, please move on to part 2. You can skip the clean up section below in that case.
Clean up
After this codelab, please stop the Kubernetes cluster and make sure to delete the project so that you don't get unexpected charges on Google Kubernetes Engine, Google Cloud Trace, Google Artifact Registry.
First, delete the cluster. If you are running the cluster with skaffold dev
, you just need to press Ctrl-C. If you are running the cluster with skaffold run
, run the following command:
skaffold delete
Command output
Cleaning up... - deployment.apps "clientservice" deleted - service "clientservice" deleted - deployment.apps "loadgen" deleted - deployment.apps "serverservice" deleted - service "serverservice" deleted
After deleting the cluster, from the menu pane, select "IAM & Admin" > "Settings", and then click "SHUT DOWN" button.
Then enter the Project ID (not Project Name) in the form in the dialog and confirm shutdown.