Go · OpenTelemetry · Observable Systems

Observable systems

This post is mainly about instrumenting tracing in distributed systems with {OpenTelemetry](https://opentelemetry.io/).

Before we get to that we need to understand what constitutes an observable system. Observable systems have three ‘pillars’ of observability:

  • Logs
  • Traces
  • Metrics

OpenTelemetry itself hasn’t got logging implemented, but it does have tracing and metric gathering capabilities (for Go).

On to the interesting stuff :)

I have an API container that has a RESTful interface for clients on the internet to call. The API container has a number of routes/endpoints that it handles by calling other containers via gRPC.

HTTP

I use Chi as my mux, and I use the Gorilla OpenTelemetry Middleware. Using the Server.go as an example I found it incredibly simple to instrument the HTTP server.

First, a trace initialisation:

package main

// required imports:
// "go.opentelemetry.io/otel"
//  stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
// sdktrace "go.opentelemetry.io/otel/sdk/trace"

var tracer = otel.Tracer("api-server")

func initTracer() (*sdktrace.TracerProvider, error) {
        exporter, err := stdout.New(stdout.WithPrettyPrint())
        if err != nil {
                return nil, err
        }

        tp := sdktrace.NewTracerProvider(
                sdktrace.WithSampler(sdktrace.AlwaysSample()),
                sdktrace.WithBatcher(exporter),
        )
        otel.SetTracerProvider(tp)
        otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

        return tp, nil
}

Then tell our mux to use the middleware:

package main

// required imports:
// "github.com/go-chi/chi/v5"
// "github.com/go-chi/chi/v5/middleware"
// "go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux"
// "time" (not necessary for OpenTelemetry, but used for another middleware).

func createRouter() *chi.Mux {
        router := chi.NewRouter()

        router.Use(
                otelmux.Middleware("my-server"),
                // The rest are standard chi middleware (unnecessary for telemetry, but left in for example of other middleware).
                middleware.RequestID,
                middleware.RealIP,
                middleware.Logger,
                middleware.Recoverer,
                //nolint:gomnd
                middleware.Timeout(60*time.Second),
        )

        // All routes are in routes.go
        OnepageRoutes(router)

        return router
}

And a simple main - the only OpenTelemetry related code is the call to the initTracer function:

package main

// required imports:
// "log"
// "context"
// "log/slog"
// "net/http"
// "time"

func main() {
        tp, err := initTracer()
        if err != nil {
                log.Fatal(err)
        }

        defer func() {
                if err := tp.Shutdown(context.Background()); err != nil {
                        log.Printf("Error shutting down tracer provider: %v", err)
                }
        }()

        router := createRouter()

        server := &http.Server{
                Addr: ":80",
                ReadHeaderTimeout: 3 * time.Second,
                Handler:           router,
        }

        if err := server.ListenAndServe(); err != nil {
                slog.Warn("cannot listen and serve", "error", err.Error())
        }
}

The routes are untouched by the OpenTelemetry code, the next part is to ensure that the handlers are injecting traces, and spans, so that the call by a user can be ‘traced’ across the systems.

I keep a tracer.go that ensures that the tracer is accessible across the package:

package api

import "go.opentelemetry.io/otel"

var tracer = otel.Tracer("mux-server")

And, an example of handler (all that is required for OpenTelemetry is the first two lines):

package api

...

func GetSomething(response http.ResponseWriter, req *http.Request) {
        _, span := tracer.Start(context.Background(), "GetSomething")
        defer span.End()
        ...
}

That’s all that’s required for HTTP telemetry (tracing).

gRPC Server

gRPC initialisation on the server is a little more involved, allegedly there is an easier (less code) way to do this, but I couldn’t grasp how to enumerate which middleware I wanted to include.

These examples are largely based on Go gRPC examples

server:

package main

import (
        "bytes"
        "context"
        "fmt"
        "log"
        "log/slog"
        "net"
        "os"
        "runtime/debug"

        "shanehowearth.com/stuff"
        grpcProto "shanehowearth.com/stuff/grpc/proto/v1"

        grpcprom "github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus"
        "github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery"
        "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promauto"
        "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
        "go.opentelemetry.io/otel"
        stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
        "go.opentelemetry.io/otel/propagation"
        sdktrace "go.opentelemetry.io/otel/sdk/trace"
        "go.opentelemetry.io/otel/trace"
        "google.golang.org/grpc"
        "google.golang.org/grpc/codes"
        "google.golang.org/grpc/reflection"
        "google.golang.org/grpc/status"
)

type server struct {
        grpcProto.UnimplementedAccountServiceServer
}

// package global
var stuffServer *stuff.Server

func main() {
        stuffServer = stuff.NewAccountServer()

        // gRPC service
        portNumber, ok := os.LookupEnv("PORT")
        if !ok {
                log.Fatalf("PORT not set in environment, cannot continue")
        }

        lis, err := net.Listen("tcp", "0.0.0.0:"+portNumber)
        if err != nil {
                log.Fatalf("Failed to listen: %v", err)
        }

        // Setup metrics.
        srvMetrics := grpcprom.NewServerMetrics(
                grpcprom.WithServerHandlingTimeHistogram(
                        grpcprom.WithHistogramBuckets([]float64{0.001, 0.01, 0.1, 0.3, 0.6, 1, 3, 6, 9, 20, 30, 60, 90, 120}),
                ),
        )

        reg := prometheus.NewRegistry()
        reg.MustRegister(srvMetrics)

        exemplarFromContext := func(ctx context.Context) prometheus.Labels {
                if span := trace.SpanContextFromContext(ctx); span.IsSampled() {
                        return prometheus.Labels{"traceID": span.TraceID().String()}
                }

                return nil
        }

        // Set up OTLP tracing (stdout for debug).
        exporter, err := stdout.New(stdout.WithPrettyPrint())
        if err != nil {
                slog.Error("exporter setup", "error", err.Error())
                os.Exit(1)
        }

        tp := sdktrace.NewTracerProvider(
                sdktrace.WithSampler(sdktrace.AlwaysSample()),
                sdktrace.WithBatcher(exporter),
        )
        otel.SetTracerProvider(tp)

        otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

        defer func() { _ = exporter.Shutdown(context.Background()) }()

        // Setup metric for panic recoveries.
        panicsTotal := promauto.With(reg).NewCounter(prometheus.CounterOpts{
                Name: "grpc_req_panics_recovered_total",
                Help: "Total number of gRPC requests recovered from internal panic.",
        })

        grpcPanicRecoveryHandler := func(p any) error {
                panicsTotal.Inc()
                slog.Error("msg", "recovered from panic", "panic", p, "stack", debug.Stack())

                return status.Errorf(codes.Internal, "%s", p)
        }

        grpcServer := grpc.NewServer(
                grpc.StatsHandler(otelgrpc.NewServerHandler()),
                grpc.ChainUnaryInterceptor(
                        // Order matters e.g. tracing interceptor have to create
                        // span first for the later exemplars to work.
                        otelgrpc.UnaryServerInterceptor(),
                        srvMetrics.UnaryServerInterceptor(grpcprom.WithExemplarFromContext(exemplarFromContext)),
                        recovery.UnaryServerInterceptor(recovery.WithRecoveryHandler(grpcPanicRecoveryHandler)),
                ),
                grpc.ChainStreamInterceptor(
                        otelgrpc.StreamServerInterceptor(),
                        srvMetrics.StreamServerInterceptor(grpcprom.WithExemplarFromContext(exemplarFromContext)),
                        recovery.StreamServerInterceptor(recovery.WithRecoveryHandler(grpcPanicRecoveryHandler)),
                ),
        )
        grpcProto.RegisterAccountServiceServer(grpcServer, &server{})
        reflection.Register(grpcServer)

        if err := grpcServer.Serve(lis); err != nil {
                panic(fmt.Sprintf("failed to serve: %v", err))
        }
}

again, I have a tracer.go that exposes the tracer to the whole package:

package main

import  "go.opentelemetry.io/otel"

var tracer = otel.Tracer("accounts-grpc-server")

And the functions exposed by the gRPC server look like this:

func (s *server) GetStuff(ctx context.Context) (*grpcProto.Stuff, error) {
        ctx, childSpan := tracer.Start(ctx, "Get Stuff")
        defer childSpan.End()
        
        ...

        return &grpcProto.Stuff{HtmlForm: "weee"}, nil
}

gRPC Client

The only thing happening in the client connection is the inclusion of the otel middleware

package stuff

import (
        "context"
        "log"
        "log/slog"
        "time"

        "go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
        "google.golang.org/grpc"
        "google.golang.org/grpc/credentials/insecure"
        proto "shanehowearth.com/stuff/grpc/proto/v1"
)

type Client struct {
        Address string
}

func (ac *Client) newConnection() (proto.StuffServiceClient, *grpc.ClientConn) {
        conn, err := grpc.Dial(
                ac.Address,
                grpc.WithTransportCredentials(insecure.NewCredentials()),
                grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
        )
        if err != nil {
                log.Fatalf("did not connect: %v", err)
        }

        return proto.NewStuffServiceClient(conn), conn
}

And that’s it. I don’t include any traces in the client caller function, but there’s no reason that it cannot be done.

Summary

Turning a distributed system of microservices written in Go into an observable one with OpenTelemetry is a fairly simple affair, and these examples should mean you can do it too :)

Published:
comments powered by Disqus