This post is mainly about instrumenting tracing in distributed systems with {OpenTelemetry](https://opentelemetry.io/).
Before we get to that we need to understand what constitutes an observable system. Observable systems have three ‘pillars’ of observability:
- Logs
- Traces
- Metrics
OpenTelemetry itself hasn’t got logging implemented, but it does have tracing and metric gathering capabilities (for Go).
On to the interesting stuff :)
I have an API
container that has a RESTful interface for clients on the internet to call. The API container has a number of routes/endpoints that it handles by calling other containers via gRPC.
HTTP
I use Chi as my mux, and I use the Gorilla OpenTelemetry Middleware. Using the Server.go as an example I found it incredibly simple to instrument the HTTP server.
First, a trace initialisation:
package main
// required imports:
// "go.opentelemetry.io/otel"
// stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
// sdktrace "go.opentelemetry.io/otel/sdk/trace"
var tracer = otel.Tracer("api-server")
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := stdout.New(stdout.WithPrettyPrint())
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
return tp, nil
}
Then tell our mux to use the middleware:
package main
// required imports:
// "github.com/go-chi/chi/v5"
// "github.com/go-chi/chi/v5/middleware"
// "go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux"
// "time" (not necessary for OpenTelemetry, but used for another middleware).
func createRouter() *chi.Mux {
router := chi.NewRouter()
router.Use(
otelmux.Middleware("my-server"),
// The rest are standard chi middleware (unnecessary for telemetry, but left in for example of other middleware).
middleware.RequestID,
middleware.RealIP,
middleware.Logger,
middleware.Recoverer,
//nolint:gomnd
middleware.Timeout(60*time.Second),
)
// All routes are in routes.go
OnepageRoutes(router)
return router
}
And a simple main - the only OpenTelemetry related code is the call to the initTracer
function:
package main
// required imports:
// "log"
// "context"
// "log/slog"
// "net/http"
// "time"
func main() {
tp, err := initTracer()
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer provider: %v", err)
}
}()
router := createRouter()
server := &http.Server{
Addr: ":80",
ReadHeaderTimeout: 3 * time.Second,
Handler: router,
}
if err := server.ListenAndServe(); err != nil {
slog.Warn("cannot listen and serve", "error", err.Error())
}
}
The routes are untouched by the OpenTelemetry code, the next part is to ensure that the handlers are injecting traces, and spans, so that the call by a user can be ‘traced’ across the systems.
I keep a tracer.go that ensures that the tracer is accessible across the package:
package api
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("mux-server")
And, an example of handler (all that is required for OpenTelemetry is the first two lines):
package api
...
func GetSomething(response http.ResponseWriter, req *http.Request) {
_, span := tracer.Start(context.Background(), "GetSomething")
defer span.End()
...
}
That’s all that’s required for HTTP telemetry (tracing).
gRPC Server
gRPC initialisation on the server is a little more involved, allegedly there is an easier (less code) way to do this, but I couldn’t grasp how to enumerate which middleware I wanted to include.
These examples are largely based on Go gRPC examples
server:
package main
import (
"bytes"
"context"
"fmt"
"log"
"log/slog"
"net"
"os"
"runtime/debug"
"shanehowearth.com/stuff"
grpcProto "shanehowearth.com/stuff/grpc/proto/v1"
grpcprom "github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus"
"github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
"go.opentelemetry.io/otel"
stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
"go.opentelemetry.io/otel/propagation"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/reflection"
"google.golang.org/grpc/status"
)
type server struct {
grpcProto.UnimplementedAccountServiceServer
}
// package global
var stuffServer *stuff.Server
func main() {
stuffServer = stuff.NewAccountServer()
// gRPC service
portNumber, ok := os.LookupEnv("PORT")
if !ok {
log.Fatalf("PORT not set in environment, cannot continue")
}
lis, err := net.Listen("tcp", "0.0.0.0:"+portNumber)
if err != nil {
log.Fatalf("Failed to listen: %v", err)
}
// Setup metrics.
srvMetrics := grpcprom.NewServerMetrics(
grpcprom.WithServerHandlingTimeHistogram(
grpcprom.WithHistogramBuckets([]float64{0.001, 0.01, 0.1, 0.3, 0.6, 1, 3, 6, 9, 20, 30, 60, 90, 120}),
),
)
reg := prometheus.NewRegistry()
reg.MustRegister(srvMetrics)
exemplarFromContext := func(ctx context.Context) prometheus.Labels {
if span := trace.SpanContextFromContext(ctx); span.IsSampled() {
return prometheus.Labels{"traceID": span.TraceID().String()}
}
return nil
}
// Set up OTLP tracing (stdout for debug).
exporter, err := stdout.New(stdout.WithPrettyPrint())
if err != nil {
slog.Error("exporter setup", "error", err.Error())
os.Exit(1)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
defer func() { _ = exporter.Shutdown(context.Background()) }()
// Setup metric for panic recoveries.
panicsTotal := promauto.With(reg).NewCounter(prometheus.CounterOpts{
Name: "grpc_req_panics_recovered_total",
Help: "Total number of gRPC requests recovered from internal panic.",
})
grpcPanicRecoveryHandler := func(p any) error {
panicsTotal.Inc()
slog.Error("msg", "recovered from panic", "panic", p, "stack", debug.Stack())
return status.Errorf(codes.Internal, "%s", p)
}
grpcServer := grpc.NewServer(
grpc.StatsHandler(otelgrpc.NewServerHandler()),
grpc.ChainUnaryInterceptor(
// Order matters e.g. tracing interceptor have to create
// span first for the later exemplars to work.
otelgrpc.UnaryServerInterceptor(),
srvMetrics.UnaryServerInterceptor(grpcprom.WithExemplarFromContext(exemplarFromContext)),
recovery.UnaryServerInterceptor(recovery.WithRecoveryHandler(grpcPanicRecoveryHandler)),
),
grpc.ChainStreamInterceptor(
otelgrpc.StreamServerInterceptor(),
srvMetrics.StreamServerInterceptor(grpcprom.WithExemplarFromContext(exemplarFromContext)),
recovery.StreamServerInterceptor(recovery.WithRecoveryHandler(grpcPanicRecoveryHandler)),
),
)
grpcProto.RegisterAccountServiceServer(grpcServer, &server{})
reflection.Register(grpcServer)
if err := grpcServer.Serve(lis); err != nil {
panic(fmt.Sprintf("failed to serve: %v", err))
}
}
again, I have a tracer.go that exposes the tracer to the whole package:
package main
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("accounts-grpc-server")
And the functions exposed by the gRPC server look like this:
func (s *server) GetStuff(ctx context.Context) (*grpcProto.Stuff, error) {
ctx, childSpan := tracer.Start(ctx, "Get Stuff")
defer childSpan.End()
...
return &grpcProto.Stuff{HtmlForm: "weee"}, nil
}
gRPC Client
The only thing happening in the client connection is the inclusion of the otel
middleware
package stuff
import (
"context"
"log"
"log/slog"
"time"
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
proto "shanehowearth.com/stuff/grpc/proto/v1"
)
type Client struct {
Address string
}
func (ac *Client) newConnection() (proto.StuffServiceClient, *grpc.ClientConn) {
conn, err := grpc.Dial(
ac.Address,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
)
if err != nil {
log.Fatalf("did not connect: %v", err)
}
return proto.NewStuffServiceClient(conn), conn
}
And that’s it. I don’t include any traces in the client caller function, but there’s no reason that it cannot be done.
Summary
Turning a distributed system of microservices written in Go into an observable one with OpenTelemetry is a fairly simple affair, and these examples should mean you can do it too :)