This post is mainly about instrumenting tracing in distributed systems with OpenTelemetry.
Before we get to that we need to understand what constitutes an observable system. Observable systems have three ‘pillars’ of observability:
- Logs
- Traces
- Metrics
OpenTelemetry itself hasn’t got logging implemented, but it does have tracing and metric gathering capabilities (for Go).
On to the interesting stuff :)
I have an API
container that has a RESTful interface for clients on the internet to call. The API container has a number of routes/endpoints that it handles by calling other containers via gRPC.
HTTP
I use Chi as my mux, and I use the Gorilla OpenTelemetry Middleware. Using the Server.go as an example I found it incredibly simple to instrument the HTTP server.
First, a trace initialisation:
package main
// required imports:
// "go.opentelemetry.io/otel"
// stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
// sdktrace "go.opentelemetry.io/otel/sdk/trace"
var tracer = otel.Tracer("api-server")
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := stdout.New(stdout.WithPrettyPrint())
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
return tp, nil
}
Then tell our mux to use the middleware:
package main
// required imports:
// "github.com/go-chi/chi/v5"
// "github.com/go-chi/chi/v5/middleware"
// "go.opentelemetry.io/contrib/instrumentation/github.com/gorilla/mux/otelmux"
// "time" (not necessary for OpenTelemetry, but used for another middleware).
func createRouter() *chi.Mux {
router := chi.NewRouter()
router.Use(
otelmux.Middleware("my-server"),
// The rest are standard chi middleware (unnecessary for telemetry, but left in for example of other middleware).
middleware.RequestID,
middleware.RealIP,
middleware.Logger,
middleware.Recoverer,
//nolint:gomnd
middleware.Timeout(60*time.Second),
)
// All routes are in routes.go
OnepageRoutes(router)
return router
}
And a simple main - the only OpenTelemetry related code is the call to the initTracer
function:
package main
// required imports:
// "log"
// "context"
// "log/slog"
// "net/http"
// "time"
func main() {
tp, err := initTracer()
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer provider: %v", err)
}
}()
router := createRouter()
server := &http.Server{
Addr: ":80",
ReadHeaderTimeout: 3 * time.Second,
Handler: router,
}
if err := server.ListenAndServe(); err != nil {
slog.Warn("cannot listen and serve", "error", err.Error())
}
}
The routes are untouched by the OpenTelemetry code, the next part is to ensure that the handlers are injecting traces, and spans, so that the call by a user can be ‘traced’ across the systems.
I keep a tracer.go that ensures that the tracer is accessible across the package:
package api
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("mux-server")
And, an example of handler (all that is required for OpenTelemetry is the first two lines):
package api
...
func GetSomething(response http.ResponseWriter, req *http.Request) {
_, span := tracer.Start(context.Background(), "GetSomething")
defer span.End()
...
}
That’s all that’s required for HTTP telemetry (tracing).
gRPC Server
gRPC initialisation on the server is a little more involved, allegedly there is an easier (less code) way to do this, but I couldn’t grasp how to enumerate which middleware I wanted to include.
These examples are largely based on Go gRPC examples
server:
package main
import (
"bytes"
"context"
"fmt"
"log"
"log/slog"
"net"
"os"
"runtime/debug"
"shanehowearth.com/stuff"
grpcProto "shanehowearth.com/stuff/grpc/proto/v1"
grpcprom "github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus"
"github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
"go.opentelemetry.io/otel"
stdout "go.opentelemetry.io/otel/exporters/stdout/stdouttrace"
"go.opentelemetry.io/otel/propagation"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/trace"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/reflection"
"google.golang.org/grpc/status"
)
type server struct {
grpcProto.UnimplementedAccountServiceServer
}
// package global
var stuffServer *stuff.Server
func main() {
stuffServer = stuff.NewAccountServer()
// gRPC service
portNumber, ok := os.LookupEnv("PORT")
if !ok {
log.Fatalf("PORT not set in environment, cannot continue")
}
lis, err := net.Listen("tcp", "0.0.0.0:"+portNumber)
if err != nil {
log.Fatalf("Failed to listen: %v", err)
}
// Setup metrics.
srvMetrics := grpcprom.NewServerMetrics(
grpcprom.WithServerHandlingTimeHistogram(
grpcprom.WithHistogramBuckets([]float64{0.001, 0.01, 0.1, 0.3, 0.6, 1, 3, 6, 9, 20, 30, 60, 90, 120}),
),
)
reg := prometheus.NewRegistry()
reg.MustRegister(srvMetrics)
exemplarFromContext := func(ctx context.Context) prometheus.Labels {
if span := trace.SpanContextFromContext(ctx); span.IsSampled() {
return prometheus.Labels{"traceID": span.TraceID().String()}
}
return nil
}
// Set up OTLP tracing (stdout for debug).
exporter, err := stdout.New(stdout.WithPrettyPrint())
if err != nil {
slog.Error("exporter setup", "error", err.Error())
os.Exit(1)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sdktrace.AlwaysSample()),
sdktrace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
defer func() { _ = exporter.Shutdown(context.Background()) }()
// Setup metric for panic recoveries.
panicsTotal := promauto.With(reg).NewCounter(prometheus.CounterOpts{
Name: "grpc_req_panics_recovered_total",
Help: "Total number of gRPC requests recovered from internal panic.",
})
grpcPanicRecoveryHandler := func(p any) error {
panicsTotal.Inc()
slog.Error("msg", "recovered from panic", "panic", p, "stack", debug.Stack())
return status.Errorf(codes.Internal, "%s", p)
}
grpcServer := grpc.NewServer(
grpc.StatsHandler(otelgrpc.NewServerHandler()),
grpc.ChainUnaryInterceptor(
// Order matters e.g. tracing interceptor have to create
// span first for the later exemplars to work.
otelgrpc.UnaryServerInterceptor(),
srvMetrics.UnaryServerInterceptor(grpcprom.WithExemplarFromContext(exemplarFromContext)),
recovery.UnaryServerInterceptor(recovery.WithRecoveryHandler(grpcPanicRecoveryHandler)),
),
grpc.ChainStreamInterceptor(
otelgrpc.StreamServerInterceptor(),
srvMetrics.StreamServerInterceptor(grpcprom.WithExemplarFromContext(exemplarFromContext)),
recovery.StreamServerInterceptor(recovery.WithRecoveryHandler(grpcPanicRecoveryHandler)),
),
)
grpcProto.RegisterAccountServiceServer(grpcServer, &server{})
reflection.Register(grpcServer)
if err := grpcServer.Serve(lis); err != nil {
panic(fmt.Sprintf("failed to serve: %v", err))
}
}
again, I have a tracer.go that exposes the tracer to the whole package:
package main
import "go.opentelemetry.io/otel"
var tracer = otel.Tracer("accounts-grpc-server")
And the functions exposed by the gRPC server look like this:
func (s *server) GetStuff(ctx context.Context) (*grpcProto.Stuff, error) {
ctx, childSpan := tracer.Start(ctx, "Get Stuff")
defer childSpan.End()
...
return &grpcProto.Stuff{HtmlForm: "weee"}, nil
}
gRPC Client
The only thing happening in the client connection is the inclusion of the otel
middleware
package stuff
import (
"context"
"log"
"log/slog"
"time"
"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
proto "shanehowearth.com/stuff/grpc/proto/v1"
)
type Client struct {
Address string
}
func (ac *Client) newConnection() (proto.StuffServiceClient, *grpc.ClientConn) {
conn, err := grpc.Dial(
ac.Address,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithStatsHandler(otelgrpc.NewClientHandler()),
)
if err != nil {
log.Fatalf("did not connect: %v", err)
}
return proto.NewStuffServiceClient(conn), conn
}
And that’s it. I don’t include any traces in the client caller function, but there’s no reason that it cannot be done.
Summary
Turning a distributed system of microservices written in Go into an observable one with OpenTelemetry is a fairly simple affair, and these examples should mean you can do it too :)