Distributed Tracing

Distributed Tracing and Request Correlation

NOTE - distributed tracing and request correlation, as documented herein, is not something that is solved intrinsically by Cloud Foundry. This is a high-level application architecture discussion, that is likely to be solved differently depending on application language and/or runtime.

This document describes recipes, both high-level and implementation-specific, for how to handle distributed tracing in the cloud.

When a request comes in to a monolith, it's fairly easy to trace through what happened within that request because everything is self-contained. However, when a request comes into a microservice (remember GUI apps can be microservices, they just render HTML instead of JSON), that request could then result in a cascading chain of 10 other HTTP calls to various other services. You might also be interested in knowing when a particular service makes non-HTTP calls as well, such as to a database or a message queue.

First, this recipe will discuss some of the things that are required from an architectural standpoint to support distributed tracing, and then we'll talk about some implementations, tools, and libraries to facilitate this.

What you are likely to discover is that the vast majority of modern distributed tracing solutions are either based on, or inspired by, the Google Dapper Whitepaper

Distributed Tracing Requirements

Distributed tracing is allowing the logs (or other metadata) for applications to span servers. More importantly, these logs need to be able to span not only multiple instances of a single service, but need to be able to correlate the collaborative activity of multiple services.

When service A calls service B which in turn calls service C, you want to be able to correlate certain log data and statistics about that one request. Manually fishing through logs to try and find the trace related to one client-originated call across an ecosystem of many services, each with multiple application instances, is a nightmare.

The solution is distributed tracing. Regardless of how you implement it, this always involves the creation of an origination ID as well as individual span IDs. The terminology is likely to vary across different servers, libraries, and agent implementations, but the concepts are the same. Let's take a look at the following simple request:

  • User requests the page /myprofile
    • Page calls account service
    • Account service calls LDAP
    • Page calls profile service
    • Profile service calls database
  • User receives profile page.

In order for a tracing system to be able to monitor this entire request flow, even with multiple instances of the web app, account service, and profile service running, we need to create an origination ID which will be an ID that will span our entire request flow, no matter what we call or how. At each individual step we're going to generate and use span IDs so that the trace for the individual steps can be viewed as a single step as well as part of a larger whole (the original request).

Decorated with IDs, we might have the following:

| Origination ID | Span ID | Activity | | --- | --- | --- | abc123 | span1 | User Requests the page /myprofile | | abc123 | span1 | Page calls account service | | abc123 | span2 | Account service handles request | | abc123 | span1 | Page calls profile service | | abc123 | span3 | Profile service handles request | | abc123 | span3 | Profile service queries database | | abc123 | span1 | Render requested page |

Drawn out as a sequence diagram, this single user-initiated request might look something like this:

Sample Microservice Transaction

The goal behind distributed tracing is to maintain enough context so that you (or a tool) can collect sufficient data from distributed trace logs to reconstitute or reverse a sequence diagram from the available data for any given initiating request.

Ideally, with the right metadata, tools, and libraries, you will be able to determine how much time was spent at each step in the sequence diagram, whether anything went wrong during those times, and to be able to dive into the detailed trace output for each step in the sequence. All of this should be available to you while being able to roll-up or collapse the individual steps in the request so you can examine the entire request.

Implementation in Java

There is nothing preventing you from creating your own code to solve this problem, however the usual caveats apply when discussing large projects that re-invent wheels. Keep in mind that your goal is to spend as much time as possible building your applications, not building tool suites that support your applications when such tools already exist.

Code-wise, the implementation core would be relatively simple. For every inbound request, examine the HTTP headers. If there is a trace id then you know you are part of some larger activity. If there isn't, create a new one. Always create a new span id. Then, just make the trace and span context available to your code so you can add additional troubleshooting, diagnostic, and metric information for collection later.

Given that Cloud Foundry already aggregates all stdout from all apps in all orgs and spaces, simply emitting this context to the logs for collection and analysis by an external tool seems like the ideal pattern.

Tracing with Spring Cloud Sleuth and Zipkin

As mentioned, there are already tools and libraries that solve this problem in the Spring Cloud / Spring Boot space - Sleuth and Zipkin.

Blog Post about Sleuth and Zipkin Using these tools, the larger, outermost request (we referred to this as the originator or origination ID above) is called the trace, and individual steps within this request are called spans. Additional information and metadata can be attached to a span through the use of tags.

Implementation in .NET

The general rules apply here - maintain your trace and span context at the request inbound and outbound level and emit log information to stdout decorated with that information.

You can use the extension points available in WCF, ASP.NET or the Web API to trap inbound and outbound requests, which would allow you to not only create and maintain trace and span contexts, but also do things like emit logs with elapsed time calculations per-span, which can then be rolled up by external tools.

There is a Zipkin Tracer Module for .NET that can be used if you're interested in working with Zipkin implementations.

Today, projects like SteelToe include .NET client code that allows .NET applications to interact with the Sleuth and Zipkin ecosystem.

Implementation in Node

Again, the high-level rules apply. The goal is to emit information with enough context so that correlations can be determined by some monitoring or analysis tool. There are a number of frameworks available in Node that facilitate this kind of instrumentation.

One such tool is called, appropriately enough, Trace, which is available both in open-source server form (on-premise) and as a hosted solution. For more information check out their Blog Post

Use of Agent-Based Distributed Tracing Systems

There are a number of solutions to this problem that involve installing an agent on a virtual machine. This agent can then monitor traffic and it does its own correlation so that you can see a directed graph of correlated activity of your entire infrastructure. Systems like AppDynamics work this way. The problem with agent-based solutions is that they require installation on a virtual machine, which is a cloud native anti-pattern.

If you can avoid installing an agent then you should do so. There are plenty of implementations of the Google Dapper whitepaper available, none of which require agent installations and prefer instead to inject code aspects or annotations to deal with distributed tracing.