A well-instrumented application is one which consistently forwards a rich set of metrics, histograms, and traces to an observability platform. This data enables SREs to triage issues faster by quickly identifying misbehaving services and drill into the root cause.
In this blog, we’re going to look into best practices around Distributed Tracing by going through a sample microservices application. In the process, we’ll cover a journey from no instrumentation to full instrumentation that every developer can relate to.
Sample Microservices Application: BeachShirts
Wavefront makes monitoring so easy that our ops like to call it #beachops. Inspired by #beachops, we built a polyglot microservices-based sample application to order beach shirts. The following diagram shows the architecture of the BeachShirts app. In this blog, we’ll reference this BeachShirts app to identify best practices for service owners to instrument their respective services.
Best Practice #1 – Report Traces for all Your Inbound and Outbound Service Calls
A trace is meaningless if it is not instrumented end-to-end. Let me explain the importance of an end-to-end trace with the below trace view. In the below view, you can see that the OrderShirts API took 9.73 seconds. At first glance, an SRE might hold shopping service owner accountable. But after inspecting the detailed end-to-end trace, it is obvious that the bottleneck lies in the packaging service. Since the packaging service took 7.47 seconds out of the 9.73 seconds. This valuable piece of information would have been unavailable to the on-call SRE if we didn’t instrument our application end-to-end.
Manual instrumentation is time-consuming and all developers can relate to it. Hence at Wavefront, we have built observability SDKs that provide an easy way to instrument applications. Check out Wavefront SDKs for all popular languages and frameworks such as Jersey, JAX-RS, Django and gRPC here.
Best Practice #2 – Make Sure Your Instrumentation is OpenTracing/OpenCensus Compliant
When it comes to distributed tracing, it’s really important to go with OpenTracing vendor neutral APIs to avoid lock-in. The Wavefront Observability SDKs are fully compliant with OpenTracing. Wavefront has built OpenTracing ‘tracer’ in all the popular languages including Java, C#, Python, Go and Ruby.
Best Practice #3 – Report Response, Error and Duration (RED) Metrics
RED is a popular acronym in the SRE world. RED stands for Requests, Errors, and Duration. It is an absolute must in the microservices world to instrument requests and errors as raw counters so that you can easily figure out the rate of requests made to your service. Additionally, it tells you the percentage of requests that resulted in an error.
The Wavefront OpenTracing SDK not only emits end-to-end traces but also derives RED metrics from those traces and spans. Using the Wavefront Observability SDK, you get RED metrics for the overall application, various services inside that application, service components and frameworks, individual spans and traces. We have also built an intuitive UI around those RED metrics showing top requests, top errors along with slowest requests.Figure 3: Wavefront’s UI for RED Metrics
Best Practice #4 – Emphasize on D, in the RED Metrics
The Duration component of the RED metrics deserves its own section because it is such a valuable piece of information. Ideally, the ‘Duration’ should be emitted as a Histogram. You can then figure out median, mean, p95, p99, etc. for that distribution. We strongly recommend reporting that value as a Wavefront Histogram. This is because percentile of a percentile is useless! Wavefront stores histograms as a first class citizen where you can merge different distributions and then apply statistical functions on that merged data.
As an example, let’s look at the tracing screenshot above. We observe that OrderShirts API is taking 9.73 seconds. It is difficult to know if it is good or bad without looking at the API latencies distribution. We reported the latency metric as Wavefront Histogram aggregated over 1-hour buckets and plotted that distribution as a table chart shown below. Looking at the chart, we can conclude that the above latency is actually an anomaly since it aligns with the P99 latency.
Table 1: OrderShirts API latency table chart in Wavefront
Best Practice #5 – Setup Alerts on RED Metrics
A service owner promises a particular SLA. And that SLA is violated if we see a high percentage of errors. Also, if the rate of requests is greater than the threshold, then SRE should ideally spin up more containers to scale out that service. This is why it is important to create those alerts after you successfully instrument RED metrics for your service. Wavefront’s AI Genie provides automated correlation, anomaly detection, and forecasting. Along with AI and ML, we strongly suggest service owners to manually setup correct alerts.
Table 2: Wavefront Observability SDKs for common languages and frameworks
Best Practice #6 – Report Custom Tracing Spans
Wavefront Observability SDKs provide out of the box traces for all the popular frameworks. But for an end to end trace to be deep (in terms of the number of spans) and meaningful, you need to instrument custom methods inside the service. This might require some manual effort from the service owner. In the example above, notice that all the async APIs are instrumented manually using OpenTracing.
There is always this tension between ‘depth-level‘ instrumentation vs. ‘breadth-level‘ instrumentation. You need to decide if you want to take the approach of traditional APM vendors and instrument every line of code in your service. In my honest opinion, instrumenting each line of code was more relevant for monolithic applications. In the world of microservices, various services are more spread out across different programming languages and frameworks. So, it makes sense to do ‘breadth-level’ instrumentation across various services. Along with that, you should implement meaningful and selective ‘depth-level’ instrumentation inside each service.
Best Practice #7 – Report Custom Business Metrics
Along with tracing, service owners should always be on the lookout for reporting custom business metrics to the Observability Platform. For instance, they should consider incrementing an error counter with error logs. If you have an in-memory buffer that grows or shrinks, consider reporting the size of that buffer as a gauge. If you are building an application for grocery delivery, then consider reporting the ‘delivery-latency’ for those groceries as a histogram. When instrumented correctly, these custom business metrics act as a valuable piece of data which can be used to correlate with the corresponding traces.
Wavefront has built custom metrics (and Wavefront Histograms) reporter SDKs in all the popular languages and frameworks. Using the below-mentioned SDKs, developers can instantiate Counters, Gauges, Wavefront Histograms, and Delta Counters as first-class citizens in their code. Furthermore, with Wavefront SDKs these metrics get reported automatically to Wavefront periodically (without developers having to send that data to Wavefront manually).
If you’re a developer, you are not only responsible for building your service but also fully instrumenting it. You should be reporting valuable instrumented data – metrics, histograms, traces – to a sophisticated and highly scalable observability platform. Follow the best practices mentioned above (Step #1 to #7), to take your application from no instrumentation to full instrumentation. Check out our free trial today, to get out-of-the-box 3D Observability for your microservices.
The post 7 Best Practices for Distributed Tracing: How to Go from Zero to Full App Instrumentation appeared first on Wavefront by VMware.
About the AuthorFollow on Twitter More Content by Sushant Dewan