Distributed Tracing in PCF Metrics: Breakthrough Insight for Microservices

April 5, 2017 Mukesh Gadiya

The shift to cloud-native delivers unheard of application resilience and flexibility. As we've opined before, it mandates a new approach to troubleshooting and understanding system failures.

Today's architectures have exponentially more complexity. The question to answer isn't "what's wrong with my source code?" Instead, teams need to address a series of questions when issues arise:

  • Which component of the workload is having a problem?

  • How do we trace the relevant requests through the entire workload?

  • How do we find all diagnostic information from the components that processed that request?

  • And how do we do all of this as soon as possible?

The situation is compounded further when different teams own different pieces of the workload. Why? A select few have an end-to-end understanding of the entire workload.

But just as the industry has rallied around microservices, so too have we rallied to simplify the troubleshooting of these modern, distributed systems.

The answer: distributed tracing. These tools help engineers understand the scale of interactions between system components. Which brings us to PCF Metrics, the integrated metrics module for Pivotal Cloud Foundry.

PCF Metrics: A Single Set of Facts, and Now a Full Picture of Your System

PCF Metrics gives your engineering organization a single repository of application telemetry. Dev and ops teams use the data therein to kick-start issue mitigation. Events, metrics, and logs are shown on an intuitive timeline.

But these features don't sufficiently answer the questions posed earlier. To get a complete picture your workload, engineers need to understand the scale of interactions between components.

That's what PCF Metrics 1.3 delivers with Trace Explorer! Use Trace Explorer to:

  • Examine distributed tracing across microservices - with correlated logs in the same view

  • Perform log filtering on specific HTTP requests within a trace

  • View a dependency tree that shows parent-child relationship for microservices within the trace

PCF Metrics is tightly integrated with UAA, Pivotal Cloud Foundry's identity management service. That means Metrics automatically respects the permissions of the user. Engineers only see the apps they are authorized to view.

SRE Life Without Trace Explorer

How does Trace Explorer work in the real world? Consider a scenario where an e-commerce site experiences latency in user checkout.

Our hypothetical system is composed of these elements:

  • User-facing properties, the UI and API that power the shopper's experience. Let's call this collection of services user.

  • Stock inventory management (stock)

  • Payment processing (payments)

  • Order processing (order)

  • Order notifications (notifications)

Suppose latency exists in the completion and processing of an order. How do on-call engineers approach this problem without Trace Explorer? This is a common flow:

  1. The SRE gets paged by the monitoring tool, because of the latency in user checkouts.

  2. She logs in monitoring tool and finds that the user.checkout HTTP request is slower than usual. She drills down further and discovers the slowness is really from order.processing.

  3. She opens a new tab in the monitoring tool for the order app.

  4. She zooms into the relevant time window inside the order monitoring page.

  5. Further analysis shows that payment processing was slow.

  6. She follows the same troubleshooting steps for payment as those for order. She finds that payment/charge-card is slow.

  7. While this investigation in the metrics tool unfolds, the SRE also examines the logging tool. She reviews logs for the desired time window from all apps - user, stock, order, payments, and notifications.

  8. After searching and filtering for the right time window in the log tool, she correlates the metrics from payments/charge-card to the application logs from payments. She sees that charge-card verification with the external bank was very slow. This introduced the latency.

This grunt work is the problem solved by Trace Explorer and PCF Metrics. No more alt-tabbing between tools. No more wading through logs after you've correlated the time window.

...And With Trace Explorer

With Trace Explorer, troubleshooting time goes from hours to minutes!

What's more, you don't need intimate knowledge of the holistic system to find issues. Trace Explorer puts it all in context for you.

Trace Explorer makes every developer a capable troubleshooter for all your organizations' microservices!

Getting Started with PCF Metrics 1.3

Trace Explorer is the flagship feature in this release, but there are other new features too. The module captures more app events, and improves how logs and metrics are retained during tile upgrades.

Check out the release notes. Then, download and install the tile. Now try out Trace Explorer using this sample app.

Got feedback on Trace Explorer or PCF Metrics? We want to hear from you! Simply click on the Feedback icon inside the product and tell us what you think. Want to read about other new features in Pivotal Cloud Foundry 1.10? Check out the overview.

About the Author

Mukesh Gadiya

Mukesh is a product manager at Pivotal, helping transform how the world monitors software.

The Emergence and Future of the Data Engineer
The Emergence and Future of the Data Engineer

Recent developments in data management have led to the creation of the field called data engineering. This ...

Debunking Cloud Foundry Myths
Debunking Cloud Foundry Myths

Sometimes people get a mistaken impression of Cloud Foundry. Let's clear up a few misperceptions about supp...

SpringOne at VMware Explore 2023

Learn More