Troubleshooting Microservices With Pivotal Cloud Foundry Metrics 1.2

October 31, 2016 Mukesh Gadiya

 

36634-pcf-metrics-sfeaturedThe software world is moving to microservices. In this era of distributed systems, short lived servers, and scalable application instances, collecting and correlating logs is exponentially harder. Traditional monitoring tools are not setup to solve this problem.

So to fully realize the benefits of cloud-native patterns and microservices, you need to rethink your instrumentation.

As the joke goes:

OH: We replaced our monolith with #microservices so that every outage could be more like a murder mystery.

— Jason K Jackson (@jasonkjackson) October 17, 2015

 

How do you make troubleshooting less murder mystery and more drop-dead simple? With monitoring, re-imagined for microservices!

That’s our goal with Pivotal Cloud Foundry Metrics (PCF Metrics). This module delivers a holistic view of an application’s performance. Metrics, logs, and events, all aggregated together in an intuitive, contextual way. With PCF Metrics, everyone has the same set of facts about an issue. No need to slog through several monitoring and logging interfaces. No more arguing over which system has the best data.

We aim to help users answer 3 questions for their apps. Let’s use an application suffering from latency to illustrate how our opinions on metrics shine through.

monitoring venn diagram

  • What went wrong and when? The metrics of an app provide the answer. During periods of high network latency, metrics reveal the time window when this latency occurred and its average response time.
  • Why did it go wrong? Answered in the logs of an application. In our latency scenario, logs show what happened during thIs time window, what API call, exact response time, the parameters passed, the other services involved. Logs and metrics together usually provide enough information to figure out why latency occurred.
  • Who is the culprit behind what went wrong? This is answered with events. Events bring in external bits of information (such as when app source code was changed, or if app was moved to another container on an underperforming cell). Correlating events with metrics and logs is often the final piece of the puzzle.

PCF Metrics 1.2, now generally available, includes many new features to help you answer these questions. Resolve issues with speed and confidence!

What’s New In Pivotal Cloud Foundry Metrics 1.2

Longer Retention of Metrics & Logs. PCF Metrics now keeps metrics and log data for 2 weeks! Before, data was stored for 24 hours. Customers can now use this additional data to identify and fix issues – an especially useful enhancement to resolve those thornier incidents.

Improved Time Slice Navigation. Selecting a time slice, and zooming into that time slice, is now easier. The new experience aligns with modern web navigation patterns, boosting overall usability.

Time Slice Navigation

To change the time slice, use the time selection dropdown in left top corner. Drag your mouse on all graphs—metrics, logs, events—to zoom into times with anomalies or to move to adjacent time slices.

New Visual Treatments for App Events. One consistent theme from customers: interesting things—problems—often happen around app events. Application updates, for example, will result in unusual latency, app crashes, or frequent restarts.

This feedback spawned a new design of App Events in 1.2. Now, events have their own toggle menu in the sidebar, and their own graph. Now, it’s far easier to correlate app events with metrics and logs.

Visual treatment for app events

Logs Histogram Display when Filtering and Highlighting. In most cases, investigations key off the metrics of an app. But sometimes logs are the starting point (i.e. when a specific user receives an unauthorized error at login). The new log histogram enhancement simplifies this needle-in-the-haystack workflow. Filter and highlight for text strings, and the histogram will show when the string appeared most frequently. The corresponding logs for those time frames are then displayed to help you understand what went gone wrong.

logs histogram

Getting Started with PCF Metrics 1.2 On PWS (Or, Drinking Our Own Champagne)

We run all of our new features as a service ourselves before releasing them to PCF customers. It’s a best practice these days. It makes our products better and gives customers confidence to run Pivotal services on their infrastructure of choice. Some quick performance stats from running PCF Metrics 1.2 at scale for 2 weeks:

  • 50K/min—peak metrics persistence rate
  • 2.9 miilion logs/min—peak logs ingestion rate
  • 600 million—total network requests over a two-week period
  • 14 billion—total logs over two-week period

And with that, we’re excited for you to try out the new features. Here’s how to get started:

  1. Push an app on Pivotal Web Services
  2. Open metrics.run.pivotal.io in browser
  3. Select your app from the apps drop down
  4. Look at app metrics, logs and events in single view (that’s right, no need to add an agent in your code or configure anything anywhere to make it work!)

PCF customers can upgrade to PCF Metrics 1.2 later in November. To upgrade, download and install the tile from https://network.pivotal.io/products/pcf-metrics. NOTE: metrics and logs will not be ingested until the PCF Metrics 1.2 tile install is complete.

Got Feedback?

Pivotal has a passion for delivering an opinionated approach to application monitoring, and we’re just getting started! Our investment areas in future include incorporating distributed tracing and correlated logs across connected microservices, supporting Spring Boot Actuators and alerts triggered from user-defined thresholds for metrics and events.

But this is a journey, and we want our customers to help guide us along the way. We welcome your feedback on our priorities and areas of focus. Let us know! Simply click on the Feedback icon and tell us what you think.

Editor’s Note: We are looking for app developers to participate in our ongoing Pivotal Cloud Foundry product development research. Participants will be compensated for their time. To see if you qualify, fill out our participation form here.

feedback

 

About the Author

Mukesh Gadiya

Mukesh is a product manager at Pivotal, helping transform how the world monitors software.

Previous
To Avoid Getting Caught In The Developer Skills Gap, Do This.
To Avoid Getting Caught In The Developer Skills Gap, Do This.

The Global Perception Study by the Cloud Foundry Foundation shows that the majority (>60%) of companies wan...

Next
What Should Your IT Team Be For Halloween?
What Should Your IT Team Be For Halloween?

With software being crucial to almost everything you do, we know your IT folks may be a tad too busy to thi...