The 3 Stages to Observability for Modern Apps

June 13, 2018 Alexis Richardson

Editor's note: This is a guest post written by Alexis Richardson, CEO at Weaveworks

Developers are writing applications for the Pivotal Cloud Foundry® (PCF) platform which today means:

12-factor applications in Pivotal Application Service
Distributed applications and data services in Pivotal Container Service
A growing set of “serverless” functions and other new capabilities

The objective is to have a single joined up strategy for monitoring all of these as well as the platform they run on. In addition you will need to think about what metrics to pay attention to and what you can leave under the control of the platform.

In this post we provide a step by step strategy for adding monitoring and observability for to your platform team in a simple and cohesive way.

Observability

The goal of a platform is to enable developers to focus on user happiness and business logic - so you will need to prioritize user and business metrics. To relate these back to overall system operations you will need to care about Observability.

Observability is a property of your system. In a nutshell - if you can’t observe your system then you can’t understand it, operate it properly or fix it when it goes wrong. Advanced automated platforms like Pivotal Cloud Foundry and Kubernetes run hundreds or even thousands of applications as part of a distributed system, so developers need to cultivate a shared understanding of how their apps work on such platforms.

If you become aware that user experience is degrading you will want to be able to visualize your whole system, locate potential areas of failure, and interact with components, log stores and other services. The ability for developers to observe any part of a system, ask questions and find answers quickly is a precondition for successful operations.

Three stages to success

We believe that everyone on the cloud native journey will proceed at their own speed. Our recommendation is to attend to each of these stages in turn. At each step success can be measured based on more productive users.

Collection – Enterprises using dynamic platforms like Kubernetes need Prometheus to monitor their apps and infrastructure to collect the right metrics and most importantly create alerts on those metrics so that developers can react appropriately.
Correlation – Once metrics are collected, and alerts generated, developers need to understand an application through visualization, logging and interactive debugging tools. A unified dashboard across PAS and PKS helps you understand your system. Also by practicing GitOps you can maintain the state of your system and recover more easily from system disaster.
Causation – And with the right tools available, developers can monitor applications with the goal of gaining complete observability to determine the root cause of application problems.

Step One - Collection

The first objective is to collect any metrics that you will need. We recommend instrumenting your services using Prometheus - the cloud native monitoring and alerting tool.

There are two basic requirements. First, you must be able to move seamlessly between business, user, app, cluster and host metrics. Second, you are dealing with a highly automated and dynamic environment, so you cannot use “old” monitoring tools that are hard bound to machines, hosts, and relatively fixed IP networks.

In a container-native environment using microservices, the velocity of change is much greater compared to a monolithic setup using virtualized machine environment. Containers for an app or service get created and destroyed every second. Along with that, container orchestration software like Kubernetes dynamically creates and destroys nodes, pods, and replicas to scale with the needs of your service or app or to “self-heal” when any of these components have failed.

This is why using Prometheus for monitoring and alerting is gaining popularity. Prometheus is designed for the high volume of fine grained metrics you will need when running containers and microservices as part of your cloud native platform. Prometheus has a vast range of application level data collectors including of course Spring, Docker, Kubernetes, etc.

A Collection Solution for PCF

Weaveworks provide an enterprise-grade managed Prometheus which can be added to PCF (PAS, PKS) in a few clicks so users can start collecting metrics from all components of PCF including the apps and services running on it. You can use Weave Cloud to visualize all of your services in a graphical context-sensitive map, and also use it to monitor, troubleshoot and send you alerts about your applications. Check out the Weave Cloud for PCF available on Pivotal Services Marketplace.

Step Two - Correlation

Congratulations! You solved the first challenge, and you are now tracking hundreds (if not thousands) of ephemeral containers and app components. You will be able adapt to any microservices as you evolve. Your business is “as dynamic as the platform”.

Your new objective is to understand and explain the system. Operations teams are responsible for user happiness expressed as service health - SLIs like error rate, request latency, or queries per second, and SLOs like minimal downtime.

Monitoring can tell you if you are not meeting Service Level Indicators (SLI) and Service Level Objectives (SLO) but have little explanatory power to help you diagnose and fix issues. A richer overall approach is needed that accounts for Observability. Here are some use cases for delivering Observability:

Validate that an application or component is in the correct state, by comparing it with a description of the desired state (eg. a config file, or an alerting threshold)
Correlate deployment events and histories with application metrics. This becomes especially important when you have a high velocity team that deploys multiple times a day. Tracing, logging, and visualization of the services are the other techniques for collecting data that indicate the operational wellness of the service
Correlate between multiple components to describe a high level behaviour impacting business users, for example the orchestrator repeatedly starts and stops a buggy container, leading to unhappy users.
Troubleshoot serious application issues as they arise. How do you quickly diagnose a problem in an ever changing environment?

A Correlation Solution for PCF

To help with these use cases you will need to observe and understand many components of your system at once. Weaveworks provide support for this in PCF:

Aggregation of full stack monitoring - host, cluster & app metrics - with visualization and interactive management eg ssh/debugging, log viewing
A unified dashboard across across the full stack, with filtering and auto-generation to save human users from mental overload
Integration of different exploration tools for team level diagnostics eg. incident handling notebooks and support for developer-created Prometheus tools

One of the obstacles with aggregating metrics from tracing, logging and visualization is managing and making sense out of the volume of data. It is a huge cognitive challenge to determine the relative importance of each data point and what it means in context with any other source of metrics or logs. Hence the aggregation of metrics and visualization of trending data in a focused and actionable dashboard becomes key. Ideally you want to focus on key performance indicators that can alert you of any potential bottlenecks so you can proactively put an improvement in place or identify the root cause.

The tool or dashboard should be application-centric. This means that the dashboard should be able to generate service metrics, correlate them with deployment events and histories, so that you can analyze and compare current vs. past performance to make more informed decisions.

With that in mind, we want to introduce you to an observability solution for PCF - Weave Cloud. A developer-centric tool that allows you to gather observability metrics as well as real-time views of your entire PCF platform and the apps running on it.

Weave Cloud allows you to gather and push time-series metrics about the health of the PCF platform itself, including metrics about e.g. CPU and memory usage, how many apps are running, and other metrics that are valuable for a PCF operator.

For example, here we can group the CPU usage in the system by BOSH jobs:

Step Three - Causation

Let’s summarize the steps thus far.

First, collect metrics and generate alerts. For Kubernetes, Docker and Cloud Foundry we highly recommend Prometheus and some enterprise features.
Second, observe and correlate across components, to make the system more understandable. Filter out noise using dashboarding to focus the UX.

There’s still one final objective: establishing Causation. This can be very hard. And so at the risk of disappointing readers who have come with us so far, we shall be brief. We’ll describe some of the issues, and one day maybe hope to write a survey of solution techniques and new products in the space.

Microservices form complex networks of behaviour and involve many layers of technology - routing, discovery, etc. Given an apparently localized fault, how do we establish a root cause for the problem? How do we dig into systems that keep on misbehaving? How do we understand the causes of complex distributed systems failures?

This is where Observability is so important. A system is observable if developers can understand its current state from the outside -- and therefore have even half a chance of figuring out what may be wrong. Please do explore further details on Observability and fixing things that go wrong, on the Weavework’s blog.

What are we looking for?

Bad user experiences
Failures that get fixed by the scheduler before we can register negative user impact
The dreaded “grey failures” (read this)
Patterns in high cardinality data (read this)

Making an application or service observable means developers can be in charge of not only monitoring an app’s behavior, but also the impact it will have on their app’s users. You can solve some of these issues using platform tools like PCF and Weave Cloud, but ultimately this is where we hit the breaking edge of technology and analytics. For now, let’s wrap by saying “it is a good idea to think about how to observe an application, as you develop it”.

Conclusion

The era of cloud native applications is upon us and developer velocity is everyone’s goal. Does this mean that operations is no longer needed? Certainly not. Your ops team is more valuable than ever before, but their role has changed and they must learn a new language. That is the language of Observability - the latest evolution of monitoring.

Start understanding your system and download the Weave Cloud for PCF tile from the Pivotal Network - your 30 day free trial can start now.

About the Author

Alexis is the co-founder and CEO of Weaveworks. He is also the chairman of the TOC for Cloud Native Compute Foundation (CNCF), and the co-founder of the Coed:Code meetups. Previously he was at Pivotal, as head of products for Spring, RabbitMQ, Redis, Apache Tomcat and vFabric.

Secure All the Services! How Banks Use Pivotal Cloud Foundry and the Open Service Broker API to Make It Happen.

Banks of all sizes are modernizing how they do IT and software development. This blog series explores how b...

Enterprise Architects, It's Time to Learn How the CredHub Service Broker Applies the Principle of Least Privilege to Your Secrets.

The CredHub Service Broker is now a beta. It's a service broker that helps developers secure off-platform s...

The 3 Stages to Observability for Modern Apps

Observability

Three stages to success

Step One - Collection

A Collection Solution for PCF

Step Two - Correlation

A Correlation Solution for PCF

Step Three - Causation

Conclusion

About the Author

Previous

Next

Related content in this Stream

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.