Skyline Resolves Production Incidents Faster with Alert-Based Health Dashboards

March 5, 2020 Gregg Ulrich

Blog Contributions from Rajiv Kumar, Senior SRE at Skyline VMware and Mohan Machha SRE at Skyline VMware in creating the Skyline Health Dashboard

As members of the VMware Skyline Site Reliability Engineering (SRE) team, we ensure the availability and performance of our production services through obsessive measurement. We depend heavily on Wavefront by VMware, an enterprise observability platform for monitoring, observability, and analytics of cloud-native applications and environments.

We are a geographically distributed team supporting geographically distributed engineering teams. As SREs, we should know when something is wrong, and we usually do because Wavefront will tell us via smart alerts. Alerts are simple – compare a metric against a threshold, and if the value is unexpected, tell someone. This someone is either our team or the awesome VMware Command Center team, both of which use additional, detailed Wavefront dashboards to resolve alerts.

Building a Dashboard

While this process of alerting and resolving issues is great, we lacked automation for communicating production issues to our distributed stakeholders. As SREs, we know that if an alert is firing, then Skyline is not working as expected (i.e., Skyline is not healthy). Not all novice users of Wavefront know how to find this information in Wavefront, and during production incidents, we don’t have the time to teach them. As the single source of truth for determining the health of Skyline, Wavefront was the logical choice for solving this problem. Thus, the Health Dashboard was born.

Our goals for the Health Dashboard were:

REQ 1. No knowledge of Wavefront or Skyline infrastructure needed, just a URL

REQ 2. New alerts should be easily incorporated into the dashboard by anyone

REQ 3. Use native Wavefront functionality wherever possible

REQ 4. Solve this problem for all teams, not just Skyline

Skyline is a set of microservices that collect, process, and present relevant information to our customers. The health of Skyline is the sum of the health of defined components. A component is a logical function performed by Skyline and its dependencies. For example, Skyline generates a set of reports for each customer, and this function is considered healthy if the reports are generated within a stated SLA. We worked with Product Management to define the Skyline components and their health definitions, and together with the Engineering teams, created metrics and alerts to ensure everything was measured and monitored. These alerts became to be the basis of the Health Dashboard.

Creating an Alert Tagging Convention

Wavefront advanced alerts have several features, some of which we discovered naturally, that we leveraged when creating the dashboard:

Alerts can be tagged

Currently firing alerts can be queried via ~alerts.isfiring

Alert tags with period-separated components become tag paths

Alert tags are the basis of the Health Dashboard. We defined our alert tagging convention as:

namespace.product.component

A namespace is used to segment the alerts. For example, the alert tag for the report generation component is skyline.health.skyline.report-generation.

We tagged all our alerts, and when we did this, Wavefront converted the tags to tag paths (Fig 1.), allowing for easier navigation and component separation.

Fig. 1. Wavefront Converts Alert Tags to Tags Paths

This helps alert creators determine how to tag their alerts. It is always nice to get something awesome for free!

Building a Reusable Dashboard with Alert Tags

With everything tagged and after a LOT of usability iteration, we built a simple and powerful dashboard that satisfies all requirements. Here it is on Fig 2!

Fig. 2. Skyline Health Dashboard

Note: Skyline is a production service, and its health dashboard is usually empty. The Skyline Health Dashboard is populated with generated data used to test the dashboard functionality. And here are the Health Dashboard features:

Overall HEALTH SEVERITY of Skyline - if green, then there are no known issues (REQ 1)

The UNHEALTHY COMPONENTS chart - lists all components with firing alerts, which allows Production Operators to focus on known problematic areas of the product

COMPONENT HEALTH PERCENTAGE - shows the percentage of time each component was healthy in the current time window, which is useful when troubleshooting

UNHEALTHY PRODUCTS - displays other products in the namespace that are having issues. Skyline is dependent on other products, which we also monitor and tag under our namespace. This simplifies correlating alerts between products.

The ALERTS time series - shows what alerts have fired over the specified time window

The charts on the dashboard are controlled by dynamic variables - we can change the scope of the dashboard by selecting a product value from the drop-down. This list of products is derived from the alert tags.

The dashboard is simple but effective; it presents enough information for operators and executives to understand the current state of Skyline production and where to start investigating in case of any production issues. Anyone can contribute to the dashboard, even creating new products and components, by creating and tagging alerts (REQ 2), which reduces dashboard atrophy (the bane of SREs everywhere). Anyone can use this dashboard by cloning it, changing the namespace variable, and creating and tagging alerts (REQ 4). And best, all of this is using core Wavefront functionality (REQ 3). The Health Dashboard mission accomplished!

Querying Alert Tags

So how does this work? As mentioned earlier the metric ~alert.isfiring can be used to query for firing alerts. Below is the query used to populate the UNHEALTHY COMPONENTS chart.

Fig. 3. The Unhealthy Components Chart with its Query

The metric ~alert.isfiring with the defined tagging convention is used to populate the chart. Note that the query is using the dashboard variables ${namespace} and ${product}, which lets the user control the scope of the dashboard. The Wavefront taggify() is a Wavefront function used to manipulate and create point tags. We use it to exclude the namespace, which reduces clutter on the dashboard. If you are not using point tags or taggify() then you are not making the most of Wavefront!

All of the charts on the Health Dashboards are using similar ~alert.isfiring queries. Here is the query used to compute component health percentage. We think it is clever!

Fig. 4. The Component Health Percentage Chart with the Query

Severity Aware Health Status Box

The color of the health severity box (1 on Fig. 2.) changes based on the state of firing alerts. There are three options for this box, configurable via a static dashboard variable:

Any alert is unhealthy – either show the status as healthy (green, 0 alerts) or unhealthy (red, >0 alerts)

Severe only – the health severity box is only red if there are firing severe alerts; otherwise it is green

The highest severity – based on multi-threshold alert severities, display the highest severity of all firing alerts. Severity values are INFO, SMOKE, WARN, SEVERE. With this option, if there are WARN and INFO alerts firing simultaneously, the health status box will show WARN (yellow) -- the higher severity.

The set of queries we use to build this functionality is shown in Fig. 5.

Fig. 5. Health Severity Status and Queries

We individually query for all severities and assign them a value 0 through 4 using if(), and then use max(collect(values)) to get the highest severity. It works great, and we know this because we spent hours watching it!

You probably noticed that some of the queries are multiplying by the variable ${health_adjustment} defined by a dashboard variable. This is how we convert the queries between display options. In this chart, anything greater than or equal to 4 is severe (red), so for:

Any alert is unhealthy - we selectively multiply by 10, making all alerts severe

Severe only - we multiply by 0 to ignore INFO, SMOKE and WARN alerts

The Highest Severity - we multiply by 1 to use the per-severity score

This is not a hack; it’s math!

Using Dynamic Dashboard Variables

Here is the Wavefront magic behind dynamically generating the product dashboard variable based on the namespace. Dynamic variables are incredibly useful.

Fig. 6. A Query Sample for Dynamic Dashboard Variable

Note that we are using collect() to add a wildcard value to the product list, which allows us to view all alerts for all products in our namespace. During incidents, the ability to easily see the health of all products is instrumental.

Slack Health Alerts via Alert Targets

In addition to the dashboard, we use alert targets to update internal Slack channels whenever the component health status changes. The dashboard is great, but Slack notifications are useful when accessing internally available dashboards is less convenient - for example, on your phone. To accomplish this we created an alert per product.component pair – a solution we are working to improve – that queries for firing alerts and dumps a simple message into Slack.

Fig. 7. Alert for Component Health Status Change

Try Wavefront to Reduce Your MTTR

The Health Dashboard is now a critical component of Skyline’s operational support and one that is easily adopted anywhere. It has reduced time to resolve production incidents by better surfacing specific issues, improving communication across our organization, and – most importantly -- enabling everyone to improve our production monitoring with straightforward alert tagging and querying convention.

Wavefront is an incredibly flexible service for measuring and monitoring everything. This post shows just the beginning of what is possible with Wavefront. Sign up for a Wavefront free trial, explore Wavefront’s advanced functionality, and see what you can build (and check out Skyline too!).

Oh, and please tell us how to optimize our Slack channel updates with a single alert! We’re thinking about creating a custom webhook.

Special Thank You to Rajiv Kumar (@Rajivnitr), and Mohan Machha (@machhachowdary) for creating Skyline Health Dashboard.

Rajiv is a Senior SRE at the Skyline VMware team. Before VMware, he was instrumental in implementing Site Reliability Engineering practices for JCPenney, Target, JPMorgan & Chase, and GE Healthcare e-commerce business. He has over 11 years of IT industry experience.

Mohan Krishna Machha works as a Site Reliability Engineer in Skyline Team at VMware. He maintains the production services by measuring and monitoring availability, latency, and overall systems health. Previously he worked with AWS.

About the Author

Gregg Ulrich is a member of the VMware Skyline Site Reliability Engineering team, a position which nurtures most of his passions – efficiency, accountability and brevity.
More Content by Gregg Ulrich

Become a Modern Software Organization with VMware Tanzu

The initial, core elements of the VMware Tanzu portfolio are now generally available. With VMware, you now ...

Cluster API is a Big Deal. Joe Beda & Craig McLuckie Tell You Why.

The Cluster API is an open-source, cross-vendor effort to simplify cluster lifecycle management. Cluster ...

Skyline Resolves Production Incidents Faster with Alert-Based Health Dashboards

Building a Dashboard

Creating an Alert Tagging Convention

Building a Reusable Dashboard with Alert Tags

Querying Alert Tags

Severity Aware Health Status Box

Using Dynamic Dashboard Variables

Slack Health Alerts via Alert Targets

Try Wavefront to Reduce Your MTTR

About the Author

Previous

Next

Most Recent

As Kubernetes continues to mature—rounding the corner toward its 6th birthday—we’ve started to see a shift in terms of the challenges our customers need to solve. Initially, Kubernetes...

This latest version of vSphere has numerous added features, including native integration of the Tanzu Kubernetes Grid (TKG) to drive adoption of Kubernetes through familiar tools.

Large enterprises clearly trust Kubernetes, according to our most recent State of Kubernetes survey, and are using it for applications in production.

The initial, core elements of the VMware Tanzu portfolio are now generally available. With VMware, you now can modernize the applications that matter most and automate the path to production.

The Cluster API is an open-source, cross-vendor effort to simplify cluster lifecycle management. Cluster API is a big deal. In fact, Kubernetes creators Joe...

When the systems outside Kubernetes need information about what happens to resources inside Kubernetes, Watch-Proxy, an open source project from VMware, can come in handy.

The security ecosystem for Kubernetes can be confusing. A Sysdig article from July 2019 outlined 33 security tools for Kubernetes. That number has only grown.

The Cluster Operations course is designed to help you learn how to bootstrap Kubernetes clusters using various community tools.

In this blog post, you will see how new DevSecOps thinking is necessary as we look at the impact a development-led change can have on your operational security.

In this blog, we describe the use cases of coupling vRealize Network Insight with VMware Enterprise PKS specifically and Kubernetes more generally.

Pivotal’s modern applications expertise along with VMware’s sustained engineering excellence and product innovation brings together a deep collection of solutions, skills, and people.

In Sonobuoy 0.15.4, we introduced the ability for plugins to report their plugin’s progress to Sonobuoy by using a customizable webhook.

We’re excited to announce Getting Started, a new KubeAcademy course designed to orient beginners to the cloud native ecosystem.

Two key goals of Cluster API are to manage the full lifecycle of a Kubernetes cluster, including scaling up and scaling down the cluster, and to give infrastructure providers a common framework.

For Kubernetes 1.17, the SIGs representing storage, networking, and api-machinery account for over half of the enhancements that were tracked.

The Podlets show aims to elucidate and demystify unique elements to help people confidently embrace cloud native technology.

VMware Tanzu solutions are built on key cloud native open source projects—they can be found in our VMware Tanzu GitHub organization at github.com/vmware-tanzu.

As part of the Tanzu umbrella of open source projects, VMware created a new open source project – Crash Recovery and Diagnostics for Kubernetes (or Crash Diagnostics for short).

In this release, we added more enhanced features to bring a much improved management experience to our customers.

Velero 1.1 provides support to back up Kubernetes applications orchestrated on VMware Enterprise PKS. This post details how to install and configure Velero to back up and restore a stateless app.