Skyline Resolves Production Incidents Faster with Alert-Based Health Dashboards

March 5, 2020 Gregg Ulrich

Blog Contributions from Rajiv Kumar, Senior SRE at Skyline VMware and Mohan Machha SRE at Skyline VMware in creating the Skyline Health Dashboard

As members of the VMware Skyline Site Reliability Engineering (SRE) team, we ensure the availability and performance of our production services through obsessive measurement. We depend heavily on Wavefront by VMware, an enterprise observability platform for monitoring, observability, and analytics of cloud-native applications and environments.

We are a geographically distributed team supporting geographically distributed engineering teams. As SREs, we should know when something is wrong, and we usually do because Wavefront will tell us via smart alerts. Alerts are simple – compare a metric against a threshold, and if the value is unexpected, tell someone. This someone is either our team or the awesome VMware Command Center team, both of which use additional, detailed Wavefront dashboards to resolve alerts.

Building a Dashboard

While this process of alerting and resolving issues is great, we lacked automation for communicating production issues to our distributed stakeholders. As SREs, we know that if an alert is firing, then Skyline is not working as expected (i.e., Skyline is not healthy). Not all novice users of Wavefront know how to find this information in Wavefront, and during production incidents, we don’t have the time to teach them. As the single source of truth for determining the health of Skyline, Wavefront was the logical choice for solving this problem. Thus, the Health Dashboard was born.

Our goals for the Health Dashboard were:

REQ 1. No knowledge of Wavefront or Skyline infrastructure needed, just a URL

REQ 2. New alerts should be easily incorporated into the dashboard by anyone

REQ 3. Use native Wavefront functionality wherever possible

REQ 4. Solve this problem for all teams, not just Skyline

Skyline is a set of microservices that collect, process, and present relevant information to our customers. The health of Skyline is the sum of the health of defined components. A component is a logical function performed by Skyline and its dependencies. For example, Skyline generates a set of reports for each customer, and this function is considered healthy if the reports are generated within a stated SLA. We worked with Product Management to define the Skyline components and their health definitions, and together with the Engineering teams, created metrics and alerts to ensure everything was measured and monitored. These alerts became to be the basis of the Health Dashboard.

Creating an Alert Tagging Convention

Wavefront advanced alerts have several features, some of which we discovered naturally, that we leveraged when creating the dashboard:

  1. Alerts can be tagged
  2. Currently firing alerts can be queried via ~alerts.isfiring
  3. Alert tags with period-separated components become tag paths

Alert tags are the basis of the Health Dashboard. We defined our alert tagging convention as:

namespace.product.component

A namespace is used to segment the alerts. For example, the alert tag for the report generation component is skyline.health.skyline.report-generation.

We tagged all our alerts, and when we did this, Wavefront converted the tags to tag paths (Fig 1.), allowing for easier navigation and component separation.

Fig. 1. Wavefront Converts Alert Tags to Tags Paths

This helps alert creators determine how to tag their alerts. It is always nice to get something awesome for free!

Building a Reusable Dashboard with Alert Tags

With everything tagged and after a LOT of usability iteration, we built a simple and powerful dashboard that satisfies all requirements. Here it is on Fig 2!

Fig. 2. Skyline Health Dashboard

Note: Skyline is a production service, and its health dashboard is usually empty. The Skyline Health Dashboard is populated with generated data used to test the dashboard functionality. And here are the Health Dashboard features:

  1. Overall HEALTH SEVERITY of Skyline - if green, then there are no known issues (REQ 1)
  2. The UNHEALTHY COMPONENTS chart - lists all components with firing alerts, which allows Production Operators to focus on known problematic areas of the product
  3. COMPONENT HEALTH PERCENTAGE - shows the percentage of time each component was healthy in the current time window, which is useful when troubleshooting
  4. UNHEALTHY PRODUCTS - displays other products in the namespace that are having issues. Skyline is dependent on other products, which we also monitor and tag under our namespace. This simplifies correlating alerts between products.
  5. The ALERTS time series - shows what alerts have fired over the specified time window
  6. The charts on the dashboard are controlled by dynamic variables - we can change the scope of the dashboard by selecting a product value from the drop-down. This list of products is derived from the alert tags.

The dashboard is simple but effective; it presents enough information for operators and executives to understand the current state of Skyline production and where to start investigating in case of any production issues. Anyone can contribute to the dashboard, even creating new products and components, by creating and tagging alerts (REQ 2), which reduces dashboard atrophy (the bane of SREs everywhere). Anyone can use this dashboard by cloning it, changing the namespace variable, and creating and tagging alerts (REQ 4). And best, all of this is using core Wavefront functionality (REQ 3). The Health Dashboard mission accomplished!

Querying Alert Tags

So how does this work? As mentioned earlier the metric ~alert.isfiring can be used to query for firing alerts. Below is the query used to populate the UNHEALTHY COMPONENTS chart.

Fig. 3. The Unhealthy Components Chart with its Query

The metric ~alert.isfiring with the defined tagging convention is used to populate the chart. Note that the query is using the dashboard variables ${namespace} and ${product}, which lets the user control the scope of the dashboard. The Wavefront taggify() is a Wavefront function used to manipulate and create point tags. We use it to exclude the namespace, which reduces clutter on the dashboard. If you are not using point tags or taggify() then you are not making the most of Wavefront!

All of the charts on the Health Dashboards are using similar ~alert.isfiring queries. Here is the query used to compute component health percentage. We think it is clever!

Fig. 4. The Component Health Percentage Chart with the Query

Severity Aware Health Status Box

The color of the health severity box (1 on Fig. 2.) changes based on the state of firing alerts. There are three options for this box, configurable via a static dashboard variable:

  1. Any alert is unhealthy – either show the status as healthy (green, 0 alerts) or unhealthy (red, >0 alerts)
  2. Severe only – the health severity box is only red if there are firing severe alerts; otherwise it is green
  3. The highest severity – based on multi-threshold alert severities, display the highest severity of all firing alerts. Severity values are INFO, SMOKE, WARN, SEVERE. With this option, if there are WARN and INFO alerts firing simultaneously, the health status box will show WARN (yellow) -- the higher severity.

The set of queries we use to build this functionality is shown in Fig. 5.

Fig. 5. Health Severity Status and Queries

We individually query for all severities and assign them a value 0 through 4 using if(), and then use max(collect(values)) to get the highest severity. It works great, and we know this because we spent hours watching it!

You probably noticed that some of the queries are multiplying by the variable ${health_adjustment} defined by a dashboard variable. This is how we convert the queries between display options. In this chart, anything greater than or equal to 4 is severe (red), so for:

  • Any alert is unhealthy - we selectively multiply by 10, making all alerts severe
  • Severe only - we multiply by 0 to ignore INFO, SMOKE and WARN alerts
  • The Highest Severity - we multiply by 1 to use the per-severity score

This is not a hack; it’s math!

Using Dynamic Dashboard Variables

Here is the Wavefront magic behind dynamically generating the product dashboard variable based on the namespace. Dynamic variables are incredibly useful.

Fig. 6. A Query Sample for Dynamic Dashboard Variable

Note that we are using collect() to add a wildcard value to the product list, which allows us to view all alerts for all products in our namespace. During incidents, the ability to easily see the health of all products is instrumental.

Slack Health Alerts via Alert Targets

In addition to the dashboard, we use alert targets to update internal Slack channels whenever the component health status changes. The dashboard is great, but Slack notifications are useful when accessing internally available dashboards is less convenient - for example, on your phone. To accomplish this we created an alert per product.component pair – a solution we are working to improve – that queries for firing alerts and dumps a simple message into Slack.

Fig. 7. Alert for Component Health Status Change

Try Wavefront to Reduce Your MTTR

The Health Dashboard is now a critical component of Skyline’s operational support and one that is easily adopted anywhere. It has reduced time to resolve production incidents by better surfacing specific issues, improving communication across our organization, and – most importantly -- enabling everyone to improve our production monitoring with straightforward alert tagging and querying convention.

Wavefront is an incredibly flexible service for measuring and monitoring everything. This post shows just the beginning of what is possible with Wavefront. Sign up for a Wavefront free trial, explore Wavefront’s advanced functionality, and see what you can build (and check out Skyline too!).

Oh, and please tell us how to optimize our Slack channel updates with a single alert! We’re thinking about creating a custom webhook.

Special Thank You to Rajiv Kumar (@Rajivnitr), and Mohan Machha (@machhachowdary) for creating Skyline Health Dashboard.

Rajiv is a Senior SRE at the Skyline VMware team. Before VMware, he was instrumental in implementing Site Reliability Engineering practices for JCPenney, Target, JPMorgan & Chase, and GE Healthcare e-commerce business. He has over 11 years of IT industry experience.

Mohan Krishna Machha works as a Site Reliability Engineer in Skyline Team at VMware. He maintains the production services by measuring and monitoring availability, latency, and overall systems health. Previously he worked with AWS.

About the Author

Gregg Ulrich is a member of the VMware Skyline Site Reliability Engineering team, a position which nurtures most of his passions – efficiency, accountability and brevity.

More Content by Gregg Ulrich
Previous
Become a Modern Software Organization with VMware Tanzu
Become a Modern Software Organization with VMware Tanzu

The initial, core elements of the VMware Tanzu portfolio are now generally available. With VMware, you now ...

Next
Cluster API is a Big Deal. Joe Beda & Craig McLuckie Tell You Why.
Cluster API is a Big Deal. Joe Beda & Craig McLuckie Tell You Why.

The Cluster API is an open-source, cross-vendor effort to simplify cluster lifecycle management. Cluster ...