Faster, AI-Driven Incident Triaging: Wavefront Alert Viewer

December 4, 2019 Frank Hattler

As a developer, SRE, or operations engineer, you often get incident alerts, and you know it’s just the beginning. Alerts lead to a slew of questions that must be answered in order to triage an issue. For example:

  • What is the alert telling me/us?
  • Where is the problem?
  • What else might be relevant to this issue?

The Wavefront Alert Viewer helps you answer these questions quickly by presenting the information you need:

  • Alert description (what?)
  • Impact analysis (where?)
  • System context (what else?)

Let’s examine how this information helps practitioners quickly triage mission-critical issues.

Alert Description

The alert description section, at the top of the Wavefront Alert Viewer, gives practitioners a 10-second briefing. This section includes the alert title, a runbook (if the creator has included one), how long the alert has been firing, and who else has been paged. By condensing this context into the alert description, Wavefront reduces the need for practitioners to switch between screens searching for runbooks and notes that can help triage the alert. This streamlined triaging experience saves precious time that can be the difference between a customer-impacting outage and a seamless recovery experience.

In the example below, the request count was dropping by more than 50%. There is a linked runbook available and we can see that an alert email has been sent to 2 teams.

Impact Analysis

Once the practitioner responding to the alert has the 10-second briefing, he or she can dive into the details of the alert. The Wavefront Alert Viewer shows:

  • A chart with the time series causing the alert to fire
  • Details about the alert query
  • Affected point tags related to the alert (most important)

In Wavefront each time series can be tagged with metadata such as region=us-west-1 or env=prod. When an alert fires Wavefront analyzes the point tags that are most likely to be related to the firing alert and displays them in ranked order on the Alert Viewer. These point tags become a list of suspects for why the alert is firing. For example, if the alert is caused by an outage in region=us-west-2, Wavefront ranks this tag higher than the other tags. As an on-call engineer, you can quickly click through the suspected tags and highlight the related lines. This tool for alert impact analysis reduces the time spent trying to manually correlate complex time-series data. As a result, Wavefront Alert Viewer dramatically speeds up impact analysis for alerts.

If we zoom in on the example from above we can see the impact analysis tool in action. This alert indicates that the issue is occurring on 2 hosts running the purchase service in the us-east-1 region. We should consider investigating whether these hosts, this region, and the purchase service are all working correctly.

System Context

To further support alert triaging, Wavefront Alert Viewer gives practitioners a full view of what is going on in the system by showing correlated alerts and events. When an alert is fired, Wavefront automatically scans all the other alerts that have fired within 30 minutes and correlates them with the initial event using AI/ML algorithms. The right panel of the Wavefront Alert Viewer shows these correlated events ranked by relevance and describes why Wavefront chose each alert. These correlated events give users a full view of what is going on when an alert fires and help spot patterns that are useful for triaging the alert.

If we zoom in again on the sample alert we can see the system context that Wavefront provides. To the right of the chart, we can see that this alert is likely related to alerts firing in the system for network retransmit, high heap usage, and high CPU usage. We should investigate these alerts to see if one points to the root cause for the decrease in requests to the purchase service.

Conclusion

The Wavefront Alert Viewer enables developers, SREs, and operations engineers to triage alerts more quickly by providing a single page for alert information, impact analysis, and system context. To learn more, sign up for our free trial.

Get Started with Wavefront Follow @FrankHattler Follow @WavefrontHQ

The post Faster, AI-Driven Incident Triaging: Wavefront Alert Viewer appeared first on Wavefront by VMware.

About the Author

Frank Hattler

Frank is an Associate Product Manager at Wavefront. He is currently working on enhancing Wavefront's user experience in order to solve observability challenges. Frank holds a B.S. in Computer Science from Tufts University as well a self-proclaimed degree in "tinkering with stuff." In his free time, Frank enjoys rock climbing, and riding mountain bikes.

Follow on Twitter More Content by Frank Hattler
Previous
Got OpenShift? Need Observability That’s Automated, Full Stack and Unified? Make a Shift to Wavefront!
Got OpenShift? Need Observability That’s Automated, Full Stack and Unified? Make a Shift to Wavefront!

Red Hat OpenShift is an open-source, cloud application development platform that enables you to develop, de...

Next
Wavefront Automates Observability for Enterprise Kubernetes
Wavefront Automates Observability for Enterprise Kubernetes

Earlier this year, we released the Wavefront Collector for Kubernetes, supporting all popular flavors of Ku...

SpringOne. Catch all the highlights

Watch now