As a developer, SRE, or operations engineer, you often get incident alerts, and you know it’s just the beginning. Alerts lead to a slew of questions that must be answered in order to triage an issue. For example:
- What is the alert telling me/us?
- Where is the problem?
- What else might be relevant to this issue?
The Wavefront Alert Viewer helps you answer these questions quickly by presenting the information you need:
- Alert description (what?)
- Impact analysis (where?)
- System context (what else?)
Let’s examine how this information helps practitioners quickly triage mission-critical issues.
The alert description section, at the top of the Wavefront Alert Viewer, gives practitioners a 10-second briefing. This section includes the alert title, a runbook (if the creator has included one), how long the alert has been firing, and who else has been paged. By condensing this context into the alert description, Wavefront reduces the need for practitioners to switch between screens searching for runbooks and notes that can help triage the alert. This streamlined triaging experience saves precious time that can be the difference between a customer-impacting outage and a seamless recovery experience.
In the example below, the request count was dropping by more than 50%. There is a linked runbook available and we can see that an alert email has been sent to 2 teams.
Once the practitioner responding to the alert has the 10-second briefing, he or she can dive into the details of the alert. The Wavefront Alert Viewer shows:
- A chart with the time series causing the alert to fire
- Details about the alert query
- Affected point tags related to the alert (most important)
In Wavefront each time series can be tagged with metadata such as region=us-west-1 or env=prod. When an alert fires Wavefront analyzes the point tags that are most likely to be related to the firing alert and displays them in ranked order on the Alert Viewer. These point tags become a list of suspects for why the alert is firing. For example, if the alert is caused by an outage in region=us-west-2, Wavefront ranks this tag higher than the other tags. As an on-call engineer, you can quickly click through the suspected tags and highlight the related lines. This tool for alert impact analysis reduces the time spent trying to manually correlate complex time-series data. As a result, Wavefront Alert Viewer dramatically speeds up impact analysis for alerts.
If we zoom in on the example from above we can see the impact analysis tool in action. This alert indicates that the issue is occurring on 2 hosts running the purchase service in the us-east-1 region. We should consider investigating whether these hosts, this region, and the purchase service are all working correctly.
To further support alert triaging, Wavefront Alert Viewer gives practitioners a full view of what is going on in the system by showing correlated alerts and events. When an alert is fired, Wavefront automatically scans all the other alerts that have fired within 30 minutes and correlates them with the initial event using AI/ML algorithms. The right panel of the Wavefront Alert Viewer shows these correlated events ranked by relevance and describes why Wavefront chose each alert. These correlated events give users a full view of what is going on when an alert fires and help spot patterns that are useful for triaging the alert.
If we zoom in again on the sample alert we can see the system context that Wavefront provides. To the right of the chart, we can see that this alert is likely related to alerts firing in the system for network retransmit, high heap usage, and high CPU usage. We should investigate these alerts to see if one points to the root cause for the decrease in requests to the purchase service.
The Wavefront Alert Viewer enables developers, SREs, and operations engineers to triage alerts more quickly by providing a single page for alert information, impact analysis, and system context. To learn more, sign up for our free trial.Get Started with Wavefront Follow @FrankHattler Follow @WavefrontHQ
The post Faster, AI-Driven Incident Triaging: Wavefront Alert Viewer appeared first on Wavefront by VMware.
About the AuthorFollow on Twitter More Content by Frank Hattler