Editor’s Note: Author will be hosting a webinar on this topic Tuesday, November 19, 2019 at 10:00 AM PST. Click here to register now!
SLOs Just Became Important
Back in the good old days of monolithic applications, most developers and application owners relied on tribal knowledge for what performance to expect. Although applications could be incredibly complex, the understanding of their inner workings usually resided within a relative few in the organization. Application performance was managed informally and measured casually.
This model, though, falls apart in a microservices world. Using informal methods, a single team or small group can no longer see a full picture of an application and can deduct its performance. Instead, the chains of services can be very deep, and a lack of understanding of the performance of the downstream components can result in unwelcome surprises that ultimately affect end-users and the business.
So What’s an SLO?
Before we go any further, let’s define a Service Level Objective (SLO), along with some related concepts and acronyms.
SLIs Are What We Measure
A Service Level Indicator (SLI) is a quantity we measure to determine whether we comply with an SLO. Very commonly, SLIs measure either the success rate or performance of some operation, but an SLI can be just any metric.
SLOs Make SLIs Actionable
For an SLI to be interesting, we need to associate it with a condition. For example, we can state that 99% of all operations must complete successfully or that 95% of all transactions must finish within one second. We often express SLOs as a percentage over some time. For example, we may measure the 99% success rate over 24 hours, and the 95% latency performance better than one second may be measured over an hour.
An SLO should be Attainable, Repeatable, Measurable, Understandable, Meaningful, Controllable, Affordable, Mutually Acceptable [Rick Sturm, Wayne Morris “Foundations of Service Level Management”, April 2000, Pearson]. For example, a transaction success rate of 99.9999999% may not be attainable because commercially available hardware can’t offer that level of reliability. Similarly, an SLO based on some obscure technical metric is probably neither understandable nor meaningful to most users.
SLAs Are SLOs with an “Or-Else” Clause
While SLOs are actionable, Service Level Agreements (SLAs) make them enforceable from a legal standpoint. SLAs typically attach a monetary penalty to the SLO, saying, for example, that if we can’t meet our SLO for a customer, we may be required to pay back a certain percentage of the money they spent with us. Typically, as technologists, we don’t have to worry as much about SLAs, since it’s something our lawyers deal with.
Implementing Good SLOs
A good SLO, it turns out, isn’t as straightforward to implement as one might think at first glance. A naive approach might be to measure deviation from “good” over some time and alert on that. This simple approach suffers from a whole range of drawbacks. The following discussion draws from best practices in the Google SRE Handbook. Let’s explore how these ideas can be used to implement useful SLOs with Wavefront!
Picking the Right SLI
Selecting the right metric as the basis for an SLI is an art in itself and worthy of a separate article. As a rule, we should pick a metric that is meaningful to the business and actionable. Good metrics include error rates and performance of business services. Poor ones include purely technical metrics and metrics that don’t map to an actionable result. These are a few examples:
|Business service error rate||Payload size|
|Business service response time||CPU utilization|
|Time spent waiting in a queue||Queue depth|
Nobody’s perfect, so we need to allow ourselves some room for error. A system with 100% uptime isn’t realistic, so we need to make a tradeoff between our desire to provide a reliable system and the resources that are available. Setting our expectations for availability too high is one of the worst mistakes we can make. It results in constant alerts that people eventually ignore, and that will mask the real issues.
So we need to determine how much time over a certain period we allow ourselves to be out of compliance. This is typically expressed as a percentage. For example, we may allow ourselves to spend 1% of our time out of compliance. Usually, we turn this percentage around as an availability number, so in the case of a 1% error budget, we say we expect 99% availability.
The next issue we’re facing is the time frame to measure. We say we allow ourselves to be out of compliance 1% of the time, but over what time frame? Selecting this time frame turns out to be a hard problem to solve. Measure over a very short time, like 10 minutes, and insignificant glitches will constantly trigger alerts. Pick a long period, like 30 days, and we will be alerted too late and risk missing severe outages altogether.
Let’s say our contract with our users states 99.9% availability over 30 days. That works out to an error budget of 43 minutes. If we burn down these 43 minutes in small increments of minor glitches, our users are probably still happy and productive. But what if we have a single outage of 43 minutes at a business-critical time? It’s safe to say our users would be pretty unhappy with that experience!
To solve this we can introduce burn rates. The definition is simple: If we burn down precisely 43 minutes in our example over 30 days, we call this a burn rate of 1. If we burn it down twice as fast, i.e., in 15 days, the burn rate is 2 and so on. As you will see, this allows us to track the long-term compliance, and to alert on severe short-term issues.
The following graph illustrates the idea of multiple burn rates. The X-axis represents time and the Y-axis is our remaining error budget.
Putting It All Together
Let’s take what has just been explained and put it together into a useful solution for SLO monitoring and alerting!
We learned that long-term monitoring of an SLO trend and short term alerting must be done using different methods. We’ve also introduced the idea of burn rates. So how do we turn this into a great SLO alerting solution?
One idea is to use multiple time windows for immediate alerts and longer-term trend observations. The problem is that if we select a one hour window, a 99% objective allows only 36 seconds of non-compliance. For a 99.9% objective, that number drops to 3.6 seconds! This would cause a lot of noisy alerts for minor glitches.
Multiple Windows, Multiple Burn Rates
One way to overcome this problem is to employ the burn rates we just discussed. If we are looking at a shorter time window, we measure against a faster burn rate. This allows us to fire alerts when a service has been unavailable for a meaningful amount of time. When we monitor over a longer time window, we use a slower burn rate to catch worrisome trends. The following table illustrates how this can be done.
|Severity||Window||Burn Rate||Percent Consumed|
|Critical – Page technician!||1 hour||14.4||2%|
|Smoke – Open ticket||3 days||1||10%|
This means that if we’re burning our error budget at a rate of 14.4 during an hour window, we should have fired a critical alert. We select 14.4 because it represents a burn of 2% over an hour.
On the other hand, if we see a slow burn over a longer time, it may not have a severe impact on users, but it means that we’re on track to breach our SLO over a longer time. The first alert should probably cause someone to be paged immediately, while the second one can be a lower priority alert that someone can look at when they have time.
How Wavefront Helps
As we have seen, implementing a good SLO is relatively hard. You need to do a fair amount of math and make complex decisions. For that, you need a tool with a rich query language and good alerting capabilities. In other words, you need Wavefront!
Let’s say we have an SLI for the response time for some service. Here’s the time series for our SLI metric.
Let’s define acceptable performance as a response time less than one second. Anything else is considered an outage that will negatively impact users. We can easily modify the query to filter out only the outages.
Now, we define two windows — one short-term window with an aggressive burn rate and one longer-term window with a slow burn rate. We can now calculate the adjusted burn rate over time for the two windows.
The orange series represents the cumulative outage time over the shorter window and the blue series over the longer window. As you can see, the spikes in response times have an immediate impact on the short window, but they also contribute to a slow incremental change on the longer-term burn. Furthermore, thanks to the burn rate multiplier, we suppress alerts resulting from any insignificant glitches.
Let’s click on the “create alert” button in the Wavefront UI and define an alert that fires when one of the windows has burned through its allotted error budget.
The complex-looking query at the bottom is synthesized from the set of much simpler queries we defined already. All we have to do now is to select the alert notification mechanism, such as PagerDuty, Slack, JIRA ticket, etc. and we can expect to find out quickly when we’re in danger of no longer meeting SLOs.
In this blog, we’ve seen that alerting on an SLO is fairly complex. But thanks to the query language and alerting mechanisms in Wavefront, we can build this in minutes.
In a real-life scenario, there are additional techniques that can be employed to refine the alerts further. For example, we may want to use multiple windows per burn rate. If you would like to learn more about this, the Google SRE Handbook is a good place to start.Get Started with Wavefront Follow @prydin Follow @WavefrontHQ