Site reliability engineering (SRE) is often discussed as the future of IT operations, but it’s so much more than that. Done correctly, SRE is a set of principles that—yes—can greatly improve application reliability, but can also lead to happier, healthier employees; an improved set of engineering priorities; and even better-performing teams across all parts of the business.
Without going into too much detail (Google’s popular SRE book covers a number of principles and best practices that define it), a simple way to think about SRE is to focus on three things: automation, service level objectives (SLOs), and measuring the right things. Because in the end, what matters most is keeping your applications online and keeping your users happy with as little toil as possible.
Anybody who’s read anything about SRE has probably come across some version of this quote, which is popular among the folks at Google who helped popularize the practice: “SRE is what happens when you treat operations as a software problem.” It means that by letting machines do what they do best—repeatable, well-understood tasks, including some degree of recovery from performance issues—humans get to sleep easier and focus on more interesting and valuable problems.
Or, as Google senior director of engineering Dave Rensin put it during a recent episode of our Cloud Native in 15 Minutes podcast: “The world you want to live in is one where some system you’re responsible for is having a problem, it sort of mitigates itself, and then it writes a bunch of information out for you to debug the next morning after your morning coffee. That’s a world where the machines work for you.”
On the other end of the spectrum is the world of alert fatigue, sleep deprivation, anxiety, and burnout that’s all too familiar to many people. It also tends to be a world with lesser reliability, in part because it’s less efficient and in part because healthier employees do better work.
Service level objectives
One major focus of SRE is the idea of “error budgeting,” which essentially means figuring out how much error—measured in downtime, degraded performance, or what have you—is acceptable in a set period of time. This becomes your SLO—and it is never 100 percent..
The goal of the error budgeting and SLO exercise is to be realistic about what’s actually required and to operate just above that level. The vaunted five-nines (99.999 percent) of availability is laudable, but might be overkill if users could live with just 99.9 percent. Aiming too high has a deleterious effect on engineers and sysadmins charged with keeping things running, because every blip becomes a stress-inducing, sleep-depriving emergency. It can also result in people covering up mistakes rather than addressing and learning from them, which will lead to lower reliability over time.
Ironically, over-performing too strongly on even a modest SLO means resources probably aren’t being used optimally. “If you determine that you can tolerate 43 minutes of error, or bad minutes, over 30 days, and you’re consistently only spending, say, 10 minutes, that’s not a victory,” Rensin explains. “That means you’ve over-engineered reliability. You have more reliability than your users need, and that’s time and resource and expense you could be applying to innovation or risk or some other thing.”
Measuring the right things
SRE encourages organizations to focus on what matters to users rather than what matters to IT. While spiking CPU consumption or memory allocation might be contributing to downtime, users experience something altogether different. They might be experiencing laggy performance, dropped sessions, or a website that won’t load. And maybe, of the plethora of performance metrics you’re tracking, only a small number actually meaningfully contribute to a degraded user experience.
This is why it’s not uncommon for companies reorganizing their operations around SRE to turn off, or at least reprioritize, a significant portion of their alerts in order to focus on the ones that really matter. For example, a late-night issue that doesn’t affect users can probably wait until morning and might be best delivered as a Slack message rather than something more urgent and obtrusive. After all, SRE is supposed to simplify operations, but—especially in an inherently complex distributed system—that’s nearly impossible if staff are getting code-red alerts about every metric that slips out of whack.
Learn more, live!
Of course, making the move to SRE is not an overnight process. Apart from reassessing what you’re monitoring and alerting, it’s going to require some changes in culture, skills, and process that might not come easily or naturally. It might require some changes in tooling in order to facilitate better automation, and a rethink of the overall application lifecycle that accounts for all of this.
If you want to learn more about SRE in-person—getting started, as well as doing it in production—Pivotal’s upcoming SpringOne Platform conference (Oct. 7-10 in Austin, Texas) is a great place to start. There are a number of talks focused on SRE, delivered by our experts as well as by large enterprise customers who have made the evolution from traditional operations to SRE. Three talks that stand out are:
Are You SREious (Ron Cuffy and John Keenleyside, Royal Bank of Canada)
360-Degree Health Assessment of Microservices on the PCF Platform (Nehal Gandhi, Travelers; Rohit Kelapure, Pivotal)
4 Questions to Ask Your Dev Team (Hannah Foxwell and Jérôme Wiedemann, Pivotal)
Highly Available and Resilient Multi-Site Deployments Using Spinnaker (Koundinya Srinivasarao and Dodd Pfeffer, Pivotal)
There are also a slew of other customer talks—from JPMorgan Chase, Wells Fargo, Fidelity Investments, Cerner, Yahoo Japan, and more—highlighting how they’ve modernized operations as a result using the Pivotal Platform in combination with SRE, chaos engineering, DevOps, and other practices. You can peruse the schedule to get the details on all of them.
Learn more, online!
Of course, there also is a lot of online content explaining SRE and its concepts, in addition to the Google books mentioned above. One of my favorite explainers from Pivotal is this Cloud Foundry Summit presentation by Hannah Foxwell (who’s also presenting at SpringOne Platform next month), called Reliability Engineering for Humans:
We’ve also produced a number of blog posts discussing the topic from numerous angles, including best practices and real-world examples:
SLIs and error budgets: What these terms mean and how they apply to your platform monitoring strategy
Thinking in error budgets: How Pivotal’s cloud ops team used service level objectives and other modern SRE practices to improve outcomes
When DevOps in the enterprise is a dead end (or, where ‘you build it, you run it’ breaks)
Scale and velocity are driving the next generation of DevOps
Check them out, make plans to attend SpringOne Platform, and then create happier users and admins by bringing your operations into the 21st century!
About the AuthorMore Content by Derrick Harris