A Quick Guide For Getting Up to Speed on SRE

September 16, 2019 Derrick Harris

Site reliability engineering (SRE) is often discussed as the future of IT operations, but it’s so much more than that. Done correctly, SRE is a set of principles that—yes—can greatly improve application reliability, but can also lead to happier, healthier employees; an improved set of engineering priorities; and even better-performing teams across all parts of the business.

Without going into too much detail (Google’s popular SRE book covers a number of principles and best practices that define it), a simple way to think about SRE is to focus on three things: automation, service level objectives (SLOs), and measuring the right things. Because in the end, what matters most is keeping your applications online and keeping your users happy with as little toil as possible.

Automation

Anybody who’s read anything about SRE has probably come across some version of this quote, which is popular among the folks at Google who helped popularize the practice: “SRE is what happens when you treat operations as a software problem.” It means that by letting machines do what they do best—repeatable, well-understood tasks, including some degree of recovery from performance issues—humans get to sleep easier and focus on more interesting and valuable problems.

Or, as Google senior director of engineering Dave Rensin put it during a recent episode of our Cloud Native in 15 Minutes podcast: “The world you want to live in is one where some system you’re responsible for is having a problem, it sort of mitigates itself, and then it writes a bunch of information out for you to debug the next morning after your morning coffee. That’s a world where the machines work for you.”

On the other end of the spectrum is the world of alert fatigue, sleep deprivation, anxiety, and burnout that’s all too familiar to many people. It also tends to be a world with lesser reliability, in part because it’s less efficient and in part because healthier employees do better work.

Service level objectives

One major focus of SRE is the idea of “error budgeting,” which essentially means figuring out how much error—measured in downtime, degraded performance, or what have you—is acceptable in a set period of time. This becomes your SLO—and it is never 100 percent..

The goal of the error budgeting and SLO exercise is to be realistic about what’s actually required and to operate just above that level. The vaunted five-nines (99.999 percent) of availability is laudable, but might be overkill if users could live with just 99.9 percent. Aiming too high has a deleterious effect on engineers and sysadmins charged with keeping things running, because every blip becomes a stress-inducing, sleep-depriving emergency. It can also result in people covering up mistakes rather than addressing and learning from them, which will lead to lower reliability over time.

Ironically, over-performing too strongly on even a modest SLO means resources probably aren’t being used optimally. “If you determine that you can tolerate 43 minutes of error, or bad minutes, over 30 days, and you’re consistently only spending, say, 10 minutes, that’s not a victory,” Rensin explains. “That means you’ve over-engineered reliability. You have more reliability than your users need, and that’s time and resource and expense you could be applying to innovation or risk or some other thing.”

Measuring the right things

SRE encourages organizations to focus on what matters to users rather than what matters to IT. While spiking CPU consumption or memory allocation might be contributing to downtime, users experience something altogether different. They might be experiencing laggy performance, dropped sessions, or a website that won’t load. And maybe, of the plethora of performance metrics you’re tracking, only a small number actually meaningfully contribute to a degraded user experience.

This is why it’s not uncommon for companies reorganizing their operations around SRE to turn off, or at least reprioritize, a significant portion of their alerts in order to focus on the ones that really matter. For example, a late-night issue that doesn’t affect users can probably wait until morning and might be best delivered as a Slack message rather than something more urgent and obtrusive. After all, SRE is supposed to simplify operations, but—especially in an inherently complex distributed system—that’s nearly impossible if staff are getting code-red alerts about every metric that slips out of whack.

Learn more, live!

Of course, making the move to SRE is not an overnight process. Apart from reassessing what you’re monitoring and alerting, it’s going to require some changes in culture, skills, and process that might not come easily or naturally. It might require some changes in tooling in order to facilitate better automation, and a rethink of the overall application lifecycle that accounts for all of this.

If you want to learn more about SRE in-person—getting started, as well as doing it in production—Pivotal’s upcoming SpringOne Platform conference (Oct. 7-10 in Austin, Texas) is a great place to start. There are a number of talks focused on SRE, delivered by our experts as well as by large enterprise customers who have made the evolution from traditional operations to SRE. Three talks that stand out are:

Are You SREious (Ron Cuffy and John Keenleyside, Royal Bank of Canada)
360-Degree Health Assessment of Microservices on the PCF Platform (Nehal Gandhi, Travelers; Rohit Kelapure, Pivotal)
4 Questions to Ask Your Dev Team (Hannah Foxwell and Jérôme Wiedemann, Pivotal)
Highly Available and Resilient Multi-Site Deployments Using Spinnaker (Koundinya Srinivasarao and Dodd Pfeffer, Pivotal)

There are also a slew of other customer talks—from JPMorgan Chase, Wells Fargo, Fidelity Investments, Cerner, Yahoo Japan, and more—highlighting how they’ve modernized operations as a result using the Pivotal Platform in combination with SRE, chaos engineering, DevOps, and other practices. You can peruse the schedule to get the details on all of them.

Learn more, online!

Of course, there also is a lot of online content explaining SRE and its concepts, in addition to the Google books mentioned above. One of my favorite explainers from Pivotal is this Cloud Foundry Summit presentation by Hannah Foxwell (who’s also presenting at SpringOne Platform next month), called Reliability Engineering for Humans:

We’ve also produced a number of blog posts discussing the topic from numerous angles, including best practices and real-world examples:

Check them out, make plans to attend SpringOne Platform, and then create happier users and admins by bringing your operations into the 21st century!

About the Author

Derrick Harris is a product marketing manager at VMware.
More Content by Derrick Harris

The Main Event: How Event-Driven Architecture Helps Discover Move Faster and Win More Card Customers

Discover turned to event-storming to help modernize a key app using event-driven architecture. The results ...

A Sneak Peek at SpringOne Platform 2019 for Ops Engineers and VMware Pros

Kubernetes is all the rage, and SpringOne Platform 2019 has you covered. But there's plenty more for ops en...

A Quick Guide For Getting Up to Speed on SRE

Automation

Service level objectives

Measuring the right things

Learn more, live!

Learn more, online!

About the Author

Previous

Next

A Quick Guide For Getting Up to Speed on SRE

Automation

Service level objectives

Measuring the right things

Learn more, live!

Learn more, online!

About the Author

Previous

Next

Most Recent

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.