SRE and the value of treating operations as a software problem

Site reliability engineering—better known as SRE—has been around since about 2004, but really took off in around 2016 when Google engineers wrote a book describing how they use SRE principles to keep the company's applications online. Since then, SRE practices and "site reliability engineer" roles have popped up in technology companies large and small, and the topic comes up in any meaningful conversation about modern IT operations.

In this episode of Cloud Native in 15 Minutes, Dave Rensin, a senior director of engineering at Google, explains some of the key SRE principles and why he thinks of SRE as "as business practice that evolved in a technical culture." In addition to IT-centric topics such as what to measure and how SRE relates to DevOps, Rensin also discusses in some detail the theories behind error budgeting and service-level objectives (you should listen just for that), and how they can apply across the entire organization.

Here are a few quotes from the episode, where Rensin gives some of the basics around what SRE is and how to think about what to measure to really make a difference.

SRE is machines working for humans

“The way I like to think of it is this: You can live in one of two worlds. In the first world, a machine, called a pager, wakes you up at 3:00 in the morning because some other machine is having a hard time. In that world, you work for the computers.

“The world you want to live in is one where some system you’re responsible for is having a problem, it sort of mitigates itself, and then it writes a bunch of information out for you to debug the next morning after your morning coffee. That’s a world where the machines work for you. SRE is a world where the machines work for you. …

“That’s the difference between staring at a monitor and poking at a keyboard, versus trying to write software or implement systems that fix themselves.”

SRE and DevOps are not so far apart

“SRE and DevOps developed mostly independently of one another, mostly at the same time, in response to exactly the same set of problems. So, unsurprisingly, they landed in really similar spaces. They share 99 percent of the same principles, so I don’t like to argue about chronology and history and all that stuff. The way I like to think of it mentally is that SRE is a concrete, opinionated—and it certainly is—implementation of DevOps principles.

“The thing I like about SRE is if you do SRE work at Google, and then go to LinkedIn or Netflix or some other place with an SRE culture, the activities will rhyme with one another. You will recognize the things.”

There is such a thing as too good (in error budgeting)

“If you determine that you can tolerate 43 minutes of error, or bad minutes, over 30 days, and you’re consistently only spending, say, 10 minutes, that’s not a victory. That means you’ve over-engineered reliability. You have more reliability than your users need, and that’s time and resource and expense you could be applying to innovation or risk or some other thing.”

Measure what matters

“No one’s users care about CPU load or memory pressure or disk fullness. … They care about, ‘How long did the thing I want take, and did I get the correct answer?’ … You want measure the things your users care about.

“We like to say the important things are the symptoms, not the causes. The causes are important because you need the data to be able to debug and fix the thing, but your users care about the symptoms.”