(Almost) Everything You Need to Know About SRE

Site Reliability Engineering (SRE) is a hot topic, but what exactly does it entail? And do you have to follow the principles to a T in order to achieve benefits from it? If you’re searching for answers to these common questions, look no further.

In this episode of the Cloud & Culture podcast, VMware Tanzu’s Hannah Foxwell explains the what, why, and how of SRE—from key principles (such as SLI, SLO, and error budgets) to real-life examples of enterprise adoption. Importantly, she also makes clear that while SRE has clear benefits around uptime and efficient use of resources and energy, it also can be a boon to employees’ quality of life.

Below are some text highlights, but you’ll want to listen to the whole episode to hear more about how to get started, what to expect, and the importance of automation.

Shooting for 99.999% might be overkill

“Some of these [attempts to achieve maximum availability are] driven by some of the more legacy practices that we always had in operations—like ‘every outage is a disaster’ and ‘there's no such thing as an acceptable outage.’ And when you get into that mindset . . . then you start to really over-engineer your solutions to try and achieve the impossible, which is 100 percent.

“And I think these are tendencies that exist within all engineering teams—preparing for the absolute worst case scenario, over-engineering solutions. And if you're achieving a very high level of availability that really your users don't need, it means that you have probably invested too much in it, whether that be through engineering time [or] whether that be through redundant resources. Maybe you've built a lot of resilience into your system that really wasn't needed. Maybe it was through automation that you'd then need to maintain over the long term and creates toil for your team.

“There are lots of ways that this excessive amount of reliability actually costs you, not [least in] the fact that you actually could be potentially shipping features faster to your users and taking a few more risks in the application software development life cycle.”

Start with what works for you

“You’re not going to learn all of this new stuff and implement everything overnight. I think as long as you have an intent to start and continuously improve, then you're doing alright. I can talk about teams who don't use SLIs and SLOs, but they do do blameless postmortems on their incidents. They do create that blameless space where not every outage is a disaster and it's an opportunity to learn and improve. That in isolation delivers an amount of value.

“And also, when we talk about eliminating toil and using automation to do that—to build a software solution to what would be a manual or human repetitive task—that's again something that has value as a standalone practice. You can reproduce more consistent environments using infrastructures as code and configuration management systems. You can rebuild your pre-production environments overnight if you script it in the right way.

“These things improve the consistency and reliability of those things in isolation, but you're not necessarily going to get all of the benefits that all of the other SRE practices bring you.”

Why she got into SRE (and why you should look into it)

“I got involved in the DevOps community to start with because I saw that good engineering practices made the environment for the humans working in software development so much better. It was about the health and wellbeing of my own team to start with—like how can we get out of this cycle of rushing towards three monthly release dates, having that enormous crunch of testing and fixing at the end. I started to research continuous delivery, and that's how I discovered DevOps, that's how I got interested in automation tooling and cloud. All of these things come together to actually make the life of the average software engineer better.

“And that's what really matters to me, because I've seen the impact of bad practices on people. I've seen burnouts, I've had engineers on call having relentless sleepless nights because of fragile systems in production. And that hurts. That hurt me as a manager, but it hurt my team more—it hurt their families, it hurt their relationships. It's a very human benefit to getting these things right.

“And that's why my career took this direction. That's why I'm here doing this stuff and teaching our customers today, because I really do think that the teams who adopt these practices are going to be happier and healthier and more sustainable.”