Kalai Wei, Corey Innis, Gustavo Franco, and Alexandra McCoy contributed to this post.
As engineering teams embrace Kubernetes and VMware’s Tanzu editions to more effectively deliver business value (e.g., by increasing deployment velocity and scalability), ensuring that workloads—now distributed across collections of microservices—are reliable is as important as ever. Teams need to take a strategic and iterative approach to adopting reliability engineering practices and processes as well as its techniques and tools in order to keep up with today’s ever-increasing pace of change.
However, when it comes to reliability engineering, individuals throughout an IT organization commonly have varying levels of knowledge, expertise, and interest. A basic understanding might include some awareness of service-level indicators (SLIs) but only a vague sense of how to use them with service-level objectives (SLOs) and error budgets to become more effective, for example. Whereas team members with a more advanced understanding know how site reliability engineering (SRE) practices complement those of agile and lean to generate and sustainably deliver business value.
At VMware, the Customer Reliability Engineering (CRE) team works closely with VMware internal teams and enterprise customers to support their adoption of SRE and the associated cultural transformation that comes with it. Over the course of these engagements, VMware CRE identifies opportunities for SRE education and enablement, and evolves tools and techniques to help guide teams during SRE adoption, bringing together these disparate positions so they can come to a shared understanding and a sense of collective ownership. In this post, we’ll dive deeper into the VMware CRE approach.
VMware CRE has developed a series of SRE training presentations and workshops that enable teams to align and level up on their ability to manage production delivery and service reliability. Our plan is to continue developing a thorough curriculum of modularized education and training courses, paying special attention to platforms and workloads running on Kubernetes or in one of our Tanzu editions.
Teams must first ensure they can invest an appropriate amount of time in reliability (to get a better sense of how much time is required, refer to our blog post on how to reduce unscalable workloads). We also find that having a foundation in observability, specifically with a user-centric approach framed in SLIs and SLOs, is necessary before getting started.
We view SLOs as fundamental to reliability engineering practices and an essential foundation for building effective SRE capabilities within a team, so we tend to start with our SLO workshop series. The training consists of a high-level, hour-long presentation and Q&A followed by a hands-on workshop using a visual collaboration platform for attendees to brainstorm ideas and discuss examples in depth.
Some of the highlights of the presentation include:
A high-level overview of SRE
Definitions of SLIs, SLOs, and error budgets
How to utilize error budgets to manage risk
How to utilize error budgets to prioritize a team’s backlog
The economics of investing in reliability work
Following the presentation and Q&A, the first hands-on workshop demonstrates how to specify and instrument SLIs. It focuses on key service use cases from a user-first perspective, as we believe that a lot of value can be derived from determining SLIs and correlating them with actual user experience and feedback (e.g., support tickets, customer satisfaction surveys, traffic pattern changes, checkout, and other key operation success rate determinants).
A follow-on workshop builds on the prioritized SLI specifications to define SLOs, error budgets, and policies. Notably, it is not necessary for teams to establish SLOs and error budgets in order to derive value from our SLO workshop series.
Much in the way that SLOs are the foundation of SRE, SLIs are the foundation of a user-driven SLO approach. Indeed, in Implementing Service Level Objectives (O’Reilly, 2020), Alex Hidalgo refers to the building blocks of SLOs as the “Reliability Stack.” By taking an SLI-first approach to investigative work, we are able to determine how users, customers, and stakeholders react to and feel about varying levels of service.
With that mind, we first help our audience define what an SLI is and identify some of the high-level criteria that an SLI consists of during our initial presentation. We also discuss a number of important concepts, such as critical user journeys and system boundaries and capabilities, as well as how to both determine the correct reliability metric to measure and implement it.
Iteration and feedback loops
Our curriculum helps teams get started on setting up the SLIs and SLOs most appropriate to their organization’s business and services. More importantly, it provides teams with the capability to periodically revisit and adjust those SLIs and SLOs in order to keep them updated and relevant so they evolve with the business.
VMware CRE invests in continuously improving the curriculum as well. We send out surveys in order to determine ways to improve it for the next engagement, and generate additional tools and techniques while improving existing materials based on feedback from participants. In each case, the CRE team initially iterates with “internal customers” within the Tanzu business unit, then makes the resulting exercises available, both directly to VMware Tanzu customers and in partnership with our services teams.
An example of output from a recent workshop
As a team within the VMware Tanzu product organization, VMware CRE is positioned to take feedback from customer engagements and contribute it to product feature work. The result: a virtuous cycle of continuous improvement.
VMware CRE is committed to evolving our SRE education, training, and enablement tools and techniques. We’ll continue to share updates about them here.
Is your team interested in SRE workshops or other reliability engineering topics? Our brief on the value of CRE is a great source of more information. And of course you can always reach out to a VMware account team directly to apply to the CRE program.
VMware CRE is a team of site reliability engineers and program managers who work together with Tanzu customers and partner teams to learn and apply reliability engineering practices using our Tanzu portfolio of services. As part of our product engineering organization, VMware CRE is responsible for some reliability engineering-related features for Tanzu. We are also in the escalation path of our technical support teams, tasked with helping our customers meet their reliability goals.