Modern SRE Practices for Incident Management

September 23, 2021 Mary Chen

At VMware, we make use of modern development and site reliability engineering (SRE) practices on a regular basis. And those of us who work on the VMware Tanzu Observability product marketing team regularly get exposure to various SRE teams that implement modern practices with the observability technology we create. With that in mind, I sat down with Feargus O’Gorman, a member of the VMware Tanzu SRE team, to learn how he supports VMware Tanzu Mission Control’s services efficiently and at scale. 

For those of you who are not familiar with Tanzu Mission Control, it’s VMware’s central management platform for consistently operating and securing your multicluster Kubernetes infrastructure and modern applications across multiple teams and clouds. Tanzu Observability is the platform Feargus and the rest of his SRE team uses to observe the health and performance of Tanzu Mission Control services and alert developers—and each other—about any incidents impacting the customer experience. 

Mary: Feargus, please introduce yourself and your team. 

Feargus: I’m part of the SRE team in the Tanzu organization within VMware. My team’s focus is to help drive SRE best practices, particularly around service-level indicator (SLI) and service-level objective (SLO) definitions, and use them to drive behavioral changes within the team. A part of my role is on-call incident management for our Tanzu Mission Control SaaS offering. 

We’re always looking for efficient ways to solve service performance issues before they impact our customers. 

Incident management  

Mary: I can imagine how stressful it can be when you’re woken up at 3 a.m. to fix a customer-impacting issue. How do you approach monitoring and managing your on-call experience? 

Feargus: When we have a customer-impacting issue, or any issue, we create an incident in GitLab and track the state of the incident there. It’s the central place for sharing what steps have been taken to remediate the issue, including screenshots such as [the one below], to save people’s time when they’re looking at the issue. 

From a response perspective, I don’t get bogged down in the details. It’s not the numbers I care about; it’s more the trend and correlations that are helpful for me.  

For one of our recent incidents, this view in Tanzu Observability distributed tracing was very useful for seeing the spikes in error rates and a corresponding spike in request rates. I could quickly determine the issue with one of our services and include a snapshot in the incident ticket. 

We also include links to charts so people can alter their query or see if the issue correlates to something else. 

When I’m working on an incident, a lot of information can come at me from different places—alerts going off, and people messaging me on Slack. On top of that, I need to provide an external status update. Having a single, concise view into the system relieves some of the panic or intensity that can go into an incident response. With the tracing visualizations in Tanzu Observability, I can go on-call with peace of mind. 

Mary: Peace of mind is probably the best way to quantify the benefit of using observability technology. Once you’ve narrowed down where the issue is, what does the handoff to the service team or developer look like? 

Feargus: The service developer I page needs to know exactly where the problem lies, so it’s critical that I have the right tool to perform my job. 

In the early stage in determining where the issue lay with the service in a recent incident, I could see that it was a particular endpoint that was encountering the error. By drilling into that endpoint, I could see that it was exclusively the agent gateway that was encountering these issues. And once we identified exactly where the issue was, we let the developer for that service know about it. 

From there, the developer would look at logs to understand what was going on. Some service developers may use [the view below] in Tanzu Observability distributed tracing to see where things are slowing down if we’re seeing consistent slow requests across certain endpoints.  

Mary: Once the service team is notified, is your work done? 

Feargus: No. Our SRE team’s role is “incident command.” When we get alerted or paged for incidents, we do basic triaging before calling in the engineers responsible for that service. It’s the engineer’s job to troubleshoot the issue, to bring the service back up and running. We, the SRE team, still maintain our role as commander to ensure the right people are on the response team, to ensure things are moving along, to guide some of the decisions. And we’re also responsible for producing a status report or updating the external status page for customers while tracking the fixes in our GitLab incident.  

Tools

Mary: Before using Tanzu Observability distributed tracing capabilities, how did you manage on-call? Does your team use other tools to reduce MTTR? 

Feargus: Before using tracing, it was ad-hoc dashboarding or on-the-spot writing of queries to call back the metrics I needed, or just running the top command on Kubernetes clusters to get the information I needed. 

Depending on the issue, I find distributed tracing to be a valuable tool for troubleshooting distributed systems. Distributed tracing in Tanzu Observability provides a nice window and collates all the relevant information into one location—that’s why I’m using it for my on-call responsibilities.

We have Prometheus running in each of our clusters, and all the metrics from Prometheus get pushed to Tanzu Observability.  

Mary: Is it standard practice to use Tanzu Observability for incident management? 

Feargus: There’s no standard practice for the team to use Tanzu Observability for incident management right now, but it will be widely used. It’s up to the engineers to decide what they want to use for debugging. Sometimes some folks rely more on Prometheus because of their familiarity with PromQL; some folks rely on Tanzu Observability; and some use CLI investigations. Our service developers use a variety of logging tools for searching through database logs and service-level logs. However, we’re using Tanzu Observability widely within the team for defining service-level indicators with service teams.  

Mary: You might be surprised to hear this, but Tanzu Observability complements open source tools such as Prometheus. Last year, we added support for PromQL in our UI so developers and Kubernetes operators can leverage their Prometheus query skills to create dashboards and alerts in Tanzu Observability. And we recently added support for the PromQL HTTP API, which lets you use Tanzu Observability as a drop-in replacement for Prometheus data sources without needing to change your existing tools.  

Feargus: Awesome. I think our service developers will find Prometheus support in Tanzu Observability very useful. 

Service levels 

Mary: Another part of your role is to meet with service teams responsible for all the services within Tanzu Mission Control and help them define SLIs and adopt their usage. Can you share what the process looks like? 

Feargus: The SLIs we’ve crafted for each service capture the real user impact. That’s what we care about. I rely on our SLI definitions to determine the health of our systems. The more generic view of error rates doesn’t necessarily represent the actual end user pain. Some errors may be correct behavior of the service.

For example, when working with the inspection service team for defining an SLI for the service, [the screenshot below] is a cool view to see how the service interacts with other services. Looking at the App Map, you quickly understand that all requests are coming through the API gateway—our real user gateway—and it reaches out to three other services. It provides a common view of the system when working with the service team and trying to define the key workflows that we want to capture with an SLI.  

Defining custom, sensible SLIs for each service is a slow, ongoing process; we have to work with each service to understand what the key things their consumers or users care about are. But it’s worth the investment. A key reason for taking a more broad approach to defining SLIs for each service is moving to SLO-based alerting. With SLO-based alerting in place, if one of the services has issues that aren’t impacting the consumer of this service, then the developer will not be paged to get out of bed at 3 a.m. to fix it. 

Mary: With SLO-based alerting, we’re hoping developers will enjoy more uninterrupted sleep time. But getting smart about incident alerts, detection, management and prevention is not easy. Our new eBook, Establishing an SRE-based Incident Lifecycle Program, will help teams better define specific key terms and customer requirements, and blend together incident prevention and detection. I’m also looking forward to hosting the SLO-based Alerting with Tanzu Observability webinar on Nov. 4 with you to share more about your team’s progress and what you’ve learned along the way. 

Feargus, this is great. Thank you for sharing your SRE approach.  

Feargus: Thank you. See you at the webinar in November. 

About the Author

Mary Chen

Mary is a senior product marketing manager in VMware’s Modern Apps Platform business unit, where she is responsible for helping customers succeed in modern DevOps environments using VMware Tanzu Observability. Prior to VMware, she was a senior industry marketing manager at Splunk, working at the intersection of retail and marketing. And before that she worked in various roles to help bring bring data analytics, security, and network management solutions to market across industries.

More Content by Mary Chen
Previous
Virtualizing the Cloud in the World of Kubernetes with VMware Tanzu Mission Control
Virtualizing the Cloud in the World of Kubernetes with VMware Tanzu Mission Control

See the new capabilities and offerings that are being added to VMware Tanzu Mission Control.

Next
Securing VMware Tanzu Mission Control with Access Policies
Securing VMware Tanzu Mission Control with Access Policies

Tanzu Mission Control has a lot of power, so verifying that proper user permissions are in place is critica...