Modern SRE Practices for Incident Management

September 23, 2021 Mary Chen

At VMware, we make use of modern development and site reliability engineering (SRE) practices on a regular basis. And those of us who work on the VMware Tanzu Observability product marketing team regularly get exposure to various SRE teams that implement modern practices with the observability technology we create. With that in mind, I sat down with Feargus O’Gorman, a member of the VMware Tanzu SRE team, to learn how he supports VMware Tanzu Mission Control’s services efficiently and at scale.

For those of you who are not familiar with Tanzu Mission Control, it’s VMware’s central management platform for consistently operating and securing your multicluster Kubernetes infrastructure and modern applications across multiple teams and clouds. Tanzu Observability is the platform Feargus and the rest of his SRE team uses to observe the health and performance of Tanzu Mission Control services and alert developers—and each other—about any incidents impacting the customer experience.

Mary: Feargus, please introduce yourself and your team.

Feargus: I’m part of the SRE team in the Tanzu organization within VMware. My team’s focus is to help drive SRE best practices, particularly around service-level indicator (SLI) and service-level objective (SLO) definitions, and use them to drive behavioral changes within the team. A part of my role is on-call incident management for our Tanzu Mission Control SaaS offering.

We’re always looking for efficient ways to solve service performance issues before they impact our customers.

Incident management

Mary: I can imagine how stressful it can be when you’re woken up at 3 a.m. to fix a customer-impacting issue. How do you approach monitoring and managing your on-call experience?

Feargus: When we have a customer-impacting issue, or any issue, we create an incident in GitLab and track the state of the incident there. It’s the central place for sharing what steps have been taken to remediate the issue, including screenshots such as [the one below], to save people’s time when they’re looking at the issue.

From a response perspective, I don’t get bogged down in the details. It’s not the numbers I care about; it’s more the trend and correlations that are helpful for me.

For one of our recent incidents, this view in Tanzu Observability distributed tracing was very useful for seeing the spikes in error rates and a corresponding spike in request rates. I could quickly determine the issue with one of our services and include a snapshot in the incident ticket.

We also include links to charts so people can alter their query or see if the issue correlates to something else.

When I’m working on an incident, a lot of information can come at me from different places—alerts going off, and people messaging me on Slack. On top of that, I need to provide an external status update. Having a single, concise view into the system relieves some of the panic or intensity that can go into an incident response. With the tracing visualizations in Tanzu Observability, I can go on-call with peace of mind.

Mary: Peace of mind is probably the best way to quantify the benefit of using observability technology. Once you’ve narrowed down where the issue is, what does the handoff to the service team or developer look like?

Feargus: The service developer I page needs to know exactly where the problem lies, so it’s critical that I have the right tool to perform my job.

In the early stage in determining where the issue lay with the service in a recent incident, I could see that it was a particular endpoint that was encountering the error. By drilling into that endpoint, I could see that it was exclusively the agent gateway that was encountering these issues. And once we identified exactly where the issue was, we let the developer for that service know about it.

From there, the developer would look at logs to understand what was going on. Some service developers may use [the view below] in Tanzu Observability distributed tracing to see where things are slowing down if we’re seeing consistent slow requests across certain endpoints.

Mary: Once the service team is notified, is your work done?

Feargus: No. Our SRE team’s role is “incident command.” When we get alerted or paged for incidents, we do basic triaging before calling in the engineers responsible for that service. It’s the engineer’s job to troubleshoot the issue, to bring the service back up and running. We, the SRE team, still maintain our role as commander to ensure the right people are on the response team, to ensure things are moving along, to guide some of the decisions. And we’re also responsible for producing a status report or updating the external status page for customers while tracking the fixes in our GitLab incident.

Tools

Mary: Before using Tanzu Observability distributed tracing capabilities, how did you manage on-call? Does your team use other tools to reduce MTTR?

Feargus: Before using tracing, it was ad-hoc dashboarding or on-the-spot writing of queries to call back the metrics I needed, or just running the top command on Kubernetes clusters to get the information I needed.

Depending on the issue, I find distributed tracing to be a valuable tool for troubleshooting distributed systems. Distributed tracing in Tanzu Observability provides a nice window and collates all the relevant information into one location—that’s why I’m using it for my on-call responsibilities.

We have Prometheus running in each of our clusters, and all the metrics from Prometheus get pushed to Tanzu Observability.

Mary: Is it standard practice to use Tanzu Observability for incident management?

Feargus: There’s no standard practice for the team to use Tanzu Observability for incident management right now, but it will be widely used. It’s up to the engineers to decide what they want to use for debugging. Sometimes some folks rely more on Prometheus because of their familiarity with PromQL; some folks rely on Tanzu Observability; and some use CLI investigations. Our service developers use a variety of logging tools for searching through database logs and service-level logs. However, we’re using Tanzu Observability widely within the team for defining service-level indicators with service teams.

Mary: You might be surprised to hear this, but Tanzu Observability complements open source tools such as Prometheus. Last year, we added support for PromQL in our UI so developers and Kubernetes operators can leverage their Prometheus query skills to create dashboards and alerts in Tanzu Observability. And we recently added support for the PromQL HTTP API, which lets you use Tanzu Observability as a drop-in replacement for Prometheus data sources without needing to change your existing tools.

Feargus: Awesome. I think our service developers will find Prometheus support in Tanzu Observability very useful.

Service levels

Mary: Another part of your role is to meet with service teams responsible for all the services within Tanzu Mission Control and help them define SLIs and adopt their usage. Can you share what the process looks like?

Feargus: The SLIs we’ve crafted for each service capture the real user impact. That’s what we care about. I rely on our SLI definitions to determine the health of our systems. The more generic view of error rates doesn’t necessarily represent the actual end user pain. Some errors may be correct behavior of the service.

For example, when working with the inspection service team for defining an SLI for the service, [the screenshot below] is a cool view to see how the service interacts with other services. Looking at the App Map, you quickly understand that all requests are coming through the API gateway—our real user gateway—and it reaches out to three other services. It provides a common view of the system when working with the service team and trying to define the key workflows that we want to capture with an SLI.

Defining custom, sensible SLIs for each service is a slow, ongoing process; we have to work with each service to understand what the key things their consumers or users care about are. But it’s worth the investment. A key reason for taking a more broad approach to defining SLIs for each service is moving to SLO-based alerting. With SLO-based alerting in place, if one of the services has issues that aren’t impacting the consumer of this service, then the developer will not be paged to get out of bed at 3 a.m. to fix it.

Mary: With SLO-based alerting, we’re hoping developers will enjoy more uninterrupted sleep time. But getting smart about incident alerts, detection, management and prevention is not easy. Our new eBook, Establishing an SRE-based Incident Lifecycle Program, will help teams better define specific key terms and customer requirements, and blend together incident prevention and detection. I’m also looking forward to hosting the SLO-based Alerting with Tanzu Observability webinar on Nov. 4 with you to share more about your team’s progress and what you’ve learned along the way.

Feargus, this is great. Thank you for sharing your SRE approach.

Feargus: Thank you. See you at the webinar in November.

About the Author

Mary is a senior product marketing manager in VMware’s Modern Apps Platform business unit, where she is responsible for helping customers succeed in modern DevOps environments using VMware Tanzu Observability. Prior to VMware, she was a senior industry marketing manager at Splunk, working at the intersection of retail and marketing. And before that she worked in various roles to help bring bring data analytics, security, and network management solutions to market across industries.
More Content by Mary Chen

Introducing VMware Tanzu Community Edition

VMware Tanzu Community Edition is a free Tanzu distribution that can be installed and configured in minutes...

Kubernetes, Give Me a Queue

We double-click on our new Kubernetes Operator to show how the Kubernetes wrapper for the RabbitMQ API simp...

Modern SRE Practices for Incident Management

Incident management

Tools

Service levels

About the Author

Previous

Next

Modern SRE Practices for Incident Management

Incident management

Tools

Service levels

About the Author

Previous

Next

Related content in this Stream

TLDR; for Tanzu Customers: Tanzu products are unaffected by recently announced CUPS vulnerabilities.

The continued commitment by Broadcom to the open-source community is also key to this with the core engineering team for the community all being staff members at Broadcom.

Broadcom Named a Leader by IDC MarketScape in the APeJ for Cloud Cost and Capacity Optimization 2024

This blog is based on "Navigating Market Storms by Leveraging VMware Tanzu and VMware Cloud Foundation at Scale," a panel with Schwab at Explore 2024.

To reduce your mean time to recovery (MTTR) from an outage, you have to take care of the hard stuff first.

As Spring Framework 5.3 support concludes, upgrading to Spring Framework 6 is vital for security and performance. Leverage tools and community resources to navigate this transition seamlessly.

Explore platform engineering with Tanzu for Cloud Foundry: build community, enhance product management, and scale developer support to maximize platform value and innovation.

Discover enhanced VMware Tanzu Knowledge Graph features: explore open source catalogs, assess package vulnerabilities, ensure compliance, and streamline security with new insights and tools.

Explore the Golden Commands—build, bind, deploy, scale—crucial for production paths on Tanzu's Cloud Foundry & Kubernetes. 'Build' is essential for secure, repeatable code production.

Discover how Tanzu Greenplum 7's certification with MicroStrategy enhances advanced analytics, empowering industries like finance and manufacturing with powerful data insights and decision-making tool

Discover how Tanzu Application Catalog empowers secure OSS use with custom container and Helm chart catalogs, offering enhanced vulnerability management and streamlined software transparency.

Dive into the complexities of securing cloud native environments. Explore custom stack challenges, integrated security's role, and insights from the latest survey on cloud native platforms.

Explore how internal hackathons boost innovation, tackle technical debt, and elevate team morale and learning, driving better business outcomes in enterprise settings.

At VMware Explore the group announced enhancements to data solutions and improved Kubernetes developer experience

Explore VMware Explore Vegas for the latest product announcements! Tanzu Platform 10 brings new features to Cloud Foundry, building on Tanzu Application Service 6.0, available October 2024.

Discover VMware Tanzu's latest blog on accelerating app delivery and enhancing data solutions with new features in Tanzu Data Solutions, driving efficiency, security, and scalability.

VMware Tanzu Platform seamlessly connects Kubernetes adoption with user experience, unifying infrastructure through centralized tools and cloud-native standards in one comprehensive solution.

Leverage Tanzu Spring's latest innovations for efficient, secure, compliant app dev. enhancements include Spring Application Advisor, Spring Boot Governance Starter, & Spring AI Seamless Integration.

Explore VMware Tanzu AI Solutions' new features for GenAI, tackling AI model management, efficiency, and governance, while boosting intelligent app delivery and observability in Java environments.

Boost Java power with Spring Boot 3.3's Class Data Sharing (CDS)! Enjoy faster startups, lower memory use, and smoother activation with DevXP. Optimize JVM for the digital era!