A Service-Level What?

August 27, 2020 Corey Innis

For any given service—be it provided by software (e.g., application, platform), hardware (e.g., infrastructure), or human (e.g, delivery, support, documentation)—there is a level of reliability required to achieve user satisfaction.

While users, from end users of web or mobile applications to developers that use a platform, want to utilize a service's features, they care more about service reliability. After all, if the service is not working, they cannot make use of those features. Indeed, companies with unreliable systems suffer consequences. To that end, we consider reliability to be a core product feature.

The journey toward embracing reliability as a feature and achieving reliability targets requires more than simply scaling up or out aspects of a service. It begins by establishing meaningful and reasonable objectives, then adopting the tools and techniques to achieve them. It begins, in other words, with service-level objectives and service-level indicators.

Which service level?

Service-level objectives (SLOs) and service-level indicators (SLIs) are a set of practices for applying a product mindset and an economic model to service operations, respectively, are the foundation of reliability engineering.

SLOs are a threshold, a quantifiable target for a system’s behavior and the answer to the question, “How reliable do I want my service to be?” We can consider these to be representative of user expectations.

SLIs are a metric, a measure of a system’s existing behavior and the answer to the question, “How is my service performing at this point in time?” We can consider these to be representative of user experience.

So what about SLAs?

Service-level agreements (SLAs) are agreements between a service provider and a consumer that are contractual and binding. When SLAs are not met, financial or legal consequences apply. The key distinction between SLAs and SLOs is that agreements suggest retroactive penalties whereas objectives suggest proactive behavior (e.g., “Let’s fix this before our users get upset”).

SLAs often get confused with SLOs and SLIs. There are some great resources to help disambiguate the terms, but here’s another attempt, courtesy of my colleague Aram Price:

Suppose you’ve been on holiday and are heading to the rental agency to return the car you’ve been using on the trip. You remember that you signed a rental contract stipulating that you’ll return the car with the fuel tank at least three-quarters full or be charged a penalty.

Think of that contract as the SLA you have with the agency and the needle on the fuel gauge the SLI. Finally—and this illustrates something important—your SLO is that three-quarters of a tank. Any less, and the penalty kicks in; any more, and you’re simply wasting your money.

Why a service level?

We’ve established that reliability must be a core feature of our service. That means any change to the service introduces the possibility that reliability will be adversely impacted. We need a way to balance change with reliability. SLOs provide that way.

The cost associated with achieving greater reliability is exponential. Additionally, for almost any given service, as reliability approaches 100 percent, the likelihood that anybody notices drops substantially. The likelihood that anybody cares drops even faster.

Like the three-quarters-of-a-tank fuel target, there is a point at which any additional investment on our part does not result in user benefit or business value. Meanwhile, the investment we’ve made in order to achieve greater (yet unnoticed) reliability is no longer available for other work. So there’s an opportunity cost. It’s a bad investment.

An SLO should be considered as both an upper and a lower bound.

How to apply a service level

We tend to refer to SLOs in terms of “number of nines." A target of 99.9 percent uptime can be referred to as “three nines.” When applied to a window of time, a reliability SLO translates to what's known as an “error budget"—that is, an allowable amount of downtime or rate of failure. For example, an SLO of “three nines over a rolling 28-day window” means we have an error budget of approximately 40 minutes (0.1 percent of the number of minutes in 28 days).

As long as we’re achieving our objective, we have remaining budget, so we move fast and invest in new product/service capabilities. It’s only when we’re approaching the SLO threshold, and our error budget reaches zero, that we slow down and redirect resources toward greater reliability.

Now that we understand some of the key reliability engineering vernacular and why we should care, how do we begin to put this knowledge to use? A great way to get started is with an SLI/SLO workshop with your team and stakeholders. A workshop can be used to introduce these topics in detail, generate your own SLI and SLO definitions, and establish ways in which the team can use error budgets and SLO miss policies to balance development and operational velocity with reliability.

Interested? VMware facilitates SLI/SLO workshops with our customers to help them get started. If you’re ready to get started on your journey with the help of experts who are always within reach, read more about the offerings in our Tanzu portfolio and reach out to a sales representative today.

Announcing the General Availability of Azure Spring Cloud

We are happy to announce the general availability of Microsoft Azure Spring Cloud—a fully managed service f...

From Idea to Product: How VMware Pivotal Labs Helps Startups Grow

A podcast discussing how VMware Pivotal Labs helped health-care startup Alluceo grow an engineering practic...

A Service-Level What?

Which service level?

Why a service level?

How to apply a service level

Previous

Next

A Service-Level What?

Which service level?

Why a service level?

How to apply a service level

Previous

Next

Most Recent

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.