Assessing Reliability Risks on Kubernetes Clusters

Peter Grant, Kalai Wei, Gustavo Franco, Corey Innis, and Alexandra McCoy contributed to this post.

The VMware Customer Reliability Engineering (CRE) team is proud to announce an open source Reliability Scanner for Kubernetes! It includes an extensible set of reliability assessments, or checks, performed against various components of a cluster, such as Pods, Namespaces, Services, etc. Operators can then configure appropriate constraints for the checks on their clusters.

Here is a rundown of the initial set of checks available, with additional checks and features forthcoming.

Probes

Probes are periodic checks against containers that run our services and notify the kubelet when the container is alive and ready to accept traffic. Probes help Kubernetes make more informed decisions about the current status of one or many particular Pods behind a Service. Defining this contract between the container and the kubelet will ensure that if for some reason a subset of the pods behind a service is unhealthy, Kubernetes is still able to quickly react and reschedule those problematic Pods to other Nodes.

There are three kinds of probes: startupProbe, which confirms the application within the container is available; livelinessProbe, which confirms the container is in a running state; and readinessProbe, which confirms the container is ready to respond to requests. Our checks allow the operator to report any combination of the three probes.

Owner annotations

Kubernetes has become one of the most commonly used platforms for multitenant deployments. And when there are multiple tenants, we have found it to be good practice to annotate services with an owner, as it helps with locating a point of contact for a particular service. If your incident management workflow allows for it, such annotation also provides the ability to route events or alerts to respective owners.

Minimum desired quality of service

Quality of service (QoS) in Kubernetes is used by the scheduler to make decisions around scheduling and evicting pods on a node. There are three classes of QoS in Kubernetes: BestEffort, Burstable, and Guaranteed.

BestEffort

Kubernetes will schedule a Pod on any Node that has free resources. This is the default if none of the containers have CPU or memory requests and limits.

Burstable

Kubernetes will ensure there are enough, as defined by the container, resources on the node for the container to run. The container may exceed (“burst”) these initial resource definitions.

Guaranteed

Kubernetes will reserve the resources defined by the container for use exclusively by that container.

As you will have noticed, the upfront reservation on the scheduled Node depends on how we define the resources our application requires. Though this may not always be the best approach if an application's requirements varied over time—we wouldn’t want to unnecessarily waste resources, for example—defining QoS for our containers helps Kubernetes determine what the optimal running conditions for a workload are. It moreover helps ensure that the Node has or reserves enough resources when a container is scheduled, allowing the application to operate more efficiently.

These reliability checks can be applied in one of two ways: either by using the Reliability Scanner Sonobuoy Plugin, or via Tanzu Mission Control, to enable reliability policies to be applied and managed as code. We’ll discuss these approaches in more detail in the following sections.

Establishing reliability-as-code via Sonobuoy

Kubernetes brings together many capabilities that can assist in the delivery of software. The patterns for deploying and operating applications on Kubernetes allow teams to build and manage their services to suit their specific needs.

With such a diverse ecosystem of included functionality, add-ons, and custom extensions, Kubernetes clusters (or objects defined within them) can often become unwieldy to manage.

And Kubernetes environments can quickly become quite complex given the various tenants, systems, applications, and configurations it features.

So with all these components competing for resources, it is easy to overlook some of the built-in conventions that can help us be more deterministic about how our services may react to both internal and external events.

Reliability Scanner

The Reliability Scanner runs as a standalone container that serves as a Sonobuoy plugin and uses a configuration file to define customized checks. Let’s take a look at one of the checks to see how the scanner works.

We define the check as follows:

This check may be included in our Reliability Scanner configuration file. With this check, we are looking across the cluster to report back on the current state of a Pod within the cluster to understand workloads that do not define our minimum QoS class.

If we create a Pod with a guaranteed QoS class and run our scan, we should see it reflected in the report.

Let’s run the Reliability Scanner to see how the Pods in our cluster are currently configured.

Based on our defined check, we have two configuration options: `minimum_desired_qos_class` and `include_detail`. Both options are telling us that for our report, we will fail any check that does not meet the minimum QoS class defined here. The included detail configuration option allows for the report runner to return the current QoS class of the Pod being assessed.

Let’s review an excerpt from our report to see how our scan went.

We can see that, although our report is showing a failed status (as not all of the Pods in the cluster meet the minimum desired QoS class), our guaranteed Pod, which we created earlier (`default/test`), has passed the check.

Using this Reliability Scanner within a cluster is an easy way for cluster operators to identify any workloads or configurations that do not meet requirements and report them.

We hope that in time we are able to build sets of checks for multiple concerns, so would love any feedback on good practices for operating workloads atop Kubernetes. In fact, the VMware CRE team would love to hear from the broader community with any feedback about our Reliability Scanner. If you are interested in contributing, or have any suggestions for checks, please feel free to raise any issues on GitHub!

Establishing reliability checks as policy via Tanzu Mission Control

If you are already a VMware Tanzu Mission Control customer, you can use it to enforce reliability policies.

Tanzu Mission Control is a management interface for operators to interact and manage their clusters in groups by creating and assigning policies, as well as a dashboard to visually monitor the current state of those clusters. The policies are defined using the Rego language and enforced via the Open Policy Agent Gatekeeper.

You will need access to the VMware Cloud Services Portal as well as either an AWS cloud account or a vSphere connected to Tanzu Mission Control. Once you have successfully created or attached a cluster, take the following steps.

Step 1: Create a new template

You will need to create a new template under Policies > Template.

Probes template:

Owner annotations template:

Minimum desired QoS template:

Step 2: Create a custom policy

Once you have a new template, you will need to assign it. Look under Policies > Assignments in the menu, select your cluster, and navigate to the custom tab to create a custom policy. You will have to select the template name and the target resource. The target resource is Pod for LivenessProbe & ReadinessProbe and QoS, and Namespace for owner annotations. Leave API Group blank.

An example screenshot for probes:

An example screenshot for owner annotations:

An example screenshot for QoS:

When creating a custom policy assignment, one thing that will be improved in subsequent releases is the resource section. You will need to click to add another resource even if you have only one, otherwise you will not be able to create the policy.

Using kubectl get constrainttemplates, you can see that the templates are being created by Tanzu Mission Control .

Step 3: Wait for the policy to apply

Please note that the first assignment will take some time to apply as it needs to wait for Pods in the gatekeeper-system namespace to be up and running.

At this time, Tanzu Mission Control doesn’t display any notifications when the Pods are up and running, but you can use kubectl get pods -n gatekeeper-system to check on the state of the pods.

Keep in mind that once created in Tanzu Mission Control, the policy will be managed by Tanzu Mission Control and anything you use the kubectl to update will be overwritten by what’s in it.

Step 4: Verify the policy

Now you can go back to CLI and use kubectl to create objects. If an object passes the validation check, it will be created successfully. Otherwise, the error message will look something like:

What’s next?

We’ve demonstrated a couple ways to bring reliability practices to your cluster management, codified as policies, with our Reliability Scanner. We hope that this lightweight, extensible tool will be a first step for the Kubernetes community to start applying reliability-as-code. If you would like to give it a try, you can find the policy templates in our project registry. You can use the command `docker run projects.registry.vmware.com/cre/reliability-policies` to print out the templates. And if you would like to share your experiences and help contribute to the Reliability Scanner plugin, we welcome any issues and pull requests from the community.

VMware CRE is a team of site reliability engineers and program managers who work together with Tanzu customers and partner teams to learn and apply reliability engineering practices using our Tanzu portfolio of services. As part of our product engineering organization, VMware CRE is responsible for some reliability engineering-related features for Tanzu. We are also in the escalation path of our technical support teams, tasked with helping our customers meet their reliability goals.