Root Cause of an Application Outage on Kubernetes, and How We Fixed It

A lot of the work I do with customers centers around helping them become successful, as quickly as possible. For some, that's helping install the PKS platform, or automating build pipelines. For others, it’s just helping containerize applications, and getting them ready to run on Kubernetes (and all that entails).

Outages happen to all applications—it’s just a part of running imperfect software and imperfect humans. This is the story of an outage one application took while running on Kubernetes, how we determined the root cause, and how we fixed it.

The Setup

This application is a scale-out Java application, running on-premises atop Pivotal Container Service (PKS). It receives requests from outside K8s, makes appropriate database calls (against a DB also outside K8s), and returns results. The app is accessed via an Ingress, and has a standard Service endpoint. The one goofy thing about this application is that it does a ton of processing upon startup. It warms up its cache and a slew of other tasks. That’s about 20-24 minutes worth of startup time before it can take requests. That's fine—we can deal with that via liveness and readiness checks.

It ran fine for weeks, scaled out with a Deployment to about 30 Pods.

The Challenge

The underlying cluster (about eight nodes) needed to be upgraded from PKS 1.2.6 to 1.3. Now, with PKS, that's a mostly an automated process that usually goes something like this:

Initiate the upgrade;
BOSH does the rest:

Node by node:
- Drain + cordon the Node
- Delete the node from the IaaS
- Create new Node from new image on IaaS
- Add node to cluster

In PKS, each node really only takes about 3-4 minutes to cycle through these steps. It’s all handled by the system.

The Event

About 25 minutes after the upgrade was started, the application monitoring systems started noticing that transactions were failing. Over 95% of requests to the application were timing out. After about 27 minutes, the application was entirely unresponsive.

Kubernetes should have migrated those Pods off the nodes as they were drained and restarted elsewhere. The process should have been invisible to end users.

How did this happen?

The Root Cause

The core of the problem turned out to be the aforementioned 25-minute startup time. Each of the 8 nodes in the cluster ran about 3 of the Pods in the deployment. When the first Node was drained and its ~3 Pods evicted, they restarted on other Nodes. But the Pods on the updated nodes needed to perform their own 25-minute startup sequence.

After 3-4 minutes of working on the first node, the second node experienced the same fate. Its ~3 Pods were evicted, and beginning their startup sequence. Even worse: some of these Pods on the second node were the ones carried over from the first node that still hadn't started, and just got killed again.

Roll through the entire cluster in about 20 minutes, and you end up in a situation where all the Pods are executing per the Deployment's design. But none of them are ready to accept traffic, and therefore aren't part of the Service or Ingress yet. This resulted in the complete failure of the application.

The Fix

Thankfully, there is a technique in Kubernetes that prevents this. What we want to be able to say is, "Don't drain this node if the overall application would be overly impacted." This is measured (generally) as a disruption budget—that is, how many components of this application can be down simultaneously without impacting our SLA?

For this particular application, the answer is about 70%. We could suffer a loss of about 70% of the Pods without a significant negative impact.

The way that Kubernetes can handle this problem is with a Pod Disruption Budget, or PDB, where we can declare our value, and apply it to any given set of selectors.

In this case, the fix was to create a PDB that looks something like this:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 name: app-pdb
spec:
 maxUnavailable: 30%
 selector:
   matchLabels:
     app: myappname

With this in place, as a drain process starts, and the Pod Disruption Budget will check the current state of all the Pods aligned to that selector (app: myappname). The upgrade workflow will then wait to drain the node until there are enough Pods to stay below that maxUnavailable value of 30%.

Conclusion

Fortunately, in this case, by the time the last node was upgraded and replaced, the first node's Pods were nearly complete with their startup. The outage in total only lasted about 14 minutes.

It was a fairly inexpensive lesson for the team and we’re now doing a few things differently:

Larger scale destructive testing: Rather than the basic tests that were tried, like killing a couple Pods manually, the team now is developing tests that include entire AZ failures and upgrades in the staging environment as part of their release QA process.
Documenting the best practices like PDBs in a central place.
Figuring out how to reduce the startup time of the application.

Overall, this was a good experience for this customer to understand some of the features Kubernetes includes to keep applications in proper working order, and how multiple layers of a stack can influence each other in unexpected ways.

The Setup

The Challenge

The Event

The Root Cause

The Fix

Conclusion

Related Articles

The Shadow PaaS vs CaaS War: Cloud Foundry's Relevance in a Kubernetes World

Gain Insights into the Risks You Face from Open Source Dependencies with VMware Tanzu OSS Health Assessment

Spring Cloud Gateway for Kubernetes 2.2: A Focus on Enhanced GraphQL API Support

Improving Kubernetes Operations One Step at a Time

2023 Product Highlights from Tanzu CloudHealth