Troubleshooting Clusters with Crash Recovery and Diagnostics for Kubernetes

November 18, 2019 Vladimir Vivien

When your cluster is properly configured and working as intended, Kubernetes can be a beautiful thing. Inevitably, however, your Kubernetes cluster will break for any number of reasons. As a complex system, debugging a broken Kubernetes cluster is a multi-step endeavor that usually starts with the collection of machine states and other diagnostic data to be analyzed by an operation or a customer reliability team.

As part of the Tanzu umbrella of open source projects, VMware created a new open source project - Crash Recovery and Diagnostics for Kubernetes (or Crash Diagnostics for short). This project is designed to help troubleshoot problem clusters by automating the collection of machine states and diagnostic data from unstable or inoperable clusters.

Read more from the Crash Diagnostics GitHub repository.

Yet Another Diagnostic Tool?

You may be wondering at this point: Does the Kubernetes community need yet another diagnostics tool? That would be a fair question as there are several tools available that are usually deployed as pods to help diagnose a running cluster. One example is Sonobuoy, an open source project from VMware that runs Kubernetes conformance tests and other plugins.

Crash Diagnostics, however, runs outside the cluster and investigates problems where the cluster may be partially or completely non-operational.

Collecting Troubleshooting Information

A series of commands is declared in a diagnostics file which specifies the resources to collect from cluster machines. Like a Dockerfile, the diagnostics file is a collection of line-by-line directives with commands that are executed on each specified cluster machine. The output of the commands is then added to a tar file and saved for further analysis.

For instance, when the following diagnostics file (saved as Diagnostics.file) is executed, it collects information from the two cluster machines specified with the FROM directive:

ENV remoteuser=adminop
FROM  192.168.176.100:22 192.168.176.102:22
AUTHCONFIG username:${remoteuser}  private-key:${HOME}/.ssh/id_rsa
WORKDIR /tmp/crashout

# copy log files
COPY /var/log/kube-apiserver.log
COPY /var/log/kube-scheduler.log
COPY /var/log/kube-controller-manager.log
COPY /var/log/kubelet.log
COPY /var/log/kube-proxy.log

# Capture service status output
CAPTURE journalctl -l -u kubelet
CAPTURE journalctl -l -u kube-apiserver

# Collect docker-related logs
CAPTURE journalctl -l -u docker
CAPTURE /bin/sh -c "docker ps | grep apiserver"

OUTPUT ./crash-out.tar.gz

On a machine with SSH access to the cluster nodes, the previous diagnostics file can be executed as follows:

$ crash-diagnostics --file Diagnostics.file

When the diagnostics file above is executed, the following actions take place:

ENV declares a named variable that can be referenced throughout the file.
FROM declares the machines on which commands will be executed.
AUTHCONFIG configures an SSH connection that will be used to connect to the node machines.
WORKDIR specifies a temporary location where gathered files are staged.
The COPY commands collects Kubernetes log files for the apiserver, scheduler, controller, kubelet, and kube-proxy.
The CAPTURE commands execute the specified commands on each node and capture the result in a file that is bundled in the tar file.
Lastly, the OUTPUT directive specifies the name and location for the generated archive file.

The Diagnostics File

Currently, the diagnostics file supports a small, but powerful, set of directives, including::

AUTHCONFIG – configures the user and key used for the SSH connection.
CAPTURE – runs a command and captures the result in a file.
COPY – used to specify files to copy.
ENV – declares environment variables.
FROM – lists machine addresses from which to retrieve data.
OUTPUT – specifies the output tar file to create.
RUN – runs the specifies command.
WORKDIR – specifies the staging directory from which the output file is created.

These directives allow you to automate the collection of valuable machine states regardless of whether the cluster is stable or not.

As shown in the earlier example, the diagnostics file also supports variable expansion, which provides a familiar feel for those who routinely use shell scripts. This variable expansion is demonstrated in the following snippet:

AUTHCONFIG username:${remoteuser} private-key:${HOME}/.ssh/id_rsa

The value of ${remoteuser} is resolved as the variable named remoteuser, which was declared with ENV in the previous example. The diagnostics file can also access predeclared variables, such as ${USER}, ${HOME}, and ${PWD}.

Project Roadmap

This project is only a few months old and has already found some interesting, but critical, uses from early adopters. There are, however, some big plans to make this project a great tool for the community at large. Here are a few items that we are considering implementing with help from contributors:

Troubleshooting recipes – a collection of diagnostics files that can help solve common Kubernetes cluster issues.
Tighter integration with Kubernetes – implementations of Kubernetes-specific directives to help extract cluster information directly from a running API server if available.
Pluggable backend – investigation of a possible pluggable internal backend that may use other mechanisms to reach remote machines other than with SSH.
Preliminary diagnosis – analyze the collected data for known and common problem patterns.

Getting Involved

Although we are just getting started, we look forward to contributors joining the project and shaping its direction and the community around it. You can:

Try out the latest release from GitHub
Share a diagnostics file and the problem it helped solve.
Come chat with us in #crash-diagnostics on the Kubernetes Slack.
Collaborate with us on GitHub by opening an issue or create a pull request.

About the Author

Vladimir Vivien has an extensive career as a software engineer. He currently works at VMware in the Cloud Native Application group where he is passionate about contributing upstream to the Kubernetes open source project. Vladimir also enjoys writing blogs on technology and he has recently published his latest book titled "Learn Go Programming".
More Content by Vladimir Vivien

Visualize the Future, Shape the Present, and Restore the Past with Key Cloud Native Projects

VMware Tanzu solutions are built on key cloud native open source projects—they can be found in our VMware T...

VMware Enterprise PKS 1.6 Enhances Management for Running Production Kubernetes Workloads

In this release, we added more enhanced features to bring a much improved management experience to our cust...

Troubleshooting Clusters with Crash Recovery and Diagnostics for Kubernetes

Yet Another Diagnostic Tool?

Collecting Troubleshooting Information

The Diagnostics File

Project Roadmap

Getting Involved

About the Author

Previous

Next

Troubleshooting Clusters with Crash Recovery and Diagnostics for Kubernetes

Yet Another Diagnostic Tool?

Collecting Troubleshooting Information

The Diagnostics File

Project Roadmap

Getting Involved

About the Author

Previous

Next

Most Recent

As Kubernetes continues to mature—rounding the corner toward its 6th birthday—we’ve started to see a shift in terms of the challenges our customers need to solve. Initially, Kubernetes...

This latest version of vSphere has numerous added features, including native integration of the Tanzu Kubernetes Grid (TKG) to drive adoption of Kubernetes through familiar tools.

Large enterprises clearly trust Kubernetes, according to our most recent State of Kubernetes survey, and are using it for applications in production.

The initial, core elements of the VMware Tanzu portfolio are now generally available. With VMware, you now can modernize the applications that matter most and automate the path to production.

As members of the VMware Skyline Site Reliability Engineering (SRE) team, we ensure the availability and performance of our production services through obsessive measurement.

The Cluster API is an open-source, cross-vendor effort to simplify cluster lifecycle management. Cluster API is a big deal. In fact, Kubernetes creators Joe...

When the systems outside Kubernetes need information about what happens to resources inside Kubernetes, Watch-Proxy, an open source project from VMware, can come in handy.

The security ecosystem for Kubernetes can be confusing. A Sysdig article from July 2019 outlined 33 security tools for Kubernetes. That number has only grown.

The Cluster Operations course is designed to help you learn how to bootstrap Kubernetes clusters using various community tools.

In this blog post, you will see how new DevSecOps thinking is necessary as we look at the impact a development-led change can have on your operational security.

In this blog, we describe the use cases of coupling vRealize Network Insight with VMware Enterprise PKS specifically and Kubernetes more generally.

Pivotal’s modern applications expertise along with VMware’s sustained engineering excellence and product innovation brings together a deep collection of solutions, skills, and people.

In Sonobuoy 0.15.4, we introduced the ability for plugins to report their plugin’s progress to Sonobuoy by using a customizable webhook.

We’re excited to announce Getting Started, a new KubeAcademy course designed to orient beginners to the cloud native ecosystem.

Two key goals of Cluster API are to manage the full lifecycle of a Kubernetes cluster, including scaling up and scaling down the cluster, and to give infrastructure providers a common framework.

For Kubernetes 1.17, the SIGs representing storage, networking, and api-machinery account for over half of the enhancements that were tracked.

The Podlets show aims to elucidate and demystify unique elements to help people confidently embrace cloud native technology.

VMware Tanzu solutions are built on key cloud native open source projects—they can be found in our VMware Tanzu GitHub organization at github.com/vmware-tanzu.

In this release, we added more enhanced features to bring a much improved management experience to our customers.

Velero 1.1 provides support to back up Kubernetes applications orchestrated on VMware Enterprise PKS. This post details how to install and configure Velero to back up and restore a stateless app.