Improving Kubernetes Operations One Step at a Time

February 20, 2024 Corey Dinkens

The performance, scalability, and flexibility that Kubernetes offers are big reasons for its rapid adoption. At the same time, however, it’s not simple or easy to manage Kubernetes clusters, which means third party tools are practically a requirement as you scale.

I have been reminded of this a lot lately. While attending three major tech conferences in recent months, I spoke with a number of companies at varying phases of their Kubernetes journey. And regardless of where they were, many are facing similar challenges: Skills, resources, culture, and technology. My personal experiences are similar to the responses we’ve received for our  annual State of Kubernetes Report.

Many teams start with a DIY platform stitched together using various open source components. These early projects tend to be fine when managing a few clusters. However, as the number of clusters grow, so does the complexity and pain of managing secure and consistent clusters at scale. Don’t get caught up in the DIY trap thinking that it’s quick or easy to curate an end-to-end solution using the CNCF landscape—the devil is in the details as they say. Because Kubernetes is complex, there will be multiple considerations that need to be taken into account. Who is going to deploy and support those solutions going forward? How will you deploy these solutions? Can all of this be automated? These are only the start of the questions that need to be answered.

Because I have always worked on small teams, efficiency and stability were paramount. Prior to my role at Tanzu by Broadcom, I was a Systems Admin (aka Platform Engineer) and a HelpDesk manager for national and international companies. That meant that I was  frequently working to identify areas for improvement and automation. 

Still doing things the hard way

For most of the customers I speak with, cluster management is a varying source of friction for factors both inside and outside of their control. Many are still managing operations via scripts that require manually adjusted values specific to each cluster. Unfortunately, this only solves one problem–deploying cluster(s) and packages. But how do you ensure they are continually updated? Often this means manually maintaining the packages and role-based access control (RBAC) permissions on all of your clusters where the package is running. We can easily see how this becomes a major bottleneck at scale.  

Today, this leaves many platform engineering teams with questions like: What happens when the sole person managing the Kubernetes deployment is out? Will operations grind to a halt until next week when the person returns to run the deployment scripts or apply needed RBAC changes? What if an application is flagged by InfoSec as holding a critical CVE that needs to be addressed today, or the app has to be taken offline?

In most cases, Kubernetes lives in a silo due to the dependence on understanding YAML, kubectl, and CLI syntax/commands, and this frequently prevents teams from involving other skilled, or maybe junior engineers from the support process.

Where do you begin? 

“The secret of getting ahead is getting started. The secret of getting started is breaking your complex overwhelming tasks into small manageable tasks, and starting on the first one.” -Mark Twain

There are several factors that make these challenges difficult to solve, including team dynamics, varying skill levels, budget, available resources, processes, and more. For many, just knowing where to begin can be a challenge. 

I recommend starting with things under your immediate control or things that can be changed with little friction. One of the ways to begin quantifying and breaking down complex tasks is by utilizing Little’s law. The overall idea is that if it takes you X minutes to do something, you can only do so many of those tasks within a given period of time. This can be used to quickly identify bottlenecks and inefficiencies. For example: If it takes you 5 minutes to log in to a cluster and update/deploy a package, updating 10 clusters will take you a minimum of 50 minutes to complete the task. I have noticed this is often forgotten as operators become inundated with day-to-day work, other projects, or fires that need to be put out. 

There is another saying that dovetails nicely with Little’s law, which is the Law of Instruments: If the only tool you have is a hammer, you tend to treat everything as if it were a nail. Scripts work great in many cases, however they are not meant to solve every problem.

Here are some of the questions that I think about when trying to identify areas that should be looked at for improvements through the use of Little’s law:

  • Are the teams responsible for cluster operations and management resource constrained?
  • Are there any existing automation, tooling, scripts, etc.?
  • Are there multiple teams involved in cluster operations?
  • Do you have separate teams for each cloud provider that you operate in? 
  • How do you lower the barrier of entry to Kubernetes and get less technical team members involved without needing to fully understand all of the complicated bits such as Kubernetes, YAML, or complex CLI commands? 
  • How do you automate ~80% of tasks so that instead of constantly putting out fires your teams have the breathing room to focus on innovating, improving systems, and the critical last 20%?

When it comes to the last 20%, these are areas that are usually the most difficult to address, yet they are also where you will have the biggest impact. You can learn more about the Pareto Principle’s 80/20 rule here.  

Hello Tanzu! Expanding cluster operations beyond power users

Keeping the above questions in mind, I want to do a quick thought exercise. If you are running Kubernetes in production today, ask yourself: Am I confident that someone from the platform operations team with little to no Kubernetes experience could deploy a production-ready Kubernetes cluster quickly, given they would have to: 

  • Create new cluster(s) in multiple environments (for example on vSphere+EKS) 
  • Create/change/delete RBAC permissions 
  • Configure DNS and enterprise cluster package configurations
  • Configure secrets
  • Apply network and security policies, including the installation of security agents for runtime
  • Create a dev namespace(s)
  • Deploy and configure Velero, and take a cluster backup and create a backup schedule

This is where Tanzu Mission Control can help. Tanzu Mission Control shifts the paradigm by lowering the barrier of entry to Kubernetes management. It was designed to be easy to use when creating Kubernetes clusters, updating Kubernetes RBAC permissions, creating namespaces, enabling/taking cluster backups, and more  without needing to touch or know Kubernetes, YAML, or CLI. This is essential to get others involved in managing cluster operations. Tanzu Mission Control can help you solve cluster operations challenges today because it eliminates low-value, repetitive tasks, and frees up operations teams and engineers for innovation and improvements instead of playing catch-up or continually putting out fires.

Taking things a step further, with a runbook and training it is possible for these same users to help with app operations. Packages , FluxCD repositories, and kustomization directories can all be added to clusters and cluster groups from the GUI. It is also worth mentioning that all of this can also be automated through REST, CLI, and Terraform.

Looking at the number of tasks it takes to deploy clusters—even with scripts—this is where we can apply Little’s law to really begin understanding the time savings that Tanzu Mission Control can offer. We have removed the guesswork when selecting your projects from the CNCF landscape and provided an opinion for you so that, altogether, these packages represent a fairly comprehensive Kubernetes management operations solution. 

When you create, manage, or attach a cluster, Tanzu Mission Control automatically deploys and configures:

  • Pinniped (cluster auth)
  • Carvel package manager
  • OPA Gatekeeper (security and custom policies) 
  • AWS EBS CSI and CNI packages for AWS EKS clusters 
  • RBAC policies 
  • Security, Network, Image, Custom OPA, or Mutating policies
  • Proxy settings

You can also configure the following tools to be automatically deployed to some or all clusters: 

  • Velero (cluster backup/restore, cross-cluster restore) 
  • FluxCD (workload continuous delivery) 
  • Tanzu Observability
  • Tanzu Service Mesh (Istio++)

To automate package deployment and cluster configurations you could configure individual clusters or cluster groups with FluxCD to deploy :

  • Security agents, ie Falco
  • External-DNS
  • Certman
  • Jaeger Tracing
  • Fluentbit
  • LOB Applications
  • Native Kubernetes CRDs
  • Etc

This is a significant task list even for trained Kubernetes administrators. So, how does this look when applied in the real world? I was recently contacted by one of our field Solution Engineers to celebrate the fact that our customer had successfully reduced cluster deployment times from two weeks to under five minutes, and eliminated all of the hands-on work that was previously spread across five different teams.

I wrote a blog post with a corresponding example repo that can be used to jump-start your FluxCD configurations and deployments using Tanzu Mission Control. A colleague also created a powerful multi-tenant example repo that also includes self-service namespaces; the self-services functionality is delivered via a custom operator.

Conclusion

It is easy to get caught up in day-to-day tasks of ticket and cluster operations and fail to realize how big of an impact certain tasks can have on other teams. I recommend taking a step back in the new year to analyze your processes and find the key areas of improvement. 

There are, of course, other areas that can have an impact on the success of any change or implementation, such as the people/culture aspect and those that are reluctant to any change or disruptions to the status quo. A colleague recently published an article that speaks to this very point so check it out.

With Tanzu Mission Control you can start solving some of these problems today and unburden your skilled engineers from low-value tasks while lowering the barrier to Kubernetes operations for other members of the operations team. 

For a deep dive into product capabilities, please explore our on-demand Kubernetes Operations Webinar Series, Part 1 and Part 2, where you can learn about best practices for managing Kubernetes environments and how to build software capabilities that sustain your organization into the future.

No Previous Articles

Next
Manage Kubernetes and Sovereign Data Needs with Tanzu Mission Control
Manage Kubernetes and Sovereign Data Needs with Tanzu Mission Control

VMware Tanzu Mission Control is a hub for multi-cluster Kubernetes management. Improvements for its self-ma...