VMware Carbon Black’s Self-Healing, Auto-Scaling Infrastructure, Powered by Observability

June 11, 2020 Stela Udovicic

Engineers in the VMware Carbon Black business unit are firm believers in observability. Developers can’t build reliable, scalable, trustworthy applications without it. The larger a company’s data pool, the greater the need for application observability.

In this post, I recap the Carbon Black observability journey based on discussions I had with David Bayendor, engineering manager of the Carbon Black infrastructure team. David is an observability veteran, with years of experience in the field. His team of five SREs is in charge of Carbon Black’s container orchestration infrastructure (internally called Mosaic)—including Kubernetes, AWS, and the CI/CD pipelines—which is currently used by about 250 engineers.

In the past, the Carbon Black infrastructure team used several methods to collect metrics. The tools they used included StatsD, InfluxDB and Telegraf, and AWS CloudWatch/CloudTrail. Eventually, they decided to use a single commercial metrics tool, from a vendor that we can’t name here. Over time, however, they began to face a number of challenges with this vendor’s solution.

Early challenges

The challenges the team faced with this vendor were varied but began early.

Immature and expensive tracing offering

First and foremost, the SRE team was not happy with the maturity level of the vendor’s tracing functionality. The lack of mature tracing capabilities resulted in a whole range of problems, none of which were surfaced to the SREs or the developers. Furthermore, high availability of this vendor’s tracing feature was dependent on relatively expensive AWS features.

Service disruptions due to cardinality issues 

As the SREs started to scale the Carbon Black services, they ran into issues with this vendor’s metrics collection service whenever they’d reach high cardinality. Carbon Black applications could scale from 10 to 1,000 containers in just a few minutes. But because each container has a unique ID, that drove up the cardinality of metrics as the number of containers increased. The high cardinality affected the amount of time series in a non-linear way, causing massive spikes.

Some of those spikes violated Carbon Black’s contract with the observability vendor, which would respond by temporarily canceling the metrics service. Instead of applying rate limiting, the vendor just canceled the service altogether; the teams couldn’t receive any metrics at all. These disruptions happened on a regular basis, which was unacceptable, especially given the type of services Carbon Black offers its customers. The vendor offered a different subscription plan that would have avoided such blackouts, but the price of that plan was not acceptable.

Complex removal of defunct metrics

Stopping defunct metrics coming from containers outside Kubernetes also proved difficult. Some Docker container instances are outside Kubernetes, but asking the vendor to manually remove the metrics collection from these containers once they became obsolete was inconvenient.

The solution

In light of these challenges, the SRE team decided to migrate to VMware Tanzu Observability by Wavefront. Moving infrastructure metrics from the previous metric vendor to Tanzu Observability took only a week. The transition was fast because the SRE team utilizes infrastructure as code (IaC). With this approach, the code defines the infrastructure configuration, which makes any modifications completely repeatable. Just a single engineer is needed to run a process, and changes governed by the code happen automatically. One person can modify 13 Kubernetes clusters with about 2,000 containers in a mere 90 minutes, and that includes changes to metrics reporting. 

Within two months, 125 engineers supporting 30 containerized services had adopted Tanzu Observability. And most of the issues the SREs faced with the previous metrics vendor disappeared.

The Tanzu Observability difference

After changing vendors and adopting Tanzu Observability, the SRE team observed a number of notable improvements.

Outstanding experiences with distributed tracing 

After the transition to Tanzu Observability, the Carbon Black SREs started using distributed tracing. The team gets valuable insights from traces, such as details about traffic between microservices, service maps, histograms, and more. The significant advantage of the Tanzu Observability tracing solution is that both development and infrastructure teams can stay independent of one another by using their own Tanzu Observability metrics proxies. The proxy is lightweight, which means developers can turn tracing on without asking for permission from the infrastructure team. That self-service nature of the Tanzu Observability tracing solution saves time and improves the productivity and collaboration of both teams.

Improved customer satisfaction

The Carbon Black team reported that Tanzu Observability was actually far easier to use than the previous metrics tool, in particular when it came to adding and grouping users, as well as tagging metrics.

"I am really happy with the depth of the Tanzu Observability integrations, the quality of documentation, responsiveness of the code, and responsiveness of the Tanzu Observability developers. Customer success engineering support is solid—we just ping the team in Slack and we get a response quickly." —David Bayendor, Infrastructure Team Engineering Manager, Carbon Black

Tanzu Observability developers also got praise from David and his team for their responsiveness. On the top of great support, the Tanzu Observability documentation is really solid; it includes videos, SDK docs, and in-product API docs.

Disappearance of cardinality issues 

After switching vendors, Carbon Black no longer has a problem with metrics cardinality. As David notes, few people are aware of how complex time series data can get in highly scaled and data-intensive environments. Tanzu Observability scales seamlessly, which means engineers don’t need to worry about cardinality. And even in the unlikely event that Carbon Black hits the contractual time series data limit of Tanzu Observability, there wouldn’t be any blackouts. Tanzu Observability only limits data ingestion based on the SLAs customers sign.

As David puts it, “I don’t even have to think about metrics and cardinality. My customers can scale to thousands of containers, and Tanzu Observability can take in huge amounts of data, process it, and then we can spin everything down.”

Full stack observability 

Tanzu Observability sources metrics from both Carbon Black applications and the cloud infrastructure, including AWS CloudWatch, vSphere, and Kubernetes. It monitors some 6,000 metrics endpoints. Information about any issues is sent directly to SREs or developers through Slack and PagerDuty.

Developers also create their own alerts that inform them if services aren’t healthy or aren’t keeping pace with the applications’ loads. The loads can become a challenge to manage due to the many microservices and all the communication that takes place among them. With Tanzu Observability distributed tracing, Carbon Black teams can see both the metrics and traces of interactions among the containers that host microservices.

All of Carbon Black’s Kubernetes clusters are multitenant, because more than one team of developers uses them. The depth and breadth of Tanzu Observability integrations with Kubernetes, and the way Tanzu Observability uses the Kubernetes state collector, together help the infrastructure team monitor the services that make Carbon Black’s applications work with minimal effort. The Kubernetes state collector allows Carbon Black to easily pull metrics from the Kubernetes APIs.

There are about a dozen services that the infrastructure team deploys onto the Carbon Black Kubernetes cluster. Those services make scaling, logging, metrics, and the ingress controllers’ automation functional.

Self-healing infrastructure 

When it comes to scaling Kubernetes pods, the Tanzu Observability Horizontal Pod Autoscaler Adapter (HPA adapter) is especially useful. It allows Carbon Black developers to write a few lines of code into their deployment YAML files, which govern the scaling of their services based on key metrics. Sometimes people on call get paged about outstanding alerts even after the HPA has already scaled the number of pods and removed the alert. With the HPA adapter, a person on call doesn’t need to do anything, because infrastructure automatically fixes the problem.

Metrics, such as service queue latencies, can tell the HPA adapter when to scale up or down the number of pods. Some good examples of Tanzu Observability metrics sources that determine the scale of Carbon Black’s services include those from Amazon SQS and Kinesis.

With their previous metrics tool vendor, Carbon Black team members weren’t able to collect scaling-related metrics from one place. Now, with Tanzu Observability, they can.

Intuitive dashboarding and analytics improve developer productivity

The Carbon Black SREs and developers can easily clone and modify Tanzu Observability dashboards to meet their needs. The teams clone the existing dashboards included in each integration, then change those dashboards with very little effort. A dashboard wizard makes it easy to create dashboards from scratch by selecting metrics, chart types, or even an integration that accelerates creation and allows for modifications to be made on the fly. The SREs find Chart Builder and Chart Editor to be  great tools for creating and modifying queries. 

To allow developers and engineers to create dashboards and queries on their own, David’s team provided appropriate documentation. The user-friendly approach of Tanzu Observability, combined with his team's guidelines, resulted in the fast and painless adoption of Tanzu Observability at Carbon Black. It took developers an average of just 40 minutes to send custom metrics from development-owned containers on his team’s infrastructure platform to Tanzu Observability. They were able to generate new metrics without having to ask any clarifying questions. David was also surprised by how little time his team spent facilitating the broad adoption of Tanzu Observability by the development community. The positive impact on the productivity of engineers and developers was very significant.

Improved accountability and developer independence

Each team is in charge of the dashboards, alerts, queries, and traces they create, along with any issues they detect with Tanzu Observability. Information on how to create dashboards and alerts is, in most cases, stored in Terraform as IaC. That way, Tanzu Observability configurations can be easily managed as part of the CI/CD pipeline.

Envisioning a future with Tanzu Observability

David’s desire is to enable access to Tanzu Observability insights across the Carbon Black organization, with TV screens that display the Tanzu Observability dashboards to management, development, and engineering groups.

The best kind of infrastructure, and the best kind of application observability, is the one that resolves itself. —David Bayendor, infrastructure team engineering manager, Carbon Black

Another initiative that David’s team wants to explore is Tanzu Observability integration with GitLab. They are already looking into GitLab metrics, which can help them optimize their CI/CD pipelines, and would like to see metrics related to any bottlenecks, which cause deployments to take more time than expected.

To experience the power of Tanzu Observability for your team, start your 30-day free trial today. 

 

About the Author

Stela Udovicic

Stela Udovicic (@stela_udo) is a Director of Product Marketing at VMware leading Tanzu Observability by Wavefront PMM team. Before VMware, while at Wavefront, as Sr. Director, Product Marketing, she led Product, Solutions and Partner Marketing. Before Wavefront, Stela led Product Marketing for Splunk's DevOps, IT Ops, storage, and networking solutions. Stela holds an MSc in Electrical Engineering. She has presented at many major conferences, including Splunk.conf, VMworld, DevOps Days, Cisco Live, RSA, Monitorama, PuppetConf, NetApp Insight, etc.

Follow on Twitter Follow on Linkedin More Content by Stela Udovicic
Previous
A Deep Dive into the Kubernetes vSphere CSI Driver with TKGI and TKG
A Deep Dive into the Kubernetes vSphere CSI Driver with TKGI and TKG

The Kubernetes vSphere CSI driver is becoming increasingly prominent as it gradually replaces the original ...

Next
Announcing Tanzu Build Service Beta: Build and Run Containers in Any Kubernetes Cluster
Announcing Tanzu Build Service Beta: Build and Run Containers in Any Kubernetes Cluster

VMware Tanzu Build Service offers a new, simplified approach to building and managing the life cycle of con...

How To Think Cloud Native

Learn more