How VMware Cloud Engineering Team Exceeds SLAs Using Wavefront

April 22, 2019 Stela Udovicic

I chatted with Sandhya Sridharan, VP of Engineering, Yumei Xiong, Engineering Director, and Dhruvin Shah, SRE Group Manager at VMware. They shared how their cloud engineering team exceeds SLAs using Wavefront. Sandhya’s engineering team is responsible for delivering critical cloud services with strict SLAs to internal and external stakeholders. Their scale today is impressive including 650 pipelines (both cloud and on-prem), multiple GEOs, 5,000 containers deployed daily, across 12+ Kubernetes clusters for 100+ application microservices, with 600 alerts, and 500+ dashboards for 400 Wavefront users.

Cloud Management Platform Scale

The Before Times – Costs Spiraling Out of Control

Before Wavefront, the SRE team selected another vendor to monitor all their metrics. Though this tool worked for them in the beginning, they ran into issues as they started to scale. As they started growing their microservices, adding more containers, the cost started rising uncontrollably. They were not happy with this vendor’s non-transparent pricing with hidden charges and opaque container pricing, too. Also, engineers did not like proprietary collection agents. Further, a lack of analytics to customize packaged dashboards this vendor offered, was very limiting.

Enter Wavefront

As a result of all the above problems, they decided to try Wavefront. After an initial investment of some engineering effort, the SRE team gradually migrated their critical cloud services to Wavefront in a seamless fashion. Now Cloud Management Business Unit SREs rely solely on Wavefront as their primary platform for the full stack observability. They use Wavefront for reliability and health analysis across all microservices, build pipelines and hybrid cloud infrastructure. The block diagram below describes their environment today.

Monitoring CI/CD Pipeline

As the velocity developers code pushes is high, i.e., about 900 pipeline executions per day, the longsighted SRE team started automating the entire pipeline which requires constant visibility, to make sure that it is ‘always flowing,’ and not stuck at any point.

Wavefront dashboards show all the pipeline metrics. They analyze not only the services but also the supporting infrastructure which guarantees continuous running while hitting SLAs. With tests for different pipeline stages and powerful and easily customizable Wavefront alerting that is integrated with PagerDuty, a developer can securely push the code from his or her laptop only to see it is in production in a matter of minutes. They will immediately see the impact of the code release in production is.

Improved Engineering Efficiency

Also, Wavefront platform is integrated with Slack, driving tremendous productivity increase and efficiency improvement for the entire engineering team both SREs and developers. Sandhya reported that MTTR reduced more than 90% since her team adopted Wavefront.

For engineering managers, it is important how quickly a new developer can ramp up. Now developers can become productive within a couple of hours compared to several days before Wavefront. Now with single SDK, managers share with new developers, newly developed services are almost instantly ready to be pushed in prod along with standard telemetry including essential alerts.

Instant SLA Visibility Impresses Executives

Sandhya is proud of her global engineering team’s ability to offer high 99.9% SLAs to their cloud services’ customers, and that is where Wavefront is critical for her team. It is essential for engineers to detect issues proactively. Sandhya can pull Wavefront dynamic dashboards any time to report to her executive team how business-critical cloud services are performing. Even better – execs can now go to Wavefront themselves and see SLA for a particular service at any second as well as a high-granularity historical view.

“What took us ten steps to do before, now it is only one step. What took us 20 minutes it is now less than a minute”, Sandhya Sridharan, VP of Engineering, VMware.

Cloud Management Services SLA

Achieving the Full-Stack Observability

Also, understanding how Kubernetes behaves across all levels as well as their AWS as VMware SDDCs is paramount to SRE team. They monitor all of those layers and more: be it on-prem or in the cloud, everything goes through Wavefront. Using Wavefront Kubernetes integration helped quickly light up their Kubernetes dashboards. No matter how many Kubernetes clusters there are, they can pick one and see its performance dashboards visualizing Kubernetes health instantly. And not only Kubernetes health but correlated visibility across containerized microservices and underlying infrastructure.

When an SRE receives a PagerDuty notification with CPU load spike, they want to know everything that is happening around that time – what kind of loads in the system, how many users, how many database connections. And, that is what they get with Wavefront – they are one click away from the resolution. Developers are service owners enabling them to see what is happening from code check-in through production instantly.

The most recent addition of OpenTracing-, OpenCensus-compliant Wavefront Distributed Tracing dramatically enhanced CMBU platform observability. Wavefront tracing SDK enabled SREs to add detailed traces with no effort and particularly useful are three key health metrics request, error, duration (RED). This visibility empowers service developers to identify quickly the most critical failures within and across the services they own.

The First Pane of Glass for all Engineering

Ultimately, Wavefront became the first pane of glass for this VMware team. Wavefront readily meets CMBU’s SREs need for detecting the problem quickly and resolving it with no delay: ‘With Wavefront, you’re right in context.’ Automation services rely on dashboards on several levels, with Wavefront links sent along with alerts for immediate access. To summarize, use cases for the first pane of glass include:

  • Full stack monitoring for all CMBU on public cloud (AWS) Kubernetes, & containerized microservices
  • Private data center monitoring in North America, Europe, and Asia
  • Monitoring of monitoring silos with a unified view across Pingdom, PagerDuty, etc.
  • Measuring and monitoring SLAs for call cloud services
  • Monitoring CI/CD pipeline including Jenkins servers, etc.

The Wavefront platform stands out for:

  • 3D Observability – metrics, traces, and histograms
  • A powerful advanced analytics engine
  • Easy usage monitoring and reporting
  • OOTB Kubernetes visibility
  • Open-source agents for data collection
  • Future proof high scale
  • Language & framework agnostic
  • Easy onboarding for new developers

With Wavefront you gain a reliable, scalable platform for your cloud applications be it in public or private cloud without hidden, additional associated with dynamic infrastructure and support. Try it and see for yourself what is possible with Wavefront.

Get Started with Wavefront Follow @stela_udo Follow @WavefrontHQ

The post How VMware Cloud Engineering Team Exceeds SLAs Using Wavefront appeared first on Wavefront by VMware.

About the Author

Stela Udovicic

Stela Udovicic (@stela_udo) is a Director of Product Marketing at VMware leading Tanzu Observability by Wavefront PMM team. Before VMware, while at Wavefront, as Sr. Director, Product Marketing, she led Product, Solutions and Partner Marketing. Before Wavefront, Stela led Product Marketing for Splunk's DevOps, IT Ops, storage, and networking solutions. Stela holds an MSc in Electrical Engineering. She has presented at many major conferences, including Splunk.conf, VMworld, DevOps Days, Cisco Live, RSA, Monitorama, PuppetConf, NetApp Insight, etc.

Follow on Twitter Follow on Linkedin More Content by Stela Udovicic
Better Cloud Monitoring for MSPs
Better Cloud Monitoring for MSPs

Digital transformation now impacts all businesses. New business operating models make cloud services from A...

Wavefront Delivers Observability and Analytics for AWS App Mesh
Wavefront Delivers Observability and Analytics for AWS App Mesh

AWS announced App Mesh at AWS re:Invent 2018, a service mesh that makes it easy to monitor and control micr...