Hybrid cloud observability is essential for customers of VMware Cloud on AWS. Observability helps engineering teams see how their on-prem and cloud applications, as well as the underlying infrastructure, perform. In this blog I recap how SREs on the VMware Cloud on AWS team use Tanzu Observability by Wavefront to proactively root-cause issues before customers are affected, as well as meet service-level objectives (SLOs) and improve the quality of their releases.
Early days of Tanzu Observability for VMware Cloud on AWS SREs
Jon Cook is a senior staff SRE architect at the VMware Cloud on AWS team. There are 60 engineers on his SRE team. Jon has been using the Tanzu Observability platform for more than three years; its usage spread like wildfire through his organization. Now, at VMware Cloud on AWS, core services teams—including NSX SREs, ESX and vCenter teams—look to Tanzu Observability data for QA testing, scale and performance monitoring.
SREs on the VMware Cloud on AWS team initially started using Tanzu Observability to understand availability, in particular to track the up/down states of their service. In addition, many black box tests were pushing related data into the platform, then reporting availability. Even today, SREs can query Tanzu Observability and obtain the availability of VMware Software Defined Data Centers (SDDCs) over a specified period.
The SRE team deployed Wavefront to help them troubleshoot the performance of various SDDC components, such as networking, compute and storage. Because Wavefront is equipped with a set of dashboards and alerts, SREs can quickly isolate the SDDC of a specific customer to see the related performance data and any important metrics. The Tanzu Observability tiles would show whether the historical performance metrics were within expected ranges or not, and also make clear if anything needed SRE attention from a performance standpoint.
Full-stack observability of VMware Cloud on AWS: applications, Kubernetes and cloud infrastructure
VMware Cloud on AWS currently has hundreds of customers, and the SRE team uses Tanzu Observability to monitor the critical performance data coming out of each of their data centers.
Tanzu Observability ingests data from all SDDC components: vSphere, vSAN and NSX. It also gathers data from Kubernetes clusters, SaaS apps and miscellaneous Linux systems, as well as all components of the VMware Cloud on AWS application itself. Recently, the SRE team has started using Tanzu Observability distributed tracing as well. As a result, all data is unified, which helps the SREs to quickly troubleshoot issues across different vSphere, vSAN and NSX sources.
For example, when the engineers were onboarding a large financial corporation—one of the biggest VMware Cloud on AWS customers—some performance issues emerged around networking between on-prem and off-prem components.
Tanzu Observability has helped VMware Cloud SREs identify many issues, enabling them to make improvements and optimizations that include
- Elimination of network bottlenecks, and more reliable compute and storage performance
- Enhanced overall service performance and availability
- Improved customer satisfaction, as a result of engineering initiative aimed at increasing the observability metrics and tracing provided by VMware products (the engineering team’s goal is to provide self-service observability to all VMC and vSphere customers)
- Faster troubleshooting using metrics and traces, including expanded instrumentation in the NSX and vCenter code that pushes more data into Tanzu Observability from within those appliances (no need for external collectors).
There is now a large amount of rich data coming in, including the span logs data from vCenter.
Detailed vSphere observability for all VMware Cloud on AWS
One set of dashboards that the SRE team finds extremely useful visualizes vSphere performance and utilization metrics. The SREs gather and visualize various metrics about utilization, including from the NSX Edge Network, NSX Resources Utilization and NSX storage statistics. As Figure 1 illustrates, SREs also have deep vSphere observability, including vCenter, NSX Manager and the Controllers' control plane performance metrics. The dashboards shown in Figures 1-4 are just a small sample of hundreds of dashboards created by SREs.
Figure 1. NSX Edge resource utilization metrics
VMware Cloud on AWS SREs look at the ESXi performance and anything related to ESXi for comprehensive visibility. For instance, they can instantly spot oversubscription issues, which helps the team proactively remediate emerging issues, as indicated in Figure 2 in yellow, or potential memory bottlenecks, which are shown in red.
Figure 2. ESXi performance and utilization metrics
Tanzu Observability also gathers detailed vSAN metrics. Once all the different metrics are combined, the SRE team has detailed insight into cluster performance, including details around VSAN cluster consumption, back-end consumption and more.
Figure 3. Detailed VSAN metrics
No more “my workloads are slow” issues!
The VMC SREs visualize all the VMs and workloads performance metrics, as shown in Figure 4. They can quickly identify top workloads that are having issues across compute, storage and networking metrics. And they can see latency even before customers report, such as by figuring out which VMs are causing issues due to extreme network utilization.
Figure 4. Detailed VM metrics
See everything in hybrid cloud: public and private cloud applications and infra performance in one view
The application migration capabilties of AWS Hybrid is one of the main reasons behind the enormous success of VMware Cloud. In the future, combining Tanzu Observability with VMware Cloud on AWS should allow VMware customers to understand the performance of their applications running on the hybrid cloud.
As our senior staff SRE architect Jon notes, “Anything that we start to use in VMware Cloud on AWS, from SaaS to SDDC, will have to give us telemetry data in Wavefront (now Tanzu Observability). That is a given now.”
Unified vCenter performance metrics and SLOs
To gain better visibility into vCenter, the SRE team also plans to add Tanzu Observability Distributed Tracing. The SREs’ goal is to define SLOs for the selected services. If there is an SLO in violation, the VMware Cloud on AWS SREs can use the span log data to determine exactly where that violation is occurring and feed that information back to the appropriate team to remediate. Detecting SLO violations during performance impact is important for the SRE team, as it helps them to be proactive.
Understanding VMware HCX performance
VMware HCX is a VMware Cloud on AWS’ hybrid networking solution that, among other things, allows the on-prem to off-prem vMotion and data replication. It handles all of the WAN traffic for VMware Cloud on AWS. Tanzu Observability is helping provide visibility not just to vSphere but also to HCX by using HCX APIs that expose performance metrics, which in turn get included in Tanzu Observability dashboards.
Exceptional hybrid cloud performance with Tanzu Observability and VMware Cloud on AWS
Using Tanzu Observability, the VMware Cloud on AWS SRE team evolved from having to spend many hours gathering a complete picture of their environment to having instantaneous access to rich performance and troubleshooting data. SREs no longer need to log in to vCenter to get insights, for example, but can provide their customers with instant access to performance data for their SDDCs.
And large VMware Cloud on AWS customers are now able to go to Tanzu Observability to see any performance issues; their days of calling VMware support to diagnose the issue or correlate their data are finally behind them. VMware Cloud on the AWS team has, in other words, made a significant impact on its customers’ satisfaction. To take Tanzu Observability by Wavefront for a spin yourself, check out our free trial.
About the Author
Follow on Twitter Follow on Linkedin More Content by Stela Udovicic