Monitor vSphere with Wavefront: An In-Depth Walkthrough

March 12, 2019 Pontus Rydin

A while ago, Wavefront announced a vSphere plugin contribution to the Telegraf project. Telegraf is the data collection engine for many Wavefront integrations. The vSphere plugin allows pulling vSphere metrics into Wavefront. The exact set of available counters is dependent on the version of vSphere and the data collection level configured in vCenter, but a near-complete list of metrics is available here. Recently, we released the full Wavefront integration, which includes several predefined dashboards. You can deploy the vSphere plugin just like you would any other integration – and we even give you some best practices!

In this blog, I discuss how to set up vSphere monitoring in Wavefront and explore available metrics and dashboards.

What is Telegraf?

Telegraf is an open source metrics collector capable of pulling metrics from a wide range of technologies, including operating systems, application servers, and middleware. Because of its small footprint, a vast ecosystem of data sources, and a very active developer community, Wavefront has chosen Telegraf as its primary data collector.

Enabling the vSphere Plugin

Preparing for Installation

To use the vSphere plugin, you don’t have to run a Telegraf instance on the vCenter host. VMware discourages running plugins inside vCenter server because that can cause problems during upgrades. Instead, you can deploy the vSphere plugin on a separate host. The vSphere plugin also requires that you run a Wavefront proxy, either on the same host as the Telegraf collector or a different host. As long as sizing guidelines are followed, the vSphere plugin can send its data through a shared Wavefront proxy. Here are the requirements for a machine that hosts the vSphere Telegraf plugin:

  • 40 GB storage (Telegraf and Wavefront proxy)
  • 8 GB memory (Telegraf and Wavefront proxy) or 6 GB memory (Telegraf only, i.e., Wavefront proxy on a separate host)
  • 4 CPUs

This configuration has been tested with up to 7,000 virtual machines. Higher volumes may require slightly more compute resources. This configuration assumes a dedicated Wavefront proxy running on the same host as the Telegraf collector. If the Wavefront proxy runs on a separate host, the memory can be reduced to 6GB.

Telegraf runs on a wide variety of operating systems and platforms, but we strongly recommend running it on a 64-bit system, because the vSphere SDK has some limitations when running on 32-bit platforms. For our internal tests, we ran Telegraf on Linux operating systems, but any modern Windows version should work too.

Installing Telegraf

When you have the Telegraf host prepared, you can install the collector using the guidelines on Wavefront integration tab:

  1. Log into your Wavefront account or sign-up for the free trial
  2. Click on the Integrations tab
  3. Search for vSphere and select the VMware vSphere tile
  4. Go to the Setup tab
  5. Select the proxy from the dropdown (assuming you will use an existing Wavefront proxy)
  6. On the host where you want to install Telegraf, copy and paste the generated installation command into a terminal window

Figure 1 shows how the command might look in your lab.

Figure 1. Telegraf Installation Command

Configuring Telegraf

Before you can start collecting metrics, you need to perform some basic Telegraf configuration. The configuration steps depend on your platform, and you can find the exact procedure on the Wavefront integration page. The steps include copying a configuration file and editing the address and credentials. Note that the vCenter address must be specified as a full URL, including the /sdk resource path at the end. If you want to pull metrics from multiple vCenter Server instances and you can use the same credentials for all of them, you can specify a comma-separated list of URLs. Here is an example of in a configuration file:

[[inputs.vsphere]]
vcenters = [ "https://vc1.corp.local/sdk", "https://vc2.corp.local/sdk", "https://vc3.corp.local/sdk" ]
username = "user@corp.local"
password = "secret"

When you are done with the configuration, you need to restart the Telegraf agent. On a Linux system, use the following command:

sudo service telegraf restart

Verifying the Installation

When configuration is complete, you can verify the installation by clicking on the Metrics tab of the vSphere integration you used earlier. Now you can browse the metrics, along with the point ingestion counters. It may take a couple of minutes before you see data because the initial resource discovery takes some time. When data starts flowing, you see a dashboard similar to shown in Figure 2.

Figure 2. Wavefront vSphere Dashboard Sample

Dashboards

The vSphere plugin for Wavefront comes with a full set of dashboards for monitoring vSphere infrastructure components. You can access the dashboards by selecting the Dashboards sub-tab under integrations (see Figure 3).

Figure 3. Accessing Wavefront vSphere Dashboards

From here, you have access to a preconfigured dashboard for every major construct in vSphere, such as virtual machines, ESXi hosts, clusters, and datastores. There is also a Summary dashboard (see Figure 4.) that gives you an overview of overall vSphere performance. Let’s have a look!

Figure 4. Wavefront vSphere Summary Dashboard

On this dashboard, you can review the overall health of the components and constructs in vSphere and find the busiest components in the environment. This dashboard gives application troubleshooters insight into how the infrastructure is affecting their applications. Below is an example of how application performance metrics and vSphere metrics can be used together to conduct a powerful root-cause analysis.

Advanced Use Case – Correlation between Application and Infrastructure

The Issue

Your application is experiencing intermittent periods of poor performance, manifested as high latency. The behavior is somewhat volatile and may be related to DRS (Distributed Resource Scheduler) vMotion migrations (hot migrations between hosts in a cluster). You hypothesize that DRS sometimes migrates your workload to a host that’s performing poorly. Unfortunately, the data in your traditional vSphere monitoring systems are too noisy to draw any conclusions.

The Solution

To diagnose the problem, your SRE puts together a few simple Wavefront queries to try to correlate application latency to host behavior. To simplify, you put the resulting graphs and tables on a custom dashboard (see Figure 5.) as follows:

Figure 5. Wavefront vSphere Customized Dashboard for Troubleshooting Poor Application Performance

Let’s walk through the dashboard. In the top-left position, you see the “signal” you are trying to find the source for. In this case, it’s a latency within an application. You have noticed that you occasionally see spikes in latency when users hit your application. Here you have zoomed in on such an event.

What kind of problem is it?

The first question you can ask is what type of delay this is. Is it CPU-, memory-, network- or IO-bound? In other words, what metric shows a pattern that’s similar to the latency spikes? To do this, you use the Wavefront mcorr() function against a selection of VM metrics with the latency pattern as a template. You then organize the metrics in a table sorted by their correlation score (see the table on the top right). You can easily see that the CPU metrics have the best correlation with the latency spikes, which leads you to believe your application is CPU-bound.

Where does it come from?

But why do you see this problem only intermittently? Maybe when the application lands on a certain ESXi-host, it slows down due to host misconfiguration? Let’s correlate the spikes against CPU ready states on hosts. A ready-state is a situation where a VM is ready to run, but the ESXi host can’t offer it a physical CPU to run on so it can cause severe application performance degradation.

In this case, you are using a weighted correlation, which is just the correlation score multiplied by the absolute value of the wait states. You show this in the middle-right diagram. Here, you quickly notice that two hosts in the cluster have a much higher number of ready-states, indicating that the balancing of the cluster is misconfigured. You can see the same thing graphically in the bottom right diagram.

The Smoking Gun

Just by running a few simple queries, you were able to detect what kind of delay you were dealing with, how significant the correlation between that delay and the latency spikes was, and what part of the infrastructure caused it. In this case, there’s a very high likelihood that the problem was caused by poorly configured DRS.

Conclusion

In this blog, you explored the vSphere plugin for Wavefront in depth. You’ve also demonstrated how correlating application metrics with the infrastructure is crucial for troubleshooting complex applications spread across multiple hosts and clusters. For more information, please contact us. If you want to take Wavefront for a spin and hook up your vSphere environment, have a look at our 30-day free trial.

Get Started with Wavefront Follow @prydin Follow @WavefrontHQ

The post Monitor vSphere with Wavefront: An In-Depth Walkthrough appeared first on Wavefront by VMware.

Previous
Wavefront Delivers Observability and Analytics for AWS App Mesh
Wavefront Delivers Observability and Analytics for AWS App Mesh

AWS announced App Mesh at AWS re:Invent 2018, a service mesh that makes it easy to monitor and control micr...

Next
Pivotal Application Service on PCF Monitoring with Wavefront
Pivotal Application Service on PCF Monitoring with Wavefront

Pivotal Application Service (PAS) is based on Pivotal Cloud Foundry (PCF). With PAS on PCF DevOps and devel...

SpringOne. Catch all the highlights

Watch now