It’s Not a Network Problem! New Ways to Troubleshoot and Scale NSX-T Data Center with Wavefront

October 25, 2018 Stela Udovicic

It’s been exhilarating to witness the evolution of networking over the years – from traditional architectures with classic routers and switches, to Software Defined Networking and adding upon it with cloud networking environments. It’s been just as fascinating to see the transformation of tools used to troubleshoot networking. With the adoption of infrastructure as code for networking components, engineering teams are beginning to add new tooling to see beyond traditional networking. Because these new types of tools are optimized for cloud-native scale, these next-gen monitoring platforms can be easily applied into cloud networking domains. With this shift, troubleshooting cloud networking infrastructure is unified with troubleshooting of cloud applications running with microservices and containers. In this blog, I’ll cover how the VMware NSX-T test team adopted these new methods for troubleshooting and scaling NSX-T Data Center environments, methods that once were the sole providence of the most leading-edge DevOps, SRE and developer teams.

NSX-T Data Center (a.k.a NSX-T) is an agile software-defined infrastructure to build cloud-native application environments. It’s focused on emerging application frameworks and architectures that have various technology stacks including containers environments. NSX-T Data Center works with vSphere, KVM, OpenStack, Kubernetes and Docker.

VMware’s NSX-T Scale Test Team needed a monitoring platform that both testers and developers could use to help them improve the quality of their releases and efficiently resolve not only code issues but with scaling their environments. In particular, they lacked visibility to help them identify outliers and correlate large amounts of data across many sources. A typical test environment for them is: NSX-T Data Center running a thousand hypervisors and 8-10 1000 virtual machines with 20,000 virtual interfaces.

Without a monitoring tool to scale to handle the vast amount of data resulting from the aforementioned environment, it was hard for teams to find problems and resolve development and test issues. This is what lead them to find Wavefront and quickly realized they now had a solution that caters to both their developers and QA teams’ needs

Enter Wavefront

The NSX Scale Test Team first decided to take advantage of Wavefront to find problems which were not easily detected by traditional test techniques. Users of NSX-T Data Center can take advantage of what was done by the NSX Scale Test Team (as described further below) to solve similar problems in large-scale NSX-T deployments. Moreover, there is more for NSX-T customers to gain from using Wavefront. They may need help from VMware support teams to understand their own scaling issues. NSX-T customers can allow remote access of their Wavefront dashboards to VMware support teams who already know how to resolve issues. Some of them are highlighted below.

Troubleshoot Communications Issues Between NSX-T Management Control and Data Plane

The NSX-T architecture, at a high-level, includes the management plane, control plane, and data plane. The management plane is where the user logs in and makes configuration changes. The management plane pushes those changes to the control plane. The control plane decides where the configuration needs to be applied and pushes it to the data plane. Thus, each user action is processed on several levels. That generates a large amount of metrics data to be monitored to ensure successful communication across multiple planes.

The management plane, the control plane and the data plane all export metrics to Wavefront. Now it’s much easier to correlate all of that data when the problem in communications is detected. The NSX-T team created a dashboard set for each of these planes in Wavefront. When there is a problem in the data plane, they can correlate that data and see how the product was behaving at the control plane and the management plane helping them troubleshoot issues across planes efficiently.

For example, let’s say a specific configuration didn’t get pushed down to Local Control Plane (LCP) and at the same time RPC connections are bouncing. Wavefront can help them to identify the root of the issue and figure out if the cause for bouncing was a config push from Management Plane (MP) to Central Control Plane (CCP). That’s a good data point to understand the problem better. With all performance metrics and configuration events in one place, it is easy to decide where a problem hotspot is.

Troubleshoot Connection Resets

Another use case for the NSX-T Scale Test Team was troubleshooting connection resets. The Wavefront platform can show them how many connections resets happened over a period. And if they drill further, Wavefront could explain the reasons for connection resets. In addition, the test team created a Wavefront dashboard for RPC calls. The RPC count is available, too. It shows them the number of connections that the Central Control Plane (CCP) makes to the Local Control Plane (LCP).

They can follow the data coming from the components of the CCP. Then they can look at the specific action which resulted in unexpected behavior. It tells them which particular element failed. Moreover, identifying failures in a scaled environment is precisely where Wavefront shines, providing visibility to teams across a vast amount of data.

Troubleshoot Increased Memory Usage Across Development and Test Teams

Another category of problems where Wavefront helped NSX engineers is isolating why memory usage across many NSX-T elements was spiking. Correlating memory growth over a large number of metrics sources over time helps teams troubleshoot critical issues in a scaled environment. If the memory usage is high, they can use Wavefront to zoom into the granular and historical data during the memory spike. For example, they can look at JVM memory stats across all controller nodes and executing test runs. Within the Controller Node CCP, they review multiple test-runs, and when they see an increased memory usage, they can file a bug to flag a potential issue in the code that developers can immediately take care of. This unified view helps teams not only troubleshoot faster but also optimize their code for better resource utilization.

Identify Rare Anomalies in a Scaled Environment

Also, Wavefront makes it easy to search for anomalies in metrics over extended periods of time. Then a user can decide if those sporadic anomalies need to be addressed to avoid buildup of unexpected behaviors in the scaled environment. Wavefront retains per second granular data for 18 months.

Deploy NSX-T Upgrades with Confidence, Validated with Analytics Insights

Upgrading NSX-T software is also an excellent example of how Wavefront can help not only test teams but also NSX-T customers too. Once the new NSX-T version is running, the same dashboards are available for users to look for abnormal behaviors that are new.

Summary

Armed with Wavefront, the NSX Scale team is reassured that NSX-T scaling is much less complicated when critical metrics and debug data are available in Wavefront dashboards for developer and QA teams. These real-time and per second insights help them minimize bug buildup and improve the quality of their code, ultimately accelerating development.

The previous examples identified a range of issues that Wavefront addresses successfully, from configuration problems to individual component communication issues to quality issues during the development process. The powerful analytics from Wavefront can be used across many different use cases beyond outlined here including monitoring hybrid cloud (AWS, Azure, GCP) to understanding containers and microservices performance. To uncover how Wavefront can help you – try Wavefront for yourself.

Get Started with Wavefront Follow @stela_udo Follow @WavefrontHQ

The post It’s Not a Network Problem! New Ways to Troubleshoot and Scale NSX-T Data Center with Wavefront appeared first on Wavefront by VMware.

About the Author

Stela Udovicic (@stela_udo) is a Director of Product Marketing at VMware leading Tanzu Observability by Wavefront PMM team. Before VMware, while at Wavefront, as Sr. Director, Product Marketing, she led Product, Solutions and Partner Marketing. Before Wavefront, Stela led Product Marketing for Splunk's DevOps, IT Ops, storage, and networking solutions. Stela holds an MSc in Electrical Engineering. She has presented at many major conferences, including Splunk.conf, VMworld, DevOps Days, Cisco Live, RSA, Monitorama, PuppetConf, NetApp Insight, etc.
Follow on Twitter Follow on Linkedin More Content by Stela Udovicic

Did You Know That Our API Docs Are Alive?

The Wavefront REST API powers the Wavefront UI. When you do something in the UI, like click a button, the r...

Become a Wavefront Power User: RTFM

From the beginning, the three Wavefront founders considered docs–getting information to our users–an integr...

Visionary in Gartner® Magic Quadrant™

Learn More

Return to Home

It’s Not a Network Problem! New Ways to Troubleshoot and Scale NSX-T Data Center with Wavefront

Enter Wavefront

Troubleshoot Communications Issues Between NSX-T Management Control and Data Plane

Troubleshoot Connection Resets

Troubleshoot Increased Memory Usage Across Development and Test Teams

Identify Rare Anomalies in a Scaled Environment

Deploy NSX-T Upgrades with Confidence, Validated with Analytics Insights

Summary

About the Author

Previous

Next

It’s Not a Network Problem! New Ways to Troubleshoot and Scale NSX-T Data Center with Wavefront

Enter Wavefront

Troubleshoot Communications Issues Between NSX-T Management Control and Data Plane

Troubleshoot Connection Resets

Troubleshoot Increased Memory Usage Across Development and Test Teams

Identify Rare Anomalies in a Scaled Environment

Deploy NSX-T Upgrades with Confidence, Validated with Analytics Insights

Summary

About the Author

Previous

Next

Related content in this Stream

Monitoring collects data, while observability offers contextualization and strategic insights into complex systems. Learn more about the differences and why observability is so powerful.

The unified observability platform in VMware Aria Operations for Applications brings together metrics, traces, and log management to deliver critical business outcomes.

With nearly 100 percent compatibility with Grafana dashboard queries, VMware Tanzu Observability delivers excellent support for PromQL.

VMware Tanzu Observability offers easy integration with AWS CloudTrail, enabling operators to view events related to governance, compliance, and operational and risk auditing for your AWS account.

See how VMware Tanzu Observability gave a British smart meter company unprecedented visibility into its platform and smoothed the path creating more innovative products.

A change to Grafana licensing means limited functionality for users of some platforms that rely on it. Here’s how Tanzu Observability can fill the gaps.

OpenShift users can now take advantage of VMware’s revamped full-stack monitoring solution of Kubernetes clusters with Tanzu Observability by Wavefront.

Updates to VMware Tanzu Observability include new ecosystem integrations and usability features designed to improve incident response.

We are holding two different design studio research sessions at VMworld that will give you the opportunity to influence the direction of VMware Tanzu Observability.

In addition to VMware Tanzu Observability supporting various instrumentation and ingestion methods for distributed tracing, it now natively supports OpenTelemetry.

Highlights from SpringOne Day 2 include more details about Tanzu Application Platform, demos of Application Accelerator and Tanzu Observability, plus summaries of some of our favorite talks.

We’re excited to announce enhancements to the VMware Tanzu Observability by Wavefront platform.

The integration of Jaeger with Tanzu Observability will help you visualize the application traces and identify any errors or performance issues.

We at VMware Tanzu recently published our first-ever summary of the current state of observability, a report entitled The State of Observability 2021.

The VMware Tanzu Observability by Wavefront engineering team recently completed 30 days of improvement focused on query quality.

VMware Tanzu Observability was named as a fast-moving leader in technology research and analysis provider GigaOm's forward-looking assessment of the cloud observability vendor space in 2021.

VMware recently announced that Apdex is now available in Tanzu Observability by Wavefront.

Companies running cloud-native apps and infrastructure will improve the user experience and boost app availability by adopting real-time alerting and predictive analysis.

New functionalities of Tanzu Observability by Wavefront accelerate analytics-driven insights and data onboarding for DevOps teams, including developers, Kubernetes operators, and wider ops teams.

Looking for a way to proactively troubleshoot complex application performance issues? Look no further than Tanzu Observability by Wavefront.