Tracing the Path to Clear Visibility in DevOps

June 24, 2021 Mary Chen

Today, we’re excited to announce enhancements to the VMware Tanzu Observability by Wavefront platform, which helps teams scale their observability practices and shorten the feedback loops between development and operations. The new features give more flexibility and functionality to any open source investments; help operations, development, and SRE teams resolve problems faster; and extend observability more efficiently into DevOps workflows.  

Here’s a quick rundown of what’s new. 

Cost-effective microservices performance troubleshooting 

In 2018, we added distributed tracing functionality to our observability platform for troubleshooting complex microservices-based applications. Along with all the enriched information to help users visualize and understand how services are interacting together, Tanzu Observability intelligently samples the data in a way that’s cost-effective for finding errors and performance problems in your code. 

Tanzu Observability intelligent sampling analyzes the entire set of trace data that your application emits, but only retains the information you need to know. When determining whether a trace is worth retaining, the platform considers its characteristics in light of its history based on three key metrics: request rate, request errors, and request duration (RED). RED metrics, a subset of the Google Golden Signals, give us a standardized starting point for troubleshooting microservices and any request-driven applications. Tanzu Observability derives RED metrics from all the spans ingested to provide a highly accurate picture of your application’s behavior. This enables the platform to determine whether an analyzed trace is a true outlier.

As our customers began adopting distributed tracing, it became apparent that their DevOps teams needed the flexibility to define specific spans—irrespective of Tanzu Observability intelligent sampling configuration—to retain in order to meet the monitoring requirements for each application in a cost-effective way.

Flexible sampling configuration for business needs 

Now, you can create a policy and choose the specific spans (by service, API, tags, HTTP status, and more) that you want to investigate in detail. You can also choose to retain more than what intelligent sampling retains. For customers who need 100 percent visibility into their business-critical services, such as a payment service, they may want to retain all spans associated with the service for root cause analysis. Another good use case would be onboarding applications to capture all errors and understand application behavior in production. Once the issue is resolved, simply deactivate the policy. You also have the option to save the traces locally and review them at a later time.

OpenTelemetry support for distributed tracing

In addition to the various instrumentation and ingestion methods that Tanzu Observability already supports for collecting trace data, we have extended the “contrib” OpenTelemetry collector by adding a Tanzu Observability (Wavefront) exporter so you can drop this integration into your current OpenTelemetry setup.

The workflow below shows how to get your trace data into the platform using the OpenTelemetry collector and the new Tanzu Observability exporter. You’ll need to deploy and use the OpenTelemetry collector as an intermediate data aggregator that receives telemetry data from your service. Then, in the OpenTelemetry collector, you enable the Tanzu Observability exporter to send trace data to Tanzu Observability via the Wavefront proxy.

Here’s a short video that shows how to configure the Tanzu Observability exporter to send trace data to Tanzu Observability: 

The Tanzu Observability exporter resides in the opentelemetry-collector-contrib repository.

Tanzu Observability enriches OpenTelemetry data with Application Maps (App Maps) to help you understand interdependencies for faster root cause analysis. Whether your microservice is calling an external cloud service API or an external database, we derive RED metrics from any service-to-service communication, be it internal or external. This helps developers debug the microservice they developed and identify potential bottlenecks with any external components.

You can also set the Apdex score threshold for each service to track perceived user satisfaction. Accepted throughout the industry, Apdex scoring helps you compare the response time of a service based on the response time threshold that you define. In Tanzu Observability, the Apdex scores are shown alongside the RED metrics to provide a comprehensive view of service quality.

Learn more about using distributed tracing with OpenTelemetry in Tanzu Observability in this webinar.

Alert creation from the Application Status screen

Any time that we can shave from a developer’s workflow is a worthwhile pursuit. On the App Map, developers can see all active alerts along with the individual service’s health metrics. And now they can easily create alerts from tracing-derived metrics on the Application Status screen instead of having to switch to the Alerts creation screen.

Reliable open source investment augmentation at scale 

Monitoring containerized applications within the Kubernetes infrastructure is no easy task. While Prometheus is a great tool to start monitoring Kubernetes along with the code and features that developers deliver, as an organization grows and wants to roll out monitoring and observability as a service across all teams, that tool may not be enough. For those organizations choosing to keep their existing Prometheus installations, Tanzu Observability enhances their open source investments by adding hyperscale clustering, long-term data storage, and enterprise-ready controls. This provides enterprise organizations with a single, reliable platform for analyzing all their data. 

Accelerated DevOps teams productivity with PromQL support

At VMworld 2020, we announced support for PromQL along with other features to accelerate DevOps teams’ productivity. Tanzu Observability continues to improve the level of support for the PromQL language and is undergoing compliance testing. This means Kubernetes platform operators, SREs, and developers can leverage community-built queries and use their Prometheus query skills within the Tanzu Observability UI to create new dashboards and alerts with confidence.

When you’re ready, you can easily transition to the Wavefront Query Language for more functions and to utilize high-performance querying of metrics, histograms, traces, and other data to surface insights you care about. That means preserving the developer and operations experience while taking care of your metrics, storage, high availability, scaling, and maintenance so you can get back to delivering great experiences to your customers. 

Improved alerting workflow to reduce MTTR

A great alerting experience helps modern DevOps teams achieve faster incident resolution times by filtering noise and capturing true anomalies. Alerting in Tanzu Observability is enabled by a powerful query language and the ability to ingest data in real time. Given that software development is getting faster and faster, and the underlying infrastructure is getting more complex and dynamic, we’re designing a new experience that makes all aspects of alerting purposeful and useful. 

Managing alerts during maintenance windows

Creating a maintenance window when systems are undergoing maintenance is a useful way to reduce unnecessary noise. Previously, the way to configure which alerts a maintenance window should affect was done by a combination of alert tags, source tags, and/or source names. However, it’s often helpful for an SRE to be able to use data within point tags to determine what should be affected by a maintenance window. 

Now you can add point tags in the redesigned Create maintenance window to group systems together by environment, data center, region, customer, and whatever tags you have on your data. Use multiple point tags to narrow the scope on where you want the maintenance window to take effect.

When performing maintenance on your systems, you may not want to suppress alerts, but instead have the team doing the work be notified of alerts being fired. There’s now an option to send alerts to a different target so you have a trace that shows something has happened during maintenance.

Stay tuned for more and better alerting experiences!

More capabilities on the way

These enhancements to the Tanzu Observability platform will help you scale your observability practice while facilitating teamwork and reducing friction. These are just a few of the many capabilities that are on the way to help DevOps teams manage the complexity of operationalizing microservices in multi-cloud, multicluster Kubernetes environments. We believe Tanzu Observability offers the most effective way to help anticipate problems and discover bottlenecks in your production environment on Day 1 of operations. 

Start seeing everything with Tanzu Observability today! Sign up for free

About the Author

Mary Chen

Mary is a senior product marketing manager in VMware’s Modern Apps Platform business unit, where she is responsible for helping customers succeed in modern DevOps environments using VMware Tanzu Observability. Prior to VMware, she was a senior industry marketing manager at Splunk, working at the intersection of retail and marketing. And before that she worked in various roles to help bring bring data analytics, security, and network management solutions to market across industries.

More Content by Mary Chen
Previous
Why Even DevOp?
Why Even DevOp?

So often, we talk about doing the DevOps for money, fame, and high performance. But DevOps was the original...

Next
A Bootiful Podcast: Redgate technical lead for the Flyway team Mikiel Agutu on database migrations and continuous delivery
A Bootiful Podcast: Redgate technical lead for the Flyway team Mikiel Agutu on database migrations and continuous delivery

Hi, Spring fans! In this installment, Josh Long (@starbuxman) talks to Mikiel Agutu, technical lead for the...