How to Find Silent Failures in Your Cloud Services Faster with Join() Function

September 30, 2019 Stela Udovicic

How do you find unknown unknows? How do you detect silent failures in your cloud services involving hidden dependencies that are flying below your radar? If undetected, they can accumulate and be detrimental to your customers.

In this blog, I introduce the latest tool in your toolkit – the Wavefront join() function. It’s one of 100+ analytics functions within the Wavefront Query Language that can be used to drive dashboards and alerts. Think about our join() function as a way of combining and matching data from multiple data sources by using values common to each. While there are many uses of the join() function, I discuss below a few of the more common ways SREs use it to accelerate problem detection and investigation:

  • Finding silent failures – using the exclusive join() function
  • Finding hidden dependencies between (related) data sources that spot trends – using join() and partial tag matching. By quickly finding these correlations and dependencies across your stack, you’re realizing the benefit of full-stack observability.

These are certainly not everything an SRE can do with the join() function. The more you learn about it, the more you can do with it. Wavefront’s join() function is modeled after the SQL JOIN operation. Having roots in relational databases, it can be applied in similar ways to metric time series data.

The Wavefront join() function supports: inner joins, left outer joins, left exclusive joins, right outer joins, right exclusive joins, full outer joins and full outer exclusive join. If you’re not too familiar with join types, check out the Wavefront query language join() description in our documentation. Or watch the video below, which also walks you through how to create a Wavefront query using join().


Video Thumbnail for Join Function


And if you’re new to Wavefront and interested to learn more about our unique analytics-enabled approach to Enterprise Observability, explore our rich query language reference which goes far beyond the join() function discussed in this blog.

Detecting Hidden Issues with Left Exclusive Join() Function

As an SRE, you may observe that your application, microservice, or container are properly functioning based on uptime or availability metrics you receive. In the world of microservices and containers, there can be hundreds or more of those metrics time series.

But as part of a specific dev team responsible for a particular microservice, how do you know that your microservice never started (it won’t emit metrics until it starts) or that a specific container instance isn’t functioning? No uptime metric data exists for containers which fail to start. It’s critical to quickly know about such problems because SREs could be slow to respond to silent failures, which can become costly when profoundly impacting your customers.

Again when a service fails to start, SREs don’t have uptime metrics for that service. But they know what underlying hosts intended to run the failed-to-start service, and they have uptime metrics from hosts running deployed services. If you’re familiar with the SQL JOIN operation, you may foresee how we will find the host(s) with the service not running.

You can use the Left Exclusive join() function against two time series – (1) the time series of all of the hosts where service is deployed and (2) the time series of all of the hosts where service is running. This Wavefront function generates metrics for the hosts with service deployed but not running. SREs can use the resulting metric time series to generate alerts automatically. After a service fails to start, the alert notifies of the problem. That allows fast triage and troubleshooting, which protects SLOs and avoids negative customer impact.

Another example for the Left Exclusive join() is explained in the example below. Suppose you want to find out how many CPU metrics in the kubernetes.node.cpu.usage_rate time series are not mapped properly in the time series. You might want to perform a Left Exclusive join() to find them, then perhaps try to fix them to appear properly in the node info gauge. In the Wavefront dashboard below, you can see the join() result as well.

There’s yet another real-world use case for the Left Exclusive join() function. The Spectre vulnerability affecting modern chipsets resulted in a need to patch exposed hosts. But again, how to find these exposed hosts that lacked the critical patch?

You have time series (metric) of all the hosts with the patch. Also, you have an uptime metric time series for all of the hosts running. All of the running hosts in this example needed the patch. And, when you apply the Left Exclusive join() function to the time series of running hosts and the time series of the hosts with the patch, the result is the time series of running hosts without the patch. Neat!

Use Join() Function and Partial Tag Matching to Achieve the Full Stack Observability

Wavefront’s join() function can be a powerful tool to help you correlate metric time series with point tags that only partially match. Having only partially matched tags is very common when you are trying to find dependencies between metrics in different parts of your stack.

For example, you find trends between application metrics and Kubernetes pods metrics or any correlation of data using similarly named point tags. The Wavefront join() function show SREs the overlaps between two time series with partially matching tags and quickly guide them to understand the reasons for discrepancies.

Here’s an example of a similar use case where join() can be useful. Suppose that you have two different time series (metrics). The first time series, kubernetes.node.cpu.usage, includes nodename tagging that describe which node it comes from. However, it isn’t tagged for which AWS EC2 instance it’s running on. The other time series,, has both the nodename and provider_id tags that detail which instance ID of AWS EC2 the node is running on.

If the SRE wants to find out the relationship between a Kubernetes node and the underlying AWS EC2 instance, the following Wavefront Inner join() function can be used to perform a join on the two different point tags, and then further use TAGGIFY function to extract the instance ID of AWS EC2 instance as a new point tag.

I’m sure that you too will find great new uses of the Wavefront join() function, beyond the ones highlighted here. The best way to explore the Wavefront’s unique Analytics and Query Language, and see first-hand what the join() function can do for you, check out now the Wavefront free trial.

Now you can quickly get on top of silent failures. I would love to hear about your experience with our join() function – contact me @stela_udo.

Special Thank You to Pierre Tessier (@PuckPuck), Howard Yoo (@YooHoward) and Mike McMahon (@killertypo) for contribution to this blog.

Get Started with Wavefront Follow @stela_udo Follow @WavefrontHQ


The post How to Find Silent Failures in Your Cloud Services Faster with Join() Function appeared first on Wavefront by VMware.

About the Author

Stela Udovicic

Stela Udovicic (@stela_udo) is a Director of Product Marketing at VMware leading Tanzu Observability by Wavefront PMM team. Before VMware, while at Wavefront, as Sr. Director, Product Marketing, she led Product, Solutions and Partner Marketing. Before Wavefront, Stela led Product Marketing for Splunk's DevOps, IT Ops, storage, and networking solutions. Stela holds an MSc in Electrical Engineering. She has presented at many major conferences, including Splunk.conf, VMworld, DevOps Days, Cisco Live, RSA, Monitorama, PuppetConf, NetApp Insight, etc.

Follow on Twitter Follow on Linkedin More Content by Stela Udovicic
How Moving From Prometheus Monitoring to Enterprise Observability Helped Secure State Deliver Exceptional Cloud Security Services
How Moving From Prometheus Monitoring to Enterprise Observability Helped Secure State Deliver Exceptional Cloud Security Services

For the VMware Secure State engineering team, metrics have become an integral part of daily life. From moni...

How to Search for Outlier Traces: A Guide to Wavefront Query Language for Distributed Tracing
How to Search for Outlier Traces: A Guide to Wavefront Query Language for Distributed Tracing

Distributed tracing is a critical piece of application observability. But, the sheer number of traces conta...