The world is built on free software. The world is also, it seems, increasingly running on Kubernetes.
Prometheus is the obvious way to monitor a Kubernetes platform. There is simple integration, with an intelligent, out-of-the-box setup. It's stable and pretty mature, with a big user base. But sometimes Prometheus isn't enough. There are situations and use cases where a managed solution, such as Tanzu Observability by Wavefront, is a better fit.
Getting started with Prometheus is generally just a matter of running a Helm chart, and with next to no work, you get great insight into your Kubernetes cluster, leaving you to wonder how the heck you ever lived without metrics. But that's “cluster,” singular, and no one has just one cluster.
Because you have more than one, next you'll want to instrument dev/test, pre-prod, and production. They will all need their own version of Prometheus and Grafana, each of which requires its own user management, dashboards, alerts, and all the other good stuff that makes them useful. None of this work is particularly onerous, but it all takes someone's time — time that could be spent on the product.
It would be nice to combine the data across clusters, perhaps to compare the performance of the release in pre-prod with the one in prod on a single chart. But that will require a bit of work too, and we're too busy to maintain Kubernetes, which is a hassle, so let's make do with having two windows open for now.
Victim of success
When the payments team sees the platform team's fancy dashboards, it wants them too. So, it builds out its own Prometheus stack, and it's great.
In fact, it's so great that the engineers in booking want to use it, too, to keep track of the latency of a service they consume.
The payments people have to add the users, and enable access from booking's IP range, and that's all good. But then the booking people want to incorporate some of payments' metrics into their own dashboards. So things start to get more complicated.
Management hears about all these metrics, and they want a high-level dashboard that correlates bookings with completed payments and statistics from the feedback team. But all that information is on a different cluster! Having more tabs open won't cut it now.
It is, of course, possible to wire everything together, making each Prometheus a source for another, top-level, "meta" Prometheus. But that's likely to involve some heavyweight networking and security work, possibly involving multiple teams and risk assessments, and the devs want a service mesh now, so finding the time is really hard.
If you're in the brave new world, with modern approaches to cloud engineering, one VPC speaking to another might not be such a big deal. But the bigger and more corporate you are, the more problematic this kind of work becomes.
For instance, security restrictions mean one of my current clients can only access their dashboards if they're logged in to the correct VPN from a corporate device. Getting any kind of data in and out of where the production cluster lives is a lot of effort.
As the company grows and the platform grows, you're going to have to think about scaling. A vanilla Prometheus can handle a lot of metrics, but the bucket is not bottomless. The standard way of scaling is vertical, and there's a limit to how big those pods can get.
People sometimes handle bigger metric flows by dropping retention rates, but the less lookback you have, the less your metrics can tell you. Or you might lighten the load by increasing the intervals between points. Can you afford the compromise?
Where do you run Prometheus and Grafana? On small setups, it's common to run them on the cluster being monitored. This is simple to do, but not ideal. First, the thing doing the measuring can affect the thing being measured. Prometheus pods can get big, particularly in memory consumption, and that cluster only has so much memory. Now you have to think about, and enact, what you want Kubernetes to do when those pods grow. If the out-of-memory killer kicks in, you'd better be sure it's killing the right thing. Better to take down the metrics than the service, but better still to take down nothing at all.
A different solution would be to have things stream out their metrics to a dedicated Prometheus server, or a pair of highly available servers. But that means more stuff to build, maintain, and probably debug. Again, none of this work is super hard, but can you spare the time? Wouldn't you (and the business) rather invest in improving your product than in measuring it?
How important are your metrics? Can you afford the risk of downtime? A small loss of metrics might not really matter to you. But if you have to publish SLAs on which your customers depend, those things need to be measured accurately, all the time. If that's you, and you're managing your own metrics, you've now got a proper high availability system to build.
For serious scaling and high availability (HA), you're likely to end up looking at Thanos. Thanos takes Prometheus's storage and HA capabilities to a new level, but it's another complex piece of software that you're going to have to learn, deploy, and possibly debug. I don't know about you, but the bane of my working life is the feeling that every problem I have to solve will involve learning some big new piece of software.
You have to think about scaling Grafana, too. When you have an incident and everyone starts piling on, pulling up dashboards and digging into metrics, the last thing you need is to DOS yourself. (This used to be a huge problem back in the days of Graphite, which would work great until everyone needed it!)
Make it someone else's problem
Let's consider the VMware Tanzu Observability by Wavefront approach.
Wiring up that first cluster is not much different using Tanzu Observability; there's a Helm chart to deploy everything you need to get Kubernetes metrics streaming. Wiring up a second cluster is no harder. The difference is that the metrics end up in the same place.
Comparing prod and pre-prod is trivial. You can plot their metrics on the same chart if you want to perform arithmetic between their time series. Building a dashboard or alert with metrics from different services, clusters, or business units is exactly the same as building one from a single source. And there's one set of users to manage. Everyone can log into the same place, from anywhere, and see the same things. (Though access control lists may limit the actions they can perform on those things.) Getting that past security shouldn't be hard.
Tanzu Observability sits outside your network. You put proxies close to the things creating the metrics, simplifying the networking as much as possible, and those proxies go out to the Tanzu Observability API. It's easy to manage, and egress on 443 to a single endpoint is easy to get approved by even the toughest corporate security team. The proxies are lightweight, and don't contribute much to the load on the cluster. Unless you lose connectivity, they're always shipping off the processed metrics, so don't (usually) consume huge amounts of memory. All the hard work is done by Tanzu Observability at the other end.
Tanzu Observability also carries the burden of user management, role-based access control, and alerts. Alerts are tightly integrated in Tanzu Observability, and extremely easy to set up and to integrate with third-party escalation services — no YAML required. In these times, "no YAML" should be a pretty strong selling point!
I've worked with Tanzu Observability at multiple sites for several years now, and the closest we've ever gotten to scaling is renegotiating our point rate. From a technical perspective, it simply isn't an issue.
For instance, we once accidentally quadrupled our point rate, instantaneously jumping from 100,000 pps to not far off half a million, and Tanzu Observability just took it. There was a bit of lag because our proxies weren't sized for that load, so they buffered to disk and the backlog took a while to clear, but once the dust settled, not a single point was dropped. We messed up, but we ended up impressed.
My experience of Tanzu Observability as a pretty hard-core user over several years has been one of extreme reliability. I don’t recall a single outage of either the UI or API. Someone else is taking care of our metrics: the high availability, the scaling, the storage, the upgrades, the on-call, everything. And they keep all those metrics for a long time, with no drop in resolution.
It is very difficult to quantify the cost of doing something yourself. When an engineer spends time evaluating, deploying, architecting, patching, maintaining, debugging, or scaling ancillary software, the business isn't just paying them to work on that software; they're paying them to not work on the core product.
Good software is easy to find, but good engineers are not. Anything that frees up talented people to work on the thing that drives your business has a value beyond the service it provides.
The actual running costs are equally slippery. Cloud cost attribution, particularly Kubernetes cost attribution, is never straightforward, and it's likely to be hard to pin down exactly how much even the computes for Prometheus and Grafana are costing day-to-day. And those costs may melt into the fog of the AWS bill, giving the illusion that once it's working, it doesn't cost a thing.
Tanzu Observability, on the other hand, is a single, undeniable line item. And that makes the cost of your telemetry easy to understand and manage.
As you pay according to a points-per-second scale, you can have a clear focus on extracting value from metrics. When each one costs money, you'd better make it work for you. We've found exercises to reduce our point rate (i.e., our spend) have resulted in tighter, tidier metrics and more focused alerts and dashboards.
Move when you need to
Prometheus is the perfect way to start instrumenting new infrastructure. Smaller shops likely need something very close to its default configuration, and may well never outgrow it. Maintenance is minimal, and it piggybacks off the infrastructure you are already managing and paying for.
But if you do need more than that simple setup, or find you're sinking time into maintaining or scaling your metric platform, switching to Tanzu Observability is simple.
Tanzu Observability has its own integrations, which speak directly to its proxies and API endpoints. But it is also able to hook into your existing Prometheus data sources, easing migration considerably. It can scrape your endpoints directly via a Telegraf proxy, obviating Prometheus, or you can deploy a container, which pulls a copy of the metrics off your Prometheus servers.
Migration to native Tanzu Observability tooling can be a single spike, or phased. Whatever works for you.
Tanzu Observability’s own query language (WQL) is not difficult to learn, and it is not enormously different from PromQL. But Tanzu Observability now lets you write queries and alerts in PromQL (beta), so you can use your useful queries unchanged. For instance,
is functionally identical to
max(ts("kubernetes.pod.cpu.usage_rate", cluster="preprod-cluster" and label.app="helm"))
Though this bilingualism will ease the transition, WQL has more functions and power and is undoubtedly worth learning in the long run.
Prometheus is a great fit for both ends of the scale. A small ops team may well find it covers all of its needs, and requires little maintenance.
Very large organizations can afford the engineering efforts to customize Prometheus to their unique needs. It is, after all, an open system, and if you have the time and the expertise, you can make it do anything you want. Thanos is a perfect example of that.
But for those of us in between, it may make more sense to outsource our metrics to experts. The hardest job in the industry is finding good engineers, so when you get them, it’s vital not to waste them. Put them on your unique problems, and give them the great tools they need to build a great product.
If you’d like to try Tanzu Observability yourself, check out the free trial.
About the AuthorMore Content by Rob Fisher