Prometheus has become something of a de-facto standard for how to start monitoring Kubernetes. There are good reasons for this: It’s open source, freely available and embraced by the Cloud Native Computing Foundation (CNCF). Also, Prometheus was designed to handle the highly ephemeral nature of Kubernetes workloads. This has propelled Prometheus to a position as the obvious choice for anyone initially wanting to monitor Kubernetes.
However, running Prometheus at scale has proven to be challenging. For example, at the time of this posting, there is no native clustering capability. Many organizations have also reported problems with data retention over longer periods of time than a few days in larger environments. Likewise, as for enterprise-readiness controls that are necessary at scale, Prometheus is missing some key ones that Enterprise DevOps teams like to use, like granular security policy controls and operational usage-reporting.
So, what should we do when we find Prometheus difficult to manage in a large environment? Replace it? Not necessarily. Instead, let’s explore how we can use Wavefront to bring Prometheus into both the enterprise-grade and hyper-scale worlds!
1. Easy to Get Started
First of all, Prometheus is really easy to get started with. You can literally get it up and running using a single command, assuming you have helm installed. You should try it!
helm install stable/prometheus-operator
Now you have a working Prometheus instance with all the messy connections to various parts of Kubernetes already set up for you. Expose the user interface ports, and you can start to play around with the Prometheus query language, PromQL.
2. Developers Like it
Prometheus is a tool for geeks. It allows you to run complex queries over your data sets and it has a well-defined API for getting data in and out. If you need to build prettier dashboards than what native Prometheus gives you, there’s a really easy way to bolt Grafana on top of it. In fact, the Helm Chart we discussed above already comes with that.
3. It’s Open Source
Prometheus is freely available on Github and is sponsored and maintained by the Cloud Native Computing Foundation (CNCF). If you find something lacking and have the skills, you can submit your updates to the project. Some people may argue that it’s “free”, although that’s a bit of a loaded statement. Like any open source tool, it needs a relatively rich skillset to maintain and operate at scale. But for someone with a smaller environment who is just looking to get their feet wet, Prometheus is a great solution.
The Less Than Good
1. Lack of Native Clustering
Like any software, a single node installation breaks down once you hit a certain scale. When that happens, you have two choices. You can either implement Prometheus native federation. This is a hierarchical way of organizing Prometheus nodes, each responsible for a subset of an environment. A higher-level node receives an aggregated or downsampled version of the data from the lower level nodes. This allows a user to query high-level data at a lower resolution from the higher-level node, while still having to access the leaf nodes for detailed data. This also has the drawback that each node has to be managed individually and keep its own configuration files that must be changed frequently.
(Image credit: Wikipedia)
There are also some third-party true clustering projects available that remedy the lack of details in the federated view, but they still require users to maintain and configure all the individual nodes. Also, since these solutions currently aren’t part of any official Prometheus release, there’s no guarantee their release cadences are synchronized with mainstream Prometheus.
2. Lack of Tiered Storage
Prometheus puts everything in a single datastore whether the data is old or new. This can cause some issues when keeping lots of historic data. First of all, keeping a high volume of data in a single datastore can cause performance problems. Secondly, storage that’s fast enough for real-time ingestion and querying tends to be expensive. A large environment can produce tens or sometime hundreds of terabytes of metrics per day. Clearly, storing that data on fast storage devices, such as SSD for any period of time would be cost prohibitive. As a result, many larger organizations only have a few days worth of data retention. This can be a huge problem, since complex problems can take days to analyze and you are running a significant risk of having your data disappear before you’re done analyzing it.
Third party projects exist to aid in this area as well, but they suffer from the same drawbacks as the clustering solutions: Difficult maintenance and reliance on a tool that’s not synchronized with the Prometheus release cadence.
3. Lack of Enterprise-Ready Controls
When you deploy an observability tool outside a small group of developers, you will inevitably encounter the need for enterprise-ready controls. The obvious one is security in the form of enterprise-ready authentication, authorization and visibility control. Prometheus completely lacks an authentication and authorization model and doesn’t even encrypt the data sent over the line. If authentication is desired, the best practice is to deploy a proxy in front of Prometheus and let that proxy handle authentication. While this is certainly a workable solution, it requires administrators to manage a complex infrastructure of HTTP proxies, certificates and identity stores. Furthermore, such a solution will provide “all or nothing” access. If you get through the proxy, you’re in the system and there’s no obvious way of protecting specific monitoring resources.
Enterprise-ready controls involve more than just security. Metering and throttling is another important aspect of enterprise-readiness. What happens if a developer willingly or accidentally generates an unexpectedly large volume of metrics? What if that volume is so large that it will impact performance of the Prometheus platform or even crash it causing loss of important metrics? This is something that happens more often than most people think and is one of the main causes of monitoring system failures. An enterprise-ready monitoring platform must be able to meter the traffic from various sources and apply throttling to sources that are exceeding quotas. It also must track and report usage consumption by team. Prometheus lacks these metering, throttling, and reporting capabilities, instead it depends on dev and ops engineers to manually make sure they don’t overload the system.
Get the Best of Both Worlds
Your developers like Prometheus, so are you likely to encounter resistance if you do a forklift switchover to another tool? Operations need a single view of tens of thousands of containers with years of fine-grained data retention? You have more data than you can keep in a single environment but you don’t want to face all the drawbacks of federated setup? Yet you want to roll out monitoring and observability as a service across all of your teams. Those demands don’t necessarily have to be contradictory. In fact, you can let developers continue to use Prometheus metric libraries to instrument their code, or even for short-term monitoring in a localized environment while shipping the data off to Wavefront for long term storage and deeper analytics.
Prometheus is great to start monitoring Kubernetes, but lacks in correlation against external components as well as correlation of metrics against distributed traces and span logs. Wavefront has over 200 adapters for a wide range of technologies and offers cross-domain correlation – collecting metrics, traces, span logs, and histograms. Using Wavefront, you can correlate Kubernetes metrics harvested by Prometheus with metrics from the rest of your environment: external databases, hardware or virtualization technologies. All in unified dashboards using a single query language across all domains.
That all sounds great, but how can Wavefront help achieve this? First of all, Wavefront is a Software as a Service (SaaS) solution that can scale from very small environments to very large ones – up to millions of data points per second and petabytes of retained data. And since it’s a SaaS, you can grow without spending as much as a minute of your time upgrading the infrastructure. This is crucial, since most Kubernetes-based, containerized application environments tend to grow very rapidly.
But let’s go more into details how we can use Wavefront to increase the value of your Prometheus installation!
Leverage Existing Prometheus Servers
So, some of your developers like Prometheus and reluctant to give it up, while operations need a unified view of a large container estate? If that’s the case, you can keep your existing Prometheus server(s) and metric pipeline infrastructure and simply tell it to forward its data to Wavefront in real-time. To do this, you can simply use a Wavefront storage adapter for Prometheus. The operation is simple: Whenever Prometheus stores a piece of data, it’s also sent to Wavefront. All you need is a tiny program and a couple of lines of configuration. You can download the storage adapter here. My colleague, Frank Hattler, wrote a great blog here on how to set it up.
Once you’ve set this up, Wavefront allows you to view your entire Kubernetes estate in a single place and correlate with any of the over 200 other technologies we gather telemetry from. All this while your existing Prometheus installation remains intact.
Leverage Prometheus Endpoints
In a Kubernetes environment, there are more Prometheus components than just the Prometheus server itself. Of interest in this context are Prometheus API endpoints for obtaining metrics from various containers. These metrics endpoints can all be leveraged by Wavefront without a Prometheus server in between. You may use Wavefront to connect directly to these API endpoints to collect Kubernetes health and performance information. This includes Prometheus Metric Libraries too. Wavefront runs analytics on a full set of metrics – optionally combined with traces, span logs, and histograms – driving a wide range of dashboards, alerts and observability visualizations. This method of collecting data from Prometheus is also, by far, the easiest to scale, as it requires a bare minimum of onsite resources.
Looking at the monitoring and observability journey of our customers, many start with visibility into native Kubernetes. As a next step, they use Wavefront to provide analytics and a unified monitoring portal, i.e. from the containerized applications above Kubernetes to the cloud and data center infrastructure below it. Finally, some of our customers end up having Wavefront collect data directly from Prometheus API endpoints and metric collectors.
A Real-Life Example
An organization I recently supported was running a 500 node Kubernetes cluster with thousands of microservices monitored by Prometheus. In order to avoid having engineers log in to many different instances of Prometheus, they had attempted to consolidate as many Kubernetes as possible into each Prometheus instance. This worked well until the environment grew beyond a couple of hundred nodes. At that point, the Prometheus servers got overloaded with long query times and crashes as a result. Data retention was also suffering, as they couldn’t find a cost-effective way of retaining more than seven days of metrics. This left them blind to incidents that were older than a week, and since troubleshooting of a complex issue can take multiple weeks, their debugging and forensic capabilities were suffering. It was debated internally whether Prometheus should be scrapped altogether and replaced with something else, but in particular, the developers didn’t want to change how they were collecting their metrics and it would have been a painful transition if they had to.
Instead, this organization decided to keep their Prometheus servers in place, but with a retention period of only four days. Anything that required longer retention, cross environment visibility or advanced analytics was done using Wavefront. They decided to use Wavefront’s Prometheus storage adapter to “fork” the data to Wavefront in parallel, allowing the developers to continue using Prometheus for debugging and basic performance measurements.
The result was a much better performing platform offering advanced analytics and 18-months data retention instead of seven days. The centralized DevOps team also gained the enterprise-readiness capabilities like policy-enforcement controls and granular usage-tracking that they always wanted – which enabled them to roll out monitoring as a service across all their engineering teams.
Conclusions and Lessons Learned
Monitoring Kubernetes-enabled, containerized applications and infrastructure at scale is no easy task. Prometheus offers a great way to start monitoring Kubernetes and is very easy to install and get started with. It’s also the initial tool of choice by many developers and SREs. Unfortunately, for an enterprise centralized monitoring or DevOps team, providing end-to-end visibility in a large-scale Kubernetes environment using Prometheus has proven to be difficult. Lack of native clustering, storage tiering, and enterprise-readiness controls are major obstacles for any large enterprise deployment.
Our position at Wavefront is not to directly replace existing Prometheus installations – although we can – but to augment them by adding hyper-scale clustering, long-term storage, and enterprise-readiness controls. This provides enterprise organizations with a single source of monitoring and observability data with deep analytics for Kubernetes environments. We accomplish this through leveraging your existing Prometheus server(s), metrics pipeline infrastructure, and/or by connecting to Prometheus endpoints within the Kubernetes infrastructure.
Start your free trial of Wavefront today, and see how we help make Prometheus monitoring enterprise-ready and enterprise-scalable.Get Started with Wavefront Follow @prydin Follow @WavefrontHQ
The post How to Make Prometheus Monitoring Enterprise Ready appeared first on Wavefront by VMware.