How Moving From Prometheus Monitoring to Enterprise Observability Helped Secure State Deliver Exceptional Cloud Security Services

October 8, 2019 Nandesh Guru

For the VMware Secure State engineering team, metrics have become an integral part of daily life. From monitoring our services to customer success and new features, all activities are driven by metrics. In this blog, I share my team’s experience in transitioning from Prometheus monitoring to Wavefront enterprise observability.

Starting with Open Source – Prometheus and Grafana

Initially, we started off using Prometheus, which is a very popular monitoring tool for open-source projects. We integrated the Prometheus client in our microservices framework to make it easier for new and existing services to start to emit metrics using Prometheus. On the server-side, we hosted a single node dedicated to running the Prometheus server and used Grafana for dashboard and alert visualization. This setup worked well at first, but as we started to scale out our services, operating the Prometheus server became complicated and time-consuming. It began to crash frequently, an became practically unusable. At this point, we had no choice but to look for other options.

Our Effortless Move to Enterprise Observability and Scalability

We wanted to move away from spending long hours managing a monitoring stack like Prometheus. That’s one of the main reasons we started to look at Wavefront. Our experience with using Wavefront was at a completely different level. The Wavefront Enterprise Observability platform is very mature, proven to handle web-scale, and has fantastic analytics and alerting. Plus, it has a wide set of integrations and plug-ins, which helped us to be up and running within an hour, literally. Also, the Wavefront Customer Success team was very responsive in helping out and following up on issues.

Integrating with Kubernetes

One of the most useful Wavefront integrations is for Kubernetes. The Wavefront Kubernetes Collector was deployed in our Kubernetes cluster, and it started scraping Prometheus metrics, instantaneously. This Kubernetes collector can also scrape Prometheus metrics from our existing microservices, which made our transition to Wavefront easy with NO code change.

Along with direct ingestion to the Wavefront backend in the cloud, we took advantage of the Wavefront Proxy. Using the Wavefront Proxy provide an excellent way to whitelist or blacklist metrics, append metadata, throttle services from ingesting too much data, and finally allowing to buffer metrics and retries on any failures. It already supports manifest files for the K8s cluster, which made it easy for us to deploy.

Integrating with AWS

We also found the Wavefront integrations for AWS services very useful. We were able to onboard our AWS accounts in minutes and were able to monitor our AWS resources using its default AWS dashboards, which we could easily customize. It was the first time we were able to watch all our AWS resources in one place, across all regions and all AWS accounts.

We had to re-create the dashboards for our services, which we couldn’t import directly from Grafana. But Wavefront can generate a dashboard and an alert via a JSON object. This feature came in handy when migrating dashboards and alerts. All our alerts are codified and are part of an Ansible module and configured to deployed via CI/CD.

Useful PagerDuty, Slack and Uptime Integrations

Wavefront made it easy to integrate and configure alert targets for PagerDuty and Slack. Also, We’ve recently integrated the Uptime availability monitoring service with Wavefront, where we pull-in monitoring metrics about the uptime of our service and the general custom analytics-driven alerts in Wavefront.

Wavefront has a rich collection of APIs and SDKs supported for all of the languages we use. We’re also consuming metrics via Wavefront APIs to rebalance weights for our graph database shards based on their utilization. This is preventing our shards from becoming a hot shard.

Assuring Secure State Customer Success with Wavefront Analytics

Along with monitoring our services, we’re also using Wavefront to measure our customer success. We can track activities at a tenant level and are currently planning to surface this data to our customers for their visibility. We’re capturing how many events we receive; how many cloud service APIs calls we’re making on the customer’s behalf; how many rules we’re executing per cloud account, along with their duration; how many cloud accounts/cloud objects/findings per cloud provider and region.

Our goal is to use this data to understand how we’re performing for a given customer. And our long term goal is to expose this data to the customers, to provide more visibility into their systems and also to help them understand how their security posture is improving over time.

Overall, our experience with Wavefront has been phenomenal. We’re at a point where we’re building more features around Wavefront and it has become one of the key important components of our stack. If you’re like us, having started with Prometheus monitoring, but now need something stable and mature to scale and also moves you into enterprise observability, then check out the Wavefront free trial.

Get Started with Wavefront Follow @nandeshguru Follow @WavefrontHQ

The post How Moving From Prometheus Monitoring to Enterprise Observability Helped Secure State Deliver Exceptional Cloud Security Services appeared first on Wavefront by VMware.

About the Author

Nandesh Guru is a staff engineer on VMware Secure State team. Nandesh was one of the core engineers of CloudCoreo team, a Seattle-based startup building a next generation cloud automation and security solution. Nandesh loves building distributed low latency services, designing for scale in GoLang.

More Content by Nandesh Guru
How to Connect Wavefront Observability with Logging Tools Using External Links
How to Connect Wavefront Observability with Logging Tools Using External Links

External linking helps engineering teams connect Wavefront to logging tools such as vRealize Log Insight, E...

How to Find Silent Failures in Your Cloud Services Faster with Join() Function
How to Find Silent Failures in Your Cloud Services Faster with Join() Function

How do you find unknown unknows? How do you detect silent failures in your cloud services involving hidden ...