“Distributed systems are hard ... but wicked cool."
That quote, from Cornelia Davis of Pivotal, sums up the state of software delivery.
If you’re an operator at a big company, the phrase is especially apt. You help your developers harness the “cool,” while taming the hard stuff.
PCF operators - the engineers who look after the platform - achieve remarkable efficiencies. It’s common to see an ops team of 8 supporting hundreds of developers. (SpringOne Platform attendees recently heard many such examples.)
Of course, there are always more jobs to be done. Especially in the world of distributed systems, where everything changes all the time.
One pressing job to be done: a better way to keep tabs on the PCF itself. Operators told us they wanted a better monitoring solution for the platform. Sure, there’s lots of tools to track VMs and infrastructure. But there isn’t one optimized for PCF platform metrics. Until now!
PCF Healthwatch helps operators monitor and understand the current health of the platform. The service tracks the recommended performance and scaling indicators for a given version of PCF.
Here’s the cool part: the product renders operational data in colorful dashboards. Is everything OK? What needs my attention? Is anything on fire? PCF Healthwatch shows you instantly.
You need new tools to wrangle distributed systems. That’s why we offer BBR (for backup and restore) and PCF Metrics (to troubleshoot microservices). Think of PCF Healthwatch in this same vein: a new product, designed for the era of distributed systems.
Let’s take a deeper look at why PCF Healthwatch is so handy for platform operators.
PCF Healthwatch is an Operational Dashboard for the Platform
End-User Impact shows you how your apps are doing in production. (“Is latency a problem for me right now?”)
Developer Impact conveys the health of the Cloud Foundry CLI and useful details about available capacity. (“Can my devs push code as expected? Is there sufficient memory for them to push and scale apps?”)
Platform Impact displays the status of Ops Manager, BOSH, and the underlying VMs. (“Is BOSH healthy and managing my VM resiliency as expected? Can I proceed with a platform upgrade now?”)
PCF Healthwatch shows you essential data about the health of your Pivotal Cloud Foundry installation.
PCF Healthwatch is helpfully updated to track the most important indicators for a given release of PCF. Previously, operators would have to tweak their bespoke platform monitoring setup. Say goodbye to that toil - PCF Healthwatch keeps you updated automatically!
So how does the product work? Where does the data come from? Simple. Healthwatch constantly runs validation tests in four areas:
Cloud Foundry CLI Health. The CLI is how developers push apps to the platform. PCF Healthwatch executes a continuous test suite that validates the core functions of the CLI. With this approach, you don’t need to wait for your devs to report an issue. Healthwatch will flip these metrics from green to red immediately if there’s a problem.
Ops Manager Health. You use Ops Manager to do upgrades and scale PCF. If an issue crops up, your ability to perform these tasks could be compromised. Healthwatch monitors Ops Manager availability for you. Once again, the dashboard refreshes to show you when an undesirable condition pops up.
Apps Manager Health. For Healthwatch, Apps Manager performs a unique function: it’s a canary app. The product checks on Apps Manager health as a leading indicator for availability and responsiveness. You’ll know about any hiccups in Apps Manager right away. This way, it’s easier for you to get in front of issues related to apps running on the platform.
BOSH Director Health. The BOSH Director is buried deep in the guts of the platform. Issues with BOSH Director rarely impact the end-users of your apps running PCF. But they can mean the loss of resiliency in BOSH-managed VMs. That’s why Healthwatch checks on the BOSH Director as a part of its test suite.
Capacity and logging performance loss rates are tracked too. Documentation goes into much more detail about each metric, and why they matter.
It gets better. PCF Healthwatch works with the Loggregator Firehose. It’s easy to publish the results of validation tests to your favorite monitoring tools. Hurray for extensibility!
Use PCF Healthwatch to “Manage the Panic”
We’ve been testing a beta of PCF Healthwatch with a few customers. Internal teams here at Pivotal have been helping us too. Here’s a few initial impressions from operators:
“Healthwatch helps us ‘manage the panic’.” One ops team had just completed an upgrade of the platform. Healthwatch flagged an unhealthy job: `clock global`. This is an obscure process, without obvious documentation. But thanks to Healthwatch, the team could see that the platform was working normally. The error wasn’t affecting other parts of the platform. The team confidently concluded that the error wasn’t critical. The issue was later resolved during normal working hours.
“A Pivotal-provided service is just what we need.” Pivotal knows its products best. An opinionated platform monitoring solution that reflects the company’s expertise is a welcome enhancement.
“Automatic updates that track new KPIs with each PCF versions is a big help.” Operations teams don’t have time to adjust homegrown dashboards with each new release. This work is toil that doesn’t help the business. With PCF Healthwatch, ops have the new metrics tracked immediately after an upgrade.
Let’s Learn Together
Pivotal is a learning organization.
Our goal with the launch of PCF Healthwatch is the same as with any other release: to learn from you. We look forward to partnering with you on your journey to get better at software. While we’re at it, let’s make distributed systems that much easier!
About the Author