Or… The One Where We Learn About Operations Dashboards and the Beauty of the Pivotal Cloud Foundry Services Protocol
In the first of a series, I shared the story of my first morning spent on the Cloud Operations team that runs Pivotal Web Services, when we quickly and painlessly responded to a production incident using parts of Pivotal Cloud Foundry itself (BOSH). Today I’d like to share another experience from my ops tenure that on the surface looks like it’s about monitoring your deployment, but also provides a far deeper lesson on continuous delivery and platforms. That’s why you got two titles for the price of one :-).
One of the tasks I had the opportunity to work on was adding to the Datadog dashboard that we use to monitor our Pivotal Cloud Foundry deployment. On it we show things such as the uptime for each DEA, the amount of available stager memory, CPU utilization, disk utilization and more. We have very basic displays where we use simple green/yellow/red traffic lights to indicate the overall health, behind which are more detailed displays. Here’s what our main stoplights page looks like:
Datadog is a really cool tool that provides a WYSIWYG construction tool where you bind visualizations to the metrics you have supplied to it. Pivotal Cloud Foundry emits a whole host of metrics (partial list), that today are sent via the collector, and in the future will be provided by the firehose. In addition to displays, Datadog also allows you to define alerts that fire when thresholds are crossed, for example.
Just as with any other application, our dashboards progress through a lifecycle; first developed and tested in a non-prod environment, the “code” (in this instance a dashboard configuration) is then checked into git, and then finally it is deployed into production. One of the cardinal rules of continuous delivery is that you run the same code in production as in the earlier stages of your lifecycle, so naturally we began by checking out the latest dashboard configuration from git, deploying it into our staging environment and, oops, several of the widgets were not displaying correctly. I’ll spare you the details of how we found the problem, a bit tricky since the symptom was totally a red herring, but I will tell you that pairing on the task—having two different viewpoints, unquestionably helped us find the root cause more quickly than either of us would have alone (pleasure working with you Kai).
Here’s what the issue was: The identifiers for Datadog alerts (and screenboards and dashboards) are globally unique, so the ID for the “router dial error missing data” alert in prod is different from the ID for the same alert in staging. The trouble was that the code in git included an ID for an alert that exists in the prod environment, so when we pushed that code to staging it was broken.
The real issue is global scope. Let me explain.
Remember what I said above about running the same code through every stage in your development cycle? Sounds great in theory, but some things cannot be the same through all stages. For example, your developers will not be testing their code against production databases containing personally identifiable information. The answer is to add a layer of abstraction (isn’t that always the answer? ;-)).
This is exactly what Pivotal Cloud Foundry does with service instances, for example (I’ll come back to dashboards in a moment, I promise). When a service instance is created the system assigns a globally unique ID, just like Datadog does for a dashboards, screenboards or alerts, however Pivotal Cloud Foundry goes one step further and allows the user to assign a name. That name is locally scoped to a Pivotal Cloud Foundry space. When I deploy an application, also to a space, I bind the app to a service instance by name and Pivotal Cloud Foundry will do the work of mapping the locally scoped name to the globally unique instance ID. I can set up a space for each stage in my development cycle, dev, test, staging and prod. For example, I can create service instances in each space with differing characteristics and globally unique identifiers, but assign them each the same locally scoped name. Then, without changing any code, I can deploy an application, that binds to the service instance by name, through this series of spaces and test out behavior in different environments.
Datadog, seriously cool product that it is, does not have this layer of abstraction; entities are referenced via their globally unique ID. This means that in order to move our dashboard code unchanged through stages we must add that abstraction ourselves, and this is exactly what the Pivotal Cloud Foundry Cloud Ops team has done. The good news is that our Cloud Ops team has recently open sourced that work, and as you can see from recent commits they continue to evolve these screens, dashboards and alerts that we use for our production Pivotal Web Services and Pivotal Web Services-Enterprise. In this project you will find the concept of an “environment”, such as staging or prod, JSON definitions for screens, dashboards and alerts that are ERB templates, and functions that perform ID translations from one environment to another. There is even a slightly cryptic reference to such translations directly in the README. This repository is a good representation of the type of automation that is defining a better way of building and bringing software to market.
There’s a good bit of work in that repo, and it provides a powerful demonstration of the difference between a platform designed to enable continuous delivery and one that is not. With Pivotal Cloud Foundry we could have provided elastic compute and data services without things such as spaces for local scoping and service bindings via locally scoped names, but then the support for continuous delivery would need to be bolted onto the outside. Continuous delivery is a top requirement for Pivotal Cloud Foundry so you get this support out of the box.
Oh, and so, yeah–we had a bug in that code that we bolted on to Datadog. We’ve fixed it.
About the Author