I’m an app developer. I have spent the last 25 years doing dev and really have had little or no appreciation for operations. In fact, when I first started working with Cloud Foundry, I was gleefully ignorant and believed all of the press that Platform as a Service (PaaS) was all about me – all for the developer. Then I joined the Pivotal Cloud Foundry team, started engaging with large enterprises and very quickly realized that PaaS done right brings as much value to an operations team as it does to a development team. Application logging, app monitoring, app security, app health management, to name a few capabilities that are built into the Pivotal Cloud Foundry platform, make keeping apps up and running in production far less burdensome than in traditional settings. I have learned a lot working with our most innovative customers.
But I still hadn’t lived operations. Until now.
I just finished a four week dojo with our Pivotal Cloud Foundry operations team. The main charter for this team is to keep Pivotal Web Services (PWS), our public, multi-tenant Cloud Foundry platform running smoothly. This includes building and maintaining a set of tools that helps the operations team achieve that goal. The operations team is also responsible for providing operations support to a growing set of customers who have contracted Pivotal for their expertise in operating dedicated, single-tenant clouds. And, most subtly, they provide real service to our development teams; this has a deeply important effect but I will defer that discussion to a later post. In my month I pair-operated in all of these areas and, thanks entirely to the team, feel mildly competent at operations now. This first in a series of blog posts on my tour tells the story of my first morning on the job, a day that serendipitously brought exactly the type of experience I was hoping for.
On that morning we had a production “incident”; it wasn’t an outage, not a single application running on PWS experienced any downtime or reduced performance, but something in the system wasn’t working completely right. What lead to this was the Sundance Film Festival, whose web application runs on PWS. The festival was opening that Thursday, and two days before, they were releasing a sizable lot of tickets for sale via their website. They had anticipated a spike in traffic, had preemptively scaled the number of application instances, and when the increased traffic came as expected, the application worked brilliantly. We, however, weren’t handling the commensurate increase in logging volume quite as flawlessly.
One of the many operational benefits that Pivotal Cloud Foundry brings is that the logs from all application instances as well as some system level messages are aggregated into a single log for an application. When application traffic goes up, so does the log volume and the trouble was that the “loggregator” was reporting that it was overwhelmed. The loggregator team had instrumented those logs so that when these loggregator errors appeared they were immediately notified. We were the first to know of the issue, in fact, I don’t believe the customer ever noticed there was a problem.
When the alert came in, a pair from the cloudops team and a pair from the loggregator team came together in a Google Hangout, took a look at the logs and discussed a solution. The loggregator component is designed to be horizontally scalable, that is, the loggregator is a cluster. We quickly reached the conclusion that we needed to increase the loggregator capacity by doubling the number. To do this we would do what we call a “manifest only deploy”, which means that none of the software running PWS would be changed, only the topology would be updated. In a BOSH managed cluster the deployment is declared in a manifest, and then BOSH executes. Here are some snippets from that manifest, before and after.
We are running loggregators across two availability zones (z1 and z2) and yes, we changed two numbers. That’s it.
We ran our `bosh deploy` and a few minutes later we had doubled the capacity of the loggregators. While this helped, we still weren’t quite keeping up with the demand. We then decided to increase the number of loggregators further, simply by updating the manifest and running `bosh deploy` again. So yes, we did two deploys. This whole process, which also included us documenting things, measuring twice and thrice, and watching the effect of our changes took just a bit over an hour. Two deploys. And, of course, not a single application running on PWS was affected while we were updating the platform. Pretty darn slick.
Pivotal practices an uncompromising approach to Agile software development and we’re bringing that same discipline to building and operating Cloud Foundry. (We also happen to be lead by a CEO who literally coined the phrase ‘eating your own dogfood’ and PWS is our dogfood in every sense of the word.) While we are focused on delivering a world class service with the power of a structured automation platform, I also want to highlight cultural aspects of how we work. We pair in operations, just like we pair in development and when there is a problem with the service, we bring pairs together from both groups. As I said, I’ve not had much previous experience with operations, but I can’t imagine trying to troubleshoot without the context provided by both sides. I know devops is an all-the-rage buzzword that might mean different things to different people, but between the automation and the collaboration, I’m certain that first morning I experienced devops.
I’ll be sharing several more stories over the coming weeks so stay tuned.
I’d like to thank Tony Hansmann and the entire CloudOps team at Pivotal for taking me under their wings. The experience was tremendous!
About the Author