At Pivotal our mission is to transform how the world builds software - supporting software in production is an important part of that. In fact, Pivotal’s Cloud Ops team aims to become the ideal operator of Pivotal Cloud Foundry. This post is the latest in a series that shares the team’s learnings. Here, we explain the role of “Train Driver,” and how it has helped the CloudOps group at Pivotal improve the life of our on-call engineers.
Here’s What the Train Driver Role Helps You Achieve
When you change an IT system, problems can occur. In CloudOps, we categorize these problems as a car crash or a barge crash. A car crash is a catastrophic incident. It happens almost immediately after the change has been applied. In contrast, the barge crash is an incident that happens a significant period after the change. These are chronic failures; like two slow-moving barges inching closer and closer together.
When these crashes happen, you want your teams aligned to get the issues fixed. But there’s a common organizational problem that gets in the way. In many companies, there’s a troubling disconnect between those who implement changes, and those whose role it is to be on-call. The roles may exist in different orgs and departments. They are measured differently, with different incentives that often conflict.
The train driver can mitigate the car crash and barge crash scenarios, while aligning engineers across your organization. Here’s how...
At Pivotal, the Train Driver is the person coordinating and executing platform updates, as well as holding the pager for a given week. Previously, this had been two separate roles. This new way, the Train Driver links the deployment phase to the support and on-call phase. Maximum context is retained. In the event of a failure, the person who has most context about the change is the person tasked to remedy it.
So what does the Train Driver do? What makes this role effective? As we’ll see, it has a fair amount in common with actual train drivers in the real world.
The Schedule - Plan Your Changes in a Manageable Way
Before a train can leave the station, the driver must be familiar with their cargo and the route. The driver must be aware of the risks associated with the cargo. Furthermore, the driver must know how long each trip should take. In our CloudOps team, this means that the Train Driver (who is typically engaged for a week) will look at the changes and upgrades to be done. They estimate how long each change will take, and allow extra time (just in case!) for those changes with an element of risk. Most importantly, only one change will be made per trip. This approach aligns with the agile principle of small iterations rather than “big bang” changes. An added benefit: it simplifies the root cause analysis of any problems that may occur.
The Daily Report - Build a Body of Evidence to Make Informed Decisions
A friend of mine drives actual trains. He described the daily report used to catalog the day’s events:
“...All occurrences, incidents, accidents, faults, delays etc have to be reported into the traffic events database. This gives a total daily report...”
In CloudOps, we use something similar: a deployment report. The report details any problems encountered during a deployment, and the corresponding troubleshooting steps. Other important details are recorded, like the time spent applying a change. We also catalog the emotional experience of the deployment. Was it stressful? Confusing? Easy?
You can imagine how useful the report is. For our team, it serves two important functions:
Section of our deployment report.
The Whistle - Start the Upgrade with Confidence
Before the deployment starts, final checks are made. If the train driver is happy, we hear the iconic whistle blow. But the whistle only blows if certain conditions are met.
We are guided by Site Reliability Engineering* (SRE) principles. So before the whistle is blown on a given day’s deploy train, the Service Level Indicators (SLIs) are checked against our Service Level Objectives (SLOs) (we describe these concepts in this handy blog post). The Train Driver then makes the final decision whether the deployment train should depart. If deployments earlier in the week have eaten into the error budget, the train may be suspended until we have more budget in reserve. When this happens, the Train Driver will try to focus on increasing reliability until the team has the budget to allow the train to leave the station again.
The Conductor - Achieve Success with a Continuity of Knowledge and Experience
Here at Pivotal, we pair program each day. There is always a Conductor to help the Train Driver. The duties of the Conductor are much the same as that of a driver (one key difference: the Conductor may change on a day to day basis).
But the role of the Conductor is of utmost importance to the team in two key scenarios:
At the end of each week, the person taking up on-call and train driving duties in the following week, pairs with the current Train Driver. This allows a hand over of context of all of the week’s deployments. Now, the new Train Driver can take over on Monday feeling comfortable with the state of affairs. And if there’s an unexpected barge crash, they will be ready!
The Conductor is also a useful role for new members of the team. By pairing with the Train Driver, they can get uninterrupted time with a more seasoned CloudOps engineer. This time can used to explore new tech and investigate legacy issues. Most usefully, the shared experience prepares the new team member for going on-call!
The Hat - Make an Important Role Feel Rewarding and Special
Finally the hat! When you have a Train Driver on your team, you need to instill your engineers with the confidence to lead. The performing upgrades and being on-call are never easy. In fact, they are normally associated with hardship and strife. Being a Train Driver and wearing the hat is reflection of respect and confidence.
Shane (Anchor Cloudops-EU) and some of our guest Train Drivers.
Stay Tuned for Part 2, Where We Examine the Role of a Train Driver in a World of Platform Automation
This post describes a grand experiment - where a designated train driver made all platform changes. The trial ran from May 2017 to April 2018. During this time, we built up a shared familiarity and experience when it came to manual platform upgrades and handling incidents. In April 2018, a big change happened -the CloudOps team automated all of our platform upgrades, by converting them to Concourse pipelines. Was this the end of the Train Driver? Anything but - who would give up such a fancy hat!? In our next post, we’ll review how the role of the Train Driver changed as a result of our team’s move towards automation.
In the meantime, check out these links for more useful reading on this topic.
About the Author