This post was co-authored by Jesse Bean and Tony Hansmann of Pivotal.
Operations teams in today’s enterprises have it rough. CIOs are trying to achieve the speed and innovation of startups while working at enterprise scale, which is a challenge, to say the least. Many organizations turn to a DevOps approach to ease their woes, hoping that by embedding operations engineers in their product teams they can speed time to delivery. And that’s a fine idea—that is, until it all starts to break down.
The truth is that at enterprise scale, DevOps can often be more hindrance than help. Enterprises require levels of stability and security that might not be immediate issues for startups, and they must maintain a fine balance between speed and safety, typically within a highly constrained environment. Under these conditions, the traditional DevOps model can become stressed beyond the breaking point.
These challenges are not insurmountable, however. While DevOps is a good starting point for achieving better agility and improved feature-flow, a better approach for modern, enterprise-scale organizations is to embrace platforms and reliability engineering. (For a deep dive into these concepts, check out these insights from Pivotal’s James Urquhart.)
DevOps is a solution to silos ... until it’s not
To understand where we are today, it’s important to remember where we came from. In the late 1990s and early 2000s, CIOs set up organizations with siloed delivery practices (development, analysis, QA, operations, and so on). The idea was to achieve some control over software delivery, as well as cost savings.
In practice, however, often the opposite was true. Teams naturally focused on optimizing their respective silos and collectively lost sight of the actual goal: optimizing the entire software-delivery process and flow of features. The results were higher costs and slower delivery. What’s more, shadow IT teams emerged and started to build many snowflake applications that created unwanted sprawl, costs, and security issues.
These problems inspired leaders to try and implement DevOps—the principles of which, in theory, appeared to be the miracle cure for the silo problem. By putting development and operations teams together to own the software-delivery lifecycle, the teams would have control of the whole process and should be able to deliver faster and safer.
One early, vocal advocate of this approach was Amazon CTO Werner Vogels. “The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it,” Vogels said in a 2006 interview. “Not at Amazon. You build it, you run it.”
But as attractive as the “you build it, you run it” mantra may be, many CIOs have found that this strategy falls apart as teams outgrow direct communication or lines of sight with each other, and as they start to scale individually but still have interdependencies on other teams. Within many enterprises, “you build it, you run it” simply does not scale.
Lessons learned in the field
Take, for example, this real-life example from a large financial services company, which we’ll call Enterprise X. The hypothesis was that if operations staff were assigned to each product team, they would have enough control over the process to support changes faster and facilitate faster releases.
The company started with a version of Spotify’s DevOps methodology and evolved it to fit the work and life culture at Enterprise X. One aspect was autonomous product teams that included ops engineers on each team. That was fine when the development of the platform and tech stack was confined to a single team, as the dependencies were minimal. But as more teams formed and became interdependent on one another, they started to drift apart in their practices and tools, because each was being led by different objectives and different deliverables for different end-customers.
Specifically, friction increased dramatically when a project scaled to:
- More than two product teams, each consisting of five full-time developers and two operations engineers.
- Twice-weekly pushes to production.
- More than one location.
Essentially, the “you build it, you run it” principle worked for about 10 people all in the same location. As teams became larger or more distributed, folks started living in their own stories while ignoring issues that might be happening elsewhere. The tipping point for this tended to be when more than 10 people were reliant on the same operations infrastructure.
A good example of this breakdown involved the company’s internal and external web portals. The business decided to merge them for the very good reason that running two different sites to present the same data was wasteful. But they did not merge the DevOps teams. As a result, the internal portal team didn’t have, for example, the same security, penetration-testing, and compliance regime as the external team. Velocity dropped for the internal-facing team. Worse, friction around things like tooling choice, deploy timing, and microservice duplication led to business and technical frustration alike.
Getting back on track
Something had to be done. So began a 3-4 month long process designed to eliminate some of the waste and increase the pace of deployment cycles. When the dust cleared, the results were dramatic:
Release cycles that had ballooned to multiple months were reduced to twice weekly, and were on track to daily releases.
Platform engineers focused on automation, serving multiple developer "customers."
Number of people needed for platform operations reduced from 30 to 10.
20 roles reallocated to product managers, developers, and designers.
Developer productive hours increased from 30 percent to 75 percent.
Let’s examine some of the measures that helped Enterprise X reverse the negative trend in its feature velocity.
Surprise! DevOps is no miracle cure
It was clear that Enterprise X’s initial, pre-DevOps practices for software delivery needed to change. Yet shifting to a Spotify-style DevOps model just wasn’t a good fit for an enterprise of Enterprise X’s scale and culture. The lesson was that, with DevOps, one size does not fit all.
One significant issue was that the operations engineers who were assigned to product teams often sat fallow for weeks or even months until the product had matured into a robust application. In an effort to reduce this waste, operations engineers were often assigned feature-engineering work. Because this work was not necessarily their métier, reassigning them in this way often led to frustration and decreased morale.
Further complicating matters were the communication and coordination problems that arose as teams grew to include remote staff members. In practice, these effects were seen even when engineers moved from one floor of a building to another. In short order, these issues began to have a direct impact on time to delivery of new features.
As teams scaled, they tended to simultaneously optimize around local issues (the root of most process evil) and lose coordination across dependencies (the root of most organizational evil). This leads to process drift and makes integrations more difficult as the product moves toward delivery.
The bottom line was that for all the promise of DevOps as it was originally implemented at Enterprise X, the company wasn’t seeing the velocity it had hoped for. While for a short time, the three-month delivery cycles that were common in the earlier, siloed model had been reduced to weekly cycles, these soon crept back up to months. Further process changes were necessary.
Return to center
Eventually, the decision was made to begin a months-long effort to pull operations engineers out of their embedded roles in product teams and move them to central IT.
Unfortunately, once the Spotify-style DevOps methodology had taken root, moving back to a hybrid centralized model wasn’t easy. Some operations engineers even chose to leave the company, rather than take up new positions in central IT, underscoring how critical organizational and process issues can be to a software development organization.
Once the dust cleared, Enterprise X had arrived at a model that more closely resembled site reliability engineering (SRE). Originally developed at Google, SRE is a pattern in which dedicated engineers have responsibility for ensuring that the underlying data center infrastructure can ensure high availability for any applications the organization choose to deploy. (For more information on SRE, check out this recent podcast with Google's Dave Rensin.)
The major difference between SRE and traditional DevOps is the idea of delivering a platform as a product. Instead of engineers embedded in application teams, the centralized operations team wholly owns the platform that is used to deliver applications and treats the internal app-dev teams as customers.
Under this model, delivering the platform isn’t a one-step process. The product-centric platform engineering team delivers a fluid infrastructure in the same way that developers build apps. This doesn’t mean building everything from scratch; rather, many teams build on a commercial platform by customizing it and extending it with differentiated services. (To learn more about delivering a platform as a product, check out this white paper.)
Conclusion: Modern enterprises need a modern delivery model
Enterprise X’s experience likely mirrors that of many modern enterprises. As the speed of competition and innovation accelerates, the classic DevOps models as developed by Spotify and others is just not a good fit for the culture and practices of companies in industries like finance, retail, healthcare, and other verticals.
That’s why it’s important to recognize when tightly codified and dogmatic practices might actually be tripping you up, and to be flexible. Flexibility is a principle that’s built into both SRE and product-style platform engineering, and many enterprises may find that a combination of the two is what serves them best.
About the authors
Jesse Bean is Field CIO for Pivotal and the former Head of Technology for Manulife/John Hancock, Digital Advice and Head of Delivery Investment Division. He is responsible for working with top-level executives, providing timely advice and strategy on top-of-mind problems that Pivotal’s clients are facing in how they move forward in a cloud native world.
Tony Hansmann works on the Pivotal Global CxO team focusing on SRE. He has been on the cloud-native journey with VMware since 2012. He focuses on Platform Operations using SRE, LEAN and XP.