To Improve Your Mean Time to Recovery, Start at the Beginning

September 17, 2024

Rita Manachi with David Zendzian 

A version of this post was published in The New Stack. 

Our IT systems are constantly under threat - malicious or otherwise - that they’ve become almost commonplace. Case in point, as I was putting the cursor to the screen for this piece a colleague sent me this

We all recognize that as our cloud native systems continue to scale, by their distributed nature, they also become more complex. This complexity affords us flexibility and velocity; it also exposes more points of failure and intrusion. 

Falling prey to human error, poorly written code, or an intentional breach, isn’t just about prevention and protection. Companies are at risk of facing government scrutiny and billions of dollars in fines, or even legal action if they aren’t able to recover quickly. 

So while the Crowdstrike fiasco certainly made headlines, it’s the aftermath that has us giving mean time to recovery (MTTR) a second look and what you can do to reduce the time it takes you to recover from an outage or malicious attack. For this article, we’re using theDevOps Research and Assessment (DORA) team’s definition of MTTR as the average amount of time it takes your team to restore service when there’s a service disruption, like an outage.

The hard stuff first! 

Before you change your approach, change your organization's mindset set on security so it’s an inherent part of your app lifecycle from code to production to management. We start here because if we don’t then no matter what technology choices we make it will be much harder to change behaviors than adopt a new tool or platform. Here are some mindset shifts you’ll have to make: 

Stop treating security as an outcome - Security is not one thing and given the XX of today’s cloud native ecosystems they are extremely porous and connected. Rather than setting up major checkpoints that could thwart weeks of work, check it throughout the process starting with developers. “People don’t measure the value of something just working” 

Embrace a product mindset - It’s time to understand that the platforms your developers work on are dynamic and need to be treated as such. This means considering it a product that requires upgrading, patching, and improving over time. Be sure to include roles like platform engineers, compliance architects and security specialists as part of your platform delivery and strategy. “Security is a team sport”

Make the secure thing the easy thing - Focus on the four golden commands (build, bind, deploy, scale) and to make security inherent in your process. Give developers self-sergice access to app and code templates that are automatically updated and patched, a catalog of approved open source and commercial software, buildpacks, an API gateway with policy controls etc. Make sure they can use the tools they love, safely!   

Dig Into the Technical Stuff

Platform choice matters to security posture. Look for security enhancing features and capabilities that support a DevSecOps based working model and upskill your current employees on new disciplines like platform engineering and architecting for compliance

Blue-green deployment is a technique that can reduce app downtime and risk by running two identical production environments one “Blue” and one “Green” where only one of the environments is live and serving production traffic and the other idle. Only after proper testing does the idle environment start serving production workloads. 

Canary deployment is another way to test the viability of new software or update in production by only sending certain bits of the new software or update to production and see how they run. If things are smooth you release more parts. It’s part of another modern app delivery paradigm called Progressive Delivery coined by RedMonk’s James Governor several years ago. What Blue|Green and Canary deployments have in common is they give you the ability to easily roll back to a known good version of the software if something breaks with minimal disruption.

Test Driven Development (TDD) is critical to continuously releasing stable and resilient applications. To get the most from TDD don’t just do functional tests on what was added, you need to test it in context of everything else so be sure to include regular fuzz, chaos or fault testing.  

Error handling and monitoring when combined with robust log monitoring and observability can capture problems as they happen and limit the scope of a failure. If you do not have better error handling it will go to the default handling (which in the CSTRIKE case) went to blue green; before it becomes a system error. (e.g. canary deploy - kicks back errors, better know how to handle; stops installing) 

Policy-based automation is not a panacea, however it can improve multiple aspects of your software delivery and maintenance processes. Based on input from various teams including platform engineering security, compliance, and I&O implementing automation for rolling put upgrades, patches, and error handling can help either mitigate a disastrous outage or lessen the damage.   

Three before four 

Before the four golden commands there were the 3Rs! A simple way of looking at the security attributes of a cloud native platform. The idea behind the 3rs is that by being fast you are safer: 

    • Rotate datacenter credentials every few minutes or hours. 

    • Repave every server and application in the datacenter every few hours from a known good state. 

    • Repair vulnerable operating systems and application stacks consistently within hours of patch availability.  

The 3Rs continue to be core tenants of Tanzu Platform. For more about Tanzu and security check here for updated content and the like!  

There are multiple factors to making sure that recovering from an outage or security breach is not devastating to app dev and delivery processes including platform choice, development styles (e.g. agile, extreme, test-driven), and organizational/cultural factors. Rather than treating security as a single outcome, focus on delivering secure software supply chains, support a security-focused culture, automate patches, upgrades and policy enforcement, stay on top of policy drift and monitoring, and other security-enabling outcomes.  

Previous
Enterprise Grade Platform Engineering at Charles Schwab
Enterprise Grade Platform Engineering at Charles Schwab

This blog is based on "Navigating Market Storms by Leveraging VMware Tanzu and VMware Cloud Foundation at S...

Next
Navigating the End of Spring Framework 5.3 Support and Preparing for the Future
Navigating the End of Spring Framework 5.3 Support and Preparing for the Future

As Spring Framework 5.3 support concludes, upgrading to Spring Framework 6 is vital for security and perfor...