Building Crash-Proof Applications the YAGNI Way

June 1, 2013 Matthew Parker

I’m a YAGNI’ist. I’m vigilant against over-engineering. I will seek out and destroy over-engineered, anticipatory, predictive designs. This wasn’t always the case; early on in my career, I was quite the opposite. I realize now that I suffered from a lack of confidence, and that BDD and extreme programming gives us the power to deal with any problem that arises, WHEN it arises.

The Pivots I’m currently working with on a project are very much of the same mindset. So how did we manage to build a crash-proof application that can persevere in the face of extreme catastrophe, when that was never a goal?

Let me start at the beginning. We were tasked with developing an application that an IT administrator could install into his or her datacenter. After some initial configuration, the application would spider the datacenter, gathering all kinds of data about it. This application would then phone home the data to another application on the Internet, where the customer could review it. It was basically your standard ETL application, with some very non-standard data-sources.

At the very beginning, everything about this application was synchronous. The user would fill out a form, click a button, and wait. While waiting, the application would hit various bits of infrastructure in their datacenter, massage the data, and then phone it back home. Instead of making this customer wait for the form to submit, we could have backgrounded this process right off the bat. But we weren’t collecting quite enough data at first to warrant it. And creating a more graceful user experience wasn’t as high on the priority list as other features.

The number of datapoints we collected started to grow, and at some point, we decided the customer had to wait too long. A minute or two was OK, but 5 minutes? 10 minutes? Unacceptable. We were risking losing customers. So we bit the bullet and backgrounded it. But we took no steps at that point to deal with transitory network failures. Remember, this process is collecting data from other pieces of infrastructure on their network, and phoning that data back home over the Internet. Before (when everything was synchronous) if something went wrong, the user could always resubmit the form. In the new user experience, this was no longer possible. They submitted the form, and were instantly presented with a message informing that data collection is proceeding and to come back later.

We could have written code right at that moment that would anticipate failures. But here’s the rub: we’d experienced no failures up this point in any of our testing. What should we expect to fail? Spidering their infrastructure? All of it? Or were certain aspects of their infrastructure more likely to become unresponsive than others? Or should we expect the phone-home application to stop responding? Anything could fail at any point, but writing code to be resilient the in the face of any type of failure is expensive.

More importantly, we had no story telling us to anticipate failures. And we knew that the cost of writing code that could prepare for any type of failure was prohibitive. We made the case to our product owner to wait.

As the application grew, so did the code, and so did the amount of data points we were collecting. The collection phase took longer and longer, and eventually we started seeing occurrences of failed collections. Every failure wasn’t alike. They happened for different reasons; some we could even control or at the very least curtail (e.g., failures due to rate limiting).

And now we had real stories, driven out by real-world experiences, that we could prioritize against new feature work. Dealing with failures in this way allowed our code to grow over time, to respond to likely failures, while ignoring unlikely ones. Had we attempted to engineer a crash-proof application at the beginning, the results would have been disastrous. But this way, not only did our code evolve in a much more organic and sustainable manner, our understanding of the different types of failures grew over time, giving everyone on the team a better understanding of the technologies our application interacted with.

Today, our application is incredibly resilient. Short of a nuclear bomb, you can not stop this application from completing its ETL. The code is well-factored, readable, and maintainable. And we gradually built in that robustness while still delivering new features. WINNING

About the Author

Matt Parker is Head of Engineering for Pivotal Labs

Code Generation in RubyMine

When learning how to program, you were probably taught not to copy and paste code. Typing it out manually ...

Admin for nothin', design for free.

This week’s blog post contains scenes from the ancient Greek tragedy Oikonomia, previously thought to be lo...

Building Crash-Proof Applications the YAGNI Way

About the Author

Previous

Next

Building Crash-Proof Applications the YAGNI Way

About the Author

Previous

Next

Related content in this Stream

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.