Building Crash-Proof Applications the YAGNI Way

June 1, 2013 Matthew Parker

I’m a YAGNI’ist. I’m vigilant against over-engineering. I will seek out and destroy over-engineered, anticipatory, predictive designs. This wasn’t always the case; early on in my career, I was quite the opposite. I realize now that I suffered from a lack of confidence, and that BDD and extreme programming gives us the power to deal with any problem that arises, WHEN it arises.

The Pivots I’m currently working with on a project are very much of the same mindset. So how did we manage to build a crash-proof application that can persevere in the face of extreme catastrophe, when that was never a goal?

Let me start at the beginning. We were tasked with developing an application that an IT administrator could install into his or her datacenter. After some initial configuration, the application would spider the datacenter, gathering all kinds of data about it. This application would then phone home the data to another application on the Internet, where the customer could review it. It was basically your standard ETL application, with some very non-standard data-sources.

At the very beginning, everything about this application was synchronous. The user would fill out a form, click a button, and wait. While waiting, the application would hit various bits of infrastructure in their datacenter, massage the data, and then phone it back home. Instead of making this customer wait for the form to submit, we could have backgrounded this process right off the bat. But we weren’t collecting quite enough data at first to warrant it. And creating a more graceful user experience wasn’t as high on the priority list as other features.

The number of datapoints we collected started to grow, and at some point, we decided the customer had to wait too long. A minute or two was OK, but 5 minutes? 10 minutes? Unacceptable. We were risking losing customers. So we bit the bullet and backgrounded it. But we took no steps at that point to deal with transitory network failures. Remember, this process is collecting data from other pieces of infrastructure on their network, and phoning that data back home over the Internet. Before (when everything was synchronous) if something went wrong, the user could always resubmit the form. In the new user experience, this was no longer possible. They submitted the form, and were instantly presented with a message informing that data collection is proceeding and to come back later.

We could have written code right at that moment that would anticipate failures. But here’s the rub: we’d experienced no failures up this point in any of our testing. What should we expect to fail? Spidering their infrastructure? All of it? Or were certain aspects of their infrastructure more likely to become unresponsive than others? Or should we expect the phone-home application to stop responding? Anything could fail at any point, but writing code to be resilient the in the face of any type of failure is expensive.

More importantly, we had no story telling us to anticipate failures. And we knew that the cost of writing code that could prepare for any type of failure was prohibitive. We made the case to our product owner to wait.

As the application grew, so did the code, and so did the amount of data points we were collecting. The collection phase took longer and longer, and eventually we started seeing occurrences of failed collections. Every failure wasn’t alike. They happened for different reasons; some we could even control or at the very least curtail (e.g., failures due to rate limiting).

And now we had real stories, driven out by real-world experiences, that we could prioritize against new feature work. Dealing with failures in this way allowed our code to grow over time, to respond to likely failures, while ignoring unlikely ones. Had we attempted to engineer a crash-proof application at the beginning, the results would have been disastrous. But this way, not only did our code evolve in a much more organic and sustainable manner, our understanding of the different types of failures grew over time, giving everyone on the team a better understanding of the technologies our application interacted with.

Today, our application is incredibly resilient. Short of a nuclear bomb, you can not stop this application from completing its ETL. The code is well-factored, readable, and maintainable. And we gradually built in that robustness while still delivering new features. WINNING

About the Author

Matthew Parker

Matt Parker is Head of Engineering for Pivotal Labs

Previous
Code Generation in RubyMine
Code Generation in RubyMine

When learning how to program, you were probably taught not to copy and paste code. Typing it out manually ...

Next
Admin for nothin', design for free.
Admin for nothin', design for free.

This week’s blog post contains scenes from the ancient Greek tragedy Oikonomia, previously thought to be lo...