Should That Be a Microservice? Part 5: Failure Isolation

In the first part of this series, we laid out a set of principles to help you understand when microservices can be a useful architectural choice. In this post, we explore one of those factors, failure isolation, in more detail.

Digital products don’t live alone. Every piece of tech we use today interacts with something else. The same is true for the custom software we write. Each bit of code works as a part of a larger whole. It’s just one piece of a system that executes a business process. It is right there in the name – “microservice” implies interdependence with other services!

Monoliths have plenty of dependencies, too. It’s common for monoliths to integrate with an aging third-party application that you don’t control. Here, the developers at your company decided to wire up these two systems with baling twine and duct tape. These engineers weren’t masochistic. They did the integration for good business reasons, most likely to provide more complete information or a better user experience.

These integrations are a fact of life in software development. You should expect to have integrations between services that were never designed to work together. You should also expect that external services are unlikely to meet your service-level objectives.

When a downstream dependency fails, it’s tempting to point the finger at a poorly architected piece of technology. But your customers don’t care about excuses. They just want to interact with your software, quickly and easily, then get on with their day.

When you need to protect your systems from failures you can’t control, microservices are a great option. Why? Refactoring the functionality in question into microservices allows you to isolate that dependency from the rest of your application. More importantly, you can protect your SLOs by building proper failover mechanisms.

Get started with an architectural review

Odds are, you have a pretty good idea of what aspects of your system will benefit from failure isolation. But don’t assume you know all the dark corners. Take the time to perform an architectural review. Gather all your subject matter experts together – developers, architects, and site reliability engineers. Draw up the architecture. You don’t need any formal architectural artifacts. A whiteboard works really well. Make sure to ask and answer questions like:

What systems does the application talk to?
How do they integrate?
Is it a direct call or do you go through a proxy layer?
What availability level can you expect from those systems?

Walk through the architecture. Does everyone have a shared understanding of what the application does? Does everyone understand the requirements? Are you all in agreement as to what talks to what?

You’ll uncover a lot of details if you ask impertinent questions like:

What happens when that call fails?
What is our average response time on that request?
What would our support team change about the user experience?

You will inevitably find gaps in the broader understanding. That is a feature, not a bug, of this exercise! What you thought was a direct call might actually go through a message bus. As you explore the architecture, you will find bottlenecks! It turns out the Wombat service has a lower availability level than we need to provide. Interesting failure cases will result – like when month end coincides with a Super Blue Blood Moon, for instance.

All of this information will give you vital intelligence about where your application might benefit from failure isolation. Refactor away!

Failure finds a way

To paraphrase our favorite Jurassic Park character, Dr. Ian Malcolm, “failure, uh, finds a way”. Once you’ve isolated a failure, think about how to react when it happens. Because it will happen.

Do you need to add some redundancy to account for the flakiness of the Wombat system? Should you consider the use of eventual consistency mechanisms, like using Redis to cache data? And you will likely need to utilize the circuit breaker pattern. You don’t want failures cascading up to your users now do you? Of course not!

The anatomy of a circuit breaker

Quite simply, a circuit breaker protects a given service. It monitors calls to the service. When it sees a certain failure threshold, the breaker is tripped (aka opened), redirecting calls to a configured failover mechanism. That could be an alternative service, a default result, or even an error message. A tripped breaker may also result in an alert to the development team. The circuit breaker can periodically let a call through to see if the service has recovered, resetting if the error threshold is no longer exceeded.

An overview of the circuit breaker pattern.

Circuit breakers are one of those useful components that simplifies building and running microservices. There are multiple options to choose from, like Hystrix from Netflix, as well as implementations in several other technology stacks. Hystrix is the most common implementation, so let’s explore this in a bit more detail.

For enterprise customers that use Pivotal Cloud Foundry to run microservices, Pivotal offers the Circuit Breaker Dashboard as part of Spring Cloud Services. It’s a simple way to create, update and manage your circuit breakers. Pivotal Web Services, a hosted version of PCF, gives us an easy way to try out the Circuit Breaker Dashboard. We can dial it up from the Marketplace:

Adding the Circuit Breaker Dashboard in Pivotal Web Services.

Once you’ve bound the Circuit Breaker dashboard to your service, you can configure appropriate fallback behavior. For example, with the Fortune Teller demo, I’ve configured the circuit breaker to return a default fortune if the fortune service is down. When the service comes back online, random fortunes are once again returned.

The Circuit Breaker Dashboard provides visibility into the current state of your circuit breaker:

A look at the Circuit Breaker Dashboard UI, with information about the state of bound services.

Regardless of your tech stack, failover and graceful degradation are a must in today’s distributed world. Failures will happen. Be prepared!

Practice chaos engineering: Because you can’t anticipate everything that can go wrong

Architectural reviews will help you find many soft spots in your system. But they won’t identify every instance where you could benefit from failure isolation. Developers tend to be very good at identifying the “happy path” of an application. The “happy path” is the flow a user should experience when everything is working as expected. It’s far more challenging to anticipate all the ways your system can go off the rails.

This is exponentially more difficult with distributed applications. A number of services interacting in unpredictable ways leads to unique, often chaotic, environments. How do you ensure a missed failure case doesn’t spiral into a major system outage? Chaos engineering to the rescue!

The discipline of chaos engineering attempts to solve the inherent difficulty in producing reliable distributed systems. After defining a steady state (aka normal behavior) chaos engineering injects various issues that real world systems encounter. Crash an application instance. Simulate a network failure. Drop an availability zone. How does your application handle these situations?

Odds are, at least at first, chaos engineering will highlight some weaknesses with your services. Once again, this result is a feature, not a bug. Figure out what you need to change in your system to handle the unexpected. Over time, your systems will become more and more reliable. The end result: you and your team will sleep easier at night.

Please microservice responsibly

Microservices are a complex architectural option. But they are the right one for certain scenarios – like when you need to isolate failure in certain components. There are ways to reduce the complexity in how you adopt microservices. Architectural reviews combined with some chaos engineering will identify vulnerable areas of your current application. Think through how to respond to the inevitable failures. Incorporate circuit breakers where they make sense.

Reliable services aren’t a guaranteed outcome of microservices, that requires engineering discipline. Armed with the proper tools and the right approach, your services won’t have you questioning your chosen career!

Read the rest of this series:

Part 1: Should that be a Microservice? Keep These 6 Factors in Mind
Part 2: Multiple Rates of Change
Part 3: Independent Life Cycles
Part 4: Independent Scalability
Part 6: Simplify External Dependencies
Part 7: The Freedom to Choose the Right Tech for the Job

Want to learn more about microservices? Join us at the next SpringOne!

Want more architectural guidance for your modern apps? Be sure to download Nathaniel's eBook Thinking Architecturally.

Get started with an architectural review

Failure finds a way

The anatomy of a circuit breaker

Practice chaos engineering: Because you can’t anticipate everything that can go wrong

Please microservice responsibly

Related Articles

Cloud Foundry Day NA 2025: A Community Ready for the Next Wave

From POC to Profit: Rapid Iteration is the Key for Agentic Application ROI

The Shadow PaaS vs CaaS War: Cloud Foundry's Relevance in a Kubernetes World

Introducing Tanzu for Valkey on Cloud Foundry 4.0

CF Weekly Blog: Log Management with SCDF