Facebook and the limits of DIY distributed systems

March 27, 2019 Derrick Harris

This post originally appeared as part of the March 21 Intersect newsletter. Click here to view the whole issue, and sign up below to get it delivered to your inbox every week.

“Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services.”

Facebook’s explanation of its 14-hour outage last week sounds simple enough, but very possibly belies an incredibly complex series of failures across its incredibly complex infrastructure that spans data centers across the world. Fourteen hours is an awfully long time for a company whose systems are more or less designed to maximize uptime, and that employs some of the smartest software engineers on the planet.

But Facebook is hardly alone in suffering lengthy outages caused by seemingly inconsequential things. Just about every large website, web company and cloud provider has been through the same thing, including AWS, Google, Microsoft and Apple. At their scale and with the complexity of their architectures—physical and software—all the automation and engineers in the world sometimes aren’t enough. One thing goes wrong, and it cascades.

This is one of the reasons why some people have a difficult time understanding, or at least accepting, the rush toward microservices architectures and all things Kubernetes. As the saying goes, “Shit happens.” When it does, it’s probably easier to debug a relatively simple monolith than to track down the cause across a collection of interconnected microservices running on ever-changing infrastructure.

That being said, when a company’s software footprint, user count and ambitions reach a certain scale—things that are almost certainly true for any large enterprise—microservices (done right) are almost certainly the right option for bringing order and agility to its IT organization. Depending on its application portfolio, Kubernetes might be, too. Companies like Facebook and Google don’t operate globally distributed systems and build the tools they build because they want to; they do it because they have to.

Of course, there are also business benefits to these types of architectures when they’re done well. Google’s just-announced streaming gaming service is perhaps an extreme example, but the software engineering culture and technologies the company has put in place do help it jump into new digital opportunities when it sees an opportunity.

However, the trick for most mainstream enterprises is taking advantage of the architectural lessons large web companies have taught the world (and the software they’ve developed) without taking on their do-it-yourself and/or not-built-here attitudes. Finding the budget, the people and, frankly, the institutional DNA to tackle every part of enterprise IT is hard work (thus the upcoming PagerDuty IPO). For example, standing up a Kubernetes cluster might be easy enough; operating it and all the complementary components at any reasonable scale, security level, etc., can prove to be a different story.

That’s why there’s a raging debate over open source licensing happening right now, but the gist of the argument is who has the right to serve enterprise customers with commercial versions of popular projects.

The great message of Amazon CTO Werner Vogels in the early days of cloud computing was that companies shouldn’t invest in “undifferentiated heavy lifting,” by which he meant managing data centers and provisioning servers. The message seems to have resonated (if the success of AWS and its peers is any indicator), only now that heavy lifting has shifted to operating complex data center software and application architectures. Technologies like Kubernetes (or Hadoop or OpenStack before that) might not cost anything to install, but that’s where the free lunch ends.

Perhaps the rash of recent outages at webscale services, including Facebook, will be a useful reminder for enterprises to not fall into that old trap.

About the Author

Derrick Harris is a product marketing manager at VMware.
More Content by Derrick Harris

The CIO's guide to CI/CD

Continuous integration and continuous deployment are fundamental practices in the types of modern, agile so...

Don't jump into AI without doing your homework

Artificial intelligence can be difficult to grok. The best place to start is to get up to speed on what AI ...

Facebook and the limits of DIY distributed systems

About the Author

Previous

Next

Facebook and the limits of DIY distributed systems

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.