This is the third blog in a series where I talk about my time with the Pivotal Cloud Operations team (part 1, part 2). I want to share with you a fairly obscure fact that I learned, that really illustrates the power of a Platform as a Service (PaaS).
The story begins on a day where I was part of a pair that was doing a production deploy of a new release of Cloud Foundry to Pivotal Web Services (PWS). As we began the deployment I said that I hoped something would go wrong (much to the dismay of my pair), as more can always be learned that way. Well, into what was otherwise a very boring deploy (Cloud Foundry deploys are typically very, very dull), I got my wish.
BOSH was most of the way through upgrading the DEAs, when one failed to come up. This is kind of odd because the whole platform is built on the principle of repeatability, if one DEA upgrades fine, they all should. As we dug into the logs we found that on the failed DEA one of the processes had failed to obtain a port binding.
If you know a bit about Cloud Foundry architecture you know that DEA VMs are running several different processes: the BOSH agent, warden, a directory server and more, and some of these processes are bound to ports. To compare the ports in use we SSH’d into a DEA that had been successfully upgraded and into the one that had failed. We found the following:
Next we looked at how the ports are assigned to the BOSH agent, and cppforlife had the answer. Very simply, the BOSH agent is dynamically assigned a port. Ah ha! The smoking gun! Albeit it was on two different VMs, which itself wouldn’t present a problem, the fact that the same port number was showing up bound to two different processes gave us a place to look. Sure enough, the deployment manifest set the port for the directory server to 34567. The first thing that starts up when a BOSH VM is provisioned is the BOSH agent, which in the case of the failed DEA was bound to port 34567. When the directory server then tried to bind to the same port we had a conflict.
How can you avoid this type of conflict?
The answer is ephemeral ports. I’ll let you read the Wikipedia article for details, but in summary, there is a port range that static port assignments should never come from – 32768 to 61000 (on most Linux kernels). We had statically assigned a port from this range, but a directory server, and all other processes assigned static ports, need to have ports assigned that are NOT in the range between 32768 and 61000.
You might note that this is a pretty big range, which explains why we had never had this happen before. It’s one of those really insidious bugs that often won’t show up in tests and can go undetected for a long time, only to rear its head at a terribly inopportune moment.
I like to think that my wish for something to go wrong coming true helped us find this bug ;-).
Let me come back to my opening salvo. As an application developer you really shouldn’t need to know this relatively obscure fact. However, that means you need a platform that handles this minutia—like Cloud Foundry. You see, when you deploy an application to Cloud Foundry, the platform chooses the port. Have you ever noticed that the port numbers assigned to app instances are in the 61001 and higher range? Note the ephemeral port range above; that is not a coincidence. This is just one example of the many IT burdens that the platform eliminates from the application developer and operator. That is so freakin’ cool!
One final note, throughout all of this we experienced zero downtime for PWS. All of the Cloud Foundry components, save one of the many DEAs, continued to function as expected. Yep, no downtime whatsoever! BOOM!
About the Author