The Road to Persistent Data Services on Cloud Foundry Diego

June 24, 2015 Paul M. Davis

sfeatured-CFSummitDuring a presentation at the Cloud Foundry Summit 2015, Pivotal’s Caleb Miles and Ted Young discussed their research into managing persistent data on Cloud Foundry. As they explained, data services traditionally have been managed outside of Cloud Foundry, increasing the complexity of service brokers and reducing the efficiency of data-driven apps. During the talk the duo proposed a new 12 factor app model for data services, demonstrating how Cloud Foundry users can access these stateful services using the CF CLI and Diego.

Young and Miles provided technical background on how this model operates on a homogenous cluster, wherein data nodes and a service broker or cluster manager application operate as two distinct applications. They then provided a hypothetical walkthrough using the CF CLI to create a Redis service running on Cloud Foundry that can be accessed through the service broker API. This process begins with creating a floating volume of Redis data, then a cluster using the Docker Redis image, and then a separate application known as the Redis broker. You can then create a service instance off of the Redis broker and bind the service to an application. In this way, data services can be run on a unified Cloud Foundry platform.

They followed this example with a short discussion of Diego, the rewrite of the Cloud Foundry runtime. They explained how it enables a virtual machine which runs Cloud Foundry applications in containers, and introduced the concept of volumes to the discussion. Volumes are a resource the scheduler can manage, which abstract away platform-specific details and exposes the volumes as simple API’s that contain scheduler-specific information. This enables fault tolerance as it manages clusters across multiple hosts.

To further illustrate these operations, Young and Miles provided an example of how scheduling would operate following this model. In their example, the scheduler sees an application set consisting of six instances and fixed volumes. It assigns them across the availability zones. This model creates a new type of interaction between developers and cloud operators, wherein the developer implements drain and interrupt handlers that handles when a node is going offline. Developers also need to be cognizant of the difference between performance and availability when working with fixed versus floating volumes.

For the more details on Young and Miles’s experiments with persistent data services on Cloud Foundry, check out the transcript, or watch the full Summit talk below.

Learn More


Ted Young:
All right. Hello, everyone. My name is Ted Young. This is Caleb Miles. We’re going to tell you today about a research project we’ve been working on at Pivotal about running services on Cloud Foundry, specifically, managing persistent data on Cloud Foundry and what would be involved in doing that.

To kick it off, let’s talk about the current state of what you can run on Cloud Foundry. Currently, we only allow you to directly run what we call elastic compute nodes, basically, stateless web processes. The best way to codify what those are allowed to do was best stated as the 12-factor web app manifesto that was released a while back. It describes a very high-level overview of what the contract is between the developer and the platform when you’re running these stateless services.

Caleb Miles:
After running these stateless applications for a little while, we’ve come a little bit dissatisfied, that is difficult to write very compelling data service projects using the platform today. A lot of it has to do with the fact that you have to manage the lifecycle of these data services outside of the Cloud Foundry create service lifecycle request, and you get a lot of complexity in your service brokers. There’s also some inefficiencies in having to segregate your services from your applications, and so you leave some resources on the table, which no one wants to do. Then, you also have two ways of setting up like access credentials to your service partition and your application partition, which also introduces a bunch of overhead that you may not want.

Ted Young:
Mm-hmm (affirmative). If we’re going to extend our application model beyond just the 12-factor web app, how much of this needs to change? This is the 12-factor app in its completeness. It’s kind of a lot of things there. At first, we’re like, “Wow, how much this gets thrown out the window once you start running data services?” The answer, it turns out, is not very much. These two points changed. Everything else pretty much runs the same way. Processes before when it was just stateless, you could say you are just executing stateless processes that are un-differentiable. You can’t tell them apart.

We’re going to change that now. We’re going to say for data apps, what you do is you expose persistent data via one or more addressable processes. These processes are individually addressable now. The other aspect that you change is disposability. Before, disposability was very vague. Just said, these things should start fast and fail fast. What we’re saying with data applications is you now need to add a concept of an interrupt and a drain signal to manage shutdown and replication. Some nuance attached to these two points.

Digging to the first point a little more deeply about processes, exposing persistent data via one or more addressable process. Each individual member can now be directly addressed. Whether you exposed that direct address publicly or not is still up to you. You may only want these things to be individually addressable by, say, service broker or some other client that’s only running in the cloud as opposed to externally outside the cloud. Nevertheless, any client that wants to talk to this data service is probably going to have an algorithm where it needs to know which member it’s talking to. It can’t just round robin between the individual members.

Each member has a unique state, which we’re going to persist in a volume, but otherwise, the members are homogeneous. We’re still talking about a homogeneous cluster. For example, if you have an application that has, say, data nodes and then some kind of service discovery or cluster manager node, you deploy that as two applications. One application would be all your homogeneous data nodes, the other application would be the service broker or cluster manager application, which again would be homogeneous.

Caleb Miles:
This is where the individual addressability really comes into play. I was stitching these clusters together via whatever orchestration is right for that particular data service.

Ted Young:
Yeah, but you’re still not saying, when we start this thing up, instance one, you get started with this set of configuration; instance two, you’re being started with some completely different set of configuration. We’re not saying we’re going all the way in supporting every possible way of deploying services.

Caleb Miles:
In handling the disposability, really the ephemeral nature of resources in the cloud need to expose two kinds of signal handling, an interrupt and a drain. We’re thinking that the interrupt is the cluster will be telling this data node instance that, “Hey, you need to stop your work for a little while” to inform anyone that’s monitoring you that you’re going to be gone for some short period of time. Maybe we’re scheduling you on different node. We may be rolling the underlying host, a term for OS update. We don’t know. We think we’re going to come back pretty soon.

We also have this idea of a drain, which is like tell again those people who are following you that this node is going to go offline, and you need to take some actions in terms of rebalancing the data that used to be on this node. We know that we want to be designing systems that have the self-healing, self-monitoring processes, and we need to expose these lifecycle hooks to allow that mechanism to work.

Ted Young:
Okay. This is a imaginary hand waving example of what this would look like, a walk-through using the CF CLI. What we’re showing here is the steps that would be necessary to create a Redis service that you could expose through the service broker API while running all the components on Cloud Foundry. The first command, you would say, would be create a floating volume, and we’ll get into it in a bit about what a floating volume is versus a fixed volume. You’re going to say create a floating volume of Redis data. You’re going to tell how big this volume is going to be each instance of this volume. Then, you’re also going to mention some reserved memories. This is telling it, when it’s thinking that scheduling these volumes, these applications you might be binding to these volumes might be a certain size now, but you want to make sure that there’s more resources available later if you want to attach additional or larger processes. You outload some memory when you talk about your volumes.

Then you would create your Redis Data cluster. This route-fs directive is a hand wavy way of saying like let’s just use the default Docker Redis image. During our spike implementation, we actually got that working on Diego. You say mount volume, the name of your Redis Data up there, which actually should be Redis Data, not my Redis Data. That saying when you push that application, only schedule that application with the Redis Data volume mounted to it.

Then you would create a separate application called your Redis Broker. We’re again imagining that we’ve created a Docker image that contains everything needed to run this Redis Broker. Then, we hand waved around whatever the command will be to allow that application access to the Redis Cluster application with all the permissions needed to be a broker for that application. That would allow it to see the addresses of the individual members of the Redis Cluster. It would allow it to scale up, scale down, have access to administrative ports, whatever things you need to actually be the broker.

Once you set up this Redis Broker, the system notices its broker. You can then use your create service to create a service instance off of that broker and then bind that surface to an application. That would be the way you would walk through this whole thing. Now, you have an application using Redis, the Redis Broker and the Redis Cluster, all of that running on one platform. That’s the goal here.

Quickly, I want to point out we were only focused in this research grant looking at how volumes affect the scheduling, but you can see once we’ve got that, there are these other questions we want answered next, which is how do things like network security, permissions, things like that, how can all those things work in this new world? We want to get these volumes step out of the door quickly because we think that data is what’s going to allow us to address these other questions in a little bit more of a real-world scenario.

Real quick, we’re going to go over an example of how we implemented this in our Spike. This is just going to be a description of the components we ended up making on Diego and walkthrough of an example scheduling just so you can get a visceral feel for what this would look like.

If you don’t know Diego, a couple of concepts you need to know about. A cell is an individual virtual machine running in the container farms. This is a virtual machine that then runs Cloud Foundry applications in containers. An application is a set of instances that are running in containers across the cell cluster. The scheduler is a daemon that keeps the current state of the system in its head along with the desired state of the system. When it notices there’s a discrepancy, it schedules work so that whatever isn’t running starts running, whatever shouldn’t be running is stopped. That’s the scheduler.

To those concepts, we’re going to add a couple of new concepts. We’re going to add the concept of a volume, which is a resource that the scheduler knows how to schedule. Individual hosts can offer resources like volumes via volume manager, which is a cell component that abstracts away all the different platform-specific stuff around how those volumes are implemented and only exposes them as a much simpler API that only contains the information the scheduler cares about. When the scheduler schedules a volume, it thinks about it as a volume set. When you are scheduling individual volumes, you’re scheduling them against an application that is actually running many instances. Your Hadoop cluster is running like 50 instances. When you’re scheduling the volume for that cluster, you’re actually scheduling a set of 50 volumes. It’s important for the scheduler to think about that as a set because it’s those volumes that needs to make fault tolerant relative to each other. You don’t want it putting all 50 volumes on the same host or something like that. It needs to know that these represent resources that need to be scheduled fault tolerant relative to each other.

Caleb Miles:
In our experiment, we only really got to examine a host-based storage, where we’re thinking that there’s really two broad categories of storage. Storage that in some way attached to the host in terms of its lifecycle or availability replacement and another type that can float between one or more hosts. This is your SAP, your cluster, your EMC storage appliance that is able to be attached to more than one host, maybe not at the same time.

We have these fixed volumes, which again are connected to the lifecycle of the underlying host. Imagine like these fixed volumes can provide stores that are very local to the process that are running up. We’re imagining things like Cassandra might run on top of this, that means of lot of throughput, so we don’t want to use local disks. Again, we have this floating type, which can float along the cluster. We see that our floating disks, they appear on all the cells because they could be attached to any of the cells whereas the fixed ones are cell specific where they were created natural scheduled.

Ted Young:
Mm-hmm (affirmative). Cool.

Caleb Miles:
Again, we’re talking about this drilling to the scheduling a little bit. We see that fixed volumes, as we’re saying, are scheduled on the cells where they’re going to be used because they’re very close to the actual processes that are going to be consuming them. We see that cell … The fixed volumes are scheduled on each individual cell. And like Ted was saying, it’s very important for a availability in fault-tolerance perspective to not land all of these volumes on the same host. The scheduler will try and splat them across your available hosts, so you have a highly available fault-tolerant solution.

Floating volumes on the other hand, because they are … They could be backed by some kind of network storage, they can appear on any of the cells. One of the schedules is going around looking at how to schedule things, they have to settle the applications that are actually going to run on these cells. It has lot … Much wider choice because all of the floating volumes are available in any of the cells.

Ted Young:
Cool. Now, we’re going to walk through an example scheduling, okay? Just to explain this diagram, we’re going to walk through in several slides. You’ve got cell clusters. Those squares are all the different cells where we could run containers that divided into two availability zones, availability zone one and two. You’ve got the scheduler, which has the square representing what it’s currently thinking about scheduling. Then next to it, you have the database, which was the state of the system. You’ve got some desired state and actual state. This first slide, you can see, we’ve got some desired state, which consists of an application set, so a grouping of six instances and a corresponding set of fixed volumes. We’re not running any of this currently.

The first thing that happens is the scheduler notices that it’s not running stuff that you want, and so it loads it into its brain and tries to decide where it’s going to place it. Placing the fixed volumes is the most important thing because wherever those land, it can’t move them later. It thinks about those first. It places them equally across the two availability zones. You notice it doesn’t pile them up on any particular cell. Then, it schedules the application that wants to run against those volumes. This is easy. It has no choice about where it can put them, so it just runs them right where it placed the volumes. Now, our actual state matches their desired state, and we’re done.

Then, we come in and we changed our desired states. Now, we’re desiring a new application that has two instances that wants to be attached to a floating volume. The scheduler notices this. Loads it into its head. First, it schedules the floating volume. Notice what happens here. When it schedules this, one instance becomes available on any node in the first availability zone and the other instance becomes available on any node in the second availability zone. This is a volume manager saying that the way these floating instances work under the hood is you can attach anywhere you like but you can’t attach processes to the same volume across availability zones. It won’t let you do that. Once it’s placed those, when it comes time to place the actual instances, it’s got a lot of choice here, right? It could place them on a number of cells. It picks to free cells that aren’t running anything else. There you have it.

Caleb Miles:

Ted Young:
Cool. To quickly sum up, when it comes to operating this, there’s a shared contract between the developer and the cloud operator. We just want to quickly review what these new operation requirements are for these two groups. For the developer, you need to be responsible for implementing drain and interrupt handlers. How your application drains or shuts itself down properly is entirely application-specific. Cassandra is going to do this. It’s different from Redis Cluster, different from a Hadoop. Whoever is building those packages has to figure out what drain and interrupt mean for that application. They have to understand the lifecycle policy. The lifecycle policy is what dictates, for example, how many instances might be told to drain at the same time. The developer needs to know what that is so they can ensure that their application is going to be tolerant to whatever the cloud operations people might be doing to it.

Your application, if you can say it can handle having two instances be draining at a time, then when cloud ops is rolling the cluster, they understand that they can shut off at most two cells at a time. That lifecycle policy is a very important policy that binds the developers and the cloud ops people. As the developer, you also need to understand the differences in availability and performance between fixed and floating volume. Fixed volumes are going to be much more performant because they’re local and they’re not doing any kind of replication or backup for you. Floating volumes are going to be much more available, but they’re going to have to slower throughput because they’re going to be replicated or remote.

Cloud ops, conversely, they need to know about the lifecycle policy because they’re going to use that as cover for any maintenance that they want to do. Anytime as an operations person, you want to figure out how to migrate your cloud, this lifecycle policy is going to tell you what you can do without breaking your ULA with the services that are running on top of your cloud. Fixed volumes, in particular, interact with the host lifecycle, so when you’re thinking about rolling your host, you have to think about how the fixed volumes have been allocated on top of those hosts.

Of course, over the course of doing this, you’re going to encounter services that are not draining or interrupting properly. They’re hanging. What we want to keep cloud ops and dev ops totally separate cloud ops is going to be in a position to notice when things have gotten stuck and take the appropriate response to resolve the issue.

Caleb Miles:
That’s, again, where the lifecycle policy comes in, empowering these two groups of people to not have to have each other sitting at the same screen while a cluster deploy is going on or while a new version of the developer software has to go out. It’s about the developer’s understanding what the constraints are on their use of the cluster and about the operations team knowing what their freedoms are in terms of how they can maintain the cluster.

Ted Young:
It’s not mentioned on this slide, but one of the thing that really helps is moving a lot of the policy decisions away from the cloud ops people and more to the CF administrators. Right now, if you’re running services using BOSH, the BOSH operations are pretty tied to the cloud operations. That means any policies around where you’re running your services and how that’s going to work, that’s now cloud operations who has to think about that. What we’re hoping is once all that stuff is on the other side of the fence running on our platform, you can use all of the administrative tools like organizations and spaces, things like that, placement pulls in order to control policy. That’s just putting the three tasks back in the right stuff, developers caring how their apps work, cloud ops caring how the platform works and administrators caring about policy.

That’s pretty much it. Just to cover this once again, the high-level stuff that affects the interface of Cloud Foundry is a lifecycle policy, drain and interrupt. Those are the new concepts. We’re excited about this because our spike was not actually very hard to implement. We believe that it should be possible to roll these changes out fairly quickly given the concerted effort.

Another thing that we’re happy about was the volume manager allowed us to keep the Diego platform decoupled from BOSH or the IS. We’re a little concerned once she start talking about managing volumes and stuff, you start having more sticky … IS-specific stuff starting to show up in Diego, and it doesn’t seem to be the case.

Caleb Miles:
Yeah. That’s again where it’s very helpful to have two distinct types of volume concepts, the fixed and the floating, because you’re going to have them both implemented in whatever way, you, as a clustered provider, like infrastructure provider think is the right way to do it.

Ted Young:
Right. Lastly, we think a beta version of this, say, rolled out on top of Lattice is very important because once you’ve got volumes as a first-class resource, you can start working on the other parts of the problems with how do you run a service broker, how do you make the network secure. It’s easy to think about these problems if you’ve already got the basic service up and running on top of Diego.

About the Author


How To Deal With Class Imbalance And Machine Learning On Big Data
How To Deal With Class Imbalance And Machine Learning On Big Data

As a member of Pivotal Data Science Labs, Greg Tam penned this article as an exploration in class imbalance...

Beyond the Twelve-Factor App
Beyond the Twelve-Factor App

While twelve factor applications are becoming more prevalent, there is enough buzz to confuse their meaning...