How to Distribute Data Across a PCC Cluster: Replicated or Partitioned?

April 30, 2019 Jagdish Mirani

When you need lightning-fast data reads and writes, Pivotal Cloud Cache (PCC) fits the bill. As an in-memory cache, PCC dramatically reduces latency when users request certain types of data. The end result: much faster performance, and a better user experience.

For Pivotal Application Service users, the addition of a cache makes sense in these scenarios:

  1. Your application is based on a modern, microservices-based, distributed architecture. You want to use caching for microservices to increase performance by avoiding network hops for accessing data.

  2. You require a simple cache for front-ending a legacy database, because the current system is too slow and unreliable for modern architectures on its own.

  3. Your applications require data to be highly available across multiple sites, or multiple clouds.

Pivotal engineers have tailored PCC to meet these requirements. The product now has a set of proven replication and clustering capabilities.

PCC nodes (called members) can be clustered, and the data can be spread across the members in the cluster. There are a couple of different approaches to spreading data across a cluster, and this is an important decision point in any PCC implementation.

So what’s the best way to capitalize on these enhancements and achieve superior stability and scalability?

We sat down with Pulkit Chandra, the PCC Product Manager, to help discuss how to best architect data across a PCC cluster.


Jag: Thanks for joining me Pulkit. PCC holds data in logically grouped sets called regions. A good place to start is to describe what a region is.

Pulkit: PCC stores the data in entries, where each entry is in a key-value format. So a region basically stores these key-value entries. Think of a region as a map that associates a key with a value. The number of maps that you create are the number of regions. So a region is analogous to a group of key values, which have similar types of values, like the type of object being stored as values. The map basically allows applications to lookup values by their keys.

Jag: So this is like a table in a relational database, only it doesn't have all the complexities of foreign keys and inter-relationships between tables. It's simpler than that.

Pulkit: Exactly. PCC regions are analogous to relational database tables, but PCC is a NoSQL cache. The entries in PCC regions are key-value pairs. The values can be simple data types or even complex objects because PCC is an object store.

Jag: There are a couple of different types of regions in PCC. Folks need to think through what kind of region they want to set up based on the data they are storing. Tell us about the types of regions in the product.

Pulkit: Let’s start with a replicated region. This spreads copies of the same data on all the members of the cluster. If a replicated region on one member has 10,000 values, all the other members of the cluster will have the same 10,000 values.

The second kind is a partitioned region. It’s similar to what the industry calls sharding. With sharding, we split up the region entries into different buckets or partitions, and store these partitions on different members. If you have five members and a region with 1 million entries, each member will store 200,000 entries. This is how PCC can operate in parallel across the members of the cluster.

Jag: This distinction can get blurry, because data from partitioned regions can also be copied to other members of a cluster, right?

Pulkit: Yes, a partition of data on a member can also be copied or replicated. It’s common to do replicate to just a few other members to improve performance and availability. If a member fails, the data can be accessed from one of the copies. Performance is improved because apps can read data from any of the members that hold copies. But data can only be written to the member that holds the primary.

Figure 1: A partitioned region with a primary partition and one copy

A replicated region makes copies of the region’s full data set to every member in the cluster. With partitioned regions, the partitions are copied typically only to a few other members.

Figure 2: A replicated region with a copy of the full data set on each member

Jag: It sounds like a partition region is important for how PCC scales.

Pulkit: Yes. You want to be able to scale the system horizontally. With a replicated region, you are just copying the data to each new member as the cluster grows. With a partition region, when you add a member, you’re actually increasing the overall data capacity of the system. Here, the total number of entries in a region is evenly spread across the members. Adding a member results in fewer entries per partition.

Jag: So where should you start designing your caching architecture? Do you look at the specific kinds of data sets you have? Would you treat customer data or product data differently? How would you decide between the replicated region or the partition route?

Pulkit: Here’s what matters: How often is the data changed by the applications that use it? Is the workload read intensive or write intensive?

If you have data that is fairly static, and the workload is “read” heavy, we would recommend replicated regions. This way, any copy of the data can support reads. Plus, the additional copies allow PCC to scale to a high volume of reads. Remember, adding copies does not help with “writes”. In fact, copies of data can reduce the performance of writes, because writes have to be replicated across all the copies.

Partitioned regions, on the other hand, increase the write capacity of the system. You simply add members to the cluster. Writes are sent only to the primary member of the cluster. The member that owns the data, and partitioning adds the ability to handle more concurrent writes. When you increase the number of members, you increase the parallel number of writes that you can serve from a cluster.

Here’s what matters: How often is the data changed by the applications that use it? Is the workload read intensive or write intensive?

Jag: So, you can add members to increase the number of partitions in order to boost the throughput of writes. Can you delve deeper into why replicated regions do not help with writes?

Pulkit: PCC is a CP system in CAP theorem parlance. PCC was designed for a high degree of data consistency. Updates to data within the cluster are done synchronously. This way, reads of the data will always return consistent values and there will be no ‘dirty’ reads.

When you have many copies of the data, writes are an expensive operation. The write will not be complete until all the copies are updated. Only then will PCC send an acknowledgment of the write to your app.

For any write operation, the speed of the write is constrained by the slowest member in the cluster. Again, writes are expensive with replicated regions, because writes have to be replicated to every member of the cluster. On the other hand, with partitioned regions, partitions of data are typically copied to only a few other members. This reduces the impact on write performance.

Jag: Other than read performance, why would anybody use replicated regions?

Pulkit: You can model many-to-many relationships with replicated regions. This pops up when data from a replicated region is accessed in conjunction with a specific data element from a partitioned region. In this scenario, the requested data is available locally, within the same member from which a response can be generated. This is actually the primary motivation for using replicated regions.

Jag: How easy would it be for me to change the decision? Let’s say I start with a partition region, but then realize I I have way more reads than writes. Therefore, a replicated region may have been a better choice. Should I make that change?

Pulkit: This is not too complex, but it does require some manual steps. You need to empty and drop the region. Then you need to recreate the region, this time as a replicated type. Finally, you redistribute the data. You can use PCC’s APIs to get the data and then just put the data in the new region. There is no automatic way of just converting a partition region to a replicated region.

Jag: What if it’s difficult to profile the workload on a region by reads and writes?

Pulkit: When in doubt, use partitioned regions as a default. This way, you get excellent write performance. You can always increase read performance by increasing the number of copies because reads can be serviced from any of the copies.

Choose replicated regions when you are sure that you’re dealing with static data. For example, a region that holds product catalog data won’t change often. If there is any doubt about the read-write profile of the workload, start with a partitioned region. You can change to a replicated region if necessary.

Jag: Partitioned regions give you more upside. You can improve read performance via the copies, and you’ve got sharding to handle the write performance.

Pulkit: Exactly. You also have horizontal scalability as well. You can increase the capacity of your cluster as you scale up.

Jag: Let’s explore this a bit. What is the impact of scaling the system, either up or down? The addition or removal of members can occur frequently. When you are in production, you’ve got data distributed across the cluster in a certain way. Now how do you go about changing the size of the cluster?

Pulkit: The good news is you don't have to worry about it as a developer, it all happens automatically. Every time a member is added or removed, PCC balances the data within that cluster. You can monitor this activity via metrics exposed by PCC. But it happens automatically.

Jag: Are there any attributes of the data itself that impact how well the data can be balanced?

Pulkit: Yes, the balance is affected by the distribution key. A good key will allow the data to be balanced nicely across the members. A bad key might not balance so well. This causes hotspots on particular members.

Jag: Lots of folks use Spring together with PCC. Can the Spring framework help with these kinds of decisions?

Pulkit: Absolutely. Spring Data Gemfire is one of the modules that we are 100% compatible with. There is a new project on the horizon as well, Spring Boot Data Gemfire which is going to give you all the capabilities of Spring Boot. Using these frameworks, you can actually let Spring decide the kind of region you want.

By default, Spring chooses a partition region. That’s our default recommendation anyway. You can always override this, though. The Spring caching abstraction and repository framework free you from some of these implementation details. Don’t worry about “what is a region,” or “how it maps to a particular data store.” Just let Spring handle that for you!

Jag: There’s a lot to say about Spring, and we’ll write more about how it works well with PCC. Do you have any other words of wisdom for enterprises who are wrestling with how to best spread my data across a cluster?

Pulkit: I would leave you with one thought: When it comes to deciding on whether to use a partitioned region or a replicated region for a particular dataset, go with partitioned region. The exception to this rule would be if you’re absolutely sure that your workload has a very limited need for writes. Partitioned regions give you horizontal scalability; you can easily scale your system up or down. You can focus on the business outcome your app needs to deliver. Then, just use Spring to handle all these low-level concerns for you.

The Spring caching abstraction and repository framework free you from some of these implementation details. Don’t worry about “what is a region,” or “how it maps to a particular data store.” Just let Spring handle that for you!

Jag: Thanks a lot Pulkit, we hope to hear a lot more from you.

Pulkit: Thanks Jag. I’m looking forward to more conversations like this one.


Next Steps for Pivotal Cloud Cache Users

Regardless of how you distribute your data across your cluster, you’ll be making copies, which raises the question of data consistency. From its Geode roots, PCC is biased towards strong consistency, and with the platform’s support for high availability, PCC delivers both, strong consistency and high-availability. You’ll find the most recent PCC buzz here.

For more best practices and tips with using PCC, you will not want to miss our SpringOne Platform conference, at which we will feature several sessions related to Geode and PCC. In addition to the sessions at the main conference, we’re hosting a half-day Geode Summit where you can engage with experts, core committers, and leading production users of Geode. You’ll also find these recorded Geode Summit presentations from prior years to be highly informative (2018, 2017, 2016).

About the Author

Jagdish Mirani

Jagdish Mirani is an enterprise software executive with extensive experience in Product Management and Product Marketing. Currently he is in charge of Product Marketing for Pivotal's data services (Cloud Cache, MySQL, Redis, PostgreSQL). Prior to Pivotal, Jagdish was at Oracle for 10 years in their Data Warehousing and Business Intelligence groups. More recently, Jag was at AgilOne, a startup in the predictive marketing cloud space. Prior to AgilOne, Jag held various Business Intelligence roles at Business Objects (now part of SAP), Actuate (now part OpenText), and NetSuite (now part of Oracle). Jagdish holds a B.S. in Electrical Engineering and Computer Science from Santa Clara University and an MBA from the U.C. Berkeley Haas School of Business.

More Content by Jagdish Mirani
Effective Communication For When There’s A Language Barrier
Effective Communication For When There’s A Language Barrier

Using Microsoft Configuration Extensions with Steeltoe
Using Microsoft Configuration Extensions with Steeltoe