How To Build a Hardware Cluster for Pivotal Greenplum Database

October 20, 2015 Scott Kahler

sfeatured-greenplumA question we get hit up with quite often here at the big data hotspot is,”What hardware should I buy to build my own Pivotal Greenplum Database (GPDB) cluster?” While it is true that GPDB is software only and can run on many different hardware configurations, cloud environments (AWS, Google Compute) and even containers such as Docker, your cluster will definitely get better mileage if you lay down a good foundation.

image01With the rapid progression in hardware and wide range in the cost spectrum of what you could procure, the goal of this post is NOT how to layout “The One Build” that is the fastest possible cluster. Instead, we define a build that offers excellent speed and capacity to work with big data sets at a reasonable price. This article explains how most of our enterprise customers have built an effective configuration, including monitoring, network, disk, processing, and compute. But, keep in mind that GPDB offers a lot of flexibility and runs across a wide spectrum of configurations.

A Key Difference: Apache Hadoop vs Greenplum Architecture

Since many of our “build your own cluster” questions come from people who have investigated or built a cluster in the Apache Hadoop® world, it is important to note that GPDB works quite differently. Let’s start by providing a conceptual comparison.

In general, we need to do something to data—let’s say the data is represented by packages that need to be moved. To do it with MapReduce, we first need a plan that basically lists the (Java) instructions for gathering and moving the packages—let’s say it assesses package type, source location, movement path, target location, and requires a summary. Then, we need a supervisor (YARN’s Application Master) who manages the plan and the workers. First, the supervisor advertises that there is work to be done, and idle workers—processes executing the instructions—show up on their single horse, a resource allocation in a hardware node. Each worker then puts the package in their saddle bag and uses their horse and the supervisor’s instructions to move the package. As they finish, they give a status to the supervisor, eventually finish, and become idle again.

GPDB, on the other hand, has a smarter supervisor who plans the work to be done in parallel and introduces carts that can be pulled by multiple “hardware horses” at once. With a smarter supervisor, GPDB doesn’t need such an exact plan like in Hadoop. Instead, the supervisor gets a general description of the plan, which is SQL, not Java. Then, they take care of the planning details—they look at the SQL description, workers, and horses available along with what other work is being done. Using this knowledge and their experience, they come up with a balanced plan to execute the work concurrently across all of the resources. In this ecosystem, the concept of carts needs to be introduced. Carts are a worker with many “hardware horses,” and these are loading up and moving multiple packages at once. In GPDB, this is called a slice, and it executes in parallel against data sets which are spread across all of the nodes. In this manner, the worker is able to move large sets of packages in unison quickly. Additionally, the destination of the cart doesn’t have to be a location where the packages are unloaded (persisted to disk). If there is a next step for the packages, it can be (and most often is) the case that packages are transferred to another cart—for example, a query is returned and acts as input to another query, like in data science scenarios. This creates a very effective system for moving and processing large amounts of packages (data).

To summarize, MapReduce receives a specific plan in Java, advertises the work, and passes chunks to workers as they become available—then workers process on a single “hardware horse” at a time and report back. GPDB receives SQL, does detailed planning for parallel execution, and allows processes to use multiple hardware nodes at once.

So, I emphasize this cart description to highlight the team nature of GPDB. It is important that the hardware across the cluster is fairly uniform and promotes this paradigm. If you were to hook up a couple of mules, five clydesdales, and an elephant to a single cart, it would be extremely hard to effectively get the team to work in unison (or parallel) to pull one cart. This is the same with GPDB. As you architect the cluster, you should work with nodes of the same type and size—avoid piecemealing in random metal that is laying around the datacenter. Processing will be pushed down to the nodes in parallel, and results will come back at a speed dictated by the slowest member. To get the most out of the parallel execution engine and avoid chasing down hotspots, keep the pieces as similar as possible. This way, they work in unison.

What are these pieces you should get to ensure effective hardware parallelism? Here are the key categories: monitoring, network, and hosts.

Monitoring for Pivotal Greenplum Clusters

Monitoring is the first thing we’ll hit up. It is often forgotten in the fun of building a new technology stack. GPDB customers with the best long term results are effectively monitoring their systems. Whether it be Openview, Nagios and Cacti, Zabbix, or OpenTSDB, it is essential to have something pulling in and logging server and network statistics while also providing system alerts. Greenplum Command Center has the ability to chart database related metrics, and alerts can be exposed via SMTP or email. However, this image02doesn’t give you the view of the entire stack, down to the bare metal, that you need to effectively run your own cluster. Management pieces need to be in place to help detect hardware and network level issues.

When our customers call with software problems, our support team regularly traces these down to platform issues. For this reason, we recommend you purchase your servers from major hardware vendors such as Dell, Cisco and HP. These companies have monitoring software packages and nice out-of-band management cards to help keep tabs on all of the blinky lights. Ultimately, monitoring means optimization and a better return on the hardware and software investments.

Networking for Pivotal Greenplum Clusters

The network is another important piece of the puzzle. While the system does work to avoid crosstalk whenever possible, there is still a lot of traffic that gets passed back and forth between GPDB segment hosts. For this reason, we recommend, at least, dual 10G ports on the servers hooked up to two non-blocking switches to act as the interconnect between nodes. Also, 1G network cards and switches aren’t sufficient. The host’s two ports should be configured in a MLAG (or equivalent), enabling both of them to participate in passing traffic. When looking to move beyond a single rack, the links between switches should meet or exceed 40GB to keep up with the rack-to-rack traffic. No fancy rules are necessary, we just need the ability to pass traffic as quickly as possible.

Careful consideration should also be put into the bandwidth of the uplink from the cluster to the rest of the network. There are utilities within GPDB, gpfdist for example, which allow the segment hosts to reach out and pull data into the cluster, in parallel, from other sources. Additionally, you may want bandwidth that is sufficient for pushing backups to a remote location in a timely manner. For these reasons, we recommend at least a 10G uplink to the rest of the network to enable movement of large data into and out of the cluster. Lastly,regular inserts of large data streams, such as IoT systems, would need similar consideration.

Hosts—Server Hardware for Pivotal Greenplum Clusters

Hosts are the final part of the equation, and there are two distinct roles in the cluster itself—the Master and the segment hosts. In short, the Master does management and planning while the Segments are there for processing the data. For most use cases, we recommend that the Master and Segment hosts are nearly the same build. Processors, memory, and network should all remain the same, but the Master host does not need as much storage as the Segment hosts. For this reason, we normally recommend a 2U chassis with 24 drives for the Segment hosts, and the Master can run with 4 to 6 drives, which might make it more economical to drop to a 1U chassis. Some customers will choose to stay with 2U servers across the board and get the same amount of drives to maintain consistency throughout the cluster, and that is okay.

Disk space and IO will be one of the primary concerns for the Segment hosts. Within a GPDB cluster, there is a primary and mirror copy of the data that is managed by the database. So, all data is written twice. We also strongly suggest the disks be run in RAID to provide a third layer of data integrity and protection. It is best to have at least two layers of failsafe when it comes to hardware issues, it is common to run in a degraded state while systems are in a repair mode and this provides a layer of protection while another image03layer is offline. The Segment hosts should have 24 disks that are 10K RPM or faster. We recommend enterprise SAS drives because SATA drives are less performant, and SSD still runs high on the price per TB scale. It’s becoming common to see SAS drives at 1.2TB and larger, which is starting to push user data capacity near the 1PB per rack. When grouping disks, a set of 24 disks will usually be split into two RAID5 sets with a set of two hot spares. Though, this decision needs to be driven by what kind of performance specifications the controller has and what the internal expectation for data integrity and query speeds are. The RAID5 with hot spares configuration is the most common, though RAID10 or RAID6 could be equally viable in your environment. Additionally, splitting 24 disks into 4 RAID sets has been known to improve recovery time at the cost of additional disks being allocated to overhead.

Just having a large number of big disks is not enough, it is vital that the controller setup is sufficient to maintain the maximum throughput of the disks. Ideally, they will be able to maintain that throughput when in a degraded state and during rebuild cycles too. Some modern, single controller systems will be up to the task where, in the past, it often took two. You should validate their abilities with your vendor. Controllers will usually come with a certain amount of cache, and GPDB will normally be blowing right through it on reads, working with data that is much larger than cache. On write, it can make a large difference because the process to lay the data down on disk is short circuited and acknowledged much quicker. It is important to note, the cache often has a battery backup and a learn cycle for the battery that is associated with it. These learn cycles have a significant impact on the amount of IO you can push at any point in time and, ultimately, impact query speed. So, it is essential to avoid cheap controllers. Look for controllers that can minimize RAID overhead, allow for low overhead with disk rebuilds, have fast learn cycles, and allow you to effectively manage and report their settings.

To keep the processor recommendation simple, all hosts should have dual processors, at least 8 cores each, and be clocked at 2.5Ghz or faster. If you choose to use them, the system will happily utilize more cores and faster processors. GPDB is flexible in the number of Segments you can run per Segment host, and this is initially set at cluster creation. Importantly, there is a temptation to run more Segments per host with the larger core counts, and we recommend customers do not exceed the suggested 8 Segments per Segment Host— even with the increased core count, we must remember that the system is sharing storage, IO, networking, and memory between the Segments. When planning Greenplum clusters of a rack or larger, it is also advisable to reduce the amount of Segments per host to 6 or even 4 in extremely large clusters or clusters that are run a workload with a large amount of concurrency. Like most things, parallelism is good in moderation, and a balance needs to be found between the performance gained from multiple processes versus the overhead associated with managing and coordinating them all.

On each segment host, a slice of memory is allocated for each segment to run in. Previously, we recommended 8G of memory space per segment, and, with 8 segments per segment host, this meant a minimum of 64G of memory per server. We no longer recommend this. We have found that customers are pushing a lot more analytic queries on larger datasets through the pipeline, and we’ve seen an increasing image00fondness in use of things like python, R, and PostGIS processed in parallel within the database. Combined with the amount of memory available in current servers, we feel the bar should be raised. So, 256G of memory per host is now the suggested starting point, and each of the 8 segments would be allocated 32G each. The increased memory baseline will help satisfy the previously mentioned demands and improve the performance of concurrent query activities. It should also be noted—as you expand memory, there is often a step down in memory speed as you increase the number of slots filled. So, stick with fewer large memory modules—as they will normally be better than many smaller ones.

As well, a decent out-of-band management solution is often overlooked in cluster setup. Most clusters will not have the luxury of someone sitting in front of them day to day nor the ability to just discard the machine and pop another into place. So, it is important to have a good solution for contacting a problem child, diagnosing their issue quickly, and getting things back to a steady state.

There is one last, important consideration—get an extra server that has all the right pieces and is a part of the cluster but isn’t running any segments. GPDB will continue to run if you lose a server, and it will shift processing for a segment to the mirror instance. As this happens, query speeds are reduced because fewer resources are now handling the same amount of load—until the missing resource comes back online. By having an extra, hot-spare machine, you can rebuild the offline segments and get back up to proper workload distribution faster than a motherboard/drive/cpu can be shipped and scheduled for installation at the datacenter.

GPDB’s Broad Configuration Options

While that wraps up my post on rolling your own Greenplum cluster, I’ll note that I’ve run GPDB on a reformatted Chromebox, laptop, and desktop as well as servers. Presentations have been done with GPDB clusters running in AWS, Google Compute Engine, Rackspace, vSphere, Docker, and CenturyLink. As well, production clusters are running with drive counts from 4 to 24+, using SSD and SAN storage. This last month I’ve helped spec new clusters for customers using Dell, Lenovo and Cisco hardware to replace systems that have reached the end of their lifecycle . GPDB is software that can be run just about anywhere.

Please don’t take this post as the only way to run GPDB. What is laid out above is a set of pointers and a general template on how to build a Greenplum cluster. It will help you build a system in line with what the majority of our enterprise customers are using right now to produce an effective result for them.

Learning More:

About the Author


Searching for BDD patterns in Python
Searching for BDD patterns in Python

I’ve been writing automated tests in a Test Driven Development style for about 6 years now. Much of my expe...

Visualizing project metadata
Visualizing project metadata

I recently prepared an experience report which included some graphs illustrating key points about the engag...