GPU-Accelerated Deep Learning on Greenplum Database

July 17, 2019 Frank McQuillan

Deep learning is starting to become a more important part of enterprise computing, since artificial neural networks are very effective in domains such as language processing, image recognition, fraud detection and recommendation systems.  Over the last five to ten years, a massive increase in computational power at reasonable cost, and the availability of enormous troves of data, have contributed to an explosion of interest in deep learning.

Enterprises have made significant investments in SQL-based infrastructure, software and training of their employees.  However, the main innovations in deep learning have taken place outside of the SQL world, requiring the adoption of separate deep learning infrastructure.  This requires careful consideration, given the additional expense and level of effort, not to mention the risk of developing new data silos. In addition, moving large datasets between systems is not efficient.  What if enterprises could execute deep learning algorithms using popular frameworks like Keras and TensorFlow in an MPP relational database? This would enable enterprises to leverage their existing investments in SQL, making deep learning easier and more approachable.

An additional consideration is the multi-modal nature of many data science problems today.  Data scientists spend a lot of time on feature engineering and often employ multiple approaches to a problem, resulting in an ensemble of models.  If this computation can all occur within one engine, it is much more efficient than solving different parts of the problem in different systems and then trying to combine results. To do this, it’s helpful to have a collection of machine learning and analytical functions that can be executed within the database itself to reduce or eliminate data movement between environments.  

GPUs on Greenplum

To bring deep learning to Greenplum Database, standard libraries such as Keras [1] and TensorFlow [2] are deployed on Greenplum segment hosts, and GPUs are a shared resource by the segments (workers) on each host (Figure 1).  

Figure 1:  Greenplum Architecture for Deep Learning

The design is intended to eliminate transport delays across the interconnect between segments and GPUs.  Each segment works on a local shard of data and Apache MADlib [3], the open source machine learning library, is responsible for merging model state from each of the segments into an overall single model.  In this way, we take advantage of the horizontal scale-out capability of MPP.


Programming is in SQL by calling Apache MADlib functions.  Here is the SQL to train a model on the well known CIFAR-10 dataset of images [4]:

 The model architecture table model_arch_library contains a JSON representation of a convolutional neural network (CNN) to be trained.  CNNs are a special kind of neural network that are very good at image classification [5].  Note the GPUs per host parameter, which specifies the number of GPUs on each segment host to use for training.  Specifying 0 for this parameter means training with CPUs not GPUs, which could be useful for initial runs and debugging of shallow neural networks on smaller datasets, say on PostgreSQL, before moving to more expensive GPUs for training deep neural networks on the whole dataset on Greenplum.

Here is the SQL to predict the class of new images based on the model we trained above (inference):

Performance and Scalability

Modern GPUs have high memory bandwidth and ~200 times more processors per chip than CPUs because they are optimized for parallel data computations such as matrix operations, whereas CPUs are more general purpose in order to perform a wider variety of tasks.  The performance gains from using GPUs for training deep neural networks are well known. Figure 2 shows the difference in performance for a simple deep CNN [6] between a regular CPU Greenplum cluster vs. a GPU-accelerated Greenplum cluster. Here we plot test set accuracy for the CIFAR-10 dataset vs. training time for a small cluster with four segment.

Figure 2:  Training Performance on Greenplum Database GPU vs. CPU*

It takes more than 30 minutes of training time for the CPU cluster to achieve 75% accuracy on the test set, whereas the GPU cluster reaches 75% accuracy in less than 15 minutes.  Note that CIFAR-10 image resolution is only 32x32 RGB so the gains due to GPU are less than for higher resolution images. For example, with the Places data set [7] which has 256x256 RGB images,  we observed that training with GPU was 6x faster than CPU for the VGG11 network configuration [8].

Reducing training time is key since it means data scientists can iterate more quickly on their models, and newly trained models can be deployed to production more quickly.  In the case of fraud detection, for example, reducing the time to train a new model and deploy it to production can translate directly into reduced financial losses.

Inference is the term for using a trained model to perform predictions on new data that has not been seen yet.  MPP databases like Greenplum are excellent for batch inference; throughput increases in a linear fashion with database cluster size.  For example, using the CNN model we trained above, Table 1 shows the time to perform batch inference on 50,000 new 32x32 color images.

Number of Greenplum segments

Batch inference time on 50,000 images (sec)

Scale factor vs. single node

1 (single node)









Table 1:  Batch Inference Scaling on Greenplum Database Clusters*

Future Work

As part of the Apache MADlib project, the community plans to add new deep learning capability with each release.  For example, one common data science workflow is parameter selection, in the form of hyper-parameter tuning of models and model architecture search (deciding on the number and composition of network layers).  This involves training dozens or sometimes hundreds of different combinations, in order to find the one with the best accuracy/training cost profile.  The parallel compute capability of MPP databases like Greenplum are potentially excellent systems for these types of workloads.

* Benchmarking infrastructure:

Google Cloud Platform

Greenplum 5

32 core vCPUs, 150 GB memory

NVIDIA Tesla P100 GPUs with 1 GPU per Greenplum segment (worker)





[4]  CIFAR-10 dataset,

[5]  Le Cun, Denker, Henderson, Howard, Hubbard and Jackel, Handwritten digit recognition with a back-propagation network, in: Proceedings of the Advances in Neural Information Processing Systems (NIPS), 1989, pp. 396–404.

[6]  Training a simple deep CNN on the CIFAR10 small images dataset,

[7]  Places dataset,

[8]  Simonyan and Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition,

Learning More:

Ready to take the next step? Great! We recommend you:

Let’s talk about your needs.  Contact us via the web, or

About the Author

Frank McQuillan

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

Implementing Comprehensive PCF Automation Pipelines
Implementing Comprehensive PCF Automation Pipelines

Developing, Architecting, Testing, & Documenting your API [Part 4 of 4]
Developing, Architecting, Testing, & Documenting your API [Part 4 of 4]