VMware Tanzu Greenplum is a massively parallel processing (MPP) data platform based on the open source Greenplum Database project. It’s designed to run the full gamut of analytical workloads, from BI to AI. Because enterprise data lives and grows throughout an organization, it is suboptimal to copy large data sets between different systems as they aren’t able to perform fast enough, scale high enough, or offer the right features.
In this post, we will discuss the power of MPP on vSphere for end-to-end data science workflows. We will share scalability results from traditional machine learning workloads as well as deep neural networks leveraging GPUs.
Business intelligence is foundational in enterprise analytics
Business data analysis—including reporting, online analytical processing, the use of dashboards, and support for concurrent queries by analysts to explain historical outcomes—is a fundamental role of data platforms.
The data industry uses TPC-DS as the de facto standard benchmark for big data. It’s designed by the Transaction Processing Performance Council, where VMware is a full member. The intent of TPC benchmarks is to provide objective performance data. TPC-DS in particular measures query response time in both single- and multiuser mode for a complex decision support workload, modeled on a company that sells and distributes products. There are 99 queries and the schema is snowflake in nature, with multiple levels of dimensional tables, like with this store returns entity-relationship diagram.
A store returns entity-relationship diagram from TPC-DS
The table below shows the results from running TPC-DS with Greenplum on VMware vSphere 7 using the cluster configuration described at the end of the post.* The main factor affecting runtime for Greenplum is the number of CPU cores; increasing the number of cores will decrease query time linearly. In addition to query time, a key measure is whether all 99 queries complete successfully without modification (they do for Greenplum). Many other vendor and open source databases cannot run all 99 queries, or need some of the queries to be modified in order to make them run.
TPC-DS results on Greenplum for a 3 TB data set
A large commercial airline in Asia uses Greenplum to manage its customer loyalty program and leverages this scalable BI capability. It uses a mix of reporting, analytics, and even transactional queries, including concurrent line-of-business queries on very large data sets using SQL command-line as well as commercial BI tools. This is a common usage profile for many Greenplum customers.
Machine learning for predictive analytics
Machine learning workloads are growing in importance, and enterprises view them as a means to potentially improve the products and services they offer to their customers. Corporate data is often structured data and resides in relational form, so if machine learning can be done in-database and at scale with SQL, this can reduce the cost, complexity, and overhead associated with procuring and maintaining multiple tools and libraries.
Let’s look at the scalability results of two in-database machine learning methods that use the Apache MADlib open source library. It offers a comprehensive set of supervised and unsupervised learning methods invoked with SQL and optimized for Tanzu Greenplum’s parallelism. In the area of supervised learning, logistic regression is a well-known statistical method for binary classification problems. In the area of unsupervised learning, where data is not labelled, k-means++ is a clustering method that finds cluster centroids that minimize the intraclass variance (i.e., the sum of squared distances from each data point to its closest cluster centroid). The two figures below show scalability results for logistic regression and k-means with Tanzu Greenplum on vSphere 7 using the test cluster described at the end of this post, respectively.
For logistic regression we tested up to 100 million rows with 100 features, and for k-means++ we tested up to 1 million points with 100 centroids. Both charts are plotted on a log-log scale and the slope=1, which means linear scalability of run time with data size. As your data set grows in size, run time grows in a predictable manner (it is not going to blow up). This is the power of horizontal scalability with Greenplum MPP compared with single-node databases.
Logistic regression training time with 100 features
K-means++ training time with 100 centroids
A large automotive manufacturer in the U.S. employs Greenplum as part of its predictive maintenance program designed to improve customer safety and experience. Data including vehicles’ diagnostic trouble codes, maintenance history, mileage, driving locations, and other factors are used in supervised learning models to predict faults before they occur and plan any necessary repairs.
Graph analytics for discovering relationships
Graph computations can be challenging to parallelize and scale due to the inherent interdependencies in graph data, which leads to high traffic across the network. This often makes it impractical to reason across a large graph as a whole. Vertex-centric programming takes a more localized approach: Computation is expressed at the level of a single vertex, so it can be made highly scalable and inherently parallel, with reasonable amounts of serialization and communication across the network.
Many common graph algorithms, when viewed from this vertex-centric perspective, can be translated into standard SQL using scans, joins, and aggregates over large tables. MPP databases like Greenplum are well-suited for these types of queries because they take advantage of the advances in query optimization and execution that have been made over many years of intensive research and development.
Vertex-centric approaches may not offer all of the expressiveness of semantic query languages associated with graph databases; however, they are a good choice for many common use cases. For example, the figure below shows the runtime for the PageRank algorithm on a graph with up to 100 million vertices and 5 billion edges. The plot is on a log-log scale and the slope=1.1, meaning near linear scalability with graph size.
PageRank computation time
A large U.S.-based computer technology company uses a variety of graph methods on Greenplum as part of an identity resolution project. In order to understand its customer base more deeply, the firm’s marketing and digital insights groups created a unified view of customer data from point-of-sale and social media activity, loyalty programs, call center logs, and more. The graph methods are used to help link different touchpoints to the same customer IDs.
Deep neural networks for more complex supervised learning tasks
Artificial neural networks can be used to create highly accurate models in domains such as language processing and image recognition. However, training deep neural networks is incredibly resource-intensive since hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. (Unlike parameters derived by the model itself, such as feature weights, hyperparameters are used to control the learning process.) This is the challenge of model selection.
To train many models simultaneously, we implement a novel approach from recent research called model hopper parallelism (MOP). An efficient way to move computation to the data, MOP involves locking data in place, since it is too big to move around the network, and having each model sequentially visit the data local to each node for training.
By leveraging MOP, the distributed compute capability of the Greenplum database with GPU acceleration can be used to train many deep learning models in parallel. This can help data scientists more efficiently determine the most accurate model architecture and associated hyperparameters from a set of candidates. To learn more about GPU acceleration, be sure to check out our post about deep learning on Greenplum database.
The table below shows the speedup when training 26,000 color images from the RSNA Pneumonia Detection data set using the AlexNet convolutional neural network with 30 different hyperparameter combinations. Training time is linear with the number of hosts, meaning that if more compute resources are added, training time will decrease proportionally. For this example, the test cluster for Greenplum includes four physical GPUs shared as virtual GPUs using NVIDIA GRID technology.
Speedup training a convolutional neural network with GPUs on Greenplum vs. a single host
Tanzu Greenplum is an all-around workhorse for enterprise analytics
VMware Tanzu Greenplum can run the full range of analytical workloads on vSphere at scale, from BI to AI. Advanced analytics projects in the enterprise often use multiple approaches, so the flexibility afforded by a horizontally scalable MPP database—one that can run a variety of methods on both CPUs and GPUs—means that more computations can be done where the data lives rather than having to move it between systems.
For more on Greenplum, watch my talk from VMWorld 2020.
* Test infrastructure:
4 Dell PowerEdge R740 for TPC-DS benchmarks (256 CPU cores)
2 Dell PowerEdge R740 for other runs (128 CPU cores)
1.12 TB memory
100.52 TB disk
VMware ESXi 7.0.0
4 VMs per host for TPC-DS benchmarks
2 VMs per host for other runs
8 workers per VM for TPC-DS benchmarks
4 workers per VM for other runs
4 NVIDIA V100 with 32 GB frame buffer each
GPUs shared as vGPUs using NVIDIA GRID