Analytic Workloads from BI to AI with VMware Tanzu Greenplum

December 7, 2020 Frank McQuillan

VMware Tanzu Greenplum is a massively parallel processing (MPP) data platform based on the open source Greenplum Database project. It’s designed to run the full gamut of analytical workloads, from BI to AI. Because enterprise data lives and grows throughout an organization, it is suboptimal to copy large data sets between different systems as they aren’t able to perform fast enough, scale high enough, or offer the right features.

In this post, we will discuss the power of MPP on vSphere for end-to-end data science workflows. We will share scalability results from traditional machine learning workloads as well as deep neural networks leveraging GPUs.

Business intelligence is foundational in enterprise analytics

Business data analysis—including reporting, online analytical processing, the use of dashboards, and support for concurrent queries by analysts to explain historical outcomes—is a fundamental role of data platforms.

The data industry uses TPC-DS as the de facto standard benchmark for big data. It’s designed by the Transaction Processing Performance Council, where VMware is a full member. The intent of TPC benchmarks is to provide objective performance data. TPC-DS in particular measures query response time in both single- and multiuser mode for a complex decision support workload, modeled on a company that sells and distributes products. There are 99 queries and the schema is snowflake in nature, with multiple levels of dimensional tables, like with this store returns entity-relationship diagram.

A store returns entity-relationship diagram from TPC-DS

The table below shows the results from running TPC-DS with Greenplum on VMware vSphere 7 using the cluster configuration described at the end of the post.* The main factor affecting runtime for Greenplum is the number of CPU cores; increasing the number of cores will decrease query time linearly. In addition to query time, a key measure is whether all 99 queries complete successfully without modification (they do for Greenplum). Many other vendor and open source databases cannot run all 99 queries, or need some of the queries to be modified in order to make them run.

Time to load data set from local disk

Number of queries that complete successfully

Query time

(1 user)

Query time

(5 concurrent users)

53 min

2 hours

5.7 hours

TPC-DS results on Greenplum for a 3 TB data set

A large commercial airline in Asia uses Greenplum to manage its customer loyalty program and leverages this scalable BI capability. It uses a mix of reporting, analytics, and even transactional queries, including concurrent line-of-business queries on very large data sets using SQL command-line as well as commercial BI tools. This is a common usage profile for many Greenplum customers.

Machine learning for predictive analytics

Machine learning workloads are growing in importance, and enterprises view them as a means to potentially improve the products and services they offer to their customers. Corporate data is often structured data and resides in relational form, so if machine learning can be done in-database and at scale with SQL, this can reduce the cost, complexity, and overhead associated with procuring and maintaining multiple tools and libraries.

Let’s look at the scalability results of two in-database machine learning methods that use the Apache MADlib open source library. It offers a comprehensive set of supervised and unsupervised learning methods invoked with SQL and optimized for Tanzu Greenplum’s parallelism. In the area of supervised learning, logistic regression is a well-known statistical method for binary classification problems. In the area of unsupervised learning, where data is not labelled, k-means++ is a clustering method that finds cluster centroids that minimize the intraclass variance (i.e., the sum of squared distances from each data point to its closest cluster centroid). The two figures below show scalability results for logistic regression and k-means with Tanzu Greenplum on vSphere 7 using the test cluster described at the end of this post, respectively.

For logistic regression we tested up to 100 million rows with 100 features, and for k-means++ we tested up to 1 million points with 100 centroids. Both charts are plotted on a log-log scale and the slope=1, which means linear scalability of run time with data size. As your data set grows in size, run time grows in a predictable manner (it is not going to blow up). This is the power of horizontal scalability with Greenplum MPP compared with single-node databases.

Logistic regression training time with 100 features

K-means++ training time with 100 centroids

A large automotive manufacturer in the U.S. employs Greenplum as part of its predictive maintenance program designed to improve customer safety and experience. Data including vehicles’ diagnostic trouble codes, maintenance history, mileage, driving locations, and other factors are used in supervised learning models to predict faults before they occur and plan any necessary repairs.

Graph analytics for discovering relationships

Graph computations can be challenging to parallelize and scale due to the inherent interdependencies in graph data, which leads to high traffic across the network. This often makes it impractical to reason across a large graph as a whole. Vertex-centric programming takes a more localized approach: Computation is expressed at the level of a single vertex, so it can be made highly scalable and inherently parallel, with reasonable amounts of serialization and communication across the network.

Many common graph algorithms, when viewed from this vertex-centric perspective, can be translated into standard SQL using scans, joins, and aggregates over large tables. MPP databases like Greenplum are well-suited for these types of queries because they take advantage of the advances in query optimization and execution that have been made over many years of intensive research and development.

Vertex-centric approaches may not offer all of the expressiveness of semantic query languages associated with graph databases; however, they are a good choice for many common use cases. For example, the figure below shows the runtime for the PageRank algorithm on a graph with up to 100 million vertices and 5 billion edges. The plot is on a log-log scale and the slope=1.1, meaning near linear scalability with graph size.

PageRank computation time

A large U.S.-based computer technology company uses a variety of graph methods on Greenplum as part of an identity resolution project. In order to understand its customer base more deeply, the firm’s marketing and digital insights groups created a unified view of customer data from point-of-sale and social media activity, loyalty programs, call center logs, and more. The graph methods are used to help link different touchpoints to the same customer IDs.

Deep neural networks for more complex supervised learning tasks

Artificial neural networks can be used to create highly accurate models in domains such as language processing and image recognition. However, training deep neural networks is incredibly resource-intensive since hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. (Unlike parameters derived by the model itself, such as feature weights, hyperparameters are used to control the learning process.) This is the challenge of model selection.

To train many models simultaneously, we implement a novel approach from recent research called model hopper parallelism (MOP). An efficient way to move computation to the data, MOP involves locking data in place, since it is too big to move around the network, and having each model sequentially visit the data local to each node for training.

By leveraging MOP, the distributed compute capability of the Greenplum database with GPU acceleration can be used to train many deep learning models in parallel. This can help data scientists more efficiently determine the most accurate model architecture and associated hyperparameters from a set of candidates. To learn more about GPU acceleration, be sure to check out our post about deep learning on Greenplum database.

The table below shows the speedup when training 26,000 color images from the RSNA Pneumonia Detection data set using the AlexNet convolutional neural network with 30 different hyperparameter combinations. Training time is linear with the number of hosts, meaning that if more compute resources are added, training time will decrease proportionally. For this example, the test cluster for Greenplum includes four physical GPUs shared as virtual GPUs using NVIDIA GRID technology.

Number of model configs	Single host training time	Parallel training on 5 hosts on Greenplum	Speedup
30	39 hours	8 hours	4.9x

Speedup training a convolutional neural network with GPUs on Greenplum vs. a single host

Tanzu Greenplum is an all-around workhorse for enterprise analytics

VMware Tanzu Greenplum can run the full range of analytical workloads on vSphere at scale, from BI to AI. Advanced analytics projects in the enterprise often use multiple approaches, so the flexibility afforded by a horizontally scalable MPP database—one that can run a variety of methods on both CPUs and GPUs—means that more computations can be done where the data lives rather than having to move it between systems.

For more on Greenplum, watch my talk from VMWorld 2020.

* Test infrastructure:

Physical hosts:

4 Dell PowerEdge R740 for TPC-DS benchmarks (256 CPU cores)
2 Dell PowerEdge R740 for other runs (128 CPU cores)
1.12 TB memory
100.52 TB disk

Hypervisor

VMware ESXi 7.0.0
4 VMs per host for TPC-DS benchmarks
2 VMs per host for other runs

Database

Greenplum 6.6.0
8 workers per VM for TPC-DS benchmarks
4 workers per VM for other runs

GPUs

4 NVIDIA V100 with 32 GB frame buffer each
GPUs shared as vGPUs using NVIDIA GRID

Provisioning and Managing Tanzu Kubernetes Clusters on vSphere 7 from VMware Tanzu Mission Control

We are excited to announce integration between Tanzu Mission Control and Tanzu Kubernetes Grid Service, a c...

VMware Tanzu SQL, Now GA for Kubernetes: A Consistent Postgres Experience Everywhere

VMware Tanzu SQL with Postgres for Kubernetes, now generally available, provides a DevOps-friendly experien...

Analytic Workloads from BI to AI with VMware Tanzu Greenplum

Business intelligence is foundational in enterprise analytics

Machine learning for predictive analytics

Graph analytics for discovering relationships

Deep neural networks for more complex supervised learning tasks

Tanzu Greenplum is an all-around workhorse for enterprise analytics

Previous

Next

Analytic Workloads from BI to AI with VMware Tanzu Greenplum

Business intelligence is foundational in enterprise analytics

Machine learning for predictive analytics

Graph analytics for discovering relationships

Deep neural networks for more complex supervised learning tasks

Tanzu Greenplum is an all-around workhorse for enterprise analytics

Previous

Next

Related content in this Stream

A new tile (beta) will be added to the Tanzu Application Service foundation so developers can add generative AI-type functionality with existing or new applications on the platform.

With Tanzu Application Catalog, enterprises can get open source software that is customized per their requirements, fully ready to be deployed, easy to use, and built on a SLSA L3 pipeline.

Azure Spring Apps Enterprise is now eligible for the Azure Savings Plan for Compute, giving customers options for significant cost savings of up to 47 percent.

A new technology research paper by CCS Insight sheds light on the challenges enterprises face using open source software and offers insights into the value provided by Tanzu Application Catalog.

Reducing the number of CVEs in software is an important practice. But if compliance adherence becomes an obsession, bad practices that lower software quality will be adopted to achieve it.

This blog captures all of the top highlights related to Tanzu Application Service in 2023, including product releases, events, and customer reviews.

Explore how Tanzu Spring Runtime empowers enterprises to maximize their Spring-based investments, and see why it's crucial for governance, upgrading, and securing your tech stack.

Tanzu Application Catalog extends its software supply chain security capabilities by leveraging Notation for signing and verifying production-ready open source software artifacts.

Read the story of one retail leader that embarked on a journey to modernize their cloud and application strategy, and see what challenges they overcame with Tanzu Insights.

Learn how to view current charges for public IP addresses, how to optimize usage, and how to track reductions in IP address cost with Tanzu CloudHealth.

Tanzu Application Catalog now ships multi-architecture container images, supporting both ARM64 as well as x86-64.

Machine learning–powered forecasting in VMware Tanzu CloudHealth—now available for AWS, Azure, and GCP—can help forecast cloud costs, gain visibility into spending, and make informed cloud choices.

See how Intelligent Assist helps abstract the knowledge of GraphQL and Tanzu Hub’s GraphQL schema, easing the burden on users and requiring them only to review what’s generated.

Learn about Kubeapps, one of the open source projects the VMware Bitnami team contributes to, and how you can use it to simplify your Kubernetes application deployments.

Overview of the contributions VMware has made to the Backstage community and the adoption of Backstage in VMware Tanzu Application Platform.

The Kubernetes value line is becoming more clear as use cases start to coalesce around application development and delivery.

With VMware Tanzu Application Platform on VxRail, you can build and deploy cloud native applications on Kubernetes on-premises that can be extended to any cloud.

At VMware Explore in Barcelona, we’re announcing new artificial intelligence and machine learning offerings in the VMware Tanzu portfolio that can help organizations drive business innovation.

Tanzu Guardrails can empower cloud operations and application teams with expanded visibility into governance issues and deeper insights into overall operational and compliance risks in public clouds.

VMware Tanzu Mission Control is a hub for multi-cluster Kubernetes management. Improvements for its self-managed deployment option bring better support for identity services and large environments.