Analytic Workloads from BI to AI with VMware Tanzu Greenplum

December 7, 2020 Frank McQuillan

VMware Tanzu Greenplum is a massively parallel processing (MPP) data platform based on the open source Greenplum Database project. It’s designed to run the full gamut of analytical workloads, from BI to AI. Because enterprise data lives and grows throughout an organization, it is suboptimal to copy large data sets between different systems as they aren’t able to perform fast enough, scale high enough, or offer the right features.

In this post, we will discuss the power of MPP on vSphere for end-to-end data science workflows. We will share scalability results from traditional machine learning workloads as well as deep neural networks leveraging GPUs.

Business intelligence is foundational in enterprise analytics

Business data analysis—including reporting, online analytical processing, the use of dashboards, and support for concurrent queries by analysts to explain historical outcomes—is a fundamental role of data platforms.

The data industry uses TPC-DS as the de facto standard benchmark for big data. It’s designed by the Transaction Processing Performance Council, where VMware is a full member. The intent of TPC benchmarks is to provide objective performance data. TPC-DS in particular measures query response time in both single- and multiuser mode for a complex decision support workload, modeled on a company that sells and distributes products. There are 99 queries and the schema is snowflake in nature, with multiple levels of dimensional tables, like with this store returns entity-relationship diagram.

A store returns entity-relationship diagram from TPC-DS

The table below shows the results from running TPC-DS with Greenplum on VMware vSphere 7 using the cluster configuration described at the end of the post.* The main factor affecting runtime for Greenplum is the number of CPU cores; increasing the number of cores will decrease query time linearly. In addition to query time, a key measure is whether all 99 queries complete successfully without modification (they do for Greenplum). Many other vendor and open source databases cannot run all 99 queries, or need some of the queries to be modified in order to make them run.

Time to load data set from local disk

Number of queries that complete successfully

Query time

(1 user)

Query time

(5 concurrent users)

53 min

2 hours

5.7 hours

TPC-DS results on Greenplum for a 3 TB data set

A large commercial airline in Asia uses Greenplum to manage its customer loyalty program and leverages this scalable BI capability. It uses a mix of reporting, analytics, and even transactional queries, including concurrent line-of-business queries on very large data sets using SQL command-line as well as commercial BI tools. This is a common usage profile for many Greenplum customers.

Machine learning for predictive analytics

Machine learning workloads are growing in importance, and enterprises view them as a means to potentially improve the products and services they offer to their customers. Corporate data is often structured data and resides in relational form, so if machine learning can be done in-database and at scale with SQL, this can reduce the cost, complexity, and overhead associated with procuring and maintaining multiple tools and libraries.

Let’s look at the scalability results of two in-database machine learning methods that use the Apache MADlib open source library. It offers a comprehensive set of supervised and unsupervised learning methods invoked with SQL and optimized for Tanzu Greenplum’s parallelism. In the area of supervised learning, logistic regression is a well-known statistical method for binary classification problems. In the area of unsupervised learning, where data is not labelled, k-means++ is a clustering method that finds cluster centroids that minimize the intraclass variance (i.e., the sum of squared distances from each data point to its closest cluster centroid). The two figures below show scalability results for logistic regression and k-means with Tanzu Greenplum on vSphere 7 using the test cluster described at the end of this post, respectively.

For logistic regression we tested up to 100 million rows with 100 features, and for k-means++ we tested up to 1 million points with 100 centroids. Both charts are plotted on a log-log scale and the slope=1, which means linear scalability of run time with data size. As your data set grows in size, run time grows in a predictable manner (it is not going to blow up). This is the power of horizontal scalability with Greenplum MPP compared with single-node databases.

Logistic regression training time with 100 features

K-means++ training time with 100 centroids

A large automotive manufacturer in the U.S. employs Greenplum as part of its predictive maintenance program designed to improve customer safety and experience. Data including vehicles’ diagnostic trouble codes, maintenance history, mileage, driving locations, and other factors are used in supervised learning models to predict faults before they occur and plan any necessary repairs.

Graph analytics for discovering relationships

Graph computations can be challenging to parallelize and scale due to the inherent interdependencies in graph data, which leads to high traffic across the network. This often makes it impractical to reason across a large graph as a whole. Vertex-centric programming takes a more localized approach: Computation is expressed at the level of a single vertex, so it can be made highly scalable and inherently parallel, with reasonable amounts of serialization and communication across the network.

Many common graph algorithms, when viewed from this vertex-centric perspective, can be translated into standard SQL using scans, joins, and aggregates over large tables. MPP databases like Greenplum are well-suited for these types of queries because they take advantage of the advances in query optimization and execution that have been made over many years of intensive research and development.

Vertex-centric approaches may not offer all of the expressiveness of semantic query languages associated with graph databases; however, they are a good choice for many common use cases. For example, the figure below shows the runtime for the PageRank algorithm on a graph with up to 100 million vertices and 5 billion edges. The plot is on a log-log scale and the slope=1.1, meaning near linear scalability with graph size.

PageRank computation time

A large U.S.-based computer technology company uses a variety of graph methods on Greenplum as part of an identity resolution project. In order to understand its customer base more deeply, the firm’s marketing and digital insights groups created a unified view of customer data from point-of-sale and social media activity, loyalty programs, call center logs, and more. The graph methods are used to help link different touchpoints to the same customer IDs.

Deep neural networks for more complex supervised learning tasks

Artificial neural networks can be used to create highly accurate models in domains such as language processing and image recognition. However, training deep neural networks is incredibly resource-intensive since hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. (Unlike parameters derived by the model itself, such as feature weights, hyperparameters are used to control the learning process.) This is the challenge of model selection.

To train many models simultaneously, we implement a novel approach from recent research called model hopper parallelism (MOP). An efficient way to move computation to the data, MOP involves locking data in place, since it is too big to move around the network, and having each model sequentially visit the data local to each node for training.

By leveraging MOP, the distributed compute capability of the Greenplum database with GPU acceleration can be used to train many deep learning models in parallel. This can help data scientists more efficiently determine the most accurate model architecture and associated hyperparameters from a set of candidates. To learn more about GPU acceleration, be sure to check out our post about deep learning on Greenplum database.

The table below shows the speedup when training 26,000 color images from the RSNA Pneumonia Detection data set using the AlexNet convolutional neural network with 30 different hyperparameter combinations. Training time is linear with the number of hosts, meaning that if more compute resources are added, training time will decrease proportionally. For this example, the test cluster for Greenplum includes four physical GPUs shared as virtual GPUs using NVIDIA GRID technology.

Number of model configs	Single host training time	Parallel training on 5 hosts on Greenplum	Speedup
30	39 hours	8 hours	4.9x

Speedup training a convolutional neural network with GPUs on Greenplum vs. a single host

Tanzu Greenplum is an all-around workhorse for enterprise analytics

VMware Tanzu Greenplum can run the full range of analytical workloads on vSphere at scale, from BI to AI. Advanced analytics projects in the enterprise often use multiple approaches, so the flexibility afforded by a horizontally scalable MPP database—one that can run a variety of methods on both CPUs and GPUs—means that more computations can be done where the data lives rather than having to move it between systems.

For more on Greenplum, watch my talk from VMWorld 2020.

* Test infrastructure:

Physical hosts:

4 Dell PowerEdge R740 for TPC-DS benchmarks (256 CPU cores)
2 Dell PowerEdge R740 for other runs (128 CPU cores)
1.12 TB memory
100.52 TB disk

Hypervisor

VMware ESXi 7.0.0
4 VMs per host for TPC-DS benchmarks
2 VMs per host for other runs

Database

Greenplum 6.6.0
8 workers per VM for TPC-DS benchmarks
4 workers per VM for other runs

GPUs

4 NVIDIA V100 with 32 GB frame buffer each
GPUs shared as vGPUs using NVIDIA GRID

Provisioning and Managing Tanzu Kubernetes Clusters on vSphere 7 from VMware Tanzu Mission Control

We are excited to announce integration between Tanzu Mission Control and Tanzu Kubernetes Grid Service, a c...

5 Steps to Financial Services App Modernization

Traditional financial services organizations looking to level the technology playing field must do five key...

SpringOne 2024

Learn More

Return to Home

Analytic Workloads from BI to AI with VMware Tanzu Greenplum

Business intelligence is foundational in enterprise analytics

Machine learning for predictive analytics

Graph analytics for discovering relationships

Deep neural networks for more complex supervised learning tasks

Tanzu Greenplum is an all-around workhorse for enterprise analytics

Previous

Next

Analytic Workloads from BI to AI with VMware Tanzu Greenplum

Business intelligence is foundational in enterprise analytics

Machine learning for predictive analytics

Graph analytics for discovering relationships

Deep neural networks for more complex supervised learning tasks

Tanzu Greenplum is an all-around workhorse for enterprise analytics

Previous

Next

Related content in this Stream

Tanzu Platform offers customers a rich set of data service offerings as part of the platform, including management, security, and disaster recovery capabilities.

This blog provides a summary of Tanzu CloudHealth news and product updates for the month of June, 2024.

We are thrilled to announce that the initial release of GenAI on Tanzu Platform for Cloud Foundry is now available on the Broadcom Support Portal.

Exciting reveal: VMware Tanzu Greenplum 7.2! Elevate data analytics and AI efforts with enhanced performance, optimization, and transformative features.

FinOps X Recap: The CloudHealth team was onsite meeting customers, presenting in a breakout session, and sharing a first look at the tech preview of our new CloudHealth user experience.

Broadcom recently announced that Arrow Electronics will now be the sole go-to-market provider of VMware Tanzu CloudHealth.

For the 2024 State of Cloud Native App Platform research, it’s clear that the landscape is rapidly maturing. We invite you to join us in exploring these trends further

Introducing an enhanced Tanzu CloudHealth user experience, catered to the needs of a FinOps practitioner by giving them the solutions required to maximize the business value of their cloud usage.

Tanzu Network's functions and capabilities will transition to Broadcom’s Customer Support Portal, starting June 7th, 2024.

We're excited to announce the latest version - v3.13 of VMware Tanzu RabbitMQ. In this blog, we will introduce the highlights of this next step in our journey.

The VMware Tanzu Application Catalog software knowledge graph is a powerful capability that will continue to deliver new product features over the next year, including integrations with Tanzu Platform

Learn how you can more easily use Salt to manage images deployed from Bitnami Application Catalog alongside other workloads and devices in your IT estate, all with open-source-software (OSS).

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for May, 2024.

This blog will introduce the main security outcomes you can expect when deploying your applications on Tanzu Platform along with how you can implement them in your organization.

Unlock the future of software development with our State of Spring 2024 free ebook! Dive into the latest trendsand discover how these advancements accelerate the Spring ecosystem.

In this in-depth exploration of Spring Application Advisor, we'll unravel its multifaceted approach and underscore what makes it the keystone of integration.

We explore the key features of Spring Boot 3.3.0 and its seamless integration with Tanzu Platform, designed to empower developers, software engineers, and tech enthusiasts to build applications.

You can now use Azure Spring Apps to effectively run Spring Batch applications with adaptive cost control.

Discover how Tanzu CloudHealth is meeting the increased demand for the elimination of wasted and unused resources by investing in better, more accurate, and new ways to assist FinOps practitioners.

VMware Tanzu Data Services' vision is to be the #1 choice for application developers seeking easy-to-use, on-demand data services for modern applications.