Pivotal Greenplum: Innovation in Data Management for Analytics

March 19, 2019 Bob Glithero

Enterprise analytics at scale, reimagined for modern deployment strategies

Enterprises that want to mine new data sources and types—like text, geospatial, graph, and machine-generated data—are confronted with a growing number of proprietary and open-source data management systems that address an ever-expanding number of use cases.

Choice is usually a good thing, but in this case, it has downsides.  Users are wary of being locked-in by proprietary systems, and teams become exhausted from constantly having to find the right system for the newest use case. The proliferation of data management software leads to an environment that is under-utilized and under-optimized. While many enterprises benefit from cloud processing for ease of self-service, public cloud service providers are rapidly becoming another source of lock-in. What’s needed is a data management system that is based on the contributions of a large community, not the directives of a single vendor, and can be deployed wherever the business needs, not in only one environment.


Enterprises are reconsidering Postgres-based systems

As users search for an alternative, there’s been a recognizable resurgence of interest in Postgres as the go-to solution for managing data in both operational and analytical contexts -- it’s now the fourth most popular data management system, according to DBEngines.com.

Why the renewed interest in Postgres?  

  • It’s compliant with the ANSI SQL standard; and offers proven maturity and stability and removes unwanted drama from enterprise data management

  • It’s increasingly useful for a wide variety of enterprise use cases—including rapid prototyping, with support for varied data types both structured and unstructured

  • Most importantly, Postgres has a large, vibrant community of users and contributors with independence from a single vendor

Postgres preserves the benefits of solid relational theory, like the performance of mature optimizers for efficient, fast querying—something not achieved simply by grafting SQL on a distributed key-value store.  That said, there are useful features of modern databases that users would love to have in their analytics projects. Near the top of many wish lists are:

  • The ability to augment traditional analytics like business intelligence (BI) and reporting with innovative analytics that goes into the realms of machine learning and deep learning

  • The ability to simplify database deployments, so that they can be managed by container orchestration systems like Kubernetes, with only a few lines of code. With the benefit of modern automation and orchestration systems, DBAs can finally benefit some of the same techniques that have freed application developers from lower-level implementation details, so that they can spend more time on higher-value work supporting the business

Is there a way to help users find an open-source escape from proprietary software while avoiding the treadmill of niche data management systems?


Enterprise analytics at scale in a modern application setting

For 15 years, Pivotal has developed Greenplum, the best massively parallel processing version of Postgres for BI, analytics, and machine learning at scale. We’re now combining that experience with our market-leading experience in platforms, tools, and methodologies for application transformation, to make Pivotal Greenplum a first-class citizen in a modern application setting.  

In the latest version of Greenplum (version 6 beta, expected GA late-June), we’ve made sizable technology investments in the areas of transaction processing and support for data streams. We’re also announcing Greenplum for Kubernetes, for deploying Greenplum with this increasingly popular container orchestration system. Finally, we have made important contributions to Apache MADlib, significantly expanding the analytical capabilities of Greenplum.

Greenplum: Support for a wider range of consolidated workloads  

Greenplum increasingly blurs the line between transactional and analytical databases that have otherwise been separate and distinct. With improved transaction processing capability and support for streaming ingest, Greenplum can address workloads across a spectrum of operational and analytic contexts from business intelligence to deep learning.

Greenplum combines fast analytic reads with higher-performance for low-latency writes. For some workloads, this translates up to a 50X performance improvement over Greenplum 5. Because of this, users can consolidate a diverse array of applications in one environment—for example, point queries, data science exploration, fast event processing, and long-running reporting queries—all with greater scale and concurrency.   

When these performance improvements are combined with our new Confluent-certified Kafka connector, Greenplum is also better positioned to address a variety of sensor-driven workloads characteristic of IoT applications.

Greenplum is also smarter about how it processes data. With replicated tables, dimensions are replicated on local segments. Joining dimensions with facts locally reduces the need to move traffic across the cluster and improves speed. Locking is more granular in Greenplum now, as row-level locks enable highly concurrent updates and deletes on the same table.

Consolidating more workloads in one database means users can gain faster operational intelligence with less need to move data.

Greenplum for Kubernetes: Container deployment automation

We’re also announcing Greenplum for Kubernetes, which is now generally available. Greenplum for Kubernetes gives data professionals and application developers the ability to deploy, operate, and upgrade self-service clusters wherever Kubernetes is installed, in both cloud and cloud-native scenarios.

With Greenplum for Kubernetes, customers run the exact same Greenplum in Pivotal Container Service (PKS), Google Container Environment (GKE), or wherever Kubernetes is installed.  Deploying software in Kubernetes is better because of:

  • Consistency in packaging. Vendors can handle application dependencies in containers so the customer doesn’t have to.

  • Wide adoption. Enterprises are beginning to standardize on Kubernetes; it’s one of the fastest growing open-source projects in history.

  • Repeatable, self-service deployments. Users can codify repetitive tasks leading to more automation, so they can focus on higher-level tasks

  • Avoiding cloud vendor lock-in. With PKS, users can achieve multi-cloud deployments more easily.

Figure 1: Containerizing mixed workloads for better isolation

The key to Greenplum for Kubernetes is the Greenplum Operator. Complex stateful services, like databases, have configuration needs beyond what’s provided by the basic Kubernetes deployment controllers. To help users avoid lower-level configuration tasks, Pivotal created an Operator for Greenplum. This creates, configures and manages instances of complex stateful applications on behalf of a Kubernetes user, informing how Greenplum should be configured and deployed. This helps data professionals automate the deployment of a multitude of nodes, so they can focus more attention higher-value work.

Apache MADlib:  Easier workflows, more powerful data science  

Pivotal continues to invest in analytics and machine learning via Apache MADlib, the massively parallel analytics library for the Postgres family of databases. In version 2.0 (expected GA in mid-2019), MADlib will support multi-GPU deployments for deep neural networks, with an initial focus on image processing use cases using convolutional neural networks. Supported libraries will include Keras with a TensorFlow backend.  

When it comes to data science workflows, MADlib 2.0 will introduce the ability to create and manage multiple repositories of models. Much of data science involves model architecture search and hyperparameter tuning, so it is essential to be able to run many combinations of models and parameters at once and save the results into a central location for analysis and prioritization.

Placing models into production at scale is arguably one of the most challenging aspects of data science. Intermediate formats in XML or JSON are incomplete and not universally supported, meaning that models may not translate reliably between environments. The alternative, which is re-coding for different development and deployment languages, is inefficient and can introduce error.  What’s needed is a way of pushing from training to production in the same language, without re-coding, in order to faithfully deploy models exactly as they have been designed by the data scientist.

To facilitate this, Pivotal will be introducing new capabilities, initially available to customers as part of a services-led engagement, to create more efficient data science workflows from modeling to production. This includes components such as a REST API for application developers to call to in a simple manner.

Figure 2: Event scoring with Pivotal Greenplum and containerized Postgres

With these enhancements, and the ability to containerize and deploy models under Kubernetes, users will be able to move beyond batch training and scoring use cases to enable event-driven scoring applications like real-time transactional fraud. Users can also leverage modern deployment strategies like canaries or champion/challenger, with less need to deploy separate environments like Apache Spark. This can reduce the number of systems deployed, managed and optimized, simplifying the environment.


Enterprise Analytics at Scale Starts Today!

We’ve merely scratched the surface of the new, rich feature set in Greenplum and its role in the Postgres revival. We’re also announcing Pivotal Postgres, self-managed open-source Postgres for any environment. It’s designed for applications like transactional workloads, prototyping, and data science exploration, as explained in our companion blog post.

Here are some other ways you can get started:

This website contains statements that are intended to outline the general direction of certain of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract.  Any information regarding the pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change.  All software releases are on an “if and when available” basis and are subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. Any purchasing decisions should only be based on features currently available.  The development, release, and timing of any features or functionality described for Pivotal's offerings on this website remain at the sole discretion of Pivotal.  Pivotal has no obligation to update forward-looking information on this website.

About the Author

Bob Glithero

Bob is Senior Manager, Product Marketing for VMware Tanzu Data Services.

Follow on Twitter Follow on Linkedin More Content by Bob Glithero
Using Metadata to Label PAS App Resources with a git SHA
Using Metadata to Label PAS App Resources with a git SHA

The PAS 2.5 release added the ability to metadata to a given resource. Learn more in this blog.

Say Hello to Pivotal Postgres
Say Hello to Pivotal Postgres

Pivotal aims to help organizations move off proprietary databases with a new self-managed offering of Postg...

SpringOne at VMware Explore 2023

Learn More