Meet Greenplum 5: The World’s First Open-Source, Multi-Cloud Data Platform Built for Advanced Analytics

September 14, 2017 Cesar Rojas

The largest and most innovative organizations in the world have deployed Pivotal Greenplum, the leading massively parallel analytical data platform, to help solve their most strategic analytical challenges. Challenges from fraud management and risk analysis to cybersecurity and IoT. These, and other important analytical workloads, are technically impossible or cost-prohibitive to run on traditional data platforms. In 2015, Pivotal shook up the data warehouse and analytics industry by taking Greenplum open source.

Today we’re thrilled to announce the latest innovation to the most powerful, agile, and mission-critical data platform for advanced analytics: Pivotal Greenplum 5. This massive release centers around three significant new capabilities and improvements:

  • Multi-Cloud Deployment. Greenplum 5 is now certified and available on Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), VMWare vSphere, and OpenStack in addition to currently supported on-premises options. Pivotal also offers deployment assistance and managed services on all these platforms.
  • Integrated Analytics. Greenplum 5 eliminates analytical silos by providing a single scale-out environment for next-generation advanced analytics (machine learning, graph, text, geospatial) as well traditional (BI/reporting) workloads.
  • Fast Development of Analytical Innovations. Open source community innovations combined with Pivotal Engineering agile development practices means faster delivery of analytical innovation for customers and the community.

Multi-Cloud Data Analytics

Run your analytics anywhere you need them

Support for analytics in multi-cloud environments is an important requirement for many organizations in 2017.

A major reason for that is that organizations are adopting the cloud on a project by project basis and in an incremental fashion. Often, different groups within the enterprise want the flexibility to instantiate and shut down their own analytical environments in Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), or private clouds. They want the freedom to select the best cloud platform for each project and workload based on ease of use, performance, and total cost of ownership. Just as important, organizations want the elasticity and disaster recovery capabilities that multiple cloud environments enable. The present and future of analytics is multi-cloud.

Unlike both legacy enterprise data warehouses (EDWs) and new “cloud” data warehouses, all Greenplum platform optimizations are made in the software and not on proprietary hardware and/or network configurations. This makes Greenplum 5 a flexible yet powerful, infrastructure-agnostic platform able to run anywhere you need it, including:

  • All public clouds: AWS, Azure, and GCP with Bring your own License (BYOL) and hourly offerings
  • Private clouds: VMware vSphere and OpenStack
  • On Premises (Dedicated Hardware): Dell EMC DCA appliances, Dell EMC Blueprints, HP, and Cisco certified configurations, and customer-supplied hardware

An infrastructure-agnostic analytics platform such as Greenplum 5 has a number of benefits  when selecting where to run the platform:

  • Helps avoid cloud/hardware vendor lock-in, enabling your organization to leverage the best available infrastructure at competitive prices.
  • Provides cloud adoption flexibility by enabling organizations to migrate designated analytical workloads to the cloud, while retaining others on-premises due to business, governance, or other requirements.
  • Eases the deployment of the best and most appropriate infrastructure for the each project or independent environment (ETL, model building, testing, scoring, BI), helping your analytical users (ETL developers, data scientists, analysts) stay productive and focused on the needs of the business.
  • Allows for quickly instantiating new clusters in minutes when running on the AWS or Azure Marketplaces, with no impact on existing environments.

Integrated Analytics: ML, Graph, GeoSpatial and More

One platform for all compute-intensive and complex analytical needs

Before the explosion of new data sources, the EDW was the best place from which to provide as close to a 360-degree analytical view of the business as possible. In recent years, many organizations have deployed disparate analytics alternatives to the EDW in an attempt to glean more sophisticated insights from its data. These alternatives include:

  • Cloud data warehouses
  • Machine learning frameworks
  • Graph databases
  • Geospatial tools
  • Text analytics environments

Often these new deployments have resulted in the creation of analytical silos that are too complex to integrate with existing EDWs, thus significantly limiting enterprise-wide insights and innovation.

Unlike the traditional EDW and newer alternatives, Greenplum 5 eliminates data silos by integrating traditional and advanced analytics in one scale-out analytical platform. Here are some of the interfaces and operators integrated in Greenplum 5:

  • Open Source, Parallel Machine Learning, and Graph Analytics: Apache MADlib is an open source library for scalable and parallel analytics. It provides data-parallel implementations of machine learning, mathematical, statistical, and graph methods on Greenplum 5. MADlib uses Greenplum’s massively parallel processing (MPP) architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADlib algorithms can also be invoked from a familiar SQL interface so they are easy to create and use.
  • Open Source, Parallel GeoSpatial Analytics: Unlike the proprietary geospatial capabilities available in some EDWs, Greenplum 5 provides massively scalable geospatial analytics based on the PostGIS open source project. Pivotal takes full advantage of the vibrant PostGIS community and partner ecosystem to constantly deliver GIS innovations.    
  • Parallel Text Analytics: Pivotal Greenplum 5 users have access to GPText, an Apache Solr-powered text analytics engine that is optimized for Greenplum’s MPP architecture. GPText 2.0 takes the flexibility and configurability of Solr and merges it with the scalability and easy SQL interface of Greenplum, dramatically simplifying and speeding up the time to insight for massive quantities of raw text data, including semi-structured and unstructured data (social media feeds, email databases, documents, etc.).
  • Support for Popular Python and R Analytical Libraries through Procedural Language Extensions (PL/X): Greenplum 5 allows users to write user defined functions (UDFs) in a wide range of languages including SQL, Perl, Python, R, C, and Java, and supports the parallelized and distributed execution of these UDFs in data science workflows. Furthermore, Greenplum users have the ability to leverage functions from any of the add-on packages of these languages (i.e. NLTK for Python, rstan for R) in these UDFs. Greenplum 5 also provides easy-to-use installers for the most popular add-on libraries for Python and R.
  • Support for Spark with Greenplum-Spark Connector (GSC): The new GSC provides Spark users like data scientists a native connection to Pivotal Greenplum 5. GSC allows users to load data at high speed from Greenplum into Spark and to run workloads on the Spark cluster. Result sets from computation on the Spark cluster can then be pushed back into Greenplum for further analysis and persistent storage.

Greenplum 5 and its integrated analytical operators enable enables users to operationalize analytical models at scale and ship tangible business innovation in record time. For example:

  • Machine learning in the database at-scale provides data science and analytics teams with a platform for rapidly responding to new business opportunities and challenges. Model training can be done at-scale in the database on-demand. Model scoring may be operationalized on the platform or models can be exported to run elsewhere including in modern data microservices architectures running on a Platform-as-a-Service (PaaS) such as Pivotal Cloud Foundry®.
  • The ability to process, analyze, and search on multi-structured text documents using modern libraries (Python) and operators (Apache Solr) combined with machine learning, provides the ideal platform for assessing a wide variety of multi-structured content.
  • For customers with Geographical Information Systems (GIS) requirements (e.g. retailers, banks, federal government), Greenplum 5 offers the ability to combine GeoSpatial analytics with machine learning. For example, a large retailer can easily understand how customers use different store locations, anticipate which stores will see an increased demand for particular items, and forecast changing markets, all leading to improved  customer satisfaction and increased revenue. By providing these capabilities in the analytic data platform, analysis can be done at scale thereby avoiding the risk and effort of sampling.
  • Data scientists can use the tools with which they are comfortable, including Python and R, that process and analyze data at-scale without requiring data movement.
  • SQL-based, data platform integrated analytics deliver faster time to market for building and deploying data science models.

Fast Development of Analytical Innovations.

100% Commitment to Open Source: Fast Innovation working with the PostgreSQL Community

In Greenplum 5, we merged 3000+ PostgreSQL improvements into the Greenplum core and provided new capabilities from PostgreSQL in many areas including performance, support for JSON and HSTORE for semistructured data, and native support for additional data types such as Universal Unique Identifiers (UUID) and raster geospatial module for advanced geospatial analysis.

Beyond fast delivery of new capabilities, aligning PostgreSQL and Greenplum Database open source communities gives our customers a strategic advantage as they are in control of the software they deploy, without vendor lock-in, while allowing open influence on product direction.

Agile Development: Constant Delivery of New Analytical Capabilities in Greenplum

For more than three years the Pivotal Greenplum engineering team has adopted Pivotal’s agile development practices (small/focused teams, pair programming, test driven development, and continuous integration). This has dramatically increased the pace of innovation, with new releases of the platform landing on a monthly basis, far outpacing both traditional open source and proprietary alternatives. There is no other analytical platform on the planet delivering innovation at the velocity of Pivotal Greenplum.

Greenplum 5 Supporting Quotes

Pivotal Greenplum Customer

“We used Greenplum running on AWS to build an advertising solution that's really changing our industry. We are very excited about the multi-cloud capabilities and the new analytics that Greenplum 5 brings to the table and hope to continue our close partnership with Pivotal.”

John Conley, Vice President Data Warehousing, Conversant.

Learn more about how Conversant is using Greenplum.


“Innovation is alive at Greenplum. The data platform continues to thrive for use cases involving petabyte-scale data sets requiring the service levels and concurrency of a proven SQL engine at open source prices.”

Tony Baer, Principal Analyst, Information Management, Ovum


“Pivotal’s 5th version of the Greenplum Data Platform allows our customer’s to feel confident that the critical analytics needed to run their businesses will continue to grow in capabilities, without fear of vendor lock in and in the spirit of open source.  It’s a major release that has shown tremendous interest from many of our most innovative and demanding customers.”

Dan Feldhusen, President, ZData, An Atos Business


"Pivotal Greenplum 5.0 is a huge step forward. It's the most performant version yet; it runs wherever you need it to; and it provides an incredible set of analytic capabilities to power both business intelligence and machine learning. With this release, Greenplum is more than a data warehouse, it's a data platform."

Elisabeth Hendrickson, Vice President of Data R&D, Pivotal

For more information

About the Author

Cesar Rojas

Cesar Rojas serves as the Head of Product Marketing for Pivotal Greenplum, responsible for setting the messaging and go to market strategy for Greenplum. Prior to joining Pivotal, Mr. Rojas was Director of Product Marketing for the Teradata Portfolio for Hadoop and Teradata Aster offerings. Mr. Rojas is an advanced analytics and data management veteran with 15 years of experience working for the largest data analytics vendors as well as successful data startups. Mr. Rojas has an MBA with emphasis in eBusiness from Notre Dame de Namur University, as well as a bachelor's in Computer Engineering.

Follow on Twitter
Listen to the Crowd
Listen to the Crowd

How We Harden a Cloud Foundry Stemcell (So You Don’t Have to)
How We Harden a Cloud Foundry Stemcell (So You Don’t Have to)

Stemcells help you embrace immutable infrastructure while improving your security posture. Here's how stemc...

SpringOne 2022

Register Now