New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries Up To 1000x

June 2, 2014 Gregory Chase

Have you heard about the new super-efficient Pivotal Query Optimizer developed by the Greenplum engineering team? Previously codenamed “Orca”, this new feature has been released as part of the HAWQ query engine in Pivotal HD, Pivotal’s commercially-supported distribution of Apache Hadoop®.

This new optimizer has been undergoing months of performance testing and improvements and is nearly ready for market. Pivotal will be showcasing a peer-reviewed paper at ACM SIGMOD Conference 2014, June 22 – 27, on the results of this performance study. Titled “Orca: A Modular Query Optimizer Architecture for Big Data”, this paper explains how they built the query optimizer, and show the results they’ve seen so far in customer usage and ongoing testing. If you would like to get a copy of the paper yourself and see the detailed benchmark results, ask at the Pivotal booth (booth S32) at this week’s Hadoop® Summit in San Jose.

The Pivotal Query Optimizer is now also available to Pivotal Greenplum DB customers as part of an early access program. For customers that are interested in trying this out, please register here.

Sophisticated Computer Science

Developing a query optimizer involves some very sophisticated computer science. The team wanted to create a new SQL-compliant query technology that was better suited to the trends we are seeing in big data:

  • Increasing volume from companies keeping detail data, not aggregates, from many more sources.
  • More variety in the types of data to be incorporated into queries such as application logs, sensor time series, geospatially tagged data, genomics data, and social media feeds.
  • Diverse storage due to an increasing variety of data technologies being instead of traditional RDBMS for storing and managing this data.
  • Complex queries generated by advanced analytics algorithms being applied to all this data.

This technology is laser focused on providing fast SQL query results on petabytes of data and be portable across data architectures, such as Pivotal HD and Pivotal Greenplum.


© 2014 ACM, used with permission.
Figure 1. The Pivotal Query Optimizer is a stand alone optimizer that is portable across databases that implement Data eXchange Language (DXL).

Along with further enhancements with the release of Pivotal HD 2.0, this new query optimizer is allowing customers to make use of full ANSI SQL compliant queries against Hadoop® at a rate up to 1000X faster than they could with Pivotal HD 1.0. Not only does it speed up your queries, it makes Apache Hadoop® more practical for some serious data science work. Now you can better take advantage of more analytics use cases on Apache Hadoop® through faster queries in HAWQ, which comes with support for GraphLab, MADLib, languages such as R, Python and Java, and all new support for Parquet files.


© 2014 ACM, used with permission.
Figure 2. The Pivotal Query Optimizer finds fastest query plans for full ANSI SQL-compliant queries hitting either Pivotal Hadoop® and Pivotal Greenplum Database.

Performance Testing on Apache Hadoop®

I’m pleased to be able to preview some of these testing results with you in this blog—for a certain purpose. Pivotal is looking for a few customers of Greenplum DB to help with final testing and validation of the new query optimizer. We’d love for you to join the early access program, and experience for yourself the performance benefits and new use cases you can achieve with the new Pivotal Query Optimizer on Greenplum DB.

Part of validating the new Pivotal Query Optimizer includes performance testing against the TCP-DS benchmark. As mentioned, testing of Pivotal HD 2.0 versus Pivotal HD 1.0 against the benchmark showed some of the queries had up to a 1000X improvement. More importantly, with the new query optimizer, Pivotal HD 2.0 is able to complete the entire benchmark of 111 queries. For the first time in the market, a commercially supported Apache Hadoop® stack can now be effectively used for ad hoc analytical use cases as well as leverage existing applications and expertise.

Performance Testing on Pivotal Greenplum DB

We did similar performance testing of the new version of Greenplum DB vs. the prior version of Greenplum DB using the TCP-DS benchmark. GPDB configured with the Pivotal Query Optimizer database versus GPDB configured to use the legacy query optimizer planner showed an overall 5X improvement in running the entire benchmark of 111 queries. For some specific queries we see as much as a 1000X improvement. We timed out the test at that point.


© 2014 ACM, used with permission.
Figure 3. TCP-DS performance testing results of Pivotal Greenplum with Pivotal Query Optimizer vs. Pivotal Greenplum with “planner” query optimizer.

What many of these significantly improved queries have in common is layers of nested queries, often with window functions. We find these kinds of queries occur when users are working with advanced analytics packages against such as SAS on top of Greenplum DB. We expect to see significant improvement in analysis times for users of these tools as we ramp up early access.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Greg Chase is an enterprise software business leader more than 20 years experience in business development, marketing, sales, and engineering with software companies. Most recently Greg has been focused on building the community and ecosystem around Pivotal Greenplum and Pivotal Cloud Foundry as part of the Global Ecosystem Team at Pivotal. His goal is to to help create powerful solutions for Pivotal’s customers, and drive business for Pivotal’s partners. Greg is also a wine maker, dog lover, community volunteer, and social entrepreneur.

Developer Interview: Improving UX, Dev Cycle Time & Scale at the World’s 8th Largest Retailer, MercadoLibre
Developer Interview: Improving UX, Dev Cycle Time & Scale at the World’s 8th Largest Retailer, MercadoLibre

MercadoLibre is one of the largest e-commerce sites on Earth and is eBay’s Latin American partner. In this ...

This Month in Data Science: Data Scientist Salaries, All-Knowing Algorithms, and Airbnb's Data-Driven Success
This Month in Data Science: Data Scientist Salaries, All-Knowing Algorithms, and Airbnb's Data-Driven Success

Data science news in May emphasized that simply ingesting and analyzing large datasets is no longer enough....