I’m very excited to finally be able to talk in public about something we’ve been working on for quite some time. In 2011 we announced our first version of Greenplum Hadoop, GPHD. This week, we’ll have a number of events to introduce Pivotal HD, our new Hadoop distribution. Pivotal HD isn’t just about making Hadoop better, it’s significantly expanding Hadoop capabilities as a data platform. In this first post, I’ll explore the core functionality enabled by the major new component of Pivotal HD: HAWQ , the first, fully functional, high-performance, all-encompassing relational database that runs in Hadoop.
Pivotal HD’s core Hadoop distribution includes all the usual suspects, except that we added truncate capability to HDFS to support database constructs. Pivotal HD has HDFS, MapReduce, Pig, Hive, and Mahout, as well as tools we’ve added to the ecosystem, which include:
- Command Center – manages and monitors your HDFS, MapReduce, and HAWQ.
- Virtualization extensions and Isilon support – The ability to deploy Hadoop in a virtualized environment and/or on Isilon for particularly challenging enterprise deployments.
- ICM (Installation/Configuration/Management) – Our subsystem for managing and administering several Pivotal HD Hadoop clusters through many different interfaces. ICM offers many hooks to integrate Hadoop clusters into your existing IT management infrastructure without getting in the way of your job.
- Spring Hadoop – Helps developers by integrating Hadoop into the Spring framework. This also includes Spring Batch, which can ease the job management and execution tasks on Pivotal HD.
These features will be familiar to users of the Greenplum HD 1.2 release, where they’ve been available for the past few months.
What takes the Pivotal HD Hadoop distribution to the next level is a major new component that we are very proud of: HAWQ, a relational database that runs atop of HDFS. In short, we gutted the code where our solid, mature, fast MPP Greenplum Database was writing to disk — instead, it writes directly to HDFS.
HAWQ is a relational database that runs atop of HDFS. It has its own execution engine, separate from MapReduce, and manages its own data, which is stored on HDFS. HAWQ bridges the gap, a SQL interface layer on top of HDFS that also organizes data. It also boasts a core feature called GPXF (Greenplum Extension Framework) that allows HAWQ to read data from flat files in HDFS that are stored in just about any common format (delimited text, sequence files, protobuf and avro.) In addition, it has native support for HBase, supporting HBase predicate pushdown, hive connectivity, and offering a ton of intelligent features to retrieve HBase data.
Here is just a quick look at the features of HAWQ, many of which have been in the wild, in our Greenplum Database product:
HAWQ draws from the 10 years of development on the Greenplum Database product. Features include:
- Columnar or row-oriented storage to provide benefits based on different workloads. This is transparent to the user and is specified when the table is created. HAWQ figures out how to shard, distribute, and store your data.
- Seamless partitioning allows you to separate your tables on a partition key, enabling super-fast scans of subsets of your data by pruning off portions of your data that you don’t need in a query. Common partition schemes are on dates, regions, or anything you commonly filter on.
- Our parallel query optimizer and planner take SQL queries that look like any other you’ve written before, then intelligently looks at table stats to figure out the best way to return data. For example, it determines what kind of join to use on the fly based on the size of the data sets.
- Table-by-table specification of distribution keys allow you to design your table schemas to take advantage of node-local joins and group bys, though HAWQ will perform will even without this.
- Fully compliant and robust SQL92 and SQL99 support. We also support the SQL 2003 OLAP extensions. We are 100% compatible with PostgreSQL 8.2, and have added significant amounts of functionality and functions past that.
- The full suite of administration tools from our Greenplum Database makes the system easy to maintain, install, and use.
- Full ODBC and JDBC support allow you to plug in your existing BI tools (Tableau, QlikView, etc.) or applications to hook into HAWQ. You can also plug in a BI tool through our ODBC or JDBC connector, to access HBase data or HDFS data via GPXF.
All of this, inside your own Hadoop cluster! These features and functionalities stretch what your Hadoop cluster can do, further into the analytical and data service realm, where Hadoop struggles to do well due to its design as a batch data processing system primarily.
Hadoop complements our database technology very well. It handles unstructured data, data archive and storage, and schema-less process. Together, Hadoop and HAWQ cover all the bases, empowering you to use the right tool for the right job within your same cluster.
So, how does HAWQ work? We have what we call “segment servers” manage a shard of each table. Several segment servers run on each data node of your cluster. This shard of data, however, is completely stored within HDFS. We have a “master” node that has the job of storing the top-level metadata, as well as building the query plan and pushing the node-local queries down to the segment servers.
When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine. HAWQ follows MPP architecture, streaming data through stages in a pipeline, instead of spilling and check pointing to disk (like MapReduce). Also, the segment servers are always running, so there is no spin-up time.
In our initial tests, we find that HAWQ is hundreds of times faster than Hive. We have also done some initial testing against some of our competing SQL-on-Hadoop solutions and are orders of magnitude faster for some queries (especially group by and joins, which are rather important).
What results is a near real time analytical SQL database that runs on Hadoop. You can run queries with sub-second response time, while at the same time running over much larger datasets and processing with the full expressiveness of SQL, in the same engine. Meanwhile, you don’t have to sacrifice or compromise anything from your Hadoop/MapReduce side of the house. They work together to get the job (whatever it is) done.
To learn more about Pivotal HD:
About the Author