Over the past five years, we have seen the Apache Hadoop® ecosystem grow at an escalating pace. This week’s Strata and Hadoop World conference in New York is a testament to the level of interest this evolution has created among enterprises looking to expand their data analytic capabilities. Enterprises now have an arsenal of data analytics tools at their disposal: Data Lakes, SQL engines, parallel machine learning tools, real-time complex-event processing and online learning tools, key-value and object stores, visualization and analytics development tools, and more.
So this is the right time to take a step back and think about what business problems we are trying to solve and how the various solutions in the market align with business objectives: the business problems and use-cases; cost and performance goals; as well as policy, maturity and regulatory needs. When the question gets asked this way, we come to the realization that there are always trade-offs between these business objectives, and that there is no one-size-fits-all solution.
The figure below shows where the three commonly used analytics solutions in the industry fit in. Enterprise Data Warehouses (EDWs) have been around since the 80s and to-date continue to be used to store historical enterprise information. Enterprises have invested in frameworks of tools and business processes around the EDW. That said, there is a clear trend in the industry to off-load analytics processing out of EDWs into Massively Parallel Processing-based (MPP) Analytics Databases and Hadoop-based Analytics Stacks. The business drivers for this trend are fairly well understood:
- the economics in storing and analyzing petabytes of data from a variety of data sources
- an ecosystem of analytics development tools not tied to specific vendors
- use cases requiring deep analytics such as machine learning algorithms applied to large volumes of data
Then the question comes down to what use cases are best suited for implementation on Hadoop®-based Analytics Stacks and what use cases are a better fit for MPP-based Analytics Databases. This is not to suggest that these two stacks are mutually exclusive. In fact, the opposite is true in that quite a few use cases require these stacks to be integrated and working together. Nonetheless, it is a useful exercise to identify the criteria that drive the use of each of these stacks.
Structured and unstructured data analytics: The Map-Reduce paradigm native to Apache Hadoop® has proven to be an effective tool to pre-process unstructured and semi-structured data sources such as images, text, raw logs, XML/JSON objects etc. On the other hand, rapid implementation of majority of the data discovery and data science use-cases requires strong support for SQL with embedded machine learning capabilities normally available in an MPP-based Analytic Database. The significant shift toward SQL based analytics is also being driven by the dearth of developers with Map-Reduce skills.
Performance and cost drivers: MPP-based Analytic Databases can be built using shared nothing architecture and are not constrained by the limitations of the HDFS file system. In addition, automatic parallelization of ingest load and redistribution of processing load in an MPP-based Analytic Database ensures better latency for ad-hoc queries and better throughput for batch-mode queries. MPP-based Analytic Databases usually run on bare-metal as opposed to virtualized environments due to their performance intensive workloads.
Current Support for Enterprise Grade Features: MPP-Based Analytic Databases have been designed with security, authentication, disaster recovery, high availability and backup/restore in mind. On the other hand, Hadoop®-based analytic stacks have been originally designed for distributed operation with high availability. Additional enterprise grade features are actively being added to both Apache and vendor-specific distributions of the Apache Hadoop® stack, so this gap in support for enterprise grade features is likely to significantly narrow into the future.
Greenplum MPP-Based Analytic Database
Pivotal offers Greenplum Database, the industry leading MPP-based Analytic Database that performs data exploration and deep analytics at petabyte scale with blazing performance and support for critical IT and business requirements in security, policy and business continuity. Greenplum Database underscores Pivotal’s commitment to providing the strongest enterprise grade SQL-based analytics offering in the market.
Architectural tenets: Greenplum Database is built using a shared-nothing architecture with collocated storage and compute. It supports parallel loading from diverse structured data sources and Apache Hadoop® data lakes and massively parallel high performance ad-hoc queries. This enables Greenplum Database to be deployed in a diverse set of data pipeline processing architectures. Hardware capacity can be expanded on an incremental basis with automatic or controlled load redistribution minimizing lifecycle management costs.
Flexibility and Adaptability: Polymorphic storage enables columnar and row-based storage simultaneously and is used for scanning large volumes of data and small lookups respectively. This enables the solution to scale up to handle large data sets with thousands of columns as well as scale down, with respect to cost and latency to handle smaller data sets. Appliance-based and Software-only deployment options, column level compression, flexible indexing and partitioning provide full control to enterprises to trade off performance with cost.
Advanced Analytics: In addition to OLAP queries such as cube and grouping set operations, Greenplum Database has the richest support in the industry for massively parallel machine learning capabilities invoked from SQL, Python, R, etc.
Enterprise Grade Features: Besides cost, performance and deep analytics capabilities, enterprises need an analytics platform that confirms to their security and regulatory policies and business continuity SLAs. To this end, Greenplum Database supports row and column level encryption for data at-rest and in-motion and a rich set of authentication and role-based access control mechanisms. Business continuity can be ensured using comprehensive High Availability with block-level replication capabilities and full and incremental automated backup/restore with remote Disaster Recovery
These product capabilities along with excellent customer success initiatives have earned Pivotal the leadership role in SQL-based enterprise analytics and machine learning. Recently, Gartner published the report, “Gartner Critical Capabilities for Data Warehouse Database Management Systems” that shares survey results of customers from their experiences with data warehouse DBMS products. The report scored Pivotal in the top 2 out of 16 vendors in two use cases: “Traditional Data Warehouse” and “Logical Data Warehouse”. In a third use case, “Context Independent Data Warehouse”, Pivotal scored in the top 3 relative to the 15 other vendors.
Leveraging Greenplum MPP-Based Analytic Database for Apache Hadoop®
Our leadership in SQL-based enterprise analytics and machine learning has led us to challenge the conventional thinking in the industry around the gap between MPP-based Analytic Databases and Hadoop®-based Analytics Stacks.
While most analytics vendors are investing to improve the SQL-on-Hadoop implementation, Pivotal has leveraged the decade worth of product development effort that went into the Greenplum Database, reused this code-base to build an SQL engine on Hadoop® and enhanced it with the industries’ only cost-based query optimization framework tailored for HDFS. This SQL-on-Hadoop product is called HAWQ (Hadoop® With Query). HAWQ enables enterprises to benefit from the hardened MPP-based analytic features and its query performance while leveraging the Apache Hadoop® stack.
Pivotal offers one license for both HAWQ and Greenplum Database, under Big Data Suite at the price point normally found in SQL-over-Hadoop systems and charges software licenses only for compute resources and not for the volume of data stored. This enables enterprises to switch between HAWQ and Greenplum Database without re-budgeting exercises and spending approvals for licenses as the volume of data grows and enterprise analytic needs change. Furthermore, this combined stack, shown in above figure can run on commodity hardware or the DCA appliance from EMC.
The combined stack significantly lowers the business risk for enterprises by providing a choice of interoperable analytic solutions and the ability to switch between them with minimal reconfiguration, all under one license. Please find more information and technical details for Greenplum Database, HAWQ and Big Data Suite, visit us at Strata and Hadoop® World, subscribe to our YouTube channel and reach out to your local Pivotal sales representative to discuss your specific business analytic needs.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author