Whilst big data can be “big” it is not always easy to access. HDFS has revolutionised the economics of storing huge volumes of data, and MapReduce provided a useful framework to analyse this data. However, it did not satisfy the need to truly interactive query processing. Further, it required a highly technical skill set—far removed from the familiar world of SQL that has powered business and analytics for years.
What if you could have the efficiency and scale of HDFS with the mature ecosystem of SQL? What if you could perform complex queries over tens or hundreds of nodes and petabytes of data? What if your existing SQL-compliant tools worked with this platform? What if you could connect your applications to this platform using standard ODBC or JDBC?
Well—you can! HAWQ is the true SQL engine that is engineered for Apache Hadoop. In this episode, we explore why HAWQ was developed, how it works and some of the benefits of using it in a Hadoop-based world.
PLAY EPISODE #18
- Subscribe to this feed.
- Find more episodes of All Things Pivotal podcast.
- Feedback: firstname.lastname@example.org
- Links Referred to in the Show:
Welcome to the All Things Pivotal Podcast, a podcast at the intersection of agile, cloud, and big data. Stay tuned for regular updates, technical deep dives, architecture discussions, and interviews. Please share your feedback with us by emailing email@example.com.
Hello everyone and welcome back to the All Things Pivotal Podcast, great to have you back. My name is Simon Elisha hosting you for another week of technical information and exploration of the world of agile, cloud, and big data.
Let’s talk a bit about big data this week. One of the areas I want to talk about was the area of SQL. You might say, “Hm, big data, we don’t normally talk so much about SQL. We talk about Hadoop, we talk about HDFS, we talk about a whole bunch of really cool petty software foundation projects related to big data with all kinds of interesting names. Why are we talking about something a bit old school like SQL?” Well let’s think about for a bit.
What does Hadoop do well? Hadoop is really useful for storing large amounts of information, so petabytes of information using HDFS. It does this quite cost effectively, it does this using commodity infrastructure which is great, and it means we can store even hundreds of petabytes at huge scale. Although for many customers a few terabytes is enough for them, but others need 500 terabytes, others need a petabyte et cetera, but you can go really as big as you like. In that context that is a really efficient cost effective storage platform.
On top of that of course we have MapReduce which is a technology that was designed to manage and query big data in a distributed fashion. This is really handy because it’s a distributed system. It can be very economical, it can work effectively, but it had some problems which is it’s not really meant for interactive to our queries. It’s more of batch type approach. Also the mechanism to use MapReduce so coding in Java and other frameworks on top of that can be tricky and less accessible to most people than we would like.
Back to SQL, why we’re talking about SQL? SQL is really the language of data analysis. It’s the Lingua Franca that exists throughout the world that we all know how it works, and it’s very well recognized, and most systems will use it, certainly most tool sets will use it and it’s pretty open and easy way to do things. In fact, even early on companies that took a big bet into the world of Hadoop, people like Facebook, Twitter, and Yahoo, they understood this and so they tried to create something called Hive which is a SQL like query engine that really does a limited version of SQL to MapReduce conversions, so they could really try and give the users a more familiar way of doing things.
But really when you boil it down a lot of users would just like to be able to use SQL on top of Hadoop. I want the scale and the efficiency and the performance at scale of HDFS and Hadoop, but I really like my native SQL interfaces. How do I solve that problem? That is something that the very, very smart engineers at Pivotal turned their mind to and spent a lot of time on. They created a solution. The solution is called HAWQ, H-A-W-Q. This is a SQL compliant engine for Hadoop. It is basically a massively parallel processing database operating on the HDFS substrate.
We’re going to get into this into some detail but essentially it allows or it’s a database that can read and write data from HDFS natively but it provides a complete standards compliant SQL interface to users. This is huge. This means I can now use all my existing tools, I can connect using ODBC and JDBC to what looks, feels, and tastes like a normal SQL database but it’s using the HDFS substrate and the distributed system to do all its work.
This brings to the table a number of interesting architectural challenges and opportunities, because with the challenge comes opportunity, with opportunity comes challenges. HAWQ was designed to be from the ground up a massively parallel processing SQL engine, or an MPP type database. Now what’s an MPP database do? An MPP database says, “I’m not going to use a monolithic service to process my queries. I’m going to take advantage of the preponderance of commodity based hardware that has reasonable performance, I’m going to strap them together in a ginormous grid, and I’m going to use the famous divide and conquer process by sending queries off to smaller subsets of data in the environment in basically query processing units and have them come back with their individual answers and then bring it all back together.”
Sounds easy, really, really, really hard to do, but if you do it well it is very, very effective. So what HAWQ does is it applies a cost-based query optimizer which is really a almost unique thing in this space where it can understand where to divide up the query workload to those little query engines. Those query engines work on segment servers and data nodes within the environment. So they could be on the same nodes as it is hosting the HDFS files, it could be on separate nodes, it depends on the typology. But essentially what you have is access to this large grid of processing nodes that can handle the workload being required.
Within the typology for like we have all these nodes, these segment instances that are containing subsets of the data. Multiple segment instances work together to form a single parallel processing query system. What this means is that from a user perspective you don’t have to worry about where the data is, how it’s organized, et cetera. It’s taken care for you.
All you’re worrying about is constructing your query and that query gets sent to the master node. The master node then does the optimization and uniquely does this in a metadata driven approach. It understands what’s going on in the environment from a holistic perspective, so it knows where all the data is on all the nodes at all the times, and can help construct queries that are efficient in the context of how some of the data might be joined between those nodes at the parallel processing level. It then takes this query plan, it passes it down to the individual workers in the cluster with a pile load of metadata that they need.
This means that worker nodes are just evaluating the pile load that they’re getting. They don’t have to worry about anything else in their environment, but they know that the evaluation that they do will be effective in the context of all the other evaluations taking place on all the other nodes. So it’s a very, very efficient and differentiated approach, because this big picture view is what gives you this increased performance. I’ll share with you some performance numbers later on. Obviously this metadata system is maintained by HAWQ. It’s highly valuable. It’s replicated and redundant so it’s always available to the system.
The other thing that tends to take place when you’re running these massively parallel queries is work takes place on these individual nodes, and the nodes will be limited by the amount of RAM they have available. What happens then? What HAWQ does is it can intelligently buffer and spill intermediate data to local disk on those individual nodes, meaning it can handle very high performance and very heavy duty workloads without having to be constrained by memory necessarily. What this is means is it can perform much bigger joins and sort operations beyond the size of the memory that you have in the cluster. You don’t have to size the memory of the cluster to the top to queries because queries tend to be unpredictable. It can be a more generic approach to that deployment.
The other nice thing the HAWQ database does is it supports asset transactions. As you imagine on Hadoop this is an interesting thing. This is the first database that can do that. It means you have an ability for users to isolate their concurrent activity and allows them to roll back modifications if something goes wrong or if they get an outcome that they don’t like. So it gives them that familiar database type experience.
In terms of connectivity, really easy to connect to. It’s ODBC and JDBC. So you can connect existing applications and existing tool sets to the database and get access to it straight away. The experience of using the database is also the same. You’re creating tables and dropping tables and doing all that sort of functionality creating fields et cetera that you normally would do. It’s no different.
Where it does become different is when you’re processing large amounts of data in parallel. This is where it gets exciting, because if you can compare it to some of the tool sets that are used in this space so things like Hive and Impala which are often used when you’re trying to achieve a SQL type outcome HAWQ provides a significantly increased level of performance. Some examples: for a sales analysis type work HAWQ is about 90 times faster than Hive and 69 times faster than Impala for that kind of workload. For data exploration it’s 20 times faster than Impala. For business intelligent drill down type workloads it’s 648 times faster than what Hive can do. So there are some real performance benefits from using HAWQ because it’s bringing together a really rich and mature way of interacting from a SQL perspective into the HDFS as well.
The other really interesting thing that it does from a data science perspective is it allows you to run a library called MADlib on the HAWQ database. Now MADlib provides a whole lot of statistical tools that are really useful for data scientists. Instead of having the data scientists having to extract data sets from the database and process them locally using their specialized tools, they can run these specialized tools at the database level so bringing the processing closer to the database. What this means is you process more data faster and you can have a far more speedy cycle of innovation and of data management without having to send it to big workstations et cetera. The MADlib library runs natively within the context of the HAWQ database as well.
In a nutshell fully SQL compliant, massively parallel processing, running on a HDFS substrate. Ticks a lot of boxes. What I’d like to do in a future episode and depending on the interest of the listenership is we’ll do a more deep dive into some of the technology behind the database, how it works, what it’s doing, et cetera, and I’ll get some of my learner colleagues to come on board and we’ll talk more about that as well to peel back the layers if you like of the design. But in the meantime that’s a little taste of the HAWQ database and what it can do in your environment. Until then I’d like to get your feedback, firstname.lastname@example.org and as always, keep on going.
Thanks for listening to the All Things Pivotal Podcast. If you enjoyed it please share it with others. We love hearing your feedback so please send any comments or suggestions to email@example.com.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Author