Transform Your Skills: Simple Steps to Set Up SQL on Hadoop

May 13, 2014 Alfred Domingo

According to Wikibon, the big data market grew by 61% in 2013.

As technologists, we don’t have to look any further than a couple of job trend sites to realize Hadoop skills are growing when compared to SQL. The demand over time and volume in these two charts for SQL and Hadoop are alarmingly different. Technology is also bringing these worlds and skills together because SQL queries can now run on Hadoop via capabilities like Pivotal GemFire XD (SQL, in-memory) and Pivotal HAWQ (SQL, MPP) on Pivotal HD. In this blog, I am going to show anyone with SQL skills how to quickly and easily set up a “Hadoop + SQL” cluster using pictures instead of words wherever possible.


If you aren’t familiar with Hadoop, we recently explained the concept of Hadoop in 5 Pictures and provided a Hadoop 101 to provide an introduction on Hadoop programming. If you need more background, these are both good explanations and worth understanding before we dive into our set-up.

Overview: Getting Familiar with the SQL on Hadoop Environment

It used to take some heavy lifting to set up Hadoop or put SQL on Hadoop—that is now a thing of the past. We’ve previously posted a way to set up the Pivotal HD (Hadoop) environment using Vagrant as well as Pivotal HD (Hadoop) on Docker. Below, I am going to perform a straightforward install of Pivotal HD 2.0 and HAWQ 1.2.0 (SQL on Hadoop) using the Pivotal Command Center (PCC) 2.2.1 graphical user interface.

In the Pivotal architecture diagram below, there are many elements—as noted in the legend, you can see the dark gray items are all Apache components and the green ones are Pivotal. We are going to first set up PCC—the green box on the far right—and use PCC’s graphical user interface (UI) to set up the rest—a Pivotal HD (Hadoop) cluster and HAWQ Advanced Database Services (SQL interface to Hadoop).


Understanding and Installing the Pivotal Command Center

Like other command centers, the Pivotal Command Center (PCC) allows you to configure, deploy, monitor, and manage systems. In this case, we can manage multiple Pivotal HD clusters and use PCC’s graphical UI or command line interface to deploy a cluster with various components.

Pivotal Command Center provides a PostgreSQL database, Puppet Server, and a management application to:

  • Manage users and profiles
  • Create, stop, start, and delete clusters
  • Check deployment status via a main dashboard or drill into detailed metrics
  • Monitor MapReduce jobs, YARN, and HAWQ queries
  • View and alter the cluster topology
  • Capture, filter, and search log files

The Pivotal Command Center installation requires a few prerequisites, and the install is pretty straightforward—especially when compared to the myriad of install steps and files for setting up an entire Hadoop environment. Of course, there is also a lot more detail about the install for those who want or need it.

At a high level, the install of Pivotal Command Center includes 1) putting some tar files in a directory, 2) running an install script, and 3) importing the JDK. Lastly, we have to move and import the tarballs that represent the installs for other items in the diagram above, namely the Apache and HAWQ elements. After the PCC is installed, the UI or CLI can be used to create many clusters of components as shown in the diagram below.


Logging In and Using the Wizard to Create a Cluster

The following sets of screen shots are going to walk you through the Pivotal Command Center wizard and explain how to configure, deploy, and test the environment.

1. Log In and Begin

Once PCC is stalled, we can log in at https://CommandCenterHost:5443, and we will see a cluster status view. Since this is our first time running PCC, the status is empty. Start by pressing the “Add Cluster” button.



2. Start the Wizard

Next, the Add Cluster Wizard runs, and we press the “Get Started” button to start. Then, we will create a new “Cluster Definition.” Note, you can also change an existing cluster configuration.



3. Define the Version, Services, and Hosts

Next, we give our new cluster a name, list the target hosts, enter a couple of passwords (for root and GPAdmin), provide a JDK file name, and select from a set of install flags/options. For the flags, we will go ahead and disable the SELinux, disable IPTables, run ScanHosts, and Setup NTP—you can learn more about those later.




4. Host Verification

The Pivotal Command Center verifies connections to the hosts listed and notes any error or informational messages.


5. Set Up the Topology

Next, we identify which hosts are hosting each service. This includes clients for Pig, Hive, HBase, and Mahout libraries as well as distributed services for HDFS, YARN, HBASE, Hive, ZooKeeper, HAWQ, GPXF, Mahout, and Pig.








6. Configure the Environment

The PCC then lists all configuration files as populated by the wizard—XML, sh, properties, and config files for all the components. Here you can review and edit as needed. At this point, it is important to consider how default values might need to change for properties like JVM size. However, we are not going through these in detail for this post. Click next.


7. Validate and Check Deployment Status

The system performs a validation step, and then we select “Deploy.” The system will then show the progression of the deployment with hostname, status, role, and messages. We will click next to see a summary of the successfully deployed cluster.



19_Pivotal-Command-Center_GUI-Successful-DeploymentAt this point, we can see the status with the name, version, services, and number of nodes successfully deployed. As mentioned, this functionality is available via command line as well.

Review the Pivotal Command Center Dashboard

We can now use the monitoring capabilities in PCC. The dashboard will give you the state of HDFS, YARN, HBase, Zookeeper, and Hive clusters. The Cluster Analysis screen will provide greater detail where you can filter metrics like CPU, disk bandwidth, network bandwidth, memory, and Swap I/O by service, category, or node. There are also screens for monitoring MapReduce Jobs and YARN apps as well as reviewing topology or drilling into logs.


For more information:

– Learn more about the Pivotal Big Data Suite, including Pivotal HD, HAWQ, and GemFire XD
– Read more about Pivotal HD, including documentation and downloads
– Find articles on big data, Hadoop, Pivotal HD, HAWQ, and GemFire
– Check out case studies on GemFire at EMC, SBI Securities, China Railways Corporation, and GIRE.

About the Author


HOWTO: Spring XD for Real Time Analytics With Twitter Example Code
HOWTO: Spring XD for Real Time Analytics With Twitter Example Code

Spring XD is a unified, distributed, and extensible system for data ingestion, real time analytics, batch p...

Strolling down the Happy Path
Strolling down the Happy Path

A highly-functioning software team has parallel development and design cycles. Developers and PMs work on b...