1 Command, 15 Minute Install: Hadoop + In Memory Data Grid + SQL Analytic Data Warehouse

July 7, 2014 Christian Tzolov

featured-magic With one command and a 15 minute install time, anyone with a decent development sandbox can deploy and log into their own Pivotal HD environment with Apache Hadoop®, a SQL-on-HDFS analytic data warehouse (HAWQ), and an in-memory data grid (GemFire XD) integrated with HDFS.

Since the single-command install process below also includes GraphLab’s machine learning toolkit, data scientists, architects, statisticians, and analysts can have an industry-leading set of big data technologies at their fingertips before they finish their turkey sandwich at lunch.

Why Apache Hadoop® Alone Doesn’t Cut It—Three Data Services vs. One

For many companies, Apache Hadoop® alone doesn’t provide the complete big data solution needed to turn data into business value. Processing batches of big data within Apache Hadoop® is powerful, but the output often needs further analysis, and, for example, the analytics workloads can come in two other forms.

One, interactive queries and statistical models are run via advanced SQL on structured data in some type of analytic data warehouse like Pivotal Greenplum DB or Pivotal HD’s HAWQ.
Two, resulting statistical models help to power real-time analytical queries within applications and operational processes like recommendation engines, risk analysis, or fraud identification. These algorithms are applied to business rules inside “operational applications.” For example, a workload like complex event processing is a fit for GemFire XD’s real-time, distributed, in-memory nature.
To complete the cycle, the results of real-time analytics within operational software is used to cycle back through the process and optimize further.

Due to the exploratory and iterative nature of this data science process, the size of data, and the myriad of applications connecting, there are often manual steps in the process—data is moved from an application, to an Apache Hadoop® system, to an analytics warehouse, and then back to a real-time engine or in memory data grid within the application. As we’ve all learned from software patterns of the past, this is inefficient and expensive.

Having the data on one, integrated, underlying HDFS system can make life much easier, not to mention at a lower cost with greater efficiency and productivity. With Pivotal HD, batch processes, interactive queries, and real-time queries can all be done on one data platform and integrated toolset for Java and SQL that is designed and proven to scale automatically in the cloud.

Getting Started with the Install

After hardware is available and software packages are downloaded, there is a Vagrant-based, single-command install process that will set-up either VMware Fusion VMs or VirtualBox VMs by default with CentOS.

In the process, four virtual machines are created—one for the Pivotal Command Center and three for the Pivotal HD cluster where Apache Hadoop® (HDFS, YARN, Pig, Zookeeper, HBase), HAWQ (SQL-on-HDFS analytic data warehouse), GemFire XD (in-memory data grid), and GraphLab services run.

The remainder of this document is organized as follows:

A. Basic Requirements and a Simple 1-Command, 15-Minute Installation

Step 1—Installing VirtualBox, Vagrant, and Downloading Files
Step 2—Download the Pivotal HD specific Vagrant Files
Step 3—Perform the Single-Command Install

B. Testing the Environment

C. Destroying the Environment and Changing the Install Configuration

More About Configuration
Pivotal HD Deployment Configurations Properties
Provisioning Scripts

D. Accessing Data, Tutorials, and Services

Preparing Demo Data
Running Tutorials
Trying GemFire
Trying GraphLab

E. Appendices

Appendix A—Installing Vagrant VirtualBox provider or the VMWare Fusion provider
Appendix B—Get the Oracle jdk-7u45-linux-x64.rpm
Appendix C—Download the Pivotal HD distribution packages
Appendix D—Troubleshooting

A. Basic Requirements and a Simple 1-Command, 15-Minute Installation

From a hardware standpoint, you need 64-bit architecture, at least 8GB of physical memory and around 160 GB of free disc space.

Step 1—Installing VirtualBox, Vagrant, and Downloading Files

Install the latest version of VirtualBox and Vagrant—version 1.5.1 or greater (see Appendix A for VMware Fusion details instead of VirtualBox).

Create a new folder called PHD and download the JDK7 rpm (see Appendix B for commands if needed) into the folder with the desired Pivotal HD 1.1.x/2.x distribution (see Appendix C for your choice of Pivotal HD versions, components, and package file names for 1.1.0, 1.1.1, and 2.0.1).

Once you are done downloading, the following files should be in the PHD folder (if Pivotal HD 2.0 is selected):

PCC-2.2.1-150.x86_64.gz PHD-2.0.1.0-148.gz PADS-1.2.0.1-8119.gz PRTS-1.0.0-14.gz jdk-7u45-linux-x64.rpm

Step 2—Download the Pivotal HD specific Vagrant Files

Download the github vagrant-pivotalhd files into the PHD folder:

wget "https://github.com/tzolov/vagrant-pivotalhd/archive/blog-gopivotal.zip"-O phd-vagrant.tar.gz tar --strip-components=1 -xzf ./phd-vagrant.tar.gz

The following files should now appear in the PHD folder: Vagrantfile , prepare_all_nodes.sh, pcc_install.sh, phd_cluster_deploy.sh

Step 3—Perform the Single-Command Install

Within PHD run the following command for a VirtualBox VM install (or see further below for a VMware Fusion install):

vagrant up

Note: When run for the first time the ‘vagrant up’ command will download and install the bigdata/centos6.4_x86_64 vagrant box. (use ‘vagrant box list’ to see the list of installed boxes)

Both Virtualbox and VMWare Fusion providers are supported. By default Vagrant uses the VirtualBox. If you install the commercial VMWare Fusion plugin (see Appendix A) then you can run:

vagrant up --provider vmware_fusion

Note: If you switch from VirtualBox to VMWare Fusion providers (or other way around) you have to clean the network interfaces as explained in Appendix D.

As it runs, four virtual machines are created:

pcc (10.211.55.100) – dedicated for the Pivotal Command Center;
phd1, phd2, phd3 (10.211.55.10[1..3]) – used for the Pivotal HD cluster.

Note: Pivotal HD doesn’t permit collocation of PCC and cluster nodes. The reason for this that single PCC can install and manage multiple clusters.

The following services are now configured and running: HDFS, Yarn, Pig, Zookeeper, HBase, Pivotal HD’s HAWQ, GemfireXD, and Hamster/GraphLab.

B. Testing the Environment

Open the Pivotal Command Center (PCC) console https://10.211.55.100:5443/hd/phd-c1/dashboard using the PCC credentials corresponding to your Pivotal distributions:

Pivotal distribution	username	password
PHD 2.0.x	gpadmin	Gpadmin1
PHD 1.1.x	gpadmin	gpadmin
PHD 1.0.x	gpadmin	gpadmin

pivotal command center

Below are the service roles per cluster node. Note that GraphLab service is installed ‘manually’ not through the PCC:

Node	Roles
phd1.localdomain	hawq-standbymaster, hawq-master, hive-server, hive-client, hive-metastore, hbase-client, hbase-master, gpxf-agent, gfxd-locator, namenode, hadoop®-client, secondarynamenode, yarn-resourcemanager, mapreduce-historyserver, pig-client
phd2.localdomain	hawq-segment, hbase-regionserver, zookeeper-server, gpxf-agent, gfxd-server, datanode, yarn-nodemanager
Phd3.localdomain	hawq-segment, hbase-regionserver, zookeeper-server, gpxf-agent, gfxd-server, datanode, yarn-nodemanager

You can also open the GemfireXD management console: http://10.211.55.101:7075/pulse/clusterDetail.html (username: admin and password: admin)

management console

C. Destroying the Environment and Changing the Install Configuration

To destroy all VMs run:

vagrant destroy –f

More About Configuration

The Vagrantfile exposes configuration properties to allow VM and PHD configuration and customization. The properties are grouped in 2 groups: (1) VM infrastructure and PCC installation properties and (2) Pivotal HD deployment configurations properties.

VM and PCC Configuration Properties:

Vagrant property	Description	Default value
PHD_DISTRIBUTION_TO_INSTALL	PCC, PHD, PADS and PRTS package names. to install. The first element is the package extension (e.g. ‘tar.g’ or ‘gz’) The predefined options are: PHD_110, PHD_111, PHD_201Custom definitions are allowed as well (see next paragraph)	PHD_201
VM_BOX	Vagrant box name. Default vbox provides 40GB of disk space. The bigdata/centos6.4_x86_64_small box is available as well and it takes only 8GB of memory	bigdata/centos6.4_x86_64
MASTER_PHD_MEMORY_MB	PHD cluster – Master node memory (MB)	2048
WORKER_PHD_MEMORY_MB	PHD cluster – Worker node memory (MB)	1536
PCC_MEMORY_MB	PCC node memory (MB)	768
DEPLOY_PHD_CLUSTER	By default as part of the installation the Vagrant script deploys a PHD cluster. If you want to use the PCC Wizard to install PHD clusters you have to set this property to FALSE. Then Vagrant will create and initialize the VMs and install the PCC but will not attempt to install the PHD cluster.	TRUE

PHD_DISTRIBUTION_TO_INSTALL= [“gz”, “PCC-2.1.1-73”, “PHD-1.1.1.0-82”, “PADS-1.1.4-34”, “PRTS-1.0.0-9”]

Note: In addition to the predefined PHD distributions you can define your custom package versions like this:

PHD Deployment Configurations Properties

Vagrant property	Description	Default value
CLUSTER_NAME	Unique name of the PHD cluster. Name will appear in ‘icm_client list’ and PCC UI	phd-c1
SERVICES	Services to install with the cluster. Only services in the list will be installed.Note: some PHD distributions don’t support the latest services. For more information check the compatibility matrix below.	hdfs, yarn, hive, pig, zookeeper, hbase, gfxd, gpxf, hawq, graphlab
MASTER	FQDN of the node used as Master. By convention the deploy script will use the Master node as worker as well.	phd1.localdomain
WORKERS	Comma separated FQDM list of WORKER nodes.You can reuse Master node as a Worker as well.	phd1.localdomain, phd2.localdomain

Provisioning Scripts

Vagrantfile uses the following provisioning scripts to create the cluster:

prepare_all_nodes.sh – configures the /etc/hosts and hostnames, installs the packages required by all cluster nodes (PCC including), sets up the NTP and disables firewall.
pcc_install.sh – Installs the PCC on the pcc node, imports the PHD, PADS, PRTS and Java RPM packages into a local yum repository.
phd_cluster_deploy.sh – Deploys the PHD cluster, installs and initializes all requested services and starts the cluster.

D. Accessing Data, Tutorials, and Services

We can also test the cluster by running jobs against sample data from the Pivotal HD demo project—the link to this project is located at: http://pivotalhd.docs.gopivotal.com/tutorial/index.html

Prepare the Demo Data

ssh to phd1.localdomain as gpadmin: ssh gpadmin@10.211.55.101 (phd1 = 10.211.55.101) , password: gpadmin
Run the following commands from the command line:

sudo yum -y install git rm -Rf /vagrant/pivotal-samples git clone https://github.com/gopivotal/pivotal-samples.git /vagrant/pivotal-samples mv /vagrant/pivotal-samples/sample-data /vagrant/retail_demo;

hadoop® fs -rm -r /retail_demo
hadoop® fs -mkdir /retail_demo

for fullfile in `ls /vagrant/retail_demo/*.gz | perl -ne ‘s/^(w+).+$/$1/;print;’`
do
filename=$(basename “$fullfile”)
filename=”${filename%.*.*}”
echo “Loading $filename …”
hadoop® fs -mkdir /retail_demo/$filename
hadoop® fs -put /vagrant/retail_demo/$filename.tsv.gz /retail_demo/$filename/
done
hdfs fsck /retail_demo -files

Run Tutorials

http://pivotalhd.docs.gopivotal.com/tutorial/getting-started/overview.html

Notes:

Make sure to replace any occurrence of pivhdsne in the tutorials by phd1.
In the Hive tutorial there is a typo in the retail_demo.customer_addresses_dim_hive table definition. You have to remove the redundant ‘d’ form ‘LOCAdTION’ and ‘Phone_Number d string’. Alternatively you can load all tables at once: ‘hive -f /vagrant/pivotal-samples/hive/create_hive_tables.sql’
For the HAWQ internal table tutorial use the following statements to load the internal tables:

zcat /vagrant/retail_demo/customers_dim.tsv.gz | psql -c "COPY retail_demo.customers_dim_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/categories_dim.tsv.gz | psql -c "COPY retail_demo.categories_dim_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/order_lineitems.tsv.gz | psql -c "COPY retail_demo.order_lineitems_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/orders.tsv.gz | psql -c "COPY retail_demo.orders_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/customer_addresses_dim.tsv.gz | psql -c "COPY retail_demo.customer_addresses_dim_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/email_addresses_dim.tsv.gz | psql -c "COPY retail_demo.email_addresses_dim_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/products_dim.tsv.gz | psql -c "COPY retail_demo.products_dim_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/payment_methods.tsv.gz | psql -c "COPY retail_demo.payment_methods_hawq FROM STDIN DELIMITER E't' NULL E'';" zcat /vagrant/retail_demo/date_dim.tsv.gz | psql -c "COPY retail_demo.date_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"

Once you have run these jobs, you can access the job monitor from the top menu (as shown below) or access the Job History Management UI here:

http://10.211.55.101:19888/jobhistory or https://10.211.55.100:5443/hd/PHD_C1/jobs

Try GemFire XD

The GemFire XD guide explains the main concepts and provides tutorials to help you quickly begin using GemFire XD: http://gemfirexd.docs.gopivotal.com/latest/userguide/index.html?q=/latest/userguide/getting_started/tutorial_chapter_intro.html

To start the GemFire console ssh to phd1 (password: gpadmin) and run gpxd command:

ssh gpadmin@10.211.55.101 cd /usr/lib/gphd/gfxd/quickstart/ gfxd gfxd> connect client 'phd1.localdomain:1527'; gfxd> show tables in sys;

Try GraphLab

Follow the Basic GraphLab Tutorial to implement a simple PageRank application from scratch:

http://docs.graphlab.org/using_graphlab.html

E. Appendices

Appendix A—Installing Vagrant VirtualBox provider or the VMWare Fusion provider

1. Install Vagrant v1.5.2 or new: http://www.vagrantup.com/downloads.htm

2a. Install VirtualBox v4.3.6 or new: https://www.virtualbox.org/wiki/Downloads

2b. (Optional) VMWare Fusion 6 provider:

Install VMWare Fusion 6
Install VMWare Fusion 6 Provider
Install VMWare Fusion provider install Vagrant VMWare Fusion Provider license:

vagrant plugin license vagrant-vmware-fusion license.lic vagrant plugin list

Vagrant Boxes

All boxes support Virtualbox and VMWare Fusion providers

Default box is ‘bigdata/centos6.4_x86_64’ which reserves 40GB of disk space per VM
‘bigdata/centos6.4_x86_64_small’ is an alternative box that reserves only 8GB of disc space per VM. Note that 8GB will not be enough to install all Pivotal services together
Alternative option is to create your own Vagrant box: Build Vagrant Boxes

Appendix B—Get the Oracle jdk-7u45-linux-x64.rpm

Download the jdk-7u45-linux-x64.rpm in your PHD folder. You can download it manually from: http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html#jdk-7u45-oth-JPR

or with the following wget command:

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.rpm"

Appendix C—Download the Pivotal HD distribution packages

Three Pivotal HD distribution are supported: PivotalHD 1.1.0, PivotalHD 1.1.1 and PivotalHD 2.0.

Instructions below explain how to get the packages for each distribution.
Note: you can download one or all distribution in the same PHD folder

C1. Pivotal HD 2.0.1

Download Pivotal HD 2.0.1 from the Pivotal Network Distribution channel:
https://network.gopivotal.com/products/pivotal-hd

Open the Pivotal HD 2.0 file group and download the following files:

Component name	Package file name
PHD 2.0.1: Pivotal Command Center 2.2.1	PCC-2.2.1-150.x86_64.gz
Pivotal HD 2.0.1	PHD-2.0.1.0-148.gz
PHD 2.0.1: Pivotal HAWQ 1.2.0.1	PADS-1.2.0.1-8119.gz
PHD 2.0.1: Pivotal GemFire XD 1.0	PRTS-1.0.0-14.gz

C2. Pivotal HD 1.1.1

Download Pivotal HD 1.1.1 from the Pivotal Network Distribution channel:
https://network.gopivotal.com/products/pivotal-hd

Open the Pivotal HD 1.1.1 file group and download the following files:

Component name	Package file name
PHD 1.1.1: Pivotal Command Center 2.1.1	PCC-2.1.1-73.x86_64.gz
Pivotal HD 1.1.1	PHD-1.1.1.0-82.gz
PHD 1.1.1: Pivotal HAWQ 1.1.4	PADS-1.1.4-34.gz

C3. Pivotal HD 1.1.0

Download the pivotal_community_1.1 and uncompress it in the PHD folder

wget "http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/pivotalhd_community_1.1.tar.gz" tar -xzf ./pivotalhd_community_1.1.tar.gz --strip 1

This bundle contains the PCC-2.1.0-460 and PHD-1.1.0.0-76 packages

Download HAWQ packages:

wget http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/PADS-1.1.3-31.tar.gz

Download the GemfireXD packages:

wget http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/Pivotal_GemFireXD_05Beta2_b44694.tar.gz

The PHD folder should contain the following files:

PCC-2.1.0-460.x86_64.tar.gz PHD-1.1.0.0-76.tar.gz PADS-1.1.3-31.tar.gz PRTS-1.0.0-8.tar.gz jdk-7u45-linux-x64.rpm

Appendix D—Troubleshooting

1. The installation (with VMware Fusion provider) hangs at:

[phd1] Waiting for the VM to finish booting…

Then stop the installation (Ctrl + C) and clean the vmnetXXX interfaces:

vagrant destroy -f sudo /Applications/VMware Fusion.app/Contents/Library/vmnet-cli --stop

Run ‘ifconfig -a’ to make sure that the vmnet1, vmnet2, vmnet3 or vmnet8 are removed

2. Installation hangs at “Verifying vmnet devices are healthy…” line and the vboxnet0 interface has IP address. E.g.

ifconfig -a … vboxnet1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 0a:00:27:00:00:00 inet 10.211.155.1 netmask 0xffffff00 broadcast 10.211.155.255 …

Then delete the vboxnet0 interface:

sudo ifconfig vboxnet1 delete

To Learn More:

Read about Pivotal HD—product, documentation, downloads, and articles
Read about Pivotal GemFire—product, documentation, downloads, and articles
Read about Pivotal Greenplum DB—product, documentation, downloads, and articles

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Spring Engineer at Pivotal; Committer, PMC member at Apache Software Foundation. Anything about Integration and Interoperability Architectures, Distributed and Data-Intensive Systems
Follow on Twitter Follow on Linkedin Visit Website

iOS CI with Jenkins

Co-Authored by Alex Kramer. It often happens that you join a project once it has been going for a few weeks...

A Review of Google I/O

The annual Google I/O conference took place on June 25-26 in San Francisco. The conference aimed to educate...

1 Command, 15 Minute Install: Hadoop + In Memory Data Grid + SQL Analytic Data Warehouse