1 Command, 15 Minute Install: Hadoop + In Memory Data Grid + SQL Analytic Data Warehouse

July 7, 2014 Christian Tzolov

featured-magicWith one command and a 15 minute install time, anyone with a decent development sandbox can deploy and log into their own Pivotal HD environment with Apache Hadoop®, a SQL-on-HDFS analytic data warehouse (HAWQ), and an in-memory data grid (GemFire XD) integrated with HDFS.

Since the single-command install process below also includes GraphLab’s machine learning toolkit, data scientists, architects, statisticians, and analysts can have an industry-leading set of big data technologies at their fingertips before they finish their turkey sandwich at lunch.

Why Apache Hadoop® Alone Doesn’t Cut It—Three Data Services vs. One

For many companies, Apache Hadoop® alone doesn’t provide the complete big data solution needed to turn data into business value. Processing batches of big data within Apache Hadoop® is powerful, but the output often needs further analysis, and, for example, the analytics workloads can come in two other forms.

  • One, interactive queries and statistical models are run via advanced SQL on structured data in some type of analytic data warehouse like Pivotal Greenplum DB or Pivotal HD’s HAWQ.
  • Two, resulting statistical models help to power real-time analytical queries within applications and operational processes like recommendation engines, risk analysis, or fraud identification. These algorithms are applied to business rules inside “operational applications.” For example, a workload like complex event processing is a fit for GemFire XD’s real-time, distributed, in-memory nature.
  • To complete the cycle, the results of real-time analytics within operational software is used to cycle back through the process and optimize further.

Due to the exploratory and iterative nature of this data science process, the size of data, and the myriad of applications connecting, there are often manual steps in the process—data is moved from an application, to an Apache Hadoop® system, to an analytics warehouse, and then back to a real-time engine or in memory data grid within the application. As we’ve all learned from software patterns of the past, this is inefficient and expensive.

Having the data on one, integrated, underlying HDFS system can make life much easier, not to mention at a lower cost with greater efficiency and productivity. With Pivotal HD, batch processes, interactive queries, and real-time queries can all be done on one data platform and integrated toolset for Java and SQL that is designed and proven to scale automatically in the cloud.

Getting Started with the Install

After hardware is available and software packages are downloaded, there is a Vagrant-based, single-command install process that will set-up either VMware Fusion VMs or VirtualBox VMs by default with CentOS.

In the process, four virtual machines are created—one for the Pivotal Command Center and three for the Pivotal HD cluster where Apache Hadoop® (HDFS, YARN, Pig, Zookeeper, HBase), HAWQ (SQL-on-HDFS analytic data warehouse), GemFire XD (in-memory data grid), and GraphLab services run.

The remainder of this document is organized as follows:

A. Basic Requirements and a Simple 1-Command, 15-Minute Installation

    • Step 1—Installing VirtualBox, Vagrant, and Downloading Files
    • Step 2—Download the Pivotal HD specific Vagrant Files
    • Step 3—Perform the Single-Command Install

B. Testing the Environment

C. Destroying the Environment and Changing the Install Configuration

    • More About Configuration
    • Pivotal HD Deployment Configurations Properties
    • Provisioning Scripts

D. Accessing Data, Tutorials, and Services

    • Preparing Demo Data
    • Running Tutorials
    • Trying GemFire
    • Trying GraphLab

E. Appendices

    • Appendix A—Installing Vagrant VirtualBox provider or the VMWare Fusion provider
    • Appendix B—Get the Oracle jdk-7u45-linux-x64.rpm
    • Appendix C—Download the Pivotal HD distribution packages
    • Appendix D—Troubleshooting

A. Basic Requirements and a Simple 1-Command, 15-Minute Installation

From a hardware standpoint, you need 64-bit architecture, at least 8GB of physical memory and around 160 GB of free disc space.

Step 1—Installing VirtualBox, Vagrant, and Downloading Files

Install the latest version of VirtualBox and Vagrantversion 1.5.1 or greater (see Appendix A for VMware Fusion details instead of VirtualBox).

Create a new folder called PHD and download the JDK7 rpm (see Appendix B for commands if needed) into the folder with the desired Pivotal HD 1.1.x/2.x distribution (see Appendix C for your choice of Pivotal HD versions, components, and package file names for 1.1.0, 1.1.1, and 2.0.1).

Once you are done downloading, the following files should be in the PHD folder (if Pivotal HD 2.0 is selected):

PCC-2.2.1-150.x86_64.gz
PHD-2.0.1.0-148.gz
PADS-1.2.0.1-8119.gz
PRTS-1.0.0-14.gz
jdk-7u45-linux-x64.rpm

Step 2—Download the Pivotal HD specific Vagrant Files

Download the github vagrant-pivotalhd files into the PHD folder:

wget "https://github.com/tzolov/vagrant-pivotalhd/archive/blog-gopivotal.zip"-O phd-vagrant.tar.gz
tar --strip-components=1 -xzf ./phd-vagrant.tar.gz

The following files should now appear in the PHD folder: Vagrantfile , prepare_all_nodes.sh, pcc_install.sh, phd_cluster_deploy.sh

Step 3—Perform the Single-Command Install

Within PHD run the following command for a VirtualBox VM install (or see further below for a VMware Fusion install):

vagrant up

Note: When run for the first time the ‘vagrant up’ command will download and install the bigdata/centos6.4_x86_64 vagrant box. (use ‘vagrant box list’ to see the list of installed boxes)

Both Virtualbox and VMWare Fusion providers are supported. By default Vagrant uses the VirtualBox. If you install the commercial VMWare Fusion plugin (see Appendix A) then you can run:

vagrant up --provider vmware_fusion

Note: If you switch from VirtualBox to VMWare Fusion providers (or other way around) you have to clean the network interfaces as explained in Appendix D.

As it runs, four virtual machines are created:

  • pcc (10.211.55.100) – dedicated for the Pivotal Command Center;
  • phd1, phd2, phd3 (10.211.55.10[1..3]) – used for the Pivotal HD cluster.

Note: Pivotal HD doesn’t permit collocation of PCC and cluster nodes. The reason for this that single PCC can install and manage multiple clusters.

The following services are now configured and running: HDFS, Yarn, Pig, Zookeeper, HBase, Pivotal HD’s HAWQ, GemfireXD, and Hamster/GraphLab.

B. Testing the Environment

Open the Pivotal Command Center (PCC) console https://10.211.55.100:5443/hd/phd-c1/dashboard using the PCC credentials corresponding to your Pivotal distributions:

Pivotal distribution username password
PHD 2.0.x gpadmin Gpadmin1
PHD 1.1.x gpadmin gpadmin
PHD 1.0.x gpadmin gpadmin

pivotal command center

Below are the service roles per cluster node. Note that GraphLab service is installed ‘manually’ not through the PCC:

Node Roles
phd1.localdomain hawq-standbymaster, hawq-master, hive-server, hive-client, hive-metastore, hbase-client, hbase-master, gpxf-agent, gfxd-locator, namenode, hadoop®-client, secondarynamenode, yarn-resourcemanager, mapreduce-historyserver, pig-client
phd2.localdomain hawq-segment, hbase-regionserver, zookeeper-server, gpxf-agent, gfxd-server, datanode, yarn-nodemanager
Phd3.localdomain hawq-segment, hbase-regionserver, zookeeper-server, gpxf-agent, gfxd-server, datanode, yarn-nodemanager

You can also open the GemfireXD management console: http://10.211.55.101:7075/pulse/clusterDetail.html (username: admin and password: admin)

management console

C. Destroying the Environment and Changing the Install Configuration

To destroy all VMs run:

vagrant destroy –f

More About Configuration

The Vagrantfile exposes configuration properties to allow VM and PHD configuration and customization. The properties are grouped in 2 groups: (1) VM infrastructure and PCC installation properties and (2) Pivotal HD deployment configurations properties.

VM and PCC Configuration Properties:

Vagrant property

Description

Default value

PHD_DISTRIBUTION_TO_INSTALL PCC, PHD, PADS and PRTS package names. to install. The first element is the package extension (e.g. ‘tar.g’ or ‘gz’) The predefined options are: PHD_110, PHD_111, PHD_201Custom definitions are allowed as well (see next paragraph) PHD_201
VM_BOX Vagrant box name. Default vbox provides 40GB of disk space. The bigdata/centos6.4_x86_64_small box is available as well and it takes only 8GB of memory bigdata/centos6.4_x86_64
MASTER_PHD_MEMORY_MB PHD cluster – Master node memory (MB) 2048
WORKER_PHD_MEMORY_MB PHD cluster – Worker node memory (MB) 1536
PCC_MEMORY_MB PCC node memory (MB) 768
DEPLOY_PHD_CLUSTER By default as part of the installation the Vagrant script deploys a PHD cluster. If you want to use the PCC Wizard to install PHD clusters you have to set this property to FALSE. Then Vagrant will create and initialize the VMs and install the PCC but will not attempt to install the PHD cluster. TRUE

PHD_DISTRIBUTION_TO_INSTALL= [“gz”, “PCC-2.1.1-73”, “PHD-1.1.1.0-82”, “PADS-1.1.4-34”, “PRTS-1.0.0-9”]

Note: In addition to the predefined PHD distributions you can define your custom package versions like this:

PHD Deployment Configurations Properties

Vagrant property

Description

Default value

CLUSTER_NAME Unique name of the PHD cluster. Name will appear in ‘icm_client list’ and PCC UI phd-c1
SERVICES Services to install with the cluster. Only services in the list will be installed.Note: some PHD distributions don’t support the latest services. For more information check the compatibility matrix below. hdfs, yarn, hive, pig, zookeeper, hbase, gfxd, gpxf, hawq, graphlab
MASTER FQDN of the node used as Master. By convention the deploy script will use the Master node as worker as well. phd1.localdomain
WORKERS Comma separated FQDM list of WORKER nodes.You can reuse Master node as a Worker as well. phd1.localdomain, phd2.localdomain

Provisioning Scripts

Vagrantfile uses the following provisioning scripts to create the cluster:

  • prepare_all_nodes.sh – configures the /etc/hosts and hostnames, installs the packages required by all cluster nodes (PCC including), sets up the NTP and disables firewall.
  • pcc_install.sh – Installs the PCC on the pcc node, imports the PHD, PADS, PRTS and Java RPM packages into a local yum repository.
  • phd_cluster_deploy.sh – Deploys the PHD cluster, installs and initializes all requested services and starts the cluster.

D. Accessing Data, Tutorials, and Services

We can also test the cluster by running jobs against sample data from the Pivotal HD demo project—the link to this project is located at: http://pivotalhd.docs.gopivotal.com/tutorial/index.html

Prepare the Demo Data

  • ssh to phd1.localdomain as gpadmin: ssh gpadmin@10.211.55.101 (phd1 = 10.211.55.101) , password: gpadmin
  • Run the following commands from the command line:

sudo yum -y install git
rm -Rf /vagrant/pivotal-samples
git clone https://github.com/gopivotal/pivotal-samples.git /vagrant/pivotal-samples
mv /vagrant/pivotal-samples/sample-data /vagrant/retail_demo;

hadoop® fs -rm -r /retail_demo
hadoop® fs -mkdir /retail_demo

for fullfile in `ls /vagrant/retail_demo/*.gz | perl -ne ‘s/^(w+).+$/$1/;print;’`
do
filename=$(basename “$fullfile”)
filename=”${filename%.*.*}”
echo “Loading $filename …”
hadoop® fs -mkdir /retail_demo/$filename
hadoop® fs -put /vagrant/retail_demo/$filename.tsv.gz /retail_demo/$filename/
done
hdfs fsck /retail_demo -files

Run Tutorials

http://pivotalhd.docs.gopivotal.com/tutorial/getting-started/overview.html

Notes:

  1. Make sure to replace any occurrence of pivhdsne in the tutorials by phd1.
  2. In the Hive tutorial there is a typo in the retail_demo.customer_addresses_dim_hive table definition. You have to remove the redundant ‘d’ form ‘LOCAdTION’ and ‘Phone_Number d string’. Alternatively you can load all tables at once: ‘hive -f /vagrant/pivotal-samples/hive/create_hive_tables.sql’
  3. For the HAWQ internal table tutorial use the following statements to load the internal tables:


zcat /vagrant/retail_demo/customers_dim.tsv.gz | psql -c "COPY retail_demo.customers_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/categories_dim.tsv.gz | psql -c "COPY retail_demo.categories_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/order_lineitems.tsv.gz | psql -c "COPY retail_demo.order_lineitems_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/orders.tsv.gz | psql -c "COPY retail_demo.orders_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/customer_addresses_dim.tsv.gz | psql -c "COPY retail_demo.customer_addresses_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/email_addresses_dim.tsv.gz | psql -c "COPY retail_demo.email_addresses_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/products_dim.tsv.gz | psql -c "COPY retail_demo.products_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/payment_methods.tsv.gz | psql -c "COPY retail_demo.payment_methods_hawq FROM STDIN DELIMITER E't' NULL E'';"
zcat /vagrant/retail_demo/date_dim.tsv.gz | psql -c "COPY retail_demo.date_dim_hawq FROM STDIN DELIMITER E't' NULL E'';"

Once you have run these jobs, you can access the job monitor from the top menu (as shown below) or access the Job History Management UI here:

http://10.211.55.101:19888/jobhistory or https://10.211.55.100:5443/hd/PHD_C1/jobs

Try GemFire XD

The GemFire XD guide explains the main concepts and provides tutorials to help you quickly begin using GemFire XD: http://gemfirexd.docs.gopivotal.com/latest/userguide/index.html?q=/latest/userguide/getting_started/tutorial_chapter_intro.html

To start the GemFire console ssh to phd1 (password: gpadmin) and run gpxd command:

ssh gpadmin@10.211.55.101
cd /usr/lib/gphd/gfxd/quickstart/
gfxd
gfxd> connect client 'phd1.localdomain:1527';
gfxd> show tables in sys;

Try GraphLab

Follow the Basic GraphLab Tutorial to implement a simple PageRank application from scratch:

http://docs.graphlab.org/using_graphlab.html

E. Appendices

Appendix A—Installing Vagrant VirtualBox provider or the VMWare Fusion provider

1. Install Vagrant v1.5.2 or new: http://www.vagrantup.com/downloads.htm

2a. Install VirtualBox v4.3.6 or new: https://www.virtualbox.org/wiki/Downloads

2b. (Optional) VMWare Fusion 6 provider:

vagrant plugin license vagrant-vmware-fusion license.lic
vagrant plugin list

Vagrant Boxes

All boxes support Virtualbox and VMWare Fusion providers

  • Default box is ‘bigdata/centos6.4_x86_64’ which reserves 40GB of disk space per VM
  • ‘bigdata/centos6.4_x86_64_small’ is an alternative box that reserves only 8GB of disc space per VM. Note that 8GB will not be enough to install all Pivotal services together
  • Alternative option is to create your own Vagrant box: Build Vagrant Boxes

Appendix B—Get the Oracle jdk-7u45-linux-x64.rpm

Download the jdk-7u45-linux-x64.rpm in your PHD folder. You can download it manually from: http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html#jdk-7u45-oth-JPR

or with the following wget command:

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.rpm"

Appendix C—Download the Pivotal HD distribution packages

Three Pivotal HD distribution are supported: PivotalHD 1.1.0, PivotalHD 1.1.1 and PivotalHD 2.0.

Instructions below explain how to get the packages for each distribution.
Note: you can download one or all distribution in the same PHD folder

C1. Pivotal HD 2.0.1

Download Pivotal HD 2.0.1 from the Pivotal Network Distribution channel:
https://network.gopivotal.com/products/pivotal-hd

Open the Pivotal HD 2.0 file group and download the following files:

Component name Package file name
PHD 2.0.1: Pivotal Command Center 2.2.1 PCC-2.2.1-150.x86_64.gz
Pivotal HD 2.0.1 PHD-2.0.1.0-148.gz
PHD 2.0.1: Pivotal HAWQ 1.2.0.1 PADS-1.2.0.1-8119.gz
PHD 2.0.1: Pivotal GemFire XD 1.0 PRTS-1.0.0-14.gz
C2. Pivotal HD 1.1.1

Download Pivotal HD 1.1.1 from the Pivotal Network Distribution channel:
https://network.gopivotal.com/products/pivotal-hd

Open the Pivotal HD 1.1.1 file group and download the following files:

Component name Package file name
PHD 1.1.1: Pivotal Command Center 2.1.1 PCC-2.1.1-73.x86_64.gz
Pivotal HD 1.1.1 PHD-1.1.1.0-82.gz
PHD 1.1.1: Pivotal HAWQ 1.1.4 PADS-1.1.4-34.gz
C3. Pivotal HD 1.1.0

Download the pivotal_community_1.1 and uncompress it in the PHD folder

wget "http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/pivotalhd_community_1.1.tar.gz"
tar -xzf ./pivotalhd_community_1.1.tar.gz --strip 1

This bundle contains the PCC-2.1.0-460 and PHD-1.1.0.0-76 packages

Download HAWQ packages:

wget http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/PADS-1.1.3-31.tar.gz

Download the GemfireXD packages:

wget http://bitcast-a.v1.o1.sjc1.bitgravity.com/greenplum/pivotal-sw/Pivotal_GemFireXD_05Beta2_b44694.tar.gz

The PHD folder should contain the following files:

PCC-2.1.0-460.x86_64.tar.gz
PHD-1.1.0.0-76.tar.gz
PADS-1.1.3-31.tar.gz
PRTS-1.0.0-8.tar.gz
jdk-7u45-linux-x64.rpm

Appendix D—Troubleshooting

1. The installation (with VMware Fusion provider) hangs at:

[phd1] Waiting for the VM to finish booting…

Then stop the installation (Ctrl + C) and clean the vmnetXXX interfaces:

vagrant destroy -f
sudo /Applications/VMware Fusion.app/Contents/Library/vmnet-cli --stop

Run ‘ifconfig -a’ to make sure that the vmnet1, vmnet2, vmnet3 or vmnet8 are removed

2. Installation hangs at “Verifying vmnet devices are healthy…” line and the vboxnet0 interface has IP address. E.g.

ifconfig -a

vboxnet1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ether 0a:00:27:00:00:00
inet 10.211.155.1 netmask 0xffffff00 broadcast 10.211.155.255

Then delete the vboxnet0 interface:

sudo ifconfig vboxnet1 delete

To Learn More:

  • Read about Pivotal HD—product, documentation, downloads, and articles
  • Read about Pivotal GemFire—product, documentation, downloads, and articles
  • Read about Pivotal Greenplum DB—product, documentation, downloads, and articles

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Christian Tzolov

Spring Engineer at Pivotal; Committer, PMC member at Apache Software Foundation. Anything about Integration and Interoperability Architectures, Distributed and Data-Intensive Systems

Follow on Twitter Follow on Linkedin Visit Website
Previous
iOS CI with Jenkins
iOS CI with Jenkins

Co-Authored by Alex Kramer. It often happens that you join a project once it has been going for a few weeks...

Next
A Review of Google I/O
A Review of Google I/O

The annual Google I/O conference took place on June 25-26 in San Francisco. The conference aimed to educate...