Recently, I spent time with a prospective customer responsible for marketing analytics. The company is struggling to realize consistent analytical value due to both the fragmented nature of the data management landscape and it’s varying skillsets. In this case, the organization literally generates its primary revenue by analyzing data and providing resulting insights back to their customer base. They employ a number of data engineers and scientists each of which likes to use a specific procedural language (R, Python & Perl). The company wanted to better combine and analyze its data, including data originating from multiple sources. It was using Amazon S3 as a cheap-and-deep, offsite backup for some of its on-premises applications as well as backup data from newly deployed applications on AWS. For the most part, the backup data was similarly organized, compressed and comma delimited.
The project team was looking for a flexible and powerful analytics solution with mature ANSI SQL compliance and the ability to process data directly from S3. They also had a strong desire to bring some of the analytics back in-house, closer to the originating, on-premises applications. Lastly, the prospective customer wanted the ability to provision, pay and obtain support for whatever solution it chose only when needing to use the product via a public cloud scenario. They evaluated both Amazon EMR and Amazon Redshift, but neither met all these requirements. The team also had a strong desire to bring some of the analytics back in-house, closer to the originating, on-premises applications and neither EMR nor Redshift allows for this proximity.
Once I learned about the prospective customer’s collective requirements, I simply pointed them to our newly released Pivotal Greenplum (Hourly) offering. The customer’s data analysts were able to provision a production-grade Greenplum cluster in under an hour by completing just a few basic fields of information. I then helped them connect to the new Greenplum cluster via the psql client.
Image 1: psql connection
Once we successfully established database connectivity, we launched pgAdmin3 on the user’s desktop. We registered a new server in pgAdmin3 simply by providing the master’s public IP address, specifying gpadmin, and the gpadmin password (which you can easily obtain via the CloudFormation stack output tab).
Image 2: pgAdmin3 connection
Pivotal Greenplum Command Center is a key management tool and is packaged with the AWS Pivotal Greenplum (Hourly) offering. By invoking an installation script that we ship with Pivotal Greenplum (Hourly), the user just selects the command center option and starts it post-installation. Not only do we provide the ability to run Pivotal Data products by the hour in licensed public cloud environments, we also ship value-add utilities! Once Pivotal Greenplum Command Center was up and running I gave the prospect a tour of the user interface.
Image 3: AWS Pivotal Greenplum (Hourly) Optional Automated Installs
Image 4: Pivotal Greenplum Command Center
The next step in the process was to configure and consume sample S3 data. I had the prospective customer create a new (empty) database called eval. We then configured the S3 protocol for the new empty database following the instructions found here. End-to-end, this process required less than five minutes. We simply copied the S3 configuration file to
/home/gpadmin for simplification and leveraged the location clause of the
CREATE EXTERNAL TABLE command. Once we successfully smoke tested the S3 accessibility, we moved on to consuming the company’s actual data.
We ingested and created a brand new data mart from scratch leveraging pre-existing data in S3. It is important to note that S3 protocol external tables provide native support for the gzip compression format. This allows data to remain compressed within your S3 buckets and still be accessible! Issuing a series of
CREATE EXTERNAL TABLE scripts, we quickly built an evaluation data mart without any native data persistence!
DROP EXTERNAL TABLE IF EXISTS s3_emails; CREATE READABLE EXTERNAL TABLE s3_emails (id int, email text) LOCATION('s3://s3.amazonaws.com/customerdata001/email_addresses_dim.tsv.gz config=/home/gpadmin/mytest_s3.config') FORMAT 'TEXT';
Code Sample 1: AWS s3 Create External Table Script
Image 5: pgadmin3 listing of s3 external tables
With the evaluation data mart built via
CREATE EXTERNAL TABLE commands, it was time to turn our attention to data visualization. We choose to leverage open-source Apache Zeppelin. We quickly configured the psql interpreter for the AWS-based Greenplum cluster. Once configured, the prospective customer’s analysts quickly began exploring the S3 external tables and were joining-and-querying the tables all via Apache Zeppelin within a matter of minutes!
Image 6: Apache Zeppelin analyzing external tables
As a final step prior to decommissioning the Greenplum cluster, the user created a writable external table. It used this external table to save the sample query results for future validation and comparison. An example of these writable external table definitions, including written results, can be found below.
DROP EXTERNAL TABLE IF EXISTS s3_ext_orders_cnt; CREATE WRITABLE EXTERNAL TABLE s3_ext_orders_cnt (last_name text, payment_method text, orders int) LOCATION('s3://s3.amazonaws.com/customerdata001/ext_orders_cnt.tsv.gz config=/home/gpadmin/mytest_s3.config') FORMAT 'TEXT'; INSERT INTO s3_ext_orders_cnt SELECT c.last_name,o.payment_method, count(*) as orders FROM public.s3_emails e join public.s3_customers c ON (e.id=c.id) join public.s3_orders o ON(o.customerid=c.id) GROUP BY c.last_name,o.payment_method ORDER BY count(*) desc ;
Code Sample 2: AWS s3 Create Writable External Table Script
With the S3 external tables made accessible and queried, it was time to delete the cluster. It is important to note that the prospective customer could have opted to simply shut down the cluster while retaining the stack’s architectural components. However, once it saw how easy it was to provision a fresh cluster, the customer decided to delete the entire stack. Because Pivotal Greenplum (Hourly) leverages AWS CloudFormation, it is easy to delete the entire cluster. Within the CloudFormation console, the customer selected the Greenplum stack and then selected the Delete Stack option. The cluster was quickly destroyed while leaving all original data in the customer’s respective S3 buckets.
Image 7: AWS CloudFormation delete stack
The prospective customer was impressed with how easy it was to spin up an entire dedicated, massively parallel analytic database in a public cloud environment with just a few clicks of a button! It was able to quickly access existing data that resided in S3 buckets and rapidly execute analytical queries against the data. The customer finished off the exercise writing out sample query results back to Its S3 buckets. Finally, it de-provisioned the entire cluster with a single click on the CloudFormation stack.
The attributes the customer experienced using Pivotal Greenplum (Hourly), beyond the cloudy aspects, were mature ANSI SQL, a diverse variety of analytic capabilities, and rich machine learning exposed via SQL. Like most enterprises, it has a large variety of traditional data assets on-premises it would like to analyze to get better insights. Based on the experience with Pivotal Greenplum (Hourly), the prospective customer decided to further evaluate Greenplum for additional on-premises deployments.
Here at Pivotal, we recently spun up a net-new data research and development team called the Data Innovation Lab. The Data Innovation Lab works on a number of strategic solution engineering projects, including hourly-charged public cloud deployments, to rapidly perform interactive analytics for our valued customers. Thanks to the Data Innovation Lab’s early efforts, we were able to quickly assist the marketing analytics customer via our Pivotal Greenplum (Hourly) offering on AWS Marketplace. This solution is coming next to Microsoft Azure to complement the existing Pivotal Greenplum (BYOL) offering, We plan to launch Pivotal Greenplum (Hourly) on Google Cloud Platform later this year.
On Tap Data Engineering Services
In addition to our Data Innovation Lab, we are also spearheading a next-generation set of professional services offerings to support today’s hybrid data landscapes. These engagements blur the lines between traditional and “on tap” engineering services. Today’s data-driven enterprise requires the right skills at the right time, consuming just the right quantity of engagement.
About the AuthorFollow on Twitter