New Tools To Shape Data In Apache MADlib

September 27, 2016 Frank McQuillan

 

36539-madlib-sfeaturedData science is an exciting and innovative field that attracts a lot of attention these days, in part because it can add great value to the enterprise. What gets less attention, however, is the time-consuming work of collecting and preparing data sets for input into predictive analytics algorithms. A New York Times article on this topic reported that data scientists can spend 50 to 80 percent of their time collecting and preparing data. A survey reported in Forbes found that data scientists find this to be a less interesting part of their work.

However, data transformation cannot be fully automated. Understanding the integrity and completeness of data is an essential activity which helps a data scientist make proper scoping decisions, and determine how to fill in the blanks when certain pieces of data are missing. April Song, a data scientist at Pivotal comments:

“Before we start applying machine learning algorithms on our data, we must first reshape and transform our data. Having tools to help wrangle the data enables us to be more agile in our approach to data science.”

Apache MADlib (incubating) is a SQL-based open source library for scalable in-database analytics that supports Greenplum Database/Pivotal Greenplum, Apache HAWQ (incubating)/Pivotal HDB and PostgreSQL. The library offers data scientists numerous distributed implementations of mathematical, statistical and machine learning methods, including many utilities for data transformation.

New utilities have been added in the recent MADlib 1.9.1 release, including:

  • Pivot—data summarization tool that can do basic OLAP type operations
  • Sessionization—time-oriented session reconstruction on a data set comprising a sequence of events
  • Prediction metrics—set of metrics to evaluate the quality of predictions of a model

In this article we will take a closer look at each of these. Please also refer to this previous article on path functions in MADlib which describes a way to perform regular pattern matching over a sequence of rows and extracting useful information about the pattern matches, such as aggregations or window functions.

Pivot

The goal of the MADlib pivot function is to provide a data summarization tool that can do basic OLAP type operations on data stored in one table and output the summarized data to a second table. There are many variations on how to pivot data, depending on the needs of a particular situation, but the common theme is reshaping the data into a convenient format for insight and subsequent operations.

The pivot syntax is as follows:

pivot(
    source_table,	-- Name of input table
    output_table,	-- Table to store pivoted data
    index,		-- How to group (desired rows in output)
    pivot_cols,		-- What to pivot
    pivot_values,	-- Values to be summarized
    aggregate_func,	-- Aggregate(s) to apply to values
    fill_value,		-- How to fill NULLs resulting from pivot
    keep_null,		-- Whether to pivot NULLs
    output_col_dictionary	-- For managing long column names
    )

Some of the terminology above may look familiar to users of Pandas pivot tables, which informed the MADlib implementation. (Pandas is a popular Python data analysis toolkit.) Let’s look at a couple examples using this small table:

image00

The simplest possible pivot is:

SELECT madlib.pivot(
'pivset',     	-- Name of input table
'pivout',    	-- Table to store pivoted data
'id',         	-- Index
'piv',        	-- Pivot
'val');       	-- Values

The resulting pivot table is:

image02

Note that the columns names are auto-generated and intended to be descriptive. For example, the column val_avg_piv_10 means:

  • val is the column containing values that were aggregated
  • Aggregate function avg was used (default since no other aggregate function was specified)
  • Column div was pivoted
  • The number 10 indicates the values in the column piv

Next let’s look at a slightly more complicated pivot:

SELECT madlib.pivot(
'pivset',     		-- Name of input table
'pivout',    		-- Table to store pivoted data
'id',         		-- Index
'piv',        		-- Pivot
'val, val2'		-- 2 value columns
 'val=avg, val2=sum',	-- Different aggregates for each value
 '-999', 		-- Fill NULLs in pivot output with -999
TRUE, 			-- Pivot NULL values into columns
TRUE);			-- Create a column header dictionary

The resulting pivot table is:

image04

Because we opted to create a dictionary, the column headers are defined in a separate table:

image05

A dictionary can improve readability and programmatic operation on the pivoted table. Also, when auto-generated column headers grow to exceed the PostgreSQL limit of 63 characters, a dictionary table is automatically created to avoid potential side effects from truncation.

For more information on other ways to pivot data, please refer to the pivot user documentation.

Sessionization

The MADlib sessionize function performs time-oriented session reconstruction on a data set comprising a sequence of events. A defined period of inactivity indicates the end of one session and beginning of the next session. Sessions can be useful in many domains including web analytics, network security, manufacturing, finance, and operational analytics.

The sessionize syntax is as follows:

sessionize(
   source_table,	-- Name of input table
   output_table,	-- Table name for sessionized output
   partition_expr,	-- How to partition input table
   time_stamp,  	-- Time stamp column for session calculation
   max_time,		-- Maximum delta time to define session
   output_cols,	        -- Output format options
   create_view  	-- Create a view or materialize a table

Let’s look an example. The simple data set below describes shopper behavior on a notional web site that sells beer and wine. A beacon fires an event to a log file when the shopper visits different pages on the site—landing page, beer selection page, wine selection page, and checkout. There are other pages on the site, like help pages and account settings, and these show up in the logs as well.

Here are the first few rows of the log file, sorted by time:

image01

The sessionize function call is:

SELECT madlib.sessionize(
     'eventlog',             -- Name of input table
     'sessionize_output',    -- Table to store sessionize results
     'user_id',              -- Partition by user id
     'event_timestamp',      -- Time stamp column
     '0:30:0'                -- Time out of 30 min defines a session
     );

The result is:

image06

User 100821 had two sessions on this day and user 202201 had three sessions. For more information on sessionization, please refer to the sessionize user documentation.

Prediction Metrics

This utility provides a set of metrics to evaluate the quality of model predictions. A model may become less accurate over time if the data changes, or if there is a fundamental shift in the characteristics of the system being modeled. Data scientists use a variety of tools for model evaluation. The following prediction metrics have been added to MADlib:

  • For continuous variables:
    • Mean absolute error
    • Mean absolute percentage error
    • Mean percentage error
    • Mean squared error
    • R-squared
    • Adjusted R-squared
  • For binary classification:
    • True positive rates (TPR) and false positive rates (FPR)
    • Area under ROC curve
  • For multi-class classifiers:
    • Confusion matrix

For example, the area under an ROC curve is a common way to evaluate a binary classifier compared to a random coin toss:

SELECT madlib.area_under_roc ( 
'test_set', 		-- Name of input table
'table_out', 		-- Output table
'pred', 		-- Predicted values
'obs');			-- Observations

The result below shows that the model is better than random (area of 0.5) and less than perfect (area of 1.0):

image03

For more details, please refer to the prediction metrics user documentation.

Learning More:

 

About the Author

Frank McQuillan

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

Previous
The Microsoft Quiz For Java Developers
The Microsoft Quiz For Java Developers

For the most part, Java developer's favorite thing to do with Microsoft was to ignore it. Closed source, hi...

Next
Pivotal Cloud Foundry: A First-Class Home For Your .NET Applications
Pivotal Cloud Foundry: A First-Class Home For Your .NET Applications

Pivotal and Microsoft are working hard to make the cloud a friendlier place for .NET developers. This post ...

SpringOne at VMware Explore 2023

Learn More