MADlib 1.7 Release—Adding Generalized Linear Models, Decision Trees, and Random Forest

January 15, 2015 Frank McQuillan

featured-madlibMADlib 1.7 is now available!

MADlib is a SQL-based open source library for scalable in-database analytics that supports PostgreSQL, Pivotal Greenplum Database, and Pivotal HAWQ. The library gives data scientists a ready-to-use set of algorithms that accelerate time to insight. It offers more than 30 data parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data, and these algorithms are used by data scientists to solve complex problems across a wide variety of domains from financial services to healthcare to academic research.

MADlib 1.7 adds the following capabilities:

  • Generalized linear models—a class of supervised learning algorithms that is a generalization of linear regression
  • Decision trees (completely new and improved implementation)—a supervised learning method that predicts the value of a target variable based on several input variables and can run up to 40 times faster than the previous version
  • Random forest (completely new and improved implementation)—uses an ensemble of classifiers, each of which produces a tree modeled on some combination of the input data and includes both variable importance metrics plus an ability to explore each tree in the forest independently

Let’s take a closer look at each of these.

Generalized Linear Models

The Generalized Linear Model (GLM) is a class of supervised learning algorithms. As its name suggests, it is a generalization of linear regression. GLM involves relating a linear predictor (i.e., a linear combination of explanatory variables) to a response variable. A link function expresses the relationship between the response variable and the linear predictor. How to use GLM depends on the distribution of the data and nature of the response variable (continuous response, binary response, count, etc.).

The family of distributions and link functions in MADlib 1.7 are: Screen Shot 2015-01-14 at 3.14.51 PM
For example, the number of items bought by customers in a grocery store would typically be modeled with a Poisson distribution and a log link function. Number of items would be the response variable, and explanatory variables could be customer demographics, macroeconomic factors, and promotions included to build the Poisson regression.

In addition to the distributions in the above table, other new regression algorithms added in MADlib 1.7 are multinomial regression and ordinal regression.

Multinomial regression is a classification method that generalizes binomial regression to multiclass problems having more than two possible discrete outcomes. It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables which may be real-valued, binary-valued, categorical-valued, etc.

Ordinal regression is a type of regression analysis used for predicting an ordinal variable where a variable’s value exists on an arbitrary scale and only the relative ordering between different values is significant. The two most common types of ordinal regression models are ordered logit, which applies to data that meets the proportional odds assumption, and ordered probit. Both types are included in MADlib 1.7.

An example of ordinal regression is Yelp data on restaurants and their crowd-sourced ratings. A restaurant’s Yelp rating is an ordered variable, ranging from 1 to 5. We could round a restaurant’s rating to its nearest 0.5 and set it as the response variable in an ordered probit model. Restaurant characteristics, including food type, price range, location, etc., and information on those who rated the restaurant, such as average rating, number of reviews submitted, etc., could be added as explanatory variables in the model and a logit link function used.

The parallel nature of MADlib’s algorithm design is demonstrated in the chart below. For ordinal regression with a probit link function that is similar to Yelp restaurant rating example above, execution time scales linearly with number of rows in the the training set:

image00

Using a Pivotal Data Computing Appliance (DCA) half-rack for GPDB 4.2.7.1 and a DCA half-rack for HAWQ 1.2.1.0 with 8 nodes and 6 segments per node.

Decision Trees

Decision trees are supervised learning methods that predict the value of a target variable based on several input variables. They can be easily visualized and are intuitive to understand. Interior nodes of the tree split data tuples using a threshold value for one of the input variables and each leaf node represents a value of the target variable.

MADlib 1.7 has a completely new and improved implementation that runs up to 40 times faster than the previous version. Additional features include pruning methods, surrogate variables for NULL handling, cross validation, tuning parameters and visualization of the trained tree.

Let’s look at an example using the Car Evaluation Data Set from UC Irvine Machine Learning Repository. This data set describes the “acceptability” or selection of a car based on the following input variables: purchase price, maintenance cost, number of doors, capacity, truck size and safety rating.

This data set in Greenplum format is available for download here.

The SQL statement to train the decision tree is:

SELECT * FROM madlib.tree_train(
    'car_eval', -- Data table
    'output',   -- Table to store model
    'id',       -- ID column name
    'class',    -- Column to predict
    '*',        -- Use all features
    NULL,       -- features to exclude (none for this case)
    'gini'      -- Classification impurity function
);

The resulting tree can be exported in DOT format, which is a plain text graph description language that is both human and machine readable.

-- Export tree to dot file
pset format u
pset tuples_only
o dt_output.dot
SELECT madlib.tree_display('output');
o

A number of programs can be used to render DOT graphs. The Unix shell command to do so is:

dot -Tpdf dt_output.dot -o dt_output.pdf

which results in the following decision tree: image01

Random Forest

Although a single decision tree is intuitive to understand, it may overfit the data. One way to mitigate this problem is by building an ensemble of classifiers, each of which produces a tree modeled on some combination of the input data. The results of these models are then combined to yield a single prediction, which is highly accurate at the expense of some loss in interpretation.

MADlib 1.7 has a completely new and improved implementation of random forest that includes variable importance metrics and ability to explore each tree in the forest independently.

For the same car evaluation data set we used above, here is the SQL statement to train the random forest:

SELECT * FROM madlib.forest_train(
    'car_eval',        -- Data table
    'rf_output',       -- Table to store model
    'id',              -- ID column name
    'class',           -- Column to predict
    '*',               -- Use all features
    '',                -- features to exclude (none for this case)
    '',                -- Grouping columns (no grouping for this case)
    10,                -- Number of trees to train
    3,                 -- Use 3 randomly-selected features for each node
    TRUE,              -- Compute variable importance
    1,                 -- Use single permutation for variable importance
    4                  -- Maximum depth for each tree
);

Let’s say we are interested in understanding variable importance, that is, which variables contribute the most and the least to prediction in the training data. The SQL is:

SELECT unnest(regexp_split_to_array(cat_features, ',')) as variable,
   unnest(cat_var_importance) as importance
FROM rf_output_group, rf_output_summary;

which produces the following output:

variable | importance
————–+———————
maint | 0.0272744901879319
persons | 0.088661843494196
lug_boot | 0.00573215979836386
safety | 0.0826413222054395
doors | 0
buying | 0.0384399018694643
(6 rows)

From this table, it appears that safety rating is the most important explanatory variable for acceptability, and the number of doors is the least important (actually irrelevant).

Other Interfaces

Each of the new algorithms in MADlib 1.7 are supported by PivotalR for users who prefer an R interface rather than SQL. PivotalR combines the usability of R with the performance and scalability benefits of in-database/in-Hadoop® computation.

Also, each model can be exported in Predictive Model Markup Language (PMML) format. PMML is an XML-based file format to provide a way for applications to describe and exchange models.

Learning More:

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Frank McQuillan

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

Previous
Groovy 2.4 And Grails 3.0 To Be Last Major Releases Under Pivotal Sponsorship
Groovy 2.4 And Grails 3.0 To Be Last Major Releases Under Pivotal Sponsorship

Pivotal today announced that it will fund the next two major releases of the Groovy and Grails through Marc...

Next
Announcing new WatchKit testing tools in PivotalCoreKit
Announcing new WatchKit testing tools in PivotalCoreKit

We’re happy today to announce an open source tool that makes writing tests for WatchKit apps possible. The ...