MADlib 1.7 Release—Adding Generalized Linear Models, Decision Trees, and Random Forest

January 15, 2015 Frank McQuillan

MADlib 1.7 is now available!

MADlib is a SQL-based open source library for scalable in-database analytics that supports PostgreSQL, Pivotal Greenplum Database, and Pivotal HAWQ. The library gives data scientists a ready-to-use set of algorithms that accelerate time to insight. It offers more than 30 data parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data, and these algorithms are used by data scientists to solve complex problems across a wide variety of domains from financial services to healthcare to academic research.

MADlib 1.7 adds the following capabilities:

Generalized linear models—a class of supervised learning algorithms that is a generalization of linear regression
Decision trees (completely new and improved implementation)—a supervised learning method that predicts the value of a target variable based on several input variables and can run up to 40 times faster than the previous version
Random forest (completely new and improved implementation)—uses an ensemble of classifiers, each of which produces a tree modeled on some combination of the input data and includes both variable importance metrics plus an ability to explore each tree in the forest independently

Let’s take a closer look at each of these.

Generalized Linear Models

The Generalized Linear Model (GLM) is a class of supervised learning algorithms. As its name suggests, it is a generalization of linear regression. GLM involves relating a linear predictor (i.e., a linear combination of explanatory variables) to a response variable. A link function expresses the relationship between the response variable and the linear predictor. How to use GLM depends on the distribution of the data and nature of the response variable (continuous response, binary response, count, etc.).

The family of distributions and link functions in MADlib 1.7 are:
For example, the number of items bought by customers in a grocery store would typically be modeled with a Poisson distribution and a log link function. Number of items would be the response variable, and explanatory variables could be customer demographics, macroeconomic factors, and promotions included to build the Poisson regression.

In addition to the distributions in the above table, other new regression algorithms added in MADlib 1.7 are multinomial regression and ordinal regression.

Multinomial regression is a classification method that generalizes binomial regression to multiclass problems having more than two possible discrete outcomes. It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables which may be real-valued, binary-valued, categorical-valued, etc.

Ordinal regression is a type of regression analysis used for predicting an ordinal variable where a variable’s value exists on an arbitrary scale and only the relative ordering between different values is significant. The two most common types of ordinal regression models are ordered logit, which applies to data that meets the proportional odds assumption, and ordered probit. Both types are included in MADlib 1.7.

An example of ordinal regression is Yelp data on restaurants and their crowd-sourced ratings. A restaurant’s Yelp rating is an ordered variable, ranging from 1 to 5. We could round a restaurant’s rating to its nearest 0.5 and set it as the response variable in an ordered probit model. Restaurant characteristics, including food type, price range, location, etc., and information on those who rated the restaurant, such as average rating, number of reviews submitted, etc., could be added as explanatory variables in the model and a logit link function used.

The parallel nature of MADlib’s algorithm design is demonstrated in the chart below. For ordinal regression with a probit link function that is similar to Yelp restaurant rating example above, execution time scales linearly with number of rows in the the training set:

Using a Pivotal Data Computing Appliance (DCA) half-rack for GPDB 4.2.7.1 and a DCA half-rack for HAWQ 1.2.1.0 with 8 nodes and 6 segments per node.

Decision Trees

Decision trees are supervised learning methods that predict the value of a target variable based on several input variables. They can be easily visualized and are intuitive to understand. Interior nodes of the tree split data tuples using a threshold value for one of the input variables and each leaf node represents a value of the target variable.

MADlib 1.7 has a completely new and improved implementation that runs up to 40 times faster than the previous version. Additional features include pruning methods, surrogate variables for NULL handling, cross validation, tuning parameters and visualization of the trained tree.

Let’s look at an example using the Car Evaluation Data Set from UC Irvine Machine Learning Repository. This data set describes the “acceptability” or selection of a car based on the following input variables: purchase price, maintenance cost, number of doors, capacity, truck size and safety rating.

This data set in Greenplum format is available for download here.

The SQL statement to train the decision tree is:

SELECT * FROM madlib.tree_train(
    'car_eval', -- Data table
    'output',   -- Table to store model
    'id',       -- ID column name
    'class',    -- Column to predict
    '*',        -- Use all features
    NULL,       -- features to exclude (none for this case)
    'gini'      -- Classification impurity function
);

The resulting tree can be exported in DOT format, which is a plain text graph description language that is both human and machine readable.

-- Export tree to dot file
pset format u
pset tuples_only
o dt_output.dot
SELECT madlib.tree_display('output');
o

A number of programs can be used to render DOT graphs. The Unix shell command to do so is:

dot -Tpdf dt_output.dot -o dt_output.pdf

which results in the following decision tree:

Random Forest

Although a single decision tree is intuitive to understand, it may overfit the data. One way to mitigate this problem is by building an ensemble of classifiers, each of which produces a tree modeled on some combination of the input data. The results of these models are then combined to yield a single prediction, which is highly accurate at the expense of some loss in interpretation.

MADlib 1.7 has a completely new and improved implementation of random forest that includes variable importance metrics and ability to explore each tree in the forest independently.

For the same car evaluation data set we used above, here is the SQL statement to train the random forest:

SELECT * FROM madlib.forest_train(
    'car_eval',        -- Data table
    'rf_output',       -- Table to store model
    'id',              -- ID column name
    'class',           -- Column to predict
    '*',               -- Use all features
    '',                -- features to exclude (none for this case)
    '',                -- Grouping columns (no grouping for this case)
    10,                -- Number of trees to train
    3,                 -- Use 3 randomly-selected features for each node
    TRUE,              -- Compute variable importance
    1,                 -- Use single permutation for variable importance
    4                  -- Maximum depth for each tree
);

Let’s say we are interested in understanding variable importance, that is, which variables contribute the most and the least to prediction in the training data. The SQL is:

SELECT unnest(regexp_split_to_array(cat_features, ',')) as variable,
   unnest(cat_var_importance) as importance
FROM rf_output_group, rf_output_summary;

which produces the following output:

From this table, it appears that safety rating is the most important explanatory variable for acceptability, and the number of doors is the least important (actually irrelevant).

Other Interfaces

Each of the new algorithms in MADlib 1.7 are supported by PivotalR for users who prefer an R interface rather than SQL. PivotalR combines the usability of R with the performance and scalability benefits of in-database/in-Hadoop® computation.

Also, each model can be exported in Predictive Model Markup Language (PMML) format. PMML is an XML-based file format to provide a way for applications to describe and exchange models.

Learning More:

Read the Release Notes, download the source, or join the forums
Find out more about PivotalR, Pivotal Greenplum Database, or Pivotal HAWQ
Read other articles from Pivotal Data Scientists

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

Groovy 2.4 And Grails 3.0 To Be Last Major Releases Under Pivotal Sponsorship

Pivotal today announced that it will fund the next two major releases of the Groovy and Grails through Marc...

Announcing new WatchKit testing tools in PivotalCoreKit

We’re happy today to announce an open source tool that makes writing tests for WatchKit apps possible. The ...

MADlib 1.7 Release—Adding Generalized Linear Models, Decision Trees, and Random Forest

Generalized Linear Models

Decision Trees

Random Forest

Other Interfaces

About the Author

Previous

Next

MADlib 1.7 Release—Adding Generalized Linear Models, Decision Trees, and Random Forest

Generalized Linear Models

Decision Trees

Random Forest

Other Interfaces

About the Author

Previous

Next

Related content in this Stream

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.

From single apps to portfolios of apps in large enterprises and our experience has led us to identify four of the most common anti-patterns impacting organizations.