Apache MADlib Comes of Age

October 6, 2017 Frank McQuillan

MADlib has graduated to a Top Level Project in the Apache Software Foundation (ASF), signifying that the community has been well-governed under the ASF's meritocratic process and principles. For Pivotal, this means accelerated innovation in the area of in-database machine learning and advanced analytics for Greenplum Database.

In this post, we describe the journey of MADlib from its roots as an open source project to the ASF, and its use by data scientists to solve real-world problems across a wide variety of industries.

What is MADlib?

MADlib is an open source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical, graph and machine learning methods for structured and unstructured data. It uses shared-nothing, distributed, scale-out architectures to offer data scientists an effective toolset for challenging problems involving very large data sets. MADlib is SQL-based and supports Pivotal Greenplum Database and PostgreSQL.

Why was MADlib Developed?

MADlib was originally developed to support a departure from traditional enterprise data warehouses and business intelligence solutions. These had been successful for enterprise reporting and descriptive analytics needs, but were poorly suited for advanced predictive analytics use cases. These new analytics required fast access to massive data sets with highly iterative, parallelizable algorithms. Traditional EDW and BI solutions lacked the necessary performance capabilities, and implementing advanced algorithms required convoluted SQL that was difficult to construct and maintain.

By contrast, MADlib  provides machine learning, graph, data utilities and other advanced analytics capabilities that permits data scientists, data engineers and others to work in an integrated manner within a single platform, reducing friction and drag to data science workflows.  When paired with an MPP analytic data warehouse like Greenplum, data scientists can develop many models in parallel. This is helpful for many types of use cases, such as modeling large populations at the entity level (e.g., individual customer tendencies).  Also, MADlib enables users to invoke advanced algorithms via SQL, rather than requiring SQL analysts to write them.  The result is an increase in business value to the enterprise as derived from the data, and therefore better products and services for their customers.

Origins of MADlib

"MADlib was conceived from the outset as an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on scalable, in-database machine learning and statistics," said Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the original developers of MADlib.

These discussions were written up in a paper from VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and computer scientists at Pivotal (formerly EMC/Greenplum).

The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the University of Wisconsin, and the University of Florida. The project was publicly documented in a paper at VLDB 2012.

Journey to the ASF

In September 2015, MADlib joined the ASF community as an incubating project.  At the time, the open source community behind MADlib felt that aligning with the ASF community, governance model, and infrastructure would allow the project to accelerate adoption and community growth. There were five releases of MADlib as an incubating project, along with a growing number of industry and academic contributors and users.

In July 2017, MADlib graduated to a Top Level Project at the ASF, followed shortly by the first top level release of MADlib 1.12 in August 2017.  This latest release includes:  new graph analytics (all pairs shortest path, weakly connected components, breadth first search, multiple graph measures), new sampling algorithms (stratified sampling, train-test split) and a multilayer perceptron, which is a type of artificial neural network.  Read more about the 1.12 release here.

Enterprise Deployments

The recent announcement of Greenplum 5 reinforced the value proposition of a single platform that can perform compute-intensive and complex analytical workloads at scale.  In the past, many enterprises have deployed separate platforms in an attempt to gain insight from data using different techniques.  For example, in addition to running SQL workloads on an Enterprise Data Warehouse (EDW) for business intelligence, they may deploy and manage separate databases for graph, geospatial, text, machine learning, etc.

Greenplum 5 is designed to eliminate data silos by integrating traditional and advanced analytics in a single scale-out analytics platform.

In concert with Greenplum’s MPP architecture, MADlib’s wide range of statistical and machine learning methods can cover a variety of real-world use cases, including:

"At Pivotal, we have seen our customers successfully deploy MADlib on large scale data science projects across a wide variety of industry verticals," said Elisabeth Hendrickson, Vice President, R&D for Data at Pivotal. "As MADlib graduates to a Top-Level Project at the ASF, we anticipate increased adoption in the enterprise given the mature level of the codebase and the active developer community."

Continued Innovation

At Pivotal, we enthusiastically look forward to working together with all future contributors as part of the MADlib community in order to advance the state-of-the-art of scale-out data science tools.  

There are many potential avenues for future development, including expanding the library of graph analytics algorithms, adding new machine learning capabilities and supporting evolving deep learning frameworks.  If you have an idea, you are welcome to contribute to the open source project.

"It has been great to witness the growth of the MADlib community and codebase as an ASF incubating project, and I look forward to this continuing as a Top Level Project," added Hellerstein.

About the Author

Frank McQuillan

Frank McQuillan is Director of Product Management at Pivotal, focusing on analytics and machine learning for large data sets. Prior to Pivotal, Frank has worked on projects in the areas of robotics, drones, flight simulation, and advertising technology. He holds a Masters degree from the University of Toronto and a Bachelor's degree from the University of Waterloo, both in Mechanical Engineering.

SVP: The Shoddiest Viable Product
SVP: The Shoddiest Viable Product

“If your product is a swiss army knife,” the workshop leader told us, “then your MVP is this simple pocket ...

Detecting Risky Assets in an Organization Using Time-Variant Graphical Model
Detecting Risky Assets in an Organization Using Time-Variant Graphical Model