Powerful open source library of scalable in-database algorithms for machine learning
Data is at the center of digital transformation—using data to drive action is how transformation happens. Therefore, it's important to efficiently extract patterns from data in order to identify the insights and actions needed. Machine learning algorithms allow organizations to not only identify patterns and trends in their datasets but also enables them to make high-value predictions that can guide better decisions and smart actions in near real time and without human intervention.
Apache™ MADlib® is an open source library for scalable in-database analytics. It provides data-parallel implementations of machine learning, mathematical, statistical, and graph methods on the PostgreSQL family of databases, including VMware Tanzu Greenplum®. MADlib uses the MPP architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. MADlib algorithms are invoked from a familiar SQL interface so they are easy to use.
Massively Parallel Machine Learning
The methods in MADlib are designed to run on shared-nothing, “scale-out” MPP architectures. This allows machine learning computations to be executed close to the data and at very high speed.
MADlib runs analytics on extremely large datasets. It differentiates itself from other analytical packages by not limiting execution to memory-only structures on a single computing node. MADlib users can add more nodes as data scales. Using all data and not a sample significantly improves accuracy.
Rich Portfolio of Analytical Methods
The MADlib community has steadily added new methods in the areas of mathematics, statistics, machine learning, graph analytics, and data transformation. The current library includes a comprehensive collection of algorithms, operators, and utility functions.
Extensive Support to Popular Data Science Interfaces
PivotalR is a R wrapper that allows practitioners who know R but very little SQL to leverage the performance and scalability benefits of MADlib. It translates R model formulas into corresponding SQL statements (via MADlib), executes these statements in the database or on-hadoop and returns summarized model output to R.

Graph Processing on Greenplum Database using Apache MADlib
Data Science Reveals Extraordinary Insights into Drivers and Their Behavior
Video: The MADlib Project: SQL Toolkit for Large Scale Predictive Analytics
Video: The Evolution of MADlib
Pivotal Data Science Transport Demo
White Paper: MAD Skills: New Analysis Practices for Big Data
White Paper: The MADlib Analytics Library
Machine Learning on Greenplum with Apache MADlib
MADlib Methods
- Cox Proportional Hazards Regression
- Elastic Net Regularization
- Generalized Linear Models
- Logistic Regression
- Marginal Effects
- Multinomial Regression
- Ordinal Regression
- Robust Variance, Clustered Variance
- Support Vector Machines
- Linear Regression
- Decision Tree
- Random Forest
- Conditional Random Field
- Naive Bayes
- Neural Networks
- ARIMA
- Cross Validation
- Prediction Metrics
- Train-Test Split
- k-Nearest Neighbors
- Association Rules (Apriori)
- Clustering (K-means)
- Topic Modeling (LDA)
- All Pairs Shortest Path (APSP)
- Breadth-First Search
- Hyperlink-Induced Topic Search (HITS)
- Average Path Length
- Closeness Centrality
- Graph Diameter
- In-Out Degree
- PageRank
- Single Source Shortest Path (SSSP)
- Weakly Connected Component
- Conjugate Gradient
-
Linear Solvers
- Dense Linear Systems
- Sparse Linear Systems
- PMML Export
- Random Sampling
- Stratified Sampling
- Balanced Sampling
- Term Frequency for Text
- Path Functions
- Sessionization
- Array Operations
- Dimensionality Reduction (PCA)
- Encoding Categorical Variables
- Matrix Operations
- Matrix Factorization (SVD, Low Rank)
- Norms and Distance Functions
- Sparse Vectors
- Pivot
- Stemming
- Cardinality Estimators
- Correlation and Covariance
- Summary
- Hypothesis Tests
- Probability Functions
MADlib Architecture
The MADlib Analytics Approach
MADlib approach to analytics is based on the MAD acronym:
- Magnetic: Designed to draw different types of data sources and data scientists to a single environment where best practices on analytics can be shared
- Agile: Built for fast, exploratory and iterative analytics where lightweight modeling is possible and integration of new data is extremely easy
- Deep: An environment where advanced machine learning and statistical algorithms are supported