Updating PostgreSQL-based Open Source Machine Learning Library from Google Summer of Code

September 9, 2014 Andreas Scherbaum

featured-SummerOfCodeGoogle Summer of Code 2014 has wrapped up and updated Pivotal’s MADlib project, an open source toolset for big data machine learning in SQL. The codebase now includes the implementation of a new analytics algorithm.

French computer science student Maxence Ahlouche originally proposed implementing two clustering algorithms for MADlib: k-medoids and OPTICS. Over the course of GSoC, we changed the goals. We left out OPTICS, and Maxence proposed to refactor k-means and k-medoids to use the same code base. This cleaned up a lot of duplicate code and makes everything more readable and easier to use.

Just in time for the deadline, the implementation for both the PostgreSQL version as well as the Greenplum/HAWQ version was finished along with the proper tests and documentation. Maxence also stated that he will continue his work and implement the OPTICS algorithm. After all, that is what the Google Summer of Code program is for—bringing open source projects and students together and offering them an easy way to contribute to the project.

The code from Maxence will be audited by the Pivotal team, namely by Pivotal Senior Engineer Hai Qian. Then, it will be added to the MADlib code base in the upcoming release.

Pivotal wants to thank all participants:

  • Maxence Ahlouche: for the excellent work during the summer

  • Atri Sharma: for mentoring the project

  • Hai Qian: for countless input and hints

  • Caleb Welton: for support from Pivotal

  • Andreas Scherbaum (me)

Special thanks goes to the PostgreSQL Global Development Group for enabling us to participate in the GSoC program.

For those that are unfamiliar with MADlib, the project sits at the intersection of commercial efforts, academic research, and open source development. The project is built from the ground up to operate in distributed computing environments and massively parallel processing databases. With Pivotal Greenplum, the data can be operated on locally within a shared-nothing architecture. To date the library supports algorithms like classification, regression, clustering, topic modeling, association rule mining, descriptive statistics, validation, and more.

Learning More:

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Andreas Scherbaum

Andreas Scherbaum is working with PostgreSQL since 1997. He is involved in several PostgreSQL related community projects, member of the Board of Directors of the European PostgreSQL User Group and also wrote a PostgreSQL book (in German). Since 2011 he is working for EMC/Greenplum/Pivotal and tackles very big databases.

Follow on Twitter Follow on Linkedin Visit Website
Ultimate Headphone Pairing (Pilot to co-pilot: I read you, over)
Ultimate Headphone Pairing (Pilot to co-pilot: I read you, over)

Here at Pivotal, we know pair programming is great. We strive to pair 100% of the time. Sometimes, a pair m...

Hosting apps in the Pivotal Web Services (PWS) cloud
Hosting apps in the Pivotal Web Services (PWS) cloud

PWS is Pivotal’s public Platform-as-a-Service offering. PaaS systems let you host apps by pushing them to a...