Pivotal's Google Summer of Code 2014: Implementing Clustering Algorithms in MADlib

June 5, 2014 Andreas Scherbaum

featured-SummerOfCode This year marks the 10th year of Google Summer of Code (GSoC). Since it’s inception, over 7500 students have developed over 50 million lines of code by working with over 440 open source projects and 7000 open source mentors from 100 countries.

This summer, French computer science student Maxence Ahlouche’s proposal was chosen out of 6313 proposals and he is spending 12 weeks, from mid-May to end-of-August, developing two data science algorithms for MADlib, Pivotal’s open-source library of big data analytics and machine learning algorithms supporting PostgreSQL. The algorithms will also run on PostgreSQL-compatible, massively parallel database services like Pivotal Greenplum and the Pivotal distribution of Apache Hadoop® , Pivotal HD with HAWQ.

For GSoC 2014, Maxence was selected along with 1300 other students who will collectively work with 190 of the world’s top open source organizations, including The Apache Software Foundation, Ceph, CERN, Clojure, Debian, Drupal, FreeBSD, Git, Gnome, GNU, Google, Groovy, Haskell, Mozilla, openSUSE, phpMyAdmin, Python, R, PostgreSQL, Ruby on Rails, The Eclipse Foundation, The Fedora Project, The Linux Foundation, Twitter, WordPress, and Xen. The student efforts cover engineering software for web crawlers, in memory data grids, Javascript libraries, aggressive compilers, porting, integration, cryptography, semantics, self-tuning optimizers, speech recognition, computer vision, robotics, fuzzy visualization, and much more.

Only 28 students were selected from France, and the country ranked 13th in terms of student participants. The five countries with the most student participants were India (401), the United States (161), Germany (78), Sri Lanka (54), and the Russian Federation (51).

How Does Google Summer of Code Work?

For PostgreSQL, the four accepted GSoC projects were index-only scans for GIST, changing unlogged to logged tables, supporting KNN for SP-GIST, and implementing clustering algorithms in MADlib. Maxence is working towards implementing the new features as decided solely by the projects and the mentoring is done by well-known project members. In this case, former GSoC student Atri Sharma, Pivotal Senior Engineer, Hai Qian, and EMC Architect and Advisor, Andreas Scherbaum, are guiding Maxence.

The aim of GSoC is to help “recruit” students as new members of open source projects and establish a long-term relationship, possibly beyond the current project. As part of the project, the students become familiar with the code base, infrastructure, and organization behind the Open Source project. During the process, students contribute real, working code to the fast growth, dynamic, disruptive world of open source software. Later in the summer, students and mentors are invited to a conference on the Google campus in Mountain View.

The Work—Developing Cluster Analysis Tools for MADlib

A common task in data science is clustering or grouping data into sets by similarity. This type of analysis is performed in use cases with gene sequencing, bioinformatics, medical imaging, recommendation engines, search results, data mining, machine learning, pattern recognition, image analysis, information retrieval, robotics, geology, and many other areas. Maxence is developing the k-medoids and the OPTICS clustering algorithms as part of the MADlib open source project.

MADlib provides an open-source framework for separating machine learning logic from database-specific implementation details, allowing data to run locally within the database, and using massively parallel processing (MPP) techniques, similar to MapReduce, for parallelism and scale. It features a toolkit of algorithms for classification, regression, clustering, topic modeling, rule mining, descriptive statistics, validation, time series analysis, and other data science techniques.

The GSoC project is split in two parts:

Implementing the k-medoids algorithm, an interesting algorithm for noisy datasets and related to the already implemented k-means algorithm.
Implementing the OPTICS (ordering points to identify the clustering structure) algorithm to identify density-based clusters in spatial data.

Both sub-projects will come with the necessary code, tests, and documentation. In addition, Maxence will remove duplicate code from the two new sub projects and optionally from other MADlib code.

More About The Clustering Algorithms k-medoids and OPTICS

The k-medoids algorithm is similar to the well-known k-means algorithm and also breaks up data sets into different groups called partitions. It then aims to minimize the distance of each point to the center of the cluster. Unlike k-means, the k-medoids algorithm uses data points as cluster centers. This makes the calculation more robust and minimizes the noise. It also makes the algorithm more computationally intensive.

OPTICS tries to find density-based clusters in spatial data sets. In contrast to its predecessors, OPTICS is able to identify meaningful clusters in sets of varying density. The clustering problem is solved by ordering points linearly and finding the closest neighbors.
The project’s progress will be documented and discussed on the MADlib mailinglist.

Learning More:

Read more about Madlib or PostgreSQL
Check out more details about the Google Summer of Code
Find out more about parallel processing of Madlib algorithms on Pivotal Greenplum MPP Database or Pivotal’s Hadoop® Distribution, Pivotal HD, with HAWQ
Get more info on the algorithms mentioned in this article: k-means, k-medoids, and OPTICS

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Andreas Scherbaum is working with PostgreSQL since 1997. He is involved in several PostgreSQL related community projects, member of the Board of Directors of the European PostgreSQL User Group and also wrote a PostgreSQL book (in German). Since 2011 he is working for EMC/Greenplum/Pivotal and tackles very big databases.
Follow on Twitter Follow on Linkedin Visit Website

Using Data Science Techniques for the Automatic Clustering of IT Alerts

Large enterprise IT infrastructure technology components generate large volumes of alert messages. Instead ...

Pivotal Receives Morgan Stanley's Exclusive 'CTO Award for Innovation' for 2014

Last night, Pivotal received a very special award. At their annual CTO Summit event that unites technology ...

Pivotal's Google Summer of Code 2014: Implementing Clustering Algorithms in MADlib

How Does Google Summer of Code Work?

The Work—Developing Cluster Analysis Tools for MADlib

More About The Clustering Algorithms k-medoids and OPTICS

About the Author

Previous

Next

Pivotal's Google Summer of Code 2014: Implementing Clustering Algorithms in MADlib

How Does Google Summer of Code Work?

The Work—Developing Cluster Analysis Tools for MADlib

More About The Clustering Algorithms k-medoids and OPTICS

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.