Building A Rich Model Factory With MPP

July 27, 2016 Scott Hajek


sfeatured-36349-ModelFactoryA model factory is a repository of data science models from which users can select and apply to new data. To build transparency into a model factory, it is useful to store a rich collection of metadata along with the models themselves. Valuable metadata can include model parameters, training data, and evaluation results. The evaluation results might be a mixture of numeric metrics as well as reports and visualizations.

Massively parallel processing (MPP) databases can serve as a powerful platform for a model factory. They can store huge amounts of data and process it in a scalable fashion. For modeling, MPP platforms (like Pivotal Greenplum or Pivotal HDB) provide a lot of freedom because a variety of open source analytic and machine learning tools can be run inside the database, such as Apache MADlib (incubating), Python and R (the latter two as procedural language extensions).

To effectively utilize an MPP database as a model factory, it is important to be able to handle a wide variety of data types. Knowing how to generate, store, and process raw, binary data opens several possibilities. First, some tools store and apply models based on a binary format. Second, reports and visualizations generated during the training or evaluation of the models can be saved as metadata in the model factory. Finally, the same concepts for handling binary data inside the database can be applied to image, audio, and video processing. In this post, I will explain how these applications of binary data can be achieved in an MPP database.

Model Storage And Usage

Storing the model in a way that can later be used is crucial. Models trained in Apache MADlib (incubating) get stored as a database table. So, the model repository would just need to refer to the schema and table name.

Procedural extensions for Python (PL/Python) and R (PL/R) make it possible to leverage open source libraries and process data in parallel inside the database. The basic idea is that you use your normal code from the other language, and you just need to wrap it in a SQL user-defined function that specifies the call signature and return type. For more explanation, see this post, which describes using PL languages in the context of text analytics.

Models trained in PL/Python or PL/R, however, are typically written to a file. So, how could they be stored in the database? The answer is to serialize the model and store it in a text or bytea field. For example, consider training a logistic regression classification model with the popular Python library scikit-learn. The following snippet defines a PL/Python function that takes training data with three predictors and a target category. Each field must be aggregated into an array using array_agg(). The classifier is trained and then serialized using Python’s built-in cPickle module and its dumps() function. Using this function, the resulting binary representation of the model can be stored as a field in a table.

	predictor1 numeric[],
	predictor2 numeric[],
	predictor3 numeric[],
	target integer[]
LANGUAGE plpythonu
   import cPickle
   import numpy
   from sklearn.linear_model import LogisticRegression
   # create matrix of predictors
   X = numpy.array(
       [predictor1, predictor2, predictor3]
   # train model based on predictors and target categories
   logreg = LogisticRegression(), target)
   # return serialized model
   return cPickle.dumps(logreg)

A trained model can be retrieved and applied to new data to make predictions. Continuing the example above, a function expecting the serialized model (bytea) and values for the three predictors can load the model, apply it, and return the probability of being in the target category. Deserialization, or loading the scikit-learn model, happens using cPickle.loads().

CREATE OR REPLACE FUNCTION model_predict_prob(
   model_bytes bytea,
   predictor1 numeric,
   predictor2 numeric,
   predictor3 numeric
RETURNS double precision
LANGUAGE plpythonu
   import cPickle
   # deserialize model
   model = cPickle.loads(model_bytes)
   # predict probabilities for test data
   probabilities = model.predict_proba(
       [predictor1, predictor2, predictor3]
   # return probability of category index 1
   return probabilities[0, 1]

Model Evaluation As Metadata

Information about the quality of trained models, such as numerical metrics and visualizations, is also important information to capture in the model repository. Metrics can be stored simply as numbers, while visualizations can be stored as binary data. To illustrate this, I will show a model evaluation function that produces both a metric and a visualization.

A test data set would include the same predictor variables as in the training set and would include “the answer” (the target category for classification tasks). We will first apply our prediction function to the test data and then evaluate the predictions. The query below computes the predictions on a test table called ‘test’ and stores the results in the table ‘test_predictions’.

CREATE TABLE test_predictions
   ) as prob,

The example code below evaluates the model’s predictions based on the area under the ROC curve (AUC), a common metric for evaluating classification models. In addition, a plot of the ROC curve is generated in PNG format and returned as well. A key tip for dealing with binary data comes into play here—how to generate an image in a particular file format and return its raw bytes. Returning the raw bytes of a plot in Python requires a step of indirection since the plotting library expects to deal with file-like objects. The trick is to use the io.BytesIO class, which can be used like a file without actually touching the file system. This way, the raw bytes can be returned.

CREATE TYPE model_test_result
AS (
	auc double precision,
	image_bytes bytea

CREATE OR REPLACE FUNCTION evaluate_predictions(
	probabilities double precision[],
	targets integer[]
RETURNS model_test_result
LANGUAGE plpythonu
   import io
   from sklearn.metrics import roc_curve, auc
   import matplotlib.pyplot as plt
   # compute false- and true-positive rates at various thresholds
   fpr, tpr, thresholds = roc_curve(targets, probabilities)
   # compute Area Under the Curve (a model evaluation metric)
   roc_auc = auc(fpr, tpr)
   ## plotting
   fig = plt.figure()
   # plot false- and true-positive rates (ROC curve)
   plt.plot(fpr, tpr, label='ROC (area = {:0.2f})'.format(roc_auc))
   # plot random chance
   plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Chance')
   # plot aesthetics
   plt.xlim([-0.05, 1.05])
   plt.ylim([-0.05, 1.05])
   plt.xlabel('False Positive Rate')
   plt.ylabel('True Positive Rate')
   plt.title('Receiver operating characteristic')
   plt.legend(loc="lower right")
   # save figure as bytestream
   buff = io.BytesIO()
   fig.savefig(buff, format='png')
   return dict(

Consuming Images From The Database

The plots, which are produced and stored in the database, serve as valuable model repository metadata. It is important to be able to query and view models of interest. This can be done in a variety of ways, but, for simplicity, I will show how you can use IPython in a Jupyter notebook to connect to the database, query, and view the plots.

First, establish a connection using the psycopg2 package:

psycopg2 package

Next, query the table where the raw plot bytes were stored. The ‘read_sql’ function executes the query and returns the results. iPython has some display functions that can take raw image data—as long as you give the keyword argument raw=True. The other trick is that psycopg2 connections return bytea data as a buffer. So, we can force it to bytes/string using bytes(). The image below shows the query with the display_png function on a PNG plot retrieved from the database:

display_png function

display_png function graph

To store the retrieved plot as a local file, you can simply open a file and write to it. The resulting file can be opened using your normal operating system tools.

opening file

Broader Applications To Images, Audio, And Video

While the focus here has been on producing and storing metadata for a model factory, the approaches presented for handling binary data can also be used to apply powerful machine learning techniques on audio, images, and video at scale. Image processing steps such as smoothing as well as thresholding and grouping of pixels can be performed inside the database and serve as a precursor to subsequent analysis. Edge detection is an important building block for higher level computer vision tasks like object recognition, and it can be accomplished in Pivotal Greenplum or Pivotal HDB. Again, popular open source libraries, such as OpenCV, can be wrapped in PL/Python or PL/C. Tools like OpenCV operate on the raw image bytes. So, handling binary data is key. Also, the output of an algorithm, like object recognition, could be stored as binary images and with the objects highlighted, which could be queried from the database and viewed similarly as described for plots.

Information extraction is another opportunity for deriving value from binary data. Text and metadata can be extracted from files stored in a variety of formats, such as PDF, Word, Excel, and HTML, using Apache Tika. Then, heuristics and natural language processing can extract specific information from the text.

The bottom line is this—learning how to handle binary data inside the database creates a lot of opportunities for predictive model management as well as machine learning at scale.

Learning More

  • Pivotal Data Scientists frequently share lots of how-tos. Check out the rest of their their latest additions here.
  • Pivotal Data Scientists are available for hire! To enquire how they can help you, contact us.


About the Author


This Month in Data Science: July 2015
This Month in Data Science: July 2015

In the month of July, education programs for data science gained traction, biologists considered how the bi...

3 Ways to Talk Your Way Out of Developing a Case for DevOps ROI
3 Ways to Talk Your Way Out of Developing a Case for DevOps ROI

In this post, Pivotal’s Michael Coté, recaps a recent article on ROI that he wrote for FierceDevOps. Import...