Data Science How To: Massively Parallel, In-Database Image Processing: Part 1

July 30, 2014 Ailey Crow

featured-cell-visionJoint work performed by Ailey Crow, Sarah Aerni, Hulya Farinas, Regunathan Radhakrishnan and Gautam Muralidhar of Pivotal’s Data Science Labs.

Automated image processing can increase efficiency for a diverse range of applications from defect detection in manufacturing, to motion detection in video games or surveillance, to tumor detection in medical images. Advances in computer vision have made a significant impact to a variety of fields including neurobiology, artificial intelligence, pathology, security, automation, and entertainment.

As imaging systems become more efficient and produce higher resolution images, image processing workflows need to adapt to handle larger image data both in image size and number of images. The Pivotal Data Science team has recently turned their attention to image processing in the context of big image data, resulting in this series of blog posts on this topic.

Traditional approaches to image processing often start with loading images, as multidimensional arrays of intensity values, into an application’s memory for performing operations such as smoothing (filtering), segmentation, object detection, and classification. Here, we propose an alternative approach of processing images in-database in a distributed fashion, which offers 2 main advantages. First, in-memory storage is not required so images with high spatial resolution (i.e. medical images or video) can be processed regardless of the memory capacity of the underlying system. Second, thanks to advances in massively parallel, shared nothing database technology, in-database image processing can be faster than traditional image processing in some contexts—especially when processing a large number of images distributed across the database.

The challenge lies in representing inherently multidimensional arrays of pixels and their relationships to one another in database.

How To Approach Massively Parallel Image Processing

In the remainder of this blog post, we will present an in-database solution for a basic image processing problem: counting cells in a digital pathology image, such as the one shown below. Specifically, we will be counting the nuclei (round purple objects in this image of a breast cancer tissue sample) in 6 steps: image loading, filtering (smoothing), thresholding, cleanup via morphological operations, object recognition using connected components, and counting.

If you want a better understanding of the technology stack we often use, this article provides more information.


Beck et al. Sci Transl Med 2011.

Image Representation in Database

Images can be represented in a database as a table where each row is a pixel and columns contain position information (i.e. x,y,z or row,col) and intensity information (i.e. grayscale or RGB values).

img_name | row | col | r_intensity | g_intensity | b_intensity


test.jpg | 331 | 188 | 250 | 249 | 255
test.jpg | 332 | 188 | 248 | 250 | 255
test.jpg | 331 | 189 | 249 | 249 | 255

We can then transform an image into tabular format by iterating through each pixel in each image. For example, leveraging the numpy toolbox in Python for a 2D image ‘im’, we can write a csv file following the column structure of the above table, as follows:

imName = ‘test.jpg’
OutPath = ‘/ImTable.csv’
for r in range(0,np.size(im,0)):
  for c in range(0,np.size(im,1)):
   S = S + imName + ',' + str(r) + ',' + str(c) + ',' +
      im[r,c,0].astype('str') + ',' +
      im[r,c,1].astype('str') + ',' +
      im[r,c,2].astype('str') + 'n'

There are a number of methods available to ingest images into a database that can be implemented at scale. For example, images could be loaded into the database in bytea format and then transformed to tabular format by running the above script in parallel as a user-defined PLpython function. Alternatively, the above Python script could be setup to run in parallel externally followed by parallel gpfdist processes pointing to the same table.

Smoothing (Filtering) Using Procedural Languages

Once the images are loaded in the database, we can begin distributed image processing. A common first step in image processing is to smooth the image to remove noise. Let’s apply the most basic filter: a uniform box filter (or local averaging filter), which simply replaces each pixel’s intensity with the mean intensity of all neighboring pixels.

There are many image processing toolboxes available to facilitate smoothing. We can leverage these in-database with a user-defined function. For example, we can use Python’s numpy and scipy ndimage packages in a PLpython user-defined function and reshape the image into a multidimensional array as follows:

create or replace function im_smooth(maxr integer, maxc integer,im_array integer[])
   returns integer[]
as $$
  import numpy as np
  import scipy.ndimage as ndi
  smooth = np.reshape(im_array, (maxr+1, maxc+1))
  smooth = ndi.uniform_filter(smooth, size=3)
  return np.reshape(smooth.astype(int), (maxr+1)*(maxc+1))
$$ language plpythonu;

Then, to smooth the blue channel:

SELECT img_name, im_smooth(max(row), max(col),
  array_agg(b_intensity order by row, col))
FROM (SELECT * from testim order by row, col) t;

Note, array_agg on line 2 is used to input all blue intensity values into the function in a pre-specified order that is assumed by the function. The smoothed intensity values will be returned as an array in the same order and would require additional processing to be returned to the tabular format.

This approach leverages familiar image processing tools, but requires holding the images in memory. If the image size exceeds the memory capacity of the underlying system, often true of satellite or medical images, an alternative approach is required.

We could downsample the image to make it fit in memory or break the image into tiles for block processing as explained in “Challenges of Medical Image Processing” or at Mathworks. Or, we can avoid memory limitations altogether and process the entire image at full resolution by processing in-database.

Smoothing (Filtering) with Window Functions

For smoothing an image in-database, we will access the neighboring pixel values using a convenient tool described previously in this series about time series and window functions by Caleb Welton.

Uniform smoothing of 4-connected pixels (ignore diagonally-connected pixels) for the blue channel can be achieved as follows:

SELECT img_name, row, col, madlib.array_mean(b_intensity) b_intensity
  SELECT img_name, row, col, array[b_intensity,
   lag(b_intensity) OVER( wdw_col ),
   lead(b_intensity) OVER( wdw_col ),
   lag(b_intensity) OVER( wdw_row ),
   lead(b_intensity) OVER( wdw_row )
   ] b_intensity
 FROM testim
 WINDOW wdw_col AS (partition by col order by row),
  wdw_row AS (partition by row order by col) )t;

This SQL query collects an array of the blue intensity values of a given pixel and its 4 immediate neighbors using window functions (lines 3-8). The smoothed value of the pixel is then computed as the mean of the array (line 1) using the MADlib in-database machine learning and analytics library.

If we want to include the diagonal pixels as well, we need to define a new coordinate system over which to run our window functions. The new coordinate system is simply rotated 45 degrees from the normal coordinate system and translated from the row, col system as follows:

Diag1 = row – col
Diag2 = row + col

This is demonstrated for the 4×4 pixel array below:


Now, let’s apply the new coordinate system for uniform smoothing including diagonals (8-connected pixels).

Uniform smoothing of 8-connected pixels (include diagonally-connected pixels) can be achieved as follows:

CREATE TEMP TABLE testim_smooth AS
SELECT img_name, row, col, madlib.array_mean(b_intensity) b_intensity
  SELECT img_name, row, col, array[b_intensity,
   lag(b_intensity) OVER( wdw_col ),
   lead(b_intensity) OVER( wdw_col ),
   lag(b_intensity) OVER( wdw_row ),
   lead(b_intensity) OVER( wdw_row ),
   lag(b_intensity) OVER( wdw_diag1 ),
   lead(b_intensity) OVER( wdw_diag1 ),
   lag(b_intensity) OVER(wdw_diag2 ),
   lead(b_intensity) OVER( wdw_diag2 )
  ] b_intensity
 FROM testim
 WINDOW wdw_col AS (partition by col order by row),
  	wdw_row AS (partition by row order by col),
    wdw_diag1 AS (partition by (row-col) order by col)
 	wdw_diag2 AS (partition by (row+col) order by col) )t;


This approach can be extended to more complex filters by computing the inner product of the neighborhood array and a vectorized kernel, such as a 2D Gaussian. In the next post, we will explore the subsequent steps required to count the number of objects (nuclei) in an image with in-database processing.

To Learn More:

  • Look for Part II and other data science articles on our Twitter, LinkedIn, Facebook, Google Plus, and Youtube where you can learn about related topics like neurobiology, statistics, optimization, machine learning, cognitive vision, artificial intelligence, control robotics, signal processing, optics, security, and smart cameras.

About the Author


The 'as a Service' is Silent
The 'as a Service' is Silent

At what point is 'infrastructure as a service', just 'infrastructure'? And a platform is just how one deplo...

Strengthening Apache Hadoop in the Enterprise with Apache Ambari
Strengthening Apache Hadoop in the Enterprise with Apache Ambari

Today we are excited to announce that Pivotal and Hortonworks will collaborate on the Apache® Ambari projec...