Model Selection for Deep Neural Networks on Greenplum Database

April 14, 2020 Frank McQuillan

Joint work between VMware and Yuhao Zhang and Dr. Arun Kumar of the Department of Computer Science and Engineering at the University of California, San Diego.

Artificial neural networks can be used to create highly accurate models in domains such as language processing and image recognition. However, since hundreds of trials may be needed to generate a good model architecture and associated hyperparameters, training deep neural networks is incredibly resource-intensive. For example, trying five network architectures with five values for learning rate and five values for regularizer will result in 125 combinations. This is the challenge of model selection, which is time-consuming and expensive, especially when training one model at a time. So, because massively parallel processing databases like Greenplum can have dozens of segments (workers), we exploit this parallel compute architecture to address the challenge of model selection by training many model configurations simultaneously.

We use the Apache MADlib open-source project, which supports deep learning with Keras and TensorFlow on Greenplum Database [1, 2, 3].

Model hopper parallelism

To train many models simultaneously, we implement a novel approach from recent research called model hopper parallelism (MOP) [4, 5]. MOP combines task and data parallelism by exploiting the fact that stochastic gradient descent (SGD), a widely used optimization method, is robust to data visit order.

The method works as follows. Suppose, as we show in this image, we have a set S of model configurations that we want to train for k iterations. The dataset is shuffled once and distributed to p segments (workers). We pick p model configurations from S and assign one configuration per segment, and each configuration is trained for a single sub-epoch. When a segment completes a sub-epoch, it is assigned a new configuration that has not yet been trained on that segment. Thus, a model "hops" from one segment to another until all workers have been visited, which completes one iteration of SGD for each model. Other models from S are rotated in, in round robin fashion, until all S model configurations are trained for k iterations.

Figure 1: An example of model hopper parallelism with three segments

Note that the training dataset is locked in place on the segments throughout the training period;  only the model state moves between segments. The advantage of moving the model state rather than the training dataset is efficiency, since in practical deep learning problems the former is far smaller than the latter.

Data distribution rules are adjustable. For example, GPUs may not be connected to all segment hosts for cost control reasons. In this case, the training data set will only be distributed to segments with GPUs attached, and the models will only hop to those segments.

The APIs

Below are the two SQL queries needed to run MOP on Greenplum using Apache MADlib. The first loads model configurations to be trained into a model selection table. The second calls the MOP function to fit each of the models in the model selection table.

Figure 2: APIs for model selection on Greenplum using Apache MADlib

Putting it to the test

We use the well-known CIFAR-10 dataset with 50,000 training examples of 32x32 color images. Grid search generates a set of 96 training configurations, consisting of three model architectures, eight combinations of optimizers and parameters, and four batch sizes.

   
   

Here we see the accuracy and cross-entropy loss for each of the 96 combinations after training for 10 epochs. Training takes 2 hours and 47 minutes on the test cluster.* The models display a wide range of accuracy, with some performing well and some not converging at all.

Figure 3: Initial training of 96 model configurations

The next step that a data scientist may take is to look for good models from the initial run and select a subset of those that have reasonable accuracy and low variance compared to the training set, meaning they do not overfit. We select 12 of these good models and run them for 50 iterations, which takes 1 hour and 53 minutes on the test cluster.  

Figure 4: Final training for more iterations with the 12 best model configurations

From the second run, we select the best model, again ensuring that it does not overfit the training data. The chosen model achieves validation accuracy of roughly 81 percent using the SGD optimizer with learning rate=.001, momentum=0.95, and batch size=256.

Once we have this model, we can then use it for inference to classify new images.

Figure 5: Inference showing the top three probabilities

The efficiency of the Greenplum database + GPU acceleration

The distributed compute capability of the Greenplum database with GPU acceleration can be used to train many deep learning models in parallel. This can help data scientists quickly determine the best model architecture and associated hyperparameters for their projects, so that the model can be deployed into production as soon as possible in order to improve the products and services that they offer to their customers.

Learn more

Ready to take the next step? The following are ways you can learn more about Apache MADlib and Greenplum:


Test infrastructure

  • Google Cloud Platform
  • 4 hosts each with 32 vCPUs & 150 GB memory & 4 NVIDIA Tesla P100 GPUs (16 total)
  • Greenplum 5 with 4 segments per host (16 total)
  • Apache MADlib 1.17
  • Keras 2.2.4, TensorFlow 1.13.1

References

[1] GPU-Accelerated Deep Learning on Greenplum Database, https://tanzu.vmware.com/content/engineers/gpu-accelerated-deep-learning-on-greenplum-database

[2] Transfer Learning for Deep Neural Networks on Greenplum Database, https://tanzu.vmware.com/content/practitioners/transfer-learning-for-deep-neural-networks-on-greenplum-database

[3] Connecting GPUs to Greenplum Database, https://tanzu.vmware.com/content/practitioners/connecting-gpus-to-greenplum-database

[4] Cerebro: A Data System for Optimized Deep Learning Model Selection, https://adalabucsd.github.io/papers/TR_2020_Cerebro.pdf

[5] Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems, DEEM’30, June 30, 2019, Amsterdam, Netherlands, https://adalabucsd.github.io/papers/2019_Cerebro_DEEM.pdf

[6] Jason Brownlee. 2019. How to Develop a CNN From Scratch for CIFAR10 Photo Classification, https://machinelearningmastery.com/how-todevelop-a-cnn-from-scratch-for-cifar-10-photo-classification/

[7] Keras Documentation. 2019. Train a simple deep CNN on the CIFAR10 small images dataset, https://keras.io/examples/cifar10_cnn/

Previous
How to Conduct a Remote Event Storming Session
How to Conduct a Remote Event Storming Session

VMware Pivotal Labs has adapted our years of experience facilitating Event Storming to our new reality. Thi...

Next
5 Tips for Effective Remote Pair Programming While Working from Home
5 Tips for Effective Remote Pair Programming While Working from Home

With COVID-19 raging across the world, now is a perfect time to practice pair programming, but remotely.

SpringOne. Catch all the highlights

Watch now