This year may be the year that automated machine learning (AutoML) enters the data science vernacular. KDnuggets recently wrote a comprehensive review of the state of AutoML in 2017, AirBnB described how AutoML has accelerated their data scientists’ productivity, and the International Conference on Machine Learning (ICML) hosted another workshop on AutoML in August.
In this post, I share an AutoML setup to train and deploy pipelines in the cloud using Python, Flask, and two AutoML frameworks that automate feature engineering and model building. To jump straight to the code, check out the GitHub repository.
What exactly is AutoML? AutoML is a broad term and technically could encompass the entire data science cycle from data exploration to model building. However, I have found it most commonly refers to automating feature preprocessing and selection, model algorithm selection, and hyperparameter tuning–steps located at the end of the data science process.
Earlier steps in the process, namely data exploration and cleansing and feature engineering, are significantly harder to automate as they require both domain expertise and human judgement. The good news is AutoML automates the tedious steps of model selection and development, allowing data scientists to focus on the interesting stuff.
Below, I share my experiences deploying an AutoML service in more detail. Here are my main takeaways:
- AutoML offers tangible benefits for model selection and optimization.
- It's very easy to get started. Many AutoML frameworks are compatible with scikit-learn or other popular data science tools
- AutoML explores an algorithm and parameter space that is much more broad than a typical data scientist workflow. Are you stuck in a Random Forest or Gradient Boosted Trees rut? AutoML can help you artfully explore new, effective approaches.
- Optimal results require longer training times (hours to days) and multiple runs using different parameters.
- Using currently available open source tools, a fully automated time series classification pipeline is possible
Automated Feature Engineering and Model Building
I tested and combined two open source Python tools: tsfresh, an automated feature engineering tool, and, TPOT, an automated feature preprocessing and model optimization tool. The result is an automated time series classification pipeline that can be used to build a model on multi-dimensional, arbitrarily shaped data.
An automated pipeline for time-series classification.
While automating feature engineering is a hard problem in general, there have been recent advances for time-series data. Data scientists routinely create features ranging from time-domain summary statistics to more advanced frequency-domain features using Fourier analysis. The tsfresh library can perform these calculations for you, along with more advanced features.
Once the features are built, TPOT intelligently constructs a feature preprocessing and modelling pipeline using a genetic algorithm under the hood. A stochastic approach, using genetic programming, for example, is beneficial over a grid search approach because it can step away from unfruitful pipelines and also return fresh pipelines unknown to the data scientist. Essentially, automating modelling pipeline development can improve the final model, while also revealing new approaches to the data scientist.
TPOT combats overfitting by default, using k-fold cross validation for each pipeline evaluation. A nested cross validation approach would yield a less biased estimate of pipeline performance, but is very computationally expensive.
One can incorporate these above libraries with a few lines of code. TPOT, for example, is trained using a fit module, à la scikit-learn. The above is one approach and there are numerous other AutoML libraries available.
Deploying an AutoML Cloud Service
I wanted to scale out this approach, deploy it in the cloud, and expose it to an API using Flask. There are several motivations for this. Primarily, deploying a machine learning pipeline as a service is a great way to operationalize a data science model to production. And by incorporating a flexible design, one can analyze many models in parallel–often a requirement for deploying models in the wild.
This pipeline can deployed on any cloud environment supporting Python. At Pivotal, we often use Cloud Foundry, which comes with nice production-grade benefits including automatic environment setup, high scalability, load balancing, built-in security, and other production quality benefits.
Demo of the AutoML service.
The application exposes both model training and model predictions with a RESTful API. For model training, input data and labels are sent via POST request, a pipeline is trained, and model predictions are accessible via a prediction route. Additionally, pipelines are stored to a unique key, and thus, live predictions can be made on the same data using different feature construction and modelling pipelines.
The model training is initiated via a HTTP POST request. Through the included parameter file, various automated feature engineering and model building pipelines can be evaluated and assigned to different pipeline IDs.
The model training logic is exposed as a REST endpoint. Raw, labeled training data is uploaded via a POST request and an optimal modeling pipeline is developed.
A parameter file contains options for feature engineering and model building:
Since the feature engineering process is automated, the feature calculations are not dataset specific, and thus pipelines could be trained on different datasets of different time windows. The extract_features parameters are specific to the tsfresh extract_features module. Check out the documentation for more information.
A POST request is made including the raw data, labels, and parameters:
# example using Python requests train_files = {'raw_data': open('data/data_train.json', 'rb'), 'labels' : open('data/label_train.json', 'rb'), 'params' : open('pipeline_parameters.yml', 'rb')} r_train = requests.post('.../train_model', files=train_files)
Note: In general, the longer the classifier is run for, the larger the model space searched, and the better the model results are. The TPOT authors suggest that models should be run on the timescale of hours to days, rather than minutes.
After the feature engineering routine is applied and the modeling pipeline is selected, a fitted pipeline is returned:
{'trainTime': 12.51, 'trainShape': [1647, 8], 'mean_cv_accuracy': .9634, 'mean_cv_roc_auc': .8985, 'modelType': Pipeline(memory=None, steps=[('stackingestimator', StackingEstimator(estimator=LinearSVC(C=0.0001,...))), ('logisticregression', LogisticRegression(...))]) 'modelId': 2}
In this case a stacking pipeline was built, first building synthetic features as input into a Logistic Regression model, however many approaches were considered.
During the search process, TPOT evaluates many different algorithms, hyperparameters, and preprocessing routines. This plot corresponds to model performance (AUC) for each model type evaluated.
After an optimal feature engineering and model building pipeline is determined, our pipeline is persisted within our Flask application within a Python dictionary–the dictionary key being the pipeline id specified in the parameter file.
After training, our trained modeling pipeline serves real-time predictions on incoming raw data.
Predictions from the trained pipeline are served via a REST API.
Again, raw data and parameters are sent via a POST request:
test_files = {'raw_data': open('data/data_test.json', 'rb'), 'params' : open('test_parameters.yml', 'rb')} r = requests.post('.../serve_prediction', files=test_files)
A JSON object is returned containing each example id and a prediction:
# id, score {'1003': 0.942, '1005': 0.981, '1006': 0.052, '1015': 0.572, … }
We can return multiple predictions on our data using from each modeling pipeline by specifying the pipeline id in the POST request.
Finally, we can view a list of the trained pipelines using the trained models:
r = requests.post('.../models')
{pipeline_1: {'mean_cv_roc_auc': 89.981, 'modelType':"RandomForestClassifier...", ...},
pipeline_2: {'mean_cv_roc_auc': 91.234, 'modelType':"LogisticRegression...", ...},
pipeline_3: {'mean_cv_roc_auc': 90.525, 'modelType':"ExtraTreesClassifier…"}, ...,
...}
In practice, instead of persisting our trained pipelines in a Python dictionary, we might persist our pipelines in a data cache such as Redis or Pivotal Cloud Cache. This would allow other applications and multiple instances of the app to access the trained pipeline.
A scalable model training and model serving architecture.
Conclusion
AutoML can already readily be incorporated into the data science workflow replacing inefficient grid searches and tedious model development processes. AutoML often improves model performance over traditional approaches while also providing insight into novel modelling pipelines. Automated feature engineering is a more difficult problem to tackle, however there are effective tools available for some use cases. I have shown how to make use of open source AutoML tools and operationalize a scalable automated feature engineering and model building pipeline to the cloud.
Thank you to Mariann Micsinai, Brandon Shroyer, Scott Hajek, Josh Plotkin, and Matt Horan for their editorial assistance.
About the Author