As data scientists, we work with a large amount of diverse data. A common task for us is supervised statistical classification. Classification underpins many activities that are a part of our everyday lives. Emails are classified as spam or not spam, credit card purchases are classified as fraudulent or legitimate, and custom web advertisements are shown to people based on their viewing habits. A ubiquitous, yet difficult problem in machine learning classification is class imbalance, which occurs when one class occurs far less frequently than the others.

For example, how do you create a classifier to detect fraudulent transactions if only 1% of all transactions are fraudulent? In this post, I take a deep-dive into class imbalance, discussing the challenges associated with large datasets and approaches to tackle the class imbalance problem.

## Statistical Classification for Big Data Problems

Typically, statistical classification first involves training a model based on training data, then scoring the model on new data. Training the model estimates a function that links between a set of observations’ features and their corresponding classes. These features can be either categorical (e.g. gender, type of car) or numeric (e.g. price in dollars, length, speed), while the classes must be categorical. Once model training is complete, a new observation with the same set of features, but possibly different values, can be fed into the model. The output will be a set of scores which represents, for each class, the probability of a new observation belonging to the class. Alternatively, the model may simply return the predicted class for each observation.

Classification can become much more complicated if one or more classes occur far less frequently than others. This is known as a class imbalance problem or rare event detection, and it commonly occurs in big data applications. Some examples include predicting rare disease developments in individuals, predicting natural disasters, or detecting fraudulent transactions. Without accounting for class imbalance, fitting a model to the data can lead to poor results. To lay the groundwork for solutions to the class imbalance problem, the next section will review common metrics for comparing models.

## How to Evaluate a Machine Learning Model in HAWQ or GPDB

A crucial part of machine learning involves iterating through and comparing multiple models using one or more metrics. At Pivotal, we use our own Pivotal Greenplum (GPDB) and Apache HAWQ (incubating) alongside Apache MADlib (incubating), our preferred library of scalable in-database machine learning algorithms. These tools are built to run machine learning algorithms in a distributed fashion. Importantly, they allow us to compare multiple models simultaneously and without waiting for hours or days.

When comparing models, optimal metric choices are subjective and depend on the specific problems. Common ways to compare models include confusion matrices, accuracy, precision/recall, and the ROC (Receiver Operating Characteristic) curve. These are explained below.

### Confusion Matrix

A confusion matrix is a 2×2 table which contains the number of correct classifications and misclassifications for both positive and negative responses. A positive response is an observation that belongs to the class we are trying to predict (namely the rare class), and a negative response is an observation that does not belong to the rare class.

Table 1: Definition of the confusion matrix

Confusion matrices consist of four values—the number of true positives, false positives, false negatives, and true negatives.

**True positive:**a positive observation that is correctly predicted as positive**False positive:**a negative observation that is incorrectly predicted as positive**False negative:**a positive observation that is incorrectly predicted as negative**True negative:**a negative observation that is correctly predicted as positive

A perfect model will have only true positives and true negatives, that is, no false positives or false negatives.

In HAWQ and GPDB, data lies in tables. Each row corresponds to an observation while the columns contain the information for the features and classes. After fitting and running a model, we can add columns for the class predictions and the probabilities of each observation belonging to the rare class.

Table 2 is an example of a results table. The y_true column contains the true values, the y_predicted column contains the corresponding predictions, and the y_prob column contains the probabilities.

Table 2: Example results table

If the table is in this form, we can compute the confusion matrix in HAWQ or GPDB using the following code:

SELECT y_true, y_predicted, COUNT(*) FROM results GROUP BY y_true, y_predicted ORDER BY y_true, y_predicted;

With this matrix, we can compute many other relevant metrics for model evaluation.

### Accuracy

One of the most intuitive metrics is accuracy, which measures the percentage of all observations that are correctly classified. This is computed as the number of correctly classified observations divided by the total number of observations.

In a class imbalance setting, accuracy can be very misleading. Suppose we have a data set with 10,000 observations where 100 of the them are in the positive class and the remaining 9,900 are in the negative class. If we were to predict that each of these 10,000 observations belongs to the larger negative class, we would get the following confusion matrix:

Table 3: Confusion matrix showing the pitfalls of accuracy

This model results in an extremely high accuracy of 99%, but it is clearly not a good one. It never predicts any of the positive classes, as shown by the lack of true and false positives.

### Precision vs. Recall

Precision and recall provide an effective means to evaluate model performance in the presence of class imbalance. Precision measures the certainty of a positive test result. Recall measures the amount of true positives that are captured by the model. They are defined below:

In plain English:

**Precision:**Given an observation whose predicted result is positive, what is the probability that it actually belongs to the positive class?**Recall:**Out of all the observations in the positive class, how many of them were predicted as positive?

Precision is sometimes referred to as the positive predictive value. Recall is also known as sensitivity or the true positive rate.

The best models should have both high precision and recall values. However, there is often a tradeoff between the two. Determining which metric to optimize depends on the question being answered. Consider these two examples of rare event detection:

- Company A wishes to find customers who may be performing fraudulent transactions
- Company B wishes to decide which search engine results to return from a large pool of potential candidates

In the first case, recall should be more important. A missed fraud attempt is much more harmful to the company than detecting an innocuous transaction and flagging it as potentially fraudulent. In statistical terminology, this means that a false negative is more damaging than a false positive.

In the second case, precision should be more important. It would be more beneficial to a user if all of the returned search results were relevant. If more emphasis were put on improving recall, then the search engine would return many irrelevant results and the user would have to cull through a potentially larger number of them to find something useful. In this situation, false positives are more damaging.

We could also use F1 score which is a metric that incorporates both precision and recall. Figure 1 shows the F1 score as a function of precision and recall. It is high when both precision and recall are high but low when either precision or recall is low. Optimizing the F1 score will ensure that both precision and recall are high.

Figure 1: Precision and Recall vs. F1 score

### ROC (Receiver Operating Characteristic) Curve / AUC (Area Under the Curve) Statistic

For binary classification, many machine learning algorithms will output a score between 0 and 1, which can be interpreted as the probability of belonging to the positive class. To predict which class an observation is more likely to belong in, a threshold must be chosen to serve as the cut-off point between the positive and negative classes. A score lying below the threshold is assigned to the negative class and a score above it is assigned to the positive class.

Before moving on, we must discuss two more metrics—the true positive rate and the false positive rate. As mentioned earlier in this blog, true positive rate is simply another name for recall. The false positive rate is the proportion of negative class entries we incorrectly predict as being in the positive class.

The ROC curve is another way to characterize a model’s performance. It captures how the false positive rate (FPR) and true positive rate (TPR) change as the threshold separating classes is moved from 0 to 1. Increasing this threshold decreases the number of observations that are predicted as positive, which increases both the false positive rate and the true positive rate. This results in the ROC curve being an increasing function.

The AUC statistic, which represents the area under the ROC curve, provides a single number to to compare different models. The best models will have ROC curves where there is a large true positive rate even for small false positive rates. In such cases, the AUC will be large (close to 1). In contrast, a very bad model which randomly guesses between the two classes will result in an AUC of around 0.5 since the true positive rate and false positive rates will be about the same at every point.

Figure 2: Example of a good and bad ROC curve

## Techniques to Handle Class Imbalance with Big Data

Now that we have discussed how to judge the performance of a model, how do we actually create one? Without accounting for class imbalance, machine learning models can perform poorly. One way to get around the class imbalance problem is to balance the training data set. This way each class is closer in size. As a result, misclassifying positive examples becomes much more costly than misclassifying negative examples. There are a few ways in which we may balance the classes.

**Down-sample:**Subsample the larger class without replacement so that it is comparable in size to the smaller class**Up-sample:**Resample the smaller class so that it is closer in size to the larger class

The exact size of the sub-sampled or resampled data set can be arbitrary and will likely depend on the metric that is being optimized. A good first guess would be to make both classes the same size, but cross-validation may show that using slightly uneven class sizes works better. Using either of these methods effectively increases the weight of the rare class, which allows it to be more easily detected. Other methods to achieve similar results include using a threshold value other than 0.5 or using a custom loss function which assigns a higher cost to misclassifying a rare observation.

## Case Study

This next section provides a case study using the Forest Covertype Data Set, which comes from UCI’s machine learning repository. This data set contains 581,012 rows and 55 columns. In this particular problem, we are primarily interested in the cover type variable, which can take one of seven values—Spruce-Fir, Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow, Aspen, Douglas-fir, and Krummholz. We are interested in predicting whether the cover type is our least frequent class, Cottonwood/Willow, which only accounts for 0.47% of the data. This means Cottonwood/Willow will be our positive/rare class and all others will be our negative class.

Before modeling, we split up the data into training and test sets. This way, cross-validation can be done later. We choose to assign an approximate split of 80% and 20% for training set and the test set, respectively:

CREATE TABLE cover_type_train_test AS SELECT CASE WHEN random() < 0.8 THEN 'train' ELSE 'test' END AS train_test, cover_type_binary, horizontal_distance_to_hydrology, vertical_distance_to_hydrology FROM cover_type_table DISTRIBUTED RANDOMLY;

We build our models using a random forest with only two features—horizontal distance to hydrology and vertical distance to hydrology. This is done purely for illustrative purposes. If we used more features it would result in better performance, but it would be more difficult to distinguish performance differences.

The code below shows how we could down-sample in HAWQ or GPDB.

CREATE TABLE cover_type_down_compressed AS SELECT array_agg(cover_type_binary) AS cover_type_binary, array_agg(horizontal_distance_to_hydrology) AS horizontal_distance_to_hydrology, array_agg(vertical_distance_to_hydrology) AS vertical_distance_to_hydrology FROM ( -- Positive class (SELECT cover_type_binary, horizontal_distance_to_hydrology, vertical_distance_to_hydrology FROM cover_type_train_test WHERE train_test = 'train' AND cover_type_binary = 1 ) UNION ALL -- Negative class down-sampled ( SELECT cover_type_binary, horizontal_distance_to_hydrology, vertical_distance_to_hydrology FROM ( -- Selects all negative examples (SELECT row_number() OVER () AS id, cover_type_binary, horizontal_distance_to_hydrology, vertical_distance_to_hydrology FROM cover_type_train_test WHERE train_test = 'train' AND cover_type_binary = 0 ORDER BY random() ) neg_examples -- Ensures the number of negative examples -- is the same as the number of positive examples JOIN (SELECT generate_series(1, num_pos) AS rand_id FROM (SELECT COUNT(*) AS num_pos FROM cover_type_train_test WHERE train_test = 'train' AND cover_type_binary = 1 ) num_pos_ex ) indices ON id = rand_id ) ) ) foo DISTRIBUTED RANDOMLY;

This code will take all of the positive examples and sub-sample the same number of randomly chosen negative examples. It will then merge them, and convert the columns into arrays, allowing them to be read into our PL/Python function. In the future, we will be able to replace these methods of down-sampling and up-sampling with functions in PDL Tools, which is a library of reusable tools for data science work in HAWQ and GPDB.

Next, we illustrate code to create two functions—plpython_rf and plpython_rf_score—that run and score a random forest model in PL/Python. The first function runs the random forest and then serializes the Python model object into a byte stream. This process of serializing a model is known as pickling. The second function will take the pickled byte stream as an input and unpickle it. Doing this recovers the original model, which we can then use to score new observations.

-- Function to train a random forest and output a pickle string CREATE FUNCTION plpython_rf(cover_type_binary integer[], horizontal_distance_to_hydrology integer[], vertical_distance_to_hydrology integer[] ) RETURNS bytea AS $$ import cPickle from sklearn.ensemble import RandomForestClassifier import numpy as np import pandas as pd df = pd.DataFrame({'cover_type_binary': cover_type_binary, 'horizontal_distance_to_hydrology': horizontal_distance_to_hydrology, 'vertical_distance_to_hydrology': vertical_distance_to_hydrology }) clf = RandomForestClassifier(criterion='entropy') clf.fit(df[['horizontal_distance_to_hydrology', 'vertical_distance_to_hydrology']], df['cover_type_binary']) # Saves the classifier as a pickled byte stream pickle_string = cPickle.dumps(clf, 2) return pickle_string $$ LANGUAGE plpythonu IMMUTABLE; -- Creates output type for the random forest scoring function CREATE TYPE rf_score AS (horizontal_distance_to_hydrology double precision, vertical_distance_to_hydrology double precision, cover_type_binary double precision, prediction double precision, prob double precision ); -- Define the scoring function CREATE FUNCTION plpython_rf_score(pickled_model bytea, cover_type_binary double precision[], horizontal_distance_to_hydrology double precision[], vertical_distance_to_hydrology double precision[] ) RETURNS SETOF rf_score AS $$ import cPickle from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score import numpy as np import pandas as pd clf = cPickle.loads(pickled_model) df = pd.DataFrame({'cover_type_binary': cover_type_binary, 'horizontal_distance_to_hydrology': horizontal_distance_to_hydrology, 'vertical_distance_to_hydrology': vertical_distance_to_hydrology }) df['y_pred'] = clf.predict(df[['horizontal_distance_to_hydrology', 'vertical_distance_to_hydrology']]) df['y_prob'] = clf.predict_proba(df[['horizontal_distance_to_hydrology', 'vertical_distance_to_hydrology']])[:,1] return np.array(df[['horizontal_distance_to_hydrology', 'vertical_distance_to_hydrology', 'cover_type_binary', 'y_pred', 'y_prob']]) $$ LANGUAGE plpythonu IMMUTABLE;

The results of the different sampling methods are shown in the table below:

Table 4: Comparing metrics for the different models.

We can see that applying a random forest to the entire unbalanced data set results in 99.5% accuracy, but a 0.37% true positive rate. Balancing the data set allows the model to detect positive observations. The ROC curves for each of these implementations are shown in Figure 3 below.

Figure 3: ROC curves for different sampling types

We see from the AUC scores that these two methods improve on the regular model by a quite a bit.

Table 5: AUC scores for the different sampling types

## Final Thoughts

Class imbalance problems can be very complex. Maximizing one metric can cause another one to drop. Without knowing which metric is the most important to focus on, it is difficult to define what the final model should be. While this blog does not exhaust all the complexities of a class imbalance problem, it gives a taste of important issues to consider and some approaches to overcome class imbalance.

### Learning More:

- Read more articles from the Pivotal Data Labs team
- Find out more about about Pivotal Greenplum and Pivotal HAWQ
- Read about path functions in Apache MADlib or predictive maintenance in cars

## About the Author