Measuring the Performance of Classification Models

Nearly all companies use machine learning to do classification now. When it comes to deciding which model best suits the data, we need to employ some performance measures. An inappropriate measure would lead to a poorly chosen model. Here are some general thoughts on model performance criteria selection.

1. Things to do upon getting the data

Before fitting any model, the quality of data should be checked:

  • What covariates do we have?
  • For a categorical covariate,
    • how many categories?
    • are they just categorical or ordinal?
  • For a numerical covariate,
    • what is the shape of the distribution?
    • are there any outlying values?
  • Is there any missingness? (can be checked with the mice package but I haven’t given it a try yet)
    • if there is missing, is the pattern at random or not at random?
    • what’s an appropriate way to deal with it? (imputation, mean/median/mode, or multiple imputation)

2. Data cleaning

The next thing is to do some data cleaning, including:

  • normalizing numerical covariates so that they have a mean of 0 and sd of 1;
  • combine levels of categorical variable if there are too few observations in a single level
  • explore the correlation between the covariates and eliminate highly correlated features

3. Classification

I usually start with a logistic regression - the simplest and most interpretable model. Variable selection with LASSO can be done at the same time using the glmnet package in R, or sklearn.LogisticRegression in Python. When used on a large and highly imbalanced data, with say 99.5% 0’s and 0.5% 1’s, the model would be dominated by the negative class. When evaluating the model on the test data, what’s most likely to happen will be that all observations are predicted to be 0.

For unbalanced dataset, such as credit fraud, or response to direct mail marketing campaigns, the major focus is often the positive case. To build a better model, people might suggest under-sampling the negative class, or over-sampling the positive class. The problem with under sampling is, we are essentially ignoring some information, which could be very important in constructing the decision rule, on purpose. What about oversampling? We will have duplicated observations in the positive class. The same observation might appear in both the training and testing data after we do data partition. Once a model is trained on the new training data after oversampling, when we test it on the new testing data, the duplicated observations will most likely be assigned the correct label, which does not reflect the true predicting power of the model!

Towards this end, some techniques are often used to “enlarge” the positive class, including the Synthetic Minority Over-sampling Technique (SMOTE), or the Adaptive Synthetic (ADASYN) approach. Both are implemented in imblearn package in Python.

  • SMOTE: use linear combinations of positive class observations to create new observations
  • ADASYN: an improved version of SMOTE; after a new observation is created, a small noise is added to it, making it more realistic.

Tree-based models work well on such datasets. The sklearn package contains a large collection of tree-based models, including the simplest DecisionTree, and more complicated ensemble methods like AdaBoost, GradientBoosting, and RandomForest. The very popular XGBoost method has its own package. Microsoft’s LightGBM is also powerful. The difference between it and XGBoost is where the trees make splits. While XGBoost orders the values of a feature first and calculate the gain or loss if a split is made at each of the values, LightGBM makes a histogram of the values first, and then split at different bins. This greatly improves the speed of calculation. These two, however, need the inputs to be all numerical. CatBoost is able to work on categorical inputs directly (I haven’t tried it yet).

4. Performance Measures

These methods have been so amazing that properly tuned models often yield high prediction precisions on testing datasets. Is this simple performance measure sufficient? Do we just choose the model that has the highest accuray? No, we don’t.

For binary classification problems, specifically, the performance evaluation is done using a confusion matrix:

Predicted PositivePredicted Negative
PositiveTrue positive, TPFalse negative, FN
NegativeFalse positive, FPTrue negative, TN

The simplest performance measure would be the prediction accuracy:

$\text{Acc} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{TN}}$

It is not difficult to notice that here we are paying equal attention to correct predictions in both the positive and negative classes. If the data is highly unbalanced, I can get a very high accuracy by just predicting everyone to be in the majority class.

The second (set of) measure(s) is sensitivity and specificity. Intuitively,

sensitivity = proportion of true positives that are correctly predicted = $\frac{\text{TP}}{\text{TP} + \text{FN}}$;
specificity = proportion of true negatives that are correctly predicted = $\frac{\text{TN}}{\text{TN} + \text{FP}}$.

Sensitivity is also called the true positive rate, while 1 - specificity, $\frac{\text{FP}}{\text{TN} + \text{FP}}$, is called the false positive rate.

All the classifiers return a set of scores for the observations, and the decision of positive/negative is made depending on the comparison of the scores with a threshold. Depending on choice of thresholds, one is able to get different combinations of (sensitivity, specificity) values. Plotting sensitivity (Y) against 1-specificity (X) gives the Receiver Operating Characteristic (ROC) curve. The area under curve (AUC) is a number between 0 and 1, and it is the probability that the classifier will assign a higher score to an originally positive observation than an originally negative observation.

Based on the AUC, another derived measure called the Gini index is defined, but they are essentially the same thing.

For very unbalanced classification, the AUC might still not be a good measure. It essentially assigns equal importance to sensitivity and specificity. In a scenario like credit card fraud detection, our major focus is the majority class. We don’t care if the system allows a normal transaction to be processed, but we do care if it successfully stops an abnormal transaction.

The third set of measures here would be the precision and recall. Similar to sensitivity and specificity, they are defined using information from the confusion matrix:

precision = proportion of correctly made positive classifications = $\frac{\text{TP}}{\text{TP} + \text{FP}}$;
recall = sensitivity = $\frac{\text{TP}}{\text{TP} + \text{FN}}$

Plotting precision against recall gives the PR curve, and correspondingly we have the PRAUC.

Based upon the precision and recall, a measure called the F-score is defined. It is the harmonic mean of precision and recall, i.e.,

$F = \frac{2}{1 / \text{precision} + 1 / \text{recall}} = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$.

We essentially place equal emphasis on false positive and false negatives. There are times, however, that FPs and FNs need different level of emphasis. Imagine that you are a credit card issuer doing fraud detection. When a real fraud happens and your classifier missed it, your cost would be the amount of transaction. When your classifier incorrectly flags a transaction and stopped it, your cost is only sending an email/text message to your cardholder, and ask them to confirm the transaction. It is obvious that FN is a much more serious problem than FP here. Therefore we wish to place more emphasis on the FN. The $F_2$ measure comes based on this premise:

$F_2 = (1 + 2^2) \frac{\text{precision} \cdot \text{recall}}{2^2 \text{precision} + \text{recall}}$

A more general form would be $F_\beta$:

$F_\beta = (1 + \beta^2)\cdot \frac{\text{precision} \cdot \text{recall}}{\beta^2 \text{precision} + \text{recall}}$.

While $F_2$ weighs recall higher than precision, and thus puts more attention to FN, $F_{0.5}$ weights recall lower than precision, and emphasizes FP.

Yishu Xue
Yishu Xue
Data Scientist / Coder / Novice Sprinter / Gym Enthusiast

The night is dark and full of terrors.