In this post we will dicuss two common metrics:
AUC Score and Average Precision.
In part 1, we will discuss these topics in this post:
What is AUC Score?
What is the downside of AUC score?
AUC Score (also known as AUC-ROC) is the Area-Under-Curve for the Receiver Operator Characteristic (ROC) curve.
The ROC curve is a probability curve plotted with True Positive Rate (TPR) against False Positive Rate (FPR). It summarizes the performance of a binary classifier and measures the trade-off between TPR and FPR at different classification decision thresholds.
On the graph on the left above, various classification decision thresholds are plotted on the red ROC curve. When different classification thresholds are chosen, the number of positive and negative classes predicted by the classifier will change and so will the TPR and FPR. To see how TPR and FPR changes at different thresholds, we rely on the ROC curve.
Examining the trade-off between TPR and FPR is not the only use of the ROC curve, more often, researchers use the area under the curve (AUC) to evaluate different models’ performance. A High AUC score indicates that a model has a high True Positive Rate while being able to keep False Positive Rate low globally across different decision thresholds.
For example, consider a random classifier with an AUC score of 0.5, which is indicated by the red diagonal line on the graph on the right above: at a FPR of 0.5, the random classifier’s corresponding TPR is also 0.5. In contrast, if we look at a “better” performing classifier, which is indicated by the blue line on the right graph, at a FPR of 0.5, this better classifier gives us a TPR of 1. Therefore, a classifier with a ROC curve expanding to the top left corner, which also indicates a bigger area under curve and thus a higher AUC score, is preferred.
AUC score is a widely-used to assess models in Machine Learning, but for dataset that is highly imbalanced, AUC score can be misleading.
To illustrate the potential issue, let’s first examine how are TPR and FPR calculated:
TPR = TP / (TP + FN)
FPR = FP / (FP + TN)
When AUC Score is used for model evaluation, intuitively speaking, it means that our goal is to search for a model that maximizes TPR and minimizes FPR. But we can see from the second equation above that:
Intuitively speaking, if we want to achieve a low FPR, for a given fixed number of False Positive, we can do so by increasing the number of True Negative.
This issue of using AUC Score to evaluate models training on imbalanced dataset can be illustrated by a project I have done in the past.
In that project, we have the goal of classifying burned area using satellite imagery data.
In the dataset, the positive class refers to the burned pixel and negative refers to the unburned pixel. As one can imagine, since fire is a rare event, the majority of our data is unburned, and thus, the dataset is hugely imbalanced. Among all our training data points, 95% is unburned (negative) and only 5% is burned (positive).
If we use AUC score to evaluate model, as mentioned above, so long as our model can classify a huge number of unburned points as unburned, the model will have a low FPR and will appear to be a “good” model.
However, for tasks like burned area mapping, even though we would like low errors in general, we are more concerned about the burned area (the positive, minority class) as that is the ‘signal’ we would like the classifier to pick up. When there is a severe class imbalance in dataset, the use of AUC score for model evaluation can be overly optimistic.
REFERENCES:
Wikipedia. (n.d.). Receiver Operating Characteristic. Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Flatley, M. (2021). AUROC: Area Under the Receiver Operating Characteristic. Retrieved from Morioh: https://morioh.com/p/189aefce710f
Draelos, R. (2019, February 23). Measuring Performance: AUC(AUROC). Retrieved from Glass Box Machine Learning and Medicine: https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/?ref=morioh.com&utm_source=morioh.com
Draelos, R. (2019, March 2). Measuring Performance: AUCPR and Average Precision. Retrieved from Glass Box Machine Learning and Medicine: https://glassboxmedicine.com/2019/03/02/measuring-performance-auprc/
Brownlee, J. (2020, January 6). ROC Curves and Precision-Recall Curves for Imbalanced Classification. Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/
Steen, D. (2020, September 19). Precision-Recall Curves. Retrieved from Medium: https://medium.com/@douglaspsteen/precision-recall-curves-d32e5b290248
Chou, S.-Y. (2020, April 25). Compute the AUC of Precision-Recall Curve. Retrieved from Github: https://sinyi-chou.github.io/python-sklearn-precision-recall/
SKlearn developer. (2022). Average Precision Score. Retrieved from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html