Evaluation Metrics for Classification Models
Evaluation Metrics for Classification Models
Evaluation metrics are essential for assessing the performance of classification models. This post covers commonly used metrics, including Accuracy, Precision, Recall, and F1 Score, and illustrates their application in binary, multi-class, and multi-label settings.
Metrics Overview
- Accuracy: The proportion of correctly predicted instances among all instances.
- Precision: The proportion of predicted positive instances that are actually positive.
- Recall: The proportion of actual positive instances that are correctly predicted.
- F1 Score: The harmonic mean of Precision and Recall, providing a balance between correctness and completeness.
These metrics provide complementary information and are often used together to evaluate model performance.
Confusion Matrix Terminology
Evaluation metrics are derived from the confusion matrix:
-
True Positive (TP): Predicted positive and actually positive.
Example: Predicting Cat for an image that contains a cat. -
False Positive (FP): Predicted positive but actually negative.
Example: Predicting Cat for an image that contains a dog. -
False Negative (FN): Predicted negative but actually positive.
Example: Predicting Dog for an image that contains a cat. -
True Negative (TN): Predicted negative and actually negative.
Example: Predicting Dog for an image that contains a dog.
Metrics such as Precision, Recall, and F1 Score are calculated using these values.
Binary Classification Example
Binary classification involves predicting one of two possible labels. Consider the following example of classifying images as Cat or Not Cat:
| Image | Actual Label | Predicted Label | Outcome |
|---|---|---|---|
| 1 | Cat | Cat | TP |
| 2 | Cat | Dog | FN |
| 3 | Dog | Cat | FP |
| 4 | Dog | Dog | TN |
From this table:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Multi-class Classification Example
Multi-class classification involves selecting one label out of multiple possible classes. Consider three classes: Cat, Dog, Bird.
| Image | Actual Label | Predicted Label | Outcome |
|---|---|---|---|
| 1 | Cat | Cat | Correct |
| 2 | Dog | Cat | Incorrect |
| 3 | Bird | Bird | Correct |
| 4 | Dog | Dog | Correct |
| 5 | Cat | Bird | Incorrect |
Metrics can be calculated per class. For example, for the Dog class:
- TP = 1 (Image 4)
- FP = 0
- FN = 1 (Image 2)
- Precision = TP / (TP + FP) = 1.0
- Recall = TP / (TP + FN) = 0.5
- F1 = 0.67
Multi-label Classification Example
In multi-label classification, a single instance may have multiple labels simultaneously. For example:
{Cat, Cute, Indoor}
Metrics are computed per label and then averaged across labels. Averaging strategies include:
- Micro averaging: Combines counts of TP, FP, FN across all labels before computing metrics.
- Macro averaging: Computes metrics per label and averages them equally.
- Weighted averaging: Computes metrics per label and averages them weighted by the number of true instances.
This ensures fair evaluation even when some labels are rare.
Metrics Comparison Across Classification Types
The following table summarizes how evaluation metrics are computed and interpreted in binary, multi-class, and multi-label settings:
| Metric / Type | Binary | Multi-class | Multi-label |
|---|---|---|---|
| Precision | TP / (TP + FP) | Per class, then averaged | Per label, then averaged (micro/macro/weighted) |
| Recall | TP / (TP + FN) | Per class, then averaged | Per label, then averaged (micro/macro/weighted) |
| F1 Score | Harmonic mean of Precision and Recall | Per class, then averaged | Per label, then averaged (micro/macro/weighted) |
| Accuracy | (TP + TN) / Total | Correct predictions / Total | Can be misleading; usually calculated per label and averaged |
| Averaging Method | N/A | Macro / Micro / Weighted | Macro / Micro / Weighted |
| Notes | Straightforward | Class imbalance should be considered | Rare labels require micro or weighted averaging |
Evaluation metrics such as Precision, Recall, F1 Score, and Accuracy are critical for measuring classification model performance. Proper understanding of these metrics, along with the confusion matrix and averaging strategies, enables effective assessment and comparison of models across binary, multi-class, and multi-label tasks.