Evaluation Metrics for Classification Models

3 minute read

Published:

Evaluation Metrics for Classification Models

Evaluation metrics are essential for assessing the performance of classification models. This post covers commonly used metrics, including Accuracy, Precision, Recall, and F1 Score, and illustrates their application in binary, multi-class, and multi-label settings.


Metrics Overview

  • Accuracy: The proportion of correctly predicted instances among all instances.
  • Precision: The proportion of predicted positive instances that are actually positive.
  • Recall: The proportion of actual positive instances that are correctly predicted.
  • F1 Score: The harmonic mean of Precision and Recall, providing a balance between correctness and completeness.

These metrics provide complementary information and are often used together to evaluate model performance.


Confusion Matrix Terminology

Evaluation metrics are derived from the confusion matrix:

  • True Positive (TP): Predicted positive and actually positive.
    Example: Predicting Cat for an image that contains a cat.
  • False Positive (FP): Predicted positive but actually negative.
    Example: Predicting Cat for an image that contains a dog.
  • False Negative (FN): Predicted negative but actually positive.
    Example: Predicting Dog for an image that contains a cat.
  • True Negative (TN): Predicted negative and actually negative.
    Example: Predicting Dog for an image that contains a dog.

Metrics such as Precision, Recall, and F1 Score are calculated using these values.


Binary Classification Example

Binary classification involves predicting one of two possible labels. Consider the following example of classifying images as Cat or Not Cat:

ImageActual LabelPredicted LabelOutcome
1CatCatTP
2CatDogFN
3DogCatFP
4DogDogTN

From this table:

  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Multi-class Classification Example

Multi-class classification involves selecting one label out of multiple possible classes. Consider three classes: Cat, Dog, Bird.

ImageActual LabelPredicted LabelOutcome
1CatCatCorrect
2DogCatIncorrect
3BirdBirdCorrect
4DogDogCorrect
5CatBirdIncorrect

Metrics can be calculated per class. For example, for the Dog class:

  • TP = 1 (Image 4)
  • FP = 0
  • FN = 1 (Image 2)
  • Precision = TP / (TP + FP) = 1.0
  • Recall = TP / (TP + FN) = 0.5
  • F1 = 0.67

Multi-label Classification Example

In multi-label classification, a single instance may have multiple labels simultaneously. For example:

{Cat, Cute, Indoor}

Metrics are computed per label and then averaged across labels. Averaging strategies include:

  • Micro averaging: Combines counts of TP, FP, FN across all labels before computing metrics.
  • Macro averaging: Computes metrics per label and averages them equally.
  • Weighted averaging: Computes metrics per label and averages them weighted by the number of true instances.

This ensures fair evaluation even when some labels are rare.


Metrics Comparison Across Classification Types

The following table summarizes how evaluation metrics are computed and interpreted in binary, multi-class, and multi-label settings:

Metric / TypeBinaryMulti-classMulti-label
PrecisionTP / (TP + FP)Per class, then averagedPer label, then averaged (micro/macro/weighted)
RecallTP / (TP + FN)Per class, then averagedPer label, then averaged (micro/macro/weighted)
F1 ScoreHarmonic mean of Precision and RecallPer class, then averagedPer label, then averaged (micro/macro/weighted)
Accuracy(TP + TN) / TotalCorrect predictions / TotalCan be misleading; usually calculated per label and averaged
Averaging MethodN/AMacro / Micro / WeightedMacro / Micro / Weighted
NotesStraightforwardClass imbalance should be consideredRare labels require micro or weighted averaging

Evaluation metrics such as Precision, Recall, F1 Score, and Accuracy are critical for measuring classification model performance. Proper understanding of these metrics, along with the confusion matrix and averaging strategies, enables effective assessment and comparison of models across binary, multi-class, and multi-label tasks.