Evaluation Metrics for Classification Models

3 minute read

Published: January 01, 2026

Evaluation Metrics for Classification Models

Evaluation metrics are essential for assessing the performance of classification models. This post covers commonly used metrics, including Accuracy, Precision, Recall, and F1 Score, and illustrates their application in binary, multi-class, and multi-label settings.

Metrics Overview

Accuracy: The proportion of correctly predicted instances among all instances.
Precision: The proportion of predicted positive instances that are actually positive.
Recall: The proportion of actual positive instances that are correctly predicted.
F1 Score: The harmonic mean of Precision and Recall, providing a balance between correctness and completeness.

These metrics provide complementary information and are often used together to evaluate model performance.

Confusion Matrix Terminology

Evaluation metrics are derived from the confusion matrix:

True Positive (TP): Predicted positive and actually positive.
Example: Predicting Cat for an image that contains a cat.
False Positive (FP): Predicted positive but actually negative.
Example: Predicting Cat for an image that contains a dog.
False Negative (FN): Predicted negative but actually positive.
Example: Predicting Dog for an image that contains a cat.
True Negative (TN): Predicted negative and actually negative.
Example: Predicting Dog for an image that contains a dog.

Metrics such as Precision, Recall, and F1 Score are calculated using these values.

Binary Classification Example

Binary classification involves predicting one of two possible labels. Consider the following example of classifying images as Cat or Not Cat:

Image	Actual Label	Predicted Label	Outcome
1	Cat	Cat	TP
2	Cat	Dog	FN
3	Dog	Cat	FP
4	Dog	Dog	TN

From this table:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)

Multi-class Classification Example

Multi-class classification involves selecting one label out of multiple possible classes. Consider three classes: Cat, Dog, Bird.

Image	Actual Label	Predicted Label	Outcome
1	Cat	Cat	Correct
2	Dog	Cat	Incorrect
3	Bird	Bird	Correct
4	Dog	Dog	Correct
5	Cat	Bird	Incorrect

Metrics can be calculated per class. For example, for the Dog class:

TP = 1 (Image 4)
FP = 0
FN = 1 (Image 2)
Precision = TP / (TP + FP) = 1.0
Recall = TP / (TP + FN) = 0.5
F1 = 0.67

Multi-label Classification Example

In multi-label classification, a single instance may have multiple labels simultaneously. For example:

{Cat, Cute, Indoor}

Metrics are computed per label and then averaged across labels. Averaging strategies include:

Micro averaging: Combines counts of TP, FP, FN across all labels before computing metrics.
Macro averaging: Computes metrics per label and averages them equally.
Weighted averaging: Computes metrics per label and averages them weighted by the number of true instances.

This ensures fair evaluation even when some labels are rare.

Metrics Comparison Across Classification Types

The following table summarizes how evaluation metrics are computed and interpreted in binary, multi-class, and multi-label settings:

Metric / Type	Binary	Multi-class	Multi-label
Precision	TP / (TP + FP)	Per class, then averaged	Per label, then averaged (micro/macro/weighted)
Recall	TP / (TP + FN)	Per class, then averaged	Per label, then averaged (micro/macro/weighted)
F1 Score	Harmonic mean of Precision and Recall	Per class, then averaged	Per label, then averaged (micro/macro/weighted)
Accuracy	(TP + TN) / Total	Correct predictions / Total	Can be misleading; usually calculated per label and averaged
Averaging Method	N/A	Macro / Micro / Weighted	Macro / Micro / Weighted
Notes	Straightforward	Class imbalance should be considered	Rare labels require micro or weighted averaging

Evaluation metrics such as Precision, Recall, F1 Score, and Accuracy are critical for measuring classification model performance. Proper understanding of these metrics, along with the confusion matrix and averaging strategies, enables effective assessment and comparison of models across binary, multi-class, and multi-label tasks.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Evaluation Metrics for Classification Models