11.2 Accuracy, Precision, Recall, F1, and When to Use Each
Right, let’s talk about metrics. Because if you’re going to build a model, you need to know if it’s any good. Throwing data at an algorithm and hoping for the best is a fantastic way to waste electricity. We need to measure performance, and not just with a single number that tells a comforting lie.
The classic beginner mistake is to reach for accuracy first. It’s the most intuitive metric: (number of correct predictions) / (total predictions). Simple, right? Let’s see it in action on a terribly balanced dataset.
from sklearn.metrics import accuracy_score
import numpy as np
# Let's simulate a "cat or dog" classifier on a dataset that's 95% cat pictures.
# Our brilliant model has decided the optimal strategy is to just yell "CAT!" every time.
y_true = np.array(['cat'] * 95 + ['dog'] * 5) # 95 cats, 5 dogs
y_pred = np.array(['cat'] * 100) # Model predicts 'cat' for all 100 images
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")
This will proudly output an Accuracy: 0.95. 95%! A+! Ship it! Except our model is a useless pile of code that has never once correctly identified a dog. It has a 0% success rate on the thing we might actually care about. This is why accuracy is often a garbage metric for imbalanced datasets, which is, let’s be honest, most interesting problems.
We need to break down the types of mistakes our model makes. Enter the Confusion Matrix. Don’t let the name fool you; it’s there to prevent confusion. It’s a simple table that cross-tabulates what actually happened (y_true) with what your model predicted (y_pred).
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | True Negative (TN) | False Positive (FP) |
| Actual Positive | False Negative (FN) | True Positive (TP) |
The names are logical: a False Positive is when you falsely predicted a positive, and a False Negative is when you falsely predicted a negative. This table is the bedrock. Everything else is just a derivative.
Precision: Are We Sure?
Precision answers a very specific question: When my model says something is positive, how often is it right? It’s a measure of confidence.
[ \text{Precision} = \frac{TP}{TP + FP} ]
Think of it as quality control. You want high precision when the cost of a false positive is high. Spam filtering is the textbook example. If you classify a legitimate email from your boss as spam (a false positive), that’s a disaster. You’d rather let a few spam emails through (false negatives) than accidentally trash important messages. A high-precision spam filter is conservative; it only labels something as spam if it’s really sure.
Recall: Did We Find Them All?
Recall (or Sensitivity) answers a different question: Of all the actual positive cases, how many did we correctly find?
[ \text{Recall} = \frac{TP}{TP + FN} ]
This is a measure of completeness. You need high recall when missing a positive case is disastrous. The classic example is cancer screening. You’d rather flag a few healthy patients for additional testing (false positives) than miss a single person who actually has cancer (a false negative). A high-recall system is paranoid; it casts a wide net to catch everything that might be a problem.
Here’s the brutal truth: For a given model, precision and recall are often in tension. Making your model more sensitive (higher recall) usually means it also makes more mistakes (lower precision), and vice-versa. You can’t have it all.
The F1 Score: The Harmonious Middle Child
So what if you care about both false positives and false negatives? You need a single number to compare models. Enter the F1 Score, the harmonic mean of precision and recall.
[ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} ]
“Why the harmonic mean?” I hear you ask. Because it punishes extreme values. The regular arithmetic mean of 1.0 precision and 0.0 recall would be 0.5, which isn’t a useful representation of a terrible model. The harmonic mean of those same numbers is 0. The F1 score only gets high if both precision and recall are high. It’s a great default metric when you have an imbalanced dataset and need a balanced perspective.
Let’s calculate all of these properly.
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
# Let's use a more realistic example
y_true = [0, 1, 1, 0, 1, 0, 0, 1, 1, 1] # 1 is our "positive" class
y_pred = [0, 1, 0, 0, 1, 1, 0, 1, 0, 1] # Our model's predictions
print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print(f"Precision: {precision_score(y_true, y_pred):.2f}")
print(f"Recall: {recall_score(y_true, y_pred):.2f}")
print(f"F1 Score: {f1_score(y_true, y_pred):.2f}")
This will output the raw confusion matrix and the scores. Run this. Stare at the numbers. See how they connect. The matrix shows the raw counts, and the metrics distill them into a story about your model’s behavior.
So, Which One Should You Actually Use?
Stop looking for a silver bullet. The choice is dictated by the business problem, not the algorithm.
- Use Precision when false positives are worse than false negatives. (Spam filtering, content moderation, fraud detection where investigating a false alert wastes resources).
- Use Recall when false negatives are worse than false positives. (Medical diagnostics, search and rescue, sensitive security screening).
- Use F1 when you need a single metric for a balanced view and the class distribution is uneven. It’s your best bet for general model comparison on imbalanced data.
- Use Accuracy only when your classes are perfectly balanced and the cost of both error types is roughly equal. (Spoiler: this almost never happens).
The real pro move is to not just pick one. Plot a Precision-Recall curve to see the trade-off across different decision thresholds. But that’s a topic for another section. For now, just promise me you’ll never deploy a model based solely on accuracy again. You’re better than that.