33.6 Evaluation: mAP, IoU, and COCO Benchmarks

Right, let’s talk about how we tell if our object detector is any good. Because trust me, your model will confidently spew out bounding boxes whether it’s found a cat or a carburetor. It’s our job to hold it accountable. We’re not just counting right and wrong answers; we’re grading its precision on a curve. The core of this entire evaluation circus revolves around two simple geometric ideas and one glorified report card.

The Intersection over Union (IoU) Gut Check

First, you need a way to measure how “right” a single prediction is. That’s where IoU comes in. It’s brutally simple: it’s the area of overlap between your predicted box and the ground truth box, divided by the area of their union. It gives you a score between 0 (no overlap, complete failure) and 1 (perfect match, they’re the same box).

Why this ratio? Because it perfectly captures both localization accuracy (is it in the right place?) and the size accuracy (is it the right size?). A box that’s slightly off will have a high IoU, say 0.8. A box that’s way off or the completely wrong size will be much lower. We use a threshold—typically 0.5—to decide if a prediction is a “True Positive” or a “False Positive.” Is this arbitrary? A bit. But it’s a standard the whole field agreed on, probably over a few beers. It works.

Here’s how you’d calculate it yourself. It’s good to do this once so you never take it for granted again.

def calculate_iou(boxA, boxB):
    # Box format is typically [x1, y1, x2, y2] for top-left and bottom-right corners.
    
    # Determine the coordinates of the intersection rectangle
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])
    
    # Compute the area of intersection
    interArea = max(0, xB - xA) * max(0, yB - yA)
    # The max(0,...) is crucial. It handles the case where they don't overlap at all.
    
    # Compute the area of both the prediction and ground-truth rectangles
    boxAArea = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
    boxBArea = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
    
    # Compute the area of union = sum of both areas - intersection area
    unionArea = boxAArea + boxBArea - interArea
    
    # Avoid the divide-by-zero edge case (if both boxes have zero area, IoU is undefined)
    if unionArea == 0:
        return 0
    
    # IoU = Intersection / Union
    iou = interArea / unionArea
    return iou

# Example:
gt_box = [50, 50, 150, 150]  # Ground truth box
pred_box = [60, 60, 140, 140] # Our prediction
print(f"IoU: {calculate_iou(gt_box, pred_box):.3f}") # Should print ~0.79

Precision, Recall, and the Confidence Trade-Off

Now, for a single class, we have a list of predictions, each with a confidence score. Your model might be 95% confident it found a dog, 87% confident it found a cat, and so on. If we set a high confidence threshold (e.g., 0.9), we only get the most surefire predictions. This leads to high precision (most of what we predicted was correct) but low recall (we missed a lot of objects). Lower the threshold, and you catch more objects (higher recall) but also start hallucinating (lower precision).

The only way to see the full picture is to sort all predictions by confidence and walk through every possible threshold. At each threshold, you calculate the precision and recall. Plotting all these points gives you the Precision-Recall (PR) Curve. A good model has a curve that bulges towards the top-right corner—high precision and high recall simultaneously. The area under this curve (AUC) is a single number that summarizes its performance. A perfect model has an AUC of 1.

Mean Average Precision (mAP): The Main Event

This is where it all comes together, and where most people’s eyes glaze over. But stick with me. Average Precision (AP) is literally the area under the PR curve for a single class. There are different ways to calculate this area (a 2010 interpolation method vs. the newer 2014 method), which is a classic academic feud that we just have to live with. The COCO benchmark primarily uses AP averaged over 10 IoU thresholds from 0.5 to 0.95 (in increments of 0.05). This is your AP or AP@[.5:.95].

Why average over IoU? Because it rewards detectors that don’t just get “good enough” boxes (IoU > 0.5) but excellent, tightly-fitting boxes (IoU > 0.9). It’s a stricter measure.

Then, Mean Average Precision (mAP) is simply the mean of the APs across all your object classes. It’s your one-number-to-rule-them-all for a detector’s overall accuracy. When you see a paper say “our model achieves 45.6 mAP on COCO,” this is what they’re talking about.

The COCO Benchmark’s Kitchen Sink

The Common Objects in Context (COCO) dataset didn’t just give us a ton of nice images; it gave us a standardized test. Its evaluation protocol is gloriously comprehensive, and you need to know what all the suffixes mean:

AP@[.5:.95] (or just AP): The main event, as described.
AP50: AP at IoU threshold 0.5. The old, easier standard. If your AP is low but your AP50 is high, your model is sloppy—it finds objects but draws bad boxes.
AP75: AP at IoU threshold 0.75. A much stricter measure of localization.
AP_S, AP_M, AP_L: AP for small, medium, and large objects. This is the most telling metric. If your AP_S is abysmal (and it often is), your model is blind to tiny objects. This is where many real-world applications fail.
AR (Average Recall): Max recall given a fixed number of detections per image (1, 10, 100). It measures how good your model is at not missing things if you let it make enough guesses.

The best practice? Don’t just look at the mAP. Look at the whole suite, especially AP_S and AP_L. It tells you what your model is actually good at, not just its average score. And when you run evaluation, for the love of all that is holy, make sure you’re suppressing non-maximum overlaps (NMS) before you feed your predictions to the COCO evaluator. The evaluator expects it, and if you don’t, you’ll be flooded with duplicate boxes and your score will be a tragedy. I’ve done it. You’ll do it. We all do it once.