33.1 Object Detection Formulation: Bounding Boxes and Class Labels

Right, let’s get this out of the way. Object detection isn’t just slapping a label on an image and calling it a day. That’s image classification, and frankly, it’s the easy part. We’re after the full monty: not only what is in the picture, but precisely where it is and how many of them there are. This “where” is almost always a bounding box, which is a fancy term for the tightest rectangle you can draw around an object without including your neighbour’s cat photobombing in the corner.

Think of it like this: classification answers “dog.” Detection answers “dog, at these pixel coordinates, and also another smaller dog slightly to the left, probably wondering what the first dog is looking at.”

The Two-Headed Beast: Regression and Classification

Every modern object detector you’ll meet is essentially a two-headed beast. One head is a regressor, a part of the network whose sole job is to predict four continuous numbers. The other is a classifier, predicting a discrete class label (and a confidence score). The magic is getting a single network to do both of these wildly different tasks simultaneously and competently.

Those four numbers from the regressor define the bounding box. But here’s the first of many “questionable choices” you’ll encounter: there isn’t one agreed-upon way to represent a box. The two most common formats are:

(x_center, y_center, width, height): The coordinates of the box’s center point, plus its width and height. This is usually normalized by the image dimensions (i.e., all values are between 0 and 1). YOLO uses this. It’s generally more stable for training.
(x_min, y_min, x_max, y_max): The coordinates of the top-left and bottom-right corners of the box. Humans tend to think in these terms, but it can be trickier for a network.

You’ll be converting between these two until you dream about it. Here’s how you do it without losing your mind.

def xyxy_to_xywh(box, img_width, img_height):
    """Converts (x_min, y_min, x_max, y_max) to normalized (x_center, y_center, width, height)"""
    x_min, y_min, x_max, y_max = box
    
    x_center = (x_min + x_max) / (2.0 * img_width)
    y_center = (y_min + y_max) / (2.0 * img_height)
    width = (x_max - x_min) / img_width
    height = (y_max - y_min) / img_height
    
    return [x_center, y_center, width, height]

# Example: A box in a 640x480 image
pixel_coords_box = [100, 50, 300, 250] # x_min, y_min, x_max, y_max
normalized_box = xyxy_to_xywh(pixel_coords_box, img_width=640, img_height=480)
print(f"Normalized box (xywh): {normalized_box}")
# Output: [0.3125, 0.3125, 0.3125, 0.416666...]

The Label Format: It’s All About the Ground Truth

Your training data needs to be in a specific format for the model to digest. For a single image, the label is typically a tensor or list containing multiple entries, one for each object in the image. Each entry contains the class id and the bounding box. The order is crucial. It’s almost always [class_id, x_center, y_center, width, height].

For a dataset with 80 classes (like COCO), a single object’s label might look like [0, 0.3125, 0.3125, 0.3125, 0.4166] where 0 represents the “person” class. The entire label for an image with two objects would be a list of lists or a 2D array.

# A realistic example label for a single image containing a person and a car.
# Assume class_id 0 for 'person', class_id 2 for 'car' (COCO dataset mapping).
image_ground_truth_labels = [
    [0, 0.3125, 0.3125, 0.3125, 0.4166],  # First object: person
    [2, 0.7500, 0.6000, 0.1500, 0.2000]   # Second object: car
]

The Messy Reality: Anchors, IoU, and The Matching Problem

Here’s where the designers really went to town. You have an image. The model might predict thousands of potential boxes at various locations and scales. But you only have a handful of ground truth objects. How do you decide which predicted box is responsible for which ground truth box? You can’t just say “the closest one.” You need a rigorous, mathematical way.

This is solved with two concepts:

Anchors (or Priors): Pre-defined, fixed-size bounding boxes tiled across the image at various scales. The model doesn’t predict absolute box coordinates from scratch; it predicts offsets from these anchor boxes. It’s a genius way to simplify the learning problem. The network just learns “adjust this default box a bit to the left, make it a little wider,” which is easier than “conjure a perfect box out of the void.”
Intersection over Union (IoU): The metric of choice for judging “closeness.” It’s the area of overlap between two boxes divided by the area of their union. An IoU of 1 is perfect; 0 is no overlap. During training, we assign a ground truth object to an anchor box (and its corresponding prediction) if their IoU is above a certain threshold (e.g., 0.5).

def calculate_iou(box1, box2):
    """Calculates IoU for two boxes in (x_min, y_min, x_max, y_max) format."""
    # Determine coordinates of intersection rectangle
    x1_inter = max(box1[0], box2[0])
    y1_inter = max(box1[1], box2[1])
    x2_inter = min(box1[2], box2[2])
    y2_inter = min(box1[3], box2[3])
    
    # Calculate area of intersection
    inter_area = max(0, x2_inter - x1_inter) * max(0, y2_inter - y1_inter)
    
    # Calculate area of each box and the union
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - inter_area
    
    # Avoid division by zero
    if union_area == 0:
        return 0.0
        
    iou = inter_area / union_area
    return iou

# Example: Two overlapping boxes
box_a = [10, 10, 100, 100]   # A larger box
box_b = [50, 50, 150, 150]   # A box that overlaps with box_a
print(f"IoU: {calculate_iou(box_a, box_b):.3f}")
# Output: IoU: 0.142

This matching process is the heart of the training loop for models like Faster R-CNN and YOLO. Get it wrong, and your model will confidently predict boxes in all the wrong places. It’s also the source of many headaches, which is why newer models like DETR said “to hell with anchors altogether” and used a completely different approach (which we’ll get to, and it’s just as wild). But for now, remember: detection is a game of high-stakes matching, and IoU is the judge.