33.5 DETR: End-to-End Detection with Transformers

Right, so you’ve slogged through the two-stage R-CNN family and the one-stage YOLO grid. You’re probably thinking, “Isn’t there a less… hacky way to do this?” A way that doesn’t involve anchor boxes, non-max suppression, and all that hand-woven, heuristic nonsense?

Enter DETR. The paper’s title says it all: “End-to-End Object Detection with Transformers.” It’s a bold claim. It looks at the last 10 years of computer vision and says, “That’s cute.” Instead of convolutions and carefully engineered proposal systems, it uses a Transformer encoder-decoder architecture, the same kind that took over natural language processing. And it works. It’s brilliantly simple conceptually, even if the devil is in the details.

The core idea is so stupidly simple it’s genius: treat object detection like a set prediction problem. We have a fixed, learned set of “object queries” that the model uses to ask the image, “Hey, is there something here?” The Transformer decoder then spits out a set of predictions—exactly N of them, no more, no less. In the standard model, N is set to 100, because you probably don’t have more than 100 salient objects in your average image. This immediately kills the need for non-max suppression; the model is forced to make its 100 best predictions globally, so it can’t just spam a thousand boxes for one object.

The Architecture Breakdown: It’s Just a Transformer, Really

DETR has three main parts: a CNN backbone, an encoder-decoder Transformer, and a simple prediction head.

First, a CNN backbone (like a ResNet) takes your image and spits out a lower-resolution feature map. This is our “context.” This feature map is then flattened spatially, combined with a positional encoding (absolutely critical, since Transformers are permutation-invariant), and fed into the Transformer encoder.

The encoder’s job is to let every part of the image talk to every other part, building a rich, global representation. This is where it gets its edge over convolutional methods that have a limited receptive field without many layers.

Now, the star of the show: the decoder. It takes in two things. First, the refined encoded features. Second, the object queries. These are just learned embeddings—think of them as a set of questions like “Is there a large red object in the top left?” or “Is there a small furry thing in the center?” that the model has learned to ask. The decoder uses attention to let each query look at all the encoded image features and then, based on that, make a prediction.

Each decoder output is fed through a simple feed-forward network (FFN) that acts as the prediction head. This FFN predicts a class (including a “no object” class, denoted as ∅) and a bounding box (center x, center y, width, height) directly. No anchors, no offsets. Just direct regression.

The Bipartite Matching Loss: The Secret Sauce

This is the real magic. How do you train this? You can’t just use a standard loss function because the model predicts an unordered set. The order of its 100 predictions is random. The loss needs to match each prediction to a ground truth object first.

It does this with Hungarian matching. It finds the optimal one-to-one assignment between the set of predictions and the set of ground truth objects (padded with “no object” to make the sets the same size) that minimizes the total loss. Only after this matching is done do we calculate the loss (typically a linear combination of a classification loss like cross-entropy and a box loss like L1/GIoU) only on the matched pairs. This forces the model to distribute its predictions across all objects without any of the duplicate-box nonsense that requires NMS.

import torch
from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image
import requests

# Let's not reinvent the wheel. Here's how you use the Hugging Face implementation.
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Load the pre-trained processor and model
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

# Process the image
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# Post-process... and note, NO NMS!
target_sizes = torch.tensor([image.size[::-1]])  # (height, width) to (width, height)
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

# Print results
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence {round(score.item(), 3)} at location {box}"
    )

The Rough Edges and Why You Should Care

DETR is not perfect. The authors would be the first to tell you that.

First, training is slow. Transformer attention is O(n²) with the number of features. It needs hundreds of epochs to converge on COCO, while YOLO or Faster R-CNN are much quicker off the mark. This is partly fixed by deformable attention in follow-ups like Deformable DETR, but the vanilla model is a compute hog.

Second, it struggles with small objects. The CNN backbone creates a feature map that’s downsampled by a factor of 32. A 10x10 pixel object becomes less than a single feature vector. The global attention is great for context, but it can blur the fine-grained details needed for tiny objects. Convolutions, with their innate locality, often handle this better.

Third, the fixed set of 100 predictions is a double-edged sword. It’s great for not having duplicates, but it’s a nightmare if your use case involves more than 100 objects. You can increase N, but then you’re just making the computational problem worse.

So, do you use it today? For most production tasks, the answer is still probably “no” – YOLO is just faster and smaller. But for proof-of-concepts, for research, and for applications where the sheer simplicity of an end-to-end pipeline is worth the computational cost, it’s a breathtakingly beautiful model. It showed us that the entire history of object detection wasn’t the only path forward, and for that, we owe it a lot.