36.5 Flamingo and Idefics: Multi-Image Understanding

Right, so you’ve got a model that can handle text. Big deal. You’ve got one that can handle a single image. Cute. But the real world doesn’t work like that. Your problems are messier. You’ve got a diagram and a chart. A product photo and a scrawled note from a client. This is where the real magic happens: models that can juggle multiple images and text in a single, coherent thought. And for that, we have to talk about the pioneers: Flamingo and its open-source spiritual successor, Idefics.

These models didn’t just add a second image input; they rethought the entire architecture to enable what’s called interleaved multimodal understanding. You can feed them a prompt like: [Image of a cat] “What is this animal?” [Image of a dog] “And this one?” “Which one is bigger?” And it will use both images to reason about the answer. It’s the difference between having two separate conversations and one conversation where you can point at two different things.

The Secret Sauce: Perceiver Resampler and Gated Cross-Attention

The genius of Flamingo wasn’t just in the scale (though it was massive), but in two clever architectural hacks that made this interleaving possible without retraining the entire universe from scratch.

First, the Perceiver Resampler. You have a powerful, pre-trained vision encoder (like a CLIP ViT) that spits out a grid of feature vectors for an image. But you can’t just shove 64 feature vectors into a language model built for word tokens; it would be a computational nightmare and the model would get lost. The Perceiver Resampler is a lightweight attention module that takes that messy grid of image features and “summarizes” it into a fixed, smaller number of visual tokens. Think of it as a smart compressor: it takes the most salient information from the image and packages it into a neat little bundle the language model can actually digest.

Second, Gated Cross-Attention. This is the real star. The language model (a pre-trained Chinchilla model, in Flamingo’s case) has its standard self-attention layers—the ones that let words attend to other words. The Flamingo team inserted new, additional cross-attention layers right after these. These layers are gated, meaning they learn how much to actually trust the visual input versus just ignoring it and relying on text. When the model encounters one of these special visual tokens (the output of the Perceiver Resampler), the cross-attention layer lets the text tokens in the sequence “attend to” or “look at” the visual information. This is how it connects the words “what is this animal” to the specific image token you placed right before it.

# This is a conceptual illustration. You won't run this code, but it shows the flow.
# Imagine we're building a Flamingo-style block.

from torch import nn
import torch

class GatedCrossAttentionBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(dim, num_heads=8)  # Standard text self-attention
        self.cross_attn = nn.MultiheadAttention(dim, num_heads=8)  # New cross-attention layer
        self.gate = nn.Parameter(torch.tensor(0.0))  # Learnable gate parameter

    def forward(self, text_tokens, visual_tokens):
        # Step 1: Normal text self-attention
        text_tokens = self.self_attn(text_tokens, text_tokens, text_tokens)[0]

        # Step 2: Cross-attention: text queries attend to visual key/values.
        # The gate controls how much of this cross-attention info we add.
        cross_info = self.cross_attn(text_tokens, visual_tokens, visual_tokens)[0]
        text_tokens = text_tokens + torch.sigmoid(self.gate) * cross_info

        return text_tokens

From Flamingo to Idefics: The Open-Source Heir

Flamingo, developed by DeepMind, was a landmark paper. It was also completely closed. You couldn’t run it. You couldn’t fine-tune it. This, frankly, was a massive bummer for everyone not named DeepMind.

Enter Idefics from Hugging Face. Idefics is the open-source answer. It’s not a clone, but a spiritual successor that replicates the core architecture: it uses a Perceiver Resampler on top of a vision encoder (like siglip) and grafts it onto a powerful language model (like Llama) using gated cross-attention. The best part? You can actually use it right now.

# This is actual, runnable code using the transformers library.
from transformers import Idefics2ForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
import requests

# Load the model and processor. This will pull ~8GB of weights, so get a coffee.
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Idefics2ForConditionalGeneration.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.bfloat16,
    device_map=device
)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")

# Create an interleaved prompt: image + text + image + text
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/idefics2/cat.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/idefics2/dog.jpg"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        "What's this animal?",
        {"type": "image"},
        "And this one?",
        "Which one is more fluffy?",
    ]}
]

# Prepare the inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt").to(device)

# Generate a response
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

Pitfalls and Battle-Tested Practices

This power comes with responsibility, and a few headaches.

Order Matters. A Lot. The model associates images in the prompt with the special <image> tokens in the order they appear. If your prompt is [image1] [image2] "compare these", the model will know which is which. If you mess up the order, it will give you a confident, spectacularly wrong answer. Always double-check your image sequencing.
Resolution and Aspect Ratio are Killers. Most of these models have a fixed, and often low, resolution for images (e.g., 320px or 448px). Squeezing a high-resolution schematic or a wide-format dashboard into that tiny square loses a catastrophic amount of detail. Pre-processing is key: crop to the relevant part and resizing appropriately is not just a good idea, it’s mandatory for accuracy.
The Context Window is Your Real Bottleneck. Each image is converted into a bunch of tokens. A conversation with multiple high-resolution images can burn through your context window faster than a team of interns at an open bar. You’ll quickly hit a “max length” error. Be ruthless in your prompt design. Use fewer, more relevant images.
They Hallucinate. Especially on Text-in-Images. Ask a model to read text from a blurry screenshot or a handwritten note, and it will often just make something up that seems plausible based on the context. It’s a vision-language model, not an OCR system. For any task where precise text extraction is needed, you’re still better off running a dedicated OCR tool like Tesseract and then feeding the text into the model.

The designers made a questionable choice in standardizing on such low image resolutions, likely for computational ease. It’s the biggest practical limitation you’ll face. But despite the rough edges, these models are nothing short of revolutionary. They’re the first real step towards AI that can understand the chaotic, multimodal world the way we do.