30.6 Vision: Analyzing Images and Documents

Right, let’s talk about getting Claude to open its eyes. The Vision API isn’t just about slapping an image into a prompt and hoping for the best. It’s about giving Claude a new pair of glasses and teaching it how to read the fine print. The magic here is that you can toss almost any common image or document format (JPEG, PNG, PDF, DOCX, you name it) at the model and it will not only see the pixels but understand the content. This is where we move from a fancy chatbot to a genuine analysis engine.

The first thing you need to know is that Claude doesn’t get the raw image bytes. You, or more accurately, the API client, have to do the legwork of encoding the image into a base64 string. It feels a bit like preparing a slide for a microscope—you have to get the sample on the glass just right before the scientist can take a look. Why base64? Because it’s a reliable, text-safe way to represent binary data in a JSON payload, which is the lingua franca of APIs. It’s a bit bulky, but it works.

Here’s the basic anatomy of a vision message. Notice it’s not just a string; it’s a whole dictionary with a type field. This structure is your friend.

# A single image message for the Claude API
vision_message = {
    "role": "user",
    "content": [
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/jpeg",  # This is critical!
                "data": "base64_encoded_string_goes_here", # This would be very long
            }
        },
        {
            "type": "text",
            "text": "What's in this image? Be thorough."
        }
    ]
}

The Crucial `media_type` Header

This is the number one rookie mistake. You cannot just encode your image and throw it over the wall. You must tell Claude what format it’s in using the media_type field. Get this wrong, and it’s like trying to play a DVD in a CD player—a lot of whirring and a disappointing error message. For a JPEG, it’s image/jpeg. For a PNG, it’s image/png. For a PDF, it’s application/pdf. This isn’t a suggestion; it’s a requirement. The API uses this to know how to decode the base64 string back into a usable file.

Mixing Text and Image Context

You’ll almost never just send an image. The power comes from combining the visual input with a textual directive. Your text prompt is the question you’re asking about the image. Be specific! “What’s in this image?” is fine for a demo. “Extract the total invoice amount, the due date, and the vendor name from this receipt, and format them as JSON” is where you start to earn your paycheck. Structure your prompt to guide Claude’s analysis toward your desired output.

# A more practical example using the Anthropic Python SDK
import base64
import anthropic

# Initialize the client
client = anthropic.Anthropic(api_key="your_api_key")

# Read and encode your image
with open("complicated_chart.png", "rb") as image_file:
    image_data = base64.b64encode(image_file.read()).decode("utf-8")

# Send the prompt
message = client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "This chart is from our quarterly report. Summarize the key trend it shows and suggest one question a board member might ask about it."
                }
            ]
        }
    ]
)

print(message.content[0].text)

Document Analysis: Claude’s Party Trick

While analyzing photos is cool, where this feature truly shines is in document processing. Toss a 10-page PDF terms of service document at Claude and ask it to list the top 5 most onerous clauses in plain English. It will do it without complaining. The key here is that Claude understands the semantic content of the document, not just performing OCR. It can reason about the concepts, not just read the words.

Pitfalls and Performance

A few hard truths. First, resolution and detail matter. If you send a tiny, low-res thumbnail of a dense schematic and ask it to trace a specific circuit, it will fail. You’re limited by what a human could reasonably see at that resolution. Second, this is more expensive and slower than text-only prompts. You’re paying for the extra processing, and latency will be higher. It’s worth it, but don’t expect sub-second responses on a large document. Finally, always sanity-check the output. For critical tasks like data extraction, implement a validation step. Claude is brilliant, but it’s not a database. It can hallucinate or misread a smudged number, just like a human intern might.

The Crucial media_type Header

Mixing Text and Image Context

Document Analysis: Claude’s Party Trick

Pitfalls and Performance

The Crucial `media_type` Header