36.3 GPT-4o Vision: Understanding Images with LLMs

Alright, let’s get our hands dirty with GPT-4o’s vision capabilities. Forget the old days of stitching together a separate image classifier and a language model and hoping they’d get along. GPT-4o (“o” for “omni”) is natively multimodal. That means it was trained from the ground up on images, text, and audio all at once. It doesn’t “see” an image in the way you and I do; it processes it into a sequence of tokens, much like it does with text. This is the magic trick: it speaks one common language for multiple types of information. The result? It’s scarily good at understanding the content, context, and even the humor in your pictures.

The Basic API Call: Throwing an Image at the Model

The most common way you’ll interact with this is through the Chat Completions API. You provide a list of messages, and one (or more) of those messages can contain an image. The image isn’t uploaded as a file. Oh no, that would be too simple. We encode it in base64 and slap it right into the JSON payload. It feels a bit like sending a letter by first photocopying every page onto a single, enormous scroll, but it works.

Here’s the canonical “what’s in this picture?” example. We’ll use the OpenAI Python library to keep things clean.

import base64
import openai

# Path to your local image
image_path = "path/to/your/messy_desk.jpg"

# Encode the image to base64
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(image_path)

client = openai.OpenAI(api_key="YOUR_API_KEY")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image? Be thorough."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0].message.content)

You’ll get back a detailed description like, “The image shows a cluttered desk with a laptop open to a code editor, a half-empty coffee mug that says ‘I hate mornings’, two books stacked precariously: ‘Design Patterns’ and ‘Clean Code’, and what appears to be a pet cat sleeping on a notebook in the corner.”

Beyond Description: Complex Reasoning and Extraction

This is where it gets wild. You’re not limited to just asking for a description. You can ask it to reason about the image, extract specific information, or even write code based on a screenshot.

Example: Data Extraction from a Graph Got a screenshot of a line chart from a meeting? Don’t manually transcribe the data points.

# Assuming base64_image is already encoded from a chart screenshot
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Look at this line chart. What are the approximate data values for each point on the blue line? Please list them in a structured JSON format with the keys 'x' and 'y'. If the axes are labeled, use those labels."},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
            ],
        }
    ],
    max_tokens=500,
)

print(response.choices[0].message.content)

The model will often return a valid JSON string you can parse, giving you a shocking head start on data digitization. Is it pixel-perfect accurate? No. Is it a million times faster than doing it by hand for a rough approximation? Absolutely.

Common Pitfalls and How to Avoid Them

The model is brilliant, but it’s not omniscient (despite the name). Here’s what to watch out for.

Hallucination: It will confidently invent details that aren’t in the image. It might see a blurry blob and call it “a dog,” or read a distorted word incorrectly. Best Practice: For any mission-critical task, especially involving text extraction (OCR), use a dedicated, deterministic OCR engine like Tesseract first. Use GPT-4o Vision for the context and reasoning around that text.
Resolution and Detail: The model doesn’t see your high-res image in all its glory. It gets downsampled. Tiny text, fine details, and subtle colors can be lost. If your question is about the serial number on a capacitor, you’re likely out of luck unless it’s a very tightly cropped shot.
The Context Window is Everything: Your image is eating into the same context window as your text. A large, high-detail image encoded in base64 is a massive string of tokens. This can quickly become expensive and limit how much follow-up conversation you can have. Be mindful of your image size and max_tokens parameter.
Bias and Safety: The model has all the classic LLM biases, now with sight. It might make assumptions about people’s professions, identities, or situations based on its training data. The built-in safety filters will also sometimes refuse to analyze perfectly benign images (medical diagrams are a common trip-up) because they’re trained to be overly cautious. There’s no easy fix for this; it’s a fundamental limitation of the current tech.

The Real Magic: Conversing with Images

The killer feature isn’t one-off analysis; it’s the conversation. You can point at things in a sequence of messages.

# First message: Describe the scene
# Second message: "Now, looking at that same image, what brand is the coffee mug?"
# Third message: "And based on the book titles, what programming language do you think I was probably using?"

The model maintains context about the image throughout the chat. This allows for a truly interactive, investigative experience. It feels less like running a tool and more like collaborating with a ridiculously perceptive intern who never sleeps. Use this to drill down into details or explore different aspects of a complex image without having to re-send it every time. Just remember, the image tokens are still there, counting against your context window and your wallet.