31.4 Ollama: Serving Local LLMs with an OpenAI-Compatible API

Right, so you’ve got your local model running, probably via some command line incantation you found on a forum and prayed would work. It’s a start. But you and I both know that’s not how you use this thing. You don’t want to be pasting prompts into a terminal; you want to build an application. You want an API.

That’s where Ollama struts in, wearing a leather jacket it definitely didn’t steal from OpenAI. It takes the raw, unwashed power of llama.cpp and other inference engines and wraps it in a well-behaved, HTTP-speaking service. Best part? It speaks OpenAI’s language. This is a massive win because it means the entire ecosystem of tools built for the OpenAI API—libraries, frameworks, UIs—can now point to your local machine instead of a credit-card-melting endpoint in the cloud.

The Core Concept: It’s Just HTTP

At its heart, Ollama is a local server. You run it, it loads your model into memory (GPU or CPU), and then it just… waits. You send HTTP requests to it, it sends back responses. The magic is in the shape of those requests and responses. They mimic the OpenAI Chat Completions API so precisely that you can often just change the base_url in your code from api.openai.com to localhost:11434 and things mostly just work. This is the kind of interoperability that makes engineers weep with joy.

Let’s get it running. You’ve likely already installed Ollama, but if not, it’s a one-liner:

curl -fsSL https://ollama.ai/install.sh | sh

Now, pull a model. Think of this as downloading the weights and having Ollama prep them for its internal use. Let’s grab a popular workhorse:

ollama pull llama3.1:8b

And now, fire up the server. It runs in the foreground by default, which is great for testing:

ollama serve

Boom. You now have an API endpoint at http://localhost:11434. Let’s annoy it with some code.

Your First Local API Call

Forget the Ollama CLI for a second. We’re talking API now. Here’s how you’d hit it with Python using the openai library, just as if it were the real deal.

pip install openai

from openai import OpenAI

# The key here is to point the client at YOUR server, not OpenAI's.
# The API key is literally just the string "ollama" – a charmingly low-effort
# security model perfect for local use. Don't expose this to the internet. Seriously.

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama", # Required for the library, functionally ignored by Ollama.
)

response = client.chat.completions.create(
    model="llama3.1:8b", # Must match the model you pulled and have available.
    messages=[
        {"role": "system", "content": "You are a sarcastic technical assistant."},
        {"role": "user", "content": "Explain quantum computing in the style of a tired fast-food worker closing up shop."}
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)

Run this. If you get a beautifully sardonic explanation of qubits and fries, congratulations. You’ve just replaced a cloud API call with a local one. The base_url and model parameters are the only critical changes. The structure of the messages array, temperature, and every other parameter is exactly as you’d use with OpenAI. This is the power of compatibility.

The Rough Edges and “Wait, What?” Moments

Ollama is brilliant, but it’s not a perfect facsimile. The designers made some… choices.

First, the model naming. OpenAI uses a simple string like gpt-4-turbo. Ollama uses a name:tag format like llama3.1:8b. If you use the wrong name, the error messages can be cryptic. Always run ollama list to see what models you have available on your system.

Second, while the API is compatible, it’s not exhaustively so. Some of the more obscure parameters in the OpenAI API might be silently ignored by Ollama. It handles all the big ones (temperature, max_tokens, stream) flawlessly, but if you’re trying to do something esoteric, check the Ollama API docs on GitHub. Don’t assume.

Third, and this is the big one, context management is on you. When you call ollama serve, it loads the model. That model stays in memory until you stop the server. Every API call you make is independent; Ollama is stateless. If you need to maintain a conversation history, your application must send the entire message history with each request. This is how the API works, but it’s a easy to forget, leading to models with shocking amnesia. The upside is total control. The downside is you have to actually implement that control.

Going Beyond Basic Chat

The /v1/chat/completions endpoint is the main event, but Ollama provides other endpoints. Need embeddings?

# Using the same client setup as before
response = client.embeddings.create(
    model="llama3.1:8b",
    input="This is a test sentence to embed.",
)

print(response.data[0].embedding)

You can also use the lower-level /api/generate endpoint directly with a tool like curl for quick tests or debugging:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

The Golden Rule: Never Expose This to the Internet

I need to be exceptionally clear about this. The default Ollama setup has no authentication and no transport security. It’s designed for local development. It is not meant to be exposed directly to the wild, chaotic internet. If you need a remote API, put Ollama behind a proper reverse proxy (like Nginx or Caddy) that adds authentication, rate limiting, and TLS. The ollama serve command has a --host flag to bind to other interfaces, but you should only use this on trusted, internal networks. Getting this wrong is how you end up on a cybersecurity news site for all the wrong reasons.

Ollama takes the profoundly complex task of running local models and makes it feel simple, or at least, approachably complex. It gives you a standard, well-understood interface to build against. It’s the duct tape and WD-40 that makes the glorious, janky future of local AI actually stick together. Now go build something.