31.5 Open-Source Model Landscape: LLaMA 3, Mistral, Qwen, Gemma, Phi

Right, let’s get you oriented. The “open-source” model landscape is a bit of a wild west right now. I put “open-source” in quotes because the licenses range from “do whatever you want” to “you can use this but don’t you dare compete with us, also we might change the terms later.” It’s less a unified ecosystem and more a collection of brilliant, chaotic fiefdoms. Your job is to pick the right champion for your specific quest.

The biggest shift recently has been the move from monolithic, general-purpose models to a mix of generalists and smaller, more specialized ones. The idea is that you might not need a 70-billion-parameter beast to write a polite email; a nimble 7-billion-parameter model can do that just fine, leaving the big guns for the complex reasoning tasks. This is where your choice of tool (Ollama vs. llama.cpp) and model becomes critical.

The Major Players: Strengths, Quirks, and Licensing Gotchas

Meta’s LLaMA 3 is the current heavyweight champion of general-purpose openness. Its 8B and 70B models are incredibly capable, sensible, and well-rounded. It’s like the reliable, brilliant friend who shows up on time and helps you move. The license is also fairly permissive for most of us (though if you have 700 million monthly active users, you need to call Meta). The context window is a respectable 8k, which handles most conversations and documents without breaking a sweat.

Mistral’s models (like Mixtral 8x7B and Mistral 7B) are the clever, efficient French contenders. Mixtral is a “Mixture of Experts” (MoE) model. Don’t let the jargon scare you; think of it not as one giant brain, but a committee of 8 smaller brains (experts). For any given token, it only consults 2 of them. This makes it wildly faster and cheaper to run than a model of its equivalent size, while still being incredibly capable. Its 32k context window is a massive advantage for working with long documents. The license is Apache 2.0, which is about as permissive as it gets.

Google’s Gemma is their answer to this open-ish world. The models (2B and 7B) are solid, but the license is the real story. It’s not open-source in the purest sense; it’s a “Gemma” license that explicitly forbids use in training other models. It’s like Google is saying, “Here, bake with this pre-made dough, but don’t try to grow your own wheat.” Technically good, but the legal baggage is something to be aware of.

Microsoft’s Phi-3 is the poster child for the “small but mighty” movement. The Phi-3-mini (3.8B parameters) performs tasks that will make you double-check its size. Microsoft achieved this through heavily curated, high-quality training data (“textbooks are all you need”). It’s perfect for resource-constrained environments where you still need cogent output. The license is MIT, which is fantastic.

Alibaba’s Qwen 2 is the dark horse you should be paying attention to. It’s fully Apache 2.0 licensed, meaning no sneaky restrictions, and its performance is absolutely top-tier, often trading blows with LLaMA 3. For a truly free model without any corporate caveats, Qwen 2 is arguably the best place to start.

Actually Pulling and Running a Model

This is where Ollama shines. Forget downloading 4GB files and figuring out GGUF quant types manually. Ollama’s pull command abstracts all that away.

# Let's get the efficient Mixtral model. Ollama handles the best quant for your system.
ollama pull mixtral

# Now, let's run it. The 'raw' flag lets us see the mechanics.
echo "Why is the sky blue?" | ollama run mixtral --raw

But sometimes you need to get under the hood. That’s where llama.cpp and its quantized GGUF model files come in. You might do this to use a specific quant type (e.g., Q4_K_M for a great size/quality balance) or a model Ollama doesn’t yet serve.

First, you’d download a model from a repository like Hugging Face. Then, using the llama.cpp main executable:

# This is a more manual, powerful approach. The '-ngl 40' pushes 40 layers to your GPU for acceleration.
./main -m /path/to/mistral-7b-v0.1.Q4_K_M.gguf -p "Why is the sky blue?" -n 128 -ngl 40

Best Practices and Pitfalls From the Trenches

Start Small, Then Scale: Your first instinct will be to pull the biggest, baddest 70B model. Fight it. Start with a 7B model like llama3:8b or qwen2:7b. You’ll get instant feedback and learn the ropes without melting your CPU. You can always scale up later.
Quantization is Your Best Friend: These models are distributed in “quantized” formats (Q2, Q4, Q6, Q8). This reduces their precision to save a colossal amount of disk space and RAM. The sweet spot for most is Q4_K_M. It offers a great balance. Q8 is almost full precision but huge, and Q2 is tiny but can noticeably degrade quality. Don’t be afraid to experiment.
Mind the Context Window: This is the model’s short-term memory. If you need to summarize a long document, you must use a model with a large context (like Mixtral’s 32k). If you exceed the context, the model will literally forget the beginning of the conversation. It will not tell you this; it will just start giving you progressively weirder answers.
The System Prompt is Your Secret Weapon: This is how you guide the model’s personality and output. Ollama lets you define this in a Modelfile. This is where you turn a general-purpose model into a specialized assistant.

# Create a file named 'Modelfile'
FROM llama3:8b
SYSTEM """
You are an expert software architect. You answer concisely and technically. 
You avoid fluff and never use phrases like "awesome" or "let's dive in". 
If the user asks about something unrelated, you refuse politely.
"""
# Build and run your customized model
ollama create architect -f Modelfile
ollama run architect

The biggest pitfall? Assuming these models share the same “truth.” They don’t. They are stochastic parrots with PhDs. They will confidently hallucinate facts, code, and citations. Your job is to use their incredible pattern-matching ability while building systems that verify their output. Trust, but verify.