20.7 Open-Source LLMs: LLaMA, Mistral, Gemma, Phi, Qwen

Right, let’s talk about the open-source revolution. Because let’s be honest, the big, proprietary models from OpenAI and Google are impressive, but they’re also black boxes. You can’t see the gears turning, you can’t fine-tune them on your own secret data without paying an arm and a leg, and you certainly can’t run them on your own hardware without a corporate-sized trust fund. That’s where this motley crew of open-source models comes in. They’re the rebels, the tinkerer’s paradise, and frankly, the reason this field is moving at lightspeed. We’re not just users here; we’re mechanics.

The Contenders: A Quick Roll Call

First, let’s meet the players. It’s a crowded field, but a few have pulled ahead of the pack.

LLaMA (Meta): The one that started the modern gold rush. When Meta dropped LLaMA (Large Language Model Meta AI) in early 2023, it was a “are they insane?” moment. It wasn’t strictly open source at first (the weights were available only to researchers), but it was good enough to prove that smaller, more efficient models could punch way above their weight class if trained on pristine data. It’s the granddaddy; most others are its descendants.
Mistral AI: The French efficiency experts. Mistral came out swinging with models that are brutally performant for their size. Their 7B model routinely embarrassed models twice its size. They’re masters of architectural tweaks (like grouped-query attention) and data curation. They also popularized the Mixture-of-Experts (MoE) approach for open models with Mixtral, which is a fancy way of saying “we only use a fraction of the total parameters for any given token,” making it wildly efficient at inference time.
Gemma (Google): Google’s response to the open-source clamor. Based on the same tech as their Gemini models, Gemma is their “okay, fine, here you go” contribution. It’s solid, well-documented, and benefits from Google’s massive infrastructure. Think of it as the corporate-sanctioned, very competent entry.
Phi (Microsoft): The minimalists. The Phi team believes in “textbooks are all you need” – high-quality, synthetically generated training data for smaller models. The results are stunning; a 3.8B parameter Phi-3 model can outperform LLaMA2 7B on many benchmarks. It proves that data quality isn’t just important; it’s everything.
Qwen (Alibaba): The strong international contender. Often overlooked in Western-centric discussions, the Qwen models are absolute workhorses, multilingual by default, and consistently rank near the top of performance leaderboards. They’re a fantastic choice if your use case has a global focus.

Why the Hell Would I Use Open-Source?

Glad you asked. It boils down to three things: control, cost, and customization.

Control: You own the model weights. You can run it on your own machine, in your own VPC, on a laptop on a plane with no internet. No API calls, no rate limits, no worrying about a provider’s policy changes or downtime.

Cost: Once you’ve downloaded the model, inference is essentially free. For high-volume applications, the cost savings compared to per-call API fees are astronomical. You’re trading upfront hardware/compute cost for a marginal cost of zero.

Customization: This is the big one. You can fine-tune these models on your specific data. Got a database of archaic legal documents? Fine-tune LLaMA on it. Need a model that speaks in your company’s brand voice? Fine-tune Mistral. This is impossible with most closed API models, or it’s a prohibitively expensive enterprise feature.

Here’s the absolute simplest way to get one of these models running, using the transformers library and a dash of torch.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Pick your model. Let's start with Mistral's 7B instruct model.
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# This will download the model (a multi-gigabyte file, so get coffee)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use half-precision to save VRAM
    device_map="auto",          # Let HF automatically handle GPU/CPU placement
    low_cpu_mem_usage=True      # A lifesaver for larger models
)

# Create a prompt in the model's required chat format
messages = [
    {"role": "user", "content": "Explain the concept of quantum entanglement like I'm a witty, slightly sarcastic physicist."}
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

# Generate a response
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

# Decode the output and print it
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The Gotchas: Welcome to the Trenches

This isn’t all sunshine and rainbows. The open-source world is messy.

1. The Hardware Tax: The biggest hurdle. These models are huge. A 7B parameter model in 16-bit precision needs about 14GB of VRAM just to load. A 70B model? 140GB. You’re not running that on your gaming PC. You’ll immediately become best friends with concepts like quantization (reducing precision to 4 or 8 bits to shrink memory footprint) and offloading (shoving parts of the model onto the CPU or disk). The device_map="auto" and load_in_4bit=True arguments in from_pretrained are your first tools for fighting this battle.

2. Inference Speed: Without the optimized, trillion-dollar infrastructure of an OpenAI, your inference will be slower. Much slower. You need a proper toolchain. Using a dedicated inference server like vLLM or TGI (Text Generation Inference) is non-negotiable for production. They handle batching, caching, and other black magic to get the most tokens per second out of your hardware.

# Using vLLM to supercharge your inference server
pip install vllm
python -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-Instruct-v0.2

3. The Responsibility is Yours: There’s no safety net. Closed APIs have extensive content moderation filters. Your open-source model will happily generate recipes for chemical weapons, bad fanfiction, or worse, if you prompt it to. It’s on you to implement guardrails, output filtering, and responsible usage policies.

The Fine-Tuning Mandate

The whole point is to make the model yours. The most common way is Supervised Fine-Tuning (SFT) with a framework like TRL (Transformer Reinforcement Learning) from Hugging Face. Here’s the gist: you prepare a dataset of prompt-response pairs specific to your task and then continue training the model on it.

from trl import SFTTrainer, TrainingArguments

dataset = ... # Your dataset of {'prompt': '...', 'completion': '...'} pairs

training_args = TrainingArguments(
    output_dir="./my_finetuned_model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    fp16=True,  # Use mixed precision
    logging_steps=10,
    save_steps=500,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="completion",  # We're training on the completion text
    max_seq_length=1024,
)

trainer.train()

This process essentially nudges the model’s probability distributions towards your specific style and content. It’s not magic—it requires high-quality data—but when done right, it transforms a general-purpose model into your own personal expert.