31.8 Hardware Requirements: GPU VRAM for Different Model Sizes
Alright, let’s talk hardware. This is where the rubber meets the road, or more accurately, where your expensive graphics card meets a torrent of matrix multiplications. You can’t just throw any old computer at this and expect magic. The single most important number on your spec sheet for running local LLMs is your GPU’s VRAM. Think of it as the “working memory” for your model. The model’s weights—its entire knowledge and reasoning capability—have to be loaded into this space to run efficiently. If they don’t fit, everything slows to a crawl as your system starts shuffling data back and forth to regular RAM, which is like trying to feed a Formula 1 engine through a drinking straw.
The Brutal Math of Model Sizes
Let’s get the cold, hard numbers out of the way first. Model sizes are usually given in parameters (e.g., 7 billion, 70 billion). To figure out how much VRAM you need, you need a rough conversion formula. The most common precision for running these models is 16-bit floating point (FP16), where each parameter takes 2 bytes.
So, a 7B model: 7,000,000,000 * 2 bytes = 14,000,000,000 bytes ≈ 14 GB.
But wait, you clever thing, you’re thinking, “My 8GB card can run a 7B model!” And you’d be right. That’s because we use quantization, the black magic of making models smaller and faster by reducing the precision of their weights. The most common quant is 4-bit (as in Q4_0, Q4_K_M). This stores each parameter in, you guessed it, ~4 bits, or ~0.5 bytes.
Our 7B model now: 7,000,000,000 * 0.5 bytes = 3,500,000,000 bytes ≈ 3.5 GB.
That’s a much more manageable number. Here’s a quick cheat sheet for the minimum VRAM you’d want for a decently performant experience:
| Model Size (Billions) | FP16 (GB) | Q4 Quantized (GB) | Reality (GB) |
|---|---|---|---|
| 7B (e.g., Llama 3) | ~14 GB | ~3.5 GB | 6-8 GB |
| 13B | ~26 GB | ~6.5 GB | 10-12 GB |
| 34B (e.g., CodeLlama) | ~68 GB | ~17 GB | 20 GB |
| 70B | ~140 GB | ~35 GB | 40 GB+ |
“Why the ‘Reality’ column?” I hear you cry. Because you’re not just loading weights. You need overhead for the inference process itself: the KV cache (memory for tracking the context of your conversation), intermediate activations (the model’s “thoughts” as it processes your prompt), and a buffer for generating tokens. If you max out your VRAM, you’ll trigger out-of-memory errors or that dreaded RAM-shuffling slowdown. Always aim for at least 1-2 GB of headroom.
Quantization: Your Get-Out-of-Jail-Free Card
Quantization is the reason you and I can do this on consumer hardware without taking out a second mortgage. It’s the process of mapping the model’s high-precision weights (like FP16) into a lower-precision format (like INT4). Yes, you lose a little fidelity—imagine converting a FLAC file to a decent MP3. It’s still the same song, just with some barely perceptible detail shaved off. The Q4_K_M variant is often the sweet spot for a great balance of size and performance.
You can use tools like llama.cpp to quantize your own models, but most model repositories on Hugging Face provide a zoo of pre-quantized versions. This is what you’ll typically download with Ollama.
# When you pull a model with Ollama, it automatically grabs a quantized version.
# The 'q4_0' is implied in the tag for most common models.
ollama pull llama3
# But you can be more specific if you want a different quantization, e.g., a smaller but potentially lower quality Q2
ollama pull codellama:7b-text-q2_K
VRAM vs. RAM: The Offloading Fallback
So what happens if your GPU’s VRAM is too small? Tools like llama.cpp and text-generation-webui can “offload” some layers of the model to your system RAM. This is the software equivalent of duct tape. It works, but it’s not pretty.
# Example using llama.cpp's CLI to run a 13B model on a GPU with only 8GB VRAM
# This tells it to offload 30 layers to the GPU; the rest will be run on the CPU.
./main -m models/codellama-13b.Q4_K_M.gguf -n 1024 --ngl 30
You’ll need to experiment with the --ngl (number of GPU layers) value. Start high and lower it until you stop getting out-of-memory errors. The performance hit is severe. You might go from 20 tokens/second to 2. It’s a useful feature for occasionally running a model that’s slightly too big, but it’s not a viable long-term solution. If you’re constantly offloading, you’re better off using a smaller model.
Best Practices and Pitfalls
- Check Your VRAM First: Don’t guess. Know your hardware. On Linux, use
nvidia-smiorradeontop. On Windows, Task Manager > Performance > GPU is your friend. - Start Small: Pull a 7B model first (
ollama pull llama3). Get it working, see how it performs, and understand your system’s limits before you go for the big ones. - Beware the “It Fits” Trap: Just because a model loads doesn’t mean it will run well. If your VRAM is 99% utilized during load, generation will be slow and might error out. Monitor your usage.
- Context Length is a VRAM Killer: Doubling the context length (e.g., from 2k to 4k) doesn’t just double the memory for the KV cache; it often more than doubles it. A long conversation with a large model will use significantly more VRAM than a single prompt. If you need long context, you need more headroom.
- Consumer vs. Pro Cards: An NVIDIA card with 16GB VRAM is not created equal. A gaming card like the RTX 4080 uses faster GDDR6X memory, while a prosumer card like the RTX 5000 Ada uses slower ECC memory. For pure inference throughput, the gaming card often wins. The pro cards are for folks who can’t afford a single flipped bit in a days-long training run, which is not us right now. Don’t feel like you need a “professional” card; a high-end consumer card is almost always the better value for this specific task.