22.6 PEFT Library: LoRA, Prefix Tuning, Prompt Tuning

Alright, let’s get our hands dirty. You’ve trained a massive model from scratch, and you’re feeling pretty good about yourself. Now you want to adapt that behemoth to your specific task—say, generating 18th-century pirate sea shanties or classifying the emotional state of garden gnomes. The naive way is “full fine-tuning”: you take the entire multi-gigabyte model and update every single one of its billions of parameters. It works, but let’s be honest, it’s absurd. You need a small fortune in GPU memory, you’re risking “catastrophic forgetting” where the model forgets how to speak English while learning about gnomes, and it’s about as efficient as using a particle accelerator to toast bread.

This is where Parameter-Efficient Fine-Tuning (PEFT) comes in, and it’s an absolute game-changer. The core idea is brilliantly simple: leave the original, pre-trained model frozen the heck alone. Instead, you inject a tiny, trainable module into the model or add a small set of external parameters. You then only train that little bit. The result? You get 90-95% of the performance of full fine-tuning while using a fraction of the compute and memory. It’s the difference between repainting an entire skyscraper and just swapping out the welcome mat. The PEFT library from Hugging Face is our toolbox for this, and we’re going to focus on its three most compelling techniques.

LoRA: Low-Rank Adaptation, The Main Attraction

LoRA is the rockstar of PEFT, and for good reason. It’s based on a wild but proven idea: the change or update to a model’s weights during fine-tuning has a low “intrinsic rank.” In human terms, while the weight matrices themselves are huge (e.g., 4096x4096), the actual important update we need to make to them can be represented by a much smaller matrix decomposition.

Here’s the magic: for a frozen pre-trained weight matrix W0, LoRA doesn’t change it. Instead, it learns two much smaller matrices, A and B, such that the forward pass becomes h = W0*x + BA*x. We freeze W0, and only train A and B. The dimension r (the rank) is the key hyperparameter—it’s like the size of the bottleneck between these matrices. A typical r might be 8 or 64, which is tiny compared to the original 4096. The number of trainable parameters plummets, and so does your GPU memory bill.

Let’s see it in action. We’ll tweak a model to be a sarcasm-detector, because of course.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# Load our base model - let's use a smaller one for example's sake
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True) # We'll talk about this 4bit thing soon...

# Define the LoRA configuration. This is where we set our key parameters.
lora_config = LoraConfig(
    r=16,                  # The rank. The size of our low-rank matrices.
    lora_alpha=32,         # Scaling factor. Just set it to 2x your `r` to start, trust me.
    target_modules=["q_proj", "v_proj"], # Which parts of the model to inject into. For LLMs, attention is usually the gold.
    lora_dropout=0.05,     # A little dropout to prevent overfitting, classic.
    bias="none",           # Don't train the bias weights.
    task_type=TaskType.CAUSAL_LM  # The type of task we're doing.
)

# Now apply it to our model!
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters() # This will output the glorious result.
# Example output: trainable params: 8,194,048 || all params: 3,807,125,504 || trainable%: 0.2152%

Look at that. We’re training only 0.2% of the total parameters. You can practically do this on a laptop.

Prefix and Prompt Tuning: Learning to… Prompt

Before LoRA stole everyone’s heart, there were these two similar approaches: Prefix and Prompt Tuning. The concept is weird but cool: instead of changing the model’s internals, we train a small set of continuous embeddings that we prepend to the input sequence. The model’s weights stay completely frozen.

Prompt Tuning: You train a small set of “soft prompts” (embedding vectors) for a specific task. The original “hard prompts” (the text you type) are fixed.
Prefix Tuning: A more sophisticated version that not only adds embeddings to the input (prefix) but also creates past key-value pairs for the attention mechanism at every layer, giving it more to work with.

The PEFT library makes them both trivial to implement. Here’s Prompt Tuning:

from peft import PromptTuningConfig, PromptTuningInit, get_peft_model

prompt_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT, # Initialize from text, not random
    prompt_tuning_init_text="Classify the sentiment of this tweet as positive or negative:", # Your starting point
    num_virtual_tokens=20,  # Number of those soft prompt tokens to train
    tokenizer_name=model_id,
)

prompt_model = get_peft_model(model, prompt_config)
prompt_model.print_trainable_parameters()
# trainable params: 20,480 || all params: 3,807,125,504 || trainable%: 0.0005%

Even more efficient than LoRA! So why isn’t everyone using it? Well, it can be trickier to train and often doesn’t perform quite as well as LoRA, especially on more complex tasks. It’s a fantastic tool for a very specific job, but LoRA is generally the more reliable and powerful choice.

Best Practices and Pitfalls

Don’t just throw r=64 at everything and call it a day. Here’s the real trench knowledge.

What to Target: For decoder-based models (like GPT, Llama, Mistral), target the query (q_proj) and value (v_proj) projection matrices in the attention layers. This is almost always the right answer. For encoder models (like BERT), intermediate.dense and output.dense are also prime candidates.
Rank r Matters, But Not That Much: Start with 8 or 16. Going to 64 might give a slight boost, but often the returns diminish rapidly. A higher r isn’t automatically better; it’s just more parameters.
Alpha is Your Scaling Knob: The lora_alpha parameter scales the learned weights. Think of r as the complexity of the update and alpha as its magnitude. The ratio alpha/r is what’s important. A ratio of 1 is neutral. Start with a ratio of 2 or 4 (e.g., r=16, alpha=32) and adjust if you need more or less “oomph.”
Check Your Target Modules: The biggest pitfall is misconfiguring target_modules. The names must exactly match the modules in your specific model. Use model.print_trainable_parameters() to see the count. If it’s suspiciously low or high, you messed this up. Always print the model’s module names (print([n for n, p in model.named_modules()])) if you’re unsure.
Save and Load correctly: When you save, you’re only saving the tiny LoRA weights, not the entire model. To reload for inference, you must load the base model and then load the PeftModel with the adapter.

# Save the adapter, not the whole model
peft_model.save_pretrained("./my_lora_adapter")

# Later, for inference:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(model_id)
loaded_model = PeftModel.from_pretrained(base_model, "./my_lora_adapter")

The beauty of this is that you can have a single base model and dozens of tiny, task-specific adapters that you can hot-swap on the fly. It’s elegantly efficient, and it frankly makes full fine-tuning look a little barbaric.