22.2 Instruction Fine-Tuning: Training on (Instruction, Response) Pairs

Right, so you’ve got a base model. It’s a brilliant, rambling savant that can predict the next word with terrifying accuracy. But ask it to write a polite email to your boss about that “project timeline adjustment” (read: you broke the production database), and it might just give you a recipe for chicken soup instead. It needs to learn to obey. That’s where instruction fine-tuning comes in.

We teach it to follow commands by training it on a dataset of (instruction, response) pairs. The core idea is stupidly simple: we show the model an instruction (e.g., “Translate this to French: Hello, world”), and we train it to produce the correct response (“Bonjour, le monde”). We’re not teaching it new facts; we’re teaching it a new style of interaction. We’re shaping its behavior to be helpful, honest, and harmless, or at least as close as we can get.

The training objective is identical to pre-training: next-token prediction. The magic is entirely in the dataset’s structure. We feed the entire string—"### Instruction: <your instruction>\n\n### Response: <desired response>"—into the model, calculate the loss over the entire sequence, but then we mask the loss so we only care about the tokens that occur in the response part. We want the model to learn what to generate when it sees an instruction, not to learn how to reconstruct the instruction itself. If you forget to mask the instruction, you’re just doing a weird, inefficient form of continued pre-training.

The Formatting Dance is Crucial

You can’t just throw a CSV file at it. The model needs to understand where the instruction ends and where its response begins. You must use a consistent format with clear separators. I’m a fan of the ### Instruction: and ### Response: tags; they’re highly distinct and unlikely to appear in your actual text. This format becomes a secret handshake between you and the model.

Here’s a peek at how you’d structure a single example for your dataset. Notice how the response field is just the completion, but the text field that we actually train on is the combined prompt and response.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Add a padding token if it doesn't have one (Llama doesn't)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

example = {
    "instruction": "Write a haiku about debugging code.",
    "response": "A silent error hides,\nIn the logic's tangled depths,\nA print statement shines."
}

# The text we actually train on:
training_prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"

# Tokenize for training (what we feed into the model)
tokenized_input = tokenizer(training_prompt, truncation=True, padding="max_length", max_length=128)
print(tokenized_input["input_ids"])

The Loss Masking Trick

This is the most common “oops” moment. You must tell your training loop to only calculate loss on the response tokens. Here’s how you do it conceptually and in code. We create a labels copy where we set all tokens not in the response section to -100, which the cross-entropy loss function in PyTorch ignores.

def format_train_example(example, tokenizer):
    # Construct the full prompt with instruction and response
    full_prompt = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    # Tokenize the full prompt
    tokenized = tokenizer(full_prompt, truncation=True, padding="max_length", max_length=512)

    # Create a list of labels, which is a copy of the input_ids
    labels = tokenized["input_ids"].copy()

    # Find the position where the "### Response:\n" string begins in the tokenized sequence
    # This is the messy part. You need to tokenize the separator to find its length.
    response_separator = "### Response:\n"
    separator_ids = tokenizer(response_separator, add_special_tokens=False)["input_ids"]

    # Find the start index of the separator in the tokenized sequence
    # This isn't always perfect, but it illustrates the point.
    start_idx = -1
    for i in range(len(tokenized["input_ids"])):
        if tokenized["input_ids"][i:i+len(separator_ids)] == separator_ids:
            start_idx = i + len(separator_ids)
            break

    if start_idx == -1:
        # Fallback: if we can't find it, just set labels to -100? Not ideal.
        raise ValueError("Couldn't find the response separator in the tokenized text.")

    # Create a mask: set everything BEFORE the response start to -100
    labels = [-100] * start_idx + labels[start_idx:]

    tokenized["labels"] = labels
    return tokenized

# Use this function in your dataset's __getitem__ method or map it to your dataset.
formatted_data = format_train_example(example, tokenizer)

Data Quality is Everything

Garbage in, garbage out. This is the law. A base model has seen a trillion tokens of mostly decent internet text. If your instruction dataset is small, shallow, and full of repetitive, robotic responses, your model will become small, shallow, and robotic. You need diversity in task types (translation, summarization, creative writing, reasoning), complexity, and writing style. The best datasets have tens of thousands of high-quality examples, often painstakingly crafted by humans or filtered from larger, noisier sets. Don’t expect to train on 500 examples from ChatGPT and get a genius.

The Overfitting Trap

You’re training on a finite set of instructions. The model is a sponge and will memorize them. If you train for too long, you’ll get a model that performs brilliantly on its training data and fails utterly on any new, slightly different instruction. This is called overfitting, and it’s the enemy. You must use a separate validation dataset to check the loss on unseen instructions. The moment the validation loss stops decreasing and starts rising, you’ve gone too far—stop training. This is non-negotiable.