22.7 Dataset Preparation: Formatting and Tokenization for Fine-Tuning

Right, let’s get your data ready for the main event. This is the part where most people screw it up, not because it’s intellectually taxing, but because it’s tedious and the rules are annoyingly specific. Think of it like packing a parachute: boring as hell, but you’ll be profoundly grateful you did it right when you jump.

Your model doesn’t read prose. It reads numbers. More specifically, it reads a sequence of numbers called tokens. Our job is to take your beautifully curated text and convert it into a perfectly formatted, numerically tokenized dataset that the model can digest without getting indigestion.

The Golden Rule: Match the Original Training Format

This is the single most important piece of advice I can give you. These models were trained on data in a specific format. Deviating from that format is like trying to feed a cat by throwing grapes at it—confusing, ineffective, and mildly infuriating for all parties involved.

For instruction-tuned models like Llama 3 or Mistral, the original format almost always uses special tokens to separate instructions, inputs, and outputs. For example, a common structure looks like this:

<|system|>
You are a helpful AI assistant.
<|user|>
What is the capital of France?
<|assistant|>
The capital of France is Paris.

Your job is to replicate this structure exactly. Not kinda-sorta. Exactly. The model’s next-token prediction superpowers are conditioned on seeing these tokens. If you use ### Human: instead of <|user|>, you’re starting from a handicap. Find the official template for your base model and use it religiously.

Here’s how you’d create a single example in the ChatML format (used by many models):

system_message = "You are a helpful, terse assistant."
user_input = "Explain quantum entanglement."
assistant_response = "It's when two particles are so deeply in love that they share a single existence, no matter the distance. Spooky."

formatted_example = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n{assistant_response}<|im_end|>"

Tokenization: Where the Magic (and Pain) Happens

Tokenization is the process of converting your text into those model-digestible tokens. It’s not just fancy splitting on spaces; it uses a learned algorithm (like Byte-Pair Encoding) to break text into subword units. The key thing to remember: the tokenizer must be the one that originally shipped with the model you’re using. Using the wrong tokenizer is a non-starter.

Let’s tokenize our example. Notice how we add the bos_token (beginning of sequence) only at the very start. The model expects this.

from transformers import AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set padding token if it's not set (common with some models)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Tokenize the entire formatted example
tokenized_input = tokenizer(
    formatted_example,
    return_tensors="pt",  # Return PyTorch tensors
    truncation=True,
    max_length=512,       # Your chosen context length
)
print(tokenized_input.input_ids)

The Label Shifting Trick

Pay attention, because this is the clever bit that makes training work. During causal language model training, we teach the model to predict the next token. So, for the sequence [The, capital, is, Paris], the input is [The, capital, is] and the label (what we’re calculating loss against) is [capital, is, Paris].

We achieve this programmatically by using the input IDs as both the input_ids and the labels, but we shift the labels. Everything the model should learn from is the label; everything else is ignored using the ignore_index (usually -100).

In instruction tuning, we want the model to only learn from the assistant’s response. We mask out the loss calculation for the system prompt and user input. Here’s how you do it manually to understand the principle:

# Let's say 'full_text' is our tokenized instruction + response
labels = full_text.input_ids.clone()
# Let's assume we know tokens 0 to 50 are the system & user parts (the instruction)
# We set the labels for these parts to -100 so they are ignored in the loss function
labels[:, :50] = -100

# Now, your data dict for training looks like this:
data_dict = {
    "input_ids": full_text.input_ids,
    "attention_mask": full_text.attention_mask,
    "labels": labels,
}

Automating the Process with `tokenize_function`

You’re not going to do this for every example by hand. You’ll write a function to process your entire dataset. This function is where you implement the formatting and the label masking. The transformers library expects this function to take a batch of examples.

def tokenize_function(examples):
    # First, format the text into the correct chat template
    # Let's assume your dataset has 'system', 'user', and 'assistant' columns
    messages_list = []
    for sys, user, ast in zip(examples['system'], examples['user'], examples['assistant']):
        messages = [
            {"role": "system", "content": sys},
            {"role": "user", "content": user},
            {"role": "assistant", "content": ast}
        ]
        # This apply_chat_template is the modern, best way to format. USE IT.
        formatted_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        messages_list.append(formatted_text)

    # Now, tokenize the entire batch of formatted text
    tokenized_inputs = tokenizer(
        messages_list,
        truncation=True,
        max_length=2048,
        padding=False,  # We'll do dynamic padding later in a collator
    )

    # Create the labels by copying the input_ids. We will mask the non-assistant parts later.
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()

    return tokenized_inputs

# Apply this to your HF dataset
dataset = your_huggingface_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=your_huggingface_dataset.column_names # Remove original text to save space
)

The Final Boss: Dynamic Padding and the Data Collator

You’ve got a problem: your examples are all different lengths. Padding every example to the length of the longest one in the entire dataset is a fantastic way to waste colossal amounts of memory and time. The solution is dynamic padding: we pad the examples in each batch only to the length of the longest example in that batch.

This is handled by a DataCollatorForLanguageModeling. We use the version that does dynamic padding and, crucially, also handles the label shifting for us automatically.

from transformers import DataCollatorForLanguageModeling

# This collator will handle the shifting of labels and the dynamic padding.
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # We are doing causal LM, not masked LM
    return_tensors="pt",
)

Now, when your training loop fetches a batch, the collator will take your list of tokenized examples (all with different lengths), pad them dynamically, and automatically create the labels tensor where the input_ids are shifted to the right. The part that needs to be shifted—the assistant’s response—is already in the input_ids thanks to your tokenize_function. The collator does the rest.

Get this pipeline right, and the actual fine-tuning part will feel like a breeze. Get it wrong, and you’ll be staring at a loss curve that doesn’t move, wondering if you’ve offended the machine spirit of your GPU. It’s worth taking the time to get it right.

The Golden Rule: Match the Original Training Format

Tokenization: Where the Magic (and Pain) Happens

The Label Shifting Trick

Automating the Process with tokenize_function

The Final Boss: Dynamic Padding and the Data Collator

Automating the Process with `tokenize_function`