19.5 Few-Shot and Zero-Shot Transfer
Right, so you’ve got a big, beefy pre-trained model. It knows the visual structure of the world or the statistical shape of human language better than you know the route to your favorite coffee shop. But you want it to do something specific—recognize a particular type of manufacturing defect, classify customer support tickets, generate code comments in your team’s weirdly specific style. You don’t have a million labeled examples for this. You might only have a handful. You might even have zero. This is where we move from just slinging models to doing actual wizardry. Welcome to few-shot and zero-shot transfer.
The secret sauce here isn’t magic; it’s semantics. These massive models, especially the transformer-based ones like BERT, GPT, and their myriad descendants, aren’t just pattern matchers. They build rich, internal representations of concepts. Few-shot and zero-shot learning is the art of tapping into those representations and bending them to your will with minimal guidance.
The Zero-Shot Gambit
Zero-shot learning is the ultimate test of a model’s semantic understanding. You’re asking it to perform a task it was never explicitly trained on, using only its pre-existing knowledge and your instructions. The most common way to do this is by reformulating your task as a text-to-text or text-matching problem.
Think of it like giving a brilliant intern a task. You wouldn’t just shove data at them and say “classify!” You’d say, “Hey, read this customer email and tell me if the sentiment is positive, negative, or neutral.” You’re providing the context and the possible labels. Zero-shot does the same thing with the model.
Here’s how you might do it with a model like Facebook’s BART or a T5 variant, which are trained for text-to-text generation. Let’s say you want to classify news headlines into categories, but you have zero labeled examples for those categories.
from transformers import pipeline
# Load a zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook-bart-large-mnli")
# Our candidate labels - the model has never been explicitly trained on these.
candidate_labels = ['politics', 'technology', 'sports', 'science', 'art']
# The headline we want to classify
headline = "Researchers discover a novel approach to quantum computing using graphene."
results = classifier(headline, candidate_labels)
print(f"The headline is classified as: {results['labels'][0]} with confidence {results['scores'][0]:.4f}")
# Output: The headline is classified as: science with confidence 0.8765
How did it do that? Under the hood, the model is performing a form of natural language inference (NLI). It’s essentially comparing the premise (the headline) to a hypothesis like “This text is about science.” The score it gives is its confidence in whether the premise entails the hypothesis. It’s not looking for keywords; it’s using its deep understanding of the words “novel approach,” “quantum computing,” and “graphene” to connect them to the broader concept of “science.” It’s genuinely reasoning, in a statistical sense.
The biggest pitfall here? The model’s pre-trained knowledge is its ceiling and its floor. If it has a poor or biased understanding of the concepts in your candidate labels, it will perform poorly. Don’t expect a general-purpose model to perfectly classify highly specialized, domain-specific labels it’s never encountered before (e.g., ‘sub-orbital payload logistics’).
Few-Shot: Learning by Example
Zero-shot is cool, but let’s be honest, it can be a bit of a gamble. Few-shot learning is where we get serious. We give the model a few examples of what we want—a few shots—to prime the pump. This is incredibly powerful because it doesn’t just rely on the model’s internal semantics; it gives it a direct, in-context pattern to follow.
The most elegant way to do this now is in-context learning, popularized by the GPT-style models. You don’t update the model’s weights (that’s fine-tuning, which we’ll get to); you just show it what to do right in the input prompt.
from transformers import OpenAIAPI, pipeline
# Note: For real use, you'd use the OpenAI API or a local model like Llama 3 via an appropriate library.
# Let's create a sentiment analysis classifier with just 3 examples.
few_shot_prompt = """
Classify the text into positive, negative, or neutral sentiment.
Text: I absolutely loved the concert last night! The energy was incredible.
Sentiment: positive
Text: The product arrived broken and the customer service was unhelpful.
Sentiment: negative
Text: The package was delivered at 3:15 PM today.
Sentiment: neutral
Text: This new update has completely ruined the app's performance.
Sentiment:
"""
# In a real scenario, you'd feed this prompt to a GPT model
generator = pipeline('text-generation', model='gpt2') # Using gpt2 as a placeholder example
result = generator(few_shot_prompt, max_new_tokens=10)
print(result[0]['generated_text'])
# The model will likely continue with: "negative"
Why does this work so well? You’re essentially activating the model’s immense pattern-completion capabilities. It sees the structure of your examples (Text: … Sentiment: …) and completes the pattern for your new input. It’s less about “learning” in the traditional sense and more about guided improvisation.
The pitfalls are numerous, however. Order of your examples matters. The model can be sensitive to which examples you show first. The quality of your examples is paramount. A bad example will teach it the wrong pattern. And perhaps most importantly, you’re limited by the model’s context window. You can only fit so many examples before you run out of space, which is why this is called “few-shot” and not “hundred-shot.”
Best Practices: Not All Shots Are Created Equal
Throwing random examples at a model is a surefire way to get mediocre results. Here’s how to do it right:
- Be Consistent: Your examples must be perfectly formatted. If you use “Sentiment:” in your examples, don’t suddenly switch to “Feeling:” in the prompt you want classified. The model is a pedant.
- Choose Representative Examples: Your few shots should cover edge cases and the breadth of what the model might see. If you’re classifying tickets, include an example of a really ambiguous, poorly written ticket.
- Mind the Context Length: This is the brutal engineering constraint. You have a finite number of tokens. Every example you add is a trade-off. Choose wisely.
- Label Space Matters: In zero-shot, your candidate labels need to be concepts the model already understands. “Not happy” is a worse label than “negative” because it’s more complex and less common in the training data.
- When to Jump to Fine-Tuning: Few-shot is amazing, but if you have hundreds of examples, you’re just teasing the model. At that point, you should actually fine-tune (update the weights) on your dataset. It’ll be more accurate, faster at inference, and won’t waste resources repeating a context prompt every single time. Few-shot is for when you can’t fine-tune.
The sheer fact that you can get a model to perform a complex task it was never trained for, with just a few examples or even just a description, is a testament to how far we’ve come. It feels like cheating. Because it is. So use these powers wisely.