38.5 Zero-Shot Classification with NLI Models

Right, so you’ve got a pile of text and you need to sort it into categories, but here’s the kicker: you don’t have any labeled training data for those specific categories. In the old days, this is where you’d throw your hands up and start the soul-crushing process of manual labeling. Not anymore. Welcome to the party trick of modern NLP: Zero-Shot Classification.

Here’s the genius, slightly absurd idea we’re stealing: we’re going to reframe classification as a natural language inference (NLI) task. You know NLI, right? It’s the “does this sentence contradict that premise?” problem. The model is given a premise and a hypothesis and has to classify their relationship as entailment, contradiction, or neutral.

Now, watch closely as we perform a little semantic judo. Your text to classify becomes the premise. The label you want to test for (“sports”, “politics”, “anger”, “joy”) becomes the hypothesis. We format it as: “This text is about {label}.” The model then gives us a probability score for whether the premise (your text) entails that hypothesis. A high entailment score for a given label? Boom. That’s your predicted class. It’s cheating, but it’s the kind of cheating that makes you feel smart.

How It Actually Works Under the Hood

Don’t just think of it as magic. The model, usually based on something like a fine-tuned BART or RoBERTa, has learned a rich, generalized understanding of language from its NLI training. It understands that if a premise is “The quarterback threw a touchdown pass,” it pretty strongly entails the hypothesis “This text is about sports.” It also understands that it contradicts “This text is about baking.” The “neutral” score would be for something unrelated, like “This text is about oceanography.”

We’re essentially using the model’s vast world knowledge, baked into its billions of parameters, to act as a lightning-fast, infinitely flexible labeling oracle. It’s not perfect, but it’s shockingly effective for a parlor trick.

Implementing It with Hugging Face `pipelines`

The good folks at Hugging Face have made this embarrassingly easy. The zero-shot-classification pipeline handles all the formatting and scoring for you. Let’s see it in action.

from transformers import pipeline
import torch

# This is the one-liner that does the black magic.
# We specify we want the NLI model behind the zero-shot trick.
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli", device=0 if torch.cuda.is_available() else -1)

# Your text - the premise
sequence = "The shareholders are pleased with the latest quarterly earnings report."

# Your candidate labels - these will be turned into hypotheses
candidate_labels = ["politics", "finance", "sports", "technology"]

# Let's run it
results = classifier(sequence, candidate_labels)

print(f"Text: {sequence}")
for label, score in zip(results['labels'], results['scores']):
    print(f"{label:<12} -> {score:.4f}")

This should output something sensible like:

Text: The shareholders are pleased with the latest quarterly earnings report.
politics     -> 0.0121
finance      -> 0.9782
sports       -> 0.0051
technology   -> 0.0046

Look at that. It correctly identified the text as finance-related with high confidence. No finance data was ever explicitly shown to the model for this task. Wild.

The Crucial Art of Writing Hypothesis Templates

Here’s where most people screw it up. The default hypothesis template is “This text is about {}.” which is great for topics. But what if you’re doing sentiment analysis? Or emotion detection? Using “about” is a terrible fit.

The pipeline lets you define a custom hypothesis_template. This is your secret weapon.

# Emotion detection? Frame it as an emotion.
candidate_emotions = ["joy", "anger", "surprise", "sadness"]
results = classifier(sequence, candidate_emotions, hypothesis_template="This text expresses {}.")

print(f"\nText: {sequence}")
for label, score in zip(results['labels'], results['scores']):
    print(f"{label:<12} -> {score:.4f}")

Output:

Text: The shareholders are pleased with the latest quarterly earnings report.
joy          -> 0.8721
surprise     -> 0.0783
sadness      -> 0.0261
anger        -> 0.0235

Suddenly, we’re doing zero-shot emotion detection because we changed a single sentence. The template is everything. For sentiment, you might use "This review is {}." with labels like “positive” and “negative”. Think like the model. What hypothesis would the text entail?

Pitfalls, Limitations, and When to Call BS

This isn’t a silver bullet. The model’s knowledge is only as good as its training data, which has cutoffs and biases. It can be hilariously confident in its wrong answers sometimes.

Label Ambiguity: If your labels are too semantically similar (“cloud computing” vs. “internet technology”), the scores might be close and noisy. The model’s understanding of the label words matters.

The Multi-Label Problem: The standard approach is single-label. For multiple labels, you have to set multi_label=True. This changes the calculation from a softmax (scores sum to 1) to independent sigmoids (each score is between 0-1). Use this when a text can belong to several categories at once.

sequence_2 = "Apple's new iPhone features a revolutionary graphene battery developed by their AI research division."
candidate_labels_2 = ["technology", "business", "science", "politics"]

# Single-label (default)
results_single = classifier(sequence_2, candidate_labels_2)
print("Single-label (softmax):")
for label, score in zip(results_single['labels'], results_single['scores']):
    print(f"{label:<12} -> {score:.4f}")

# Multi-label
results_multi = classifier(sequence_2, candidate_labels_2, multi_label=True)
print("\nMulti-label (sigmoids):")
for label, score in zip(results_multi['labels'], results_multi['scores']):
    print(f"{label:<12} -> {score:.4f}")

You’ll likely see “technology”, “business”, and “science” all get high scores in the multi-label run, which is correct.

Inference Speed: This is slower than a dedicated, fine-tuned classifier. You’re running a separate forward pass for each candidate label. If you have 50 labels, that’s 50 passes. For production, you might want to cache results or pre-train a model on your most common labels after you’ve used zero-shot to create your initial dataset.

Zero-shot classification is your go-to tool for prototyping, exploring unknown datasets, or handling tasks where the labels change constantly. It’s the Swiss Army knife that lets you ask arbitrary questions of your text without any prep work. Just remember to frame the question right.

How It Actually Works Under the Hood

Implementing It with Hugging Face pipelines

The Crucial Art of Writing Hypothesis Templates

Pitfalls, Limitations, and When to Call BS

Implementing It with Hugging Face `pipelines`