32.4 Hallucination Detection: Fact-Checking and Grounding

Right, let’s talk about the LLM’s most infamous party trick: hallucination. It’s not the fun, psychedelic kind. It’s the “I will confidently state that the capital of France is Berlin because it sounds right” kind. As you start building systems on top of these models, this isn’t just a quirky bug; it’s a critical failure mode that can torpedo user trust, business logic, and your reputation. So, how do we catch these fabrications before they escape into the wild? We ground them and we fact-check them.

The first and most powerful weapon against hallucination isn’t a fancy algorithm—it’s your data. If you never let the model answer questions from its own parameterized memory, you drastically reduce its ability to make things up. This is the core principle of Retrieval-Augmented Generation (RAG). You provide context, and you instruct the model to only answer based on that context.

The First Line of Defense: Prompt Engineering for Grounding

This is step zero, and it’s infuriatingly simple yet often botched. You must explicitly, forcefully, and redundantly tell the model to stick to the provided text. They call this “grounding” the response. A weak prompt is a polite request; a strong prompt is a direct order with consequences.

# A weak, naive prompt. The model will ignore this and do what it wants.
weak_prompt = f"""
Answer the following question based on the context below.
Question: {user_question}
Context: {retrieved_text}
"""

# A strong, grounding prompt. Notice the explicit instructions and consequences.
strong_prompt = f"""
You are a helpful AI assistant. Your knowledge is STRICTLY LIMITED to the CONTEXT PROVIDED below. You must answer the user's question using ONLY the information from the provided context.

Follow these rules absolutely:
1. If the answer cannot be found in the context, you MUST respond with "I cannot answer that question based on the information provided."
2. Do not speculate, infer, or use any prior knowledge.
3. Quote directly from the context when possible.

Context: {retrieved_text}

User Question: {user_question}
"""

Why this works: The model is a probability engine. Without explicit boundaries, it will happily follow the most statistically likely path, which often involves blending its training data with your context. The strong prompt narrows the probability space significantly, making an ungrounded hallucination a much less likely output. It’s not foolproof, but it’s your most important lever.

Automated Fact-Checking with Entailment Models

Okay, let’s say you’ve got a response from your LLM. How do you programmatically check if it’s supported by the source context? You don’t use another LLM for this—that’s asking a fabulist to check another fabulist’s homework. Instead, you use a Natural Language Inference (NLI) model, specifically one trained for entailment.

These models are designed for one job: take a premise (your source context) and a hypothesis (the LLM’s generated statement) and judge whether the premise entails the hypothesis (supports it), contradicts it, or is neutral.

from transformers import pipeline

# Load a model specifically trained for entailment (e.g., Microsoft's DeBERTa)
entailment_checker = pipeline("text-classification", model="microsoft/deberta-large-mnli")

def check_fact_with_entailment(context, generated_text):
    """
    Checks if the generated text is entailed by the context.
    Returns a verdict and a confidence score.
    """
    # The NLI model takes the input as a sequence: premise + hypothesis
    result = entailment_checker(f"{context} [SEP] {generated_text}")
    
    # The model returns a label like 'ENTAILMENT', 'CONTRADICTION', 'NEUTRAL'
    verdict = result[0]['label']
    confidence = result[0]['score']
    
    return verdict, confidence

# Example usage
context = "The Apollo 11 mission landed on the Moon on July 20, 1969. Neil Armstrong was the first person to step onto the lunar surface."
generated_response = "Apollo 11 landed on the Moon in 1969."

verdict, confidence = check_fact_with_entailment(context, generated_response)
print(f"Verdict: {verdict}, Confidence: {confidence:.2f}")
# Output: Verdict: ENTAILMENT, Confidence: 0.99

generated_hallucination = "Buzz Aldrin was the first person to walk on the Moon."
verdict, confidence = check_fact_with_entailment(context, generated_hallucination)
print(f"Verdict: {verdict}, Confidence: {confidence:.2f}")
# Output: Verdict: CONTRADICTION, Confidence: 0.85

The Pitfall: These models are good but not perfect. They can struggle with long, complex contexts or highly technical language. They’re best used as a automated scoring mechanism (e.g., “flag any response with a ‘CONTRADICTION’ score above 0.8 for human review”) rather than a single source of truth.

Using RAGAS for Structured Evaluation

You didn’t think I’d mention RAGAS in the section title and not explain it, did you? RAGAS (Retrieval-Augmented Generation Assessment) is a framework that provides metrics to evaluate your RAG pipeline. For hallucination, the key metric is Answer Faithfulness (is the answer based only on the given context?) and Answer Correctness (is it factually accurate against a ground truth?).

Here’s the beautiful part: it often uses an LLM in a structured way as a judge, asking it to reason step-by-step, which is far more reliable than asking for a direct answer.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_correctness
from datasets import Dataset

# Prepare your data in a Hugging Face Dataset format
data_dict = {
    "question": ["When did Apollo 11 land on the Moon?"],
    "answer": ["Apollo 11 landed in 1969."], # Your LLM's generated answer
    "contexts": [["The Apollo 11 mission landed on the Moon on July 20, 1969."]], # Your retrieved context
    "ground_truth": ["July 20, 1969"] # The actual correct answer (for correctness)
}

dataset = Dataset.from_dict(data_dict)

# Run the evaluation
score = evaluate(dataset, metrics=[faithfulness, answer_correctness])
print(score)

Why this works: RAGAS formalizes the evaluation process. faithfulness is internally checking for hallucinations by comparing the answer to the context, much like our entailment check but with more sophisticated prompting. answer_correctness then checks that faithful answer against a known ground truth. It’s the difference between “did it make something up?” and “was it right?”

The Big Caveat: This isn’t a fully automated production solution. The ground truth is the killer. You need a human-in-the-loop to establish a benchmark dataset of correct answers to evaluate against. Use this for testing and improving your pipeline, not for real-time fact-checking on live queries.

The brutal truth is that there is no magic bullet. Combating hallucination is a layered defense: strong grounding prompts, high-quality retrieval, and automated checks using both specialized models (NLI) and structured LLM judges (RAGAS). Your goal isn’t perfection—it’s to catch 99% of the nonsense before a user ever sees it.