37.5 Coreference Resolution

Right, coreference resolution. This is where your NLP pipeline stops just pointing at words and starts actually reading. It’s the task of figuring out all the nouns and pronouns that refer to the same real-world entity. When I say “The model loaded its weights. It was trained for weeks,” you know that “It” and “its” are pointing back to “The model.” You do this effortlessly. Getting a computer to do it is, predictably, a bit of a circus.

The reason this isn’t just a simple lookup is the sheer number of ways language can be ambiguous. Does “it” refer to the model, the weights, or some nebulous concept I mentioned three paragraphs ago? Does “they” refer to the developers, the users, or the GPUs? This is the messy reality of human communication, and it’s why coreference resolution is a cornerstone of true language understanding, not just language processing.

The Two Main Flavors: Rule-Based vs. Neural

Historically, we tackled this with rule-based systems (like Stanford’s CoreNLP system). These are giant, intricate Rube Goldberg machines built on hand-crafted rules, grammatical constraints (e.g., a singular “it” can’t refer to a plural “models”), and features like semantic compatibility. They’re interpretable—you can see why a rule fired—but they’re also brittle. They handle the obvious cases well but fall apart spectacularly on anything that wasn’t foreseen by the grad student who wrote the rules.

The modern approach, and what you should probably be using, is neural coreference models. These are typically end-to-end systems that learn to cluster mentions (words or phrases that refer to something) by reading vast amounts of text. They don’t know grammar rules explicitly; they learn statistical patterns of what co-refers with what. The current champion in this space is the system from the spaCy universe, neuralcoref, and more recently, the all-in-one transformer-based pipelines.

Let’s Get Our Hands Dirty with spaCy and neuralcoref

First, a reality check. Coreference resolution is computationally expensive. This isn’t some lightweight lemma tagger. You’ll want a decent machine. Let’s use spaCy with the neuralcoref add-on. Note: as of my writing, neuralcoref works best with spaCy v2.x. The ecosystem is moving towards built-in coref in newer transformer-based models, but neuralcoref is a battle-tested, reliable choice.

pip install spacy==2.3.5
pip install neuralcoref
python -m spacy download en_core_web_sm

Now, let’s run it on a classic example.

import spacy
import neuralcoref

# Load the model and add the neuralcoref pipe to it
nlp = spacy.load('en_core_web_sm')
neuralcoref.add_to_pipe(nlp)

text = "Apple is looking at buying a U.K. startup. They are interested in the deal."
doc = nlp(text)

# Print all clusters of coreferring mentions
if doc._.has_coref:
    print("Coreference clusters found:")
    for cluster in doc._.coref_clusters:
        print(f" - {cluster}")
else:
    print("No coreference clusters detected.")

# Let's see the resolved text for clarity
print("\nResolved text:")
print(doc._.coref_resolved)

This should output something like:

Coreference clusters found:
 - [Apple:0, They:1]
Resolved text:
Apple is looking at buying a U.K. startup. Apple are interested in the deal.

See what it did? It correctly identified that “They” refers to “Apple” and even gave us a resolved version of the text where the pronoun is replaced. This is incredibly useful for downstream tasks like sentiment analysis or summarization—imagine trying to summarize a news article full of “it,” “they,” and “the company” without knowing what those words actually mean.

Where It All Goes Horribly Wrong (The Edge Cases)

This technology is impressive, but it’s not magic. Here’s where it will confidently hand you the wrong answer.

Pitfall 1: The Universal “It”. The word “it” is a nightmare. It can be a pronoun (“I love it”), a dummy subject ("It is raining"), or even refer to an entire clause (“He was late again, which annoyed me. It was unacceptable.”). Neural models often struggle to tell these apart. Try this:

text = "It is cold today. I don't like it."
doc = nlp(text)
print(doc._.coref_clusters)  # Might incorrectly link the two "it"s

Pitfall 2: Cataphora. This is when a pronoun appears before the noun it refers to. “Despite her success, Dr. Smith remained humble.” Models trained predominantly on written text, which favors anaphora (pronoun after noun), often miss this.

Pitfall 3: Abstract Concepts and Plurals. “The team celebrated. They were victorious.” Easy. “The idea was revolutionary. They didn’t understand it.” Hard. Here, “They” refers to some unnamed group of people, not “the idea.” The model has to use world knowledge to avoid a nonsensical link.

Pitfall 4: The Winograd Schema. These are specially crafted sentences designed to be ambiguous without real-world knowledge. They are the kryptonite of statistical models. Example: “The trophy doesn’t fit in the suitcase because it is too big.” What is “it”? You know it’s the trophy, but a model might only see the grammatical proximity to “suitcase” and get it wrong. There’s a whole competition for this. It’s humbling.

Best Practices for the Trenches

Preprocessing is Key. Garbage in, garbage out. Ensure your text is clean and sentences are properly segmented. A missing period can merge two sentences and completely confuse the resolver.
It’s a Suggestion, Not Gospel. Never treat coreference output as ground truth. Use it as a high-confidence signal in a larger system, not as the sole source of truth. Always have a fallback or a human-in-the-loop for critical applications.
Mind the Context Window. Neural models have a limited context window. If the referent is 50 sentences back, the model will have forgotten it. For long documents, you’ll need to chunk them and resolve coreference within chunks, accepting that cross-chunk references will be lost.
Consider the Task. Do you need full clusters, or just pronoun resolution? Sometimes a simpler, faster heuristic might be “good enough” for your specific use case.

Coreference resolution is one of those tasks that separates a simple text scraper from a true language understanding system. It’s hard, it’s messy, and it will fail in ways that are both baffling and obvious in hindsight. But when it works, it feels like actual magic. Use it, but for heaven’s sake, don’t trust it blindly.