37.3 Named Entity Recognition: Rule-Based and Neural Approaches
Right, let’s talk about Named Entity Recognition, or NER. Your goal here is simple: teach a machine to read a sentence like “Apple is looking to buy a U.K. startup for $1 billion” and not have an existential crisis about whether we’re discussing fruit, a tech giant, or a very expensive piece of produce. It’s the process of finding and classifying named entities—things like people, organizations, locations, monetary values, and more—into pre-defined categories.
This is deceptively hard. Language is messy. “Will Smith” is a person, but “will” is also a verb and “smith” is also a job. “Python” could be a snake, a programming language, or a comedy group. Context is everything. We’ve tackled this problem in two major waves: first with rule-based systems (which are like giving the computer a very detailed map) and then with neural models (which are more like teaching it to drive by intuition).
The Old Guard: Rule-Based NER with spaCy
Before deep learning ate everything, we relied on patterns, dictionaries, and a whole lot of cleverness. Rule-based systems are fantastic when you need precision on a specific, well-defined task (like extracting part numbers from engineering manuals) and you have zero tolerance for the model “making stuff up.”
spaCy’s EntityRuler is a brilliant modern take on this. It lets you build a list of patterns—combining token attributes like text, lemma, part-of-speech tags, and more—and add them to the pipeline as a named entity recognizer. It’s transparent, blazingly fast, and completely deterministic.
import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
nlp = English() # Start with a blank language model
ruler = nlp.add_pipe("entity_ruler")
# Define your patterns. This is where you bring your domain knowledge.
patterns = [
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]},
{"label": "ORG", "pattern": [{"TEXT": {"regex": "[A-Z][a-z]+"}}, {"TEXT": {"regex": "Labs|Technologies"}}]},
{"label": "PRODUCT", "pattern": "The Crank", "id": "crank-model"} # Simple string-based pattern
]
ruler.add_patterns(patterns)
# Process a text
doc = nlp("I work at San Francisco based Riverdale Labs and I love The Crank.")
print([(ent.text, ent.label_) for ent in doc.ents])
# Output: [('San Francisco', 'GPE'), ('Riverdale Labs', 'ORG'), ('The Crank', 'PRODUCT')]
Why this works: You’re not waiting for a model to be trained. You’re directly encoding knowledge. The "regex" key is particularly powerful for catching common suffixes like “Inc.” or “Ltd.”.
The pitfall: This is brutally rigid. It will only find what you tell it to find. If a document mentions “Riverdale Lab” (singular), your pattern misses it. Maintaining these patterns for a large, evolving domain is a full-time job. It has the knowledge of a savant and the adaptability of a brick.
The New School: Neural Network-Based NER
This is where we are now. Instead of you writing all the rules, you show a neural network thousands of examples and it infers the rules itself. It learns from context. A model trained on enough data will understand that “Python” in “I love Python’s syntax” is a technology, while in “The python slithered away” it’s an animal.
Modern libraries like spaCy ship with powerful pre-trained neural models that are shockingly good at this out-of-the-box.
import spacy
# Load a pre-trained model (this is spaCy's medium English model)
nlp = spacy.load("en_core_web_md")
text = "Apple, based in Cupertino, is looking to buy a U.K. startup for $1 billion. Tim Cook is thrilled."
doc = nlp(text)
# Let's see what it found
for ent in doc.ents:
print(f"{ent.text:<20} {ent.label_:<10} {spacy.explain(ent.label_)}")
# Output:
# Apple ORG Companies, agencies, institutions, etc.
# Cupertino GPE Countries, cities, states
# U.K. GPE Countries, cities, states
# $1 billion MONEY Monetary values, including unit
# Tim Cook PERSON People, including fictional
Why this works so well: The model isn’t just looking at words; it’s looking at word representations (embeddings) that capture meaning based on their use in context. The representation for “Apple” near “Cupertino” and “buy” is mathematically closer to the “company” concept than the “fruit” concept. It’s statistical poetry.
The Inevitable Trade-Offs and How to Navigate Them
Neither approach is perfect. The neural model will sometimes hallucinate or make bizarre mistakes because its “intuition” is based on statistical correlations from its training data. You might see it tag a common company name as a person because the CEO is mentioned so often. The rule-based system will never do that, but it will also never find anything new.
The best practice, especially in production, is often to combine them. Use a pre-trained neural model as your broad net to catch the obvious stuff. Then, use a custom EntityRuler to add highly specific rules for your domain (e.g., your internal product names, project codenames, obscure locations) or to correct consistent errors the model makes.
# After loading the pre-trained model, add a custom ruler
nlp = spacy.load("en_core_web_md")
ruler = nlp.add_pipe("entity_ruler", before="ner") # Add it before the built-in 'ner'
# Rule to catch a specific model the neural net might miss
patterns = [{"label": "PRODUCT", "pattern": "The Crank"}]
ruler.add_patterns(patterns)
# Now your pipeline has both world knowledge and your specific rules
doc = nlp("Tim Cook announced The Crank today.")
for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")
# The neural model finds 'Tim Cook - PERSON'
# Your ruler finds 'The Crank - PRODUCT'
This hybrid approach gives you the best of both worlds: the adaptability and contextual understanding of a neural network, with the precise, guaranteed recall of a rule-based system. It acknowledges that sometimes, you just need to tell the computer, “No, really, this is what I mean.”