37.4 Dependency Parsing: Syntactic Structure of Sentences
Right, so you’ve tagged your words, you’ve found your entities, and now you’re staring at a sentence like “The old man the boat.” and your brain just did a little somersault, didn’t it? Welcome to the party. This is why we need dependency parsing. It’s the process of mapping out the grammatical structure of a sentence and figuring out how the words relate to each other. It’s the difference between seeing a pile of lumber and seeing the blueprint for the house.
Think of it as a directed graph where words are nodes and the grammatical relationships are the edges. The parser’s job is to find the “head” word for each word and label the nature of that relationship (is it a subject? an object? a modifier?). The result is a tree that shows you the syntactic scaffolding holding the sentence together. This is monumentally useful. It’s the foundation for more complex NLP tasks like information extraction, question answering, and machine translation. If you don’t know what’s modifying what, you’re just guessing.
The Grammatical Universe of Universal Dependencies
Now, we could all get into a bloody academic knife-fight over the “best” set of grammatical relations. To avoid that, the field has largely coalesced around a project called Universal Dependencies (UD). It’s a framework that aims to be consistent across languages. It’s not perfect, but it’s a fantastic starting point and what you’ll see in most modern libraries.
UD defines a set of universal part-of-speech tags and a set of universal dependency relations. The relations are the gold. You’ll see things like:
nsubj: nominal subject. The doer of the verb. “The cat sleeps.” ->nsubj(sleeps, cat)dobj: direct object. The thing acted upon. “The cat chases the mouse.” ->dobj(chases, mouse)amod: adjectival modifier. An adjective describing a noun. “The quick brown fox.” ->amod(fox, quick)det: determiner. Words like ’the’, ‘a’, ’this’. "The cat" ->det(cat, The)prep: prepositional modifier. The head of a prepositional phrase. “sleeps on the mat” ->prep(sleeps, on)
There are dozens of these. The beauty is that once you have this tree, you can ask precise questions: “What is the subject of the main verb?” or “Find all the adjectives describing this entity.”
Actually Doing It: Code with spaCy
Enough theory. Let’s get our hands dirty with spaCy, because that’s what you’ll probably use. First, get it installed (pip install spacy and then python -m spacy download en_core_web_sm for a small English model).
import spacy
# Load the small English model. For real work, use 'en_core_web_lg' or 'en_core_web_trf'
nlp = spacy.load("en_core_web_sm")
# Our classic grammatical ambiguity
doc = nlp("The old man the boat.")
# Let's iterate through each token and print its head and dependency label
for token in doc:
print(f"{token.text:{10}} {token.dep_:{12}} {token.head.text:{10}} {token.head.pos_}")
# A more visual way to see the whole structure
from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)
If you run this, you’ll see the magic happen. spaCy correctly figures out that “man” is the main verb here (as in “to operate or serve on the boat”), “The old” is its subject, and “the boat” is the object. The parser resolved the ambiguity we tripped over a second ago.
Where This All Goes Sideways
Parsers are brilliant, but they’re not clairvoyant. They make mistakes, especially on:
- Long-distance dependencies: Sentences with lots of clauses between the subject and verb. “The keys that I thought I had lost yesterday were actually on the table.” A weaker parser might get confused and link “keys” to “lost”.
- Non-standard grammar: Poetry, song lyrics, and social media posts will absolutely wreck most parsers. They’re trained on well-formed text like news articles.
- Coordination: Parsing things like “macaroni and cheese” or “She bought apples, oranges, and pears” can be tricky. You want “macaroni” and “cheese” to be connected to “and” in a flat structure, not one modifying the other.
- Prepositional Phrase Attachment: The classic NLP problem. In “I saw the man with the telescope,” does “with the telescope” modify “saw” (I used the telescope to see) or “man” (the man who had the telescope)? This is often ambiguous without real-world context.
Best Practices and Pitfalls
- Model Size Matters: The small (
sm) model is fast and good for a demo. The large (lg) or transformer-based (trf) models are significantly more accurate. If parsing is mission-critical for your project, use the biggest model you can afford computationally. - Don’t Trust It Blindly: Always sanity-check the parser’s output on a sample of your actual data. The performance on medical notes will be different than on product reviews.
- Use the Tree, Don’t Just Look At It: The real power comes from traversing the tree programmatically. Need to find all the subjects of a verb?
def find_verb_subjects(doc):
for token in doc:
if token.pos_ == "VERB":
subjects = [child for child in token.children if child.dep_ in ("nsubj", "nsubjpass")]
print(f"Verb: {token.text} -> Subjects: {[s.text for s in subjects]}")
doc = nlp("The cat chases the mouse and eats the cheese.")
find_verb_subjects(doc)
This code will correctly find “cat” for both “chases” and “eats” because it understands the coordinated structure.
So, the next time you have a sentence that makes your head spin, don’t just stare at it. Throw it at a dependency parser. It’s like having a brilliant, hyper-literal linguist in your code who can see the underlying matrix of language. Just remember, even the best linguist sometimes has a bad day.