41.3 pg_trgm: Trigram-Based Fuzzy String Search
Right, let’s talk about something you’ve undoubtedly wrestled with: the absolute nightmare of human-generated text data. People misspell things. They use abbreviations you’ve never seen. They add spaces where they shouldn’t or forget them where they should. Searching for St. James's Park in your database when a user types st james park is a recipe for an empty result set and a frustrated user.
This is where pg_trgm waltzes in, puts its arm around your shoulder, and says, “Relax, I’ve got this.” It’s one of my absolute favorite PostgreSQL extensions because it solves a messy, real-world problem with a dose of elegant, statistical cleverness.
The core idea is the trigram. A trigram is a group of three consecutive characters taken from a string. The extension splits your string into these tiny little chunks, and we use the overlap of these chunks between two strings to guess how similar they are. Let’s see it in action. First, you’ve got to enable it. This isn’t magic; it’s a tool you have to install.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
Now, let’s dissect a word. The show_trgm function is our microscope.
SELECT show_trgm('word');
{" w"," wo","ord","rd ",wor}
See what it did? It pads the word with two spaces at the beginning and end, and then breaks it down every possible three-character sequence: ' w', ' wo', 'ord', 'rd ', 'wor'. This padding is crucial—it makes the matching sensitive to the start and end of words. The similarity between two strings is calculated using a formula that essentially asks, “How many of these trigram sets overlap?” It’s a measurement of shared DNA.
The Two Superpowers: Similarity and Distance
pg_trgm gives you two primary operators to work with. Don’t get them confused; they’re siblings, not twins.
%(the similarity operator): This returnsTRUEif the similarity of the two strings is above the currentsimilarity_thresholdsetting. Think of it as a fuzzy=.<->(the distance operator): This returns the “distance,” which is1 - similarity. A distance of 0 means identical; as it gets closer to 1, the strings are more different. This is what you use for ordering.
Let’s get our hands dirty. Imagine a table of products with a name column.
SELECT name,
similarity(name, 'Classic Chrome Bicyle Pedal') as sim,
name <-> 'Classic Chrome Bicyle Pedal' as dist
FROM products
ORDER BY dist ASC
LIMIT 5;
This query will happily find "Classic Chrome Bicycle Pedal" (note the correct spelling of “Bicycle”) even though our search term mangled it. The results are ordered by distance (lowest first, meaning most similar), and we also get to see the raw similarity score. This is incredibly powerful for search autocompletion or correcting user input.
GIN is Your Friend, GiST is… There
You cannot, I repeat, cannot, use this effectively on any sizable table without an index. Trigram matching is computationally expensive because it’s comparing sets for every row. The extension provides two index types: GiST and GIN.
Here’s the deal: Almost always use a GIN index. Just do it. A GIN index is faster to look up but slower to build. For read-heavy applications (which is what fuzzy search usually is), it’s the undisputed champion. A GiST index is slower for lookups but faster to build and supports nearest-neighbor searches with the <-> operator natively, which can be useful in very specific, advanced scenarios. But for 99% of you, the choice is simple.
-- This is what you should probably do
CREATE INDEX idx_products_name_gin ON products USING GIN (name gin_trgm_ops);
Once that index is in place, queries using the % and <-> operators will blaze. Without it, they’ll crawl and die on you. It’s the difference between a helpful friend and a useless one.
Tuning Your Fuzzy Radar
The default similarity_threshold is 0.3. This is, frankly, a bit enthusiastic. It will match almost anything to anything else. You can adjust this session-wide to tighten the criteria. Let’s be a bit more strict.
SET pg_trgm.similarity_threshold = 0.5;
SELECT name FROM products WHERE name % 'bicyle';
Play with this value. Set it to 0.7 for a very tight, precise match, or drop it to 0.4 for a much more forgiving net. The right value depends entirely on how badly your users type and how long your strings are. This is a knob you must twist yourself to get the right feel.
The Rough Edges and Pitfalls
It’s not all roses. The trigram approach has weaknesses you must respect.
- Short Words: It’s terrible with short words.
catandbatshare two out of three trigrams ('at ',' ca'vs' ba'), so they’ll have a high similarity. This is a fundamental limitation of the three-character window. If you’re dealing with short strings, this might not be the tool for you. - Language Weirdness: The algorithm is language-agnostic. It doesn’t know that “run” and “ran” are the same verb; it just sees completely different trigrams. It’s statistical, not linguistic.
- Performance Cost: That GIN index isn’t small. It’s effectively creating an index for every single trigram in every single indexed value. It’s a trade-off: you’re exchanging storage space and index maintenance time for blistering query speed.
So, use pg_trgm when your data is messy and your users are human. It’s a pragmatic, brilliantly engineered solution to one of the dirtiest problems in software. Just remember to build the index and tune your threshold, or you’ll be right back in the trenches with me, wondering why the database is so damn slow.