27.7 Language Configurations and Custom Dictionaries
Right, so you’ve got your tsvectors and tsquerys working. You’re feeling pretty good. You search for ‘database’ and it finds ‘database’. A modern miracle. But then you try to search for ‘running’ and it doesn’t find ‘ran’. You search for ‘colour’ and it doesn’t find ‘color’. Suddenly, this powerful tool feels a bit… stupid. This is where we stop just using the tool and start configuring it. This is the layer that separates a toy from a professional-grade search setup.
Think of a text search configuration as the rulebook. It defines three crucial things: how to break text into tokens (parsing), how to reduce those tokens to their normalized form (stemming), and which words to ignore entirely (stop words). PostgreSQL comes with a bunch of built-in configurations for different languages, like english, french, and german. You’ve probably been using english by default without even knowing it.
Checking and Setting the Default Configuration
First, let’s see what you’re working with. Your default configuration is dictated by the default_text_search_config setting.
SHOW default_text_search_config;
On most standard installs, this will likely be pg_catalog.simple. The simple configuration isn’t dumb, but it’s, well, simple. It doesn’t know about language-specific stemming or stop words; it just lowercases everything. This is why your search for ‘running’ didn’t find ‘ran’ – simple doesn’t do stemming.
You can set this for your session or, better yet, in your postgresql.conf file to make it permanent for your database cluster.
SET default_text_search_config = 'english';
-- Now let's see the difference
SELECT to_tsvector('simple', 'the runner is running'),
to_tsvector('english', 'the runner is running');
The simple vector will contain 'runner' and 'running'. The english vector will contain 'runner' and 'run'. See? The english config understood that ‘running’ is a form of the stem ‘run’. Much smarter.
How to See What a Configuration Actually Does
Don’t just take my word for it. The ts_debug() function is your best friend for peeking under the hood. It shows you exactly what happens to every word.
SELECT * FROM ts_debug('english', 'The quick brown foxes jumped over the lazy dog.');
You’ll get a row per word, showing the original token, what type of token it was identified as (word, number, space, etc.), the dictionary it was processed with, and most importantly, the final lexeme it produced. You’ll see that ’the’ was recognized as a stop word and reduced to nothing. ‘foxes’ became 'fox'. ‘jumped’ became 'jump'. This is how you debug why a search isn’t working as expected.
When Built-in Dictionaries Aren’t Enough
The built-in english configuration is good, but it’s not perfect. Let’s say you’re building a medical application. The word ‘patient’ is a noun, but it’s also a perfectly valid adjective meaning ‘calm and enduring’. The stemmer will happily reduce ‘patiently’ to 'patient'. This is a problem. A search for medical notes about a patient will now also match notes saying someone “waited patiently,” polluting your results with false positives.
This is where custom dictionaries come in. You can create your own, or more commonly, modify the rules of the existing ones. The most powerful tool is the ispell dictionary, which uses a rule file to define how words are stemmed. You can also use a synonym dictionary to say “when you see ’tummy’, treat it as ‘stomach’”.
Here’s the thing, though: setting up a custom dictionary from scratch involves creating physical files on the database server’s filesystem. It’s powerful, but it’s a bit of an arcane art and requires server access. For most of us, a more practical approach is to override the search path for a specific configuration.
The Power of Overriding Stop Words
A simpler but incredibly effective trick is to create your own version of a dictionary that extends the built-in one but with your own stop word list. Let’s say in your application, the word ‘user’ is a meaningless stop word because everything is about a user.
-- First, create a text search configuration based on 'english'
CREATE TEXT SEARCH CONFIGURATION my_app_english ( COPY = pg_catalog.english );
-- Now, create a new dictionary that uses the english stemmer but a custom stopword list
CREATE TEXT SEARCH DICTIONARY my_app_english_stop (
TEMPLATE = pg_catalog.simple,
STOPWORDS = my_stopwords -- You'd have to create this file
);
-- This is the complex part: altering the mapping.
-- You need to change the mapping for 'word' tokens to use your new dictionary first,
-- then fall back to the standard english_stem.
ALTER TEXT SEARCH CONFIGURATION my_app_english
ALTER MAPPING FOR word, asciiword
WITH my_app_english_stop, english_stem;
This is the point where you might rightly think, “Wow, that syntax is something only a mother could love.” It’s clunky. But what it does is vital: it processes words through your custom stop list first. If a word is not in your stop list, it then passes it to the standard english_stem dictionary for normal stemming.
The Golden Rule: Test Relentlessly
Your choice of configuration is not a set-it-and-forget-it deal. It’s a critical part of your application’s user experience. You must test it with real data and real queries from your domain. Use ts_debug on a representative sample of your text. Craft queries that should work and, more importantly, ones that shouldn’t (to check for false positives). The difference between a good search and a frustrating one often boils down to taking the time to get this configuration right. It’s the unsexy, trench-work part of full-text search, and it’s where the real battle is won.