27.1 tsvector: Normalized Lexeme Vectors

Alright, let’s get our hands dirty with tsvector. This is the foundational data type for full-text search in PostgreSQL, and if you don’t understand it, the rest of this chapter will feel like wizardry performed by a very unhelpful wizard. Think of a tsvector not as a block of text, but as a normalized, optimized index of the words in that text. It’s the “after” picture in a makeover montage.

When you convert a string to a tsvector, PostgreSQL performs a series of brutal, efficient transformations. It doesn’t care about your feelings for capitalization or grammar. It’s going to:

Parse: Break the text into individual tokens (words).
Normalize: Convert all text to lowercase. ‘The’, ’the’, and ‘THE’ all become ’the’. This is why your searches are case-insensitive.
Remove Stop Words: Toss out common words that carry little semantic weight for search, like ‘a’, ‘an’, ’the’, ‘is’, ‘in’. The specific list depends on the text search configuration you’re using (we’ll get to that). This drastically reduces the size of the index.
Reduce to a Lexeme: This is the fancy term. A lexeme is a canonical representation of a word, its root form. For example, ‘running’, ‘ran’, and ‘runs’ might all be reduced to the lexeme ‘run’. This process is called stemming.

The final output is a sorted list of these distinct lexemes, each optionally tagged with their position in the original string and a weight (more on weights in a second).

Creating a tsvector

You create one using the to_tsvector() function. You must specify a text search configuration (here, 'english') which dictates the rules for parsing, the stop word list, and the stemming dictionary.

SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.');

Result:

'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

Let’s autopsy this. Notice:

‘The’ and ’the’ are gone (stop words).
Everything is lowercase.
‘jumped’ became ‘jump’ (stemmed to its lexeme).
’lazy’ became ’lazi’ (Yes, the English stemmer is… a choice. It’s not perfect, but it’s pragmatic and fast).
Each lexeme has a number: its position in the original string. ‘brown’ was the 3rd word.

Weights: A is for Important

You can assign importance, or “weights,” to words from different parts of your text. This is crucial for ranking results later. The four weights are A (highest), B, C, and D (lowest). The setweight() function is used to assign a weight to a given tsvector.

This is most useful when building a tsvector from multiple columns. You’d typically give words from a title more weight than words from a comment body.

SELECT setweight(to_tsvector('english', 'The Art of War'), 'A') ||
       setweight(to_tsvector('english', 'By Sun Tzu. A classic treatise on strategy.'), 'D');

Result:

'art':1A 'classic':5D 'strategi':8D 'sun':3D 'treatis':6D 'tzu':4D 'war':3A

See how ‘art’ and ‘war’ (from the title) are tagged with A, while the words from the description are tagged with D? The || operator concatenates the vectors. This weighted vector will be invaluable when we start ranking search results later.

The Guts of a tsvector

A tsvector doesn’t store the original text. It stores a compressed, optimized representation. You can see its internal structure with ts_debug(), which is like putting your text under a microscope.

SELECT * FROM ts_debug('english', 'jumped');

This will return a row showing each stage of processing: what token was found, what dictionary processed it, and the final lexeme. It’s incredibly useful for debugging why a specific word isn’t being found as you expect (hint: it’s probably being stopped or stemmed strangely).

Common Pitfalls and The ‘Simple’ Configuration

Sometimes you don’t want the linguistic voodoo. You might be indexing product codes, model numbers, or email addresses. Using the 'english' config on ‘PC-1054’ might try to stem it and remove it as a stop word. Bad news.

Enter the 'simple' configuration. It only does two things: lowercase everything and split on non-alphanumeric characters. It does no stemming and removes only the absolute most basic stop words. It’s your blunt instrument.

-- Compare the two:
SELECT to_tsvector('english', 'PC-1054 laptop'), to_tsvector('simple', 'PC-1054 laptop');

Result:

english: 'laptop':2
simple:  '1054':2 'laptop':3 'pc':1

See the drastic difference? For technical data, 'simple' is almost always the correct choice. The most common pitfall I see is people blindly using 'english' on every single text column in their database. Don’t be that person. Think about what you’re indexing. The configuration is your first and most important choice.