27.3 to_tsvector() and to_tsquery(): Converting Text and Queries

Right, let’s get your words ready for a fight. You’ve got a pile of text in a TEXT column and a user typing queries into a search box. Throwing a LIKE '%query%' at that problem is like bringing a butter knife to a gunfight. It’s slow, it’s clumsy, and it will utterly fail the moment you need to find “run” but the text says “running”.

PostgreSQL’s full-text search is your special ops unit. It understands language. It’s fast. But first, we have to speak its language. That means converting your messy human text and your user’s even messier queries into two structured, machine-readable types: tsvector and tsquery.

What’s a tsvector, Really?

Think of a tsvector as a sorted, de-duplicated bag of lexemes. A lexeme is a normalized word, sort of its root form. The process of creating one is called tokenization and normalization. It’s where the magic happens.

SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.');

                        to_tsvector
-----------------------------------------------------------
 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

Notice what to_tsvector() did? It:

Threw out the stop words (’the’, ‘over’).
Normalized the words (‘jumped’ became ‘jump’, ’lazy’ became ’lazi’ – which is the lexeme for lazy, don’t @ me, linguists).
Recorded their positions (the original word number: ‘quick’ was the 2nd word).
Sorted everything alphabetically.

This compact, intelligent representation is what we’ll eventually index for blistering speed.

Configurations: The Rulebook

That first argument, 'english', is crucial. It’s the text search configuration. This is the rulebook that tells PostgreSQL how to parse your text: what words to ignore (stop words), and how to reduce words to their root (the stemmer). Use the wrong one, and you’ll get gibberish.

-- Using the 'simple' config, which doesn't remove stop words or do stemming
SELECT to_tsvector('simple', 'The quick brown fox jumped over the lazy dog.');

                                  to_tsvector
-------------------------------------------------------------------------------
 'the':1 'quick':2 'brown':3 'fox':4 'jumped':5 'over':6 'the':7 'lazy':8 'dog':9

Big difference. The 'simple' config is useful for things like product codes or tags where you need exact matches, but for natural language, you almost always want a specific language config like 'english'.

Crafting Queries with to_tsquery()

A tsvector is useless without a query to match against. You craft that query with to_tsquery(). It takes your user’s search string and normalizes it into a boolean predicate that the system can execute against the tsvector.

The basic operators are & (AND), | (OR), and ! (NOT).

SELECT to_tsquery('english', 'quick & dog');

   to_tsquery
-----------------
 'quick' & 'dog'

Now, let’s see them in action together. The match operator is @@.

SELECT to_tsvector('english', 'The quick brown fox jumped over the lazy dog.')
       @@
       to_tsquery('english', 'quick & dog');

?column?
----------
 t

It returns true because the vector contains both ‘quick’ and ‘dog’. Beautiful.

Why You Must Use the Same Configuration

This is the most common foot-gun. The configuration is the shared language between your tsvector and your tsquery. If they’re not speaking the same language, the conversation falls apart.

-- This will FAIL, even though the word is there.
SELECT to_tsvector('english', 'Theatre') @@ to_tsquery('simple', 'Theatre');

?column?
----------
 f

Why? Because the english config normalizes ‘Theatre’ to the lexeme ’theatr’. But the simple config leaves ‘Theatre’ as-is. The vector contains ’theatr’ but the query is looking for ‘Theatre’. No match. Always, always use the same configuration for both sides of the operation.

The Perils of Implicit Casting and the Horror of the ::tsvector Cast

You might see someone try to shortcut this by casting a text field directly to tsvector.

-- This is a TERRIBLE IDEA. Do not do this.
SELECT 'The quick brown fox'::tsvector;

           tsvector
-------------------------------
 'The' 'brown' 'fox' 'quick'

This uses the default_text_search_config, which is a setting that can change out from under you. It’s a silent, lurking bug waiting to happen. More importantly, it doesn’t do any normalization! It just splits on whitespace and sorts. No stemming, no stop-word removal. It’s practically useless for real search. Never use ::tsvector. Always use to_tsvector(config, text) and be explicit.

Making It Practical: A Quick Example

Let’s make this concrete. Setting up a table for search isn’t rocket surgery.

CREATE TABLE book (
  id SERIAL PRIMARY KEY,
  title TEXT NOT NULL,
  author TEXT NOT NULL,
  body_text TEXT NOT NULL
);

-- Create a generated column to hold our pre-processed tsvector
ALTER TABLE book ADD COLUMN body_text_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', body_text)) STORED;

-- Now, a search is clean and fast
SELECT title, author
FROM book
WHERE body_text_tsv @@ to_tsquery('english', 'magic & dragon');

By using a STORED generated column, we pre-compute the tsvector on write, so our reads are just doing the super-fast match. This is the foundation. Next, we’ll throw a GIN index on that body_text_tsv column and watch it fly.