27.5 Ranking Results: ts_rank() and ts_rank_cd()

Right, so you’ve got your search results. They’re… correct. That’s the boring part. The magic trick, the part that makes users think your app is brilliant instead of just accurate, is putting the best results at the top. That’s where ts_rank() and its slightly more pedantic cousin ts_rank_cd() come in. They don’t just find the needles in the haystack; they tell you which needles are shiniest.

These functions essentially ask: “How well does this tsvector match that tsquery?” and return a number. Higher number, better match. It’s a deceptively simple concept that hides a surprising amount of nuance.

How the Sausage is Made: Ranking Algorithms

PostgreSQL gives you two main recipes for this ranking sauce, and you need to know which one to use.

ts_rank() (The Standard-Issue Workhorse): This is your go-to. It ranks based on a brutally simple but effective idea: a document that contains a search term more frequently is probably more relevant. It also heavily weights matches that appear early in the document and in important sections (like a title vs. body text). It’s fast and gets it right most of the time.
ts_rank_cd() (The Pedantic Cover-Density Nerd): The ‘cd’ stands for “cover density.” This algorithm is less concerned with raw frequency and more with the proximity of the search terms to each other. If you search for cat & dog, a document that says “the cat and the dog” (terms close together) will rank higher than one that says “cat” in the first paragraph and “dog” three pages later (terms far apart). It’s computationally more expensive but can be worth it for complex queries where term proximity is a strong signal of relevance.

You can—and absolutely should—tweak their behavior using weights. PostgreSQL defines four weight categories (A, B, C, D) that you can assign to different parts of your text when you create your tsvector. By default, words from titles or section headers get the highest priority (A), while words from the main body get lower priority (B), and so on.

-- Let's create a simple table and assign weights
CREATE TABLE doc_example (
    title TEXT,
    body TEXT,
    tsv TSVECTOR
);

-- Notice we use 'setweight' to assign the 'A' weight to the title
-- and 'B' to the body. We'll concatenate them into one tsvector.
INSERT INTO doc_example (title, body, tsv)
VALUES (
    'PostgreSQL Full-Text Search',
    'This is a tutorial about implementing full-text search in PostgreSQL. It covers tsvector and tsquery.',
    setweight(to_tsvector('english', title), 'A') ||
    setweight(to_tsvector('english', body), 'B')
);

-- Now, let's rank them. We pass in the weights we want to consider.
-- Here, we're only considering the highest-priority 'A' weight (title).
SELECT title, ts_rank('{1,0,0,0}', tsv, to_tsquery('postgres & tutorial')) as rank
FROM doc_example;

This query might return a decent rank because “PostgreSQL” is in the title (weight A), even though “tutorial” is only in the body (weight B, which we told the function to ignore with {1,0,0,0}).

Normalization: Taming the Wild Document Length

Here’s the classic pitfall: a five-page essay that mentions “PostgreSQL” ten times will naturally get a higher raw frequency score than a concise, perfect one-paragraph answer that mentions it twice. This is obviously stupid. The essay isn’t more relevant; it’s just longer.

This is why you must use a normalization strategy. The ts_rank() function takes an optional final argument to deal with this exact nonsense. The most common and useful one is /* code */2, which divides the rank by the document length.

-- The bad way: long documents will dominate.
SELECT title, ts_rank(tsv, query) as rank
FROM doc_example, to_tsquery('postgresql') query
ORDER BY rank DESC;

-- The correct way: normalize by document length.
SELECT title, ts_rank(tsv, query, 2) as rank -- Notice the '2'
FROM doc_example, to_tsquery('postgresql') query
ORDER BY rank DESC;

Always, always use a normalization code. /* code */2 (divide by length + 1) is almost always the right choice. It’s the difference between a useful ranking and a useless one.

Putting It All Together in a Real Query

Let’s see this in action with a proper query. The key is to remember that ranking is computationally expensive. You don’t want to rank every single document in your 10-million-row table. You filter first with your @@ operator, then you rank the matching results.

SELECT
    title,
    body,
    ts_rank(tsv, query, 2) AS rank -- Filter, then rank.
FROM doc_example, to_tsquery('english', 'search & tutorial') query
WHERE tsv @@ query -- This happens first; it's fast thanks to the GIN index.
ORDER BY rank DESC;

This is the blueprint. You use the GIN index to quickly narrow down the candidates, and then you apply the relatively expensive ranking function to only that subset to put them in the right order. It’s a one-two punch of efficiency and intelligence.

The designers actually got this part pretty right. The weights system is flexible, the normalization options handle the biggest problem, and the two algorithms give you a choice between speed and precision. Your job is to wield these tools deliberately. Don’t just throw ts_rank() into a query and hope for the best. Think about your data. Are titles important? Use weights. Are your documents wildly different lengths? Use normalization. Do users often search for phrases? Consider ts_rank_cd(). This is how you move from having search to having good search.