16.6 String Aggregation: string_agg() and array_agg()

Right, let’s talk about making lists. Because sometimes you don’t want a summary statistic like SUM() or AVG(); you want to see the actual stuff that got grouped together. That’s where string_agg() and array_agg() come in. They’re your go-to functions for when you need to roll up a bunch of row values into a single, consolidated value. Think of them as the “and company” in a list of names.

string_agg() is the more commonly used of the two. It takes a column of values, mashes them together into a single string, and plops a delimiter of your choice between each one. It’s shockingly useful for generating comma-separated lists, tags, or any scenario where you need a human-readable summary.

array_agg() is its more structured, data-nerd cousin. Instead of a string, it returns a PostgreSQL array. This is objectively better for programmatic use because you preserve the individual elements in a proper data structure, avoiding the nightmare of later having to parse a string back apart. If your next step is another part of your application code, you almost always want array_agg().

The Anatomy of string_agg()

The signature is straightforward: string_agg(expression, delimiter). The expression is usually just a column name, but it can be any expression that results in a text value (or something coercible to text). The delimiter is what gets shoved between each value. Let’s get our hands dirty.

Imagine a film_actors table that links actors to movies. To get a list of all actors for each film, you’d do this:

SELECT
  film_id,
  string_agg(actor_name, ', ') AS cast_list
FROM film_actors
GROUP BY film_id
LIMIT 3;

Your result might look like:

 film_id |                     cast_list
---------+-----------------------------------------------------
     101 | Daniel Day-Lewis, Meryl Streep, John Doe
     102 | Jane Fonda, John Doe
     103 | ...

Simple, right? But here’s the first pitfall: what if an actor’s name has a comma in it? Suddenly, your beautifully formatted list is ambiguous garbage. This is a classic problem with CSV-style data and a perfect example of why string_agg() is for presentation and array_agg() is for data processing.

You can also ORDER BY within the aggregate itself. This is crucial for controlling the order of the concatenated list, since the order rows are fed into the aggregate during a GROUP BY is not guaranteed.

SELECT
  film_id,
  string_agg(actor_name, ' > ' ORDER BY cast_billing_order) AS cast_list_by_billing
FROM film_actors
GROUP BY film_id;

Now your list will respect the official billing order from the movie credits, separated by ’ > ‘.

Why array_agg() is Often the Smarter Choice

While string_agg() gives you a friendly string, array_agg() gives you a precise array. This is a non-destructive operation. You get each element preserved in its original form, ready to be used.

SELECT
  film_id,
  array_agg(actor_name) AS cast_array
FROM film_actors
GROUP BY film_id
LIMIT 1;

Result:

 film_id |                cast_array
---------+-------------------------------------------
     101 | {"Daniel Day-Lewis","Meryl Streep","John Doe"}

See the difference? The elements are now individual items within an array, denoted by curly braces {}. No more worrying about delimiter collisions. You can also ORDER BY inside array_agg() just like with string_agg().

The real power comes when you use this array downstream. You can unnest it, search it, get its length, or pass it directly to a function that expects an array.

-- Find films with more than 5 actors
SELECT film_id, array_agg(actor_name) as cast_array
FROM film_actors
GROUP BY film_id
HAVING cardinality(array_agg(actor_name)) > 5;

FILTER: The Cleanest Way to Conditional Aggregation

Now for the really cool part. Sometimes you only want to aggregate some of the rows in your group. The old-school way was to use a CASE statement inside the aggregate, which is functional but ugly as sin.

PostgreSQL’s FILTER (WHERE ...) clause is the elegant solution. It lets you apply a WHERE clause just to the input of that specific aggregate function. It’s brilliantly readable.

Let’s say you have a sales table and you want a report showing total sales per staff member, but also a list of all the high-value transactions (say, over $100) they made.

SELECT
  staff_id,
  sum(amount) as total_sales,
  string_agg('Sale #' || sale_id::text, '; ' FILTER (WHERE amount > 100)) AS big_ticket_sales
FROM sales
GROUP BY staff_id;

For a given staff member, the result might be:

 staff_id | total_sales |         big_ticket_sales
----------+-------------+-----------------------------------
        1 |      850.00 | Sale #101; Sale #155; Sale #209

The FILTER clause ensures string_agg() only considers rows where amount > 100. You can mix and match different FILTER clauses on different aggregates in the same SELECT statement. It’s incredibly powerful for building complex pivot-like reports without losing your mind. This is one of those features where the PostgreSQL designers absolutely nailed it.