Count

16.6 String Aggregation: string_agg() and array_agg()

Right, let’s talk about making lists. Because sometimes you don’t want a summary statistic like SUM() or AVG(); you want to see the actual stuff that got grouped together. That’s where string_agg() and array_agg() come in. They’re your go-to functions for when you need to roll up a bunch of row values into a single, consolidated value. Think of them as the “and company” in a list of names. string_agg() is the more commonly used of the two. It takes a column of values, mashes them together into a single string, and plops a delimiter of your choice between each one. It’s shockingly useful for generating comma-separated lists, tags, or any scenario where you need a human-readable summary.

16.5 Statistical Aggregates: stddev, variance, corr, regr_*

Alright, let’s talk about the fun stuff: making your data confess its deepest secrets. We’ve summed and averaged, but now it’s time to get properly nosy. We’re moving beyond “what” to “how much does this stuff vary?” and “do these things even move together?”. That’s where statistical aggregates come in. They sound academic, but they’re just tools for measuring spread and relationship. And I promise, we’ll get through this without a single lecture about bell curves.

16.4 FILTER Clause: Conditional Aggregation

Alright, let’s talk about the FILTER clause. You’re going to love this. It’s one of those SQL features that, once you start using it, you’ll wonder how you ever lived without it. It’s the antidote to a common SQL headache: writing a bunch of messy CASE statements inside your aggregate functions. The core problem it solves is conditional aggregation. You know the scenario. You want to count things, or sum things, but only for specific rows within your groups. The old-school way, which you’ve probably written a thousand times, involves a CASE expression nested inside the aggregate. It gets the job done, but it’s like reading a sentence with a bunch of parentheses in the middle—it breaks your flow.

16.3 HAVING: Filtering After Aggregation

Right, so you’ve grouped your data. You’ve got your neat little summary rows, your averages, your counts. It’s a beautiful, aggregated masterpiece. But what if you don’t want all of those groups? What if you only care about the groups where the average is above a certain threshold, or the count is suspiciously high? You can’t slap a WHERE clause on this; WHERE is for filtering raw rows before they get fed into the aggregation machine. For filtering the results of that machine, you need its bouncer: the HAVING clause.

16.2 GROUP BY: Collapsing Rows into Groups

Alright, let’s get down to brass tacks. You’ve got a big ol’ table of data, and you want to summarize it. You don’t want to see every single row; you want to see the shape of the forest, not count every tree. That’s what GROUP BY is for. It’s the SQL clause that collapses rows into meaningful groups and lets you apply aggregate functions (SUM, AVG, COUNT, etc.) to each group individually. It’s the difference between a pile of individual receipts and a monthly financial statement.

16.1 COUNT, SUM, AVG, MIN, MAX: The Core Aggregates

Right, let’s get our hands dirty with the core aggregate functions. These are the workhorses, the ones you’ll use in 90% of your analytical queries. They’re deceptively simple. COUNT, SUM, AVG, MIN, and MAX seem straightforward, but the devil—and the real power—is in the details. Think of them as a tiny, efficient factory you attach to a group of rows. You feed a bunch of values in, and the factory spits out a single, summarized value. SUM adds everything up, AVG calculates the mean, and so on. The magic happens when you pair them with GROUP BY, which we’ll get to shortly. For now, let’s meet the team.

16.6 String Aggregation: string_agg() and array_agg()

16.5 Statistical Aggregates: stddev, variance, corr, regr_*

16.4 FILTER Clause: Conditional Aggregation

16.3 HAVING: Filtering After Aggregation

16.2 GROUP BY: Collapsing Rows into Groups

16.1 COUNT, SUM, AVG, MIN, MAX: The Core Aggregates

16. Aggregate Functions: GROUP BY, HAVING, and FILTER

16. count and for_each

6. RDD Actions: collect, reduce, count, take, and saveAsTextFile