16.5 Statistical Aggregates: stddev, variance, corr, regr_*
Alright, let’s talk about the fun stuff: making your data confess its deepest secrets. We’ve summed and averaged, but now it’s time to get properly nosy. We’re moving beyond “what” to “how much does this stuff vary?” and “do these things even move together?”. That’s where statistical aggregates come in. They sound academic, but they’re just tools for measuring spread and relationship. And I promise, we’ll get through this without a single lecture about bell curves.
The Usual Suspects: STDDEV and VARIANCE
Variance (var_samp) is the polite way to start. It calculates the average of the squared differences from the mean. Squaring the differences does two things: it makes all values positive (so negatives don’t cancel out positives) and it punishes outliers more severely. A big difference becomes a really big number when squared. Standard deviation (stddev_samp) is just the square root of the variance. We take the root to bring the units back to something we can actually understand. If your data is in “meters,” variance is in “meters²” (which is useless), but standard deviation is back in “meters.”
PostgreSQL, being the overachiever it is, gives you two flavors of each: the population and sample versions.
var_samp()/stddev_samp(): Use this when your data is a sample of a larger population. This is what you’ll use 99% of the time. It usesN-1in the denominator (where N is the number of rows) to correct for sample bias. It’s the statistically honest thing to do.var_pop()/stddev_pop(): Use this only if your data literally is the entire population you’re studying. It usesNin the denominator.
-- Sample data: Salaries in a (very small) company
SELECT
department,
avg(salary) AS avg_salary,
round(stddev_samp(salary)::numeric, 2) AS sample_sd,
round(stddev_pop(salary)::numeric, 2) AS population_sd,
count(*) AS n_employees
FROM employees
GROUP BY department;
department | avg_salary | sample_sd | population_sd | n_employees
------------+------------+-----------+---------------+-------------
Engineering| 110000 | 22456.12 | 19526.03 | 5
Marketing | 85000 | 7071.07 | 6123.72 | 3
See the difference? With only 3 people in Marketing, the sample correction (n-1 = 2) makes the sample standard deviation larger than the population version. This is the database politely reminding you that your sample size is tiny and you shouldn’t bet the company on it.
How Correlated Are Your Variables? Enter CORR
Correlation (corr) measures the direction and strength of a linear relationship between two variables. It returns a number between -1 and 1.
- 1: A perfect positive linear relationship. As X goes up, Y goes up in a perfectly straight line.
- -1: A perfect negative linear relationship. As X goes up, Y goes down in a perfectly straight line.
- 0: No linear relationship. They might have a wild, non-linear relationship (like a parabola), but
corrwon’t see it. This is its biggest blind spot.
Let’s say you’re suspicious that longer titles get more clicks.
SELECT corr(length(title), click_count) AS title_length_click_corr
FROM blog_posts;
title_length_click_corr
-------------------------
0.1274
A value of 0.12 suggests a very weak positive relationship. The takeaway? Maybe title length has a tiny effect, but it’s being drowned out by other factors (like, I don’t know, the content?). Don’t let a weak correlation like this trick you into thinking you’ve found a cause. It hasn’t passed the sniff test yet.
The Heavy Artillery: Linear Regression with REGR_*
This is where PostgreSQL flexes. The regr_* family of functions (regr_slope, regr_intercept, regr_r2) aren’t just measuring a relationship; they’re fitting a straight line (y = mx + b) to your data. This is simple linear regression built right into your SQL query. It’s absurdly powerful.
regr_slope(Y, X): Themin the equation. For every one unit increase in X, how much does Y change?regr_intercept(Y, X): Thebin the equation. What is the value of Y when X is zero? (Often this is a theoretical concept, not a practical one).regr_r2(Y, X): The coefficient of determination. This tells you how well the line fits your data. Closer to 1 = better fit.
Let’s model the relationship between a product’s discount percentage and its sales volume.
SELECT
product_id,
regr_slope(sales_volume, discount_pct) as slope,
regr_intercept(sales_volume, discount_pct) as intercept,
regr_r2(sales_volume, discount_pct) as r_squared
FROM sales_data
GROUP BY product_id;
product_id | slope | intercept | r_squared
------------+---------+------------+-----------
101 | 55.61 | 102.44 | 0.87
102 | 12.15 | 198.01 | 0.23
Look at that. Product 101 has a strong relationship (r_squared = 0.87). The slope tells us that for every 1% increase in discount, we sell about 55 more units. The intercept suggests we’d sell about 102 units at a 0% discount. Product 102, however, has a weak relationship (r_squared = 0.23). Discounts just don’t drive sales for that product. Stop putting it on sale; you’re just leaving money on the table.
The Crucial HAVING Clause for Statistics
You can’t use a WHERE clause to filter based on the result of an aggregate function. That’s what HAVING is for. This is especially useful here to find meaningful patterns and ignore statistical noise.
-- Find products where discount has a STRONG positive effect on sales
SELECT
product_id,
regr_slope(sales_volume, discount_pct) as slope,
regr_r2(sales_volume, discount_pct) as r_squared
FROM sales_data
GROUP BY product_id
HAVING regr_slope(sales_volume, discount_pct) > 0 -- Positive correlation
AND regr_r2(sales_volume, discount_pct) > 0.7; -- Strong model fit
This query filters out all the weak, negative, or messy relationships and only shows you the ones where the signal is clear and actionable. This is how you stop guessing and start knowing.