78.5 Statistical Tests: t-test, chi-squared, ANOVA

Right, let’s talk about p-values. No, don’t groan. I know they’ve been the subject of more academic drama than a stolen research idea, but they’re still the lingua franca of “is this thing I’m seeing real?” in science. We use them not because they’re perfect, but because they’re a standardized, if slightly clunky, tool. And SciPy is your toolbox for wielding them without cutting your fingers off.

The core idea is simple: you have a hypothesis (e.g., “this new fertilizer makes plants grow taller”), you collect some data, and then you use a statistical test to calculate the probability of seeing that data if your hypothesis was wrong (e.g., if the fertilizer actually did nothing). That probability is the p-value. A very low p-value (typically below 0.05) tells you your null hypothesis is looking pretty shaky. It’s not proof, it’s evidence. Now, let’s get our hands dirty.

The t-test: Comparing Two Means

The workhorse. You use this when you want to compare the average of two groups. Did users on the new website design spend more money (Group A: old design, Group B: new design) than users on the old one? Let’s say we have our data in two Polars Series. First, we’ll use Polars for what it’s best at: wrangling and describing the data.

import polars as pl
from scipy import stats

# Let's fabricate some plausible, if underwhelming, A/B test results
# Group A (old design)
group_a = pl.Series("old_design", [12.50, 15.75, 11.20, 14.10, 13.05, 16.40, 10.90])
# Group B (new design)
group_b = pl.Series("new_design", [14.20, 18.10, 15.30, 17.55, 16.80, 12.50, 19.25, 15.90])

print(f"Group A mean: {group_a.mean():.2f}")
print(f"Group B mean: {group_b.mean():.2f}")

Now, the million-dollar question: is that difference in means (14.7 vs. 16.2) real, or just random noise? We use scipy.stats.ttest_ind. The ind is for “independent” samples, which these are.

# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(group_b, group_a)

print(f"t-statistic: {t_stat:.3f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject the null hypothesis: There's a statistically significant difference!")
else:
    print("Fail to reject the null hypothesis: Any difference could just be chance.")

Why it works & Pitfalls: This test assumes your data is roughly normally distributed and that the two groups have similar variances. SciPy helps you out by defaulting to a test that doesn’t assume equal variances (Welch’s t-test), which is almost always the smarter choice. If you’re a stickler and want to check the variance assumption first, you can use scipy.stats.levene(group_a, group_b). The biggest pitfall is using a t-test on non-independent data (e.g., before-and-after measurements on the same users—that requires a paired t-test, ttest_rel) or using it to compare more than two groups (which is a job for…).

ANOVA: The Fancy t-test for Three or More Groups

ANOVA (Analysis of Variance) is essentially a t-test on steroids for when you have three or more groups. Let’s say you have three different fertilizers and you want to see if any of them produce different average plant heights. The null hypothesis is that all group means are equal.

# Three different fertilizer groups
fertilizer_a = pl.Series([15.0, 16.5, 14.2, 17.1, 15.8])
fertilizer_b = pl.Series([18.5, 20.1, 19.3, 17.9, 18.8])
fertilizer_c = pl.Series([14.8, 15.2, 13.9, 16.0, 14.5])

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(fertilizer_a, fertilizer_b, fertilizer_c)

print(f"F-statistic: {f_stat:.3f}")
print(f"p-value: {p_value:.8f}") # It's gonna be tiny

The Critical Caveat: A significant ANOVA result only tells you that not all the groups are the same. It doesn’t tell you which ones are different from which others. For that, you need a post-hoc test like Tukey’s HSD. SciPy doesn’t include one in the main stats module, which is a bit of an omission. You’ll often need to grab it from statsmodels. It’s a hassle, and I call it out as a questionable choice. You deserve built-in post-hocs.

The Chi-Squared Test: Categorical Showdown

While t-tests and ANOVA deal with means of continuous data, chi-squared (chi2) is for counts and categories. The classic example: does the distribution of blood types (A, B, AB, O) in your sample match the known distribution for the general population? This is the goodness-of-fit test.

More commonly, you’ll use the chi-squared test of independence on a contingency table. For example: is there a relationship between gender (Male, Female) and preference for a product (Like, Dislike)?

# Let's create a contingency table
# Rows: Gender (Male, Female)
# Columns: Product Preference (Like, Dislike)
observed_data = pl.DataFrame({
    "gender": ["Male", "Male", "Female", "Female"],
    "preference": ["Like", "Dislike", "Like", "Dislike"],
    "count": [20, 10, 30, 15]
})

# Pivot to create the matrix SciPy expects
cont_table = observed_data.pivot(
    index="gender",
    columns="preference",
    values="count",
    aggregate_function="first"
).select(pl.col("Like"), pl.col("Dislike"))

print("Contingency Table:")
print(cont_table)

# Run the chi2 test. The function expects the matrix as-is.
chi2_stat, p_value, dof, expected = stats.chi2_contingency(cont_table)

print(f"\nChi2 statistic: {chi2_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
print("\nExpected frequencies if independent:")
print(expected)

The Non-Negotiable Rule: Chi-squared tests get wildly inaccurate if your expected frequencies in any cell of the table are too low (a common rule of thumb is <5). If you see that, you shouldn’t trust the result. Your options are to collect more data or use a more robust test like Fisher’s exact test (scipy.stats.fisher_exact), though it’s computationally heavier for large tables.

Remember, all of these tests have assumptions. Violate them and you’re building on a shaky foundation. Your brilliant Polars-driven data preparation is the first and most important step—no statistical test can save you from garbage data. Now go forth and test responsibly. And maybe lay off the 0.05 significance level like it’s a religious text; it’s a convention, not a commandment.