11.8 Statistical Significance Testing for Model Comparison
Right, so you’ve got two models. One’s your new shiny thing, the promise of a better tomorrow. The other is the old, boring baseline (maybe a linear regression or just guessing the average). Your new model has a better accuracy, a lower RMSE, a higher F1-score. You’re feeling pretty good. But hold on. Did it really win, or did it just get lucky on this particular slice of data? This isn’t a question of opinion; it’s a question of probability. That’s where statistical significance testing comes in. We’re going to move from saying “it looks better” to “we are 95% confident that this improvement is real and not just random noise.” This is how you stop yourself from shipping a model that’s actually worse.