Statistics | mikePietsch.com

3.9 Principal Component Analysis as a Linear Algebra Application

Right, so you’ve got data. Lots of it. A spreadsheet with a thousand rows and a hundred columns, a point cloud with a million 3D coordinates, image data with thousands of pixels per sample. It’s a mess. It’s high-dimensional, which is a fancy way of saying it’s a pain in the neck to visualize, process, and train models on. Many of those dimensions are probably redundant, correlated, or just noisy. Wouldn’t it be nice to squash it down into its most important, uncorrelated components without losing the good stuff? Enter Principal Component Analysis, or PCA. Don’t let the fancy name intimidate you; at its heart, it’s just a brutally effective application of the linear algebra we’ve been talking about.

3.8 Information Theory: Entropy, KL Divergence, and Cross-Entropy

Alright, let’s get our hands dirty with the math that makes AI models actually care about being right. We’re talking about information theory. Don’t let the name intimidate you; at its core, it’s just a brutally honest way to measure surprise and disagreement. It’s the difference between a model that confidently spouts nonsense and one that whispers, “I’m not entirely sure, but here’s my best guess.” Think of it this way: if I told you the sun rose this morning, you’d offer a polite nod. Low surprise, low information. If I told you a penguin just delivered my new passport, you’d be shocked. High surprise, high information. Information theory gives us a mathematical yardstick for that feeling of surprise. And in AI, we use that yardstick to beat our models into shape, teaching them to assign high probabilities to things that actually happen and low probabilities to things that don’t.

3.7 Bayes' Theorem and Bayesian Reasoning

Right, let’s talk about Bayes’ Theorem. This isn’t just some dusty equation from a statistics textbook; it’s the very engine of modern reasoning for AI systems. It’s how your spam filter learns what you consider junk, how diagnostic tools weigh evidence, and how a self-driving car updates its belief about a pedestrian stepping off the curb. At its heart, it’s a formal method for changing your mind in the face of new evidence. And it’s scandalously simple.

3.6 Probability Distributions: Gaussian, Bernoulli, Categorical, Multinomial

Right, let’s talk probability distributions. You can’t do AI without them. They’re the mathematical machinery for handling uncertainty, which is pretty much the entire job description of an intelligent system. Think of them as the personality profiles for your data. Is your data a well-behaved, predictable type (Gaussian)? Or is it a fickle, yes-or-no drama queen (Bernoulli)? Let’s meet the usual suspects. The All-Powerful Gaussian (Normal) Distribution The Gaussian, or normal, distribution is the overachieving golden child of probability. It’s everywhere, thanks to the Central Limit Theorem, which basically says if you take a bunch of random stuff and add it together, the result will tend to be Gaussian. It’s the universe’s default setting for noise.

3.5 Partial Derivatives and Gradients

Right, so you’ve got a function. Maybe it’s your model’s loss function, a complex simulation, or just a weirdly shaped wavy sheet. Up until now, you’ve probably asked questions like, “If I nudge my input this way, what happens to the output?” That’s a derivative. But our world isn’t one-dimensional. Your AI model has thousands, millions, sometimes billions of parameters. Nudging things is a multi-directional affair. This is where we stop thinking in terms of slopes and start thinking in terms of gradients.

3.4 Derivatives and the Chain Rule: Foundations of Backpropagation

Alright, let’s get our hands dirty with derivatives. Forget the dusty old definition from calculus class with the limit of the secant line. In the AI world, you need a more practical, almost physical intuition. Think of a derivative not as a slope, but as a sensitivity measurement. If you have a function f(x), the derivative f'(x) or df/dx tells you one thing: if you give x a tiny nudge h, how much will the output f(x) nudge in response? It’s the function’s amplification factor for change at that specific point. A large derivative means it’s super sensitive; a small one means it barely cares. This is the absolute bedrock of training neural networks. We nudge the weights (our x) based on how sensitive the loss (our f(x)) is to them. It’s how the network learns.

3.3 Dot Products, Norms, and Projections

Alright, let’s get our hands dirty with the real workhorses of linear algebra: dot products, norms, and projections. These aren’t just abstract mathematical curiosities; they are the fundamental tools that let AI models measure similarity, understand distance, and even “learn” by nudging things in the right direction. If you’ve ever used a recommendation system or seen a neural network classify an image, these concepts were working overtime under the hood.

3.2 Matrix Operations: Multiplication, Transpose, Inverse, Eigendecomposition

Right, let’s talk about the things you’ll actually do with matrices. You’ve got these grids of numbers, and you can add them, which is delightfully sane. But multiplication? That’s where the designers of our universe decided to get weird. Matrix multiplication isn’t just multiplying each element by its corresponding partner. That operation exists; it’s called the Hadamard product, and we almost never use it. No, true matrix multiplication is a more profound, and frankly, more useful beast. It’s the mathematical embodiment of composing linear transformations. If I have a matrix A that rotates a vector, and a matrix B that scales it, then A @ B (in Python parlance) gives me a new matrix that does the scaling then the rotation, in one step. The key rule: the number of columns in the first matrix must equal the number of rows in the second. If A is (m x n) and B is (n x p), then the result C is (m x p). The element C[i, j] is the dot product of the i-th row of A and the j-th column of B.

3.1 Vectors, Matrices, and Tensors: The Language of ML

Right, let’s get this out of the way. You’re not here to learn about vectors in the abstract, geometric sense, like some arrow pointing into space from your high school physics class. In our world—the world of machine learning—vectors, matrices, and tensors are just data containers. They’re the fundamental structures we use to shove numbers into a model’s mouth. A vector is a list of numbers. A matrix is a list of lists of numbers. A tensor is just a fancy, multi-dimensional array of numbers (and yes, a vector is a 1D tensor, a matrix is a 2D tensor; don’t let anyone make it sound more mystical than that).

3. Mathematics for AI: Linear Algebra, Calculus, and Probability

78.7 SymPy: Symbolic Mathematics in Python

Right, so you’ve graduated from just plugging numbers into functions and you want to ask the big questions. What is the derivative of this monstrosity? How do I solve this equation for x without guessing? That’s where SymPy saunters in, the library that gives Python the soul of a grumpy, infinitely patient mathematician. Forget floating-point approximations for a second. SymPy is all about symbolic computation. It deals with symbols, variables, and exact relationships. It manipulates mathematical expressions the way a human would on paper, just a lot faster and without the coffee stains. It’s not a numerical library; it’s an algebraic one. We use it when we want to understand the structure of a problem before we ever feed it numbers.

78.6 Polars: Lazy Evaluation and Performance vs Pandas

Right, let’s talk about what happens when you stop asking your CPU to politely wait around and instead tell it to get its act together. That’s the fundamental shift in mindset between eager evaluation (Pandas’ default mode) and lazy evaluation (Polars’ superpower). Pandas is like that eager intern who runs off to do each task you give them the second you ask, which is great… until you realize you needed to change the first step. Polars, in its lazy mode, is the senior engineer who asks for the entire project plan first, stares at it for a while, optimizes the hell out of the route, and then executes it all in one go. It’s not just faster; it’s smarter.

78.5 Statistical Tests: t-test, chi-squared, ANOVA

Right, let’s talk about p-values. No, don’t groan. I know they’ve been the subject of more academic drama than a stolen research idea, but they’re still the lingua franca of “is this thing I’m seeing real?” in science. We use them not because they’re perfect, but because they’re a standardized, if slightly clunky, tool. And SciPy is your toolbox for wielding them without cutting your fingers off. The core idea is simple: you have a hypothesis (e.g., “this new fertilizer makes plants grow taller”), you collect some data, and then you use a statistical test to calculate the probability of seeing that data if your hypothesis was wrong (e.g., if the fertilizer actually did nothing). That probability is the p-value. A very low p-value (typically below 0.05) tells you your null hypothesis is looking pretty shaky. It’s not proof, it’s evidence. Now, let’s get our hands dirty.

78.4 Signal Processing: FFT, Filtering, and Spectral Analysis

Right, let’s talk about making your data sing. Or at least making it stop screaming. Signal processing is how you take a raw, noisy, often infuriatingly messy signal from the real world and extract the information you actually care about. It’s the digital equivalent of tuning an old radio—you’re turning the knobs (applying filters) to bring the station (your signal) into focus and drown out the static (the noise). We’ll use SciPy for the heavy-duty signal processing math because, frankly, it’s a beast. But since our signals often live in big, beautiful DataFrames, we’ll use Polars to manage them before we hand things off to SciPy’s algorithms. This is the classic one-two punch: Polars for fast, efficient data wrangling, SciPy for the rigorous numerical analysis.

78.3 Optimization: Minimizing Functions and Curve Fitting

Right, so you’ve got some data and a model. Maybe it’s the decay of a radioactive isotope, the growth of a bacterial colony, or how many cups of coffee it takes before your hands start to vibrate at a measurable frequency. You need to find the parameters of your model that make it fit your data best. This isn’t guesswork; it’s optimization. And SciPy’s scipy.optimize module is your brilliantly stocked toolbox for this exact job. Let’s crack it open.

78.2 Numerical Integration and Solving ODEs

Right, so you’ve got some data, maybe it’s the trajectory of a particle, the growth rate of a tumor, or the decay of a radioactive sample. The math governing it is a differential equation. You could try to solve it analytically, wrestle with integrals and constants of integration until your pencil snaps. But let’s be real, most of the interesting problems in the real world are non-linear, messy, and refuse to have a nice closed-form solution. That’s where we stop being mathematicians and start being computational scientists.

78.1 SciPy Subpackages: integrate, optimize, signal, stats, sparse

Right, let’s get into the meat of SciPy. You’ve got NumPy for your arrays, the raw material. SciPy is the fully-stocked workshop where you shape that material into something useful. It’s a massive collection of subpackages, but we’re going to focus on the heavy hitters you’ll actually use. Forget the kitchen-sink approach; we’re here to talk about the tools that earn their keep. integrate: Making Sense of the Curve The world isn’t discrete. Sometimes you need to know the whole of something, not just the sum of its sampled parts. That’s where scipy.integrate comes in. The workhorse here is quad, which handles the definite integral of a function of one variable. It’s shockingly simple to use, but the magic is in what it’s doing behind the scenes: it’s using robust numerical techniques (like adaptive quadrature) to figure out the area under your curve without needing an analytical solution.