3.6 Probability Distributions: Gaussian, Bernoulli, Categorical, Multinomial

Right, let’s talk probability distributions. You can’t do AI without them. They’re the mathematical machinery for handling uncertainty, which is pretty much the entire job description of an intelligent system. Think of them as the personality profiles for your data. Is your data a well-behaved, predictable type (Gaussian)? Or is it a fickle, yes-or-no drama queen (Bernoulli)? Let’s meet the usual suspects.

The All-Powerful Gaussian (Normal) Distribution

The Gaussian, or normal, distribution is the overachieving golden child of probability. It’s everywhere, thanks to the Central Limit Theorem, which basically says if you take a bunch of random stuff and add it together, the result will tend to be Gaussian. It’s the universe’s default setting for noise.

Its personality is defined by two parameters: the mean (μ), which is where it centers itself, and the standard deviation (σ), which controls how spread out it is. A small σ means it’s a tightly-wound, precise distribution. A large σ means it’s… well, it’s had a few and is now telling long, rambling stories to everyone at the party.

The classic bell curve is beautiful because it’s incredibly tractable. The entire thing is described by this equation, which I’ll show you not to be intimidating but so you recognize it in a dark alley:

$$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$

Now, let’s make it in code. You’ll use this constantly.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Define the parameters: mean and standard deviation
mu = 0.0   # Center it at zero
sigma = 1.0 # Standard Deviation of one -> "Standard Normal"

# Generate some data points from this distribution
data = np.random.normal(mu, sigma, 10000)

# Plot a histogram to see the famous bell curve
plt.hist(data, bins=50, density=True, alpha=0.6, color='g', label='Sampled Data')

# Now plot the actual, smooth Probability Density Function (PDF)
x = np.linspace(-4, 4, 100)
plt.plot(x, norm.pdf(x, mu, sigma), 'r-', lw=2, label='PDF')

plt.title('The Glorious Gaussian Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.show()

Pitfall: The biggest mistake is assuming everything is Gaussian. Real-world data is often messy, skewed, and has fat tails. Blindly assuming normality will lead your model astray. Always visualize your data first.

The Binary Bouncer: Bernoulli Distribution

Meet the simplest distribution: the Bernoulli. It’s the distribution for a single event with two possible outcomes. Success (1) or failure (0). Heads or tails. Cat picture or not-cat-picture.

It has just one parameter: p, the probability of success. The probability of failure is then 1 - p. That’s it. Its entire personality is defined by how optimistic p is.

from scipy.stats import bernoulli

# Let's model a biased coin flip where P(Heads) = 0.7
p = 0.7
bernoulli_rvs = bernoulli.rvs(p, size=20) # Generate 20 flips

print("20 coin flips (1=Heads, 0=Tails):", bernoulli_rvs)
print("Mean of our samples (should be close to p):", bernoulli_rvs.mean())

Why it matters: It’s the building block. You don’t just use it for coins; every time you think about a single node in a neural network making a binary decision, or a single trial in an experiment, you’re in Bernoulli territory.

The Bernoulli distribution has a fancy cousin: the Categorical distribution. Instead of two outcomes, it has K possible outcomes. Think of it as rolling a single, possibly unfair, die. The outcome is a single category (e.g., “you rolled a 3”).

Its parameters are a vector of probabilities, one for each category. The probabilities must sum to 1, or the math gods get very angry.

from numpy.random import choice

# Let's model a weird 4-sided die with probabilities:
categories = ['Red', 'Green', 'Blue', 'Yellow']
probabilities = [0.1, 0.2, 0.5, 0.2] # Must sum to 1.0

# Roll the die 15 times
rolls = choice(categories, size=15, p=probabilities)
print("Your 15 rolls:", rolls)

…And Counting Them: Multinomial Distribution

Now, what if you roll that weird die N times and want to know how many times each side came up? You’ve left the realm of the Categorical and entered the Multinomial. It’s the distribution of counts across multiple categories.

It generalizes the Binomial distribution (which is just the count of successes from multiple Bernoulli trials) to more than two categories. Its parameters are the number of trials n and the probability vector p.

from scipy.stats import multinomial

# Same weird die from before
p = [0.1, 0.2, 0.5, 0.2]

# Roll it 100 times and count the outcomes
outcome = multinomial.rvs(n=100, p=p)

print(f"Results of 100 rolls: {outcome}")
print(f"Expected red: {100 * p[0]}, Actual red: {outcome[0]}")
print(f"Expected blue: {100 * p[2]}, Actual blue: {outcome[2]}")

Why you care: This is the core of classification. The output of a classic machine learning classifier (like Logistic Regression or a simple neural network) over multiple classes is often interpreted as a probability vector. When you then make a prediction, you’re essentially drawing a sample from a Categorical distribution defined by that vector. And when you evaluate performance on a test set of N examples, you’re thinking in terms of Multinomial counts. These aren’t abstract academic concepts; they are the literal plumbing of AI.

The All-Powerful Gaussian (Normal) Distribution

The Binary Bouncer: Bernoulli Distribution

Choosing One from the Menu: Categorical Distribution

…And Counting Them: Multinomial Distribution