3.7 Bayes' Theorem and Bayesian Reasoning

Right, let’s talk about Bayes’ Theorem. This isn’t just some dusty equation from a statistics textbook; it’s the very engine of modern reasoning for AI systems. It’s how your spam filter learns what you consider junk, how diagnostic tools weigh evidence, and how a self-driving car updates its belief about a pedestrian stepping off the curb. At its heart, it’s a formal method for changing your mind in the face of new evidence. And it’s scandalously simple.

The theorem itself is almost embarrassingly straightforward. It tells us how to update the probability of a hypothesis H (like “this email is spam”) given some new evidence E (like the word “Viagra”). Here’s the classic formulation:

P(H|E) = [P(E|H) * P(H)] / P(E)

Don’t glaze over. Let’s break down what these symbols are actually saying:

P(H|E) is what we want: the posterior probability. This is our updated belief about the hypothesis H after seeing the evidence E.
P(E|H) is the likelihood. This is the probability of seeing the evidence E if the hypothesis H were true. In our spam example: how likely is the word “Viagra” to appear in a spam email?
P(H) is the prior probability. This is our belief about the hypothesis before we see the new evidence. It’s our starting point. Maybe we know that 30% of all emails are spam, so our prior P(spam) is 0.3.
P(E) is the marginal likelihood or evidence. This is the total probability of seeing the evidence E under all possible hypotheses. It acts as a normalizing constant to make sure our posterior is a valid probability between 0 and 1.

The magic is in the update: P(H|E) is proportional to P(E|H) * P(H). Your new belief is your old belief, scaled by how strongly the new evidence supports that old belief.

The Prior Isn’t an Opinion, It’s a Starting Point

The most controversial part of Bayes is often the prior, P(H). Critics (usually frequentists) will whine that it’s subjective. To which I say: of course it is, and that’s the point. All reasoning starts from some prior assumption. The beautiful thing about Bayesian analysis is that it forces you to state your assumption explicitly so everyone can see it. A frequentist method hides these assumptions in its methodology. Ours are right there in the equation for the world to critique.

If you have good data, the prior gets overwhelmed by the evidence. A strong prior just means you need more convincing evidence to change your mind significantly—which is exactly how rational thinking should work. The key is to not just make up a number. Your prior should be based on historical data, domain expertise, or a well-reasoned default (like a uniform distribution if you’re truly ignorant).

Let’s Code a Classic Example: The Drug Test

Imagine a highly accurate drug test: it’s 99% sensitive (correctly identifies a user 99% of the time) and 99% specific (correctly identifies a non-user 99% of the time). If someone tests positive, what’s the probability they actually use drugs? Your gut might say 99%. Your gut, like most of us before learning Bayes, is terrible at probability.

Let’s say only 0.5% of the population actually uses the drug. Our hypothesis H is “is a user”. Our evidence E is “tests positive”.

# Define our probabilities
p_user = 0.005      # P(H): Prior probability of being a user
p_not_user = 1 - p_user # P(¬H)

# Sensitivity: P(E|H) - Probability of testing positive GIVEN they are a user
p_pos_given_user = 0.99

# Specificity: P(¬E|¬H) - Probability of testing negative GIVEN they are not a user
p_neg_given_not_user = 0.99
# Therefore, the probability of a false positive is 1 - specificity
p_pos_given_not_user = 1 - p_neg_given_not_user

# Total probability of testing positive P(E)
# This is P(E|H)*P(H) + P(E|¬H)*P(¬H)
p_positive = (p_pos_given_user * p_user) + (p_pos_given_not_user * p_not_user)

# Now apply Bayes' Theorem: P(H|E) = [P(E|H) * P(H)] / P(E)
p_user_given_positive = (p_pos_given_user * p_user) / p_positive

print(f"The probability that someone who tests positive is actually a user is: {p_user_given_positive:.3f}")
print(f"That's only {p_user_given_positive * 100:.1f}%!")

Run this code. I’ll wait. The result is a shockingly low ~33%. Why? Because the low prior probability (P(H) = 0.005) means the number of false positives from the vast non-using population massively outweighs the number of true positives from the tiny using population. This is the exact kind of counter-intuitive result Bayes’ Theorem exists to uncover. Ignoring it leads to disastrously wrong conclusions.

The Naive Bayes “Gotcha”

You’ll see Bayes used everywhere in machine learning, most famously in the Naive Bayes classifier. It’s “naive” because it makes a heroic (and almost always wrong) assumption: that all features (pieces of evidence) are conditionally independent given the class.

For spam, it assumes the words “Viagra” and “prescription” appear independently of each other in spam emails, which is blatantly false. This is the rough edge. So why do we use it? Because this simplification turns an impossibly complex calculation into a trivial one, and it often works surprisingly well for classification tasks. It’s a classic engineering trade-off: a wrong model that gives a roughly right answer quickly is more useful than a perfect model that’s too computationally expensive to run. Just be aware of its naivety.