5.5 Logistic Regression: The Sigmoid Function and Binary Classification

Right, so linear regression was a neat party trick for predicting things like house prices or how many cups of coffee I’ll need to get through this chapter. But you and I both live in the real world, and the real world is full of questions that linear regression is hilariously bad at answering. What’s the probability this email is spam? Will this customer churn? Is that a picture of a cat or a very fluffy loaf of bread?

These are classification problems. We’re not predicting a continuous value anymore; we’re predicting a probability, a likelihood, which is always between 0 and 1. If you try to shove a yes/no question through our old y = mx + b formula, it’ll gleefully spit out numbers like -4.2 or 17.3, which are, to put it technically, absolute nonsense as probabilities.

We need a way to take that unbounded linear output and squish it, gracefully and mathematically, into that nice, interpretable 0-to-1 range. Enter the star of our show: the sigmoid function.

The Magic S-Curve: What the Sigmoid Actually Does

The sigmoid function, also called the logistic function, is our mathematical hammer that smashes any number into the space between 0 and 1. It’s defined as:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where z is the output from our linear model (z = mx + b or, in multi-dimensional terms, z = β₀ + β₁x₁ + ... + βₙxₙ).

Let’s break down why this is so brilliant. Imagine z is our linear score. If z is a large positive number (say, 5), e^{-5} becomes a tiny number (~0.0067). So, σ(5) ≈ 1 / (1 + 0.0067) ≈ 0.993. That’s a very high probability. Conversely, if z is a large negative number (say, -5), e^{5} is a large number (~148.4), so σ(-5) ≈ 1 / (1 + 148.4) ≈ 0.0067. A very low probability. If z is 0, e^{0} is 1, so σ(0) = 1 / (1+1) = 0.5. Dead even.

It gives us a smooth, S-shaped curve that is not only interpretable but also differentiable, which is crucial for the optimization process (gradient descent) to work its magic. Let’s see it in action.

import numpy as np
import matplotlib.pyplot as plt

# Define the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Generate a range of values from -10 to 10
z = np.linspace(-10, 10, 100)
# Apply the sigmoid function
sigma_z = sigmoid(z)

# Plot it
plt.figure(figsize=(9, 6))
plt.plot(z, sigma_z)
plt.axhline(y=0.5, color='r', linestyle='--', label='Decision Boundary (0.5)')
plt.xlabel('Linear Output (z)')
plt.ylabel('Sigmoid Output σ(z)')
plt.title('The S-Shaped Sigmoid Function')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend()
plt.show()

Run that code. See how it’s almost a step function at the extremes but nice and soft in the middle? That softness is what allows the model to be updated with nuance during training.

From Probability to Decision: The Decision Boundary

Okay, so the sigmoid gives us a probability, P(class = 1). But we live in a binary world; we need a final answer. This is where you, the human, come in. You have to choose a threshold, typically 0.5.

If σ(z) >= 0.5, we predict class 1.
If σ(z) < 0.5, we predict class 0.

Why 0.5? Because it’s the intuitive, fair midpoint. But it’s not a law. Let’s say you’re building a model to detect a rare but fatal disease. A false negative (missing the disease) is much worse than a false positive (causing some worry but then ruling it out with further tests). In that case, you might lower the threshold to, say, 0.3. Anything with a 30%+ probability gets flagged for review. The choice of threshold is a business decision, not just a technical one, dictated by the cost of different types of errors.

Implementing It: The Code Reality

Let’s be real, you’re never going to code this from scratch for a real project. You’ll use Scikit-learn. But seeing it helps you understand what’s happening under the hood. Here’s a classic example using the Wisconsin Breast Cancer dataset.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# Load the data - a classic binary classification problem
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# This is CRUCIAL: StandardScaler. Remember, logistic regression uses gradient descent.
# If your features are on wildly different scales, the model will have a conniption fit.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Note: transform, NOT fit_transform on the test set!

# Create and train the model. Notice 'liblinear' is a good solver for smaller datasets.
# The 'C' parameter is inverse regularization strength. More on that later.
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions (this returns the class 0 or 1)
y_pred = model.predict(X_test_scaled)

# But you can also get the probabilities themselves! This is often more useful.
y_pred_proba = model.predict_proba(X_test_scaled)
print("Probabilities for the first few test samples:")
print(y_pred_proba[:5]) # Shows [P(class=0), P(class=1)]

# Evaluate
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.3f}")

# The Confusion Matrix is your best friend. Accuracy is a liar on imbalanced datasets.
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot(cmap='Blues')
plt.show()

The Pitfalls and “Wait, What?” Moments

The Scaling Non-Negotiable: I can’t stress this enough. If you don’t scale your features (StandardScaler is your friend), your model will either take 1000 years to converge or will just give you garbage results. The solver’s optimization process needs all features on a level playing field.
It’s Still Linear: Don’t let the fancy S-curve fool you. Logistic regression is a linear classifier. It finds a single straight line (or a hyperplane in higher dimensions) to separate the classes. If your data isn’t linearly separable, its performance will plateau hard. This is its greatest strength (interpretability) and its biggest weakness.
The Mysterious C Parameter: You’ll see C=1.0 as the default. C is the inverse of regularization strength. C down means regularization up. A smaller C tells the model to trust the data less, resulting in smaller coefficients. A very large C can lead to overfitting. Tune this hyperparameter. It matters.
Multicollinearity is a Party Crasher: Just like in linear regression, if your features are highly correlated, it makes the coefficients unstable and harder to interpret. The model’s overall predictive power might be fine, but don’t try to read too much into the importance of any one individual coefficient if they’re all chatting with each other behind your back.

So there you have it. Logistic regression: the elegant, sensible, and surprisingly robust workhorse of classification. It’s often the first thing you should try, not because it’s the simplest, but because it’s often good enough and its results are beautifully interpretable. You can see why it made a decision, which in the real world is often more valuable than a fraction of a percent of accuracy from a black-box model.