79.4 Classification: Logistic Regression, Random Forest, SVM
Right, so you want to classify things. You have data, you have categories, and you want to teach a machine to sort the former into the latter. It’s the digital equivalent of training a very smart, very fast dog to herd sheep, only with less fluff and more math. We’re going to look at three of the most trusty workhorses for this job: the deceptively simple Logistic Regression, the robust and democratic Random Forest, and the geometrically elegant Support Vector Machine. Each has its own superpower and its own tragic flaw. Let’s get into it.
Logistic Regression: It’s Not Actually Regression
First, the name is a lie. It’s a classification algorithm, through and through. The “regression” part is a historical artifact because it uses a linear function like its cousin, Linear Regression. But instead of spitting out a continuous number, it feeds that linear output into the logistic sigmoid function, which squishes everything into a nice, interpretable probability between 0 and 1.
Think of it like this: you’re trying to decide if you want a second coffee. Your brain weighs a linear combination of factors: how much sleep you got (x1), how boring your current task is (x2), the current time (x3). Logistic Regression takes that same internal calculus (coefficient * sleep + coefficient * boredom + ...) and runs it through a function that gives you the probability of you saying “yes, more coffee.”
Here’s the simplest example. Let’s say we’re classifying irises (the flower, not the part of your eye).
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the classic iris dataset
X, y = load_iris(return_X_y=True)
# Let's make it a simple binary problem: is it a setosa or not?
# (because setosa is easily separable from the others)
y_binary = (y == 0).astype(int)
# Split the data - always do this.
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)
# Create and train the model. Note the high C means we tell it to trust the data a lot.
model = LogisticRegression(C=1e5, random_state=42)
model.fit(X_train, y_train)
# See how it did
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Training accuracy: {train_score:.3f}")
print(f"Testing accuracy: {test_score:.3f}")
# And here's the magic - the predicted probability for the first test sample
proba = model.predict_proba(X_test[:1])
print(f"Probabilities for class 0 and class 1: {proba}")
Why it’s great: It’s fast, it’s highly interpretable (you can inspect the coefficients to see which features matter and in what direction), and it gives you well-calibrated probabilities out of the box.
The catch: It’s a linear model. Its decision boundary is a straight line (or a hyperplane in higher dimensions). If your classes aren’t linearly separable, it will struggle without help from feature engineering or kernel tricks. It also assumes your features are relatively independent, which is almost never true in the real world, but it often works anyway because machine learning is just weird like that.
Random Forest: The Wisdom of Crowds
If Logistic Regression is a brilliant but opinionated specialist, a Random Forest is a large committee of slightly dumb, easily distracted experts. Individually, they’re not that great. Together, they’re terrifyingly effective.
It works by building a bunch of decision trees (hence, a “forest”). But here’s the clever bit that stops it from being just “a bunch of identical trees that overfit in the same way”: each tree is trained on a random subset of the data (bootstrapping), and at each split in the tree, it only considers a random subset of the features. This injects diversity. The final prediction is made by majority vote (for classification) or averaging (for regression).
from sklearn.ensemble import RandomForestClassifier
# Create the forest. n_estimators is the number of trees - more is almost always better until you hit diminishing returns.
# max_depth restricts how deep each tree can go, a key lever to prevent overfitting.
forest_model = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=42)
forest_model.fit(X_train, y_train)
print(f"Forest Test Accuracy: {forest_model.score(X_test, y_test):.3f}")
# Want to see how important each feature was? This is a killer feature.
import pandas as pd
feature_importances = pd.Series(forest_model.feature_importances_, index=load_iris().feature_names)
print(feature_importances.sort_values(ascending=False))
Why it’s great: It’s famously robust. It works well out of the box on almost any problem with minimal tuning. It handles non-linear relationships like a champ and isn’t overly sensitive to messy data or features being on different scales. The feature importance output is pure gold for explaining what actually matters.
The catch: You lose interpretability. A forest of 100 trees is not a simple equation you can write on a napkin. It can also be computationally expensive and memory-hungry with a huge number of trees. Most critically, it can still overfit if you let the trees grow too deep (max_depth is your best friend here) or if your data is very noisy.
Support Vector Machines: Finding the Gap
SVMs are the geometry nerds of the classification world. Their core idea is beautiful: instead of just drawing any old line that separates the classes, find the best line. The “best” line is the one that creates the widest possible margin—the clearest no-man’s-land—between the classes. The data points that define the edge of this margin are called “support vectors,” and they’re the only ones that actually matter for defining the boundary; the rest could be ignored.
For non-linear data, SVMs use a magical trick called the “kernel trick” to implicitly map the data into a higher-dimensional space where it is linearly separable, without ever having to do the computationally horrific math of that transformation.
from sklearn.svm import SVC
# The 'linear' kernel is for, well, linear problems. C is the regularization parameter.
# A high C means we want a tight margin, a low C allows for a wider, more generalizable margin.
svm_linear_model = SVC(kernel='linear', C=1.0, random_state=42, probability=True) # probability=True slows it down a bit
svm_linear_model.fit(X_train, y_train)
print(f"Linear SVM Test Accuracy: {svm_linear_model.score(X_test, y_test):.3f}")
# Let's try the famous RBF kernel for non-linear problems
svm_rbf_model = SVC(kernel='rbf', gamma='scale', C=1.0, random_state=42)
svm_rbf_model.fit(X_train, y_train)
print(f"RBF Kernel SVM Test Accuracy: {svm_rbf_model.score(X_test, y_test):.3f}")
Why it’s great: They are exceptionally powerful, especially in high-dimensional spaces (like text classification). The kernel trick makes them incredibly flexible. They are very memory efficient since they only need to remember the support vectors.
The catch: They are a nightmare to scale to very large datasets. Training time can balloon. They are also notoriously sensitive to feature scaling—you must standardize your features (e.g., with StandardScaler) before feeding them to an SVM, or the larger-scaled features will completely dominate the calculation. Tuning the C and gamma parameters is also critical and can feel like black magic.
So, Which One Do You Use?
This is the real question, isn’t it? Here’s the brutally honest answer:
- Start with Logistic Regression as your baseline. It’s your null hypothesis. If you can beat a simple linear model, you’re actually adding value.
- 90% of the time, just use Random Forest. It will probably give you a great result with minimal fuss. It’s the default for a reason.
- Use SVM if you have a moderately-sized dataset, you’ve remembered to scale your features, and you have the time to tune hyperparameters meticulously. It can eke out that last 1% of performance, but it’s often not worth the effort compared to a well-tuned Random Forest.
The true answer, of course, is to try all three and let cross-validation decide. But now you know why you’re trying them, and you’re not just throwing algorithms at the wall to see what sticks. You’re welcome.