7.5 RBF, Polynomial, and Sigmoid Kernels

Alright, let’s get our hands dirty with the kernel bag of tricks. You’ve seen the linear kernel—solid, dependable, but about as exciting as a dial tone. It can’t handle the messy, non-linearly separable reality we actually live in. That’s where these three come in: the Radial Basis Function (RBF), the Polynomial, and the Sigmoid kernels. They’re your key to projecting your data into higher dimensions where a clean slice, a hyperplane, can finally be found. Think of it less like magic and more like very clever geometry.

The Workhorse: The RBF Kernel

This is, without a doubt, the kernel you’ll use most often. It’s your default, your go-to, the first thing you should try when a linear kernel fails miserably. The RBF kernel, also called the Gaussian kernel, has a beautifully simple formula:

$K(\mathbf{x}_1, \mathbf{x}_2) = \exp\left(-\gamma |\mathbf{x}_1 - \mathbf{x}_2 |^2\right)$

Let’s break down why this is so brilliant. The term $|\mathbf{x}_1 - \mathbf{x}_2 |^2$ is just the straight-line Euclidean distance between your two data points, squared. The kernel then says: “The similarity between two points is based on how close they are to each other.” Points that are close have a high similarity (approaching 1), and points that are far apart have a low similarity (approaching 0).

Now, the $\gamma$ ($gamma$) parameter is the knob you turn to control this. A small gamma means a large similarity radius, so points farther away still influence each other. This leads to a smoother, more generalized decision boundary. A large gamma means the radius of similarity is small; the model only cares about very close points, leading to a more complex, wiggly boundary that can overfit your training data spectacularly.

Getting the gamma and C (the regularization parameter) balance right is 90% of the battle with RBF SVMs.

from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Let's create a classic non-linear problem: moons
X, y = make_moons(n_samples=200, noise=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Always scale your features for SVM! The kernel relies on distance calculations.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Let's try two extremes of gamma
svc_high_gamma = SVC(kernel='rbf', gamma=10, C=1)
svc_low_gamma = SVC(kernel='rbf', gamma=0.01, C=1)

svc_high_gamma.fit(X_train_scaled, y_train)
svc_low_gamma.fit(X_train_scaled, y_train)

print(f"High Gamma Train Score: {svc_high_gamma.score(X_train_scaled, y_train):.3f}")
print(f"High Gamma Test Score: {svc_high_gamma.score(X_test_scaled, y_test):.3f}")
print("---")
print(f"Low Gamma Train Score: {svc_low_gamma.score(X_train_scaled, y_train):.3f}")
print(f"Low Gamma Test Score: {svc_low_gamma.score(X_test_scaled, y_test):.3f}")

# You'll likely see high_gamma overfits (perfect train, worse test)
# while low_gamma underfits (mediocre train and test).

The Power-Hungry Polynomial Kernel

If the RBF kernel is a scalpel, the Polynomial kernel is a sledgehammer. It’s defined as:

$K(\mathbf{x}_1, \mathbf{x}_2) = (\gamma \cdot \mathbf{x}_1^\top \mathbf{x}_2 + r)^d$

Where d is the degree of the polynomial, gamma is a scale factor (similar to RBF), and r is a coefficient term that controls how much influence higher-degree terms have versus lower-degree terms.

This kernel explicitly constructs polynomial features up to degree d and then fits a linear SVM in that colossal, bloated feature space. The result? It can model curves and interaction terms. The problem? It’s computationally expensive and notoriously finicky. The performance is highly sensitive to the choice of degree, gamma, and r. It can work wonders on problems where you know the underlying relationship is truly polynomial, but in practice, that’s rare. I use it about as often as I use a fax machine.

# Continuing from the previous scaled data
svc_poly = SVC(kernel='poly', degree=3, gamma='scale', coef0=1.0) # coef0 is 'r'
svc_poly.fit(X_train_scaled, y_train)

print(f"Polynomial Kernel Test Score: {svc_poly.score(X_test_scaled, y_test):.3f}")

# It'll probably work fine on the moons, but try changing 'degree' to 10 and watch it potentially lose its mind.
# Also, note the 'gamma='scale'' option, which is sklearn's sensible default (1 / (n_features * X.var())).

The Quirky Relic: The Sigmoid Kernel

Here’s the odd one out. The Sigmoid kernel is defined as:

$K(\mathbf{x}_1, \mathbf{x}_2) = \tanh(\gamma \cdot \mathbf{x}_1^\top \mathbf{x}_2 + r)$

Yes, it looks like a neural network activation function. That’s because it is. This kernel originated from early attempts to make SVMs behave like neural nets. Here’s the honest truth: it’s rarely a good choice. Its behavior is erratic and it’s not even guaranteed to be a valid kernel (i.e., positive semi-definite) in all situations, which can cause the optimization to fail or behave poorly. I’m mentioning it so you know it exists, but my strong advice is to steer clear unless you have a very specific, documented reason to use it and you’re willing to spend hours tuning gamma and r (often called coef0 in libraries) to maybe, possibly, get it to work. Just use RBF.

Best Practices and Pitfalls

Scale Your Data: I said it before, I’ll scream it from the rooftops. Kernels like RBF and Polynomial are based on distances and dot products. If one feature has a range of 0-1000 and another has a range of 0-1, the first feature will completely dominate the calculation. Use StandardScaler or MinMaxScaler. Always.
Tune Hyperparameters Systematically: Don’t guess. Use GridSearchCV or RandomizedSearchCV to find the best C and gamma (and degree if you’re brave enough for Polynomial). Your model’s performance is almost entirely dependent on this.
Beware of Overfitting (The RBF Special): A model with a very high gamma and a very high C will create an incredibly complex boundary that perfectly separates every training point, including the noise. It will be useless on new data. Your validation score is your truth detector.
Interpretability Goes Out the Window: The moment you use any of these kernels, you can no longer look at the coefficients of the SVM and explain the model in the original feature space. The “support vectors” are your model now. You’re trading interpretability for power.