15.1 Activation Functions: Sigmoid, Tanh, ReLU, Leaky ReLU, GELU, Swish

Let’s be honest: your neurons are just doing a weighted sum. That’s linear. And if your entire network is just a bunch of linear operations stacked together, guess what? It’s still just one big linear operation. That’s spectacularly useless for learning anything interesting, like the difference between a cat and a dog, or a good and a bad decision. We need to introduce non-linearity, a way to bend the data. That’s the job of the activation function. It’s the decision-maker, the gatekeeper, the source of all our network’s actual intelligence. And some of these gatekeepers are… well, let’s just say they’ve had better career choices than others.

The Old Guard: Sigmoid & Tanh

The sigmoid function, σ(x) = 1 / (1 + e^{-x}), was the darling of the early days. It squashes any input into a nice, predictable range between 0 and 1. This was fantastic for interpretation: it gave us a probabilistic output. Perfect for a binary classification output layer, right? Well, yes, but also a nightmare.

Its two fatal flaws are the vanishing gradient problem and computational expense. Look at its curve. The ends are flat. When your input is a very large positive or negative number, the gradient (the derivative) approaches zero. During backpropagation, when you’re chain-ruling your way back to the early layers, these tiny gradients get multiplied together, effectively stopping learning dead in its tracks. The network just gives up. Also, it involves an exponentiation, which isn’t cheap.

Then there’s its cooler, more centered cousin, the Hyperbolic Tangent (tanh): tanh(x) = (e^{x} - e^{-x}) / (e^{x} + e^{-x}). It squishes values to between -1 and 1. This is nice because it centers the data around zero, which often helps the next layer learn more efficiently. But guess what? It also has saturating regions and suffers from the same vanishing gradient problem, though to a slightly lesser degree. We keep it around mostly for output layers in things like LSTMs, but for hidden layers, it’s been rightfully shown the door.

import numpy as np
import matplotlib.pyplot as plt

# Define the old guard
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

x = np.linspace(-5, 5, 100)
plt.plot(x, sigmoid(x), label='Sigmoid')
plt.plot(x, tanh(x), label='Tanh')
plt.title("The Saturating Old Guard")
plt.grid(True)
plt.legend()
plt.show()

The Modern Default: ReLU & Its Slightly Less Lazy Sibling

Then came the Rectified Linear Unit (ReLU). A moment of silence for its beautiful, arrogant simplicity: f(x) = max(0, x).

It’s computationally dirt cheap: it’s just a max() operation. For positive inputs, the gradient is a constant 1. No more vanishing gradients! This simple change was a primary driver behind the ability to train much deeper networks. It works shockingly well in practice.

But ReLU has a rather embarrassing flaw, poetically named the “Dying ReLU” problem. For any input less than zero, the output is zero, and crucially, the gradient is zero. If a neuron’s weights get updated in such a way that it only outputs negative values for all training examples, it effectively turns off. It will never update again because its gradient is zero. It’s not dead dead, but it’s in a permanent vegetative state, contributing nothing to the network’s learning.

Enter Leaky ReLU, the fix for this existential crisis: f(x) = max(αx, x). Instead of a hard zero for negative inputs, it gives a small, non-zero slope (α is a small constant, like 0.01). This means even when the input is negative, there’s a small gradient, allowing the neuron to potentially recover. It’s a direct admission that the original ReLU’s design was a bit too extreme.

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

x = np.linspace(-5, 5, 100)
plt.plot(x, relu(x), label='ReLU')
plt.plot(x, leaky_relu(x), label='Leaky ReLU (α=0.01)')
plt.title("ReLU: The Workhorse & Its Fix")
plt.grid(True)
plt.legend()
plt.show()

The New Contenders: GELU & Swish

The neural network community got bored of piecewise linear functions and decided to bring the smooth, probabilistic curves back, but this time without the vanishing gradient nightmare. The key insight: why not let the activation stochastically decide?

The Gaussian Error Linear Unit (GELU) is f(x) = x * Φ(x), where Φ(x) is the cumulative distribution function of the standard Gaussian distribution. Think of it as weighting the input by the probability that it is “useful.” It’s smooth, has a better shape than ReLU, and performs brilliantly, especially in Transformer models (hello, BERT, GPT). It’s non-convex, monotonic, and looks like a smoothed-out ReLU. The downside? It’s computationally more expensive because of the erf calculation.

Then Google Brain dropped Swish: f(x) = x * σ(x), which is suspiciously similar to GELU (x * sigmoid(x)`). It’s also smooth and, like GELU, empirically tends to outperform ReLU on deeper models. The smoothness provides a small benefit during optimization. The joke is that the researchers found it through automated search, proving that sometimes we can just brute-force our way to good ideas.

# Note: We'll use a rough approximation for GELU for demonstration
def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def swish(x):
    return x * sigmoid(x)

x = np.linspace(-4, 4, 100)
plt.plot(x, relu(x), label='ReLU', alpha=0.5)
plt.plot(x, gelu(x), label='GELU (approx)')
plt.plot(x, swish(x), label='Swish')
plt.title("The Smooth Operators: GELU & Swish")
plt.grid(True)
plt.legend()
plt.show()

So, which one should you use? Here’s the brutal truth: start with ReLU. It’s the default for a reason. It’s simple, fast, and gets you 95% of the way there. If you suspect dead neurons (e.g., a large portion of your activations are zero), switch to Leaky ReLU. If you’re training a state-of-the-art transformer or feel like being fancy and can spare the compute, go with GELU. The choice is rarely about finding a single “best” function and almost always about the specific architecture and dataset you’re wrestling with. Now go make some non-linear decisions.