6.1 Decision Trees: Splitting Criteria (Gini, Entropy, MSE)

Alright, let’s talk about how a decision tree decides where to make its cuts. This isn’t arbitrary; it’s not like the tree is just throwing darts at your dataset’s features. It’s methodical, and it uses a mathematical scoring system to find the single best question to ask at each point to most effectively separate your data. We call this the splitting criterion. It’s the algorithm’s way of quantifying how “mixed up” or “impure” a node is. Our goal is to find the split that creates the purest possible child nodes.

Think of it this way: you’re at a party with a mix of developers and designers. A good question would be, “Do you prefer sans-serif fonts?” This would likely cleanly separate the two groups (designers say yes, developers shrug). A bad question would be, “Do you like pizza?” Everyone says yes, and you’ve learned nothing. The splitting criterion is how the tree numerically grades these questions.

Gini Impurity: The Quick Calculation

Don’t let the name scare you. Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. It’s a mouthful, but the calculation is beautifully simple.

For a node with C classes, the Gini Impurity is: $G = 1 - \sum_{i=1}^{C}(p_i)^2$

where $p_i$ is the probability of picking a data point of class i. If a node is perfectly pure (all one class), Gini impurity is 0. If it’s a complete mess (50/50 split between two classes), it’s 0.5. The tree’s job is to find the split that results in the largest weighted decrease in Gini impurity. It’s fast to compute, which is why it’s often the default.

import numpy as np

# Let's calculate Gini impurity for a few nodes manually.
def gini_impurity(classes):
    """Calculates Gini impurity for a list of classes."""
    # Count the frequency of each unique class
    _, counts = np.unique(classes, return_counts=True)
    probabilities = counts / counts.sum()
    return 1 - np.sum(probabilities**2)

# A perfectly pure node (all 'Cat')
print(gini_impurity(['Cat', 'Cat', 'Cat']))  # Output: 0.0

# A perfectly mixed node for two classes
print(gini_impurity(['Cat', 'Dog', 'Cat', 'Dog']))  # Output: 0.5

# A more realistic, impure node
print(gini_impurity(['Cat', 'Dog', 'Dog', 'Dog', 'Fish']))  # Output: ~0.56

Information Gain & Entropy: The Theorist’s Choice

If Gini is the quick calculation, Entropy is its more theoretically grounded cousin, straight from information theory. Entropy, in this context, measures the “disorder” or “uncertainty” in a node. A pure node has low entropy (you’re very certain what class a random sample will be). A mixed node has high entropy (you’re very uncertain).

The formula for entropy is: $E = -\sum_{i=1}^{C}p_i\log_2(p_i)$

The tree then calculates the Information Gain for a potential split. Information Gain is simply the parent node’s entropy minus the weighted average of the child nodes’ entropies. The split with the highest information gain is chosen. It often produces trees very similar to Gini impurity, but it’s slightly more sensitive to changes in the node’s composition.

from scipy.stats import entropy

def entropy_impurity(classes):
    """Calculates entropy for a list of classes."""
    _, counts = np.unique(classes, return_counts=True)
    probabilities = counts / counts.sum()
    # base=2 gives entropy in bits
    return entropy(probabilities, base=2)

# Let's compare Gini and Entropy on the same node
node_data = ['Cat', 'Dog', 'Dog', 'Dog', 'Fish']

print(f"Gini Impurity: {gini_impurity(node_data):.4f}")
print(f"Entropy: {entropy_impurity(node_data):.4f}")

# You'll see they track each other very closely. The choice is rarely a game-changer.

Here’s the dirty little secret: for classification tasks, it hardly matters. Gini and Entropy are like choosing between a Phillips-head and a flat-head screwdriver for most screws. They’ll both get the job done, and the resulting trees are often indistinguishable. Gini is marginally faster computationally, so it’s a fine default. Don’t spend days agonizing over this choice; your time is better spent on feature engineering or pruning the tree.

Mean Squared Error: For When You’re Predicting Numbers

The above criteria are for classification. But what if you’re doing regression—predicting a continuous number, like a house price? You can’t use Gini or Entropy. Instead, we use a criterion based on variance. The most common is Mean Squared Error (MSE).

For a node, the MSE is just the average of the squared differences between the actual values and the mean value of that node. The tree aims to find the split that minimizes the overall MSE across the child nodes. It’s basically performing a greedy piecewise constant regression.

# Example: Predicting house prices based on a feature like 'number of bedrooms'
def mse_impurity(values):
    """Calculates MSE for a list of values (the target)."""
    if len(values) == 0:
        return 0
    mean_val = np.mean(values)
    return np.mean((values - mean_val)**2)

# A 'pure' node where all houses are around the same price
print(mse_impurity([299000, 300000, 301000]))  # Low MSE

# A very impure node with wildly different prices
print(mse_impurity([100000, 300000, 500000]))  # High MSE

# The tree's algorithm will seek splits that minimize the sum of the MSE in the left and right children.

The Pitfalls: Why Your Tree is Greedy and Shortsighted

Here’s the crucial part the manuals often gloss over: this splitting process is greedy. It only looks one step ahead. It finds the very best split right now with no consideration for whether that split will lead to the best possible tree two splits down the line. It’s like playing chess by only ever making the move that captures the highest-value piece immediately, without any strategy for the endgame.

This greediness is the fundamental reason why we need ensemble methods like Random Forest. A single decision tree, making a series of locally optimal decisions, is almost never globally optimal. It’s prone to overfitting, latching onto spurious patterns in the training data because that one split reduced impurity just a tiny bit more. Always remember: a lower impurity score on your training data doesn’t necessarily mean a better model. It usually just means a more complex, overfit one. The splitting criterion is brilliant at its job, but its job is very narrow. It’s our job to manage its myopia.