77.2 Common Plot Types: line, scatter, bar, histogram, boxplot
Right, let’s get you plotting. Forget the sterile, corporate examples you see in most tutorials. We’re going to make graphs that actually communicate something, using the three workhorses of the Python world: the venerable matplotlib, the stylish seaborn, and the interactive plotly. I’ll be honest with you—matplotlib can feel like assembling IKEA furniture with instructions in a language you don’t speak, but once you understand its logic, you own the whole factory. Seaborn is the chic friend who comes in and makes your default matplotlib plots look like they belong in a journal. And plotly is for when you need to make things you can poke and prod on a webpage.
The key to not losing your mind is understanding the object hierarchy. matplotlib has a Figure (the whole canvas) and Axes (the actual plots on that canvas). You almost always want to work with the Axes object directly. This is the first pitfall everyone hits: they use the lazy plt.plot() and then wonder why they can’t customize anything later. We’re not doing that.
The Indispensable Line Plot
The line plot is the workhorse for sequential data, usually time series. Its biggest secret? It’s just a connect-the-dots game. The function ax.plot(x, y) draws lines between the points in the order they appear in your arrays. This leads to the most common gotcha: if your x-axis data isn’t sorted, you get a spectacularly useless spaghetti scribble.
import matplotlib.pyplot as plt
import numpy as np
# Create some data
x = np.linspace(0, 10, 100) # 100 points from 0 to 10
y = np.sin(x) + np.random.randn(100) * 0.1 # Noisy sine wave
# The RIGHT way: create figure and axes explicitly
fig, ax = plt.subplots(figsize=(10, 5)) # figsize is in inches, because of course it is.
ax.plot(x, y, linestyle='-', linewidth=1, color='steelblue', label='Measured Data')
ax.plot(x, np.sin(x), 'r--', label='True Model') # Shorthand: 'r--' = red dashed line
ax.set_title("The Results of Very Important Science", fontweight='bold')
ax.set_xlabel("Time (s)")
ax.set_ylabel("Widgets Processed")
ax.legend()
ax.grid(True, alpha=0.3) # alpha controls transparency. Use it.
plt.tight_layout() # This prevents labels from getting chopped off. Always use it.
plt.show()
Why fig, ax = plt.subplots()? It gives you direct, unambiguous control over your specific plot. Now you can customize ax to your heart’s content without some hidden state from a previous plt command messing it up.
The Humble, Powerful Scatter Plot
A scatter plot shows you the raw relationship between two variables. The biggest mistake here is using a line plot for this job. If your points aren’t connected by sequence, you want ax.scatter(). Its superpower is the ability to map a third (or even fourth) dimension to the size and color of the markers. This is where you can really start to see stories in your data.
# Let's use the classic Iris dataset to avoid being boring
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
target = iris.target
fig, ax = plt.subplots(figsize=(8, 6))
# Plot sepal length vs width, color by species, size by petal length
scatter = ax.scatter(X[:, 0], X[:, 1], c=target, s=X[:, 2]*20, alpha=0.7, cmap='viridis')
ax.set_xlabel(iris.feature_names[0])
ax.set_ylabel(iris.feature_names[1])
cbar = fig.colorbar(scatter) # Add a colorbar to decode the colors
cbar.set_label('Species')
# The size is a bit hacky (s=X[:,2]*20). You'd scale it properly for a real publication.
plt.tight_layout()
plt.show()
Notice the alpha parameter? It’s non-negotiable. It makes overlapping points visible, revealing density and structure that solid markers would completely hide.
The “Just the Facts” Bar Plot
Use ax.bar() for categorical comparisons. The main thing to remember is that the x values you pass are the positions of the bars (like 0, 1, 2, …), not the category labels themselves. You then set the labels separately with ax.set_xticks() and ax.set_xticklabels(). It’s a bit clunky, but you get used to it.
categories = ['Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin']
points = [850, 650, 930, 720] # Obviously not canonical
colors = ['#AE0001', '#FFDB00', '#0E1A40', '#1A472A']
fig, ax = plt.subplots()
bars = ax.bar(categories, points, color=colors) # Yes, you can just pass the list of names for x.
ax.set_ylabel("House Points")
ax.set_title("End-of-Year Total (Pre-Potter)")
# Annotate the bars with their values. This is how you make a plot *better*.
for bar in bars:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height + 5,
f'{height}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
The Histogram: See Your Data’s Shape
A histogram bins your data and shows the frequency in each bin. The single most important choice here is the number of bins. Too few, and you lose all detail. Too many, and you get a spiky, meaningless mess. ax.hist()’s default is comically bad. Never use it. Use bins='auto' or experiment until the story is clear.
# Let's make some fake, skewed data
data = np.random.gamma(2, 1.5, 1000)
fig, ax = plt.subplots(1, 2, figsize=(12, 4)) # 1 row, 2 columns of axes
# The Terrible Default
ax[0].hist(data, edgecolor='black')
ax[0].set_title("The Default (Why?)")
# The Actually Useful Version
ax[1].hist(data, bins='auto', edgecolor='black', color='skyblue', alpha=0.7)
ax[1].set_title("bins='auto' (Much Better)")
ax[1].set_xlabel("Value")
ax[1].set_ylabel("Frequency")
plt.tight_layout()
plt.show()
The edgecolor parameter is crucial—it adds a tiny border between bars, making it much easier to read, especially when printed in black and white.
The Boxplot: The Statistical Summary
A boxplot is a brilliant, compact way to show a distribution’s median, quartiles, and outliers. The box shows the interquartile range (IQR), the line inside is the median, and the “whiskers” typically extend to 1.5 * IQR. Points beyond that are shown as fliers (outliers). It’s perfect for comparing distributions across categories.
# Let's compare sepal width across the iris species
data_by_species = [X[target == i, 1] for i in range(3)] # Group sepal width by species
fig, ax = plt.subplots()
boxplot = ax.boxplot(data_by_species, labels=iris.target_names, patch_artist=True)
# Make it pretty (because default boxplots are tragically beige)
colors = ['lightgreen', 'lightyellow', 'lightblue']
for patch, color in zip(boxplot['boxes'], colors):
patch.set_facecolor(color)
ax.set_ylabel("Sepal Width (cm)")
ax.set_title("Distribution by Species")
plt.show()
The patch_artist=True is your ticket to coloring the boxes. Without it, you’re stuck with hollow outlines. The designers made that default choice, and it was a bad one.