78.1 SciPy Subpackages: integrate, optimize, signal, stats, sparse

Right, let’s get into the meat of SciPy. You’ve got NumPy for your arrays, the raw material. SciPy is the fully-stocked workshop where you shape that material into something useful. It’s a massive collection of subpackages, but we’re going to focus on the heavy hitters you’ll actually use. Forget the kitchen-sink approach; we’re here to talk about the tools that earn their keep.

integrate: Making Sense of the Curve

The world isn’t discrete. Sometimes you need to know the whole of something, not just the sum of its sampled parts. That’s where scipy.integrate comes in. The workhorse here is quad, which handles the definite integral of a function of one variable. It’s shockingly simple to use, but the magic is in what it’s doing behind the scenes: it’s using robust numerical techniques (like adaptive quadrature) to figure out the area under your curve without needing an analytical solution.

Let’s say you need to integrate a function that looks deceptively simple but is a nightmare on paper: e^(-x^2). The Gaussian. Good luck finding a nice closed-form for its definite integral. For SciPy, it’s Tuesday.

import numpy as np
from scipy.integrate import quad

# Define the function to integrate: the Gaussian bell curve
def integrand(x):
    return np.exp(-x**2)

# Integrate from -infinity to +infinity. The result should be sqrt(pi).
result, error_estimate = quad(integrand, -np.inf, np.inf)

print(f"Result: {result}")
print(f"Error estimate: {error_estimate}")
print(f"√π: {np.sqrt(np.pi)}")

The output will show you it’s nailed it. The error_estimate is key—it tells you how much the function is lying to you (usually a very, very small amount). Pitfall: Your function must be vectorized. If you write something that can’t handle a NumPy array of x values, quad will choke. Use NumPy functions inside your integrand, not math functions.

For differential equations, solve_ivp (Initial Value Problem) is your new best friend. It replaces the older, clunkier odeint. Let’s model a simple damped spring.

from scipy.integrate import solve_ivp

# Define the system of ODEs: y'' = -y' - y (damped harmonic oscillator)
# We turn this 2nd-order ODE into two 1st-order ODEs:
# Let y0 = position, y1 = velocity
# Then: dy0/dt = y1
#       dy1/dt = -y1 - y0
def damped_spring(t, y):
    return [y[1], -y[1] - y[0]]

# Initial conditions: position=1, velocity=0
initial_state = [1.0, 0.0]
time_span = [0, 10]  # Solve from t=0 to t=10

solution = solve_ivp(damped_spring, time_span, initial_state, dense_output=True)

# The solution is stored neatly
t_vals = np.linspace(0, 10, 300)
y_vals = solution.sol(t_vals)[0]  # Grab the position (y0)

# Now you can plot t_vals vs y_vals to see the oscillation decay.

optimize: Finding the Best (or Least Worst) Option

Optimization is the heart of so much scientific computing. Finding a root, minimizing a function, fitting a curve—it’s all here. Let’s start with minimization. Say you have a classic nasty function like Rastrigin’s function. It’s got a global minimum but a zillion local minima to trap naive algorithms.

from scipy.optimize import minimize

def rastrigin(x):
    """A function designed to make optimizers cry."""
    A = 10
    return A * len(x) + sum([(xi**2 - A * np.cos(2 * np.pi * xi)) for xi in x])

# Start at a truly terrible point far from the minimum [0,0]
initial_guess = [2.8, 3.2]

# Let's use the robust Nelder-Mead method because it's derivative-free
result = minimize(rastrigin, initial_guess, method='Nelder-Mead')
print(f"Minimum found at: {result.x}")
print(f"Function value at minimum: {result.fun}")

The output will get you heartwrenchingly close to [0, 0]. The key is choosing the right method. If your function has gradients, use 'BFGS'. If it doesn’t, 'Nelder-Mead' is a good start. Best practice: Always provide Jacobians (and Hessians if you can) if your problem is large. It turns a painful wait into a blink.

Curve fitting is just a specific type of minimization (minimizing the sum of squared errors). curve_fit wraps this up nicely.

from scipy.optimize import curve_fit

# Let's generate some noisy data that follows an exponential decay
t_data = np.linspace(0, 4, 50)
y_data = 2.5 * np.exp(-1.2 * t_data) + np.random.normal(scale=0.1, size=len(t_data))

# Define the model function to fit
def model(t, a, b):
    return a * np.exp(-b * t)

# Perform the fit! curve_fit returns the optimal parameters and the covariance matrix.
popt, pcov = curve_fit(model, t_data, y_data)

print(f"Fitted parameters: a={popt[0]:.3f}, b={popt[1]:.3f}")
# Should be close to a=2.5, b=1.2

signal: Taming the Noise

This package is a signal processing Swiss Army knife. Need to filter, find peaks, or design a FIR filter? This is your place. A common task: detrending your data by removing a linear best-fit line.

from scipy import signal

# Create a noisy signal with a linear drift
t = np.linspace(0, 10, 1000)
actual_signal = np.sin(t)
linear_drift = 0.1 * t  # The annoying trend
noisy_signal = actual_signal + linear_drift + np.random.normal(scale=0.1, size=len(t))

# Remove the linear trend in one line
detrended_signal = signal.detrend(noisy_signal)

# Now plot detrended_signal vs t to see your clean sine wave.

Another classic: finding peaks in a noisy dataset. The find_peaks function is brilliantly practical. You can set minimum height, distance between peaks, and even prominence (how much a peak stands out from its surroundings).

# Let's find the peaks in our cleaned-up signal
peaks, properties = signal.find_peaks(detrended_signal, height=0, prominence=0.5)

print(f"Found {len(peaks)} peaks at indices: {peaks}")

stats: More Than Just P-Values

While it does hypothesis tests, scipy.stats is also your one-stop shop for probability distributions. This is its killer feature. Need a Normal distribution? It’s an object with methods to PDF, CDF, generate random variates, and fit data.

from scipy import stats

# Create a frozen normal distribution object with mean=10 and std=2
my_dist = stats.norm(loc=10, scale=2)

# Calculate probability density at x=11
pdf_val = my_dist.pdf(11)

# Calculate the cumulative probability P(X < 12)
cdf_val = my_dist.cdf(12)

# Generate 1000 random samples from this distribution
samples = my_dist.rvs(size=1000)

# Now, let's say you have data and want to *fit* a distribution to it.
fit_params = stats.norm.fit(samples)  # Returns (loc, scale)
print(f"Fitted mean (loc): {fit_params[0]:.2f}, fitted std (scale): {fit_params[1]:.2f}")

Pitfall: The naming convention. Mean is loc, standard deviation is scale. It’s generic because other distributions (like Cauchy) use location and scale parameters too. You get used to it.

sparse: When Most of Your Data is Nothing

If you’ve ever had a matrix where 99% of the entries are zero, you know dense matrices are a waste of memory and CPU cycles. Sparse matrices store only the non-zero values. The scipy.sparse module has a dozen formats because, of course, the best way to store nothing depends on the structure of the nothing you’re not storing. The most common are CSR (efficient for arithmetic) and CSC (efficient for column slicing).

from scipy import sparse

# Create a large 10,000 x 10,000 identity matrix.
# A dense version would consume ~800 MB of RAM.
# A sparse diagonal matrix? A few kilobytes.
large_identity = sparse.eye(10000, format='csr')

# You can do linear algebra on it almost like a normal matrix.
# Let's compute the dot product with a vector.
vector = np.random.randn(10000)
result = large_identity.dot(vector)  # This is just the vector itself, but computed efficiently.

# You can also build matrices from non-zero entries.
row = [0, 1, 2, 2]  # row indices
col = [0, 1, 2, 0]  # column indices
data = [4, 5, 6, 7] # values at those indices
sparse_matrix = sparse.csr_matrix((data, (row, col)), shape=(3, 3))

print(sparse_matrix.toarray())  # Convert to dense to see it. Use this only for debugging!

Best practice: Keep data in a sparse format for as long as possible. The moment you convert to a dense array with .toarray(), you’ve lost the advantage. The ecosystem support is great; many solvers in scipy.sparse.linalg (like spsolve) are built specifically for these structures.