87.10 librosa and pydub: Audio Processing

Right, let’s talk about audio. You’re not here to learn how to make a beep; you’re here to wrangle sound waves into submission. For that, we have two main allies in Python: librosa for the heavy scientific lifting and pydub for the practical, “I just need to cut this file” tasks. They serve different masters, and knowing which to use when is half the battle.

librosa: The Signal Processing Scholar

librosa is your go-to when you need to understand what’s inside the audio. It’s a library built by and for audio signal processing nerds (I say this with the utmost respect). It doesn’t care much about file formats; it cares about the raw numerical signal once it’s loaded.

Its first brilliant design choice is to always convert audio to a floating-point time series and standardize the sample rate. This saves you from a world of integer normalization headaches.

import librosa

# Load an audio file. Notice we don't specify a format.
# librosa will resample to 22050 Hz by default and convert to float32.
y, sr = librosa.load('your_cool_song.mp3')

print(f"Audio shape: {y.shape}, Sample rate: {sr}")
# Audio shape: (1555200,), Sample rate: 22050

Why 22050 Hz? Because for many analysis tasks (like music), you don’t need the full frequency range of a CD (44.1 kHz). The human ear tops out around 20 kHz, and the Nyquist theorem says 22.05 kHz is plenty to capture that. It makes computations faster and uses less memory. You can override this with sr=None to get the native rate, but you’ll rarely need to.

Now, let’s extract a Mel-frequency cepstral coefficients (MFCCs) matrix. This is a classic feature set that vaguely mimics human hearing, and it’s the bread and butter of many music genre classification and speech recognition models.

# Extract MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

print(f"MFCC shape: {mfccs.shape}")  # (n_mfcc, time_frames)
# MFCC shape: (13, 3031)

# Want to see the spectral centroid? It's a measure of the "brightness" of the sound.
centroids = librosa.feature.spectral_centroid(y=y, sr=sr)

The magic here is that librosa.feature doesn’t just give you numbers; it gives you a well-tested, academically sound implementation of these complex features. You could try to build MFCCs from scratch using scipy and numpy, but you’d probably get it slightly wrong and waste a week. Don’t do that.

pydub: The Pragmatic Audio Editor

While librosa is a scientist, pydub is a skilled mechanic. Its job is to manipulate audio files. Slicing, dicing, changing volume, fading, and format conversion—this is its home turf. It’s essentially a high-level, Pythonic wrapper around FFmpeg, which is the real hero here (but pydub makes it painless).

Need to split a podcast into 30-second chunks? pydub.

from pydub import AudioSegment
from pydub.utils import make_chunks

# Load an audio file. pydub is format-aware.
podcast = AudioSegment.from_file("my_podcast.wav", format="wav")

# Slice it with millisecond precision.
first_10_seconds = podcast[:10000]  # 10,000 milliseconds

# Or split it into chunks
chunk_length_ms = 30000  # 30 seconds
chunks = make_chunks(podcast, chunk_length_ms)

# Export each chunk
for i, chunk in enumerate(chunks):
    chunk.export(f"chunk_{i}.mp3", format="mp3")

The beauty is the simplicity. The AudioSegment object is intuitive. The biggest “gotcha”? You must have FFmpeg installed and accessible on your system path. pydub will not work without it, and the error messages can be cryptic. On macOS, brew install ffmpeg. On Linux, sudo apt install ffmpeg. On Windows, it’s a bit more of a chore—download the binaries and add them to your PATH. It’s worth the setup.

When to Use Which (and When to Use Both)

This is the critical part. Use pydub for:

Format conversion (e.g., .flac to .mp3)
Trimming, concatenating, or splicing audio files
Simple operations like adjusting gain (volume) or adding fades

Use librosa for:

Any kind of feature extraction (MFCCs, chroma, spectral contrast, etc.)
Beat detection, tempo estimation, or time-stretching
Any scientific analysis of the audio signal

Often, you’ll use them together in a beautiful partnership. Use pydub to load a tricky proprietary format and export it as a .wav, then use librosa to analyze that .wav file. This avoids librosa’s sometimes-fragile file loading.

from pydub import AudioSegment
import librosa
import numpy as np

# Use pydub to load a finicky .m4a file
audio_segment = AudioSegment.from_file("annoying_file.m4a", format="m4a")
# Convert it to a raw numpy array and sample rate
y = np.array(audio_segment.get_array_of_samples())
sr = audio_segment.frame_rate

# If it's stereo, you might need to handle channels. Let's take the mean.
if audio_segment.channels > 1:
    y = y.reshape((-1, audio_segment.channels)).mean(axis=1)

# Now convert to float32 like librosa would
y = y.astype(np.float32) / (2**15)  # For 16-bit audio

# Now you can proceed with librosa as normal!
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)

The key takeaway? librosa gives you the deep insight, while pydub handles the mundane but crucial tasks of audio I/O and editing. Together, they cover about 95% of what you’ll need to do programmatically with sound. Now go make some noise. Responsibly.