76.4 Data Cleaning: dropna, fillna, duplicated, astype

Right, let’s talk about cleaning. Not the fun, put-on-some-music-and-zone-out kind. The data kind. It’s the unglamorous, absolutely essential foundation of everything you’ll do with pandas. If your data is a mess, your results are a lie. It’s that simple. So let’s roll up our sleeves and get our hands dirty with the tools that make it less dirty.

The Art of Dropping the Nulls (`dropna`)

Your first instinct when you see NaN (Not a Number, pandas’ way of saying “I got nothing”) is probably to just delete the whole row. That’s what dropna does, and it’s a blunt instrument. Use it carelessly, and you’ll be left with a sad, empty DataFrame.

By default, dropna() drops any row that has at least one NaN in it. This is often overkill.

import pandas as pd
import numpy as np

# Let's create some realistically messy data
df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'score': [99, np.nan, 85, 78, 92],
    'comments': ["Great", None, "Okay", "Bad", "Excellent"]
})

print("Original DataFrame:")
print(df)
print("\nDefault dropna() - nuclear option:")
print(df.dropna())

This nukes rows 1 (Bob), 2 (Charlie?!), and 3. We lost Charlie, who had a score and a comment, just because his name was missing? That’s a data tragedy. Be more surgical.

# Only drop rows where ALL values are NaN (rare but useful)
print(df.dropna(how='all'))

# Only drop rows where the 'score' is specifically NaN
print(df.dropna(subset=['score']))

# Drop rows where BOTH 'name' AND 'comments' are missing
print(df.dropna(subset=['name', 'comments'], how='all'))

The subset parameter is your best friend here. It lets you surgically target the columns where missingness actually matters for your analysis.

Filling in the Blanks (`fillna`)

Sometimes deletion is too drastic. You’d rather make an educated guess. Enter fillna. The key here is to think about what an appropriate fill value is. Filling a numerical score with 0 is very different from filling it with the mean, and both have consequences.

# The bad: Filling a score with 0 implies they got a zero. Probably not true.
df['score'].fillna(0, inplace=True)

# The meh: Filling with the mean distorts the distribution. Use cautiously.
mean_fill = df['score'].fillna(df['score'].mean())

# The contextual: Forward fill (ffill) might work for time series data.
# For our user data, it's probably nonsense. "Bob's score? Oh, same as Alice's, 99."
ffill_fill = df['score'].fillna(method='ffill')

# For categorical data, a dedicated placeholder is often best.
df['comments'].fillna('No Comment', inplace=True)
df['name'].fillna('Unknown', inplace=True)

The designers gave us methods like ffill and bfill because they’re incredibly useful for ordered data like sensor readings or stock prices. For unordered, categorical data, they’re a trap. Don’t just use a method because it’s there; use it because it makes sense.

Hunting Duplicates (`duplicated` and `drop_duplicates`)

Duplicates waste memory and skew counts. Finding them is straightforward, but the definition of a “duplicate” is key. Is a row a duplicate only if every single column is identical? Or just if a user ID appears twice?

# Let's add a blatant duplicate and a subtle one (same user_id, different score)
df_dupes = df.append(df.iloc[0]).append(pd.DataFrame({'user_id': [2], 'name': ['Bob'], 'score': [100], 'comments': ["Great"]}))

# Find all complete duplicates (every column matches)
print(df_dupes.duplicated())

# Find duplicates based only on the 'user_id' column
print(df_dupes.duplicated(subset=['user_id']))

# Keep the first or last occurrence? Your choice.
# Default is `keep='first'`, which marks subsequent duplicates as True.
dropped_dupes = df_dupes.drop_duplicates(subset=['user_id'], keep='first')

The subtle pitfall here is keep='first' vs. keep='last'. If you have two entries for user 2 (scores NaN and 100), which one do you want to keep? The first one (the bad data) or the last one (the updated data)? You have to make that call. pandas can’t read your mind.

Forcing Types (`astype`)

This is where the rubber meets the road. Data comes in as strings from CSV files all the time, even when it’s clearly numbers or dates. astype is your coercion tool.

# A classic: 'score' looks like a number but is actually stored as an object (string)
print(df['score'].dtype) # Probably `object` after our earlier shenanigans

# Force it to be an integer. This will fail if there's any text left.
df['score'] = df['score'].astype(int)

# Safer: convert to numeric, coercing errors to NaN
df['score'] = pd.to_numeric(df['score'], errors='coerce')

# Convert a string to a category to save a ton of memory on large datasets
df['comments'] = df['comments'].astype('category')

The errors='coerce' parameter is a lifesaver. It doesn’t just scream and give up when it hits a value like “N/A”; it quietly turns it into a NaN that you can handle later. This is almost always what you want. The other massive win is converting long, repetitive text columns (like “country” or “product_name”) to the category dtype. It can reduce memory usage by 90% or more. It’s the easiest performance win in the book.

The Art of Dropping the Nulls (dropna)

Filling in the Blanks (fillna)

Hunting Duplicates (duplicated and drop_duplicates)

Forcing Types (astype)

The Art of Dropping the Nulls (`dropna`)

Filling in the Blanks (`fillna`)

Hunting Duplicates (`duplicated` and `drop_duplicates`)

Forcing Types (`astype`)