76.2 Reading and Writing: CSV, Excel, SQL, Parquet, JSON
Right, let’s talk about getting data in and out of this circus. This isn’t just about saving files; it’s about not losing your mind (or your data types) in the process. I’ve seen more projects derailed by a botched CSV import than by flawed machine learning models. Consider this your guide to doing it right.
The Humble, Deceptively Treacherous CSV
Ah, the CSV. The format everyone uses and no one agrees on. Pandas makes it look easy, which is its greatest trick. read_csv has more parameters than a spaceship cockpit, and you’ll need about five of them to avoid disaster.
The biggest lie a CSV tells is that it’s just commas and lines. It’s a nest of dragons: inconsistent escaping, mixed encodings, and a complete disregard for data types. Your integer column from yesterday is a string column today because one cell had the audacity to be empty. Let’s tame it.
import pandas as pd
# This is how you get bit. Don't do this.
df_naive = pd.read_csv('some_file_from_a_client.csv')
# This is how you do it. Like a professional.
df = pd.read_csv(
'data.csv',
sep=',', # It's not always a comma. Sometimes it's a tab ('\t') or a pipe ('|')
encoding='utf-8', # The silent killer. If you see �, try 'latin-1'
dtype={'user_id': str}, # Force this column to string. Trust me.
parse_dates=['signup_date'], # "But it looked like a date in Excel!"
na_values=['', '#N/A', 'NULL'] # Treat these as NaN, not the string "NULL"
)
# And when you write, be kind to the next person (probably you).
df.to_csv(
'cleaned_data.csv',
index=False, # Unless you love explaining what "Unnamed: 0" is, leave this False.
encoding='utf-8'
)
The dtype and parse_dates parameters are your best friends. Use them aggressively to stop Pandas from guessing, because Pandas guesses like a toddler—with enthusiasm but no foresight.
The Excel Quagmire
Excel files are CSV’s evil twin, holding data hostage in a proprietary binary format that also remembers the color of the header row from 1997. Reading them is fine, mostly. Writing them? That’s where the pain lives.
# Reading is straightforward, if slow.
df_excel = pd.read_excel(
'financial_mess.xlsx',
sheet_name='Q3 Sales', # Can also use sheet index (0). Name is clearer.
engine='openpyxl' # For .xlsx files. For .xls, you'll need 'xlrd'.
)
# Writing is where you learn about despair.
with pd.ExcelWriter('output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, sheet_name='Clean Data', index=False)
df_excel.to_excel(writer, sheet_name='Original Mess', index=False)
# The context manager (`with`) is crucial. It's what actually closes the file.
# Forget this and you'll end up with a corrupt, zero-byte file. Ask me how I know.
The engine parameter matters. For modern .xlsx files, openpyxl is your go-to. Old .xls files require the xlrd library. It’s a mess, but it’s a managed mess.
Talking to Databases like a Pro
This is where Pandas truly shines. Instead of writing reams of boilerplate SQL connection code, you can do it in one line. The read_sql function is a godsend.
from sqlalchemy import create_engine
import os
# Create a connection engine. SQLAlchemy is the key here.
# connection_string format: 'dialect+driver://username:password@host:port/database'
engine = create_engine('postgresql://user:password@localhost:5432/mydb')
# Write a query. Use a parameterized query with `text()` for safety.
query = """
SELECT user_id, signup_date, last_login
FROM users
WHERE last_login > %(login_cutoff)s
"""
# Let Pandas handle the connection, execution, and cursor looping.
df_sql = pd.read_sql(
query,
engine,
params={'login_cutoff': '2023-01-01'} # Safely pass parameters to avoid SQL injection
)
# Writing a DataFrame back is just as easy.
df.to_sql(
'new_table_name',
engine,
index=False,
if_exists='replace' # 'fail', 'replace', or 'append'. Choose wisely.
)
The magic here is SQLAlchemy. It handles the differences between PostgreSQL, MySQL, SQLite, etc., so you don’t have to. Never, ever use string formatting (f"...") to put values into your queries. That’s how you get hacked. Use the params argument. It’s non-negotiable.
The Modern Champions: Parquet and JSON
CSVs are for amateurs. Parquet is for professionals. It’s a binary, columnar format. This means it’s fast to read (especially if you only need a few columns) and it preserves your data types and index. It’s compressed, so it’s smaller on disk. It’s the format you should use for everything that doesn’t absolutely have to be a CSV.
# It's embarrassingly simple and it just works.
df.to_parquet('data.parquet', engine='pyarrow') # Also 'fastparquet'
df_reloaded = pd.read_parquet('data.parquet', engine='pyarrow')
JSON is the other contender—ubiquitous in web APIs, flexible, and a pain to parse. The key is orienting the JSON correctly. The orient parameter in read_json and to_json is everything.
# For a list of records (the most common API output)
json_data = '[{"a": 1, "b": 2}, {"a": 3, "b": 4}]'
df_from_api = pd.read_json(json_data, orient='records')
# When you write, choose your orient based on who's consuming it.
# 'records' is good for APIs, 'split' is more efficient.
as_split_json = df.to_json(orient='split')
The bottom line? Use Parquet for your own work. It’s faster and safer. Use JSON when you have to talk to the outside world, and always specify the orient unless you enjoy reading cryptic error messages.