Let’s be honest: you’re here because you have to be. Someone, somewhere, decided that the fate of your critical data should rest in the hands of a proprietary binary file format designed for accountants. I’m sorry. But since we’re stuck with it, we might as well master the art of wrestling Excel files with Python. The two main contenders are xlrd for reading the ancient scrolls and openpyxl for both reading and writing the modern ones.

The first thing you need to know is the Great Schism of Excel file formats. Everything changed in 2007 with the introduction of the .xlsx format, which is basically a ZIP file full of XML documents. The old .xls format is a binary nightmare. xlrd (version 1.2.0 or earlier) was the undisputed champion for reading both, but for security reasons, it now only handles .xlsx. If you have a legacy .xls file, you’ll need to use xlrd==1.2.0 or find another library. openpyxl only deals with .xlsx. Keep this in mind, or you’ll spend an hour wondering why your script is screaming about a corrupt file.

The Simple Art of Reading (with openpyxl)

First, install your weapon of choice: pip install openpyxl. To read a file, you load it into a Workbook object. The data_only flag is your first important choice. Do you want the formula itself, or the last calculated value that Excel saved? For most data extraction tasks, you want the value.

from openpyxl import load_workbook

# Load the workbook. data_only=True gives you the result, not the formula.
wb = load_workbook(filename='your_spreadsheet.xlsx', data_only=True)

# Get a worksheet by name. This step is where typos will ruin your day.
ws = wb['Sheet1']

# Access a cell. openpyxl is 1-indexed, like Excel itself. This is A1.
cell_value = ws.cell(row=1, column=1).value
# Or use the friendly, Excel-style notation
other_cell_value = ws['A1'].value

print(f"The value in A1 is: {cell_value}")

Iterating Without Going Mad

You will, inevitably, need to process a range of cells. The most efficient way is to treat the worksheet as a iterable of rows, which are themselves iterables of cells.

# Iterate over all rows that contain data (up to the 'max_row' the file defines)
for row in ws.iter_rows(min_row=2, max_col=3, max_row=10, values_only=True):
    # With values_only=True, `row` is a tuple of the cell values.
    # Without it, `row` is a tuple of Cell objects.
    id_value, name, score = row # unpack the first three columns
    print(f"ID: {id_value}, Name: {name}, Score: {score}")

The values_only parameter is a lifesaver. It returns the cell values directly instead of forcing you to access the .value attribute on every single cell object, which cleans up your code significantly.

Writing and Saving Your Masterpiece

This is where openpyxl truly shines. You can create a workbook from scratch or load an existing one to modify. Remember, until you call save(), nothing is written to disk. It’s all in memory.

from openpyxl import Workbook
from openpyxl.styles import Font

# Create a new workbook. It always starts with one worksheet.
wb = Workbook()
ws = wb.active
ws.title = "Master Data" # Change the default 'Sheet' name

# Dump a list of lists into the sheet.
data_rows = [
    ['Item', 'Quantity', 'Price'],
    ['Python Book', 1, 39.99],
    ['Coffee Mug', 4, 12.50]
]

for row in data_rows:
    ws.append(row) # append() adds a whole row

# Fancy styling because why not? Let's make the header bold.
for cell in ws[1]: # ws[1] gets all cells in the first row
    cell.font = Font(bold=True)

# Adjust column widths. This is annoyingly manual.
ws.column_dimensions['A'].width = 20
ws.column_dimensions['B'].width = 10

# CRITICAL: Actually save the file. This is the step everyone forgets.
wb.save("my_new_file.xlsx")

Common Pitfalls and the Ghost of Data Past

  1. The Phantom Cell: Excel files often have “phantom” rows or columns—cells that were once edited and then deleted, but which openpyxl still considers part of the worksheet’s dimensions. Use ws.max_row and ws.max_column to check the bounds of your data, but be warned: they might be lies. You may need to write logic to find the real last cell with data.

  2. Dates are a Lie: Excel stores dates as floating-point numbers (the number of days since January 0, 1900). openpyxl can usually convert these to Python datetime objects automatically, but if you’re reading a cell with data_only=True and get a float instead of a date, you’ve been bitten. You might need to check the cell’s number format and do the conversion yourself. It’s a headache.

  3. Formulas are Not Magic: If you write a formula to a cell using ws['A1'] = "=SUM(B1:B10)" and then immediately open the file and look at A1 with data_only=True, it will show None. Why? Because the formula hasn’t been evaluated by Excel. openpyxl can write the formula, but it cannot calculate the result. You need Excel for that. This trips up everyone.

  4. Performance is a Thing: Reading a massive 100,000-row spreadsheet? load_workbook() will try to eat your entire RAM for lunch. Use read_only=True mode to stream the file. Conversely, if you’re writing a huge file, use write_only=True mode to create the file as a stream. These modes are less flexible but essential for big jobs.

# For reading huge files
wb = load_workbook(filename='massive_file.xlsx', read_only=True)
# For writing huge files
wb = Workbook(write_only=True)
ws = wb.create_sheet()
# In write_only mode, you can only use ws.append([list, of, values])