54.3 zipfile: Creating, Reading, and Extracting ZIP Archives

The zipfile module in Python provides a powerful and flexible way to create, read, and extract ZIP archives. It abstracts the complexities of the ZIP file format, allowing developers to work with archives as if they were standard file system directories. This is invaluable for bundling multiple files for distribution, compressing data to save space, and processing archived data without the overhead of extracting it first.

Creating a ZIP Archive

To create a new ZIP archive, you use the ZipFile object in write mode ('w'). This mode creates a new archive, overwriting any existing file with the same name. When adding files, you can control the compression method and level. The two primary compression methods are ZIP_STORED (no compression) and ZIP_DEFLATED (using the zlib module). ZIP_DEFLATED is the most common and effective for general use.

import zipfile
import os

# Create a new archive and add files to it
with zipfile.ZipFile('project_backup.zip', 'w', compression=zipfile.ZIP_DEFLATED) as archive:
    # Add a single file, using arcname to control the path inside the archive
    archive.write('important_report.pdf', arcname='docs/report.pdf')
    
    # Add the contents of a directory recursively
    for root, dirs, files in os.walk('src'):
        for file in files:
            file_path = os.path.join(root, file)
            # Create the archive path by removing the leading 'src/' 
            arcname = os.path.relpath(file_path, start='src')
            archive.write(file_path, arcname=arcname)
    print("Archive created successfully.")

A critical pitfall to avoid is adding files with absolute paths, which can clutter the archive and create security risks when extracted. Always use the arcname parameter to specify a clean, relative path within the archive.

Reading Metadata and Listing Contents

The ZipFile object serves as a context manager, ensuring the archive is properly closed. You can inspect its contents without extracting anything. The namelist() method returns a list of all member names, while infolist() returns a list of ZipInfo objects containing rich metadata like file size, compressed size, and modification time.

with zipfile.ZipFile('project_backup.zip', 'r') as archive:
    print("Archive contents:")
    for info in archive.infolist():
        print(f"  {info.filename} - "
              f"Original: {info.file_size} bytes, "
              f"Compressed: {info.compress_size} bytes, "
              f"Ratio: {info.file_size / info.compress_size:.1f}x")
    
    # Read the contents of a specific file into memory without extracting
    try:
        with archive.open('docs/report.pdf') as file_in_zip:
            first_few_bytes = file_in_zip.read(10)
            print(f"First bytes of report: {first_few_bytes}")
    except KeyError:
        print("The file was not found in the archive.")

Extracting Archives

Extraction is straightforward but requires caution. The extractall() method extracts all members to the current directory or a specified path. The extract() method extracts a single member. A major security best practice is to never extract archives from untrusted sources without sanitizing member names. Malicious archives can contain paths like ../../etc/passwd (a path traversal attack). The extract() and extractall() methods have built-in safeguards that will not allow a member to be extracted outside the target directory if the archive was created with absolute paths. However, the safer practice is to use the path parameter to specify a dedicated, empty directory for extraction.

import tempfile
import pathlib

# Safely extract to a temporary directory
with tempfile.TemporaryDirectory() as tmpdir:
    with zipfile.ZipFile('project_backup.zip', 'r') as archive:
        # Check for safe paths first (a more robust security step)
        for member in archive.namelist():
            member_path = pathlib.Path(member)
            if member_path.is_absolute() or '..' in member_path.parts:
                raise ValueError("Potentially unsafe archive member path detected.")
        
        # Proceed with extraction if paths are safe
        archive.extractall(path=tmpdir)
    print(f"Archive extracted safely to: {tmpdir}")

Working with Password-Protected Archives

The module supports creating and reading encrypted archives. However, note that the encryption used (traditional PKWARE encryption) is weak and should not be considered secure against determined attackers. For stronger encryption, consider creating the archive with external tools that support AES encryption.

# Creating an encrypted archive (with a weak password)
with zipfile.ZipFile('secure_data.zip', 'w', compression=zipfile.ZIP_DEFLATED) as archive:
    archive.write('data.txt')
    archive.setpassword(b'my_weak_password') # Password must be bytes

# Reading an encrypted archive
with zipfile.ZipFile('secure_data.zip', 'r') as archive:
    archive.setpassword(b'my_weak_password')
    data = archive.read('data.txt')
    print(f"Decrypted data: {data.decode()}")

Best Practices and Common Pitfalls

Always Use Context Managers (with statements): This guarantees that the archive is properly closed, writing all necessary data structures and avoiding corruption.
Beware of Path Traversal: As mentioned, sanitize member names from untrusted archives. The extractall(path) method is safer than extracting to the current directory.
Compression Trade-offs: ZIP_DEFLATED offers good compression but is slower than ZIP_STORED. Use ZIP_STORED for archives where speed of creation is more important than size, or for files that are already compressed (e.g., JPEGs, MP3s).
Large Files: For very large files, use the open() method within the archive to read or write data in chunks instead of loading the entire file into memory.
Metadata Limitations: The ZIP format does not preserve all filesystem metadata (e.g., precise Unix permissions, creation times). For perfect backups, consider other formats like tarfile.