53.5 struct: Packing and Unpacking Binary Data

The struct module is Python’s primary tool for converting between Python values and C-style data structures represented as Python bytes objects. This is essential for interacting with binary files, network protocols, device drivers, or any system that uses a tightly packed binary data layout. Unlike pickle, which is Python-specific, struct produces and consumes standardized binary formats, enabling interoperability with programs written in other languages.

Its core functions are pack() and unpack(). The pack(fmt, v1, v2, ...) function takes a format string and a series of values, returning a bytes object containing the packed values. The unpack(fmt, buffer) function takes a format string and a buffer (e.g., a bytes object), and returns a tuple of unpacked values.

Format Strings: The Blueprint for Data

The format string is the heart of the struct module. It defines the type, byte order, and size of the data you are handling. Each character in the string represents a single data item.

Byte Order, Size, and Alignment Characters: These optional prefix characters define crucial low-level details.
- @: Native byte order, size, and alignment (default).
- <: Little-endian (least significant byte first).
- >: Big-endian (most significant byte first).
- !: Network byte order (which is always big-endian).
Specifying byte order is a critical best practice. Using the default native format (@) can create data that is unreadable on a system with a different architecture. For portable data, always explicitly use <, >, or !.
Format Characters: These define the data type.
- x: Pad byte (no corresponding value; used for alignment).
- c: char (length 1 byte).
- b: signed char (1 byte).
- B: unsigned char (1 byte).
- ?: _Bool (1 byte).
- h: short (2 bytes).
- H: unsigned short (2 bytes).
- i: int (4 bytes).
- I: unsigned int (4 bytes).
- l: long (4 bytes).
- L: unsigned long (4 bytes).
- q: long long (8 bytes).
- Q: unsigned long long (8 bytes).
- f: float (4 bytes).
- d: double (8 bytes).
- s: char[] (bytes, must be prefixed with a length, e.g., 10s).
- p: char[] (Pascal string, with a leading length byte).
- P: void * (pointer).

A number prefixing a format character indicates a count of that type (e.g., 3h means three shorts). Whitespace is ignored in format strings.

import struct

# Packing integers with explicit little-endian byte order
packed_data = struct.pack('<2h', 1000, 2000)  # Pack two short integers
print(f"Packed bytes: {packed_data}")  # e.g., b'\xe8\x03\xd0\x07'

# Unpacking the same data
unpacked_data = struct.unpack('<2h', packed_data)
print(f"Unpacked tuple: {unpacked_data}")  # (1000, 2000)

# Handling a string (must be encoded to bytes)
name = "Python".encode('utf-8')
packed_name = struct.pack('<5s', name)  # Pack exactly 5 bytes of the string
print(f"Packed name: {packed_name}")  # b'Pytho'

Calculating Size and The `calcsize` Function

The struct.calcsize(fmt) function is indispensable. It returns the number of bytes required to store the data described by the format string. This is vital for reading fixed-length records from a file or network socket—you need to know how many bytes to read before you can unpack them.

format_string = '<I10sf'  # unsigned int, 10-char string, float
size = struct.calcsize(format_string)
print(f"Size of structure: {size} bytes")  # 4 + 10 + 4 = 18 bytes

# Use this size to read exactly one record from a binary file
with open('data.bin', 'rb') as f:
    record_data = f.read(size)  # Read exactly 18 bytes
    data_tuple = struct.unpack(format_string, record_data)

Advanced Packing with `pack_into` and `unpack_from`

For working with pre-allocated buffers (like a bytearray), the pack_into(fmt, buffer, offset, v1, v2, ...) and unpack_from(fmt, buffer, offset=0) functions are used. They allow you to write to or read from a specific location (offset) within the buffer, which is essential for handling complex binary formats with headers and variable-length sections.

# Create a buffer of 20 bytes, initialized to zeros
buffer = bytearray(20)

# Pack data starting at offset 5
struct.pack_into('<hH', buffer, 5, -10, 65000)
print(f"Buffer after pack_into: {buffer}")

# Unpack data starting from offset 5
data = struct.unpack_from('<hH', buffer, 5)
print(f"Data unpacked from offset 5: {data}")  # (-10, 65000)

Common Pitfalls and Best Practices

Byte Order Neglect: The most common mistake is omitting the byte order specifier. Always explicitly define it (<, >, !) for portable data. Relying on the native @ format will cause data corruption when exchanged between different machines.
Size Mismatches: The size of the data being packed must exactly match the format specifier. Trying to pack a Python integer that is too large for a C short (h) will raise a struct.error.
Buffer Management: When using pack_into, ensure the offset plus the calculated size of the structure does not exceed the length of the buffer, or an error will be raised.
Handling Strings: Remember that the s format character packs bytes, not Unicode strings. You must .encode() your strings first. When unpacking, you get a bytes object which may need to be .decode().
Data Alignment: Be aware of padding bytes. Some C compilers and the native @ format will insert pad bytes (x) to align data to memory boundaries for performance. If you need a truly packed structure without alignment, use the byte order prefixes (<, >, !) which also enforce standard sizing and no alignment padding.

Format Strings: The Blueprint for Data

Calculating Size and The calcsize Function

Advanced Packing with pack_into and unpack_from

Common Pitfalls and Best Practices

Calculating Size and The `calcsize` Function

Advanced Packing with `pack_into` and `unpack_from`