53.6 msgpack, cbor2, and Other Compact Binary Formats

While Python’s pickle module is convenient for Python-specific serialization, its lack of interoperability with other languages and inherent security risks make it unsuitable for many applications. This is where compact, language-agnostic binary formats like MessagePack (msgpack) and Concise Binary Object Representation (CBOR) excel. These formats offer a compelling blend of performance, small payload size, and cross-platform compatibility, making them ideal for network communication, data storage, and inter-process communication where Python is not the sole participant.

Why Choose msgpack or cbor2?

The primary advantage of these libraries over pickle is their standardization. pickle produces a byte stream that is tightly coupled to Python’s internal object model; a data structure pickled in Python cannot be unpickled in JavaScript or Go. msgpack and CBOR, however, are based on well-defined specifications. They serialize data into a standardized binary format that any compliant library in any language can understand. Furthermore, pickle is inherently insecure because it can serialize and execute arbitrary code during unpickling (__reduce__), whereas msgpack and CBOR are strictly data serialization formats, mitigating this critical vulnerability. They also typically produce significantly smaller serialized payloads than pickle or JSON.

Working with the msgpack Library

MessagePack is often described as “JSON-like but binary and faster.” It defines a compact binary representation for basic data types like integers, floats, strings, arrays, and maps (dictionaries). The Python msgpack library provides a simple API reminiscent of the json module.

import msgpack

# Data to serialize
data = {
    "name": "Alice",
    "age": 30,
    "is_active": True,
    "tags": ["python", "data"],
    "coordinates": (44.3, -72.4)  # Will be serialized as a list
}

# Serialization (packing)
packed_data = msgpack.packb(data)
print(f"Packed data: {packed_data}")
print(f"Size: {len(packed_data)} bytes")

# Deserialization (unpacking)
unpacked_data = msgpack.unpackb(packed_data)
print(f"Unpacked data: {unpacked_data}")

A crucial point to understand is type conversion. Python’s tuple is not a native msgpack type; it will be serialized as an array and deserialized back as a Python list. Similarly, while msgpack has a bin type for raw byte strings, the default behavior for Python bytes objects might vary. You can control this with the use_bin_type parameter to ensure bytes are correctly handled.

Working with the cbor2 Library

CBOR is a more recent standard, building upon the ideas of MessagePack and JSON. It is an official IETF standard (RFC 8949) and offers more extensive native type support, including dates, big integers, and semantic tagging, which allows for custom data types. The cbor2 library is its Python implementation.

import cbor2
from datetime import datetime, date

# Data with more complex types
data = {
    "timestamp": datetime.now(),
    "birthday": date(1992, 5, 15),
    "large_number": 2**64,
    "binary_data": b"some raw bytes"
}

# Serialization (dumping)
with open('data.cbor', 'wb') as f:
    cbor2.dump(data, f)

# Deserialization (loading)
with open('data.cbor', 'rb') as f:
    loaded_data = cbor2.load(f)

print(loaded_data)
print(f"Timestamp type: {type(loaded_data['timestamp'])}")

CBOR’s strength is its ability to natively handle types like datetime and date objects through its tagging mechanism. A compliant cbor2 library will correctly serialize a Python datetime into a tagged CBOR item and then deserialize it back into a datetime object, preserving the type information in a cross-platform way.

Common Pitfalls and Best Practices

Type Fidelity: The most common surprise is the loss of specific Python types. A tuple becomes a list, a set is not supported natively and must be handled as a list or with a custom extension. Always be aware of the mapping between Python types and the serialization format’s types.
Custom Objects: Neither format can serialize arbitrary Python class instances out of the box. You must provide a mechanism to convert your object into a serializable representation (e.g., a dictionary) and reconstruct it. This is often done by implementing __getstate__ and __setstate__ or using the library’s extension mechanisms.

Encoding and Decoding Hooks: Both libraries provide powerful hooks (default/ext_hook in msgpack, encoders/decoders in cbor2) to handle non-native types. This is the recommended way to add support for set, complex, or custom classes.

import msgpack
from decimal import Decimal

def encode_custom(obj):
    if isinstance(obj, Decimal):
        # Convert to a string or a tuple representation
        return {'__class__': 'Decimal', 'value': str(obj)}
    raise TypeError(f"Object of type {type(obj)} is not serializable")

def decode_custom(obj):
    if '__class__' in obj and obj['__class__'] == 'Decimal':
        return Decimal(obj['value'])
    return obj

data = {'price': Decimal('19.99')}
packed = msgpack.packb(data, default=encode_custom)
unpacked = msgpack.unpackb(packed, object_hook=decode_custom)
# unpacked['price'] is a Decimal object

Performance: For maximum performance, especially with large data structures, prefer working with bytes I/O streams (BytesIO) and the pack()/unpack() or dump()/load() functions instead of their string-oriented counterparts (packb/unpackb) to avoid unnecessary memory copies.
Schema Evolution: When used for long-term storage, plan for how your data structures might change. Adding new optional fields is usually safe, but renaming or removing fields can break backward compatibility. Consider using a library like Protocol Buffers if strict schema enforcement is required.