9.1 bytes: Immutable Binary Sequences
The bytes type in Python represents an immutable sequence of integers in the range 0 <= x < 256. Conceptually, it is the immutable counterpart to the mutable bytearray. Its immutability means that once a bytes object is created, its contents cannot be altered, appended to, or removed from. This design choice is fundamental and has significant implications for its behavior, performance, and use cases.
The immutability of bytes provides several key advantages. First, it allows for hashability, meaning bytes objects can be used as keys in dictionaries or elements in sets, which is crucial for tasks like caching or deduplication of binary data. Second, it guarantees data integrity. When a function receives a bytes object, it can be certain that the data will not be unexpectedly modified by another part of the program, making it safer for concurrent operations. Finally, immutability enables certain optimizations by the Python interpreter, as it can safely reuse existing bytes objects or make assumptions about their unchanging state.
Creating bytes Objects
There are several primary ways to instantiate a bytes object. The most common method is using a literal, prefixed with a b.
# Creating from a literal
data = b'Hello World'
print(data) # Output: b'Hello World'
print(type(data)) # Output: <class 'bytes'>
# Creating from an iterable of integers
byte_sequence = bytes([72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100])
print(byte_sequence) # Output: b'Hello World'
# Creating from an existing object (like a string) and specifying encoding
text_string = "Hello World"
encoded_data = bytes(text_string, encoding='utf-8')
print(encoded_data) # Output: b'Hello World'
# Creating a zero-filled sequence of a specific length
empty_bytes = bytes(5)
print(empty_bytes) # Output: b'\x00\x00\x00\x00\x00'
It is critical to understand the distinction between the literal syntax and creating bytes from a string. The literal b'abc' creates a bytes object directly. Using bytes('abc', encoding='utf-8') first creates a Unicode string object, then encodes it into a sequence of bytes according to the specified encoding rules. Forgetting to specify the encoding parameter in this constructor is a common pitfall and will raise a TypeError.
Operations and Indexing
As a sequence, bytes supports most of the common operations you would expect, such as indexing, slicing, and iteration. However, due to its immutability, any operation that would modify the sequence returns a new bytes object instead.
data = b'Python'
# Indexing returns an integer representing the byte at that position
print(data[0]) # Output: 80 (the ASCII value for 'P')
# Slicing returns a new bytes object
slice_of_data = data[1:4]
print(slice_of_data) # Output: b'yth'
# Concatenation with the + operator creates a new object
new_data = data + b' Programming'
print(new_data) # Output: b'Python Programming'
# The in operator can be used for membership testing
print(b'Py' in data) # Output: True
print(b'py' in data) # Output: False (case-sensitive)
# Iteration yields integers
for byte in data:
print(byte, end=' ')
# Output: 80 121 116 104 111 110
A common point of confusion arises from indexing. While data[0] returns an integer, a slice like data[0:1] returns a bytes object of length 1. This is consistent with other sequence types in Python but can be surprising when first working with binary data.
Immutability in Practice
Attempting to change a single byte within a bytes object will result in a TypeError. This is the core manifestation of its immutable nature.
data = b'Hello'
try:
data[0] = 90 # Try to change 'H' (72) to 'Z' (90)
except TypeError as e:
print(f"Error: {e}") # Output: Error: 'bytes' object does not support item assignment
To “modify” the data, you must create a new object. This can be done efficiently through slicing and concatenation, or for more complex transformations, by using a bytearray (which is mutable) and then converting it back to bytes.
# Inefficient way: creating many intermediate objects
original = b'abcd'
modified = original[:1] + b'Z' + original[2:]
print(modified) # Output: b'aZcd'
# Efficient way for complex changes: use a mutable bytearray
mutable_version = bytearray(original)
mutable_version[1] = 90 # ASCII code for 'Z'
new_immutable_version = bytes(mutable_version)
print(new_immutable_version) # Output: b'aZcd'
Relationship with Strings and Encoding
Perhaps the most important concept to grasp is that bytes represents encoded data, while str represents Unicode text. A bytes object without knowledge of its encoding is just a sequence of numbers. Interpreting it as text requires decoding it using the correct character encoding.
# Encoding a string to bytes
text = " café "
utf8_bytes = text.encode('utf-8')
print(utf8_bytes) # Output: b' caf\xc3\xa9 ' (the é is encoded as two bytes)
# Decoding bytes back to a string
decoded_text = utf8_bytes.decode('utf-8')
print(decoded_text) # Output: café
# Trying to decode with the wrong encoding leads to errors
try:
decoded_wrong = utf8_bytes.decode('ascii')
except UnicodeDecodeError as e:
print(f"Decoding error: {e}")
A critical best practice is to decode input bytes to strings at the earliest possible point in your program (the “unicode sandwich” approach) and encode strings back to bytes as late as possible when outputting. You should never mix bytes and str objects in operations like concatenation or comparison, as this will raise a TypeError.
b_data = b'hello'
s_data = 'world'
try:
result = b_data + s_data
except TypeError as e:
print(f"Error: {e}") # Output: Error: can't concat str to bytes