89.2 Encoding in Practice: UTF-8, UTF-16, and Latin-1
Right, let’s get our hands dirty with the actual bytes. You’ve probably heard “just use UTF-8” as a mantra, and 99% of the time, that’s brilliant advice. But it’s our job to understand the why behind the mantra, so we know how to handle that other 1% and, more importantly, so we can debug the spectacularly weird errors that happen when this goes wrong.
First, a crucial distinction. Encoding is the map that turns abstract characters into bytes. Unicode is the grand, all-encompassing catalog of every character we might want to use. UTF-8, UTF-16, and even the ancient Latin-1 are different encodings for that Unicode standard. Think of Unicode as the idea of “the number 42,” and UTF-8 as the specific way to write those digits as bytes (0x34, 0x32).
The Undisputed Champion: UTF-8
UTF-8 is the internet’s default for a reason: it’s brilliantly designed. It’s a variable-length encoding, meaning a single character can take 1, 2, 3, or even 4 bytes. The sheer genius is in its backwards compatibility with ASCII. The first 128 Unicode code points (the boring old ASCII stuff) are encoded in a single byte, and they are identical to their ASCII counterparts.
This means any system that expects ASCII (which is, well, everything old) can happily read a UTF-8 string containing only English letters and not throw a tantrum. It’s a Trojan horse of goodness. Here’s how it works:
# Let's encode a simple string containing an ASCII character and a non-ASCII one.
text = "A naïve"
# Encoding: from string (Unicode) to bytes (UTF-8)
utf8_bytes = text.encode('utf-8')
print(utf8_bytes) # b'A na\xc3\xafve'
# See that? 'A' and ' ' are single bytes: 0x41 and 0x20.
# The 'ï' (U+00EF) is two bytes: 0xc3 and 0xaf.
The beauty is in the bit patterns. A byte starting with 0 is a single-byte character. A byte starting with 110 is the start of a two-byte character, 1110 a three-byte, and so on. The following bytes all start with 10 to signal they are continuations. This structure makes it robust; if you get a byte out of sync, you can resynchronize easily.
The Pitfall: The variable-length nature means you cannot randomly split a UTF-8 string on a byte boundary. If you cut between those lead and continuation bytes, you get gibberish. Always split on a character boundary, which any decent string library will do for you.
The Messy Legacy: UTF-16
If UTF-8 is the elegant new architecture, UTF-16 is the city’s old, confusing plumbing. It was designed when we thought 65,536 code points (a “16-bit” space) would be enough for everyone. We were colossally wrong. So UTF-16 is also variable-length: most common characters fit in 2 bytes (a “code unit”), but ones outside the “Basic Multilingual Plane” (BMP) require a surrogate pair: two 16-bit code units.
This is where the infamous pain begins. The existence of surrogate pairs means the number of code units is not always the same as the number of characters. And it introduces endianness—the order of the bytes. Is 0x12 0x34 the number 0x1234 or 0x3412? This is why you see Byte Order Marks (BOMs), like 0xFEFF, at the start of files.
text = "A naïve 🐉" # Note the dragon emoji, outside the BMP.
utf16_bytes = text.encode('utf-16')
print(utf16_bytes) # b'\xff\xfeA\x00 \x00n\x00a\x00\xef\x00v\x00e\x00 \x00=\xd8\x0c\xdc'
# It's a mess! It starts with the BOM (0xFF 0xFE for little-endian).
# The emoji (U+1F409) is encoded as the surrogate pair 0xD83C 0xDC09.
# A common mistake is treating a UTF-16 string as if it were fixed-width.
utf16_code_units = text.encode('utf-16le') # Specify little-endian, no BOM.
length_in_code_units = len(utf16_code_units) // 2
print(f"String length: {len(text)}")
print(f"UTF-16 code units: {length_in_code_units}") # Not the same!
The Pitfall: Never assume strlen() or its equivalent works on UTF-16 data. You’re counting 16-bit code units, not characters. This is the root of countless bugs in systems like Windows and Java, which made the unfortunate early bet on UTF-16 as their internal string representation.
The Zombie Encoding: Latin-1 (ISO-8859-1)
Latin-1 is the ghost at the feast. It’s a fixed-width, single-byte encoding that covers most Western European languages. Its fatal attraction is that the first 256 code points of Unicode were deliberately made to align with Latin-1. This creates a deceptively easy conversion path.
This is the source of the “mojibake” you see all over the web: ñ instead of ñ. This happens when software naively assumes bytes are Latin-1, but they’re actually UTF-8.
# How mojibake is born
original_text = "pañal"
utf8_bytes = original_text.encode('utf-8') # b'pa\xc3\xb1al'
# Now, something stupid happens: we decode those UTF-8 bytes as if they were Latin-1.
mojibake = utf8_bytes.decode('latin-1')
print(mojibake) # 'pañal'
# And the reverse is also common and just as bad.
The Pitfall: The number one rule: know your encoding. You cannot guess. You must know. If you’re consuming data from an HTTP response, look for the Content-Type header (charset=utf-8). If you’re reading a file, the source should tell you. If you’re dealing with a system that stubbornly emits Latin-1, your first step is to decode those bytes to a string using latin-1, then you have a proper Unicode string to work with.
The Best Practice: Use UTF-8 everywhere. For storage, for transmission, for logs. Specify it explicitly in your code; don’t rely on default system settings, which are a time bomb. Be vigilant when interacting with older systems (file paths on Windows, some legacy databases) and know that you might need to perform a explicit conversion from their preferred encoding to your UTF-8 world. It’s a bit of paperwork, but it beats cleaning up mojibake later.