89.1 Unicode Deep Dive: Code Points, Planes, and Normalization
Right, let’s get into the weeds. You’ve probably heard that “Unicode solves everything.” It mostly does, but it does so by trading a simple, obvious problem (mapping one character to one number) for a complex, robust solution (mapping human text to a system of codes, rules, and algorithms). It’s a fantastic trade, but you need to understand its machinery or it will bite you.
Think of Unicode not as a simple character set but as a database. Its core abstraction is the code point. A code point is just a number, represented as U+XXXX, where XXXX is a hexadecimal value. For example, the code point for the letter ‘A’ is U+0041. This number isn’t the bytes you’ll store it in; it’s the abstract idea of the character. The range of possible code points is massive: from U+0000 to U+10FFFF. That’s over 1.1 million slots. We call this entire space the Codespace.
The Planes: A Question of Real Estate
Now, 1.1 million is a big number. Unicode divides this codespace into 17 logical chunks called planes. Each plane is a range of 65,536 (2^16) consecutive code points. The most important one by far is Plane 0, the Basic Multilingual Plane (BMP). This covers code points from U+0000 to U+FFFF and contains almost every character you’ve ever heard of: Latin, Greek, Cyrillic, Chinese, Japanese, Korean, Arabic, you name it. This is why for a long time, people (and programming languages) could pretend a “character” was 16 bits and get away with it.
The other 16 planes (U+10000 to U+10FFFF) are called Supplementary Planes or astral planes. This is where the fun stuff lives: ancient scripts, lesser-used kanji, musical symbols, math symbols, and of course, everyone’s favorite, emoji. The “Grinning Face” emoji (😀) isn’t in the BMP; it’s U+1F600 in the first supplementary plane, Plane 1 (the Supplementary Multilingual Plane).
This is the first “gotcha.” If you’re using a language or a library that claims to use “UCS-2” encoding, run. UCS-2 is a relic that only handles the BMP. It’s literally incapable of representing an emoji. You need UTF-16 or UTF-8.
Encoding: From Abstract Idea to Concrete Bytes
A code point is an idea. To store it or send it, you need to encode it into bytes. This is where UTF-8, UTF-16, and UTF-32 come in.
UTF-32: The simpleton’s choice. Every code point, whether it’s
U+0041orU+1F4A9(💩), is stored as a full 4-byte number. It’s predictable but wildly inefficient for most text, which lives in the BMP. Most software considers it a memory hog and avoids it.UTF-16: The trickster. It uses 2 bytes for any character in the BMP. But for a supplementary character (like most emoji), it uses a clever trick called a surrogate pair. This is two special 2-byte codes that, when combined, represent a single code point from an astral plane. If your string processing is naive (e.g., counting
charelements in a JavaString), it will see two “characters” instead of one. This breaks everything fromsubstring()operations to simply counting user-perceived characters.UTF-8: The champion of the modern web. It’s variable-length: 1 byte for ASCII, 2-3 for most BMP characters, and 4 bytes for astral characters. Its genius is its backwards compatibility with ASCII. The beauty of
strlen("café")in a UTF-8 aware language is that it works correctly; in a naive one, it might count bytes, not characters.
Here’s the practical difference. Let’s take the pile of poo emoji (U+1F4A9).
// JavaScript example: It 'gets' Unicode for the most part.
const poo = '💩';
console.log(poo.length); // 2 <- WAIT, WHAT? Gotcha! JS uses UTF-16 under the hood.
console.log(Array.from(poo).length); // 1 <- Correct. Use modern methods!
// Let's see its UTF-8 encoding in Node.js
const buffer = Buffer.from(poo, 'utf8');
console.log(buffer); // <Buffer f0 9f 92 a9> (4 bytes)
console.log(buffer.length); // 4
// And its actual code point
console.log(poo.codePointAt(0).toString(16)); // "1f4a9" <- The correct code point.
Normalization: The “Same” But Different
Here’s where Unicode gets philosophical. Is "é" one thing or two? Unicode allows it to be represented two ways:
- As a single code point:
U+00E9(LATIN SMALL LETTER E WITH ACUTE). - As a sequence of two code points:
U+0065(LATIN SMALL LETTER E) +U+0301(COMBINING ACUTE ACCENT).
They look identical to a human. They are semantically equivalent. But to a computer, they are completely different sequences of bytes. If you try to compare them or search for one, you’ll get a false negative. This is a massive source of bugs.
Unicode Normalization is the process of converting these equivalent sequences to a single, standardized form. The two most common forms are NFC and NFD.
- NFC (Normalization Form C): Precomposes everything. Prefers the single code point
U+00E9. - NFD (Normalization Form D): Decomposes everything. Prefers the sequence
U+0065+U+0301.
You must normalize your strings before comparing them or storing them if you care about textual equality. It’s not optional.
# Python has excellent Unicode support
s1 = "café" # Likely using U+00E9
s2 = "cafe\u0301" # Using e + combining accent
print(s1, s2) # Both print 'café'
print(len(s1), len(s2)) # 4, 5 <- Different lengths!
print(s1 == s2) # False <- They are not equal!
import unicodedata
normalized_s1 = unicodedata.normalize('NFC', s1)
normalized_s2 = unicodedata.normalize('NFC', s2)
print(normalized_s1 == normalized_s2) # True <- Now they are equal.
Best practice: Pick a normalization form (usually NFC for storage and display) and stick with it. Normalize on input, before processing, and before comparison. It’s like washing your hands. It just prevents so many gross bugs.
The takeaway? Unicode is deep. Respect its depth. Never assume one byte equals one character. Never assume one code point equals what a user would call a “character” (that’s a whole other can of worms called a grapheme cluster). Use well-tested libraries for everything, and always normalize your strings. It’s the only way to avoid the inevitable chaos.