22.7 unicode and unicode/utf8: Working with Runes
Right, let’s talk about text. You’ve probably been happily using string for everything, thinking “it’s just text.” Go’s string is a fantastic abstraction, but it’s built on a lie of omission. Under the hood, a string is a read-only slice of bytes ([]byte). Not characters. Bytes. And this is where the entire world of unicode and unicode/utf8 comes crashing into our pleasant little program.
The problem is simple: the world uses more than 128 characters (the limit of ASCII). My last name has an “é”; that’s one character, but it’s represented by two bytes in UTF-8. If you try to process it by just indexing the string (s[0], s[1]…), you’re slicing through those bytes and getting utter garbage. The crucial concept here is the rune.
A rune in Go is simply an alias for int32. It’s a number meant to represent a single Unicode code point. Think of a code point as a unique ID for every character, symbol, and emoji imaginable. When you write a for loop over a string with a range clause, Go does the hard work for you: it decodes the UTF-8 bytes and gives you each rune, one by one.
s := "Gophér" // Note the 'é'
fmt.Printf("String: %s\n", s)
// The wrong way: byte-by-byte
fmt.Println("Byte indexing:")
for i := 0; i < len(s); i++ {
fmt.Printf("Index %d: %q (%#x)\n", i, s[i], s[i])
}
// The right way: rune-by-rune
fmt.Println("\nRange iteration:")
for index, runeValue := range s {
fmt.Printf("Index %d: %#U\n", index, runeValue)
}
This outputs:
String: Gophér
Byte indexing:
Index 0: 'G' (0x47)
Index 1: 'o' (0x6f)
Index 2: 'p' (0x70)
Index 3: 'h' (0x68)
Index 4: 'Ã' (0xc3) // <- What even is this? It's the first byte of 'é'
Index 5: '©' (0xa9) // <- And the second byte. Garbage.
Index 6: 'r' (0x72)
Range iteration:
Index 0: U+0047 'G'
Index 1: U+006F 'o'
Index 2: U+0070 'p'
Index 3: U+0068 'h'
Index 4: U+00E9 'é' // <- There's our actual character!
Index 6: U+0072 'r' // <- Note the index jumped to 6!
See the index jump from 4 to 6? That’s the range loop silently, correctly, handling the multi-byte nature of UTF-8 for you. This is why you should almost always use for range to iterate over strings.
The unicode and unicode/utf8 Packages: Your Toolkit
The unicode package is your high-level guide to the Unicode universe. It’s full of useful functions like unicode.IsLetter(r), unicode.IsDigit(r), unicode.IsSpace(r), and unicode.ToLower(r). These functions understand the Unicode standard, so they know that ‘Ⅷ’ is a digit and ‘é’ is a letter.
The unicode/utf8 package is the mechanic that works under the hood. It deals with the raw bytes. It provides functions to encode/decode runes and to analyze byte slices.
Why You Can’t Just Truncate Strings
This is a classic pitfall. Let’s say you want the first 5 characters of a string. Your first instinct might be s[:5]. This is a catastrophic error if the fifth byte is in the middle of a rune.
s := "Hello, 世界" // "Hello, " (7 ASCII chars) + "世界" (2 Chinese chars, 3 bytes each)
truncatedBad := s[:10] // Slice after the 10th byte
truncatedGood := string([]rune(s)[:7]) // Slice after the 7th rune
fmt.Printf("Original: %s (len=%d bytes)\n", s, len(s))
fmt.Printf("Bad truncation (byte slice): %s (len=%d bytes)\n", truncatedBad, len(truncatedBad))
fmt.Printf("Good truncation (rune slice): %s (len=%d bytes)\n", truncatedGood, len(truncatedGood))
Output:
Original: Hello, 世界 (len=13 bytes)
Bad truncation (byte slice): Hello, 世� (len=10 bytes) // <- Garbage rune at the end!
Good truncation (rune slice): Hello, 世 (len=10 bytes) // <- Correct
The “bad” version sliced the string after the first 3 bytes of the second Chinese character, leaving a malformed UTF-8 sequence which is rendered as the replacement character �. The solution is to convert to a []rune, slice that, and then convert back to a string.
Counting “Characters” Correctly
len(s) gives you the length in bytes. This is almost never what you want for user-facing text. Use utf8.RuneCountInString(s) to get the number of runes (what a user would perceive as characters).
s := "🐍 Python" // Snake emoji is 4 bytes
fmt.Printf("Bytes: %d\n", len(s)) // Output: Bytes: 10
fmt.Printf("Runes: %d\n", utf8.RuneCountInString(s)) // Output: Runes: 7
Validating UTF-8
Not every slice of bytes is valid UTF-8. You might be processing user input or files that claim to be UTF-8 but aren’t. The utf8.ValidString(s) function is your first line of defense. It reports whether the string s consists entirely of valid UTF-8-encoded runes. If this returns false, you need to handle the error, perhaps by cleaning the input or rejecting it outright.
Manual Decoding and The Edge of The Universe
Sometimes you need to process a stream of bytes and can’t just convert it to a string first. This is where utf8.DecodeRune(bytes) shines. It takes a []byte and returns the first rune and the width of its UTF-8 encoding in bytes. You use this width to then slice the next rune out of the buffer. It’s the exact logic the for range loop uses internally.
b := []byte("A±B")
for len(b) > 0 {
r, size := utf8.DecodeRune(b)
fmt.Printf("Decoded rune %#U, consumed %d bytes. Remaining: %v\n", r, size, b[size:])
b = b[size:] // Slice off the rune we just consumed.
}
The key takeaway is this: Go has excellent support for UTF-8, but it doesn’t hide the complexity from you. It gives you the tools (rune, unicode, unicode/utf8) to handle text correctly. Respect the difference between bytes and runes, and your code will work everywhere, from ASCII to emoji. Ignore it, and your code will work only in Kansas.