5.6 char: A Four-Byte Unicode Scalar Value
Now, let’s talk about char. You might be coming from a language where a char is a single, lonely byte representing an ASCII character. Rust politely asks you to forget all that. In Rust, a char is not a byte; it’s a Unicode Scalar Value. This is a fancy way of saying it represents a single Unicode code point, and it takes up a full 4 bytes in memory.
“Why on earth would you do that, Rust?” I hear you cry. It’s not for fun, I promise. It’s for correctness. By making a char 32 bits wide, Rust can guarantee that every char is a valid, self-contained Unicode value. This completely sidesteps the nightmare of trying to figure out if a byte is part of a UTF-8 sequence or not when you’re just trying to iterate over characters. It’s a bit like using a sledgehammer to crack a nut, but the nut is the entire history of broken text encoding, and the sledgehammer is beautifully designed.
The Anatomy of a Rust char
A Unicode Scalar Value is any Unicode code point except for surrogate code points (the range U+D800 to U+DFFF). These surrogates are a legacy hack from UTF-16 and are utterly meaningless on their own. Rust’s char type enforces this validity at the type level. You simply cannot create an invalid char; if you try, you’ll be greeted with a panic at runtime.
// This is a perfectly cromulent char. It's the letter 'a'.
let a = 'a';
// This is a char representing a non-Latin script character.
let katakana_a = 'ア';
// This is a char representing a emoji. Yes, really.
let crab = '🦀';
// This will panic with 'char' literal must be one character.
// Surrogate points are not allowed.
// let invalid = '\u{D800}';
Iterating Over Strings: The Great Reveal
This is where the char type truly shines and saves you from countless bugs. Let’s look at what happens when you try to get characters from a string in a language that uses bytes (like C/C++) versus Rust.
let hello = "hello";
let helloz = "hello🦀"; // Note the crab emoji at the end.
// In a naive byte-based language, len() might return:
// hello: 5 bytes
// helloz: 9 bytes (because '🦀' is 4 bytes in UTF-8)
// But in Rust, the .len() method on a &str gives you the length in BYTES.
// This is important for memory layout, but useless for human perception.
println!("Bytes length of 'helloz': {}", helloz.len()); // Prints 9
// To get the number of Unicode scalar values (graphemes are a different story!), use .chars().
println!("Chars count of 'helloz': {}", helloz.chars().count()); // Prints 6
// Let's see what those chars actually are.
for c in helloz.chars() {
println!("{}", c);
}
// Output:
// h
// e
// l
// l
// o
// 🦀
See? No guessing, no wondering if you’ve split a multi-byte sequence in half. The .chars() iterator hands you back a sequence of pristine, valid char values. It’s blissfully simple.
The Gotchas and Gray Areas
Now, I wouldn’t be your brilliant friend if I didn’t tell you the whole truth. A char is a Unicode Scalar Value, but that’s not always the same thing as what a human would call a “character.”
Consider the character ‘é’. You can represent this as a single scalar value (U+00E9, Latin Small Letter E with Acute) or as two scalar values: a normal ’e’ (U+0065) followed by a combining acute accent (U+0301). Both are valid, both will render as ‘é’, but they have a different number of chars!
let e_acute_single = 'é'; // 1 char
let e_acute_combo = "e\u{0301}"; // 2 chars: 'e' and the combining accent
println!("{}", e_acute_single); // é
println!("{}", e_acute_combo); // é
println!("{}", e_acute_single.len_utf8()); // 2 bytes
println!("{}", e_acute_combo.len_utf8()); // 3 bytes
println!("{}", e_acute_single.chars().count()); // 1
println!("{}", e_acute_combo.chars().count()); // 2
This is why for advanced text processing (like cursor movement in a text editor), you’d need to look at grapheme clusters, which you can handle with crates like unicode-segmentation. The char type gives you the fundamental building blocks, but human language is messy. Rust gives you the tools to handle that mess correctly, rather than pretending it doesn’t exist.
Best Practices and Conversion
- Indexing: Never try to index a string by bytes to get a character. You’ll slice a UTF-8 sequence in half and panic. Use
.chars().nth()if you must, but be aware it’s an O(n) operation. - Conversion to Integer: A
charcan be cast to au32to get its code point value, and you can convert back usingchar::from_u32. This returns anOption<char>because the integer might be an invalid value. - Use it for its purpose: The
chartype is your go-to for when you conceptually mean “a single character.” For everything else—especially raw bytes—you should be usingu8.
let star_char = '★';
let star_value: u32 = star_char as u32;
println!("Code point for ★: U+{:04X}", star_value); // U+2605
let maybe_char = char::from_u32(0x1F980); // The crab emoji, again.
println!("{:?}", maybe_char); // Some('🦀')