5.6 char: A Four-Byte Unicode Scalar Value

Now, let’s talk about char. You might be coming from a language where a char is a single, lonely byte representing an ASCII character. Rust politely asks you to forget all that. In Rust, a char is not a byte; it’s a Unicode Scalar Value. This is a fancy way of saying it represents a single Unicode code point, and it takes up a full 4 bytes in memory.

“Why on earth would you do that, Rust?” I hear you cry. It’s not for fun, I promise. It’s for correctness. By making a char 32 bits wide, Rust can guarantee that every char is a valid, self-contained Unicode value. This completely sidesteps the nightmare of trying to figure out if a byte is part of a UTF-8 sequence or not when you’re just trying to iterate over characters. It’s a bit like using a sledgehammer to crack a nut, but the nut is the entire history of broken text encoding, and the sledgehammer is beautifully designed.

The Anatomy of a Rust `char`

A Unicode Scalar Value is any Unicode code point except for surrogate code points (the range U+D800 to U+DFFF). These surrogates are a legacy hack from UTF-16 and are utterly meaningless on their own. Rust’s char type enforces this validity at the type level. You simply cannot create an invalid char; if you try, you’ll be greeted with a panic at runtime.

// This is a perfectly cromulent char. It's the letter 'a'.
let a = 'a';

// This is a char representing a non-Latin script character.
let katakana_a = 'ア';

// This is a char representing a emoji. Yes, really.
let crab = '🦀';

// This will panic with 'char' literal must be one character.
// Surrogate points are not allowed.
// let invalid = '\u{D800}';

Iterating Over Strings: The Great Reveal

This is where the char type truly shines and saves you from countless bugs. Let’s look at what happens when you try to get characters from a string in a language that uses bytes (like C/C++) versus Rust.

let hello = "hello";
let helloz = "hello🦀"; // Note the crab emoji at the end.

// In a naive byte-based language, len() might return:
// hello: 5 bytes
// helloz: 9 bytes (because '🦀' is 4 bytes in UTF-8)

// But in Rust, the .len() method on a &str gives you the length in BYTES.
// This is important for memory layout, but useless for human perception.
println!("Bytes length of 'helloz': {}", helloz.len()); // Prints 9

// To get the number of Unicode scalar values (graphemes are a different story!), use .chars().
println!("Chars count of 'helloz': {}", helloz.chars().count()); // Prints 6

// Let's see what those chars actually are.
for c in helloz.chars() {
    println!("{}", c);
}
// Output:
// h
// e
// l
// l
// o
// 🦀

See? No guessing, no wondering if you’ve split a multi-byte sequence in half. The .chars() iterator hands you back a sequence of pristine, valid char values. It’s blissfully simple.

The Gotchas and Gray Areas

Now, I wouldn’t be your brilliant friend if I didn’t tell you the whole truth. A char is a Unicode Scalar Value, but that’s not always the same thing as what a human would call a “character.”

Consider the character ‘é’. You can represent this as a single scalar value (U+00E9, Latin Small Letter E with Acute) or as two scalar values: a normal ’e’ (U+0065) followed by a combining acute accent (U+0301). Both are valid, both will render as ‘é’, but they have a different number of chars!

let e_acute_single = 'é'; // 1 char
let e_acute_combo = "e\u{0301}"; // 2 chars: 'e' and the combining accent

println!("{}", e_acute_single); // é
println!("{}", e_acute_combo);  // é

println!("{}", e_acute_single.len_utf8()); // 2 bytes
println!("{}", e_acute_combo.len_utf8());  // 3 bytes

println!("{}", e_acute_single.chars().count()); // 1
println!("{}", e_acute_combo.chars().count());  // 2

This is why for advanced text processing (like cursor movement in a text editor), you’d need to look at grapheme clusters, which you can handle with crates like unicode-segmentation. The char type gives you the fundamental building blocks, but human language is messy. Rust gives you the tools to handle that mess correctly, rather than pretending it doesn’t exist.

Best Practices and Conversion

Indexing: Never try to index a string by bytes to get a character. You’ll slice a UTF-8 sequence in half and panic. Use .chars().nth() if you must, but be aware it’s an O(n) operation.
Conversion to Integer: A char can be cast to a u32 to get its code point value, and you can convert back using char::from_u32. This returns an Option<char> because the integer might be an invalid value.
Use it for its purpose: The char type is your go-to for when you conceptually mean “a single character.” For everything else—especially raw bytes—you should be using u8.

let star_char = '★';
let star_value: u32 = star_char as u32;
println!("Code point for ★: U+{:04X}", star_value); // U+2605

let maybe_char = char::from_u32(0x1F980); // The crab emoji, again.
println!("{:?}", maybe_char); // Some('🦀')

The Anatomy of a Rust char

Iterating Over Strings: The Great Reveal

The Gotchas and Gray Areas

Best Practices and Conversion

The Anatomy of a Rust `char`