55.2 Character Classes: [], \d, \w, \s, and Negation

Character classes, also known as character sets, are among the most fundamental and powerful constructs in regular expressions. They allow you to instruct the regex engine to match any one character from a predefined or custom set of characters. This moves beyond literal matching, enabling you to define flexible patterns for categories of text like digits, letters, or whitespace.

Defining Custom Character Sets with Square Brackocks []

The most fundamental character class is defined using square brackets []. Any single character listed between these brackets will be considered a match. This is far more efficient than using alternation (the | operator) for single characters. For example, [aeiou] is vastly preferable to (a|e|i|o|u).

You can also specify a range of consecutive characters using a hyphen -. The common ranges are [a-z] for lowercase letters, [A-Z] for uppercase letters, [0-9] for digits, and combinations like [a-zA-Z] for any letter. The regex engine interprets these ranges based on the character’s position in the ASCII or Unicode table, so [A-z] would match all uppercase letters, lowercase letters, and the six characters between Z and a ([, \, ], ^, _, `), which is rarely the intended behavior.

// Match a hexadecimal digit (0-9, a-f, A-F)
const hexPattern = /[0-9a-fA-F]/;
console.log(hexPattern.test('x')); // false
console.log(hexPattern.test('5')); // true
console.log(hexPattern.test('b')); // true
console.log(hexPattern.test('F')); // true

// Match a vowel
const vowelPattern = /[aeiouAEIOU]/;
console.log('sky'.replace(vowelPattern, '-')); // "s-y" (replaces first vowel 'k' is not a vowel)

Predefined Shorthand Character Classes

Because certain character classes are used so frequently, regex engines provide shorthand notations for them. These are much more concise than their bracket equivalents.

\d matches any digit character. It is equivalent to [0-9]. The d stands for digit.
\w matches any word character. This is typically equivalent to [a-zA-Z0-9_] (letters, digits, and the underscore). The w stands for word. Note that this is often specific to English; characters like é or ä are usually not included.
\s matches any whitespace character. This includes spaces, tabs (\t), newlines (\n), carriage returns (\r), and form feeds (\f). The s stands for space.

import re

# Extract all digits from a string
text = "Order 123: 45 items for $99.99"
digits = re.findall(r'\d', text)
print(digits) # Output: ['1', '2', '3', '4', '5', '9', '9', '9', '9']

# Find words followed by whitespace and a number
pattern = r'\w+\s\d+'
match = re.search(pattern, "Total items: 42")
if match:
    print(match.group()) # Output: "items: 42"

Negated Character Classes

You can invert the meaning of any character class by placing a caret ^ immediately after the opening bracket [. A negated character class will match any character that is NOT listed in the set. This is incredibly useful for matching delimiters or “everything until” a certain character.

[^aeiou] matches any character that is not a vowel.
\D matches any character that is not a digit. It is equivalent to [^0-9].
\W matches any character that is not a word character. It is equivalent to [^a-zA-Z0-9_].
\S matches any character that is not a whitespace character.

// Match a string enclosed in quotes, but don't include the quotes themselves.
// This matches any character that is not a quote, one or more times.
const quotedTextPattern = /"([^"]+)"/;
const result = quotedTextPattern.exec('The message was "Hello, world!"');
console.log(result[1]); // Output: "Hello, world!"

// Split a string on any non-digit character
const dataString = "ID:123,Price:$456";
const parts = dataString.split(/\D+/);
console.log(parts); // Output: [ '', '123', '456' ]
// Note the empty first element due to leading non-digit characters.

Important Pitfalls and Best Practices

Metacharacters Inside Brackets: Most regex metacharacters (e.g., ., *, +, ?, |) lose their special meaning inside square brackets and are treated as literal characters. The major exceptions are the closing bracket ], the hyphen -, the caret ^, and the backslash \. To include a literal hyphen, it must be the first or last character in the set (e.g., [-a-z] or [a-z-]). To include a literal caret, it must not be the first character (e.g., [a-z^]). To include a literal closing bracket, it is often safest to escape it (e.g., [\[\]]), though placing it first in the set (e.g., []a-z]) also works in many engines.
The Dot (.) is Not a Character Class: While often grouped conceptually, the dot . is a separate regex token that matches any character except a newline. It is not a shorthand character class defined by brackets, which is why it cannot be negated like [^.]. To match a literal dot inside a character class, you do not need to escape it; [.] is a class containing only a dot.
Locale and Unicode Awareness: The definitions of \w, \d, and \s (and their negations) can be influenced by the application’s locale or the use of Unicode flags (like u in JavaScript). For example, with the Unicode flag, \d might match any digit from any script, not just 0-9. Always test these shorthands if your application will process international text. For maximum clarity and control, especially with non-ASCII text, explicitly defining your character classes with Unicode properties or ranges is often a better practice.
Performance: Character classes are generally very efficient. Using a character class like [aeiou] is almost always faster than using a alternation group like (a|e|i|o|u) because the regex engine can optimize it into a simple lookup for a single character position.