30.1 grep: Basic and Extended Regular Expressions

Right, let’s talk about grep. It’s the first text search tool you learn, and if you’re like me, it’s the one you’ll use 90% of the time. The name stands for Global Regular Expression Print, which sounds intimidating but just means “find this pattern in the file and show me the lines where it appears.” Its superpower is its simplicity. You give it a pattern and a file, and it gets to work. No fuss.

But here’s the secret: most of grep’ power, and most of the confusion around it, comes from the regular expressions it uses. And grep has a… let’s call it a “quirky” history with them, which leads to the first big fork in the road.

Basic vs. Extended Regular Expressions: The Great Schism

This is where grep shows its age. For historical reasons, it distinguishes between Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE). The difference boils to one thing: which meta-characters you have to escape.

In BRE, the characters +, ?, {, |, (, and ) are treated as literal characters. To use them as their special regex meaning, you must escape them with a backslash (\+). In ERE, it’s the opposite: those characters are special by default, and you escape them to make them literal.

Why does this madness exist? Because BRE came first, from the original ed editor. ERE was added later to make patterns less cluttered. It’s a classic case of Unix “worse is better” design. My advice? Unless you’re trapped on a 1970s Unix system, use Extended Regular Expressions by invoking grep -E (or just use its cooler alias, egrep). It’s far more intuitive and saves you from a backslash-induced headache.

# Let's find lines containing '1+' meaning one or more '1's.

# This fails with BRE because '+' is literal. It searches for the literal string "1+".
grep '1+' file.txt

# This works with BRE, but look at that ugly escape.
grep '1\+' file.txt

# This works cleanly with ERE. This is the way.
grep -E '1+' file.txt

Anchors:^ and $ (Not the nautical kind)

These are non-negotiable. You must understand them. The caret ^ anchors your pattern to the beginning of a line. The dollar sign $ anchors it to the end of a line. They don’t match characters; they match positions.

This is the most common beginner mistake: forgetting anchors and getting partial matches. Want to find every line that starts with “error”? That’s ^error. If you just use error, you’ll also get lines containing “terror” or “errors”, which is probably not what you want.

# Find lines that are exactly "DEBUG"
grep '^DEBUG$' application.log

# Find lines that end with a period (like the end of a sentence)
grep '\.$' my_novel.txt

Note: I escaped the dot (\.) because a dot is a regex meta-character that means “match any single character.” To match a literal dot, you must escape it. This is true in both BRE and ERE.

Character Classes and the Magic Bracket

Square brackets [ ] are your best friend for “match any one of these characters.” It’s a fantastic tool for tolerating inconsistent data. Think [Tt] for a word that might start with a capital or lowercase ‘T’, or [aeiou] to find a vowel.

The real magic trick inside a character class is the hyphen (-) for defining ranges. [0-9] is identical to [0123456789] but much cooler. You can combine them: [a-zA-Z] for any letter.

# Find all lines with a 3-digit number (from 000 to 999)
grep -E '[0-9]{3}' data.txt

# Find "color" or "colour" because the US/UK debate is a solved problem here.
grep 'colou?r' file.txt  # Wait, this will fail in BRE! See why ERE is better?
grep -E 'colou?r' file.txt # This works. The '?' makes the 'u' optional.

The Pit of Despair: Greedy Matching

Here’s the part that trips up absolutely everyone, including me on a bad day. Regular expressions are greedy by default. This means they will match the longest possible string that satisfies the pattern.

Let’s say you have an HTML tag (I know, don’t parse HTML with regex, but just for example) <div>some stuff</div>. You foolishly try to match everything between the tags with /<div>.*<\/div>/. The .* doesn’t mean “some characters until the next </div>”; it means “ALL THE CHARACTERS until the very last </div> in the entire line or file.” If you have multiple divs on one line, it will slurp up everything from the first <div> to the very last </div>.

grep alone can’t fix this; it’s a fundamental trait of its regex engine. For this level of surgery, you’d graduate to sed or awk (or perl), which can do non-greedy matching. But for now, just be aware of the greed. It will save you hours of confusion.

Best Practices for the Smart Grepper

Always (Almost) Quote Your Patterns: Put your search pattern in single quotes: grep 'pattern' file. This prevents the shell from interpreting any special characters (like * or $) before grep even sees them. It’s not just a good idea; it’s the law.
-i for Case-Insensitivity: The world is messy. Text is inconsistently cased. Use grep -i 'error' to find “Error”, “ERROR”, and “error”. It’s a lifesaver.
-v to Invert the Match: Show me all the lines that don’t contain this pattern. Incredibly useful for filtering out noise.
-n to Show Line Numbers: You’re not just searching; you’re investigating. You need to know where the thing is. -n gives you the line number.
-w for Whole Words: This is a nicer alternative to cumbersome word-boundary anchors (\b). grep -w 'class' will find “class” but not “classic” or “subclass”. It’s clean and intuitive.

So there you have it. grep isn’t just a command; it’s your first and most loyal line of defense against the chaos of unstructured text. Master its regex quirks, and you can ask precise questions of your data. Now go find something.