31.3 uniq: Removing Duplicate Lines (-c, -d, -u)

Right, uniq. The name is a bit of a lie, and that’s the first thing you need to get over your head. It doesn’t magically find all unique lines in a file. No, no. Its job is far more specific, and frankly, a little bit dumb: it only removes adjacent duplicate lines. If you don’t sort your data first, uniq is about as useful as a screen door on a submarine.

Think of it like this: you’re scanning a list, and you only remove a line if the one immediately above it is identical. This is why you almost always see sort and uniq chained together. It’s a package deal. sort file.txt | uniq is the classic one-two punch for actually getting all unique lines.

The Core Function: Adjacent Deduplication

Let’s get the failure case out of the way first so you understand its limitations. Imagine a guest list where people have signed in multiple times.

$ cat party_guests.txt
Alice
Bob
Alice
Charlie
Bob
Alice

Running uniq on this directly is an exercise in frustration. It only catches the first duplicate “Alice” because it’s adjacent, but misses the others.

$ uniq party_guests.txt
Alice
Bob
Alice
Charlie
Bob
Alice

Pathetic. Now, let’s do it correctly by sorting first. This groups all identical lines together, making uniq’s job possible.

$ sort party_guests.txt | uniq
Alice
Bob
Charlie

There. Now you have a proper, deduplicated list. This is non-negotiable. uniq requires sorted input. Don’t forget it.

Counting Your Dupes (-c)

This is where uniq starts to earn its keep. The -c (count) flag is brilliant. Instead of just removing duplicates, it prefixes each line with the number of times it occurred. This is how you turn a list into a frequency report. It’s incredibly useful for log analysis, survey results, or figuring out which error message is haunting your dreams.

$ sort party_guests.txt | uniq -c
      3 Alice
      2 Bob
      1 Charlie

See? Alice is clearly the life of the party. The output is formatted with the count left-padded in a field, which makes it easy to pipe into sort -n for sorting by frequency.

$ sort party_guests.txt | uniq -c | sort -n
      1 Charlie
      2 Bob
      3 Alice

Now you can instantly see the most common entries. This pipeline is a workhorse.

Finding the Repeat Offenders (-d) and the Loners (-u)

Sometimes, you don’t want the unique list; you want to know what’s duplicated. The -d (repeated) flag only prints lines that are repeated. It’s the “show me the duplicates” option.

$ sort party_guests.txt | uniq -d
Alice
Bob

Notice Charlie is missing from this list because he only showed up once. Poor Charlie.

Conversely, the -u (unique) flag does the opposite: it only prints lines that are not duplicated. It finds the true uniques.

$ sort party_guests.txt | uniq -u
Charlie

These flags are mutually exclusive. You can use -d to find duplicates and -u to find uniques, but the default behavior (no flags) shows you everything, just deduplicated.

The Devil’s in the Details: Whitespace and Comparison

Here’s a common pitfall that will make you tear your hair out: uniq is picky. It doesn’t just look at the text; it looks at the entire line, including any trailing whitespace.

$ cat notes.txt
important
important 
important

Those three lines are not the same to uniq. The second one has a trailing space. Watch:

$ uniq notes.txt
important
important 
important

It treated all three as unique! Always pre-process your data with something like sed 's/[[:space:]]*$//' to trim trailing whitespace before throwing it at uniq, or prepare for bizarre results. Similarly, it’s case-sensitive by default. ALICE and alice are different lines. If you want case-insensitive uniqueness, you’ll need to sort -f (fold case) or pipe your text through tr first to normalize case.

The takeaway? uniq is a simple tool with a simple mind. It does exactly what it says, no more, no less. Your job is to prepare the data to its exacting standards. Master that, and it becomes an indispensable part of your text-wrangling toolkit.