31.1 sort: Alphabetical, Numeric, and Reverse Sorting
Right, let’s talk about sort. It’s one of those commands you’ll use so often it becomes a reflex, but it’s also deceptively powerful. Most people get to sort file.txt and call it a day, but that’s like using a sports car to drive to the mailbox and back. We’re going to open the garage door and take this thing for a proper spin.
By default, sort does what you’d expect: it reads lines of text and sorts them in ascending order. But here’s the first “gotcha” that trips up everyone, including me on a bad day: it uses the locale’s collating order. This means sort file.txt on your machine and my machine might give slightly different results if we’re in different countries. It’s generally alphabetical, but it knows that in Spanish, for example, “ñ” should sort after “n”. For 99% of what you do, this is fine, but just know the ghost of internationalization is in the machine.
The Default (and Why It’s Weird with Numbers)
Let’s look at a simple example. We have a file, animals.txt:
zebra
aardvark
3 monkeys
11 elephants
2 lions
Running sort animals.txt gives us this:
11 elephants
2 lions
3 monkeys
aardvark
zebra
Wait, what? “11” came before “2”? That’s absurd! Well, no, it’s actually perfectly logical. By default, sort does a lexicographical sort (i.e., character by character, like in a dictionary). It sees ‘1’, ‘1’, ‘space’ and then ‘2’, ‘space’. Since ‘1’ comes before ‘2’ in the character table, “11” wins. This is, frankly, useless for numbers. Which is why we have…
Numeric Sorting: Your Sanity Saver
Behold the -n flag. This tells sort, “Hey, please be a dear and actually treat these as numbers.” This is non-negotiable for any data that contains integers.
sort -n animals.txt
Output:
2 lions
3 monkeys
11 elephants
aardvark
zebra
Now we’re talking. The lines starting with numbers are sorted correctly. But notice that the purely textual lines, “aardvark” and “zebra”, are still along for the ride. sort -n just ignores lines that don’t have a recognizable number at the beginning, leaving them in the mix but sorting them after the numeric lines. It’s a bit of a kludge, but it works.
For real data processing, you often have files where the number isn’t at the beginning. This is where sort gets clever.
Sorting by a Specific Field
Let’s say we have a comma-delimited file, employees.csv:
Alice Smith,Engineering,75000
Bob Jones,Marketing,65000
Carol Davis,Engineering,82000
Sorting this normally would sort by the first name. Useful, but what if you want to sort by salary (the third field)? This is where -t (field separator) and -k (key definition) come to the rescue.
sort -t ',' -k 3n employees.csv
-t ',': Sets the field delimiter to a comma.-k 3n: This means “use the 3rd field as the sort key, and treat it numerically (n)”.
The output is what any underpaid employee would expect:
Bob Jones,Marketing,65000
Alice Smith,Engineering,75000
Carol Davis,Engineering,82000
You can get wildly specific with -k. Want to sort by department (field 2) and then by salary (field 3) within each department? No problem.
sort -t ',' -k 2,2 -k 3n employees.csv
-k 2,2 means “use field 2 for the entire sort key” (from start of field 2 to end of field 2). This ensures we only look at the department for the first key. Then, for any lines that have the same department, the second key -k 3n breaks the tie by salary.
Reverse Engineering Your Order
Need the highest salary at the top? That’s what -r is for. It reverses the final sort order. It works with any key type.
sort -t ',' -k 3nr employees.csv
Output:
Carol Davis,Engineering,82000
Alice Smith,Engineering,75000
Bob Jones,Marketing,65000
A crucial tip: -r applies to the entire sort operation. If you have multiple -k flags, the whole result is reversed. If you need to reverse only one key but not another, well, let’s just say the designers didn’t make that straightforward. You often have to get creative.
The Pit of Whitespace and Stability
One last pro tip. By default, sort uses whitespace (spaces and tabs) as field separators, and it considers leading blanks (spaces) to be part of the field. This can make a file with sloppy formatting sort in weird ways. If you’re dealing with human-generated data, always eyeball it first with cat -A or less to see the hidden characters.
Also, sort is a “stable” sort. This is a computer science term meaning if two lines compare as equal, their original input order is preserved. This is mostly irrelevant unless you’re chaining sorts, but it’s good to know the machine has your back.
So there you have it. sort is far more than an alphabetical organizer; it’s a data wrangler’s best friend. Use -n liberally, master -t and -k, and you’ll be slicing and dicing text files like a pro.