31.7 paste and join: Combining Files Side by Side and by Key
Right, so you’ve sorted your data, you’ve de-duplicated it, you’ve sliced and diced it. Now you’re left with two or more files, each a neat column of information, and you need to put them together. This is where paste and join come in. They are the dynamic duo of horizontal file combination, but they have wildly different personalities and use cases. One is a simple, no-fuss bricklayer; the other is a finicky, key-obsessed database administrator.
The Simple Art of paste
Think of paste as the command-line equivalent of slapping two pieces of paper side-by-side on a photocopier. It doesn’t ask questions. It doesn’t look for relationships. It just takes line 1 from file A and line 1 from file B and mashes them together, separated by a tab, onto a new line. It is gloriously, stupidly simple, and that’s its power.
Let’s say you have two files. One (names.txt) has a list of employees, and another (ids.txt) has their employee IDs in the same order.
$ cat names.txt
Alice
Bob
Carol
$ cat ids.txt
a123
b456
c789
To combine them side-by-side, you just… paste them.
$ paste names.txt ids.txt
Alice a123
Bob b456
Carol c789
See? No thinking required. The default delimiter is a tab, which is why they line up nicely. But what if you want a comma, or a semicolon? paste has you covered with the -d (delimiter) flag. You can specify one or more characters. If you give it multiple, it will cycle through them as it joins each column.
$ paste -d', ' names.txt ids.txt
Alice, a123
Bob, b456
Carol, c789
$ paste -d':-' names.txt ids.txt otherfile.txt
Alice:a123-firstfield
Bob:b456-secondfield
Carol:c789-thirdfield
The most common “gotcha” with paste" is file length mismatch. If one file is longer than the other, paste` just keeps going, pairing lines from the shorter file with empty lines until it runs out of data. It won’t error out. It’s your responsibility to make sure your files are ordered correctly and of expected lengths. This is its greatest strength and its most common pitfall.
The Keyed Precision of join
Now, meet join’s pedantic older brother. join is for when you have a relationship between files. It doesn’t care about line order; it cares about a shared key field. It’s like doing a SQL table join in the shell, and it’s just as fussy about its input data.
By default, join expects both files to be sorted on the join field (the first column by default) and for that field to be separated by whitespace. If your data doesn’t meet these exacting standards, join will either fail silently or produce absolute nonsense. I cannot stress this enough: join requires sorted input on the join key. It’s not a suggestion.
Let’s use a classic example. You have a file with user IDs and names (users.txt), and another with user IDs and departments (depts.txt).
$ cat users.txt
a42 Alice
a57 Bob
a10 Carol
$ cat depts.txt
a57 Engineering
a10 Design
a42 Product
Notice the files are in different orders. This is why we use join. But first, we must sort both files on the first field (the key).
$ sort -k1,1 users.txt > users_sorted.txt
$ sort -k1,1 depts.txt > depts_sorted.txt
$ cat users_sorted.txt
a10 Carol
a42 Alice
a57 Bob
$ cat depts_sorted.txt
a10 Design
a42 Product
a57 Engineering
Now we can join them on the first field. The -j1 option tells it to use field 1 from both files as the key.
$ join -j1 users_sorted.txt depts_sorted.txt
a10 Carol Design
a42 Alice Product
a57 Bob Engineering
Beautiful, right? But let’s talk about the rough edges, because there are many. What if your files are delimited by commas, not spaces? Use -t ,. What if the key is in a different column in each file? For example, if the key was column 2 in the first file and column 1 in the second, you’d use -1 2 -2 1. What if you only want specific output fields? You have to use -o with a specific list like -o 1.1, 1.2, 2.2 (file1.field1, file1.field2, file2.field2).
The most brutal edge case is missing keys. By default, join does an “inner join”; it only outputs lines where the key exists in both files. If you want a “left join” (all lines from file1, with matched data from file2 or empty), you must use the -a 1 flag. It’s powerful, but the syntax feels like it was designed by a grumpy Unix wizard in 1972 who assumed you’d already read the full manual. Twice.
Best Practice: For paste, always double-check the line count of your inputs with wc -l. For join, always pre-sort your files explicitly on the exact key field you intend to use, and use -t to specify your delimiter. Trusting your data to “probably be sorted” is a recipe for silent, catastrophic failure. These tools are incredibly powerful for building quick data pipelines, but they demand precision. Give it to them, and they’ll serve you well.