10.5 file: Detecting File Type by Magic Bytes
Right, let’s get our hands dirty. You’ve got a file. It has no extension, or worse, a lying extension like virus.exe.pdf. The file command on your system is about to become your new best friend, and we’re going to understand its secret language: magic bytes.
At its core, file doesn’t trust filenames. Filenames are suggestions; the content is the law. It works by peeking at the first few bytes of the file—the so-called “magic bytes” or “magic number”—and comparing them against a massive, gloriously detailed database of known file signatures. This database is typically /usr/share/misc/magic.mgc (a compiled binary version) or its source file, /usr/share/misc/magic. Go ahead, cat it sometime. It’s a beautiful, arcane mess of patterns and incantations.
How file Actually Works Its Magic
Think of it as a bouncer with a very specific checklist. It doesn’t just look at one thing; it runs down a list of rules. It might start with “Is the first 2 bytes 0xFF 0xD8? If yes, it’s a JPEG.” If that fails, it moves on: “Are the first 4 bytes 0x25 0x50 0x44 0x46 (which is %PDF in ASCII)? If yes, it’s a PDF.” This continues until it finds a match or gives up and tells you it’s “data.” This sequential checking is why the order of entries in the magic file is critically important.
Here’s the simplest way to use it. Point it at a file and pray.
$ file picture.jpg
picture.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, baseline, precision 8, 1920x1080, frames 3
But the real power comes when you use it on something mysterious.
$ file unknown_data.bin
unknown_data.bin: PDF document, version 1.5
Ah-ha! Someone tried to hide a PDF. Not on our watch.
The Limits of Magic (and Why They Exist)
This system is brilliant, but it’s not clairvoyant. The most common pitfall is false positives. Since file only looks at the beginning, it can be tricked. A text file that starts with %PDF will be misidentified as a PDF. This is a fundamental trade-off for speed and simplicity.
Another edge case is overly generic signatures. For example, any file starting with the bytes PK is identified as a ZIP archive. This is technically correct, but unhelpfully broad, because ZIP is a container format. It could be a .zip, a .jar, a .docx (which is just a fancy ZIP), or an .apk. The modern file command has gotten better at this, often specifying “Microsoft OOXML” for DOCX files, but it’s a constant battle for the maintainers of the magic database.
Going Deeper: The -i Flag and MIME Types
Sometimes you don’t need a verbose description for a human; you need a standard identifier for a script. That’s where the -i (or --mime) option comes in. This outputs the MIME type, which is far less witty but much more machine-parseable.
$ file -i picture.jpg
picture.jpg: image/jpeg; charset=binary
$ file -i unknown_data.bin
unknown_data.bin: application/pdf; charset=binary
This is incredibly useful in scripts where you need to handle files based on their actual type, not their name.
Building Your Own Signature Detective Kit
What if you encounter something file doesn’t know, or you simply don’t trust its verdict? You become the detective. Use xxd or hexdump to look at those first bytes yourself.
Let’s say file says something is “data”. Let’s investigate:
$ xxd -g 2 -l 32 my_mystery_file | head -5
00000000: 504b 0304 1400 0800 0800 63a2 3d51 0000 PK........c.=Q..
00000010: 0000 0000 0000 0000 1300 0000 776f 7264 ............word
See those first two bytes? 0x50 0x4b? That’s ‘P’ and ‘K’ in ASCII, the signature for a ZIP file. Now you know file might be missing a rule, or this is a custom archive format. The designers of the file format chose PK… which stands for Phil Katz, the co-creator of PKZIP. A little vanity in your hex dump, how quaint.
The file command is a testament to the pragmatic, hacky brilliance of Unix. It’s a simple idea executed with a massive lookup table. It’s not perfect, but it’s indispensable. Use it, trust it, but always know how to verify its work yourself. That’s how you move from following instructions to actually understanding what’s going on.