39.2 The OOM Killer: How Linux Evicts Processes Under Memory Pressure
Right, let’s talk about the OOM Killer. This is the part of the Linux kernel that, when you’re desperately out of memory, stops being polite and starts getting real. It’s the digital equivalent of a bouncer at an over-capacity nightclub: its job is to pick a process to throw out so the whole system doesn’t collapse into a twitching heap. It’s brutal, often surprising, and frankly, a bit of a design admission of failure. We ran out of clever ideas, so we just pick a sucker and shoot it.
The core reason it exists is simple: the kernel made a promise. When a process calls malloc(), it’s not actually getting real RAM at that moment; it’s getting a promise of RAM (an address space). The kernel is an optimist, banking on the fact you might not use all that memory you asked for. This is called overcommit. It’s a great strategy until everyone decides to actually use their promised memory all at once. Suddenly, the kernel has to make good on its promises and there’s no cash in the bank. Panic ensues. The OOM Killer is its way of balancing the books by… well… murder.
How the OOM Killer Chooses Its Victim
It’s not random. That would be too cruel, even for Linux. Instead, it uses an oom_score calculated for each process. You can see these scores yourself. Find a process ID (PID) and take a look:
# Let's look at the OOM score for your current shell ($$)
cat /proc/$$/oom_score
3
# And the adjusted score (more on this in a bit)
cat /proc/$$/oom_score_adj
0
The kernel calculates this score based on a frankly byzantine formula, but the gist is: it’s trying to kill the process that will free up the most memory while causing the least amount of fuss. It favors sacrificing a single fat process over several small ones, and it heavily penalizes processes running as root (CAP_SYS_ADMIN or CAP_SYS_MODULE, to be precise) because taking out a system-level process could cause a bigger disaster.
The score is a combination of:
- The total amount of memory the process is using.
- The CPU time it’s consumed (long-running jobs are punished).
- A user-adjustable
oom_score_adjvalue that lets you tip the scales.
Taking Control with oom_score_adj
This is your “Please don’t kill me” (or “Kill this first”) knob. It’s a value between -1000 and 1000 that gets added to the internal calculation.
- -1000: Means “never, ever kill this process.” Use this for your most critical daemons.
- 0: The default. The kernel does what it wants.
- 1000: Means “I hate this process, kill it with fire on sight.” The kernel will almost certainly pick it first.
You can set it on the fly for a running process:
# Protect your precious database PID 4242 from the OOM Killer's gaze
echo -1000 | sudo tee /proc/4242/oom_score_adj
To make it permanent, you’d set this in your service’s systemd unit file with the OOMScoreAdjust directive. It’s one of the best and simplest bits of preventative medicine you can administer.
# Example systemd service snippet
[Service]
ExecStart=/usr/local/bin/my-memory-hogging-daemon
OOMScoreAdjust=-500
The Aftermath: Reading the Tea Leaves
When the OOM Killer strikes, it’s not a quiet affair. It leaves a glaring, all-caps confession in your system logs (/var/log/kern.log or journalctl -k). The log entry is a masterpiece of kernel-level shame.
# What to look for in your logs after a process mysteriously dies
journalctl -k --grep="killed process"
# Or
grep "killed process" /var/log/kern.log
The output will look something like this:
kernel: Out of memory: Killed process 12345 (some_hog) total-vm:123456kB, anon-rss:12345kB, file-rss:1234kB, shmem-rss:0kB, UID:1000 pgtables:256kB oom_score_adj:200
This tells you who died, how much memory it was using (in various categories), and what its oom_score_adj was. It’s the kernel’s way of saying, “Look, I did what I had to do.”
Best Practices and Pitfalls
- It’s a Last Resort: The kernel tries everything else first—flushing caches, swapping, etc.—before invoking the OOM Killer. If it’s triggered, your system was truly on the brink.
- The Performance Cliff: You’ll feel the system slowdown long before the OOM Killer activates. This is the kernel struggling to free memory. It’s your warning sign. Heed it.
- The Biggest Pitfall: The OOM Killer can kill the wrong thing. I’ve seen it take out your SSH session, your display manager, or a critical database. This is why adjusting
oom_score_adjfor critical infrastructure is non-negotiable. It’s the difference between losing a single application and having to drive to the data center to physically reboot a server. - The Alternative:
vm.overcommit_memory: You can tell the kernel to stop being an optimist. Settingsysctl vm.overcommit_memory=2makes it use a “strict” policy wheremalloc()can fail withENOMEMif there isn’t enough real memory plus swap. This pushes the problem to the application, which might handle it more gracefully. It’s a trade-off: more predictable crashes for some apps versus the chaotic violence of the OOM Killer for all.
In the end, the OOM Killer is a blunt instrument. It’s not elegant, but it’s effective. Its existence is a reminder that resource management is a hard problem, and sometimes the only way out is to get mean. Your job is to understand its logic so you can protect what’s important and maybe, just maybe, engineer your systems so you never have to meet it face-to-face.