Right, let’s talk about cgroups memory limits. You’ve probably been here: you’re running a big data processing job, a leaky web app, or just a badly behaved script, and it decides it’s going to try to eat every single byte of RAM on your machine. The OS, in a desperate bid to survive, starts the OOM (Out-of-Memory) Killer, which is basically a glorified game of Russian roulette for your processes. It picks a process and shoots it dead to free up memory. Spoiler alert: it’s almost never the one you wanted it to kill.

This is where cgroups (control groups) come in. They’re the bouncers of the Linux kernel. Their job isn’t to kill processes after they’ve caused a scene; it’s to prevent them from ever getting that out of control in the first place. We’re putting up a velvet rope and a strict guest list for memory usage.

How the Kernel Enforces the Limit

This is the crucial bit most people miss, so pay attention. When you set a memory limit in a cgroup, you’re not setting a suggestion. The kernel enforces this thing with prejudice. Here’s the play-by-play:

  1. Your process inside the cgroup happily allocates memory (via malloc or whatever) like it’s at an all-you-can-eat buffet.
  2. The kernel’s job is to actually hand out physical pages of RAM (or swap) when the process tries to use that allocated memory.
  3. When the total memory usage of the cgroup (RSS + file cache) starts approaching the limit you set, the kernel shifts from a friendly waiter to a stern accountant.
  4. If the cgroup is at its limit and a process inside tries to ask for more physical RAM (this is called a page fault), the kernel puts that process to sleep.
  5. Then the kernel kicks off its memory reclamation machinery. It tries really hard to free up memory within the cgroup to satisfy the request. This means throwing out clean page cache, swapping out anonymous memory to disk (if you have swap enabled), etc.
  6. If it succeeds, the process wakes up and gets its memory, none the wiser. If it fails… well, that’s where things get interesting.

The OOM Killer’s Cgroup Edition

If the kernel’s reclamation efforts can’t free up enough memory to satisfy the request, it has no choice. It invokes the OOM Killer, but with a critical twist: it only looks at processes within the offending cgroup.

This is a game-changer. Instead of some random SSH session or database getting shot to save a runaway script, only the processes within the cgroup that breached its limit are on the chopping block. The rest of the system hums along perfectly. It’s beautiful, really.

Let’s get our hands dirty. The modern way to interact with cgroups is via systemd, which manages them for us. Let’s create a slice (a group of units) with a memory limit.

First, create a slice file:

# /etc/systemd/system/my-stingy-slice.slice
[Unit]
Description=A slice with a strict memory limit

[Slice]
MemoryMax=500M
MemoryHigh=450M

Now, let’s run a process in this slice. We’ll use systemd-run to launch a shell that’s a member of this slice.

# This starts a bash shell inside our new cgroup
sudo systemd-run --slice=my-stingy-slice.slice --shell --user --unit=test-limit

# From within that new shell, let's see our limits
cat /sys/fs/cgroup/my-stingy-slice.slice/memory.max
# You'll see 524288000 (500M in bytes)
cat /sys/fs/cgroup/my-stingy-slice.slice/memory.high
# 471859200 (450M)

Now for the fun part. Let’s try to break it. Here’s a dumb C program that just allocates memory until it can’t.

// eatmem.c
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    long long int allocated = 0;
    const long long int chunk = 100 * 1024 * 1024; // 100MB chunks
    while (1) {
        if (malloc(chunk) == NULL) {
            perror("malloc failed");
            break;
        }
        allocated += chunk;
        printf("Allocated %lld MB\n", allocated / (1024 * 1024));
        sleep(1); // Slow down the inevitable
    }
    printf("I'm done. Goodbye.\n");
    return 0;
}

Compile it: gcc -o eatmem eatmem.c. Now run it inside the shell that’s in your limited slice: ./eatmem. Watch it print “Allocated 100 MB”, “Allocated 200 MB”, “Allocated 300 MB”, “Allocated 400 MB”… and then it will likely just hang. The kernel has put it to sleep at the MemoryHigh watermark while it tries to reclaim memory. Since there’s nothing to reclaim (we’re just allocating, not using it), it will eventually give up and the OOM killer will terminate the eatmem process. The key point? Your main SSH session is utterly unaffected.

Common Pitfalls and Best Practices

  • MemoryHigh vs. MemoryMax: MemoryHigh is a “soft” limit. The kernel will throttle processes and aggressively reclaim memory to keep usage below it. MemoryMax is the hard limit. Breaching it is what triggers the OOM Killer. Use MemoryHigh as a gentle warning bell and MemoryMax as the concrete wall.
  • Swap is a Get-Out-of-Jail-Free Card: And you usually don’t want to give that to your processes. If your system has swap enabled, a process can blow past its RAM limit by just swapping endlessly to disk, grinding your entire system to a halt. Always set MemorySwapMax. A good rule is to set it equal to MemoryMax to disable swap for the cgroup entirely, unless you have a very specific reason not to (MemorySwapMax=500M in our slice example).
  • It’s the Total, Not Just RSS: The limit includes RSS (your process’s working set) and the file cache (page cache) used by processes in the cgroup. A process that reads a lot of files can evict itself by pushing its own data out of the page cache to stay under the limit. It’s holistic, which is generally good, but can be surprising.
  • The Accounting Isn’t Instantaneous: The kernel checks memory usage asynchronously, not on every single allocation. There’s a small window where a process can briefly overshoot the limit. Don’t treat it as a real-time guarantee; treat it as a very effective containment field.