19.3 Restart Policies: on-failure, always, on-abnormal

Right, so you’ve got a service unit written and it’s running. The big question now is: what should systemd do when it, inevitably, crashes? Or when the whole server reboots? Or when it exits cleanly? This isn’t a philosophical question; it’s a practical one answered by the Restart= directive. Get this wrong, and you’ll either have a service that’s dead and never comes back, or one that’s a zombie, constantly resurrecting itself into a failed state, burning CPU cycles for absolutely no reason. Let’s get this right.

The Core Restart Policies

You have a few main options here, each with a very specific job. Don’t just slap always on everything and call it a day. That’s the equivalent of using duct tape to fix a server—it might hold, but it’s not exactly elegant.

The on-failure policy is your sensible default, the workhorse of the bunch. It does exactly what it sounds like: it restarts the service if it exits with a non-zero exit code or if it’s terminated by a signal (like a segmentation fault). This is the policy you want for most long-running daemons. It assumes that a non-zero exit means “something went wrong, try again.”

[Service]
Type=simple
ExecStart=/usr/bin/my-awesome-daemon
Restart=on-failure
# This is the gold standard. Use it.

Then there’s always. This is the overbearing parent of restart policies. The service will restart no matter why it stopped—clean exit (exit code 0), crash, signal, you name it. This is crucial for services that must be running at all times, but it’s a terrible choice for oneshot tasks or services that might exit cleanly under normal operation. If your service exits with code 0 and has Restart=always, systemd will just keep relaunching it in an infinite loop. Don’t be that person.

[Service]
Type=simple
ExecStart=/usr/bin/my-needy-daemon
Restart=always
# Only use this if the service should NEVER be stopped, even intentionally.

Finally, we have on-abnormal. This one is a bit more nuanced. It restarts the service only if it was terminated by an unhandled exception (a signal), not a clean exit. This is less common but useful for services where a crash (signal) is a restart-worthy event, but a planned stop (exit code 0) is not.

The Devil’s in the Details: Exit Codes and Signals

Here’s where the designers, in their infinite wisdom, made a choice you need to understand. on-failure triggers on any non-zero exit code or on any signal termination. But what’s a “failure”? Your application gets to decide that by its exit code. 1? Failure. 4? Also failure. 0? Success. This is why your application should have sensible exit codes. If your task runs and finds no work to do, exiting 0 is correct. If it exits 1 because there’s no work, on-failure will unnecessarily restart it. Be intentional with your exit codes.

Avoiding the Restart Loop of Despair

This is the big pitfall. You’ve set Restart=always, your service has a bug and exits immediately with code 0, and now systemd is stuck trying to restart it every 100ms until the heat death of the universe. To save you from your own configuration, systemd has two crucial safety valves: RestartSec= and StartLimitBurst=.

RestartSec is the breath your service takes between restart attempts. The default is 100ms, which is a great way to DOS your own machine if the service fails immediately. Be kind to your CPU. If a service is failing, give it a second to cool off.

[Service]
ExecStart=/usr/bin/my-flaky-daemon
Restart=on-failure
RestartSec=5s
# Wait 5 seconds before trying again. Much more civilized.

StartLimitBurst and StartLimitIntervalSec work together to say “okay, that’s enough.” If the service restarts more than StartLimitBurst times within the StartLimitIntervalSec window, systemd gives up and stops trying. It’s an intervention.

[Service]
ExecStart=/usr/bin/my-very-flaky-daemon
Restart=on-failure
RestartSec=5s
StartLimitBurst=3
StartLimitIntervalSec=120
# If it fails 3 times in 2 minutes, stop trying. You need to go check the logs.

So, Which One Should You Actually Use?

Here’s the simple heuristic: for 95% of services, use Restart=on-failure. It’s the sane default. It handles crashes and errors perfectly and gets out of the way for clean stops. Use Restart=always only for those truly critical, must-be-always-on services, and only if you are certain they never exit cleanly under normal circumstances. And for goodness sake, always pair a aggressive restart policy with a sensible RestartSec and start limit. Your monitoring system will thank you.