10.6 Job Cleanup: TTL Controller and Manual Deletion

Right, so you’ve run your Job. It did its thing. The pods are sitting there, “Completed” or maybe “Failed,” cluttering up your kget pods output like dirty mugs on a desk. You’re not a digital hoarder; you want this stuff cleaned up. Kubernetes, thankfully, agrees with you. Let’s talk about how it gets done, both automatically and when you need to take matters into your own hands.

The TTL Controller: Your Automatic Janitor

Introduced as a beta feature way back in 1.21 (it’s stable now, don’t panic), the TTL-after-finished controller is the primary way to automate Job cleanup. It’s brilliantly simple: you tell a Job, “Hey, live for this long after you finish, and then please vanish.” You do this by setting the .spec.ttlSecondsAfterFinished field.

The controller, which runs in the kube-controller-manager, wakes up every so often, checks all the finished Jobs, and does the math: JobCompletionTime + ttlSeconds < CurrentTime. If that statement is true, it deletes the Job resource. And because of the magic of owner references, deleting the Job cascades to its Pods. Neat and tidy.

Here’s how you use it. This Job will happily delete itself (and its pods) 30 seconds after it completes.

apiVersion: batch/v1
kind: Job
metadata:
  name: clean-up-after-yourself
spec:
  ttlSecondsAfterFinished: 30 # The magic number
  template:
    spec:
      containers:
      - name: echo-and-dip
        image: busybox
        command: ["echo", "My work here is done."]
      restartPolicy: Never

Why this is great: It’s declarative. You set the policy once and forget it. It prevents your cluster’s history from piling up indefinitely.

The Rough Edges & Pitfalls:

It’s a best-effort service. The controller’s cleanup loop isn’t instantaneous. It might take a minute or two after the TTL expires for the deletion to actually happen. Don’t write code that relies on millisecond precision.
Zero is not a valid value. Setting ttlSecondsAfterFinished: 0 is like saying “delete immediately.” The controller sees this as an invalid request and won’t clean up the Job. If you want immediate cleanup, you’re looking at manual deletion. If you want to keep it forever, omit the field or set it to a very high number.
Remember, it deletes the Job. This means you lose the history. You can no longer kubectl describe job my-job to see its completion status, its logs (though you should be shipping those elsewhere, right?), or why it failed three weeks ago. The TTL is for hygiene, not for record-keeping.

Manual Deletion: The Big Red Button

Sometimes you don’t want to wait. The job failed catastrophically and you need to rerun it now. Or you’re testing and you just want to clear the deck. Manual deletion is your friend, but you have to understand how Kubernetes’ cascading deletion works.

When you delete a Job, you’re just deleting the manager object. By default, the Pods it created are “orphaned”—they lose their owner. This is almost never what you want. You want the Pods to be cleaned up too.

This is where --cascade comes in. The default behavior is --cascade=true, which means “also delete the dependent objects.” This is what you want 99.9% of the time.

# This deletes the Job AND its Pods. This is what you want.
kubectl delete job my-finished-job

# This is the explicit version of the same command.
kubectl delete job my-finished-job --cascade=true

# This is the "please orphan my Pods" command. You probably don't want this.
# Now you have to go delete the Pods manually. Enjoy that.
kubectl delete job my-finished-job --cascade=orphan

Why you might use --cascade=orphan: It’s a niche debugging tool. Imagine a Job’s Pod is in a weird state and you can’t kubectl delete it—it’s stuck “Terminating.” Deleting the Job with --cascade=orphan severs the link, allowing you to then force-delete the Pod itself without the Job controller trying to recreate it (which is a whole other fight). It’s a surgical instrument, not a everyday hammer.

Best Practices and The Obvious Thing Everyone Forgets

Always Set a TTL: Make it a habit. For most Jobs, keeping them around for 24 hours (86400 seconds) is more than enough time to diagnose any post-completion issues. For CI/CD runners, maybe 30 minutes (1800 seconds) is sufficient. Define it in your GitOps templates so you don’t have to think about it.
Logs, Logs, Logs: I cannot stress this enough. The TTL controller is designed to destroy information. If you care about why a Job succeeded or failed, you must have a log aggregation system (Loki, Elasticsearch, Splunk, etc.) that collects logs from your Pods before they get deleted. Relying on kubectl logs for a historical Job is a recipe for frustration. The TTL controller will happily delete your only copy of the logs. It’s just doing its job.