Alright, let’s talk about CronJobs, the part of Kubernetes that tries its best to be your old-school system scheduler but ends up being a bit more… Kubernetes-y. Which is to say, powerful, but with a few more knobs to turn and a couple of new ways to shoot yourself in the foot.

The core idea is simple: you want to run a Job (a Pod that runs to completion) on a schedule. Easy, right? You define a schedule using that familiar, slightly-cryptic cron syntax, point it at a Pod template, and off you go. But of course, in true Kubernetes fashion, the devil is in the details—details like time zones, concurrency, and what happens when your job takes longer to run than the time between schedules.

The Cron Syntax and The Time Zone Problem

First, the schedule. You’ll use a standard cron string in the schedule field. "0 * * * *" runs at the top of every hour, "30 6 * * *" runs at 6:30 AM every day, you know the drill.

Here’s the first “gotcha”: by default, the schedule is evaluated in UTC. If you’re not careful, you’ll be wondering why your “daily 2 AM report” is running at 2 AM UTC, which might be yesterday evening for you. This is, frankly, a bit daft in a world where we’ve mostly agreed on the existence of time zones. Thankfully, you’re not stuck. You can specify a timeZone field to use something sensible, like "America/New_York". But check your cluster’s documentation; this requires the CronJobTimeZone feature gate to be enabled, which it is by default in newer versions (1.25+). Don’t assume it’s on.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: your-witty-job-name
spec:
  schedule: "30 6 * * *" # 6:30 AM, but in what timezone?
  timeZone: "America/New_York" # Ah, *this* timezone. Critical.
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: report-generator
            image: your-image:latest
            command: ["python", "/scripts/generate-report.py"]
          restartPolicy: OnFailure

Concurrency Policy: Dealing with Overlap

This is where things get interesting. What happens if your "*/5 * * * *" job (every 5 minutes) occasionally takes 7 minutes to run? You’ll have overlaps. Is that okay? Sometimes yes, sometimes a disaster.

Kubernetes gives you three choices via the concurrencyPolicy field:

  1. Allow (default): The “sure, why not” approach. New jobs are created even if the previous one is still running. Use this if your job is idempotent and you don’t mind multiple instances chewing through resources.
  2. Forbid: The “nope” approach. If it’s time for a new run but the old one is still executing, the new run is skipped. The schedule will drift. This is what you use for jobs that absolutely must not run concurrently, like a database schema migration.
  3. Replace: The “hurry up and wait” approach. If it’s time for a new run and the old one is still going, the old one is killed before the new one is created. This is… risky. You need to be very confident your job can handle being SIGKILL’d mid-execution without corrupting anything. I rarely use this.
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid # Skip the next run if the previous is still going.
  jobTemplate:
    ...

Managing History: Don’t Let It Pile Up

By default, Kubernetes will keep about three successful and one failed Job record around forever. This is a fantastic way to slowly fill your etcd datastore with garbage if you have a frequent job. You must manage this.

The successfulJobsHistoryLimit and failedJobsHistoryLimit fields are your friends. Set them to the minimum you need for debugging. For a job that runs every minute, keeping 10 successful ones might be plenty. For a daily job, maybe 3.

spec:
  schedule: "* * * * *" # Every minute! That's a lot of jobs.
  successfulJobsHistoryLimit: 5 # Only keep the last 5 successful runs
  failedJobsHistoryLimit: 2 # Only keep the last 2 failures
  jobTemplate:
    ...

Starting Deadlines and Missed Runs

What if your cluster is having a really bad day and is down for maintenance at the exact time a job is supposed to run? The startingDeadlineSeconds field defines how long after the scheduled time a job is still allowed to start. If the deadline is missed, the run is counted as missed and won’t run.

More importantly, if you set startingDeadlineSeconds to a value and the CronJob misses more than 100 scheduled times for any reason, it will stop scheduling new jobs entirely. It’s basically throwing its hands up and saying “I give up, this is clearly broken, figure it out.” You’ll have to manually intervene. It’s a brutal but effective alarm bell.

The Pod Template: It’s Just a Job

Remember, the jobTemplate spec is just a Pod template inside a Job spec. All the same rules apply. The most important one: you must set restartPolicy to OnFailure or Never. You cannot use Always because a Job is not a Deployment; it’s supposed to end. Using Always is a classic mistake that will leave you with a bunch of restarting Pods and a confused look on your face.

So there you have it. CronJobs are more than just crontab -e for the cloud. They’re a robust, if slightly fussy, system for scheduled work. Set your timeZone, think hard about concurrencyPolicy, clean up your history, and for the love of all that is holy, get your restartPolicy right. Do that, and you’ll have a brilliant scheduler. Forget it, and you’ll have a brilliant mess.