← Back to blog
·6 min read

How to Monitor Kubernetes CronJobs with Heartbeat Monitoring

Kubernetes CronJobs are deceptively tricky to monitor. The job might fail because the pod was evicted, the image pull failed, the node ran out of resources, or the CronJob was suspended during a cluster upgrade. Kubernetes won't page you for any of these.

Why Kubernetes CronJobs fail silently

The kubectl get cronjobs output looks healthy even when jobs are broken:

NAME          SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE
db-backup     0 */6 * * *   False     0        3h
etl-sync      */15 * * * *  False     0        12m

But "LAST SCHEDULE" only means Kubernetes tried to create a Job. The pod could have:

  • CrashLoopBackOff — container crashed immediately
  • ImagePullBackOff — wrong image tag or registry auth expired
  • OOMKilled — exceeded memory limits
  • Evicted — node pressure killed the pod
  • Pending forever — no node had enough resources

You'd need to run kubectl get pods and check each job's pod status to find failures. Nobody does this manually at 3 AM.

Adding heartbeat monitoring

The pattern is the same as any cron job: add a curl ping after the work completes. In Kubernetes, you do this inside the container's command.

Basic CronJob with monitoring

yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
  namespace: production
spec:
  schedule: "0 */6 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: 1800
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: backup
              image: postgres:16
              command:
                - /bin/sh
                - -c
                - |
                  set -euo pipefail

                  # Signal start
                  curl -fsS https://api.getcronsafe.com/ping/k8s-db-backup/start

                  # Run backup
                  pg_dump -Fc "$DATABASE_URL" > /tmp/backup.dump

                  # Validate
                  SIZE=$(stat -c%s /tmp/backup.dump)
                  if [ "$SIZE" -lt 1048576 ]; then
                    curl -fsS -X POST https://api.getcronsafe.com/ping/k8s-db-backup/fail \
                      -d "Backup too small: $SIZE bytes"
                    exit 1
                  fi

                  # Upload
                  aws s3 cp /tmp/backup.dump "s3://backups/db/$(date +%Y%m%d_%H%M).dump"

                  # Signal success
                  curl -fsS -X POST https://api.getcronsafe.com/ping/k8s-db-backup \
                    -d "Backup OK: $SIZE bytes"
              env:
                - name: DATABASE_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url
                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: access-key
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: aws-credentials
                      key: secret-key
              resources:
                requests:
                  memory: "256Mi"
                  cpu: "100m"
                limits:
                  memory: "512Mi"
                  cpu: "500m"

Key Kubernetes-specific settings

concurrencyPolicy: Forbid — prevents overlapping runs. If the previous job hasn't finished, the new one is skipped. CronSafe's overlap detection catches this scenario too.

backoffLimit: 1 — don't retry failed jobs automatically. Let the failure propagate to CronSafe so you can investigate.

activeDeadlineSeconds: 1800 — kill the job if it runs longer than 30 minutes. Without this, a hung job blocks future runs when concurrencyPolicy is Forbid.

restartPolicy: Never — don't restart the container on failure. Combined with backoffLimit: 1, this ensures one attempt and one alert.

Using an init container for the start ping

If you want to separate the monitoring ping from your main container:

yaml
spec:
  template:
    spec:
      restartPolicy: Never
      initContainers:
        - name: signal-start
          image: curlimages/curl:latest
          command:
            - curl
            - -fsS
            - https://api.getcronsafe.com/ping/k8s-etl/start
          resources:
            requests:
              memory: "16Mi"
              cpu: "10m"
            limits:
              memory: "32Mi"
              cpu: "50m"
      containers:
        - name: etl
          image: my-registry/etl-worker:latest
          command:
            - /bin/sh
            - -c
            - |
              set -euo pipefail
              python3 /app/etl.py
              curl -fsS https://api.getcronsafe.com/ping/k8s-etl

The init container runs before the main container, giving you an accurate start timestamp.

Monitoring CronJob suspension

During cluster upgrades or maintenance, CronJobs are often suspended:

bash
kubectl patch cronjob db-backup -p '{"spec":{"suspend":true}}'

When a CronJob is suspended, no jobs are created. If someone forgets to unsuspend it after maintenance, the job silently stops running.

CronSafe catches this automatically — the expected pings don't arrive, and you get alerted regardless of why they stopped.

Handling image pull failures

A common Kubernetes failure mode: the image tag was updated in the CronJob manifest, but the new image doesn't exist or the registry credentials expired.

yaml
# This will fail silently - the pod enters ImagePullBackOff
containers:
  - name: backup
    image: my-registry/backup:v2.0.1  # typo: should be v2.0.0

The CronJob's LAST SCHEDULE column updates (Kubernetes tried to run it), but the pod never starts. Without heartbeat monitoring, you'd only discover this by manually checking pod status.

With CronSafe, the start ping never arrives, and you're alerted within the grace period.

Complete production example with Helm

If you're using Helm, here's a reusable template:

yaml
# templates/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: {{ .Values.name }}
spec:
  schedule: {{ .Values.schedule | quote }}
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: {{ .Values.deadlineSeconds | default 1800 }}
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: job
              image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
              command:
                - /bin/sh
                - -c
                - |
                  set -euo pipefail
                  curl -fsS {{ .Values.cronsafe.url }}/start
                  {{ .Values.command }}
                  curl -fsS {{ .Values.cronsafe.url }}
              envFrom:
                - secretRef:
                    name: {{ .Values.name }}-secrets
              resources:
                {{- toYaml .Values.resources | nindent 16 }}
yaml
# values.yaml
name: db-backup
schedule: "0 */6 * * *"
deadlineSeconds: 1800
image:
  repository: postgres
  tag: "16"
command: "pg_dump -Fc $DATABASE_URL > /tmp/backup.dump && aws s3 cp /tmp/backup.dump s3://backups/"
cronsafe:
  url: https://api.getcronsafe.com/ping/k8s-db-backup
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

This gives you a standardized pattern for every CronJob in your cluster, with monitoring built in.

Setting up the CronSafe monitor

For Kubernetes CronJobs, set these values:

  • Schedule: Match the CronJob's schedule field exactly
  • Grace period: Set to activeDeadlineSeconds + 5 minutes. This accounts for pod scheduling delays and the job's maximum runtime.
  • Overlap detection: Enable this. It catches cases where concurrencyPolicy: Allow causes parallel runs.

Start monitoring your cron jobs for free

20 monitors, email alerts, GitHub badges. No credit card required.

Get started free →