How to Monitor Kubernetes CronJobs with Heartbeat Monitoring
Kubernetes CronJobs are deceptively tricky to monitor. The job might fail because the pod was evicted, the image pull failed, the node ran out of resources, or the CronJob was suspended during a cluster upgrade. Kubernetes won't page you for any of these.
Why Kubernetes CronJobs fail silently
The kubectl get cronjobs output looks healthy even when jobs are broken:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE
db-backup 0 */6 * * * False 0 3h
etl-sync */15 * * * * False 0 12mBut "LAST SCHEDULE" only means Kubernetes tried to create a Job. The pod could have:
- •CrashLoopBackOff — container crashed immediately
- •ImagePullBackOff — wrong image tag or registry auth expired
- •OOMKilled — exceeded memory limits
- •Evicted — node pressure killed the pod
- •Pending forever — no node had enough resources
You'd need to run kubectl get pods and check each job's pod status to find failures. Nobody does this manually at 3 AM.
Adding heartbeat monitoring
The pattern is the same as any cron job: add a curl ping after the work completes. In Kubernetes, you do this inside the container's command.
Basic CronJob with monitoring
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-backup
namespace: production
spec:
schedule: "0 */6 * * *"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 1
activeDeadlineSeconds: 1800
template:
spec:
restartPolicy: Never
containers:
- name: backup
image: postgres:16
command:
- /bin/sh
- -c
- |
set -euo pipefail
# Signal start
curl -fsS https://api.getcronsafe.com/ping/k8s-db-backup/start
# Run backup
pg_dump -Fc "$DATABASE_URL" > /tmp/backup.dump
# Validate
SIZE=$(stat -c%s /tmp/backup.dump)
if [ "$SIZE" -lt 1048576 ]; then
curl -fsS -X POST https://api.getcronsafe.com/ping/k8s-db-backup/fail \
-d "Backup too small: $SIZE bytes"
exit 1
fi
# Upload
aws s3 cp /tmp/backup.dump "s3://backups/db/$(date +%Y%m%d_%H%M).dump"
# Signal success
curl -fsS -X POST https://api.getcronsafe.com/ping/k8s-db-backup \
-d "Backup OK: $SIZE bytes"
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"Key Kubernetes-specific settings
concurrencyPolicy: Forbid — prevents overlapping runs. If the previous job hasn't finished, the new one is skipped. CronSafe's overlap detection catches this scenario too.
backoffLimit: 1 — don't retry failed jobs automatically. Let the failure propagate to CronSafe so you can investigate.
activeDeadlineSeconds: 1800 — kill the job if it runs longer than 30 minutes. Without this, a hung job blocks future runs when concurrencyPolicy is Forbid.
restartPolicy: Never — don't restart the container on failure. Combined with backoffLimit: 1, this ensures one attempt and one alert.
Using an init container for the start ping
If you want to separate the monitoring ping from your main container:
spec:
template:
spec:
restartPolicy: Never
initContainers:
- name: signal-start
image: curlimages/curl:latest
command:
- curl
- -fsS
- https://api.getcronsafe.com/ping/k8s-etl/start
resources:
requests:
memory: "16Mi"
cpu: "10m"
limits:
memory: "32Mi"
cpu: "50m"
containers:
- name: etl
image: my-registry/etl-worker:latest
command:
- /bin/sh
- -c
- |
set -euo pipefail
python3 /app/etl.py
curl -fsS https://api.getcronsafe.com/ping/k8s-etlThe init container runs before the main container, giving you an accurate start timestamp.
Monitoring CronJob suspension
During cluster upgrades or maintenance, CronJobs are often suspended:
kubectl patch cronjob db-backup -p '{"spec":{"suspend":true}}'When a CronJob is suspended, no jobs are created. If someone forgets to unsuspend it after maintenance, the job silently stops running.
CronSafe catches this automatically — the expected pings don't arrive, and you get alerted regardless of why they stopped.
Handling image pull failures
A common Kubernetes failure mode: the image tag was updated in the CronJob manifest, but the new image doesn't exist or the registry credentials expired.
# This will fail silently - the pod enters ImagePullBackOff
containers:
- name: backup
image: my-registry/backup:v2.0.1 # typo: should be v2.0.0The CronJob's LAST SCHEDULE column updates (Kubernetes tried to run it), but the pod never starts. Without heartbeat monitoring, you'd only discover this by manually checking pod status.
With CronSafe, the start ping never arrives, and you're alerted within the grace period.
Complete production example with Helm
If you're using Helm, here's a reusable template:
# templates/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: {{ .Values.name }}
spec:
schedule: {{ .Values.schedule | quote }}
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 1
activeDeadlineSeconds: {{ .Values.deadlineSeconds | default 1800 }}
template:
spec:
restartPolicy: Never
containers:
- name: job
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
command:
- /bin/sh
- -c
- |
set -euo pipefail
curl -fsS {{ .Values.cronsafe.url }}/start
{{ .Values.command }}
curl -fsS {{ .Values.cronsafe.url }}
envFrom:
- secretRef:
name: {{ .Values.name }}-secrets
resources:
{{- toYaml .Values.resources | nindent 16 }}# values.yaml
name: db-backup
schedule: "0 */6 * * *"
deadlineSeconds: 1800
image:
repository: postgres
tag: "16"
command: "pg_dump -Fc $DATABASE_URL > /tmp/backup.dump && aws s3 cp /tmp/backup.dump s3://backups/"
cronsafe:
url: https://api.getcronsafe.com/ping/k8s-db-backup
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"This gives you a standardized pattern for every CronJob in your cluster, with monitoring built in.
Setting up the CronSafe monitor
For Kubernetes CronJobs, set these values:
- •Schedule: Match the CronJob's
schedulefield exactly - •Grace period: Set to
activeDeadlineSeconds+ 5 minutes. This accounts for pod scheduling delays and the job's maximum runtime. - •Overlap detection: Enable this. It catches cases where
concurrencyPolicy: Allowcauses parallel runs.
Start monitoring your cron jobs for free
20 monitors, email alerts, GitHub badges. No credit card required.
Get started free →