Why Silent Cron Job Failures Are Dangerous (And How to Fix Them)
A backup script that stopped running three weeks ago. An ETL pipeline that silently dropped half its records. A certificate renewal job that failed, taking down your site when the cert expired. These aren't hypothetical — they're the most common incidents caused by silent cron failures.
What makes a failure "silent"?
A silent failure is when a job stops working but nothing reports the problem. No error in your dashboard. No alert in Slack. No entry in your logs. The server is up, the cron daemon is running, but the job either:
1. Never started — removed from crontab, wrong schedule, server timezone changed 2. Started but crashed — unhandled exception, OOM kill, dependency not found 3. Completed but produced wrong results — empty backup file, partial data sync, stale cache 4. Ran too late — the previous run was still going when the next one started
In all cases, the system looks healthy from the outside. The failure only becomes visible when its consequences surface — often days or weeks later.
Real examples of silent failures
The backup that wasn't
A startup ran nightly pg_dump backups to S3. The cron job had been running reliably for 18 months. Then an OS update changed the system PATH, and pg_dump was no longer found. The script exited silently (no set -e), and the curl to S3 uploaded an empty file.
They discovered the problem 23 days later when they needed to restore from backup. Twenty-three days of data, unrecoverable.
The payment sync that drifted
An e-commerce team synced orders to their accounting system every 15 minutes. A database migration added a new required column. The sync script crashed on the first new-format order, but because the cron entry redirected stderr to /dev/null, nobody saw the traceback.
Orders accumulated for 4 days before accounting noticed the gap. Reconciliation took a week.
The SSL cert that expired
A Let's Encrypt renewal script ran monthly via cron. After a server migration, the new server's crontab wasn't copied over. The cert expired 60 days later, causing a full outage at 2 AM on a Saturday.
Why traditional monitoring doesn't catch this
Uptime monitors check if your website responds. They don't check if your background jobs ran.
Log monitoring only works if the job produces logs. If the job never starts, there's nothing to log.
APM tools instrument your application code. Cron jobs that run as standalone scripts aren't covered.
Server metrics show CPU, memory, disk. A cron job that didn't run consumes zero resources — indistinguishable from a healthy idle server.
The only reliable way to detect a job that didn't run is to expect it to run and notice when it doesn't. That's heartbeat monitoring.
Three strategies to eliminate silent failures
Strategy 1: Heartbeat monitoring for every critical job
Add a ping to the end of every important cron job. If the ping doesn't arrive on time, you get alerted.
#!/bin/bash
set -euo pipefail
pg_dump -Fc mydb > /backups/mydb_$(date +%Y%m%d).dump
aws s3 cp /backups/mydb_$(date +%Y%m%d).dump s3://backups/
curl -s https://api.getcronsafe.com/ping/backup-prodKey details:
- •
set -euo pipefailensures the script exits on any error - •The ping is at the end, so it only fires if everything succeeded
- •If pg_dump fails, the script exits before reaching curl
Strategy 2: Fail pings for explicit error reporting
Don't just rely on the absence of a success ping. Actively report failures:
#!/bin/bash
set -euo pipefail
cleanup() {
if [ $? -ne 0 ]; then
curl -s -X POST https://api.getcronsafe.com/ping/etl-sync/fail \
-d "Job failed with exit code $?"
fi
}
trap cleanup EXIT
python3 /opt/etl/sync.py
curl -s https://api.getcronsafe.com/ping/etl-syncThe trap ensures that even if the script crashes, a fail ping is sent. This gives you faster detection and more context.
Strategy 3: Validate output, not just execution
The most dangerous silent failure is when the job runs but produces garbage. Add validation before pinging:
#!/bin/bash
set -euo pipefail
BACKUP_FILE="/backups/mydb_$(date +%Y%m%d).dump"
pg_dump -Fc mydb > "$BACKUP_FILE"
# Validate: file must be at least 1MB
FILE_SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat --printf="%s" "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt 1048576 ]; then
curl -s -X POST https://api.getcronsafe.com/ping/backup-prod/fail \
-d "Backup too small: ${FILE_SIZE} bytes"
exit 1
fi
curl -s https://api.getcronsafe.com/ping/backup-prodThis catches the empty-backup scenario. The job technically "succeeded" but the output is invalid.
Building a safety net
The combination of all three strategies creates a robust safety net:
1. Heartbeat monitoring catches jobs that don't run 2. Fail pings catch jobs that crash 3. Output validation catches jobs that produce bad data
Add CronSafe's escalating reminders (1h, 6h, 24h) and you have a system where no failure goes unnoticed for more than a few minutes.
The cost of not monitoring
Every team that experiences a silent cron failure says the same thing: "We assumed it was running." The cost of that assumption is measured in lost data, broken SLAs, and weekend incidents.
Adding heartbeat monitoring takes 30 seconds per job. The alternative is discovering the failure when it's already too late.
Start monitoring your cron jobs for free
20 monitors, email alerts, GitHub badges. No credit card required.
Get started free →