← Back to blog
·5 min read

Why Silent Cron Job Failures Are Dangerous (And How to Fix Them)

A backup script that stopped running three weeks ago. An ETL pipeline that silently dropped half its records. A certificate renewal job that failed, taking down your site when the cert expired. These aren't hypothetical — they're the most common incidents caused by silent cron failures.

What makes a failure "silent"?

A silent failure is when a job stops working but nothing reports the problem. No error in your dashboard. No alert in Slack. No entry in your logs. The server is up, the cron daemon is running, but the job either:

1. Never started — removed from crontab, wrong schedule, server timezone changed 2. Started but crashed — unhandled exception, OOM kill, dependency not found 3. Completed but produced wrong results — empty backup file, partial data sync, stale cache 4. Ran too late — the previous run was still going when the next one started

In all cases, the system looks healthy from the outside. The failure only becomes visible when its consequences surface — often days or weeks later.

Real examples of silent failures

The backup that wasn't

A startup ran nightly pg_dump backups to S3. The cron job had been running reliably for 18 months. Then an OS update changed the system PATH, and pg_dump was no longer found. The script exited silently (no set -e), and the curl to S3 uploaded an empty file.

They discovered the problem 23 days later when they needed to restore from backup. Twenty-three days of data, unrecoverable.

The payment sync that drifted

An e-commerce team synced orders to their accounting system every 15 minutes. A database migration added a new required column. The sync script crashed on the first new-format order, but because the cron entry redirected stderr to /dev/null, nobody saw the traceback.

Orders accumulated for 4 days before accounting noticed the gap. Reconciliation took a week.

The SSL cert that expired

A Let's Encrypt renewal script ran monthly via cron. After a server migration, the new server's crontab wasn't copied over. The cert expired 60 days later, causing a full outage at 2 AM on a Saturday.

Why traditional monitoring doesn't catch this

Uptime monitors check if your website responds. They don't check if your background jobs ran.

Log monitoring only works if the job produces logs. If the job never starts, there's nothing to log.

APM tools instrument your application code. Cron jobs that run as standalone scripts aren't covered.

Server metrics show CPU, memory, disk. A cron job that didn't run consumes zero resources — indistinguishable from a healthy idle server.

The only reliable way to detect a job that didn't run is to expect it to run and notice when it doesn't. That's heartbeat monitoring.

Three strategies to eliminate silent failures

Strategy 1: Heartbeat monitoring for every critical job

Add a ping to the end of every important cron job. If the ping doesn't arrive on time, you get alerted.

bash
#!/bin/bash
set -euo pipefail

pg_dump -Fc mydb > /backups/mydb_$(date +%Y%m%d).dump
aws s3 cp /backups/mydb_$(date +%Y%m%d).dump s3://backups/

curl -s https://api.getcronsafe.com/ping/backup-prod

Key details:

  • set -euo pipefail ensures the script exits on any error
  • The ping is at the end, so it only fires if everything succeeded
  • If pg_dump fails, the script exits before reaching curl

Strategy 2: Fail pings for explicit error reporting

Don't just rely on the absence of a success ping. Actively report failures:

bash
#!/bin/bash
set -euo pipefail

cleanup() {
  if [ $? -ne 0 ]; then
    curl -s -X POST https://api.getcronsafe.com/ping/etl-sync/fail \
      -d "Job failed with exit code $?"
  fi
}
trap cleanup EXIT

python3 /opt/etl/sync.py
curl -s https://api.getcronsafe.com/ping/etl-sync

The trap ensures that even if the script crashes, a fail ping is sent. This gives you faster detection and more context.

Strategy 3: Validate output, not just execution

The most dangerous silent failure is when the job runs but produces garbage. Add validation before pinging:

bash
#!/bin/bash
set -euo pipefail

BACKUP_FILE="/backups/mydb_$(date +%Y%m%d).dump"
pg_dump -Fc mydb > "$BACKUP_FILE"

# Validate: file must be at least 1MB
FILE_SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat --printf="%s" "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt 1048576 ]; then
  curl -s -X POST https://api.getcronsafe.com/ping/backup-prod/fail \
    -d "Backup too small: ${FILE_SIZE} bytes"
  exit 1
fi

curl -s https://api.getcronsafe.com/ping/backup-prod

This catches the empty-backup scenario. The job technically "succeeded" but the output is invalid.

Building a safety net

The combination of all three strategies creates a robust safety net:

1. Heartbeat monitoring catches jobs that don't run 2. Fail pings catch jobs that crash 3. Output validation catches jobs that produce bad data

Add CronSafe's escalating reminders (1h, 6h, 24h) and you have a system where no failure goes unnoticed for more than a few minutes.

The cost of not monitoring

Every team that experiences a silent cron failure says the same thing: "We assumed it was running." The cost of that assumption is measured in lost data, broken SLAs, and weekend incidents.

Adding heartbeat monitoring takes 30 seconds per job. The alternative is discovering the failure when it's already too late.

Start monitoring your cron jobs for free

20 monitors, email alerts, GitHub badges. No credit card required.

Get started free →