← Back to blog
·6 min read

How to Monitor Database Backup Jobs (And Know When They Fail)

The most dangerous silent failure in any system is a broken backup. Everything else — downtime, bugs, performance issues — is recoverable if you have good backups. But when backups fail silently, you only discover it when you need to restore. And by then, it's too late.

How backup jobs fail silently

pg_dump failures

bash
# This looks correct but has a critical flaw
0 2 * * * pg_dump mydb > /backups/mydb.sql 2>/dev/null

Common pg_dump failure modes:

  • Authentication failure — password changed, pg_hba.conf updated
  • Connection refused — PostgreSQL restarted, port changed
  • Disk full — backup file grows until the disk is full, then the dump is truncated
  • OOM kill — large databases can exhaust memory during dump
  • Table lock timeout — long-running transactions block the dump

In all cases, the cron job "ran" but produced either no file or a corrupt file. Without monitoring, you won't know.

mysqldump failures

bash
# Another silent failure waiting to happen
0 3 * * * mysqldump --all-databases > /backups/all.sql

mysqldump-specific issues:

  • Locked tables — InnoDB lock waits timeout
  • Binary log position lost — replication-safe dumps fail if binlog rotated
  • Character encoding corruption — missing --default-character-set=utf8mb4
  • Partial dumps — one database errors out, rest are skipped

The "empty file" trap

The worst failure: the backup runs, produces a file, but the file is empty or trivially small.

bash
$ ls -la /backups/
-rw-r--r-- 1 root root     0 Apr 14 02:00 mydb.sql      # empty!
-rw-r--r-- 1 root root  1024 Apr 13 02:00 mydb.sql.1    # yesterday, also empty
-rw-r--r-- 1 root root 450M  Apr 10 02:00 mydb.sql.4    # last real backup: 4 days ago

Without size validation, you'll archive weeks of empty files and only realize it during an emergency restore.

Monitored pg_dump with validation

bash
#!/bin/bash
set -euo pipefail

MONITOR="https://api.getcronsafe.com/ping/pg-backup-prod"
DB_URL="postgresql://user:pass@localhost:5432/mydb"
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/mydb_${TIMESTAMP}.dump"
MIN_SIZE=1048576  # 1MB minimum — adjust for your database

# Signal start
curl -fsS "${MONITOR}/start"

# Ensure backup directory exists
mkdir -p "$BACKUP_DIR"

# Create backup (custom format for compression + selective restore)
pg_dump -Fc "$DB_URL" > "$BACKUP_FILE"

# Validate file size
FILE_SIZE=$(stat --printf="%s" "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt "$MIN_SIZE" ]; then
  curl -fsS -X POST "${MONITOR}/fail" \
    -d "Backup too small: ${FILE_SIZE} bytes (minimum: ${MIN_SIZE})"
  rm -f "$BACKUP_FILE"
  exit 1
fi

# Validate backup integrity
if ! pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1; then
  curl -fsS -X POST "${MONITOR}/fail" \
    -d "Backup integrity check failed"
  exit 1
fi

# Upload to remote storage
aws s3 cp "$BACKUP_FILE" "s3://my-backups/postgres/${TIMESTAMP}.dump" \
  --storage-class STANDARD_IA

# Cleanup old local backups (keep 7 days)
find "$BACKUP_DIR" -name "mydb_*.dump" -mtime +7 -delete

# Signal success with details
HUMAN_SIZE=$(numfmt --to=iec "$FILE_SIZE")
curl -fsS -X POST "$MONITOR" \
  -d "Backup OK: ${HUMAN_SIZE}, uploaded to S3, local cleanup done"

This script: 1. Sends a start ping for duration tracking 2. Creates the backup 3. Validates file size (catches empty/truncated dumps) 4. Validates integrity with pg_restore --list 5. Uploads to S3 6. Cleans up old local files 7. Sends success ping with details — or fail ping with the reason

Monitored mysqldump with validation

bash
#!/bin/bash
set -euo pipefail

MONITOR="https://api.getcronsafe.com/ping/mysql-backup-prod"
BACKUP_DIR="/backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/alldb_${TIMESTAMP}.sql.gz"
MIN_SIZE=524288  # 512KB minimum

curl -fsS "${MONITOR}/start"
mkdir -p "$BACKUP_DIR"

# Dump all databases with consistent snapshot
mysqldump \
  --all-databases \
  --single-transaction \
  --routines \
  --triggers \
  --default-character-set=utf8mb4 \
  --set-gtid-purged=OFF \
  2>/tmp/mysqldump_err.log | gzip > "$BACKUP_FILE"

# Check if mysqldump produced errors
if [ -s /tmp/mysqldump_err.log ]; then
  ERRORS=$(cat /tmp/mysqldump_err.log)
  # Warning-only errors are OK, actual errors are not
  if echo "$ERRORS" | grep -qi "error"; then
    curl -fsS -X POST "${MONITOR}/fail" -d "mysqldump errors: $ERRORS"
    exit 1
  fi
fi

# Validate size
FILE_SIZE=$(stat --printf="%s" "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt "$MIN_SIZE" ]; then
  curl -fsS -X POST "${MONITOR}/fail" \
    -d "Backup too small: ${FILE_SIZE} bytes"
  exit 1
fi

# Validate gzip integrity
if ! gzip -t "$BACKUP_FILE" 2>/dev/null; then
  curl -fsS -X POST "${MONITOR}/fail" \
    -d "Gzip integrity check failed"
  exit 1
fi

# Upload and cleanup
aws s3 cp "$BACKUP_FILE" "s3://my-backups/mysql/${TIMESTAMP}.sql.gz"
find "$BACKUP_DIR" -name "alldb_*.sql.gz" -mtime +7 -delete

HUMAN_SIZE=$(numfmt --to=iec "$FILE_SIZE")
curl -fsS -X POST "$MONITOR" -d "MySQL backup OK: ${HUMAN_SIZE}"

What to validate in your backups

Minimum file size: Set this to roughly 80% of your typical backup size. If your database is 500MB, the backup should never be under 400MB. An empty or 1KB file means something went catastrophically wrong.

Integrity check: Use format-specific tools:

  • PostgreSQL custom format: pg_restore --list
  • PostgreSQL plain SQL: head -1 should contain -- PostgreSQL database dump
  • MySQL: gzip -t for compressed, or check the last line contains -- Dump completed
  • MongoDB: mongorestore --dryRun

Remote upload verification: After uploading to S3/GCS, verify the remote file exists:

bash
aws s3 ls "s3://my-backups/postgres/${TIMESTAMP}.dump" || {
  curl -fsS -X POST "${MONITOR}/fail" -d "S3 upload verification failed"
  exit 1
}

Restore testing

The ultimate backup validation is a restore test. Run this weekly or monthly:

bash
#!/bin/bash
set -euo pipefail

MONITOR="https://api.getcronsafe.com/ping/backup-restore-test"
curl -fsS "${MONITOR}/start"

# Download latest backup
LATEST=$(aws s3 ls s3://my-backups/postgres/ --recursive | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://my-backups/${LATEST}" /tmp/restore_test.dump

# Restore to test database
dropdb --if-exists restore_test
createdb restore_test
pg_restore -d restore_test /tmp/restore_test.dump

# Run a basic sanity check
USERS=$(psql -t -c "SELECT count(*) FROM users" restore_test)
if [ "$USERS" -lt 1 ]; then
  curl -fsS -X POST "${MONITOR}/fail" -d "Restore test: users table empty"
  exit 1
fi

# Cleanup
dropdb restore_test
rm /tmp/restore_test.dump

curl -fsS -X POST "$MONITOR" -d "Restore test passed: ${USERS} users found"

For teams needing full infrastructure monitoring

If you're monitoring backups across multiple databases, servers, and cloud services, consider LuxkernOS. It combines CronSafe (cron monitoring), LogDrain (log aggregation), PingCheck (uptime monitoring), and AI-powered anomaly detection in a single platform at €49/mo — replacing the patchwork of Cronitor, Better Stack, UptimeRobot, and Datadog that most teams cobble together.

The rule of backup monitoring

If you can't prove your backup ran successfully in the last 24 hours, you don't have backups. You have a script that might be creating backups.

Add heartbeat monitoring to every backup job. Validate file sizes. Test restores. The 10 minutes of setup prevents the worst day of your career.

Start monitoring your cron jobs for free

20 monitors, email alerts, GitHub badges. No credit card required.

Get started free →