How to Monitor Database Backup Jobs (And Know When They Fail)
The most dangerous silent failure in any system is a broken backup. Everything else — downtime, bugs, performance issues — is recoverable if you have good backups. But when backups fail silently, you only discover it when you need to restore. And by then, it's too late.
How backup jobs fail silently
pg_dump failures
# This looks correct but has a critical flaw
0 2 * * * pg_dump mydb > /backups/mydb.sql 2>/dev/nullCommon pg_dump failure modes:
- •Authentication failure — password changed, pg_hba.conf updated
- •Connection refused — PostgreSQL restarted, port changed
- •Disk full — backup file grows until the disk is full, then the dump is truncated
- •OOM kill — large databases can exhaust memory during dump
- •Table lock timeout — long-running transactions block the dump
In all cases, the cron job "ran" but produced either no file or a corrupt file. Without monitoring, you won't know.
mysqldump failures
# Another silent failure waiting to happen
0 3 * * * mysqldump --all-databases > /backups/all.sqlmysqldump-specific issues:
- •Locked tables — InnoDB lock waits timeout
- •Binary log position lost — replication-safe dumps fail if binlog rotated
- •Character encoding corruption — missing
--default-character-set=utf8mb4 - •Partial dumps — one database errors out, rest are skipped
The "empty file" trap
The worst failure: the backup runs, produces a file, but the file is empty or trivially small.
$ ls -la /backups/
-rw-r--r-- 1 root root 0 Apr 14 02:00 mydb.sql # empty!
-rw-r--r-- 1 root root 1024 Apr 13 02:00 mydb.sql.1 # yesterday, also empty
-rw-r--r-- 1 root root 450M Apr 10 02:00 mydb.sql.4 # last real backup: 4 days agoWithout size validation, you'll archive weeks of empty files and only realize it during an emergency restore.
Monitored pg_dump with validation
#!/bin/bash
set -euo pipefail
MONITOR="https://api.getcronsafe.com/ping/pg-backup-prod"
DB_URL="postgresql://user:pass@localhost:5432/mydb"
BACKUP_DIR="/backups/postgres"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/mydb_${TIMESTAMP}.dump"
MIN_SIZE=1048576 # 1MB minimum — adjust for your database
# Signal start
curl -fsS "${MONITOR}/start"
# Ensure backup directory exists
mkdir -p "$BACKUP_DIR"
# Create backup (custom format for compression + selective restore)
pg_dump -Fc "$DB_URL" > "$BACKUP_FILE"
# Validate file size
FILE_SIZE=$(stat --printf="%s" "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt "$MIN_SIZE" ]; then
curl -fsS -X POST "${MONITOR}/fail" \
-d "Backup too small: ${FILE_SIZE} bytes (minimum: ${MIN_SIZE})"
rm -f "$BACKUP_FILE"
exit 1
fi
# Validate backup integrity
if ! pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1; then
curl -fsS -X POST "${MONITOR}/fail" \
-d "Backup integrity check failed"
exit 1
fi
# Upload to remote storage
aws s3 cp "$BACKUP_FILE" "s3://my-backups/postgres/${TIMESTAMP}.dump" \
--storage-class STANDARD_IA
# Cleanup old local backups (keep 7 days)
find "$BACKUP_DIR" -name "mydb_*.dump" -mtime +7 -delete
# Signal success with details
HUMAN_SIZE=$(numfmt --to=iec "$FILE_SIZE")
curl -fsS -X POST "$MONITOR" \
-d "Backup OK: ${HUMAN_SIZE}, uploaded to S3, local cleanup done"This script: 1. Sends a start ping for duration tracking 2. Creates the backup 3. Validates file size (catches empty/truncated dumps) 4. Validates integrity with pg_restore --list 5. Uploads to S3 6. Cleans up old local files 7. Sends success ping with details — or fail ping with the reason
Monitored mysqldump with validation
#!/bin/bash
set -euo pipefail
MONITOR="https://api.getcronsafe.com/ping/mysql-backup-prod"
BACKUP_DIR="/backups/mysql"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/alldb_${TIMESTAMP}.sql.gz"
MIN_SIZE=524288 # 512KB minimum
curl -fsS "${MONITOR}/start"
mkdir -p "$BACKUP_DIR"
# Dump all databases with consistent snapshot
mysqldump \
--all-databases \
--single-transaction \
--routines \
--triggers \
--default-character-set=utf8mb4 \
--set-gtid-purged=OFF \
2>/tmp/mysqldump_err.log | gzip > "$BACKUP_FILE"
# Check if mysqldump produced errors
if [ -s /tmp/mysqldump_err.log ]; then
ERRORS=$(cat /tmp/mysqldump_err.log)
# Warning-only errors are OK, actual errors are not
if echo "$ERRORS" | grep -qi "error"; then
curl -fsS -X POST "${MONITOR}/fail" -d "mysqldump errors: $ERRORS"
exit 1
fi
fi
# Validate size
FILE_SIZE=$(stat --printf="%s" "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt "$MIN_SIZE" ]; then
curl -fsS -X POST "${MONITOR}/fail" \
-d "Backup too small: ${FILE_SIZE} bytes"
exit 1
fi
# Validate gzip integrity
if ! gzip -t "$BACKUP_FILE" 2>/dev/null; then
curl -fsS -X POST "${MONITOR}/fail" \
-d "Gzip integrity check failed"
exit 1
fi
# Upload and cleanup
aws s3 cp "$BACKUP_FILE" "s3://my-backups/mysql/${TIMESTAMP}.sql.gz"
find "$BACKUP_DIR" -name "alldb_*.sql.gz" -mtime +7 -delete
HUMAN_SIZE=$(numfmt --to=iec "$FILE_SIZE")
curl -fsS -X POST "$MONITOR" -d "MySQL backup OK: ${HUMAN_SIZE}"What to validate in your backups
Minimum file size: Set this to roughly 80% of your typical backup size. If your database is 500MB, the backup should never be under 400MB. An empty or 1KB file means something went catastrophically wrong.
Integrity check: Use format-specific tools:
- •PostgreSQL custom format:
pg_restore --list - •PostgreSQL plain SQL:
head -1should contain-- PostgreSQL database dump - •MySQL:
gzip -tfor compressed, or check the last line contains-- Dump completed - •MongoDB:
mongorestore --dryRun
Remote upload verification: After uploading to S3/GCS, verify the remote file exists:
aws s3 ls "s3://my-backups/postgres/${TIMESTAMP}.dump" || {
curl -fsS -X POST "${MONITOR}/fail" -d "S3 upload verification failed"
exit 1
}Restore testing
The ultimate backup validation is a restore test. Run this weekly or monthly:
#!/bin/bash
set -euo pipefail
MONITOR="https://api.getcronsafe.com/ping/backup-restore-test"
curl -fsS "${MONITOR}/start"
# Download latest backup
LATEST=$(aws s3 ls s3://my-backups/postgres/ --recursive | sort | tail -1 | awk '{print $4}')
aws s3 cp "s3://my-backups/${LATEST}" /tmp/restore_test.dump
# Restore to test database
dropdb --if-exists restore_test
createdb restore_test
pg_restore -d restore_test /tmp/restore_test.dump
# Run a basic sanity check
USERS=$(psql -t -c "SELECT count(*) FROM users" restore_test)
if [ "$USERS" -lt 1 ]; then
curl -fsS -X POST "${MONITOR}/fail" -d "Restore test: users table empty"
exit 1
fi
# Cleanup
dropdb restore_test
rm /tmp/restore_test.dump
curl -fsS -X POST "$MONITOR" -d "Restore test passed: ${USERS} users found"For teams needing full infrastructure monitoring
If you're monitoring backups across multiple databases, servers, and cloud services, consider LuxkernOS. It combines CronSafe (cron monitoring), LogDrain (log aggregation), PingCheck (uptime monitoring), and AI-powered anomaly detection in a single platform at €49/mo — replacing the patchwork of Cronitor, Better Stack, UptimeRobot, and Datadog that most teams cobble together.
The rule of backup monitoring
If you can't prove your backup ran successfully in the last 24 hours, you don't have backups. You have a script that might be creating backups.
Add heartbeat monitoring to every backup job. Validate file sizes. Test restores. The 10 minutes of setup prevents the worst day of your career.
Start monitoring your cron jobs for free
20 monitors, email alerts, GitHub badges. No credit card required.
Get started free →