Files
Productier/docs/operations-disaster-recovery.md
T
Tomas Dvorak 3cb40adb23 first commit
2026-04-10 12:04:09 +02:00

3.8 KiB

Productier Disaster Recovery Runbook

Scope

This runbook covers backup and restore of:

  • PostgreSQL data (postgres service)
  • Object storage data (rustfs bucket configured by S3_BUCKET)

It assumes the production compose stack and env setup in:

  • infra/docker-compose.prod.yml
  • .env.production

Preconditions

  1. Validate production env:
npm run check:prod-env
  1. Ensure production services are running:
npm run ops:deploy

If deployment is already running and you only want readiness validation:

npm run ops:preflight

Backup

Create a timestamped backup:

npm run ops:backup

Output directory format:

backups/<UTC timestamp>/

Expected files:

  • postgres.sql.gz
  • s3/ (synced object data)
  • checksums.sha256
  • metadata.json

Verify backup:

bash scripts/ops/verify-backup.sh backups/<timestamp>

Run full scheduled-style backup flow (backup + verify + prune):

npm run ops:backup:job

Restore

Restore is destructive and requires explicit safety flags.

FORCE=1 RESET_DB=1 RESTORE_S3=1 \
bash scripts/ops/restore-prod.sh .env.production backups/<timestamp>

Flags:

  • FORCE=1: required; otherwise restore exits
  • RESET_DB=1: recommended; drops and recreates schema before import
  • RESTORE_S3=1: restore object storage (default on)

Restore drill (non-destructive)

Run drill against latest backup:

npm run ops:restore:drill

Run full isolated staging drill (temporary compose project + teardown):

npm run ops:drill:staging

Run drill against specific backup:

bash scripts/ops/restore-drill.sh .env.production backups/<timestamp>

Drill behavior:

  • Imports DB dump into temporary drill database
  • Runs sanity check (public table count)
  • Optionally syncs backup objects into a temporary drill bucket
  • Drops temporary drill DB and bucket by default

Post-restore checks

  1. API health:
curl -sS https://<PUBLIC_DOMAIN>/v1/health
  1. Auth health:
curl -sS https://<PUBLIC_DOMAIN>/api/auth/get-session
  1. Manual smoke:
  • sign in
  • open board/calendar/notes/mail
  • download at least one attachment

Automated smoke script:

npm run ops:smoke

The smoke script checks:

  • public homepage response
  • /v1/health payload
  • security response headers
  • HTTP->HTTPS redirect behavior

Cadence recommendations

  • Daily backup (off-peak)
  • Weekly restore drill in staging
  • Keep at least 14 daily restore points and 8 weekly restore points

Automation (systemd)

Template files:

  • infra/systemd/productier-backup.service
  • infra/systemd/productier-backup.timer

Example install on Linux host:

sudo cp infra/systemd/productier-backup.service /etc/systemd/system/
sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-backup.timer
sudo systemctl status productier-backup.timer

Run backup manually through systemd:

sudo systemctl start productier-backup.service
sudo journalctl -u productier-backup.service -n 200 --no-pager

Retention is controlled by BACKUP_KEEP_COUNT in productier-backup.service.

Alerting:

  • OPS_ALERT_WEBHOOK_URL
  • OPS_ALERT_WEBHOOK_BEARER_TOKEN
  • OPS_NOTIFY_ON_SUCCESS
  • OPS_ALERT_TIMEOUT_SECONDS

These variables can be set in /opt/productier/.env.production and are loaded by productier-backup.service.

Restore drill automation:

  • infra/systemd/productier-restore-drill.service
  • infra/systemd/productier-restore-drill.timer

Example install:

sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/
sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-restore-drill.timer
sudo systemctl status productier-restore-drill.timer