3.8 KiB
Productier Disaster Recovery Runbook
Scope
This runbook covers backup and restore of:
- PostgreSQL data (
postgresservice) - Object storage data (
rustfsbucket configured byS3_BUCKET)
It assumes the production compose stack and env setup in:
infra/docker-compose.prod.yml.env.production
Preconditions
- Validate production env:
npm run check:prod-env
- Ensure production services are running:
npm run ops:deploy
If deployment is already running and you only want readiness validation:
npm run ops:preflight
Backup
Create a timestamped backup:
npm run ops:backup
Output directory format:
backups/<UTC timestamp>/
Expected files:
postgres.sql.gzs3/(synced object data)checksums.sha256metadata.json
Verify backup:
bash scripts/ops/verify-backup.sh backups/<timestamp>
Run full scheduled-style backup flow (backup + verify + prune):
npm run ops:backup:job
Restore
Restore is destructive and requires explicit safety flags.
FORCE=1 RESET_DB=1 RESTORE_S3=1 \
bash scripts/ops/restore-prod.sh .env.production backups/<timestamp>
Flags:
FORCE=1: required; otherwise restore exitsRESET_DB=1: recommended; drops and recreates schema before importRESTORE_S3=1: restore object storage (default on)
Restore drill (non-destructive)
Run drill against latest backup:
npm run ops:restore:drill
Run full isolated staging drill (temporary compose project + teardown):
npm run ops:drill:staging
Run drill against specific backup:
bash scripts/ops/restore-drill.sh .env.production backups/<timestamp>
Drill behavior:
- Imports DB dump into temporary drill database
- Runs sanity check (
publictable count) - Optionally syncs backup objects into a temporary drill bucket
- Drops temporary drill DB and bucket by default
Post-restore checks
- API health:
curl -sS https://<PUBLIC_DOMAIN>/v1/health
- Auth health:
curl -sS https://<PUBLIC_DOMAIN>/api/auth/get-session
- Manual smoke:
- sign in
- open board/calendar/notes/mail
- download at least one attachment
Automated smoke script:
npm run ops:smoke
The smoke script checks:
- public homepage response
/v1/healthpayload- security response headers
- HTTP->HTTPS redirect behavior
Cadence recommendations
- Daily backup (off-peak)
- Weekly restore drill in staging
- Keep at least 14 daily restore points and 8 weekly restore points
Automation (systemd)
Template files:
infra/systemd/productier-backup.serviceinfra/systemd/productier-backup.timer
Example install on Linux host:
sudo cp infra/systemd/productier-backup.service /etc/systemd/system/
sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-backup.timer
sudo systemctl status productier-backup.timer
Run backup manually through systemd:
sudo systemctl start productier-backup.service
sudo journalctl -u productier-backup.service -n 200 --no-pager
Retention is controlled by BACKUP_KEEP_COUNT in productier-backup.service.
Alerting:
OPS_ALERT_WEBHOOK_URLOPS_ALERT_WEBHOOK_BEARER_TOKENOPS_NOTIFY_ON_SUCCESSOPS_ALERT_TIMEOUT_SECONDS
These variables can be set in /opt/productier/.env.production and are loaded by productier-backup.service.
Restore drill automation:
infra/systemd/productier-restore-drill.serviceinfra/systemd/productier-restore-drill.timer
Example install:
sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/
sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-restore-drill.timer
sudo systemctl status productier-restore-drill.timer