# Productier Disaster Recovery Runbook ## Scope This runbook covers backup and restore of: - PostgreSQL data (`postgres` service) - Object storage data (`rustfs` bucket configured by `S3_BUCKET`) It assumes the production compose stack and env setup in: - `infra/docker-compose.prod.yml` - `.env.production` ## Preconditions 1. Validate production env: ```bash npm run check:prod-env ``` 2. Ensure production services are running: ```bash npm run ops:deploy ``` If deployment is already running and you only want readiness validation: ```bash npm run ops:preflight ``` ## Backup Create a timestamped backup: ```bash npm run ops:backup ``` Output directory format: `backups//` Expected files: - `postgres.sql.gz` - `s3/` (synced object data) - `checksums.sha256` - `metadata.json` Verify backup: ```bash bash scripts/ops/verify-backup.sh backups/ ``` Run full scheduled-style backup flow (backup + verify + prune): ```bash npm run ops:backup:job ``` ## Restore Restore is destructive and requires explicit safety flags. ```bash FORCE=1 RESET_DB=1 RESTORE_S3=1 \ bash scripts/ops/restore-prod.sh .env.production backups/ ``` Flags: - `FORCE=1`: required; otherwise restore exits - `RESET_DB=1`: recommended; drops and recreates schema before import - `RESTORE_S3=1`: restore object storage (default on) ## Restore drill (non-destructive) Run drill against latest backup: ```bash npm run ops:restore:drill ``` Run full isolated staging drill (temporary compose project + teardown): ```bash npm run ops:drill:staging ``` Run drill against specific backup: ```bash bash scripts/ops/restore-drill.sh .env.production backups/ ``` Drill behavior: - Imports DB dump into temporary drill database - Runs sanity check (`public` table count) - Optionally syncs backup objects into a temporary drill bucket - Drops temporary drill DB and bucket by default ## Post-restore checks 1. API health: ```bash curl -sS https:///v1/health ``` 2. Auth health: ```bash curl -sS https:///api/auth/get-session ``` 3. Manual smoke: - sign in - open board/calendar/notes/mail - download at least one attachment Automated smoke script: ```bash npm run ops:smoke ``` The smoke script checks: - public homepage response - `/v1/health` payload - security response headers - HTTP->HTTPS redirect behavior ## Cadence recommendations - Daily backup (off-peak) - Weekly restore drill in staging - Keep at least 14 daily restore points and 8 weekly restore points ## Automation (systemd) Template files: - `infra/systemd/productier-backup.service` - `infra/systemd/productier-backup.timer` Example install on Linux host: ```bash sudo cp infra/systemd/productier-backup.service /etc/systemd/system/ sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable --now productier-backup.timer sudo systemctl status productier-backup.timer ``` Run backup manually through systemd: ```bash sudo systemctl start productier-backup.service sudo journalctl -u productier-backup.service -n 200 --no-pager ``` Retention is controlled by `BACKUP_KEEP_COUNT` in `productier-backup.service`. Alerting: - `OPS_ALERT_WEBHOOK_URL` - `OPS_ALERT_WEBHOOK_BEARER_TOKEN` - `OPS_NOTIFY_ON_SUCCESS` - `OPS_ALERT_TIMEOUT_SECONDS` These variables can be set in `/opt/productier/.env.production` and are loaded by `productier-backup.service`. Restore drill automation: - `infra/systemd/productier-restore-drill.service` - `infra/systemd/productier-restore-drill.timer` Example install: ```bash sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/ sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/ sudo systemctl daemon-reload sudo systemctl enable --now productier-restore-drill.timer sudo systemctl status productier-restore-drill.timer ```