first commit

2026-07-29 15:03:49 +00:00 · 2026-04-10 12:04:09 +02:00
commit 3cb40adb23
203 changed files with 40226 additions and 0 deletions
@@ -0,0 +1,195 @@
+# Productier Disaster Recovery Runbook
+
+## Scope
+
+This runbook covers backup and restore of:
+
+- PostgreSQL data (`postgres` service)
+- Object storage data (`rustfs` bucket configured by `S3_BUCKET`)
+
+It assumes the production compose stack and env setup in:
+
+- `infra/docker-compose.prod.yml`
+- `.env.production`
+
+## Preconditions
+
+1. Validate production env:
+
+```bash
+npm run check:prod-env
+```
+
+2. Ensure production services are running:
+
+```bash
+npm run ops:deploy
+```
+
+If deployment is already running and you only want readiness validation:
+
+```bash
+npm run ops:preflight
+```
+
+## Backup
+
+Create a timestamped backup:
+
+```bash
+npm run ops:backup
+```
+
+Output directory format:
+
+`backups/<UTC timestamp>/`
+
+Expected files:
+
+- `postgres.sql.gz`
+- `s3/` (synced object data)
+- `checksums.sha256`
+- `metadata.json`
+
+Verify backup:
+
+```bash
+bash scripts/ops/verify-backup.sh backups/<timestamp>
+```
+
+Run full scheduled-style backup flow (backup + verify + prune):
+
+```bash
+npm run ops:backup:job
+```
+
+## Restore
+
+Restore is destructive and requires explicit safety flags.
+
+```bash
+FORCE=1 RESET_DB=1 RESTORE_S3=1 \
+bash scripts/ops/restore-prod.sh .env.production backups/<timestamp>
+```
+
+Flags:
+
+- `FORCE=1`: required; otherwise restore exits
+- `RESET_DB=1`: recommended; drops and recreates schema before import
+- `RESTORE_S3=1`: restore object storage (default on)
+
+## Restore drill (non-destructive)
+
+Run drill against latest backup:
+
+```bash
+npm run ops:restore:drill
+```
+
+Run full isolated staging drill (temporary compose project + teardown):
+
+```bash
+npm run ops:drill:staging
+```
+
+Run drill against specific backup:
+
+```bash
+bash scripts/ops/restore-drill.sh .env.production backups/<timestamp>
+```
+
+Drill behavior:
+
+- Imports DB dump into temporary drill database
+- Runs sanity check (`public` table count)
+- Optionally syncs backup objects into a temporary drill bucket
+- Drops temporary drill DB and bucket by default
+
+## Post-restore checks
+
+1. API health:
+
+```bash
+curl -sS https://<PUBLIC_DOMAIN>/v1/health
+```
+
+2. Auth health:
+
+```bash
+curl -sS https://<PUBLIC_DOMAIN>/api/auth/get-session
+```
+
+3. Manual smoke:
+
+- sign in
+- open board/calendar/notes/mail
+- download at least one attachment
+
+Automated smoke script:
+
+```bash
+npm run ops:smoke
+```
+
+The smoke script checks:
+
+- public homepage response
+- `/v1/health` payload
+- security response headers
+- HTTP->HTTPS redirect behavior
+
+## Cadence recommendations
+
+- Daily backup (off-peak)
+- Weekly restore drill in staging
+- Keep at least 14 daily restore points and 8 weekly restore points
+
+## Automation (systemd)
+
+Template files:
+
+- `infra/systemd/productier-backup.service`
+- `infra/systemd/productier-backup.timer`
+
+Example install on Linux host:
+
+```bash
+sudo cp infra/systemd/productier-backup.service /etc/systemd/system/
+sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/
+sudo systemctl daemon-reload
+sudo systemctl enable --now productier-backup.timer
+sudo systemctl status productier-backup.timer
+```
+
+Run backup manually through systemd:
+
+```bash
+sudo systemctl start productier-backup.service
+sudo journalctl -u productier-backup.service -n 200 --no-pager
+```
+
+Retention is controlled by `BACKUP_KEEP_COUNT` in `productier-backup.service`.
+
+Alerting:
+
+- `OPS_ALERT_WEBHOOK_URL`
+- `OPS_ALERT_WEBHOOK_BEARER_TOKEN`
+- `OPS_NOTIFY_ON_SUCCESS`
+- `OPS_ALERT_TIMEOUT_SECONDS`
+
+These variables can be set in `/opt/productier/.env.production` and are loaded by `productier-backup.service`.
+
+Restore drill automation:
+
+- `infra/systemd/productier-restore-drill.service`
+- `infra/systemd/productier-restore-drill.timer`
+
+Example install:
+
+```bash
+sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/
+sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/
+sudo systemctl daemon-reload
+sudo systemctl enable --now productier-restore-drill.timer
+sudo systemctl status productier-restore-drill.timer
+```