first commit

This commit is contained in:
Tomas Dvorak
2026-04-10 12:04:09 +02:00
commit 3cb40adb23
203 changed files with 40226 additions and 0 deletions
+195
View File
@@ -0,0 +1,195 @@
# Productier Disaster Recovery Runbook
## Scope
This runbook covers backup and restore of:
- PostgreSQL data (`postgres` service)
- Object storage data (`rustfs` bucket configured by `S3_BUCKET`)
It assumes the production compose stack and env setup in:
- `infra/docker-compose.prod.yml`
- `.env.production`
## Preconditions
1. Validate production env:
```bash
npm run check:prod-env
```
2. Ensure production services are running:
```bash
npm run ops:deploy
```
If deployment is already running and you only want readiness validation:
```bash
npm run ops:preflight
```
## Backup
Create a timestamped backup:
```bash
npm run ops:backup
```
Output directory format:
`backups/<UTC timestamp>/`
Expected files:
- `postgres.sql.gz`
- `s3/` (synced object data)
- `checksums.sha256`
- `metadata.json`
Verify backup:
```bash
bash scripts/ops/verify-backup.sh backups/<timestamp>
```
Run full scheduled-style backup flow (backup + verify + prune):
```bash
npm run ops:backup:job
```
## Restore
Restore is destructive and requires explicit safety flags.
```bash
FORCE=1 RESET_DB=1 RESTORE_S3=1 \
bash scripts/ops/restore-prod.sh .env.production backups/<timestamp>
```
Flags:
- `FORCE=1`: required; otherwise restore exits
- `RESET_DB=1`: recommended; drops and recreates schema before import
- `RESTORE_S3=1`: restore object storage (default on)
## Restore drill (non-destructive)
Run drill against latest backup:
```bash
npm run ops:restore:drill
```
Run full isolated staging drill (temporary compose project + teardown):
```bash
npm run ops:drill:staging
```
Run drill against specific backup:
```bash
bash scripts/ops/restore-drill.sh .env.production backups/<timestamp>
```
Drill behavior:
- Imports DB dump into temporary drill database
- Runs sanity check (`public` table count)
- Optionally syncs backup objects into a temporary drill bucket
- Drops temporary drill DB and bucket by default
## Post-restore checks
1. API health:
```bash
curl -sS https://<PUBLIC_DOMAIN>/v1/health
```
2. Auth health:
```bash
curl -sS https://<PUBLIC_DOMAIN>/api/auth/get-session
```
3. Manual smoke:
- sign in
- open board/calendar/notes/mail
- download at least one attachment
Automated smoke script:
```bash
npm run ops:smoke
```
The smoke script checks:
- public homepage response
- `/v1/health` payload
- security response headers
- HTTP->HTTPS redirect behavior
## Cadence recommendations
- Daily backup (off-peak)
- Weekly restore drill in staging
- Keep at least 14 daily restore points and 8 weekly restore points
## Automation (systemd)
Template files:
- `infra/systemd/productier-backup.service`
- `infra/systemd/productier-backup.timer`
Example install on Linux host:
```bash
sudo cp infra/systemd/productier-backup.service /etc/systemd/system/
sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-backup.timer
sudo systemctl status productier-backup.timer
```
Run backup manually through systemd:
```bash
sudo systemctl start productier-backup.service
sudo journalctl -u productier-backup.service -n 200 --no-pager
```
Retention is controlled by `BACKUP_KEEP_COUNT` in `productier-backup.service`.
Alerting:
- `OPS_ALERT_WEBHOOK_URL`
- `OPS_ALERT_WEBHOOK_BEARER_TOKEN`
- `OPS_NOTIFY_ON_SUCCESS`
- `OPS_ALERT_TIMEOUT_SECONDS`
These variables can be set in `/opt/productier/.env.production` and are loaded by `productier-backup.service`.
Restore drill automation:
- `infra/systemd/productier-restore-drill.service`
- `infra/systemd/productier-restore-drill.timer`
Example install:
```bash
sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/
sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-restore-drill.timer
sudo systemctl status productier-restore-drill.timer
```