mirror of
https://github.com/Dvorinka/Productier.git
synced 2026-06-03 20:13:01 +00:00
first commit
This commit is contained in:
@@ -0,0 +1,195 @@
|
||||
# Productier Disaster Recovery Runbook
|
||||
|
||||
## Scope
|
||||
|
||||
This runbook covers backup and restore of:
|
||||
|
||||
- PostgreSQL data (`postgres` service)
|
||||
- Object storage data (`rustfs` bucket configured by `S3_BUCKET`)
|
||||
|
||||
It assumes the production compose stack and env setup in:
|
||||
|
||||
- `infra/docker-compose.prod.yml`
|
||||
- `.env.production`
|
||||
|
||||
## Preconditions
|
||||
|
||||
1. Validate production env:
|
||||
|
||||
```bash
|
||||
npm run check:prod-env
|
||||
```
|
||||
|
||||
2. Ensure production services are running:
|
||||
|
||||
```bash
|
||||
npm run ops:deploy
|
||||
```
|
||||
|
||||
If deployment is already running and you only want readiness validation:
|
||||
|
||||
```bash
|
||||
npm run ops:preflight
|
||||
```
|
||||
|
||||
## Backup
|
||||
|
||||
Create a timestamped backup:
|
||||
|
||||
```bash
|
||||
npm run ops:backup
|
||||
```
|
||||
|
||||
Output directory format:
|
||||
|
||||
`backups/<UTC timestamp>/`
|
||||
|
||||
Expected files:
|
||||
|
||||
- `postgres.sql.gz`
|
||||
- `s3/` (synced object data)
|
||||
- `checksums.sha256`
|
||||
- `metadata.json`
|
||||
|
||||
Verify backup:
|
||||
|
||||
```bash
|
||||
bash scripts/ops/verify-backup.sh backups/<timestamp>
|
||||
```
|
||||
|
||||
Run full scheduled-style backup flow (backup + verify + prune):
|
||||
|
||||
```bash
|
||||
npm run ops:backup:job
|
||||
```
|
||||
|
||||
## Restore
|
||||
|
||||
Restore is destructive and requires explicit safety flags.
|
||||
|
||||
```bash
|
||||
FORCE=1 RESET_DB=1 RESTORE_S3=1 \
|
||||
bash scripts/ops/restore-prod.sh .env.production backups/<timestamp>
|
||||
```
|
||||
|
||||
Flags:
|
||||
|
||||
- `FORCE=1`: required; otherwise restore exits
|
||||
- `RESET_DB=1`: recommended; drops and recreates schema before import
|
||||
- `RESTORE_S3=1`: restore object storage (default on)
|
||||
|
||||
## Restore drill (non-destructive)
|
||||
|
||||
Run drill against latest backup:
|
||||
|
||||
```bash
|
||||
npm run ops:restore:drill
|
||||
```
|
||||
|
||||
Run full isolated staging drill (temporary compose project + teardown):
|
||||
|
||||
```bash
|
||||
npm run ops:drill:staging
|
||||
```
|
||||
|
||||
Run drill against specific backup:
|
||||
|
||||
```bash
|
||||
bash scripts/ops/restore-drill.sh .env.production backups/<timestamp>
|
||||
```
|
||||
|
||||
Drill behavior:
|
||||
|
||||
- Imports DB dump into temporary drill database
|
||||
- Runs sanity check (`public` table count)
|
||||
- Optionally syncs backup objects into a temporary drill bucket
|
||||
- Drops temporary drill DB and bucket by default
|
||||
|
||||
## Post-restore checks
|
||||
|
||||
1. API health:
|
||||
|
||||
```bash
|
||||
curl -sS https://<PUBLIC_DOMAIN>/v1/health
|
||||
```
|
||||
|
||||
2. Auth health:
|
||||
|
||||
```bash
|
||||
curl -sS https://<PUBLIC_DOMAIN>/api/auth/get-session
|
||||
```
|
||||
|
||||
3. Manual smoke:
|
||||
|
||||
- sign in
|
||||
- open board/calendar/notes/mail
|
||||
- download at least one attachment
|
||||
|
||||
Automated smoke script:
|
||||
|
||||
```bash
|
||||
npm run ops:smoke
|
||||
```
|
||||
|
||||
The smoke script checks:
|
||||
|
||||
- public homepage response
|
||||
- `/v1/health` payload
|
||||
- security response headers
|
||||
- HTTP->HTTPS redirect behavior
|
||||
|
||||
## Cadence recommendations
|
||||
|
||||
- Daily backup (off-peak)
|
||||
- Weekly restore drill in staging
|
||||
- Keep at least 14 daily restore points and 8 weekly restore points
|
||||
|
||||
## Automation (systemd)
|
||||
|
||||
Template files:
|
||||
|
||||
- `infra/systemd/productier-backup.service`
|
||||
- `infra/systemd/productier-backup.timer`
|
||||
|
||||
Example install on Linux host:
|
||||
|
||||
```bash
|
||||
sudo cp infra/systemd/productier-backup.service /etc/systemd/system/
|
||||
sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now productier-backup.timer
|
||||
sudo systemctl status productier-backup.timer
|
||||
```
|
||||
|
||||
Run backup manually through systemd:
|
||||
|
||||
```bash
|
||||
sudo systemctl start productier-backup.service
|
||||
sudo journalctl -u productier-backup.service -n 200 --no-pager
|
||||
```
|
||||
|
||||
Retention is controlled by `BACKUP_KEEP_COUNT` in `productier-backup.service`.
|
||||
|
||||
Alerting:
|
||||
|
||||
- `OPS_ALERT_WEBHOOK_URL`
|
||||
- `OPS_ALERT_WEBHOOK_BEARER_TOKEN`
|
||||
- `OPS_NOTIFY_ON_SUCCESS`
|
||||
- `OPS_ALERT_TIMEOUT_SECONDS`
|
||||
|
||||
These variables can be set in `/opt/productier/.env.production` and are loaded by `productier-backup.service`.
|
||||
|
||||
Restore drill automation:
|
||||
|
||||
- `infra/systemd/productier-restore-drill.service`
|
||||
- `infra/systemd/productier-restore-drill.timer`
|
||||
|
||||
Example install:
|
||||
|
||||
```bash
|
||||
sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/
|
||||
sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now productier-restore-drill.timer
|
||||
sudo systemctl status productier-restore-drill.timer
|
||||
```
|
||||
Reference in New Issue
Block a user