TDvorak/Productier

Fork 0

mirror of https://github.com/Dvorinka/Productier.git synced 2026-06-03 20:13:01 +00:00

Files

T

Tomas Dvorak 3cb40adb23 first commit

2026-04-10 12:04:09 +02:00

3.8 KiB

Raw Blame History

Productier Disaster Recovery Runbook

Scope

This runbook covers backup and restore of:

PostgreSQL data (postgres service)
Object storage data (rustfs bucket configured by S3_BUCKET)

It assumes the production compose stack and env setup in:

infra/docker-compose.prod.yml
.env.production

Preconditions

Validate production env:

npm run check:prod-env

Ensure production services are running:

npm run ops:deploy

If deployment is already running and you only want readiness validation:

npm run ops:preflight

Backup

Create a timestamped backup:

npm run ops:backup

Output directory format:

backups/<UTC timestamp>/

Expected files:

postgres.sql.gz
s3/ (synced object data)
checksums.sha256
metadata.json

Verify backup:

bash scripts/ops/verify-backup.sh backups/<timestamp>

Run full scheduled-style backup flow (backup + verify + prune):

npm run ops:backup:job

Restore

Restore is destructive and requires explicit safety flags.

FORCE=1 RESET_DB=1 RESTORE_S3=1 \
bash scripts/ops/restore-prod.sh .env.production backups/<timestamp>

Flags:

FORCE=1: required; otherwise restore exits
RESET_DB=1: recommended; drops and recreates schema before import
RESTORE_S3=1: restore object storage (default on)

Restore drill (non-destructive)

Run drill against latest backup:

npm run ops:restore:drill

Run full isolated staging drill (temporary compose project + teardown):

npm run ops:drill:staging

Run drill against specific backup:

bash scripts/ops/restore-drill.sh .env.production backups/<timestamp>

Drill behavior:

Imports DB dump into temporary drill database
Runs sanity check (public table count)
Optionally syncs backup objects into a temporary drill bucket
Drops temporary drill DB and bucket by default

Post-restore checks

API health:

curl -sS https://<PUBLIC_DOMAIN>/v1/health

Auth health:

curl -sS https://<PUBLIC_DOMAIN>/api/auth/get-session

Manual smoke:

sign in
open board/calendar/notes/mail
download at least one attachment

Automated smoke script:

npm run ops:smoke

The smoke script checks:

public homepage response
/v1/health payload
security response headers
HTTP->HTTPS redirect behavior

Cadence recommendations

Daily backup (off-peak)
Weekly restore drill in staging
Keep at least 14 daily restore points and 8 weekly restore points

Automation (systemd)

Template files:

infra/systemd/productier-backup.service
infra/systemd/productier-backup.timer

Example install on Linux host:

sudo cp infra/systemd/productier-backup.service /etc/systemd/system/
sudo cp infra/systemd/productier-backup.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-backup.timer
sudo systemctl status productier-backup.timer

Run backup manually through systemd:

sudo systemctl start productier-backup.service
sudo journalctl -u productier-backup.service -n 200 --no-pager

Retention is controlled by BACKUP_KEEP_COUNT in productier-backup.service.

Alerting:

OPS_ALERT_WEBHOOK_URL
OPS_ALERT_WEBHOOK_BEARER_TOKEN
OPS_NOTIFY_ON_SUCCESS
OPS_ALERT_TIMEOUT_SECONDS

These variables can be set in /opt/productier/.env.production and are loaded by productier-backup.service.

Restore drill automation:

infra/systemd/productier-restore-drill.service
infra/systemd/productier-restore-drill.timer

Example install:

sudo cp infra/systemd/productier-restore-drill.service /etc/systemd/system/
sudo cp infra/systemd/productier-restore-drill.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now productier-restore-drill.timer
sudo systemctl status productier-restore-drill.timer

3.8 KiB Raw Blame History