Infrastructure Reliability — Milestone Proposal
Status: Planned · April 2026
Defines the reliability, observability, and failover baseline required before the platform can handle real-money transactions at scale. Covers uptime monitoring, log aggregation, circuit breakers, DB backup drills, and disaster recovery.
No items implemented yet. All eight are pre-launch requirements.
Problem
The platform runs four Docker containers on a single Hetzner VPS (webapi, kick-server, worker, migrator), backed by Hetzner Managed PostgreSQL and Cloudflare Pages. The current posture is functional for a closed beta but has significant gaps:
- No
/healthendpoint — there is no machine-readable liveness or readiness signal - No external uptime monitor — outages are discovered by users, not by alerts
SENTRY_DSNin.envis a placeholder — error tracking is not active- Logs are stdout/stderr from four containers with no centralised store; correlation across services is impossible
- External dependencies (ScraperAPI, Kick OAuth, X OAuth, Pusher, Resend) have no retry policy, backoff, or circuit breaker — a single external outage cascades
- No tested database restore procedure; RTO and RPO targets are undefined
- Deployments use a stop-then-start pattern that causes a brief availability gap
- No disaster recovery runbook — rebuilding the environment from scratch is undocumented
Proposed Milestone — M-IR
A dedicated infrastructure milestone to be completed before opening the platform beyond the current invite-only cohort. Grouped into three themes: Reliability, Observability, and Failover.
Reliability
IR-1 · Health Endpoint & Uptime Monitoring
- Add
GET /api/healththat checks: DB connectivity (1-row ping), worker heartbeat timestamp (last job < 10 min ago), and returns{ status, db, worker, uptime_s } - Register endpoint with an external uptime monitor (UptimeRobot free tier or Grafana Cloud Synthetic) polling every 60 s
- Alert to Slack
#opschannel on two consecutive failures; auto-resolve on recovery - Document expected response shape and acceptable latency threshold (< 200 ms)
IR-2 · Zero-Downtime Rolling Deploy
- Replace current
docker compose down && docker compose updeploy pattern with a rolling restart (docker compose up --no-deps --wait webapi) - Add a
HEALTHCHECKdirective to all four Dockerfiles so Docker itself gates traffic on readiness - Validate with a synthetic request during the CI deploy step — reject the deploy if
/api/healthreturns non-200 within 30 s of container start
IR-3 · External Dependency Circuit Breakers
- Wrap ScraperAPI calls (Kick follower poll) with retry-with-backoff (3 attempts, exponential 1 s / 4 s / 16 s) and a per-run circuit breaker — skip remaining batch if 5 consecutive requests fail
- Add similar retry wrappers to: Resend email dispatch, Pusher event publish, X/Twitter OAuth token refresh
- Log
[WARN] circuit open: {service}on trip;[INFO] circuit closed: {service}on recovery - Worker should complete its batch in a degraded state rather than crashing entirely
Observability
IR-4 · Sentry Activation & Error Budget
- Replace the placeholder
SENTRY_DSNwith the real project DSN; upload source maps from all four containers as part of the CI build step - Set an error budget alert: page
#opsiferrors/min > 5for 3 consecutive minutes - Tag every Sentry event with
{ service, env, version }so errors are filterable by container - Document triage SLA: P1 (crash/data loss) < 1 h response; P2 (degraded) < 4 h; P3 (cosmetic) next sprint
IR-5 · Structured Log Aggregation
- Prefix every log line with a
request_idcorrelation header (UUID generated per request, forwarded to all downstream calls) - Ship container stdout/stderr to a centralised store — Logtail (Better Stack) or self-hosted Loki + Grafana on the same Hetzner VPS
- Retain logs for 30 days minimum; index on
level,service,request_id, andstreamer_id/operator_idwhere present - Add a saved query for the three most common investigation patterns: failed payouts, KYC upload errors, auth failures
IR-6 · Performance Metrics Baseline
- Instrument the Hono middleware layer to emit per-route request duration (p50 / p95 / p99) and status-code counters — expose as Prometheus-compatible
/metricsendpoint (internal only) - Enable PostgreSQL
log_min_duration_statement = 500ms— flag slow queries to the log aggregator - Track worker job duration per job type; alert if any job type exceeds 2× its 7-day median
- Publish a simple Grafana dashboard covering: request rate, error rate, DB connection pool utilisation, worker queue depth
Failover
IR-7 · Database Backup & Restore Runbook
- Verify that Hetzner Managed PostgreSQL automated daily backups are enabled and retained for ≥ 7 days
- Document and drill the full point-in-time restore procedure: restore to a staging DB, run
npm run migrate, smoke-test critical queries - Define RTO target (≤ 4 h) and RPO target (≤ 24 h) in writing; confirm Hetzner backup schedule meets RPO
- Add a monthly calendar reminder to re-validate the backup; record outcome in a runbook changelog
IR-8 · Disaster Recovery Playbook
- Document the full environment rebuild procedure from zero:
- Provision new Hetzner VPS + Managed PG instance
- Pull Docker images from GHCR
- Restore DB from latest backup
- Inject secrets from secrets manager (see IS-2)
- Run
npm run migrate - Update DNS / Cloudflare route
- Validate
/api/health+ smoke-test checklist
- Test the playbook end-to-end at least once before public launch; document time-to-recovery achieved
- Store the playbook in the internal wiki alongside IS and IR runbooks
Prioritisation
| Item | Theme | Priority | Effort |
|---|---|---|---|
| IR-1 Health endpoint + uptime monitor | Reliability | High | Low |
| IR-4 Sentry activation | Observability | High | Low |
| IR-2 Zero-downtime rolling deploy | Reliability | High | Medium |
| IR-5 Log aggregation | Observability | High | Medium |
| IR-3 Circuit breakers | Reliability | Medium | Medium |
| IR-6 Performance metrics | Observability | Medium | Medium |
| IR-7 DB backup drill | Failover | High | Low |
| IR-8 DR playbook | Failover | Required before launch | High |
Out of Scope
- Multi-region active-active deployment (V2 consideration)
- Kubernetes / container orchestration (single-server deployment is sufficient for current scale)
- CDN failover for the API layer (API is on Hetzner; frontend already on Cloudflare Pages)
- SLA commitments to operators (post-launch, once baseline is established)