Infrastructure Reliability — Milestone Proposal

Status: Planned · April 2026
Defines the reliability, observability, and failover baseline required before the platform can handle real-money transactions at scale. Covers uptime monitoring, log aggregation, circuit breakers, DB backup drills, and disaster recovery.
No items implemented yet. All eight are pre-launch requirements.

Problem

The platform runs four Docker containers on a single Hetzner VPS (webapi, kick-server, worker, migrator), backed by Hetzner Managed PostgreSQL and Cloudflare Pages. The current posture is functional for a closed beta but has significant gaps:

No /health endpoint — there is no machine-readable liveness or readiness signal
No external uptime monitor — outages are discovered by users, not by alerts
SENTRY_DSN in .env is a placeholder — error tracking is not active
Logs are stdout/stderr from four containers with no centralised store; correlation across services is impossible
External dependencies (ScraperAPI, Kick OAuth, X OAuth, Pusher, Resend) have no retry policy, backoff, or circuit breaker — a single external outage cascades
No tested database restore procedure; RTO and RPO targets are undefined
Deployments use a stop-then-start pattern that causes a brief availability gap
No disaster recovery runbook — rebuilding the environment from scratch is undocumented

Proposed Milestone — M-IR

A dedicated infrastructure milestone to be completed before opening the platform beyond the current invite-only cohort. Grouped into three themes: Reliability, Observability, and Failover.

Reliability

IR-1 · Health Endpoint & Uptime Monitoring

Add GET /api/health that checks: DB connectivity (1-row ping), worker heartbeat timestamp (last job < 10 min ago), and returns { status, db, worker, uptime_s }
Register endpoint with an external uptime monitor (UptimeRobot free tier or Grafana Cloud Synthetic) polling every 60 s
Alert to Slack #ops channel on two consecutive failures; auto-resolve on recovery
Document expected response shape and acceptable latency threshold (< 200 ms)

IR-2 · Zero-Downtime Rolling Deploy

Replace current docker compose down && docker compose up deploy pattern with a rolling restart (docker compose up --no-deps --wait webapi)
Add a HEALTHCHECK directive to all four Dockerfiles so Docker itself gates traffic on readiness
Validate with a synthetic request during the CI deploy step — reject the deploy if /api/health returns non-200 within 30 s of container start

IR-3 · External Dependency Circuit Breakers

Wrap ScraperAPI calls (Kick follower poll) with retry-with-backoff (3 attempts, exponential 1 s / 4 s / 16 s) and a per-run circuit breaker — skip remaining batch if 5 consecutive requests fail
Add similar retry wrappers to: Resend email dispatch, Pusher event publish, X/Twitter OAuth token refresh
Log [WARN] circuit open: {service} on trip; [INFO] circuit closed: {service} on recovery
Worker should complete its batch in a degraded state rather than crashing entirely

Observability

IR-4 · Sentry Activation & Error Budget

Replace the placeholder SENTRY_DSN with the real project DSN; upload source maps from all four containers as part of the CI build step
Set an error budget alert: page #ops if errors/min > 5 for 3 consecutive minutes
Tag every Sentry event with { service, env, version } so errors are filterable by container
Document triage SLA: P1 (crash/data loss) < 1 h response; P2 (degraded) < 4 h; P3 (cosmetic) next sprint

IR-5 · Structured Log Aggregation

Prefix every log line with a request_id correlation header (UUID generated per request, forwarded to all downstream calls)
Ship container stdout/stderr to a centralised store — Logtail (Better Stack) or self-hosted Loki + Grafana on the same Hetzner VPS
Retain logs for 30 days minimum; index on level, service, request_id, and streamer_id / operator_id where present
Add a saved query for the three most common investigation patterns: failed payouts, KYC upload errors, auth failures

IR-6 · Performance Metrics Baseline

Instrument the Hono middleware layer to emit per-route request duration (p50 / p95 / p99) and status-code counters — expose as Prometheus-compatible /metrics endpoint (internal only)
Enable PostgreSQL log_min_duration_statement = 500ms — flag slow queries to the log aggregator
Track worker job duration per job type; alert if any job type exceeds 2× its 7-day median
Publish a simple Grafana dashboard covering: request rate, error rate, DB connection pool utilisation, worker queue depth

Failover

IR-7 · Database Backup & Restore Runbook

Verify that Hetzner Managed PostgreSQL automated daily backups are enabled and retained for ≥ 7 days
Document and drill the full point-in-time restore procedure: restore to a staging DB, run npm run migrate, smoke-test critical queries
Define RTO target (≤ 4 h) and RPO target (≤ 24 h) in writing; confirm Hetzner backup schedule meets RPO
Add a monthly calendar reminder to re-validate the backup; record outcome in a runbook changelog

IR-8 · Disaster Recovery Playbook

Document the full environment rebuild procedure from zero:
1. Provision new Hetzner VPS + Managed PG instance
2. Pull Docker images from GHCR
3. Restore DB from latest backup
4. Inject secrets from secrets manager (see IS-2)
5. Run npm run migrate
6. Update DNS / Cloudflare route
7. Validate /api/health + smoke-test checklist
Test the playbook end-to-end at least once before public launch; document time-to-recovery achieved
Store the playbook in the internal wiki alongside IS and IR runbooks

Prioritisation

Item	Theme	Priority	Effort
IR-1 Health endpoint + uptime monitor	Reliability	High	Low
IR-4 Sentry activation	Observability	High	Low
IR-2 Zero-downtime rolling deploy	Reliability	High	Medium
IR-5 Log aggregation	Observability	High	Medium
IR-3 Circuit breakers	Reliability	Medium	Medium
IR-6 Performance metrics	Observability	Medium	Medium
IR-7 DB backup drill	Failover	High	Low
IR-8 DR playbook	Failover	Required before launch	High

Out of Scope

Multi-region active-active deployment (V2 consideration)
Kubernetes / container orchestration (single-server deployment is sufficient for current scale)
CDN failover for the API layer (API is on Hetzner; frontend already on Cloudflare Pages)
SLA commitments to operators (post-launch, once baseline is established)

Infrastructure Reliability — Milestone Proposal ​

Problem ​

Proposed Milestone — M-IR ​

Reliability ​

IR-1 · Health Endpoint & Uptime Monitoring ​

IR-2 · Zero-Downtime Rolling Deploy ​

IR-3 · External Dependency Circuit Breakers ​

Observability ​

IR-4 · Sentry Activation & Error Budget ​

IR-5 · Structured Log Aggregation ​

IR-6 · Performance Metrics Baseline ​

Failover ​

IR-7 · Database Backup & Restore Runbook ​

IR-8 · Disaster Recovery Playbook ​

Prioritisation ​

Out of Scope ​