Skip to content

Infrastructure Reliability — Milestone Proposal

Status: Planned · April 2026

Defines the reliability, observability, and failover baseline required before the platform can handle real-money transactions at scale. Covers uptime monitoring, log aggregation, circuit breakers, DB backup drills, and disaster recovery.

No items implemented yet. All eight are pre-launch requirements.


Problem

The platform runs four Docker containers on a single Hetzner VPS (webapi, kick-server, worker, migrator), backed by Hetzner Managed PostgreSQL and Cloudflare Pages. The current posture is functional for a closed beta but has significant gaps:

  • No /health endpoint — there is no machine-readable liveness or readiness signal
  • No external uptime monitor — outages are discovered by users, not by alerts
  • SENTRY_DSN in .env is a placeholder — error tracking is not active
  • Logs are stdout/stderr from four containers with no centralised store; correlation across services is impossible
  • External dependencies (ScraperAPI, Kick OAuth, X OAuth, Pusher, Resend) have no retry policy, backoff, or circuit breaker — a single external outage cascades
  • No tested database restore procedure; RTO and RPO targets are undefined
  • Deployments use a stop-then-start pattern that causes a brief availability gap
  • No disaster recovery runbook — rebuilding the environment from scratch is undocumented

Proposed Milestone — M-IR

A dedicated infrastructure milestone to be completed before opening the platform beyond the current invite-only cohort. Grouped into three themes: Reliability, Observability, and Failover.


Reliability

IR-1 · Health Endpoint & Uptime Monitoring

  • Add GET /api/health that checks: DB connectivity (1-row ping), worker heartbeat timestamp (last job < 10 min ago), and returns { status, db, worker, uptime_s }
  • Register endpoint with an external uptime monitor (UptimeRobot free tier or Grafana Cloud Synthetic) polling every 60 s
  • Alert to Slack #ops channel on two consecutive failures; auto-resolve on recovery
  • Document expected response shape and acceptable latency threshold (< 200 ms)

IR-2 · Zero-Downtime Rolling Deploy

  • Replace current docker compose down && docker compose up deploy pattern with a rolling restart (docker compose up --no-deps --wait webapi)
  • Add a HEALTHCHECK directive to all four Dockerfiles so Docker itself gates traffic on readiness
  • Validate with a synthetic request during the CI deploy step — reject the deploy if /api/health returns non-200 within 30 s of container start

IR-3 · External Dependency Circuit Breakers

  • Wrap ScraperAPI calls (Kick follower poll) with retry-with-backoff (3 attempts, exponential 1 s / 4 s / 16 s) and a per-run circuit breaker — skip remaining batch if 5 consecutive requests fail
  • Add similar retry wrappers to: Resend email dispatch, Pusher event publish, X/Twitter OAuth token refresh
  • Log [WARN] circuit open: {service} on trip; [INFO] circuit closed: {service} on recovery
  • Worker should complete its batch in a degraded state rather than crashing entirely

Observability

IR-4 · Sentry Activation & Error Budget

  • Replace the placeholder SENTRY_DSN with the real project DSN; upload source maps from all four containers as part of the CI build step
  • Set an error budget alert: page #ops if errors/min > 5 for 3 consecutive minutes
  • Tag every Sentry event with { service, env, version } so errors are filterable by container
  • Document triage SLA: P1 (crash/data loss) < 1 h response; P2 (degraded) < 4 h; P3 (cosmetic) next sprint

IR-5 · Structured Log Aggregation

  • Prefix every log line with a request_id correlation header (UUID generated per request, forwarded to all downstream calls)
  • Ship container stdout/stderr to a centralised store — Logtail (Better Stack) or self-hosted Loki + Grafana on the same Hetzner VPS
  • Retain logs for 30 days minimum; index on level, service, request_id, and streamer_id / operator_id where present
  • Add a saved query for the three most common investigation patterns: failed payouts, KYC upload errors, auth failures

IR-6 · Performance Metrics Baseline

  • Instrument the Hono middleware layer to emit per-route request duration (p50 / p95 / p99) and status-code counters — expose as Prometheus-compatible /metrics endpoint (internal only)
  • Enable PostgreSQL log_min_duration_statement = 500ms — flag slow queries to the log aggregator
  • Track worker job duration per job type; alert if any job type exceeds 2× its 7-day median
  • Publish a simple Grafana dashboard covering: request rate, error rate, DB connection pool utilisation, worker queue depth

Failover

IR-7 · Database Backup & Restore Runbook

  • Verify that Hetzner Managed PostgreSQL automated daily backups are enabled and retained for ≥ 7 days
  • Document and drill the full point-in-time restore procedure: restore to a staging DB, run npm run migrate, smoke-test critical queries
  • Define RTO target (≤ 4 h) and RPO target (≤ 24 h) in writing; confirm Hetzner backup schedule meets RPO
  • Add a monthly calendar reminder to re-validate the backup; record outcome in a runbook changelog

IR-8 · Disaster Recovery Playbook

  • Document the full environment rebuild procedure from zero:
    1. Provision new Hetzner VPS + Managed PG instance
    2. Pull Docker images from GHCR
    3. Restore DB from latest backup
    4. Inject secrets from secrets manager (see IS-2)
    5. Run npm run migrate
    6. Update DNS / Cloudflare route
    7. Validate /api/health + smoke-test checklist
  • Test the playbook end-to-end at least once before public launch; document time-to-recovery achieved
  • Store the playbook in the internal wiki alongside IS and IR runbooks

Prioritisation

ItemThemePriorityEffort
IR-1 Health endpoint + uptime monitorReliabilityHighLow
IR-4 Sentry activationObservabilityHighLow
IR-2 Zero-downtime rolling deployReliabilityHighMedium
IR-5 Log aggregationObservabilityHighMedium
IR-3 Circuit breakersReliabilityMediumMedium
IR-6 Performance metricsObservabilityMediumMedium
IR-7 DB backup drillFailoverHighLow
IR-8 DR playbookFailoverRequired before launchHigh

Out of Scope

  • Multi-region active-active deployment (V2 consideration)
  • Kubernetes / container orchestration (single-server deployment is sufficient for current scale)
  • CDN failover for the API layer (API is on Hetzner; frontend already on Cloudflare Pages)
  • SLA commitments to operators (post-launch, once baseline is established)

Verifluence Documentation