Scaling to 100 Streamers
A concrete plan for taking the platform from its current state (~9 active streamers on stage) to 100 active streamers without an outage spiral. Built on the bottlenecks + points-of-failure analysis from the same period — see Engineering → Monitoring → Alerting for the alert side.
The headline: going from 9 → 100 active streamers is ~11× concurrency, not 11× total volume. Most operations scale per-active-streamer, not per-row. Real cost only appears in three places: chat-message ingress, the worker's per-tick work, and a couple of dashboards that don't paginate. Everything else is comfortable headroom.
Numbers we're planning against
Assumptions: 100 streamers total, ~30 % live concurrency, ~20 chat msgs / min per live streamer when live.
| Signal | Today (~9 active) | At 100 streamers | Bound |
|---|---|---|---|
| Probe rows / day | ~6.5 k | ~45 k | trivial — PG + CH happy at billions |
| Kick chat msgs / day | ~3 k | ~865 k | comfortable — webhook receiver is RPS-bound |
| Webhook deliveries / day | ~1 k | ~10 k | easy |
| Pusher concurrent connections | ~10 | ~500 | paid tier needed |
| ScraperAPI calls / month | ~6 k | ~72 k | paid tier or kill the dependency |
| Postgres row growth (1 yr) | ~2.5 M | ~25 M (probes) | needs retention policy |
| Worker pod CPU under spike | ~30 m | ~150–200 m | bump request, easy |
Almost everything is fine. Two third-party costs go from $0 to ~$100 / mo combined; everything else is operational hygiene.
Phase 0 — pre-ramp prerequisites
Cheap, high-leverage. Each item is something already flagged as latent in the bottleneck analysis. The acceptable ratio of "live streamers" to "auth-race bug" is fine at 9 — at 100 it isn't.
Don't onboard past ~25 streamers without (0.1) and (0.2) at minimum.
0.1 Deadman alert: "no kick_streamer_probe row in 10 min"
One PrometheusRule. Catches worker death, Kick API outage, OAuth refresh failure, ClickHouse write storm — all in one signal. The current heartbeat-probe scheme assumes the worker is alive, but the worker dying is precisely when you need to know.
0.2 CronJob success alert
kube_job_status_failed{namespace="stage"} > 0 → Telegram. The vf-daily-summary CronJob fails silently if the clickhouse-readonly secret rotates and we miss the manual mirror. Currently invisible until someone notices the daily message stopped arriving.
0.3 Replace hand-mirrored secrets
Today: vf-pg-app, vf-pg-migrator in stage are manually re-created from CNPG-managed sources. clickhouse-readonly and telegram-bot in monitoring mirror manually too. With 4 hand-mirrored secrets and rotation being a silent failure, this becomes the highest-frequency outage source as the platform grows.
Either install reflector (annotation-driven secret mirroring) or adopt external-secrets-operator for centralized rotation. Reflector is the smaller change.
0.4 ClickHouse dual-write: surface failures
void chInsert(…) is fire-and-forget — a CH outage produces gaps and we have no metric. Add ch_dual_write_failures_total Prometheus counter + alert above some threshold. One-line code change in the writer; the alert wires automatically to the existing kube-prometheus-stack.
0.5 Bump worker resources
Today: 50 m CPU request, 128 Mi memory. At 100 streamers the per-minute viewer poll does 2 batched API calls + 100 PG inserts + 100 CH inserts — comfortably ~150 m CPU under spike. Pre-emptive bump to 200 m / 256 Mi keeps the pod from becoming the bottleneck before alerting catches it.
0.6 PG retention on kick_streamer_probe
Today the table has no TTL — only ClickHouse does (12 mo). At 100 streamers that's ~16 M rows / year accumulating in PG. Add a monthly DELETE … WHERE ts < NOW() - INTERVAL '90 days' job, or better, partition by month and DROP old partitions. CH is the analytical store anyway; PG only needs the recent window for the live dashboard + worker decisions.
Phase 0 total effort: ~5 hours of focused work, $0 incremental cost.
Phase 1 — capacity (during the ramp 25 → 100 streamers)
1.1 Pusher → paid plan
500 concurrent connections is past Pusher's free tier (100). Their Startup tier is $49 / mo for 500 connections.
Cheaper alternatives:
- Stay on Pusher Startup — zero engineering, $49 / mo
- Migrate to Soketi (self-hosted, Pusher-protocol compatible, drops into the same JS client) — one cluster pod, ~10 m CPU, $0 incremental
- Migrate to NATS or a simpler websocket layer — bigger lift, only worth it if doing other realtime work
Recommendation: stay on Pusher Startup until usage justifies more.
1.2 Drop ScraperAPI dependency or move to paid
ScraperAPI is only used by kick_follower_poll.ts to hit Kick's internal API for channel-age data. At 100 streamers × hourly = 72 k calls / month — well past their 1 000 / month free tier ($49 / mo for the next tier).
Two options:
- Pay $49 / mo — easiest
- Rate-limit follower poll to once / day — channel age + follower count change slowly; daily granularity is sufficient for trust scoring. Drops to ~3 k / month, fits free tier.
Recommendation: drop to daily polling. $0 incremental, the trust score doesn't need finer granularity.
1.3 Operator inbox UI scaling
The OperatorLayout Inbox renders one tab per streamer. At 100 streamers per operator, the horizontal Tabs strip is unusable. Two options:
- Master / detail layout: left sidebar list of streamers (sortable by last activity, with a search box) + selected thread on the right. Standard messaging UI.
- Campaign filter pre-applied: tabs only show streamers within the currently-selected campaign, defaulting to "all but limit to 20 most recent".
Recommendation: master / detail. Same shape every messaging app uses for a reason.
1.4 Admin streamers panel: pagination
Currently shows all rows in one table. At 100 rows the UI is tolerable but the auth-protected endpoint that returns them all in one shot is wasteful. Add ?limit=20&offset=0 plus a search box.
1.5 Resend tier check
PIN emails (legacy signup), invitation emails, KYC notifications. At 100 streamers + their ramp-up traffic, probably 200–500 emails / month. Free tier covers 3 000 / month. Likely no action needed — verify before assuming.
1.6 Replace node-cron with a durable job substrate
Today every periodic job runs as node-cron inside a single worker pod (api/src/worker.ts). Schedules, in-flight state, and retry logic all live in pod memory. That's fine at 9 streamers; it's structurally wrong at 100. Concrete failure modes that surface as we scale:
- Pod restart drops in-flight work. A trust-score recompute that's 20 s into a 70 s run loses everything when the pod is killed for an image update. No retry; the next tick simply starts a new run.
- Overrun. If trust-score takes 70 s, the next 60 s tick fires while it's still running.
node-cronhas no built-in serialisation — two recomputes race over the same tables. - Two pods double-fire. If we ever scale
replicas: 1 → 2for availability, every cron fires on both pods simultaneously. There's no leader election in the worker. - No exactly-once. Jobs that mutate external state (Kick API unsubscribe, S3 chat export) need idempotency in the handler; there's no framework safety net.
- No observability. "What ran when, how long, what failed" requires reading log lines.
The original "split into two pods" idea (worker-fast / worker-slow) sidesteps overrun but doesn't address restart, leader election, or observability. Time to evaluate a real substrate.
Option A — Temporal (suggested)
Temporal is a workflow engine: durable code that survives pod restart, retries with configurable backoff, deduplicates, and exposes a UI for inspection / replay / pause. Workflows are expressed as long-running functions; activities are the atomic side-effect units that get retried.
- Strengths: durability is the killer feature for the long-running paths (trust-score recompute, chat export, daily summary). Observability (Temporal UI) is best-in-class. Naturally absorbs more than just cron — the
offer → negotiation → deallifecycle is the kind of multi-step state machine Temporal was designed for. - Weaknesses: Temporal Server is itself a stateful cluster (Cassandra/MySQL/PG-backed, multiple components). Self-hosting adds meaningful operational burden — possibly more than the system it's trying to make robust. Temporal Cloud sidesteps that but starts at ~$200 / month for a useful tier.
- Migration cost: every cron handler rewrites as a workflow + activities. New mental model for the team. Not free.
- Where it makes sense: when the platform has multiple long-running stateful processes worth modelling (deal funding + HTLC lifecycle, KYC review queues, multi-stream campaigns). For pure-cron use it's overkill.
Option B — Postgres-backed job queue (recommended for 100)
pg-boss is a Node library that implements scheduled jobs, retries, dedup, queues, and cron — all on top of Postgres. We already run CNPG; no new infra.
- Strengths: zero new infrastructure. Library, not server. Schedules + retry state survive pod restart because they're rows in Postgres. Dedup via job key. Multiple workers pull from the same queue without races (PG
SELECT FOR UPDATE SKIP LOCKED). Suits our specific mix: short tight loops (viewer-poll) and slower jobs (trust-score, daily summary) coexist. - Weaknesses: less powerful than Temporal — no workflow-level durability, no UI out of the box (admin queries via SQL or a small panel), no first-class long-running-process model.
- Migration cost: crons become `boss.schedule('viewer-poll', '*
- *', { … })
calls + handlers. Idempotency still in handler code (we already havededupe_key` patterns). Estimated 2–3 days of focused work for the full crontab.
- *', { … })
Alternatives in the same shape: BullMQ (Redis-backed; adds infra), river (PostgreSQL, Go only — not applicable to a TypeScript codebase).
Option C — Kubernetes CronJob for everything
Already in use for vf-daily-summary. Each invocation is a fresh pod; durability is "did the job complete or not", retry is per-job-template.
- Strengths: zero application-level cron state. Survives pod restart trivially. Battle-tested.
- Weaknesses: pod startup overhead (5–10 s) makes per-minute jobs wasteful — half the runtime is bootstrapping. Not viable for the viewer-poll hot path. Good for hourly+ jobs.
Option D — Inngest / Quirrel (cloud-managed)
Turnkey but vendor lock-in; adds a cloud dependency. Not aligned with the rest of the stack which is self-hosted on Hetzner.
Recommendation for the 9 → 100 transition
Move to pg-boss for the durable / slow jobs. Keep the per-minute viewer-poll in-process — but use the database as the coordinator (atomic claim via UPDATE … RETURNING) so multiple workers can run the same code without leader election.
The viewer-poll has a property worth being explicit about: it does not need exactly-once semantics or precise timing. A probe at T+30 s captures the same useful info as one at T+0 s. Missing one minute is invisible (heartbeat row catches up within the hour); missing five triggers the deadman alert. What it does need is no duplicate probes for the same minute — otherwise viewer_hours = sum(viewer_count)/60 double-counts.
That property unlocks a much simpler architecture:
worker (replicas: 2, podAntiAffinity by hostname, NO leader role)
├─ per-minute timer fires on BOTH pods, independently
│ → atomic UPDATE … RETURNING on kick_streamer_probe_schedule
│ claims a subset of due streamers per pod (database = lease)
└─ pg-boss client (both pods pull, queue coordinates via SKIP LOCKED)Three tiers, all coordinating through Postgres rows. Zero special election primitives.
| Tier | Substrate | Jobs |
|---|---|---|
| Per-minute viewer-poll | atomic UPDATE … RETURNING on kick_streamer_probe_schedule (the existing schedule row IS the lease) | viewer-poll |
| Durable queue | pg-boss on the existing CNPG | trust-score recompute, chat export, daily summary, Kick token refresh, stale-session cleanup, follower poll |
| Hourly batch | K8s CronJob (already in use) | vf-daily-summary and similar |
The atomic-claim pattern for the viewer-poll:
-- Phase 1 of pollViewerCounts(): atomic claim + filter in one CTE.
-- Two workers running this concurrently get DISJOINT subsets.
WITH due_existing AS (
UPDATE kick_streamer_probe_schedule
SET next_probe_at = NOW() + INTERVAL '5 minutes', -- placeholder hold
last_probed_at = NOW(),
updated_at = NOW()
WHERE next_probe_at <= NOW()
RETURNING kick_user_id
),
due_new AS (
-- Atomically seed schedule for streamers who don't have one yet
INSERT INTO kick_streamer_probe_schedule
(kick_user_id, next_probe_at, last_probed_at, updated_at)
SELECT sc.kick_user_id, NOW() + INTERVAL '5 minutes', NOW(), NOW()
FROM streamer_channels sc
WHERE sc.platform = 'kick' AND sc.kick_user_id IS NOT NULL
AND NOT EXISTS (SELECT 1 FROM kick_streamer_probe_schedule
WHERE kick_user_id = sc.kick_user_id)
ON CONFLICT (kick_user_id) DO NOTHING
RETURNING kick_user_id
)
SELECT sc.kick_user_id, sc.channel_name
FROM streamer_channels sc
LEFT JOIN streamers s ON s.id = sc.streamer_id
WHERE (sc.kick_user_id IN (SELECT kick_user_id FROM due_existing)
OR sc.kick_user_id IN (SELECT kick_user_id FROM due_new))
AND (s.status IS NULL OR s.status NOT IN ('archived','suspended'));The 5-minute placeholder is a soft hold — gets overwritten by the existing UPSERT after the API call completes (with the actual cadence based on is_live). Long enough to cover API latency + processing, short enough that a worker crash mid-claim makes the row available for retry within minutes.
Two pods racing the same query: PG row-level locks make the UPDATE atomic per row. The slower pod's predicate next_probe_at <= NOW() no longer matches the rows the faster pod claimed → 0 rows in its result for those streamers → it skips them. No duplicate probes, no duplicate Kick API calls.
What this costs vs. what it buys
- One new dependency:
pg-boss(single npm package). No new infrastructure. - One small refactor:
pollViewerCounts()claim logic from SELECT-then-INSERT to the CTE pattern above. Localised to one file.
In return:
- Durable retry on pod restart for everything that matters (queued jobs).
- No leader-election dependency to maintain — the database row IS the lease.
- Failover for the per-minute loop is invisible: the next tick on the surviving pod claims everything that was due. No coordination primitive to mis-configure.
- Two pods can run, with
podAntiAffinityacross nodes, surviving any single pod or node failure. - Observability via PG queries (and a Grafana panel on
pgboss.archive).
Path to Temporal stays open if/when the platform's domain logic (offer / negotiation / deal lifecycle, HTLC funding) reaches a complexity where workflow modelling pays for itself. That's not at streamer #100 — it's at "we want to model long-running multi-party state machines as code instead of polling DB rows."
Effort: ~2 hours for the viewer-poll claim restructure (with a follow-up unique-index migration as a backstop for races on schedule-row creation), 2–3 days for the pg-boss migration of the slow tier. Trigger: streamer #50 or earlier if the worker-overrun symptom shows up first.
Phase 2 — scale comfort (around streamer #50)
2.1 Probe-cadence assumption
viewer_hours = sum(viewer_count) / 60 lives in 5+ places (Grafana panels + the daily-summary CronJob). At 100 streamers, the cost of getting this wrong (e.g. someone changes the cadence) is large enough that the divisor should become a single constant queried at job start.
Refactor: LIVE_PROBE_MINUTES as either a CH UDF or a row in a settings table. One-time cleanup, ~2 hours.
2.2 Stream-session jitter audit
The (kick_user_id, stream_started_at) predicate counts a "stream" — but we never verified that Kick reports the same start_time across probes of the same live session. Even 1-second jitter creates duplicate rows in the daily report.
Audit query:
SELECT kick_user_id, count(distinct stream_started_at) AS distinct_starts
FROM vf.kick_streamer_probe
WHERE is_live = 1 AND ts > now() - INTERVAL 24 HOUR
GROUP BY 1
HAVING count(distinct stream_started_at) > 1;If it returns rows, floor stream_started_at to the minute on insert.
2.3 Webhook receiver capacity
Chat messages at 865 k / day = ~10 RPS average, with bursts to maybe 50–100 RPS during peak. The current receiver writes to PG kick_chat_messages synchronously. At 100 RPS with 5 ms PG insert each, you saturate one PG connection per request.
Mitigation: batch chat-message inserts. api/src/kick_chat_export.ts already batches to S3 — extend the same pattern to the PG write path so the receiver can return 200 quickly and a worker drains them out of a queue. Only matters if peak chat actually hits 100 RPS.
2.4 ClickHouse query caching
The daily-summary CronJob currently scans the full UTC day every run. At 100 streamers × 30 days of probes = a few million rows. Cheap today. After streamer #200 it's worth turning on chunksCache + resultsCache in the CH cluster — one Helm-values change.
2.5 PG read replica
Single-primary CNPG today. Read traffic at 100 streamers (admin panel, dashboards, worker queries) is fine on the primary. Past streamer #200, add a CNPG replicas: 2 and route the heavy analytical queries (Grafana, daily report) to the replica.
Phase 3 — flagged but not actioned
These don't matter at 100 but worth knowing they exist:
- Multi-region — today we're Hetzner Helsinki only. Latency-critical streamers (US, JP) take a 200 ms hit on every write. Plausibly never matters for this app.
- Sharding
kick_streamer_probeby env or kick_user_id — only needed past 1 B rows. Years away. - Replacing CNPG with managed Aurora / Cloud SQL — only if Hetzner becomes the bottleneck, which it won't soon.
- Operator-side delegation / multi-seat — today one operator account = one human. Past ~50 operators with teams, you'll need invitations + per-seat permissions.
- Temporal for domain workflows — once we want to model
offer → negotiation → deal → HTLC funding → submission → payoutas a workflow rather than as polling state machines over DB rows, Temporal earns its operational cost. Plausibly relevant in the next 12–18 months, not at streamer #100.
What ships in what order
| # | Deliverable | Effort | Trigger |
|---|---|---|---|
| 1 | Deadman alert: 10-min no-probe → Telegram | 30 min | now |
| 2 | CronJob failed-job alert | 30 min | now |
| 3 | Reflector for vf-pg-app, vf-pg-migrator, clickhouse-readonly, telegram-bot | 2 h | now |
| 4 | ch_dual_write_failures_total metric + alert | 1 h | now |
| 5 | Worker resources bump 50 m → 200 m CPU | 5 min | now |
| 6 | PG retention DELETE / partition on kick_streamer_probe | 1 day | streamer #25 |
| 7 | Drop ScraperAPI to daily polling | 30 min | streamer #25 |
| 8 | Operator inbox: master / detail layout | 1 day | streamer #30 |
| 9 | Admin streamers pagination | 2 h | streamer #50 |
| 10 | Pusher Startup tier billing | 5 min + billing | streamer #50 |
| 11 | Probe-cadence constant → single source | 2 h | before next change |
| 12 | Stream-start jitter audit | 30 min | streamer #75 |
| 13a | Atomic-claim refactor of viewer-poll (DB row = lease, no leader election) | 2 h | streamer #25 |
| 13b | pg-boss migration: durable queue for slow tier | 2–3 days | streamer #50 |
| 14 | Webhook receiver: async chat inserts | 1 day | streamer #80 |
| 15 | ClickHouse query cache enable | 5 min config | streamer #100+ |
Phase 0 (items 1–5) is ~5 hours of work and zero infra cost. Don't onboard past 25 streamers without it. Total Phase 0 + Phase 1 effort: ~5 days of focused work + ~$50 / mo incremental external services.
Estimated monthly cost at 100 streamers
| Item | Cost (USD / mo) |
|---|---|
| Hetzner K3s cluster + LB | ~€60 (~$65) |
| Hetzner Object Storage | <$1 |
| Pusher Startup | $49 |
| ScraperAPI (free, daily poll) | $0 |
| Resend (free tier) | $0 |
| Sentry (free tier) | $0 |
| Cloudflare Pages + Access | $0 |
| LaunchDarkly (free tier) | $0 |
| Total | ~$115 / mo |
The economics keep up trivially with the platform's revenue at 100 streamers. The constraint is operator and engineer attention, not infrastructure cost.