Skip to content

Scaling to 100 Streamers

A concrete plan for taking the platform from its current state (~9 active streamers on stage) to 100 active streamers without an outage spiral. Built on the bottlenecks + points-of-failure analysis from the same period — see Engineering → Monitoring → Alerting for the alert side.

The headline: going from 9 → 100 active streamers is ~11× concurrency, not 11× total volume. Most operations scale per-active-streamer, not per-row. Real cost only appears in three places: chat-message ingress, the worker's per-tick work, and a couple of dashboards that don't paginate. Everything else is comfortable headroom.


Numbers we're planning against

Assumptions: 100 streamers total, ~30 % live concurrency, ~20 chat msgs / min per live streamer when live.

SignalToday (~9 active)At 100 streamersBound
Probe rows / day~6.5 k~45 ktrivial — PG + CH happy at billions
Kick chat msgs / day~3 k~865 kcomfortable — webhook receiver is RPS-bound
Webhook deliveries / day~1 k~10 keasy
Pusher concurrent connections~10~500paid tier needed
ScraperAPI calls / month~6 k~72 kpaid tier or kill the dependency
Postgres row growth (1 yr)~2.5 M~25 M (probes)needs retention policy
Worker pod CPU under spike~30 m~150–200 mbump request, easy

Almost everything is fine. Two third-party costs go from $0 to ~$100 / mo combined; everything else is operational hygiene.


Phase 0 — pre-ramp prerequisites

Cheap, high-leverage. Each item is something already flagged as latent in the bottleneck analysis. The acceptable ratio of "live streamers" to "auth-race bug" is fine at 9 — at 100 it isn't.

Don't onboard past ~25 streamers without (0.1) and (0.2) at minimum.

0.1 Deadman alert: "no kick_streamer_probe row in 10 min"

One PrometheusRule. Catches worker death, Kick API outage, OAuth refresh failure, ClickHouse write storm — all in one signal. The current heartbeat-probe scheme assumes the worker is alive, but the worker dying is precisely when you need to know.

0.2 CronJob success alert

kube_job_status_failed{namespace="stage"} > 0 → Telegram. The vf-daily-summary CronJob fails silently if the clickhouse-readonly secret rotates and we miss the manual mirror. Currently invisible until someone notices the daily message stopped arriving.

0.3 Replace hand-mirrored secrets

Today: vf-pg-app, vf-pg-migrator in stage are manually re-created from CNPG-managed sources. clickhouse-readonly and telegram-bot in monitoring mirror manually too. With 4 hand-mirrored secrets and rotation being a silent failure, this becomes the highest-frequency outage source as the platform grows.

Either install reflector (annotation-driven secret mirroring) or adopt external-secrets-operator for centralized rotation. Reflector is the smaller change.

0.4 ClickHouse dual-write: surface failures

void chInsert(…) is fire-and-forget — a CH outage produces gaps and we have no metric. Add ch_dual_write_failures_total Prometheus counter + alert above some threshold. One-line code change in the writer; the alert wires automatically to the existing kube-prometheus-stack.

0.5 Bump worker resources

Today: 50 m CPU request, 128 Mi memory. At 100 streamers the per-minute viewer poll does 2 batched API calls + 100 PG inserts + 100 CH inserts — comfortably ~150 m CPU under spike. Pre-emptive bump to 200 m / 256 Mi keeps the pod from becoming the bottleneck before alerting catches it.

0.6 PG retention on kick_streamer_probe

Today the table has no TTL — only ClickHouse does (12 mo). At 100 streamers that's ~16 M rows / year accumulating in PG. Add a monthly DELETE … WHERE ts < NOW() - INTERVAL '90 days' job, or better, partition by month and DROP old partitions. CH is the analytical store anyway; PG only needs the recent window for the live dashboard + worker decisions.

Phase 0 total effort: ~5 hours of focused work, $0 incremental cost.


Phase 1 — capacity (during the ramp 25 → 100 streamers)

1.1 Pusher → paid plan

500 concurrent connections is past Pusher's free tier (100). Their Startup tier is $49 / mo for 500 connections.

Cheaper alternatives:

  • Stay on Pusher Startup — zero engineering, $49 / mo
  • Migrate to Soketi (self-hosted, Pusher-protocol compatible, drops into the same JS client) — one cluster pod, ~10 m CPU, $0 incremental
  • Migrate to NATS or a simpler websocket layer — bigger lift, only worth it if doing other realtime work

Recommendation: stay on Pusher Startup until usage justifies more.

1.2 Drop ScraperAPI dependency or move to paid

ScraperAPI is only used by kick_follower_poll.ts to hit Kick's internal API for channel-age data. At 100 streamers × hourly = 72 k calls / month — well past their 1 000 / month free tier ($49 / mo for the next tier).

Two options:

  • Pay $49 / mo — easiest
  • Rate-limit follower poll to once / day — channel age + follower count change slowly; daily granularity is sufficient for trust scoring. Drops to ~3 k / month, fits free tier.

Recommendation: drop to daily polling. $0 incremental, the trust score doesn't need finer granularity.

1.3 Operator inbox UI scaling

The OperatorLayout Inbox renders one tab per streamer. At 100 streamers per operator, the horizontal Tabs strip is unusable. Two options:

  • Master / detail layout: left sidebar list of streamers (sortable by last activity, with a search box) + selected thread on the right. Standard messaging UI.
  • Campaign filter pre-applied: tabs only show streamers within the currently-selected campaign, defaulting to "all but limit to 20 most recent".

Recommendation: master / detail. Same shape every messaging app uses for a reason.

1.4 Admin streamers panel: pagination

Currently shows all rows in one table. At 100 rows the UI is tolerable but the auth-protected endpoint that returns them all in one shot is wasteful. Add ?limit=20&offset=0 plus a search box.

1.5 Resend tier check

PIN emails (legacy signup), invitation emails, KYC notifications. At 100 streamers + their ramp-up traffic, probably 200–500 emails / month. Free tier covers 3 000 / month. Likely no action needed — verify before assuming.

1.6 Replace node-cron with a durable job substrate

Today every periodic job runs as node-cron inside a single worker pod (api/src/worker.ts). Schedules, in-flight state, and retry logic all live in pod memory. That's fine at 9 streamers; it's structurally wrong at 100. Concrete failure modes that surface as we scale:

  • Pod restart drops in-flight work. A trust-score recompute that's 20 s into a 70 s run loses everything when the pod is killed for an image update. No retry; the next tick simply starts a new run.
  • Overrun. If trust-score takes 70 s, the next 60 s tick fires while it's still running. node-cron has no built-in serialisation — two recomputes race over the same tables.
  • Two pods double-fire. If we ever scale replicas: 1 → 2 for availability, every cron fires on both pods simultaneously. There's no leader election in the worker.
  • No exactly-once. Jobs that mutate external state (Kick API unsubscribe, S3 chat export) need idempotency in the handler; there's no framework safety net.
  • No observability. "What ran when, how long, what failed" requires reading log lines.

The original "split into two pods" idea (worker-fast / worker-slow) sidesteps overrun but doesn't address restart, leader election, or observability. Time to evaluate a real substrate.

Option A — Temporal (suggested)

Temporal is a workflow engine: durable code that survives pod restart, retries with configurable backoff, deduplicates, and exposes a UI for inspection / replay / pause. Workflows are expressed as long-running functions; activities are the atomic side-effect units that get retried.

  • Strengths: durability is the killer feature for the long-running paths (trust-score recompute, chat export, daily summary). Observability (Temporal UI) is best-in-class. Naturally absorbs more than just cron — the offer → negotiation → deal lifecycle is the kind of multi-step state machine Temporal was designed for.
  • Weaknesses: Temporal Server is itself a stateful cluster (Cassandra/MySQL/PG-backed, multiple components). Self-hosting adds meaningful operational burden — possibly more than the system it's trying to make robust. Temporal Cloud sidesteps that but starts at ~$200 / month for a useful tier.
  • Migration cost: every cron handler rewrites as a workflow + activities. New mental model for the team. Not free.
  • Where it makes sense: when the platform has multiple long-running stateful processes worth modelling (deal funding + HTLC lifecycle, KYC review queues, multi-stream campaigns). For pure-cron use it's overkill.

pg-boss is a Node library that implements scheduled jobs, retries, dedup, queues, and cron — all on top of Postgres. We already run CNPG; no new infra.

  • Strengths: zero new infrastructure. Library, not server. Schedules + retry state survive pod restart because they're rows in Postgres. Dedup via job key. Multiple workers pull from the same queue without races (PG SELECT FOR UPDATE SKIP LOCKED). Suits our specific mix: short tight loops (viewer-poll) and slower jobs (trust-score, daily summary) coexist.
  • Weaknesses: less powerful than Temporal — no workflow-level durability, no UI out of the box (admin queries via SQL or a small panel), no first-class long-running-process model.
  • Migration cost: crons become `boss.schedule('viewer-poll', '*
        • *', { … })calls + handlers. Idempotency still in handler code (we already havededupe_key` patterns). Estimated 2–3 days of focused work for the full crontab.

Alternatives in the same shape: BullMQ (Redis-backed; adds infra), river (PostgreSQL, Go only — not applicable to a TypeScript codebase).

Option C — Kubernetes CronJob for everything

Already in use for vf-daily-summary. Each invocation is a fresh pod; durability is "did the job complete or not", retry is per-job-template.

  • Strengths: zero application-level cron state. Survives pod restart trivially. Battle-tested.
  • Weaknesses: pod startup overhead (5–10 s) makes per-minute jobs wasteful — half the runtime is bootstrapping. Not viable for the viewer-poll hot path. Good for hourly+ jobs.

Option D — Inngest / Quirrel (cloud-managed)

Turnkey but vendor lock-in; adds a cloud dependency. Not aligned with the rest of the stack which is self-hosted on Hetzner.

Recommendation for the 9 → 100 transition

Move to pg-boss for the durable / slow jobs. Keep the per-minute viewer-poll in-process — but use the database as the coordinator (atomic claim via UPDATE … RETURNING) so multiple workers can run the same code without leader election.

The viewer-poll has a property worth being explicit about: it does not need exactly-once semantics or precise timing. A probe at T+30 s captures the same useful info as one at T+0 s. Missing one minute is invisible (heartbeat row catches up within the hour); missing five triggers the deadman alert. What it does need is no duplicate probes for the same minute — otherwise viewer_hours = sum(viewer_count)/60 double-counts.

That property unlocks a much simpler architecture:

text
worker (replicas: 2, podAntiAffinity by hostname, NO leader role)
  ├─ per-minute timer fires on BOTH pods, independently
  │  → atomic UPDATE … RETURNING on kick_streamer_probe_schedule
  │    claims a subset of due streamers per pod (database = lease)
  └─ pg-boss client (both pods pull, queue coordinates via SKIP LOCKED)

Three tiers, all coordinating through Postgres rows. Zero special election primitives.

TierSubstrateJobs
Per-minute viewer-pollatomic UPDATE … RETURNING on kick_streamer_probe_schedule (the existing schedule row IS the lease)viewer-poll
Durable queuepg-boss on the existing CNPGtrust-score recompute, chat export, daily summary, Kick token refresh, stale-session cleanup, follower poll
Hourly batchK8s CronJob (already in use)vf-daily-summary and similar

The atomic-claim pattern for the viewer-poll:

sql
-- Phase 1 of pollViewerCounts(): atomic claim + filter in one CTE.
-- Two workers running this concurrently get DISJOINT subsets.
WITH due_existing AS (
  UPDATE kick_streamer_probe_schedule
  SET    next_probe_at  = NOW() + INTERVAL '5 minutes',  -- placeholder hold
         last_probed_at = NOW(),
         updated_at     = NOW()
  WHERE  next_probe_at <= NOW()
  RETURNING kick_user_id
),
due_new AS (
  -- Atomically seed schedule for streamers who don't have one yet
  INSERT INTO kick_streamer_probe_schedule
         (kick_user_id, next_probe_at, last_probed_at, updated_at)
  SELECT sc.kick_user_id, NOW() + INTERVAL '5 minutes', NOW(), NOW()
  FROM   streamer_channels sc
  WHERE  sc.platform = 'kick' AND sc.kick_user_id IS NOT NULL
    AND  NOT EXISTS (SELECT 1 FROM kick_streamer_probe_schedule
                     WHERE kick_user_id = sc.kick_user_id)
  ON CONFLICT (kick_user_id) DO NOTHING
  RETURNING kick_user_id
)
SELECT sc.kick_user_id, sc.channel_name
FROM   streamer_channels sc
LEFT JOIN streamers s ON s.id = sc.streamer_id
WHERE  (sc.kick_user_id IN (SELECT kick_user_id FROM due_existing)
     OR sc.kick_user_id IN (SELECT kick_user_id FROM due_new))
  AND  (s.status IS NULL OR s.status NOT IN ('archived','suspended'));

The 5-minute placeholder is a soft hold — gets overwritten by the existing UPSERT after the API call completes (with the actual cadence based on is_live). Long enough to cover API latency + processing, short enough that a worker crash mid-claim makes the row available for retry within minutes.

Two pods racing the same query: PG row-level locks make the UPDATE atomic per row. The slower pod's predicate next_probe_at <= NOW() no longer matches the rows the faster pod claimed → 0 rows in its result for those streamers → it skips them. No duplicate probes, no duplicate Kick API calls.

What this costs vs. what it buys

  • One new dependency: pg-boss (single npm package). No new infrastructure.
  • One small refactor: pollViewerCounts() claim logic from SELECT-then-INSERT to the CTE pattern above. Localised to one file.

In return:

  • Durable retry on pod restart for everything that matters (queued jobs).
  • No leader-election dependency to maintain — the database row IS the lease.
  • Failover for the per-minute loop is invisible: the next tick on the surviving pod claims everything that was due. No coordination primitive to mis-configure.
  • Two pods can run, with podAntiAffinity across nodes, surviving any single pod or node failure.
  • Observability via PG queries (and a Grafana panel on pgboss.archive).

Path to Temporal stays open if/when the platform's domain logic (offer / negotiation / deal lifecycle, HTLC funding) reaches a complexity where workflow modelling pays for itself. That's not at streamer #100 — it's at "we want to model long-running multi-party state machines as code instead of polling DB rows."

Effort: ~2 hours for the viewer-poll claim restructure (with a follow-up unique-index migration as a backstop for races on schedule-row creation), 2–3 days for the pg-boss migration of the slow tier. Trigger: streamer #50 or earlier if the worker-overrun symptom shows up first.


Phase 2 — scale comfort (around streamer #50)

2.1 Probe-cadence assumption

viewer_hours = sum(viewer_count) / 60 lives in 5+ places (Grafana panels + the daily-summary CronJob). At 100 streamers, the cost of getting this wrong (e.g. someone changes the cadence) is large enough that the divisor should become a single constant queried at job start.

Refactor: LIVE_PROBE_MINUTES as either a CH UDF or a row in a settings table. One-time cleanup, ~2 hours.

2.2 Stream-session jitter audit

The (kick_user_id, stream_started_at) predicate counts a "stream" — but we never verified that Kick reports the same start_time across probes of the same live session. Even 1-second jitter creates duplicate rows in the daily report.

Audit query:

sql
SELECT kick_user_id, count(distinct stream_started_at) AS distinct_starts
FROM   vf.kick_streamer_probe
WHERE  is_live = 1 AND ts > now() - INTERVAL 24 HOUR
GROUP  BY 1
HAVING count(distinct stream_started_at) > 1;

If it returns rows, floor stream_started_at to the minute on insert.

2.3 Webhook receiver capacity

Chat messages at 865 k / day = ~10 RPS average, with bursts to maybe 50–100 RPS during peak. The current receiver writes to PG kick_chat_messages synchronously. At 100 RPS with 5 ms PG insert each, you saturate one PG connection per request.

Mitigation: batch chat-message inserts. api/src/kick_chat_export.ts already batches to S3 — extend the same pattern to the PG write path so the receiver can return 200 quickly and a worker drains them out of a queue. Only matters if peak chat actually hits 100 RPS.

2.4 ClickHouse query caching

The daily-summary CronJob currently scans the full UTC day every run. At 100 streamers × 30 days of probes = a few million rows. Cheap today. After streamer #200 it's worth turning on chunksCache + resultsCache in the CH cluster — one Helm-values change.

2.5 PG read replica

Single-primary CNPG today. Read traffic at 100 streamers (admin panel, dashboards, worker queries) is fine on the primary. Past streamer #200, add a CNPG replicas: 2 and route the heavy analytical queries (Grafana, daily report) to the replica.


Phase 3 — flagged but not actioned

These don't matter at 100 but worth knowing they exist:

  • Multi-region — today we're Hetzner Helsinki only. Latency-critical streamers (US, JP) take a 200 ms hit on every write. Plausibly never matters for this app.
  • Sharding kick_streamer_probe by env or kick_user_id — only needed past 1 B rows. Years away.
  • Replacing CNPG with managed Aurora / Cloud SQL — only if Hetzner becomes the bottleneck, which it won't soon.
  • Operator-side delegation / multi-seat — today one operator account = one human. Past ~50 operators with teams, you'll need invitations + per-seat permissions.
  • Temporal for domain workflows — once we want to model offer → negotiation → deal → HTLC funding → submission → payout as a workflow rather than as polling state machines over DB rows, Temporal earns its operational cost. Plausibly relevant in the next 12–18 months, not at streamer #100.

What ships in what order

#DeliverableEffortTrigger
1Deadman alert: 10-min no-probe → Telegram30 minnow
2CronJob failed-job alert30 minnow
3Reflector for vf-pg-app, vf-pg-migrator, clickhouse-readonly, telegram-bot2 hnow
4ch_dual_write_failures_total metric + alert1 hnow
5Worker resources bump 50 m → 200 m CPU5 minnow
6PG retention DELETE / partition on kick_streamer_probe1 daystreamer #25
7Drop ScraperAPI to daily polling30 minstreamer #25
8Operator inbox: master / detail layout1 daystreamer #30
9Admin streamers pagination2 hstreamer #50
10Pusher Startup tier billing5 min + billingstreamer #50
11Probe-cadence constant → single source2 hbefore next change
12Stream-start jitter audit30 minstreamer #75
13aAtomic-claim refactor of viewer-poll (DB row = lease, no leader election)2 hstreamer #25
13bpg-boss migration: durable queue for slow tier2–3 daysstreamer #50
14Webhook receiver: async chat inserts1 daystreamer #80
15ClickHouse query cache enable5 min configstreamer #100+

Phase 0 (items 1–5) is ~5 hours of work and zero infra cost. Don't onboard past 25 streamers without it. Total Phase 0 + Phase 1 effort: ~5 days of focused work + ~$50 / mo incremental external services.


Estimated monthly cost at 100 streamers

ItemCost (USD / mo)
Hetzner K3s cluster + LB~€60 (~$65)
Hetzner Object Storage<$1
Pusher Startup$49
ScraperAPI (free, daily poll)$0
Resend (free tier)$0
Sentry (free tier)$0
Cloudflare Pages + Access$0
LaunchDarkly (free tier)$0
Total~$115 / mo

The economics keep up trivially with the platform's revenue at 100 streamers. The constraint is operator and engineer attention, not infrastructure cost.

Verifluence Documentation