External Data Ingestion — Proposal
Status: Proposed · May 2026
Scrape Kick/Twitch/X (and inbound webhooks from Kick, Sumsub) once from a single tier, fan out the resulting domain events to both stage and production environments, and keep all metrics in the shared ClickHouse so dashboards work the same way for both envs.
Nothing implemented yet. Architecture, schemas, and rollout plan only.
Problem
Once the production tier comes up (see Streamer App — Production Launch) we will have two complete app stacks — stage and production namespaces — each running their own copy of:
- Kick OAuth-token-refresh polling
- Kick viewer polling (every minute, per active streamer)
- Kick follower polling (every minute, due streamers)
- Kick channel-name resolution + Kick event subscriptions (
/v1/events/subscriptions) - Kick webhook receivers (
/wh/kick) - Kick chat export (hourly)
- X follower polling (hourly)
- Sumsub webhook receivers (
/wh/sumsub) - Sumsub applicant lookups on PoA approval
- Trust-score refresh (daily)
- Sheets streamer sync (hourly)
- ClickHouse backfills
That works against four real constraints:
- External rate limits. Kick's public API allows ~few-hundred req/min per token. We already burn most of that on the per-streamer-per-minute viewer poll. Doubling the call volume puts us at risk of getting rate-limited or banned.
- External webhook configuration. Kick / Sumsub register one callback URL per app. We'd either duplicate apps (
wh.verifluence.io+wh.stage.verifluence.io, two Kick OAuth apps) or one env doesn't see the events. - Proxy / scraper costs.
SCRAPEDO_TOKENandSCRAPER_API_KEYare paid services billed per request — duplicating that traffic doubles the bill. - State drift. Stage and production each polling at independent cron offsets would record different viewer counts for the same stream. Bug repro in stage stops matching reality in production.
Running two independent scraper tiers is wasteful and risks reputation loss with each upstream. We need scrape-once-fan-out.
Why now
- Production environment is being provisioned this milestone. The pre-production state is the only time we can change this cleanly — once both envs are live, every shared component is a coordinated cutover.
- ClickHouse already lives in
productionand is consumed by bothstageand (soon)productionworkers — half the design is implicit today. - Existing pollers were written with
Envinjected at the call-site; refactoring to "scrape once, write to a bus" requires touching the same files we'd already be modifying for production cutover.
Considered alternatives
| Option | Verdict |
|---|---|
| A. Two independent scraper tiers | Status-quo extended. Rate-limit risk, paid-traffic doubled, state drift. Rejected. |
B. Scrape only in prod; stage replicates from prod DB (logical decoder / pg_dump snapshots) | Tight coupling — stage schema changes lock-step with prod. Replaying old data into a fresh stage table requires a full re-snapshot. Rejected for development velocity. |
| C. Scrape once → raw event bus → per-env projectors ✓ | Each env owns its own DB; bus has replay; scraper is one tier. Recommended. |
| D. Same as C with NATS JetStream / Kafka as the bus | Right shape, but introduces new infra a week before launch. Defer to v2. |
| E. Same as C with ClickHouse as the bus ✓ | Already deployed, append-only, supports cursors, free. Used for raw event log. |
| F. Same as C with Postgres outbox + pg-boss fan-out ✓ | Already deployed; transactional dual-write at the scraper; per-env HTTP delivery with retries. Used for domain-event fan-out. |
Recommended design combines C + E + F: ClickHouse is the raw event log (full fidelity, replay, analytics); Postgres outbox + pg-boss is the domain-event fan-out (idempotent HTTP delivery to each env).
Proposed Milestone — M-INGEST
Group the work into four themes: Ingest tier, Bus & fan-out, Per-env consumers, Metrics unification. Items numbered INGEST-<N>.
Theme 1 — Ingest tier (one scraper, in production namespace)
INGEST-1 — Stand up a new k8s release ingest in the production namespace that owns:
- All cron-driven pollers currently in
api/src/worker.ts(viewer poll, follower poll, X poll, trust-score refresh, kick-subscriptions, kick-token-refresh, chat-export, sheets-sync) - Outbound calls to ScraperAPI / Scrape.do (only one paid-API customer)
- The webhook receiver process —
wh.verifluence.iobecomes a single public endpoint, taking over fromwh.stage.verifluence.io(which becomes an internal-only forwarder for stage smoke tests)
INGEST-2 — Strip pollers/webhooks from stage's worker.yaml and webhooks.yaml. Stage namespace keeps api + per-env-consumer (see Theme 3); it no longer talks to Kick/Sumsub/X.
INGEST-3 — Register all external webhooks against wh.verifluence.io:
- Kick OAuth app: production-only Kick app, one redirect URL, one event-subscription host
- Sumsub: production-only webhook host
- X / Resend / Pusher / Scrape.do callbacks: same
INGEST-4 — ingest's only data sinks are:
- ClickHouse
vf.raw_events(append-only, partitioned by date) — full payload of every webhook + every poll response - Postgres
production.ingest_outbox(one row per domain event — see Theme 2)
The ingest tier owns no domain tables. It does not write to streamers, streamer_channels, stream_sessions, etc. — those live in each per-env DB and are filled by the consumer.
Theme 2 — Bus & fan-out
INGEST-5 — Define a stable DomainEvent envelope:
interface DomainEvent {
id: string; // ULID; idempotency key
source: "kick" | "sumsub" | "x" | "scrape" | "sheets";
type: string; // e.g. "kick.viewer_count.observed", "sumsub.applicant.reviewed"
occurred_at: string; // ISO8601 — when the upstream produced it
ingested_at: string; // ISO8601 — when we received it
subject: { // who/what is this about
kind: "streamer" | "channel" | "applicant" | "session" | "campaign";
external_id: string; // kick_user_id, sumsub_applicant_id, etc.
};
payload: unknown; // event-type-specific JSON
schema_version: number; // bump on breaking shape change
}INGEST-6 — vf.raw_events (ClickHouse) — append-only, no env tag:
CREATE TABLE vf.raw_events ON CLUSTER vf (
id String,
source LowCardinality(String),
type LowCardinality(String),
occurred_at DateTime64(3),
ingested_at DateTime64(3) DEFAULT now64(3),
subject_kind LowCardinality(String),
subject_id String,
payload String, -- JSON
schema_version UInt16
)
ENGINE = ReplicatedMergeTree
PARTITION BY toYYYYMM(ingested_at)
ORDER BY (source, type, subject_id, ingested_at);This is the canonical source-of-truth. Anyone can replay history from here.
INGEST-7 — ingest_outbox (Postgres in production) — fan-out queue:
CREATE TABLE ingest_outbox (
event_id TEXT PRIMARY KEY, -- matches DomainEvent.id
envelope JSONB NOT NULL, -- the whole DomainEvent
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- One delivery row per target env (see ingest_outbox_delivery)
);
CREATE TABLE ingest_outbox_delivery (
event_id TEXT NOT NULL REFERENCES ingest_outbox(event_id) ON DELETE CASCADE,
target_env TEXT NOT NULL, -- 'stage' | 'production'
status TEXT NOT NULL DEFAULT 'pending', -- pending | delivered | dead
attempts INT NOT NULL DEFAULT 0,
last_error TEXT,
delivered_at TIMESTAMPTZ,
PRIMARY KEY (event_id, target_env)
);The ingest tier writes both rows transactionally inside the same DB transaction that records the raw payload acknowledgement. This is the standard transactional-outbox pattern — guarantees no event is fanned-out if the raw record didn't land, and vice versa.
INGEST-8 — Fan-out worker (a new pg-boss queue ingest:deliver):
- Polls
ingest_outbox_delivery WHERE status = 'pending' - POSTs
envelopeto each env's/internal/eventsendpoint (with HMAC signature using a per-env shared secret) - On 2xx →
status = 'delivered' - On 4xx (other than 409 = idempotent-dup) →
status = 'dead'; alert - On 5xx / timeout → exponential backoff via pg-boss retry semantics; cap at 24h before marking dead
Stage being down does not block production delivery (separate rows); production being down does not block stage. Each env has its own backlog.
INGEST-9 — Reconciliation job — once per hour, scan ingest_outbox_delivery for dead rows newer than 7 days and re-queue with a manual override flag. Beyond 7 days, operator decides via admin UI.
Theme 3 — Per-env consumers
INGEST-10 — Each env's api exposes an internal-only POST /internal/events:
- HMAC-verified (
X-VF-Signatureheader againstINGEST_DELIVERY_SECRET) - Idempotent on
DomainEvent.id— checked againstprocessed_events(event_id PRIMARY KEY, processed_at)table - Returns 200 if successfully projected, 409 if already-processed, 4xx for malformed, 5xx for transient errors
INGEST-11 — Inside the handler: a thin projector module per source/type, e.g.
api/src/projectors/
kick_viewer_count.ts // updates streamer_channels.last_viewer_count, stream_sessions
kick_follower_count.ts // updates streamer_channels.follower_count
kick_channel_event.ts // creates/closes stream_sessions
sumsub_applicant_reviewed.ts // updates streamers.kyc_age_status / kyc_country_status
...The projector decides what part of the domain DB to write. Each projector is a pure function (env, DomainEvent) → SQL writes — no network I/O, no external calls.
INGEST-12 — Replay tooling — npm run replay -- --from='2026-05-01' --types='kick.*' reads from vf.raw_events, re-wraps into envelopes, and POSTs to the local env's /internal/events. Used after a projector bug fix to rebuild a window of state.
Theme 4 — Metrics unification
INGEST-13 — All metric tables in ClickHouse get an env LowCardinality(String) column. Existing tables (kick_streamer_probe, streamer_trust_score_history, events, impressions, sumsub_webhooks) get a migration adding env DEFAULT 'stage' (since today everything is stage), then we backfill 'production' rows as production starts emitting.
INGEST-14 — Each app pod sets ENV=stage|production (already true for webhooks; extend to api + worker). All emitted ClickHouse rows include it.
INGEST-15 — vf.raw_events does not carry env — it's environment-agnostic. Per-env tables (anything written by a projector) do.
INGEST-16 — Grafana variables: add a top-level env template variable on every dashboard; default production; queries filter WHERE env = $env. The shared dashboards (operator funnel, streamer onboarding, wallet, prelaunch) get a stage version "free" by switching the variable.
INGEST-17 — Alert rules: split into stage/production alert groups via the same env label. Stage alerts route to #alerts-dev; production to the paging channel.
Data flow (visual)
┌────────────────────────────────────────────────────────────────────┐
│ External APIs / webhooks (Kick · X · Sumsub · etc.) │
└──────────────────────────────┬─────────────────────────────────────┘
│ one set of credentials
▼
┌──────────────────────────────────┐
│ ingest (production namespace) │
│ • pollers · webhook receivers │
│ • outbound proxy traffic │
└──────┬─────────────────────┬─────┘
│ │
▼ ▼
vf.raw_events ingest_outbox
(ClickHouse, (Postgres in prod,
replay log) one row per DomainEvent +
per-env delivery state)
│
┌────────────────────┼────────────────────┐
│ pg-boss `ingest:deliver` (HMAC-signed │
│ POST to each env's /internal/events) │
▼ ▼
┌────────────────┐ ┌──────────────────┐
│ stage api │ │ production api │
│ /internal/events │ /internal/events │
│ projectors → stage Postgres │ projectors → │
│ │ prod Postgres │
└────────────────┘ └──────────────────┘
Metrics: both envs write to the same ClickHouse cluster
with `env` column → Grafana templating.Security
wh.verifluence.iois the only externally exposed ingress; behind CF Tunnel + CF Access (allow-list for Kick + Sumsub source IPs, plus signature verification at the application layer)/internal/eventsis not exposed publicly — only reachable from within the cluster. HMAC is belt-and-braces.INGEST_DELIVERY_SECRETis per-env (so a compromised stage secret can't be used to inject into production)- Raw payloads in
vf.raw_eventsmay contain PII (Sumsub applicant data); the table sits in the same security boundary as the rest of CH. No additional egress beyond Grafana queries. - Replay tooling is a CLI run by engineers — no UI surface.
Failure modes & how the design handles them
| Failure | Behaviour |
|---|---|
Stage /internal/events returns 5xx | Production delivery unaffected. Stage backlog accumulates in ingest_outbox_delivery and drains when stage recovers. |
Production /internal/events returns 5xx | Stage delivery unaffected. Production backlog accumulates. Page-worthy if it persists. |
| Ingest tier crashes | Webhooks retry from upstream (Kick has 3 retries on 5xx). Pollers resume on next cron tick. No data lost from upstream perspective; gap captured in next poll. |
| ClickHouse down | Ingest tier writes to outbox only; raw-event archive paused. Backlog drains to CH when it returns (separate write path). |
| Bad projector ships → stage corrupted | Roll back projector code, run npm run replay -- --from=<release> against stage. Production untouched. |
| Need to onboard a third env (dev cluster) | Add a row in INGEST_TARGET_ENVS, register HMAC secret, point its api at the bus. No scraper changes. |
| Kick rate-limits us | Single tier means we can globally back off without one env starving the other. |
Rollout plan
Phase 1 — Ingest skeleton (week 1)
INGEST-1, 4, 5, 6, 7— stand upingestrelease, raw-events table, outbox tables.- Run alongside existing stage worker, dry-run mode (writes to outbox but fan-out is paused).
- Verify outbox grows in step with stage's existing polling.
Phase 2 — Fan-out (week 2)
INGEST-8, 10, 11— wire/internal/eventsin stage api; enable fan-out for stage target only.- Verify stage tables stay consistent vs. its existing direct-write workers.
- Once green, flip stage workers off — stage now receives all data via ingest.
Phase 3 — Strip pollers from stage (week 2-3)
INGEST-2, 3— register external webhooks againstwh.verifluence.io; turn off stage's pollers and webhook receivers.- Stage now consumes domain events exclusively from the ingest tier.
Phase 4 — Production go-live (week 3, blocks on Streamer App launch)
- Production env's
/internal/eventswired; addproductiontoINGEST_TARGET_ENVS. - Both envs now receive every event from the single ingest tier.
Phase 5 — Metrics unification (week 4)
INGEST-13–17— addenvcolumn migrations, Grafana templating, alert routing.
Phase 6 — Replay tooling + reconciliation (week 4-5)
INGEST-9, 12— replay CLI, reconciliation job, admin UI for dead rows.
Open questions
- Where does pg-boss live? Current pg-boss runs in stage's Postgres. Move to production's Postgres (since ingest lives in production)? Yes — keep ingest's queue colocated with ingest's outbox.
- Do we need a separate
ingest_dbPostgres? Probably not. The production app's Postgres has spare capacity. Reuse and namespace tables under aingest.schema. - Outbox retention? Keep delivered rows for 30 days for audit, then drop.
vf.raw_eventskeeps forever (or per CH partition TTL — discuss). - Schema evolution of
DomainEvent? Addschema_versionfrom day one. Projectors can branch on version. Breaking changes ship as new event types, not in-place mutations. - What about reads? Today's API reads from its own Postgres only. No change — projectors keep the per-env DB in sync, and reads stay local.
- What about admin actions that affect external state (e.g. operator forces a Kick re-subscription)? Those go via the api → ingest tier (RPC), not directly to Kick. Ingest is the single point of egress to external APIs.