Skip to content

External Data Ingestion — Proposal

Status: Proposed · May 2026

Scrape Kick/Twitch/X (and inbound webhooks from Kick, Sumsub) once from a single tier, fan out the resulting domain events to both stage and production environments, and keep all metrics in the shared ClickHouse so dashboards work the same way for both envs.

Nothing implemented yet. Architecture, schemas, and rollout plan only.


Problem

Once the production tier comes up (see Streamer App — Production Launch) we will have two complete app stacks — stage and production namespaces — each running their own copy of:

  • Kick OAuth-token-refresh polling
  • Kick viewer polling (every minute, per active streamer)
  • Kick follower polling (every minute, due streamers)
  • Kick channel-name resolution + Kick event subscriptions (/v1/events/subscriptions)
  • Kick webhook receivers (/wh/kick)
  • Kick chat export (hourly)
  • X follower polling (hourly)
  • Sumsub webhook receivers (/wh/sumsub)
  • Sumsub applicant lookups on PoA approval
  • Trust-score refresh (daily)
  • Sheets streamer sync (hourly)
  • ClickHouse backfills

That works against four real constraints:

  1. External rate limits. Kick's public API allows ~few-hundred req/min per token. We already burn most of that on the per-streamer-per-minute viewer poll. Doubling the call volume puts us at risk of getting rate-limited or banned.
  2. External webhook configuration. Kick / Sumsub register one callback URL per app. We'd either duplicate apps (wh.verifluence.io + wh.stage.verifluence.io, two Kick OAuth apps) or one env doesn't see the events.
  3. Proxy / scraper costs. SCRAPEDO_TOKEN and SCRAPER_API_KEY are paid services billed per request — duplicating that traffic doubles the bill.
  4. State drift. Stage and production each polling at independent cron offsets would record different viewer counts for the same stream. Bug repro in stage stops matching reality in production.

Running two independent scraper tiers is wasteful and risks reputation loss with each upstream. We need scrape-once-fan-out.


Why now

  • Production environment is being provisioned this milestone. The pre-production state is the only time we can change this cleanly — once both envs are live, every shared component is a coordinated cutover.
  • ClickHouse already lives in production and is consumed by both stage and (soon) production workers — half the design is implicit today.
  • Existing pollers were written with Env injected at the call-site; refactoring to "scrape once, write to a bus" requires touching the same files we'd already be modifying for production cutover.

Considered alternatives

OptionVerdict
A. Two independent scraper tiersStatus-quo extended. Rate-limit risk, paid-traffic doubled, state drift. Rejected.
B. Scrape only in prod; stage replicates from prod DB (logical decoder / pg_dump snapshots)Tight coupling — stage schema changes lock-step with prod. Replaying old data into a fresh stage table requires a full re-snapshot. Rejected for development velocity.
C. Scrape once → raw event bus → per-env projectorsEach env owns its own DB; bus has replay; scraper is one tier. Recommended.
D. Same as C with NATS JetStream / Kafka as the busRight shape, but introduces new infra a week before launch. Defer to v2.
E. Same as C with ClickHouse as the busAlready deployed, append-only, supports cursors, free. Used for raw event log.
F. Same as C with Postgres outbox + pg-boss fan-outAlready deployed; transactional dual-write at the scraper; per-env HTTP delivery with retries. Used for domain-event fan-out.

Recommended design combines C + E + F: ClickHouse is the raw event log (full fidelity, replay, analytics); Postgres outbox + pg-boss is the domain-event fan-out (idempotent HTTP delivery to each env).


Proposed Milestone — M-INGEST

Group the work into four themes: Ingest tier, Bus & fan-out, Per-env consumers, Metrics unification. Items numbered INGEST-<N>.


Theme 1 — Ingest tier (one scraper, in production namespace)

INGEST-1 — Stand up a new k8s release ingest in the production namespace that owns:

  • All cron-driven pollers currently in api/src/worker.ts (viewer poll, follower poll, X poll, trust-score refresh, kick-subscriptions, kick-token-refresh, chat-export, sheets-sync)
  • Outbound calls to ScraperAPI / Scrape.do (only one paid-API customer)
  • The webhook receiver process — wh.verifluence.io becomes a single public endpoint, taking over from wh.stage.verifluence.io (which becomes an internal-only forwarder for stage smoke tests)

INGEST-2 — Strip pollers/webhooks from stage's worker.yaml and webhooks.yaml. Stage namespace keeps api + per-env-consumer (see Theme 3); it no longer talks to Kick/Sumsub/X.

INGEST-3 — Register all external webhooks against wh.verifluence.io:

  • Kick OAuth app: production-only Kick app, one redirect URL, one event-subscription host
  • Sumsub: production-only webhook host
  • X / Resend / Pusher / Scrape.do callbacks: same

INGEST-4ingest's only data sinks are:

  • ClickHouse vf.raw_events (append-only, partitioned by date) — full payload of every webhook + every poll response
  • Postgres production.ingest_outbox (one row per domain event — see Theme 2)

The ingest tier owns no domain tables. It does not write to streamers, streamer_channels, stream_sessions, etc. — those live in each per-env DB and are filled by the consumer.


Theme 2 — Bus & fan-out

INGEST-5 — Define a stable DomainEvent envelope:

ts
interface DomainEvent {
  id:              string;          // ULID; idempotency key
  source:          "kick" | "sumsub" | "x" | "scrape" | "sheets";
  type:            string;          // e.g. "kick.viewer_count.observed", "sumsub.applicant.reviewed"
  occurred_at:     string;          // ISO8601 — when the upstream produced it
  ingested_at:     string;          // ISO8601 — when we received it
  subject: {                        // who/what is this about
    kind:     "streamer" | "channel" | "applicant" | "session" | "campaign";
    external_id: string;            // kick_user_id, sumsub_applicant_id, etc.
  };
  payload:         unknown;         // event-type-specific JSON
  schema_version:  number;          // bump on breaking shape change
}

INGEST-6vf.raw_events (ClickHouse) — append-only, no env tag:

sql
CREATE TABLE vf.raw_events ON CLUSTER vf (
  id              String,
  source          LowCardinality(String),
  type            LowCardinality(String),
  occurred_at     DateTime64(3),
  ingested_at     DateTime64(3) DEFAULT now64(3),
  subject_kind    LowCardinality(String),
  subject_id      String,
  payload         String,           -- JSON
  schema_version  UInt16
)
ENGINE = ReplicatedMergeTree
PARTITION BY toYYYYMM(ingested_at)
ORDER BY (source, type, subject_id, ingested_at);

This is the canonical source-of-truth. Anyone can replay history from here.

INGEST-7ingest_outbox (Postgres in production) — fan-out queue:

sql
CREATE TABLE ingest_outbox (
  event_id      TEXT PRIMARY KEY,           -- matches DomainEvent.id
  envelope      JSONB NOT NULL,             -- the whole DomainEvent
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  -- One delivery row per target env (see ingest_outbox_delivery)
);

CREATE TABLE ingest_outbox_delivery (
  event_id      TEXT NOT NULL REFERENCES ingest_outbox(event_id) ON DELETE CASCADE,
  target_env    TEXT NOT NULL,              -- 'stage' | 'production'
  status        TEXT NOT NULL DEFAULT 'pending',  -- pending | delivered | dead
  attempts      INT  NOT NULL DEFAULT 0,
  last_error    TEXT,
  delivered_at  TIMESTAMPTZ,
  PRIMARY KEY (event_id, target_env)
);

The ingest tier writes both rows transactionally inside the same DB transaction that records the raw payload acknowledgement. This is the standard transactional-outbox pattern — guarantees no event is fanned-out if the raw record didn't land, and vice versa.

INGEST-8 — Fan-out worker (a new pg-boss queue ingest:deliver):

  1. Polls ingest_outbox_delivery WHERE status = 'pending'
  2. POSTs envelope to each env's /internal/events endpoint (with HMAC signature using a per-env shared secret)
  3. On 2xx → status = 'delivered'
  4. On 4xx (other than 409 = idempotent-dup) → status = 'dead'; alert
  5. On 5xx / timeout → exponential backoff via pg-boss retry semantics; cap at 24h before marking dead

Stage being down does not block production delivery (separate rows); production being down does not block stage. Each env has its own backlog.

INGEST-9 — Reconciliation job — once per hour, scan ingest_outbox_delivery for dead rows newer than 7 days and re-queue with a manual override flag. Beyond 7 days, operator decides via admin UI.


Theme 3 — Per-env consumers

INGEST-10 — Each env's api exposes an internal-only POST /internal/events:

  • HMAC-verified (X-VF-Signature header against INGEST_DELIVERY_SECRET)
  • Idempotent on DomainEvent.id — checked against processed_events(event_id PRIMARY KEY, processed_at) table
  • Returns 200 if successfully projected, 409 if already-processed, 4xx for malformed, 5xx for transient errors

INGEST-11 — Inside the handler: a thin projector module per source/type, e.g.

api/src/projectors/
  kick_viewer_count.ts        // updates streamer_channels.last_viewer_count, stream_sessions
  kick_follower_count.ts      // updates streamer_channels.follower_count
  kick_channel_event.ts       // creates/closes stream_sessions
  sumsub_applicant_reviewed.ts // updates streamers.kyc_age_status / kyc_country_status
  ...

The projector decides what part of the domain DB to write. Each projector is a pure function (env, DomainEvent) → SQL writes — no network I/O, no external calls.

INGEST-12 — Replay tooling — npm run replay -- --from='2026-05-01' --types='kick.*' reads from vf.raw_events, re-wraps into envelopes, and POSTs to the local env's /internal/events. Used after a projector bug fix to rebuild a window of state.


Theme 4 — Metrics unification

INGEST-13 — All metric tables in ClickHouse get an env LowCardinality(String) column. Existing tables (kick_streamer_probe, streamer_trust_score_history, events, impressions, sumsub_webhooks) get a migration adding env DEFAULT 'stage' (since today everything is stage), then we backfill 'production' rows as production starts emitting.

INGEST-14 — Each app pod sets ENV=stage|production (already true for webhooks; extend to api + worker). All emitted ClickHouse rows include it.

INGEST-15vf.raw_events does not carry env — it's environment-agnostic. Per-env tables (anything written by a projector) do.

INGEST-16 — Grafana variables: add a top-level env template variable on every dashboard; default production; queries filter WHERE env = $env. The shared dashboards (operator funnel, streamer onboarding, wallet, prelaunch) get a stage version "free" by switching the variable.

INGEST-17 — Alert rules: split into stage/production alert groups via the same env label. Stage alerts route to #alerts-dev; production to the paging channel.


Data flow (visual)

┌────────────────────────────────────────────────────────────────────┐
│           External APIs / webhooks (Kick · X · Sumsub · etc.)      │
└──────────────────────────────┬─────────────────────────────────────┘
                               │ one set of credentials

            ┌──────────────────────────────────┐
            │   ingest (production namespace)  │
            │   • pollers · webhook receivers  │
            │   • outbound proxy traffic       │
            └──────┬─────────────────────┬─────┘
                   │                     │
                   ▼                     ▼
            vf.raw_events         ingest_outbox
            (ClickHouse,          (Postgres in prod,
             replay log)           one row per DomainEvent +
                                   per-env delivery state)

                  ┌────────────────────┼────────────────────┐
                  │ pg-boss `ingest:deliver` (HMAC-signed   │
                  │  POST to each env's /internal/events)   │
                  ▼                                         ▼
        ┌────────────────┐                       ┌──────────────────┐
        │   stage api    │                       │  production api  │
        │ /internal/events                       │ /internal/events │
        │   projectors → stage Postgres          │   projectors →   │
        │                                        │   prod Postgres  │
        └────────────────┘                       └──────────────────┘

       Metrics: both envs write to the same ClickHouse cluster
       with `env` column → Grafana templating.

Security

  • wh.verifluence.io is the only externally exposed ingress; behind CF Tunnel + CF Access (allow-list for Kick + Sumsub source IPs, plus signature verification at the application layer)
  • /internal/events is not exposed publicly — only reachable from within the cluster. HMAC is belt-and-braces.
  • INGEST_DELIVERY_SECRET is per-env (so a compromised stage secret can't be used to inject into production)
  • Raw payloads in vf.raw_events may contain PII (Sumsub applicant data); the table sits in the same security boundary as the rest of CH. No additional egress beyond Grafana queries.
  • Replay tooling is a CLI run by engineers — no UI surface.

Failure modes & how the design handles them

FailureBehaviour
Stage /internal/events returns 5xxProduction delivery unaffected. Stage backlog accumulates in ingest_outbox_delivery and drains when stage recovers.
Production /internal/events returns 5xxStage delivery unaffected. Production backlog accumulates. Page-worthy if it persists.
Ingest tier crashesWebhooks retry from upstream (Kick has 3 retries on 5xx). Pollers resume on next cron tick. No data lost from upstream perspective; gap captured in next poll.
ClickHouse downIngest tier writes to outbox only; raw-event archive paused. Backlog drains to CH when it returns (separate write path).
Bad projector ships → stage corruptedRoll back projector code, run npm run replay -- --from=<release> against stage. Production untouched.
Need to onboard a third env (dev cluster)Add a row in INGEST_TARGET_ENVS, register HMAC secret, point its api at the bus. No scraper changes.
Kick rate-limits usSingle tier means we can globally back off without one env starving the other.

Rollout plan

Phase 1 — Ingest skeleton (week 1)

  • INGEST-1, 4, 5, 6, 7 — stand up ingest release, raw-events table, outbox tables.
  • Run alongside existing stage worker, dry-run mode (writes to outbox but fan-out is paused).
  • Verify outbox grows in step with stage's existing polling.

Phase 2 — Fan-out (week 2)

  • INGEST-8, 10, 11 — wire /internal/events in stage api; enable fan-out for stage target only.
  • Verify stage tables stay consistent vs. its existing direct-write workers.
  • Once green, flip stage workers off — stage now receives all data via ingest.

Phase 3 — Strip pollers from stage (week 2-3)

  • INGEST-2, 3 — register external webhooks against wh.verifluence.io; turn off stage's pollers and webhook receivers.
  • Stage now consumes domain events exclusively from the ingest tier.

Phase 4 — Production go-live (week 3, blocks on Streamer App launch)

  • Production env's /internal/events wired; add production to INGEST_TARGET_ENVS.
  • Both envs now receive every event from the single ingest tier.

Phase 5 — Metrics unification (week 4)

  • INGEST-13–17 — add env column migrations, Grafana templating, alert routing.

Phase 6 — Replay tooling + reconciliation (week 4-5)

  • INGEST-9, 12 — replay CLI, reconciliation job, admin UI for dead rows.

Open questions

  1. Where does pg-boss live? Current pg-boss runs in stage's Postgres. Move to production's Postgres (since ingest lives in production)? Yes — keep ingest's queue colocated with ingest's outbox.
  2. Do we need a separate ingest_db Postgres? Probably not. The production app's Postgres has spare capacity. Reuse and namespace tables under a ingest. schema.
  3. Outbox retention? Keep delivered rows for 30 days for audit, then drop. vf.raw_events keeps forever (or per CH partition TTL — discuss).
  4. Schema evolution of DomainEvent? Add schema_version from day one. Projectors can branch on version. Breaking changes ship as new event types, not in-place mutations.
  5. What about reads? Today's API reads from its own Postgres only. No change — projectors keep the per-env DB in sync, and reads stay local.
  6. What about admin actions that affect external state (e.g. operator forces a Kick re-subscription)? Those go via the api → ingest tier (RPC), not directly to Kick. Ingest is the single point of egress to external APIs.

Verifluence Documentation