Skip to content

Ingest Tier Rollout — Execution Plan

Status: Proposed · May 2026

Execution plan for Phase 4.b of the Shared data Namespace rollout — the actual ingest workload that takes over external API ownership from each env's per-env worker. The design itself lives in External Data Ingestion; this doc is the PR-by-PR sequence for landing it safely against a live stage cluster.

Pre-requisites already in place:

  • data namespace, vf-pg-dm, chi-vf alias, dm-migrator (Phase 1–3 done).
  • ingest.ingest_outbox + ingest.ingest_outbox_delivery in vf-pg-dm (migrations-dm/0002_ingest_outbox.sql).
  • vf.raw_events in ClickHouse (api/clickhouse/06_raw_events.sql).
  • DomainEvent envelope type in api/src/ingest.ts.

Scope summary

StepRepoTouches running services?Reversible?
1. dist/ingest.js entry pointappNo (new binary)Yes
2. Ingest writer helpers + raw-event dual-write from workerappYes — worker rolling deployYes (revert)
3. Outbox + fan-out worker (dry-run mode)app + opsYes — new Deployment in dataYes
4. Per-env /internal/events + projector modulesappYes — api rolling deployYes
5. Enable fan-out for stageopsYes — flag flipYes
6. Strip pollers from stage's workeropsYes — worker rolling deployYes (revert HelmRelease)
7. Re-register external webhooksexternal (Kick / Sumsub admin UI)YesNo without coordinated downtime
8. Strip webhook receiver from stageopsYes — webhooks rolling deployYes
9. Replay tool + reconciliation jobapp + opsNo (new tool, manual invocation)Yes

Steps 1–6 and 8 are routine rolling deploys. Step 7 is the only irreversible action — once Kick's OAuth app has been re-registered to point at wh.verifluence.io (the new ingress in data), there is no clean way back to the per-env webhook URLs without a registration round-trip and lost events. Treat as the cutover point.


PR-by-PR sequence

PR 1 — dist/ingest.js entry point (no behaviour change)

Files:

  • api/src/ingest-server.ts — new entry point. Mirrors worker.ts's startup (env, db, pgboss) but registers only ingest-tier cron handlers (initially none — we leave the cron registration to PR 2).
  • api/Dockerfile — new stage FROM base AS ingest with CMD ["node", "dist/ingest-server.js"].
  • api/package.jsonnpm run dev:ingest.

Verification: npm run build produces dist/ingest-server.js; the binary boots, connects to both DBs, exits cleanly on SIGTERM. No runtime use yet.

PR 2 — Ingest writer helpers + raw-event dual-write from worker

Files:

  • api/src/ingest-writer.tswriteRawEvent(env, event: DomainEvent): Promise<void> and appendOutbox(env, event, targets: IngestTargetEnv[]): Promise<void>. Single transaction across CH (vf.raw_events) and PG (ingest.ingest_outbox + ingest.ingest_outbox_delivery). Uses env.DM_DB and the existing chInsert helper.
  • api/src/kick_viewer_poll.ts (and a couple of others incrementally) — after the existing PG/CH writes, also call writeRawEvent with a kick.viewer_count.observed envelope. Still in the worker, still per-env.

Verification: stage worker writes envelopes to vf.raw_events and ingest.ingest_outbox; rows accumulate but nothing consumes them yet. Confirm via:

sql
-- vf-pg-dm
SELECT count(*), max(created_at) FROM ingest.ingest_outbox;
-- ClickHouse
SELECT count() FROM vf.raw_events WHERE source = 'kick' AND type = 'kick.viewer_count.observed';

This is the most useful dry-run gate — every code path that should produce a DomainEvent is observable, side-effect-free, and replayable, before any consumer or new tier ships.

PR 3 — Fan-out worker (dry-run mode, no targets)

Files:

  • api/src/ingest-fanout.ts — pg-boss worker registered against vf-pg-dm's pgboss schema. Polls ingest.ingest_outbox_delivery WHERE status = 'pending' (none yet — there are no rows because PR 2 didn't insert into ingest_outbox_delivery, only ingest_outbox). Initially a no-op.
  • operations/clusters/prod/data/releases/ingest.yaml — new Helm release for dist/ingest-server.js, replicas=1. Loads dm-migrator-database-url-style secret plus a new INGEST_TARGET_ENVS config env. Annotated with same flux-system:ingest ImagePolicy (new policy + repository — mirror the existing pattern).
  • operations/clusters/prod/system/policies/ingest.yaml, operations/clusters/prod/system/sources/images/ingest.yaml — new ImagePolicy / ImageRepository pair.

Verification: deploy reconciles, pod runs idle, no errors. No deliveries because PR 2 only writes the outbox row, not delivery rows yet.

PR 4 — Per-env /internal/events + projector modules

Files:

  • api/src/internal-events.tsPOST /internal/events handler:
    • HMAC verify against INGEST_DELIVERY_SECRET (per-env).
    • Idempotency check against new public.processed_events(event_id PRIMARY KEY, processed_at) table.
    • Dispatch to the right projector based on (source, type).
  • api/migrations/0124_processed_events.sql — the dedupe table.
  • api/src/projectors/ — one file per (source, type). First one: kick_viewer_count.ts — reads the envelope, updates streamer_channels + kick_streamer_probe exactly as kick_viewer_poll.ts does today. Same SQL, just moved.
  • api/src/server.ts — wire POST /internal/events into the route table. Reject when missing HMAC header.

Verification: deploy stage api. POST a synthetic envelope to /internal/events from a sandbox pod with a valid HMAC. Confirm streamer_channels row gets touched. No actual fan-out traffic yet because PR 2's worker still does both the dual-write and the direct DB write — the projector path is just available.

PR 5 — Enable fan-out for stage

Files:

  • api/src/kick_viewer_poll.ts — extend appendOutbox(env, evt, ["stage"]) so PR 2's outbox rows now have a corresponding delivery row.
  • api/src/ingest-fanout.ts — flip the no-op into the real delivery loop. POST envelopes to STAGE_INTERNAL_EVENTS_URL (set on the ingest pod's env). HMAC sign. Retry with backoff. Mark delivered / dead.

Verification:

  • Stage's public.streamer_channels updated twice per probe: once by the worker's direct write (PR 2 still has it), once by the projector via /internal/events. Look for one processed_events row per envelope id.
  • Watch ingest_outbox_delivery drain to delivered within seconds.
  • Read-side queries return identical results (the projector is doing a no-op overwrite — same SQL, same data).

This is the "fan-out is working" checkpoint. Stage soaks here for at least 24h before PR 6.

PR 6 — Strip the pollers from stage's worker

Files:

  • api/src/worker.ts — remove the cron registrations for viewer-poll, follower-poll, X poll, trust-score, sheets-sync, etc. Keep only env-local jobs (pg-boss queue consumers).
  • operations/clusters/prod/stage/releases/worker.yaml — strip the env vars the removed cron handlers used (mostly the upstream creds — KICK_CLIENT_*, SCRAPER_API_KEY, SCRAPEDO_TOKEN).
  • api/src/ingest-server.ts — register the same cron handlers here, talking to the same upstream APIs.

Verification: stage worker pod restarts without the cron jobs. kubectl logs shows no more [viewer-poll] lines on stage. Meanwhile kubectl --context vf -n data logs deploy/ingest shows them, and stage's streamer_channels.kick_synced_at continues to advance via the /internal/events projector.

PR 7 — Re-register external webhooks (CUTOVER, irreversible)

Steps (cluster + external, not git):

  1. Add a CFTunnel record for wh.verifluence.io pointing at a Gateway HTTPRoute in data/routes/.
  2. Update Kick OAuth App's redirect URL + webhook callback URL to https://wh.verifluence.io/wh/kick.
  3. Same for Sumsub.
  4. Same for X / Resend / ScraperAPI callbacks.
  5. Deploy webhooks-server.js as a second container inside the ingest Deployment (or a sibling Deployment in data).

Coordination checklist (paste in incident channel before starting):

  • [ ] Maintenance window scheduled (recommend 09:00–10:00 UTC on a low-traffic weekday).
  • [ ] All upstream admin UI credentials accessible.
  • [ ] Previous webhook URL preserved as a comment so revert is documented.
  • [ ] Stage validation script ready (curl + signed payload to both old and new URLs to confirm 401 vs 200).
  • [ ] Rollback plan: revert the Kick app's callback URL. Stage will silently miss live webhooks until the URL is restored — no data corruption, just a polling-only gap.

Why this is irreversible mid-flight: Kick's webhook endpoint configuration is global per OAuth app. There is no atomic flip. During the swap window, webhooks sent to the old URL bounce (404), and webhooks sent to the new URL before data's receiver is live bounce too. Plan for a 30-second outage of livestream.status.updated events. Viewer-poll covers the gap.

PR 8 — Strip the webhook receiver from stage

Files:

  • operations/clusters/prod/stage/releases/webhooks.yaml — delete. The HelmRelease reconciles to zero.
  • operations/clusters/prod/stage/routes/http/webhooks.yaml — delete the HTTPRoute.
  • DNS — wh-stage.verifluence.io record retired (cosmetic; can stay as a CNAME to wh.verifluence.io if anything stale points at it).

Verification: kubectl --context vf -n stage get deploy webhooks returns NotFound. Kick events keep flowing via the new tier into stage's /internal/events.

PR 9 — Replay tool + reconciliation

Files:

  • api/src/scripts/replay.ts — CLI. Reads from vf.raw_events filtered by --from/--to/--types, re-wraps each row into a fresh delivery envelope, POSTs to the target env's /internal/events. Useful when a projector bug is fixed and stage state needs rebuilding from a window.
  • api/src/ingest-reconciliation.ts — hourly pg-boss job. Scans ingest_outbox_delivery WHERE status = 'dead' AND attempted_at > now() - 7d, surfaces them in the admin UI for manual re-queue.

Verification: bury a synthetic projector bug in stage, run replay against the window, confirm stage tables converge. No production analogue — this is dev tooling.


Stop conditions / abort criteria

Pause the rollout at any of these and write a postmortem before continuing:

  • Stage streamer_channels.kick_synced_at drifts more than 5 minutes from now() for any active streamer (PR 5 expected drift: <30s).
  • ingest_outbox_delivery rows aging past 5 minutes in pending status for stage.
  • Kick rate-limit headers (X-RateLimit-Remaining) drop below 20% across two consecutive polls.
  • Any Sumsub webhook delivery returns 5xx from our side for two consecutive callbacks (Sumsub retries only twice).

Open questions

  1. One image or two? Could ship the ingest tier from the existing worker image with a different CMD, or build a dedicated ingest image. Two images adds CI churn; one image keeps it cohesive. Recommendation: dedicated image — the ingest tier carries upstream credentials the per-env workers shouldn't need anymore, and the image manifest is the natural place to enforce that boundary.
  2. Where does the webhook receiver live in PR 7? Inline with the ingest Deployment (one pod, two ports) or as a sibling Deployment? Sibling is cleaner for scaling — ingest is CPU-light, webhooks can spike on a Kick raid. Two Deployments, both in data.
  3. Production launch coupling. When production rolls, its /internal/events URL gets added to INGEST_TARGET_ENVS. Production's worker also stops polling on day one — it's a pure consumer from the start. Document in the production-launch runbook.
  4. Deletion of streamer_trust_score_history PG mirror, etc. Per kick-external-schema Phase B — out of scope here.

Verifluence Documentation