Ingest Tier Rollout — Execution Plan
Status: Proposed · May 2026
Execution plan for Phase 4.b of the Shared
dataNamespace rollout — the actualingestworkload that takes over external API ownership from each env's per-env worker. The design itself lives in External Data Ingestion; this doc is the PR-by-PR sequence for landing it safely against a live stage cluster.Pre-requisites already in place:
datanamespace,vf-pg-dm,chi-vfalias, dm-migrator (Phase 1–3 done).ingest.ingest_outbox+ingest.ingest_outbox_deliveryin vf-pg-dm (migrations-dm/0002_ingest_outbox.sql).vf.raw_eventsin ClickHouse (api/clickhouse/06_raw_events.sql).DomainEventenvelope type inapi/src/ingest.ts.
Scope summary
| Step | Repo | Touches running services? | Reversible? |
|---|---|---|---|
1. dist/ingest.js entry point | app | No (new binary) | Yes |
| 2. Ingest writer helpers + raw-event dual-write from worker | app | Yes — worker rolling deploy | Yes (revert) |
| 3. Outbox + fan-out worker (dry-run mode) | app + ops | Yes — new Deployment in data | Yes |
4. Per-env /internal/events + projector modules | app | Yes — api rolling deploy | Yes |
| 5. Enable fan-out for stage | ops | Yes — flag flip | Yes |
| 6. Strip pollers from stage's worker | ops | Yes — worker rolling deploy | Yes (revert HelmRelease) |
| 7. Re-register external webhooks | external (Kick / Sumsub admin UI) | Yes | No without coordinated downtime |
| 8. Strip webhook receiver from stage | ops | Yes — webhooks rolling deploy | Yes |
| 9. Replay tool + reconciliation job | app + ops | No (new tool, manual invocation) | Yes |
Steps 1–6 and 8 are routine rolling deploys. Step 7 is the only irreversible action — once Kick's OAuth app has been re-registered to point at wh.verifluence.io (the new ingress in data), there is no clean way back to the per-env webhook URLs without a registration round-trip and lost events. Treat as the cutover point.
PR-by-PR sequence
PR 1 — dist/ingest.js entry point (no behaviour change)
Files:
api/src/ingest-server.ts— new entry point. Mirrorsworker.ts's startup (env, db, pgboss) but registers only ingest-tier cron handlers (initially none — we leave the cron registration to PR 2).api/Dockerfile— new stageFROM base AS ingestwithCMD ["node", "dist/ingest-server.js"].api/package.json—npm run dev:ingest.
Verification: npm run build produces dist/ingest-server.js; the binary boots, connects to both DBs, exits cleanly on SIGTERM. No runtime use yet.
PR 2 — Ingest writer helpers + raw-event dual-write from worker
Files:
api/src/ingest-writer.ts—writeRawEvent(env, event: DomainEvent): Promise<void>andappendOutbox(env, event, targets: IngestTargetEnv[]): Promise<void>. Single transaction across CH (vf.raw_events) and PG (ingest.ingest_outbox+ingest.ingest_outbox_delivery). Usesenv.DM_DBand the existingchInserthelper.api/src/kick_viewer_poll.ts(and a couple of others incrementally) — after the existing PG/CH writes, also callwriteRawEventwith akick.viewer_count.observedenvelope. Still in the worker, still per-env.
Verification: stage worker writes envelopes to vf.raw_events and ingest.ingest_outbox; rows accumulate but nothing consumes them yet. Confirm via:
-- vf-pg-dm
SELECT count(*), max(created_at) FROM ingest.ingest_outbox;
-- ClickHouse
SELECT count() FROM vf.raw_events WHERE source = 'kick' AND type = 'kick.viewer_count.observed';This is the most useful dry-run gate — every code path that should produce a DomainEvent is observable, side-effect-free, and replayable, before any consumer or new tier ships.
PR 3 — Fan-out worker (dry-run mode, no targets)
Files:
api/src/ingest-fanout.ts— pg-boss worker registered againstvf-pg-dm'spgbossschema. Pollsingest.ingest_outbox_delivery WHERE status = 'pending'(none yet — there are no rows because PR 2 didn't insert intoingest_outbox_delivery, onlyingest_outbox). Initially a no-op.operations/clusters/prod/data/releases/ingest.yaml— new Helm release fordist/ingest-server.js, replicas=1. Loadsdm-migrator-database-url-style secret plus a newINGEST_TARGET_ENVSconfig env. Annotated with sameflux-system:ingestImagePolicy (new policy + repository — mirror the existing pattern).operations/clusters/prod/system/policies/ingest.yaml,operations/clusters/prod/system/sources/images/ingest.yaml— new ImagePolicy / ImageRepository pair.
Verification: deploy reconciles, pod runs idle, no errors. No deliveries because PR 2 only writes the outbox row, not delivery rows yet.
PR 4 — Per-env /internal/events + projector modules
Files:
api/src/internal-events.ts—POST /internal/eventshandler:- HMAC verify against
INGEST_DELIVERY_SECRET(per-env). - Idempotency check against new
public.processed_events(event_id PRIMARY KEY, processed_at)table. - Dispatch to the right projector based on
(source, type).
- HMAC verify against
api/migrations/0124_processed_events.sql— the dedupe table.api/src/projectors/— one file per(source, type). First one:kick_viewer_count.ts— reads the envelope, updatesstreamer_channels+kick_streamer_probeexactly askick_viewer_poll.tsdoes today. Same SQL, just moved.api/src/server.ts— wirePOST /internal/eventsinto the route table. Reject when missing HMAC header.
Verification: deploy stage api. POST a synthetic envelope to /internal/events from a sandbox pod with a valid HMAC. Confirm streamer_channels row gets touched. No actual fan-out traffic yet because PR 2's worker still does both the dual-write and the direct DB write — the projector path is just available.
PR 5 — Enable fan-out for stage
Files:
api/src/kick_viewer_poll.ts— extendappendOutbox(env, evt, ["stage"])so PR 2's outbox rows now have a corresponding delivery row.api/src/ingest-fanout.ts— flip the no-op into the real delivery loop. POST envelopes toSTAGE_INTERNAL_EVENTS_URL(set on the ingest pod's env). HMAC sign. Retry with backoff. Markdelivered/dead.
Verification:
- Stage's
public.streamer_channelsupdated twice per probe: once by the worker's direct write (PR 2 still has it), once by the projector via/internal/events. Look for oneprocessed_eventsrow per envelope id. - Watch
ingest_outbox_deliverydrain todeliveredwithin seconds. - Read-side queries return identical results (the projector is doing a no-op overwrite — same SQL, same data).
This is the "fan-out is working" checkpoint. Stage soaks here for at least 24h before PR 6.
PR 6 — Strip the pollers from stage's worker
Files:
api/src/worker.ts— remove the cron registrations for viewer-poll, follower-poll, X poll, trust-score, sheets-sync, etc. Keep only env-local jobs (pg-boss queue consumers).operations/clusters/prod/stage/releases/worker.yaml— strip the env vars the removed cron handlers used (mostly the upstream creds —KICK_CLIENT_*,SCRAPER_API_KEY,SCRAPEDO_TOKEN).api/src/ingest-server.ts— register the same cron handlers here, talking to the same upstream APIs.
Verification: stage worker pod restarts without the cron jobs. kubectl logs shows no more [viewer-poll] lines on stage. Meanwhile kubectl --context vf -n data logs deploy/ingest shows them, and stage's streamer_channels.kick_synced_at continues to advance via the /internal/events projector.
PR 7 — Re-register external webhooks (CUTOVER, irreversible)
Steps (cluster + external, not git):
- Add a CFTunnel record for
wh.verifluence.iopointing at a Gateway HTTPRoute indata/routes/. - Update Kick OAuth App's redirect URL + webhook callback URL to
https://wh.verifluence.io/wh/kick. - Same for Sumsub.
- Same for X / Resend / ScraperAPI callbacks.
- Deploy
webhooks-server.jsas a second container inside theingestDeployment (or a sibling Deployment indata).
Coordination checklist (paste in incident channel before starting):
- [ ] Maintenance window scheduled (recommend 09:00–10:00 UTC on a low-traffic weekday).
- [ ] All upstream admin UI credentials accessible.
- [ ] Previous webhook URL preserved as a comment so revert is documented.
- [ ] Stage validation script ready (curl + signed payload to both old and new URLs to confirm 401 vs 200).
- [ ] Rollback plan: revert the Kick app's callback URL. Stage will silently miss live webhooks until the URL is restored — no data corruption, just a polling-only gap.
Why this is irreversible mid-flight: Kick's webhook endpoint configuration is global per OAuth app. There is no atomic flip. During the swap window, webhooks sent to the old URL bounce (404), and webhooks sent to the new URL before data's receiver is live bounce too. Plan for a 30-second outage of livestream.status.updated events. Viewer-poll covers the gap.
PR 8 — Strip the webhook receiver from stage
Files:
operations/clusters/prod/stage/releases/webhooks.yaml— delete. The HelmRelease reconciles to zero.operations/clusters/prod/stage/routes/http/webhooks.yaml— delete the HTTPRoute.- DNS —
wh-stage.verifluence.iorecord retired (cosmetic; can stay as a CNAME towh.verifluence.ioif anything stale points at it).
Verification: kubectl --context vf -n stage get deploy webhooks returns NotFound. Kick events keep flowing via the new tier into stage's /internal/events.
PR 9 — Replay tool + reconciliation
Files:
api/src/scripts/replay.ts— CLI. Reads fromvf.raw_eventsfiltered by--from/--to/--types, re-wraps each row into a fresh delivery envelope, POSTs to the target env's/internal/events. Useful when a projector bug is fixed and stage state needs rebuilding from a window.api/src/ingest-reconciliation.ts— hourly pg-boss job. Scansingest_outbox_delivery WHERE status = 'dead' AND attempted_at > now() - 7d, surfaces them in the admin UI for manual re-queue.
Verification: bury a synthetic projector bug in stage, run replay against the window, confirm stage tables converge. No production analogue — this is dev tooling.
Stop conditions / abort criteria
Pause the rollout at any of these and write a postmortem before continuing:
- Stage
streamer_channels.kick_synced_atdrifts more than 5 minutes fromnow()for any active streamer (PR 5 expected drift: <30s). ingest_outbox_deliveryrows aging past 5 minutes inpendingstatus for stage.- Kick rate-limit headers (
X-RateLimit-Remaining) drop below 20% across two consecutive polls. - Any Sumsub webhook delivery returns 5xx from our side for two consecutive callbacks (Sumsub retries only twice).
Open questions
- One image or two? Could ship the ingest tier from the existing
workerimage with a differentCMD, or build a dedicatedingestimage. Two images adds CI churn; one image keeps it cohesive. Recommendation: dedicated image — the ingest tier carries upstream credentials the per-env workers shouldn't need anymore, and the image manifest is the natural place to enforce that boundary. - Where does the webhook receiver live in PR 7? Inline with the ingest Deployment (one pod, two ports) or as a sibling Deployment? Sibling is cleaner for scaling — ingest is CPU-light, webhooks can spike on a Kick raid. Two Deployments, both in
data. - Production launch coupling. When production rolls, its
/internal/eventsURL gets added toINGEST_TARGET_ENVS. Production's worker also stops polling on day one — it's a pure consumer from the start. Document in the production-launch runbook. - Deletion of
streamer_trust_score_historyPG mirror, etc. Per kick-external-schema Phase B — out of scope here.
Related docs
- Shared
dataNamespace — the infrastructure foundation - External Data Ingestion — the design (not this execution plan)
- Datamining Schema (dm) — the dm schema this builds on