Skip to content

Datamining Schema (dm) — Streamer Channel as First-Class Entity

Status: Proposed · May 2026 (revised — channel-first design)

Move all environment-agnostic streaming-platform data into a dedicated dm schema. Promote the channel itself (dm.streamer_channel) to a first-class entity with its own stable PK. Reduce public.streamer_channels to a thin per-env link table (public.streamer_channel_link) that maps Verifluence streamer accounts to one or more canonical channels via soft references.

Enables the External Data Ingestion proposal: with dm.* sitting cleanly outside public.* and the dependency arrow inverted, the schema becomes physically movable to a shared ingest tier without dragging per-env tables along — and without needing FDW gymnastics for hot-path joins.


Problem

Today the streaming-platform tables sit in three different schemas (public., scrape., webhooks.), keyed by kick_user_id text everywhere, with the human-readable identifier scattered under three different column names (channel_name, channel_slug, slug). And the public.streamer_channels table does two unrelated jobs in one row. That has four concrete costs:

1. Identification is awkward

External APIs (admin, support, ad-hoc SQL) overwhelmingly know streamers by their handleninja, trainwreckstv — not their Kick numeric ID. Every existing lookup pays for a streamer_channels round-trip:

sql
-- "Show me the last 10 stream sessions for username 'trainwreckstv'"
SELECT s.* FROM kick_stream_sessions s
JOIN streamer_channels c ON c.kick_user_id = s.kick_user_id
WHERE c.channel_name = 'trainwreckstv'    -- per-env table!
ORDER BY s.stream_started_at DESC LIMIT 10;

The handle lookup crosses into a per-env table even though the answer is environment-agnostic. A Kick stream session exists whether or not any Verifluence streamer in this env has claimed it.

2. Boundary lives in convention, not in the schema

scrape.kick_raw, public.kick_streamer_probe, public.kick_stream_sessions, public.kick_chat_messages, etc. are all env-agnostic — see the data-model audit. But they share the public. namespace with per-env tables (streamers, campaigns, deals). The boundary is only visible to developers who happen to know.

When we split production from stage, "what gets duplicated vs. what's shared" becomes a per-table judgment call. We want it to be WHERE schema = 'dm'.

3. streamer_channels does two jobs in one row

The current table holds both:

sql
-- Half A: per-env claim
streamer_id          → public.streamers
verified, created_at,

-- Half B: denormalised cache of Kick profile data
channel_name, kick_display_name, kick_avatar_url, kick_bio,
kick_follower_count, kick_is_live, kick_subscriber_count, kick_synced_at

Half B is env-agnostic — it describes the Kick channel, which exists independently of any Verifluence env. Half A is the only thing that ties the row to an env. Bundling them means every change to the Kick profile rewrites a per-env row, every per-env streamer claim drags Kick-profile columns along, and queries that need only one half pay the cost of carrying the other.

4. Renames are silently lossy

channel_name on streamer_channels is the streamer's last-known slug. Kick lets users rename. We update channel_name on resolution, but the previous slug is gone. Historical chat lines reference channel_slug literally and can't be tied back to today's streamer. No rename history is captured.

5. Multi-platform support is bolted on

Today streamer_channels.platform can be 'kick' or 'twitch' (and X-follower polling piggybacks on the same table), but the schema is shaped around Kick — kick_user_id, kick_display_name, etc. Adding Twitch scraping would mean either new columns on the same table, a parallel twitch_streamer_channels table, or a refactor — none of them clean.


Proposed schema

The channel is the first-class entity. It has its own stable PK and identifier. Per-env streamer accounts point at it through a thin link table — many-to-many in principle, one-to-one in practice for most streamers.

sql
CREATE SCHEMA dm;

-- ═══ Canonical channel identity ═══════════════════════════════════════
-- One row per external streaming-platform account ever observed.
-- Stable internal PK (bigserial) — this is what every other dm.* table
-- references. The platform-supplied user-id stays as a text field for
-- backward-compat reads but is no longer the join key.
CREATE TABLE dm.streamer_channel (
  id                   bigserial    PRIMARY KEY,
  external_id          text         NOT NULL UNIQUE,         -- ULID, for cross-system refs (events, audit)
  platform             text         NOT NULL,                -- 'kick' | 'twitch' | 'x'
  platform_user_id     text         NOT NULL,                -- kick_user_id / twitch user_id / x user_id
  current_slug         text         NOT NULL,
  display_name         text,
  avatar_url           text,
  bio                  text,
  follower_count       integer,
  is_live              boolean      NOT NULL DEFAULT false,
  is_banned            boolean      NOT NULL DEFAULT false,
  first_seen_at        timestamptz  NOT NULL DEFAULT now(),
  last_seen_at         timestamptz  NOT NULL DEFAULT now(),
  UNIQUE (platform, platform_user_id),
  UNIQUE (platform, current_slug),
  CHECK  (current_slug = lower(current_slug))
);

CREATE INDEX idx_channel_platform_slug ON dm.streamer_channel (platform, current_slug);

-- ═══ Rename history ═══════════════════════════════════════════════════
-- Append-only. Every observed change of current_slug writes a row here.
-- Historical chat lines + scrape rows can be tied back to a channel
-- even when the slug was different at the time.
CREATE TABLE dm.streamer_channel_slug_history (
  channel_id           bigint       NOT NULL REFERENCES dm.streamer_channel ON DELETE CASCADE,
  slug                 text         NOT NULL,
  observed_from        timestamptz  NOT NULL,
  observed_until       timestamptz,                           -- NULL = current
  PRIMARY KEY (channel_id, observed_from),
  CHECK (slug = lower(slug))
);

CREATE INDEX idx_slug_history_slug ON dm.streamer_channel_slug_history (slug);

public.streamer_channels is replaced by public.streamer_channel_link — a thin mapping. A VF streamer can claim multiple channels (one per platform, or even multiple per platform if needed for, say, alt accounts).

sql
CREATE TABLE public.streamer_channel_link (
  streamer_id          integer      NOT NULL REFERENCES public.streamers ON DELETE CASCADE,
  channel_id           bigint       NOT NULL,                 -- soft ref → dm.streamer_channel.id
  verified             boolean      NOT NULL DEFAULT false,
  claimed_at           timestamptz  NOT NULL DEFAULT now(),
  PRIMARY KEY (streamer_id, channel_id)
);

-- While same-DB, enforce the FK. Drop it when dm moves to its own DB.
ALTER TABLE public.streamer_channel_link
  ADD CONSTRAINT streamer_channel_link_channel_fkey
  FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel(id) NOT VALID;

-- A channel can only be verified-claimed by one streamer at a time.
CREATE UNIQUE INDEX uq_verified_channel
  ON public.streamer_channel_link (channel_id)
  WHERE verified = TRUE;

The link table is per-env by construction — it FKs into public.streamers. The reference to dm.streamer_channel is intentionally a soft pointer: while same-DB it's enforced by FK; after the future DB split it's enforced by the channel-link upsert flow (one transaction in dm to upsert the channel, then a domain event into the env DB to insert the link). Either way the application is responsible for keeping the relationship coherent — which it already had to be.

Tables that move into dm

Every dm.* descendant now keys on channel_id bigint instead of kick_user_id text. Smaller indexes, uniform across platforms, type-safe.

FromToKey columnNotes
scrape.kick_rawdm.scrape_logchannel_id (NULL allowed for pre-resolution)Raw HTTP audit log
webhooks.incomingsplit — see "Webhook inbox" belown/aToday does three jobs; we split
public.kick_streamer_probedropped — moved to ClickHouse onlyn/aSee "Probe table → ClickHouse" below
public.kick_streamer_probe_scheduledm.probe_schedulechannel_id PKWhen to next probe each channel
public.kick_stream_sessionsdm.stream_sessionchannel_idOne row per live session (the canonical "stream")
public.kick_chat_messagesdropped — moved to ClickHouse onlyn/aSee "Chat messages → ClickHouse" below
public.kick_chat_s3_archive_logdm.chat_archive_log(none — bookkeeping)S3 export ledger; stays PG
public.kick_event_subscriptionsdm.event_subscriptionchannel_idPlatform's subscription IDs

Each descendant gets an indexed FK to dm.streamer_channel(id). The dm.* tables stop carrying any denormalised platform-specific identifiers — kick_user_id and channel_slug columns drop. Slug rendering goes through one JOIN to dm.streamer_channel.current_slug (cheap) or through dm.streamer_channel_slug_history for historical accuracy.

Why this is better

  1. Channel is a first-class entity with its own identifier. A channel_id bigint is stable, internal, and uniform across platforms. Kick renames a user → current_slug changes, channel_id doesn't. Twitch GUIDs and X snowflake IDs collapse to the same column. Indexes get smaller.

  2. external_id text (ULID) is the cross-system identifier for domain events and external audit. Internal bigserial stays implementation detail.

  3. UNIQUE (platform, platform_user_id) and UNIQUE (platform, current_slug) at the dm layer give canonical-row guarantees the old design couldn't enforce without scattering CHECK constraints across many tables.

  4. The link table is genuinely thin. Per-env data shrinks to "this streamer claims that channel, here's when, here's whether we verified it." Nothing platform-specific.

  5. DB split becomes free. When dm moves to its own database (the ingestion-tier proposal), the soft reference from streamer_channel_link.channel_id survives without FDW. Hot-path joins on the env side don't need to cross databases because the env side carries nothing platform-specific to merge in — rendering the channel card pre-joins on the dm side.

  6. Multi-platform support is native, not bolted on. Adding Twitch = INSERT INTO dm.streamer_channel (platform='twitch', …). Probes, sessions, chat, subscriptions all get the Twitch row for free with the same shape. X follower polling can move under this umbrella when it's scoped properly.


Probe table → ClickHouse (drop the Postgres copy)

public.kick_streamer_probe is the per-minute viewer/follower snapshot — append-only, no FK in or out, hot growth (~150k rows/hour at the current cohort, dominates OLTP WAL). The team already dual-writes every probe to vf.kick_streamer_probe in ClickHouse (api/src/kick_viewer_poll.ts:279chInsert("kick_streamer_probe", …)).

A code audit of every PG read:

SitePatternCan CH serve it?
kick_webhook.ts:752 — 7-day follower history (≤ 1008 rows)Range scan on (platform_user_id, recorded_at)✅ trivial
kick_webhook.ts:1839 — per-session per-minute viewer chartEquality on (platform_user_id, stream_started_at)✅ trivial
kick_viewer_poll.ts:438closeStaleStreamSessions finds last probe per sessionMAX(recorded_at) per channel✅ trivial (or skip CH: read dm.probe_schedule.last_probed_at)
applications.ts:804 — resolve a submitted stream URL to a stream_session by overlapping probe rangeRange scan + window match✅ rewrite as two queries (CH for window, PG for session)
backfill-clickhouse.ts:120 — explicit PG → CH backfill toolReads PG❌ obviously goes away; tool deletes

Nothing requires PG transactional semantics with public.*. Trust-score reads dm.stream_session (the aggregate) — not the raw probe.

Decision: drop public.kick_streamer_probe entirely. ClickHouse becomes the single source of truth. The PG table was always a hot-path mirror written defensively while CH was new; that defence is no longer needed.

What changes:

  • kick_viewer_poll.ts — keep only the chInsert call; remove the PG INSERT. ~5–10k writes/hour avoided.
  • The four PG read paths repoint to CH via a new helper channelProbeQuery(env, channelId, …) in api/src/clickhouse.ts. The helper resolves channel_id(platform, platform_user_id) via a tiny PG lookup, then queries CH.
  • Migration 0123 drops the PG table at the end of the transaction. CH already carries 12 months of history.
  • backfill-clickhouse.ts kick_streamer_probe deletes.
  • closeStaleStreamSessions reads dm.probe_schedule.last_probed_at (already updated on every probe), removing the cross-system join entirely.

dm.probe_schedule stays in Postgres. Small mutable bookkeeping (one row per channel with next_probe_at); PG handles point updates with FK-safe semantics. ≤ 50k rows.

So the final picture for "where is probe data?":

  • Append-only telemetry: vf.kick_streamer_probe (ClickHouse) — sole copy
  • Polling bookkeeping: dm.probe_schedule (Postgres)
  • Session aggregate: dm.stream_session (Postgres)

ClickHouse keeps its existing (env, platform, platform_user_id, ts) ordering — no CH schema change. The dm.streamer_channel mapping happens entirely on the PG side before the CH query.


Trust score history → ClickHouse (drop the Postgres copy)

public.streamer_trust_score_history is the second PG↔CH dual-write. Same shape: keyed on (platform, platform_user_id), env-agnostic, append-only, no FK in or out. kick_trust_score.ts:564 inserts PG and :586 does chInsert("streamer_trust_score_history", …).

Read-site audit:

SitePatternCH-servable?
kick_webhook.ts:1322 — 90-day score history for the trend chartRange scan on (platform='kick', platform_user_id, computed_at)✅ trivial

That's the only production read. The cleanup function delete_streamer_cascade in 00_baseline.sql:254 did a DELETE keyed on platform_user_id — semantically wrong in a multi-env world (the Kick channel still exists after a per-env streamer is purged). Dropping the PG table removes the offending cleanup branch.

Write volume is small (~1 row/streamer/day) — the win is schema-shape consistency. Either both env-agnostic mirrors stay in PG or both go CH-only. Mixed is the worst of both worlds.

Decision: drop public.streamer_trust_score_history in the same migration as kick_streamer_probe.

What changes:

  • kick_trust_score.ts:564 — remove PG INSERT; keep the chInsert.
  • kick_webhook.ts:1322 — replace PG SELECT with channelTrustScoreHistory(env, channelId, 90) reading from CH.
  • delete_streamer_cascade (in baseline / follow-up migration) — drop the trust-score-history DELETE branch.
  • backfill-clickhouse.ts — delete both supported-table branches; tool retires. The only two options were kick_streamer_probe and streamer_trust_score_history.
  • Migration 0123 ends with DROP TABLE public.streamer_trust_score_history. CH has 24-month TTL.

public.streamer_trust_scores (the current score, distinct from history) stays in PG. Small (one row per streamer), heavily UPSERTed, joined into operator-facing leaderboard queries. After the inversion, it gains a channel_id column FK to dm.streamer_channel — or stays keyed on (platform, platform_user_id) if that's lower-disruption. Both shapes work.


Chat messages → ClickHouse (drop the Postgres copy)

Chat messages are the third PG→CH candidate. They're an inherently append-only event stream — every chat.message.sent webhook becomes one row that's never updated except for the exported boolean the export pipeline mutates as bookkeeping. The PG-specific operations (UPDATE exported = TRUE, bulk DELETE after 14 days) are workarounds for what CH expresses natively.

Chat is also anonymous: the streamer being monitored has no business relationship with the chatters whose messages we record, and there is no compliance-driven takedown workflow for chat content. That removes the only reason to keep a mutable PG row.

CH-native lifecycle

OpPG todayCH replacement
Skip-already-exported on next runUPDATE exported = TRUE WHERE id = ANY($1) per batchWatermark in a tiny PG table dm.chat_export_watermark(env, last_exported_at). Next run reads WHERE created_at > watermark. No row mutation.
RetentionDELETE … WHERE exported = TRUE AND created_at < now() - INTERVAL '14 days'TTL created_at + INTERVAL 30 DAY DELETE on the CH table. Background merges handle it.
Buffer-flush boundaryPG transactional reads see committed rows immediatelyExport job's upper bound is now() - INTERVAL 1 HOUR (already true today for stream-session matching) — well within CH's flush latency.

Schema

sql
-- ClickHouse
CREATE TABLE vf.kick_chat_message (
  env               LowCardinality(String)         NOT NULL,
  channel_id        UInt64                         NOT NULL,
  channel_slug      LowCardinality(String)         NOT NULL,    -- denormalised for query convenience
  message_id        String                         NOT NULL,
  platform_user_id  String                         NOT NULL,    -- the chatter, not the streamer
  username          LowCardinality(String),
  content           String,
  emotes_json       String  DEFAULT '[]',
  badges_json       String  DEFAULT '[]',
  created_at        DateTime                       NOT NULL,
  inserted_at       DateTime                       DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY (env, toYYYYMMDD(created_at))
ORDER BY (env, channel_id, created_at)
TTL created_at + INTERVAL 30 DAY DELETE;

-- Postgres — tiny side table, the only PG residue for chat
CREATE TABLE dm.chat_export_watermark (
  env               text         PRIMARY KEY,
  last_exported_at  timestamptz  NOT NULL DEFAULT '1970-01-01'
);

dm.chat_archive_log (the S3 export ledger) stays — it's a tiny per-batch bookkeeping table, naturally PG.

Export path rewrite

ts
// kick_chat_export.ts — hourly
const { last_exported_at } = await env.DB`
  SELECT last_exported_at FROM dm.chat_export_watermark WHERE env = ${env}
`;
const upperBound = new Date(Date.now() - 60 * 60_000);   // now − 1h

const rows = await chQuery<ChatRow>(`
  SELECT * FROM vf.kick_chat_message
  WHERE env = '${env}'
    AND created_at >  toDateTime(${last_exported_at.getTime() / 1000})
    AND created_at <= toDateTime(${upperBound.getTime() / 1000})
  ORDER BY created_at
  LIMIT 500000
`);

if (rows.length) {
  const maxSeen = rows[rows.length - 1].created_at;
  await streamToS3(rows);
  await env.DB`INSERT INTO dm.chat_archive_log (rows_exported, files_written, success) VALUES (${rows.length}, ${files}, TRUE)`;
  await env.DB`UPDATE dm.chat_export_watermark SET last_exported_at = ${maxSeen} WHERE env = ${env}`;
}

No exported flag. No bulk DELETE in app code. The 50-ish lines of UPDATE/DELETE bookkeeping in the current kick_chat_export.ts shrink to ~20 lines.

Read paths

Site todayAfter CH-only
kick_webhook.ts:589 daily chat histogram (30 days)One CH query: SELECT toDate(created_at), count() FROM vf.kick_chat_message WHERE channel_id = ? GROUP BY toDate(created_at). Faster than the PG version.
kick_webhook.ts:1846 per-session per-minute chat aggregationTwo-step: fetch session bounds from PG (dm.stream_session), then CH aggregation GROUP BY toStartOfMinute(created_at). App merges with probe data — same code path that already does multi-CH-query merge after the probe move.
kick_webhook.ts:546 24h chat totalsCH WHERE channel_id = ? AND created_at >= now() - INTERVAL 24 HOUR.

All three read patterns map natively to CH columnar scans.

Why this wins

  • Storage 10–20× smaller via CH LZ4 compression. Chat is the highest-write PG table after probe; moving it sheds another ~50 MB/day of WAL plus indexes.
  • Write throughput decouples from webhook latency — chInsert is async-buffered. Giveaway/raid spikes are absorbed by CH without backpressuring the webhook handler.
  • Analytical reads (histograms, per-minute aggregations) run 5–10× faster in CH than in PG.
  • Schema-shape consistency — all append-only telemetry (probe, trust-score history, webhook audit, chat) now lives in CH. The remaining PG tables in dm are all small mutable bookkeeping: streamer_channel, probe_schedule, stream_session, event_subscription, webhook_message_ids, kick_follow_event, chat_archive_log, chat_export_watermark, slug_history, scrape_log.

Costs

  • Eventual consistency (~minutes) on writes. Acceptable for chat; export job's hour-aligned upper bound absorbs this.
  • Per-session chart needs two queries instead of one PG join — but the same code path already issues two CH queries (session + probe) after the probe move. Chat just becomes a third.
  • 30-day TTL is a hard retention floor. If the S3 export pipeline lags more than 30 days, that batch is lost. Alert on watermark lag; failure mode is best-effort archival (acceptable — chat is not a transactional record).

Decision: drop public.kick_chat_messages in the same migration as kick_streamer_probe and streamer_trust_score_history. Data goes to vf.kick_chat_message (CH). The export ledger dm.chat_archive_log stays in PG; a new dm.chat_export_watermark table replaces the exported flag.


Webhook inbox — split into three concerns

webhooks.incoming today serves three distinct purposes in one table:

ConcernCurrent implementationRecommended
Idempotency — was message_id already processed? (kick_webhook.ts:261)SELECT id FROM webhooks.incoming WHERE message_id = $1 against a row that also holds the full payloadTiny dm.webhook_message_ids(source, message_id PK, received_at) — no payload. Drops the payload-bearing PG row off the hot path.
Raw audit log — full payload of every verified + every rejected webhook, 30-day retentionPG row with payload text + signature + verified boolMove to ClickHouse. Append-only, large rows, never read transactionally. Add vf.webhook_inbox(source, event_type, message_id, payload String, signature, verified UInt8, received_at). Same pattern as vf.sumsub_webhooks.
Domain-event projectionchannel.followed is only recoverable from webhooks.incoming; the per-session follow chart depends on it (kick_webhook.ts:1994)Reads raw rows back out of audit logProject at write time. Add dm.kick_follow_event(channel_id bigint, occurred_at timestamptz) — handler inserts one row per channel.followed event in the same path that writes the audit log. Per-session follow chart reads from this table directly.

Net effect:

  • PG webhook hot-path table shrinks from 11 columns to 3 — 1–2 orders of magnitude smaller.
  • Raw payloads land in CH alongside Sumsub webhooks. Compliance + replay queries are one CH query.
  • channel.followed follow counts read from a dedicated projected table — same query shape but smaller scan, and the dependency on raw audit retention disappears (you can shorten audit TTL without breaking the chart).
  • webhooks.incoming deletes after 0123 ships.

Stat-aggregation queries that today join webhooks.incoming to kick_event_subscriptions (line 538) move to CH too — they're analytical, not transactional.


API surface — channel-first, username-friendly

A new helper module api/src/streamer_channel.ts is the only place that needs to know how to translate between platform-user-id, slug, and the canonical channel_id:

ts
// Resolve (platform, slug) → channel_id. Cheap (indexed). Null if unknown.
export async function channelIdBySlug(
  env: Env, platform: string, slug: string
): Promise<bigint | null>;

// Resolve (platform, platform_user_id) → channel_id. Cheap.
export async function channelIdByPlatformUserId(
  env: Env, platform: string, platformUserId: string
): Promise<bigint | null>;

// Resolve channel_id → full row (current slug, display name, etc.).
export async function channel(env: Env, channelId: bigint): Promise<Channel | null>;

// Idempotent upsert. Called by every poller/webhook before inserting
// into a child table. Records a slug-rename in slug_history when the
// observed slug differs from current_slug.
export async function upsertChannel(
  env: Env,
  args: {
    platform: string;
    platformUserId: string;
    slug: string;
    displayName?: string;
    avatarUrl?: string;
    followerCount?: number;
    isLive?: boolean;
  }
): Promise<bigint>;          // returns channel_id

Every existing call site that built a SQL query of the form WHERE kick_user_id = $1 converts to WHERE channel_id = $1 once the dm.* descendants are switched. A transition helper kickUserIdToChannelId(env, kuid) covers the brief overlap window.

HTTP

Every endpoint that previously took ?kick_user_id=… learns to also accept ?username=… and ?channel_id=…:

ts
const channelId =
     url.searchParams.has("channel_id")     ? BigInt(url.searchParams.get("channel_id")!)
  :  url.searchParams.has("username")       ? await channelIdBySlug(env, "kick", url.searchParams.get("username")!)
  :  url.searchParams.has("kick_user_id")   ? await channelIdByPlatformUserId(env, "kick", url.searchParams.get("kick_user_id")!)
  :  null;

New routes can take username exclusively. Admin/support tooling stops needing to know about numeric platform IDs at all.


Migration

Two-phase migration. Phase A is purely additive (new structure alongside old). Phase B switches readers/writers and drops the old.

sql
-- 0123_dm_streamer_channel.sql

BEGIN;

CREATE SCHEMA dm;

-- ═══ Phase A — add new structure ═════════════════════════════════════

CREATE TABLE dm.streamer_channel (...);
CREATE TABLE dm.streamer_channel_slug_history (...);

-- Backfill from every observed kick_user_id. Source priority:
--   1. streamer_channels  (we have a resolved slug)
--   2. kick_stream_sessions.channel_name
--   3. (chat lives in CH now — not a backfill source)
--   4. kick_streamer_probe with a synthetic slug if all else missing
INSERT INTO dm.streamer_channel (external_id, platform, platform_user_id, current_slug,
                                  display_name, avatar_url, bio, follower_count,
                                  is_live, first_seen_at, last_seen_at)
SELECT gen_ulid(),
       'kick',
       sc.kick_user_id,
       COALESCE(sc.channel_name, 'unknown-' || sc.kick_user_id),
       sc.kick_display_name, sc.kick_avatar_url, sc.kick_bio,
       sc.kick_follower_count, COALESCE(sc.kick_is_live, false),
       MIN(sc.created_at),     -- approx; real history starts here
       MAX(COALESCE(sc.kick_synced_at, sc.created_at))
FROM   public.streamer_channels sc
WHERE  sc.kick_user_id IS NOT NULL AND sc.platform = 'kick'
GROUP  BY sc.kick_user_id, sc.channel_name, sc.kick_display_name,
          sc.kick_avatar_url, sc.kick_bio, sc.kick_follower_count, sc.kick_is_live;

-- Pick up channels we've observed in probe/session/chat but haven't claimed.
INSERT INTO dm.streamer_channel (external_id, platform, platform_user_id, current_slug, first_seen_at)
SELECT DISTINCT gen_ulid(), 'kick', kss.kick_user_id,
                COALESCE(kss.channel_name, 'unknown-' || kss.kick_user_id),
                MIN(kss.stream_started_at)
FROM   public.kick_stream_sessions kss
WHERE  kss.kick_user_id NOT IN (
         SELECT platform_user_id FROM dm.streamer_channel WHERE platform = 'kick'
       )
GROUP  BY kss.kick_user_id, kss.channel_name;

-- Seed slug_history with one row per channel covering the current slug.
INSERT INTO dm.streamer_channel_slug_history (channel_id, slug, observed_from)
SELECT id, current_slug, first_seen_at FROM dm.streamer_channel;

-- Create the per-env link table.
CREATE TABLE public.streamer_channel_link (
  streamer_id  integer  NOT NULL REFERENCES public.streamers ON DELETE CASCADE,
  channel_id   bigint   NOT NULL,
  verified     boolean  NOT NULL DEFAULT false,
  claimed_at   timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (streamer_id, channel_id)
);

ALTER TABLE public.streamer_channel_link
  ADD CONSTRAINT streamer_channel_link_channel_fkey
  FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel(id) NOT VALID;

INSERT INTO public.streamer_channel_link (streamer_id, channel_id, verified, claimed_at)
SELECT sc.streamer_id, dsc.id, sc.verified, sc.created_at
FROM   public.streamer_channels sc
JOIN   dm.streamer_channel       dsc
  ON   dsc.platform = sc.platform AND dsc.platform_user_id = sc.kick_user_id
WHERE  sc.streamer_id IS NOT NULL;

-- ═══ Phase A.2 — move tables, add channel_id, backfill ═══════════════

ALTER TABLE scrape.kick_raw                  SET SCHEMA dm;
ALTER TABLE dm.kick_raw                      RENAME TO scrape_log;
ALTER TABLE dm.scrape_log                    ADD COLUMN channel_id bigint;
UPDATE dm.scrape_log sl SET channel_id = sc.id
  FROM dm.streamer_channel sc
  WHERE sc.platform = 'kick' AND sc.platform_user_id = sl.kick_user_id;

ALTER TABLE public.kick_streamer_probe_schedule  SET SCHEMA dm;
ALTER TABLE dm.kick_streamer_probe_schedule      RENAME TO probe_schedule;
ALTER TABLE dm.probe_schedule                    ADD COLUMN channel_id bigint;
UPDATE dm.probe_schedule ps SET channel_id = sc.id
  FROM dm.streamer_channel sc
  WHERE sc.platform = 'kick' AND sc.platform_user_id = ps.kick_user_id;
ALTER TABLE dm.probe_schedule ALTER COLUMN channel_id SET NOT NULL;
ALTER TABLE dm.probe_schedule DROP CONSTRAINT IF EXISTS probe_schedule_pkey;
ALTER TABLE dm.probe_schedule ADD PRIMARY KEY (channel_id);
ALTER TABLE dm.probe_schedule
  ADD CONSTRAINT probe_schedule_channel_fkey
  FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel NOT VALID;
ALTER TABLE dm.probe_schedule VALIDATE CONSTRAINT probe_schedule_channel_fkey;

-- (Same pattern for kick_stream_sessions → dm.stream_session,
--  kick_chat_messages does NOT move — it's dropped in Phase B
--  after the kick_chat_export pipeline switches to CH reads.)

-- ═══ Phase B — switch code → drop the old ════════════════════════════
-- Deferred to a follow-up migration once all code sites have been
-- repointed. Phase B drops:
--   public.streamer_channels       (replaced by streamer_channel_link + dm.streamer_channel)
--   public.kick_streamer_probe     (CH only)
--   public.streamer_trust_score_history  (CH only)
--   webhooks.incoming              (split into dm.webhook_message_ids + vf.webhook_inbox + dm.kick_follow_event)
--   kick_user_id columns on dm.*   (channel_id is the join key now)

COMMIT;

Phase A is a single transaction that can run on a live database. Every dm.* descendant temporarily carries both kick_user_id and channel_id columns; the code switches its reads/writes over to channel_id incrementally; then Phase B drops kick_user_id and the old public.streamer_channels table. Two PRs, one to flip the schema and one to retire the old columns.

The NOT VALID + VALIDATE CONSTRAINT pattern lets us add FKs without an AccessExclusiveLock on child tables — kept for the large historical PG tables (stream_session, event_subscription, scrape_log) during the move. Chat is no longer a concern: it leaves PG entirely.


Code change scope

FileChangeLines
api/migrations/0123_dm_streamer_channel.sqlPhase A migration~250
api/migrations/0124_drop_legacy_kick_columns.sqlPhase B (after code repoint)~80
api/src/streamer_channel.tsNew helper module (channelIdBySlug, channelIdByPlatformUserId, upsertChannel, channel)~120
api/src/kick_viewer_poll.tsDrop PG probe INSERT (keep chInsert); upsertChannel returns channel_id; touch dm.probe_schedule.last_probed_at~25
api/src/kick_subscriptions.tsUse channelIdBySlug / channel(channelId) instead of poking streamer_channels.channel_name; subscription tables key on channel_id~35
api/src/kick_webhook.tsupsertChannel on every inbound event; resolve channel_id from payload; replace webhooks.incoming dedupe with dm.webhook_message_ids; project channel.followed into dm.kick_follow_event; raw audit goes to vf.webhook_inbox; 4 probe-read sites repoint to CH helper~100
api/src/kick_maintenance.tsDrop webhooks.incoming purge; add dm.webhook_message_ids TTL prune~5
api/src/kick_chat_export.tsRewrite: read from CH via watermark, write S3, advance dm.chat_export_watermark. The 50 lines of UPDATE/DELETE bookkeeping shrink to ~20 lines of watermark-driven read.-30 (net)
api/clickhouse/07_kick_chat_message.sqlNew CH schema for chat messages (MergeTree, 30-day TTL)~30
api/src/kick_trust_score.tsDrop PG streamer_trust_score_history INSERT; keep chInsert; reads dm.stream_session by channel_id~25
api/src/applications.tssession_submissions resolver: probe-window scan moves to CH helper (resolves channel_id first); PG session join uses channel_id~20
api/src/streamers.tsStreamer-channel linking: upsertChannel first, then INSERT into public.streamer_channel_link~25
api/src/clickhouse.tsNew helpers: channelProbeQuery, channelTrustScoreHistory, webhookInboxInsert; each takes a channel_id and resolves to (platform, platform_user_id) for the CH WHERE clause~70
api/src/backfill-clickhouse.tsDelete both kick_streamer_probe and streamer_trust_score_history branches (PG sources gone). Tool retires.-260
api/k8s/backfill-clickhouse-job.yamlDelete (tool is gone)-107
api/clickhouse/06_webhook_inbox.sqlNew CH schema for raw webhook archive~30
api/src/oauth.tsKick OAuth callback writes streamer_channel_link instead of streamer_channels; upsertChannel first~20
api/src/x_follower_poll.ts(Optional in 0123) Move X polling under dm.streamer_channel with platform='x'; or defer to a follow-up~30 or —
frontend/src/api/streamers.tsAdd username / channel_id query param support~15
admin/src/panels/StreamersPanel.tsxSearch by username without an explicit user-id lookup; channel rows render from dm.streamer_channel~10

Touches roughly 19 files. The bulk is the kick_user_idchannel_id propagation, which is mechanical: a typed helper covers the translation during the overlap window, then Phase B retires the old column.

Four PG table drops (in Phase B) — kick_streamer_probe, streamer_trust_score_history, kick_chat_messages, webhooks.incoming — plus streamer_channels itself. The backfill-clickhouse tool retires entirely. After Phase B, every append-only telemetry source lives in ClickHouse; the PG tables that remain in dm are all small mutable state.


Migration ordering relative to other in-flight work

This proposal slots cleanly after the v2 baseline (current 00_baseline.sql) and before the ingestion-tier extraction. Suggested ordering:

  1. Phase A — migration 0123: purely additive (new schema + new identity table + streamer_channel_link + table moves with dual kick_user_id/channel_id columns + FKs). Can ship to a live DB.
  2. Code repoints — same week: ~18 files, low-risk grep-and-replace. The dual-column overlap period means readers can switch incrementally.
  3. Phase B — migration 0124: drop kick_user_id columns from dm.* descendants, drop public.streamer_channels, drop public.kick_streamer_probe, drop public.streamer_trust_score_history, drop public.kick_chat_messages, drop webhooks.incoming. Ships once all writers and readers are on the new columns and kick_chat_export.ts has switched to the watermark-driven CH read.
  4. External Data Ingestion proposal: operates on cleaner ground — dm.* is a single namespace to physically extract, and the dependency arrow already points the right way (env reaches into dm via soft reference; nothing in dm references env-specific data).

Doing it in this order means the ingestion proposal's "move the env-agnostic schema to its own database" step becomes a single pg_dump --schema=dm plus dropping the FK on streamer_channel_link.channel_id — no application-level join rewrites needed.


Open questions

  1. Schema namedm (datamining). Short, fits the existing schema-naming style (auth, pgboss). Platform-neutral; future dm.sumsub_applicant_* etc. land in the same namespace.
  2. platform_user_id retention — keep dm.streamer_channel.platform_user_id for backward compat with CH (which still keys on it) and for OAuth lookups. Not a join key for dm.* descendants, just a stable identifier.
  3. Should streamer_videos move? No — curated profile content owned by the streamer (per-env), not scraped. After the inversion it FKs streamer_id only; no link to dm.streamer_channel needed.
  4. Backfilling slug history — we don't have it. Backfill seeds one row per channel covering (observed_from = first_seen_at, observed_until = NULL). Real rename detection begins from migration forward.
  5. Slug uniqueness across renames — if channel A renames to foo after channel B held foo and released it, UNIQUE (platform, current_slug) trips. Resolution: relax to a partial UNIQUE allowing NULLs; on a collision the loser gets current_slug = NULL and is only reachable by platform_user_id. Edge case worth logging.
  6. Multi-claim semanticsstreamer_channel_link allows multiple unverified claims on the same channel (uq_verified_channel only enforces uniqueness among verified rows). Application code decides when to mark a claim verified; only verified claims count for payout and KYC gating.
  7. What about other platforms? Twitch and X drop in naturally under platform='twitch' / platform='x'. The X follower poll already piggybacks on streamer_channels; in this migration it moves under dm.streamer_channel with the same shape. Twitch scraping (when it ships) gets the schema for free.
  8. ClickHouse re-keying? Not needed in this proposal. CH tables stay keyed on (env, platform, platform_user_id, ts). Per-channel queries resolve channel_id(platform, platform_user_id) via a one-row PG lookup before issuing the CH query.

Verifluence Documentation