Datamining Schema (`dm`) — Streamer Channel as First-Class Entity

Status: Proposed · May 2026 (revised — channel-first design)
Move all environment-agnostic streaming-platform data into a dedicated dm schema. Promote the channel itself (dm.streamer_channel) to a first-class entity with its own stable PK. Reduce public.streamer_channels to a thin per-env link table (public.streamer_channel_link) that maps Verifluence streamer accounts to one or more canonical channels via soft references.
Enables the External Data Ingestion proposal: with dm.* sitting cleanly outside public.* and the dependency arrow inverted, the schema becomes physically movable to a shared ingest tier without dragging per-env tables along — and without needing FDW gymnastics for hot-path joins.

Problem

Today the streaming-platform tables sit in three different schemas (public., scrape., webhooks.), keyed by kick_user_id text everywhere, with the human-readable identifier scattered under three different column names (channel_name, channel_slug, slug). And the public.streamer_channels table does two unrelated jobs in one row. That has four concrete costs:

1. Identification is awkward

External APIs (admin, support, ad-hoc SQL) overwhelmingly know streamers by their handle — ninja, trainwreckstv — not their Kick numeric ID. Every existing lookup pays for a streamer_channels round-trip:

sql

-- "Show me the last 10 stream sessions for username 'trainwreckstv'"
SELECT s.* FROM kick_stream_sessions s
JOIN streamer_channels c ON c.kick_user_id = s.kick_user_id
WHERE c.channel_name = 'trainwreckstv'    -- per-env table!
ORDER BY s.stream_started_at DESC LIMIT 10;

The handle lookup crosses into a per-env table even though the answer is environment-agnostic. A Kick stream session exists whether or not any Verifluence streamer in this env has claimed it.

2. Boundary lives in convention, not in the schema

scrape.kick_raw, public.kick_streamer_probe, public.kick_stream_sessions, public.kick_chat_messages, etc. are all env-agnostic — see the data-model audit. But they share the public. namespace with per-env tables (streamers, campaigns, deals). The boundary is only visible to developers who happen to know.

When we split production from stage, "what gets duplicated vs. what's shared" becomes a per-table judgment call. We want it to be WHERE schema = 'dm'.

3. `streamer_channels` does two jobs in one row

The current table holds both:

sql

-- Half A: per-env claim
streamer_id          → public.streamers
verified, created_at,

-- Half B: denormalised cache of Kick profile data
channel_name, kick_display_name, kick_avatar_url, kick_bio,
kick_follower_count, kick_is_live, kick_subscriber_count, kick_synced_at

Half B is env-agnostic — it describes the Kick channel, which exists independently of any Verifluence env. Half A is the only thing that ties the row to an env. Bundling them means every change to the Kick profile rewrites a per-env row, every per-env streamer claim drags Kick-profile columns along, and queries that need only one half pay the cost of carrying the other.

4. Renames are silently lossy

channel_name on streamer_channels is the streamer's last-known slug. Kick lets users rename. We update channel_name on resolution, but the previous slug is gone. Historical chat lines reference channel_slug literally and can't be tied back to today's streamer. No rename history is captured.

5. Multi-platform support is bolted on

Today streamer_channels.platform can be 'kick' or 'twitch' (and X-follower polling piggybacks on the same table), but the schema is shaped around Kick — kick_user_id, kick_display_name, etc. Adding Twitch scraping would mean either new columns on the same table, a parallel twitch_streamer_channels table, or a refactor — none of them clean.

Proposed schema

The channel is the first-class entity. It has its own stable PK and identifier. Per-env streamer accounts point at it through a thin link table — many-to-many in principle, one-to-one in practice for most streamers.

sql

CREATE SCHEMA dm;

-- ═══ Canonical channel identity ═══════════════════════════════════════
-- One row per external streaming-platform account ever observed.
-- Stable internal PK (bigserial) — this is what every other dm.* table
-- references. The platform-supplied user-id stays as a text field for
-- backward-compat reads but is no longer the join key.
CREATE TABLE dm.streamer_channel (
  id                   bigserial    PRIMARY KEY,
  external_id          text         NOT NULL UNIQUE,         -- ULID, for cross-system refs (events, audit)
  platform             text         NOT NULL,                -- 'kick' | 'twitch' | 'x'
  platform_user_id     text         NOT NULL,                -- kick_user_id / twitch user_id / x user_id
  current_slug         text         NOT NULL,
  display_name         text,
  avatar_url           text,
  bio                  text,
  follower_count       integer,
  is_live              boolean      NOT NULL DEFAULT false,
  is_banned            boolean      NOT NULL DEFAULT false,
  first_seen_at        timestamptz  NOT NULL DEFAULT now(),
  last_seen_at         timestamptz  NOT NULL DEFAULT now(),
  UNIQUE (platform, platform_user_id),
  UNIQUE (platform, current_slug),
  CHECK  (current_slug = lower(current_slug))
);

CREATE INDEX idx_channel_platform_slug ON dm.streamer_channel (platform, current_slug);

-- ═══ Rename history ═══════════════════════════════════════════════════
-- Append-only. Every observed change of current_slug writes a row here.
-- Historical chat lines + scrape rows can be tied back to a channel
-- even when the slug was different at the time.
CREATE TABLE dm.streamer_channel_slug_history (
  channel_id           bigint       NOT NULL REFERENCES dm.streamer_channel ON DELETE CASCADE,
  slug                 text         NOT NULL,
  observed_from        timestamptz  NOT NULL,
  observed_until       timestamptz,                           -- NULL = current
  PRIMARY KEY (channel_id, observed_from),
  CHECK (slug = lower(slug))
);

CREATE INDEX idx_slug_history_slug ON dm.streamer_channel_slug_history (slug);

Per-env link table

public.streamer_channels is replaced by public.streamer_channel_link — a thin mapping. A VF streamer can claim multiple channels (one per platform, or even multiple per platform if needed for, say, alt accounts).

sql

CREATE TABLE public.streamer_channel_link (
  streamer_id          integer      NOT NULL REFERENCES public.streamers ON DELETE CASCADE,
  channel_id           bigint       NOT NULL,                 -- soft ref → dm.streamer_channel.id
  verified             boolean      NOT NULL DEFAULT false,
  claimed_at           timestamptz  NOT NULL DEFAULT now(),
  PRIMARY KEY (streamer_id, channel_id)
);

-- While same-DB, enforce the FK. Drop it when dm moves to its own DB.
ALTER TABLE public.streamer_channel_link
  ADD CONSTRAINT streamer_channel_link_channel_fkey
  FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel(id) NOT VALID;

-- A channel can only be verified-claimed by one streamer at a time.
CREATE UNIQUE INDEX uq_verified_channel
  ON public.streamer_channel_link (channel_id)
  WHERE verified = TRUE;

The link table is per-env by construction — it FKs into public.streamers. The reference to dm.streamer_channel is intentionally a soft pointer: while same-DB it's enforced by FK; after the future DB split it's enforced by the channel-link upsert flow (one transaction in dm to upsert the channel, then a domain event into the env DB to insert the link). Either way the application is responsible for keeping the relationship coherent — which it already had to be.

Tables that move into `dm`

Every dm.* descendant now keys on channel_id bigint instead of kick_user_id text. Smaller indexes, uniform across platforms, type-safe.

From	To	Key column	Notes
`scrape.kick_raw`	`dm.scrape_log`	`channel_id` (NULL allowed for pre-resolution)	Raw HTTP audit log
`webhooks.incoming`	split — see "Webhook inbox" below	n/a	Today does three jobs; we split
`public.kick_streamer_probe`	dropped — moved to ClickHouse only	n/a	See "Probe table → ClickHouse" below
`public.kick_streamer_probe_schedule`	`dm.probe_schedule`	`channel_id` PK	When to next probe each channel
`public.kick_stream_sessions`	`dm.stream_session`	`channel_id`	One row per live session (the canonical "stream")
`public.kick_chat_messages`	dropped — moved to ClickHouse only	n/a	See "Chat messages → ClickHouse" below
`public.kick_chat_s3_archive_log`	`dm.chat_archive_log`	(none — bookkeeping)	S3 export ledger; stays PG
`public.kick_event_subscriptions`	`dm.event_subscription`	`channel_id`	Platform's subscription IDs

Each descendant gets an indexed FK to dm.streamer_channel(id). The dm.* tables stop carrying any denormalised platform-specific identifiers — kick_user_id and channel_slug columns drop. Slug rendering goes through one JOIN to dm.streamer_channel.current_slug (cheap) or through dm.streamer_channel_slug_history for historical accuracy.

Why this is better

Channel is a first-class entity with its own identifier. A channel_id bigint is stable, internal, and uniform across platforms. Kick renames a user → current_slug changes, channel_id doesn't. Twitch GUIDs and X snowflake IDs collapse to the same column. Indexes get smaller.
external_id text (ULID) is the cross-system identifier for domain events and external audit. Internal bigserial stays implementation detail.
UNIQUE (platform, platform_user_id) and UNIQUE (platform, current_slug) at the dm layer give canonical-row guarantees the old design couldn't enforce without scattering CHECK constraints across many tables.
The link table is genuinely thin. Per-env data shrinks to "this streamer claims that channel, here's when, here's whether we verified it." Nothing platform-specific.
DB split becomes free. When dm moves to its own database (the ingestion-tier proposal), the soft reference from streamer_channel_link.channel_id survives without FDW. Hot-path joins on the env side don't need to cross databases because the env side carries nothing platform-specific to merge in — rendering the channel card pre-joins on the dm side.
Multi-platform support is native, not bolted on. Adding Twitch = INSERT INTO dm.streamer_channel (platform='twitch', …). Probes, sessions, chat, subscriptions all get the Twitch row for free with the same shape. X follower polling can move under this umbrella when it's scoped properly.

Probe table → ClickHouse (drop the Postgres copy)

public.kick_streamer_probe is the per-minute viewer/follower snapshot — append-only, no FK in or out, hot growth (~150k rows/hour at the current cohort, dominates OLTP WAL). The team already dual-writes every probe to vf.kick_streamer_probe in ClickHouse (api/src/kick_viewer_poll.ts:279 — chInsert("kick_streamer_probe", …)).

A code audit of every PG read:

Site	Pattern	Can CH serve it?
`kick_webhook.ts:752` — 7-day follower history (≤ 1008 rows)	Range scan on `(platform_user_id, recorded_at)`	✅ trivial
`kick_webhook.ts:1839` — per-session per-minute viewer chart	Equality on `(platform_user_id, stream_started_at)`	✅ trivial
`kick_viewer_poll.ts:438` — `closeStaleStreamSessions` finds last probe per session	`MAX(recorded_at)` per channel	✅ trivial (or skip CH: read `dm.probe_schedule.last_probed_at`)
`applications.ts:804` — resolve a submitted stream URL to a `stream_session` by overlapping probe range	Range scan + window match	✅ rewrite as two queries (CH for window, PG for session)
`backfill-clickhouse.ts:120` — explicit PG → CH backfill tool	Reads PG	❌ obviously goes away; tool deletes

Nothing requires PG transactional semantics with public.*. Trust-score reads dm.stream_session (the aggregate) — not the raw probe.

Decision: drop public.kick_streamer_probe entirely. ClickHouse becomes the single source of truth. The PG table was always a hot-path mirror written defensively while CH was new; that defence is no longer needed.

What changes:

kick_viewer_poll.ts — keep only the chInsert call; remove the PG INSERT. ~5–10k writes/hour avoided.
The four PG read paths repoint to CH via a new helper channelProbeQuery(env, channelId, …) in api/src/clickhouse.ts. The helper resolves channel_id → (platform, platform_user_id) via a tiny PG lookup, then queries CH.
Migration 0123 drops the PG table at the end of the transaction. CH already carries 12 months of history.
backfill-clickhouse.ts kick_streamer_probe deletes.
closeStaleStreamSessions reads dm.probe_schedule.last_probed_at (already updated on every probe), removing the cross-system join entirely.

dm.probe_schedule stays in Postgres. Small mutable bookkeeping (one row per channel with next_probe_at); PG handles point updates with FK-safe semantics. ≤ 50k rows.

So the final picture for "where is probe data?":

Append-only telemetry: vf.kick_streamer_probe (ClickHouse) — sole copy
Polling bookkeeping: dm.probe_schedule (Postgres)
Session aggregate: dm.stream_session (Postgres)

ClickHouse keeps its existing (env, platform, platform_user_id, ts) ordering — no CH schema change. The dm.streamer_channel mapping happens entirely on the PG side before the CH query.

Trust score history → ClickHouse (drop the Postgres copy)

public.streamer_trust_score_history is the second PG↔CH dual-write. Same shape: keyed on (platform, platform_user_id), env-agnostic, append-only, no FK in or out. kick_trust_score.ts:564 inserts PG and :586 does chInsert("streamer_trust_score_history", …).

Read-site audit:

Site	Pattern	CH-servable?
`kick_webhook.ts:1322` — 90-day score history for the trend chart	Range scan on `(platform='kick', platform_user_id, computed_at)`	✅ trivial

That's the only production read. The cleanup function delete_streamer_cascade in 00_baseline.sql:254 did a DELETE keyed on platform_user_id — semantically wrong in a multi-env world (the Kick channel still exists after a per-env streamer is purged). Dropping the PG table removes the offending cleanup branch.

Write volume is small (~1 row/streamer/day) — the win is schema-shape consistency. Either both env-agnostic mirrors stay in PG or both go CH-only. Mixed is the worst of both worlds.

Decision: drop public.streamer_trust_score_history in the same migration as kick_streamer_probe.

What changes:

kick_trust_score.ts:564 — remove PG INSERT; keep the chInsert.
kick_webhook.ts:1322 — replace PG SELECT with channelTrustScoreHistory(env, channelId, 90) reading from CH.
delete_streamer_cascade (in baseline / follow-up migration) — drop the trust-score-history DELETE branch.
backfill-clickhouse.ts — delete both supported-table branches; tool retires. The only two options were kick_streamer_probe and streamer_trust_score_history.
Migration 0123 ends with DROP TABLE public.streamer_trust_score_history. CH has 24-month TTL.

public.streamer_trust_scores (the current score, distinct from history) stays in PG. Small (one row per streamer), heavily UPSERTed, joined into operator-facing leaderboard queries. After the inversion, it gains a channel_id column FK to dm.streamer_channel — or stays keyed on (platform, platform_user_id) if that's lower-disruption. Both shapes work.

Chat messages → ClickHouse (drop the Postgres copy)

Chat messages are the third PG→CH candidate. They're an inherently append-only event stream — every chat.message.sent webhook becomes one row that's never updated except for the exported boolean the export pipeline mutates as bookkeeping. The PG-specific operations (UPDATE exported = TRUE, bulk DELETE after 14 days) are workarounds for what CH expresses natively.

Chat is also anonymous: the streamer being monitored has no business relationship with the chatters whose messages we record, and there is no compliance-driven takedown workflow for chat content. That removes the only reason to keep a mutable PG row.

CH-native lifecycle

Op	PG today	CH replacement
Skip-already-exported on next run	`UPDATE exported = TRUE WHERE id = ANY($1)` per batch	Watermark in a tiny PG table `dm.chat_export_watermark(env, last_exported_at)`. Next run reads `WHERE created_at > watermark`. No row mutation.
Retention	`DELETE … WHERE exported = TRUE AND created_at < now() - INTERVAL '14 days'`	`TTL created_at + INTERVAL 30 DAY DELETE` on the CH table. Background merges handle it.
Buffer-flush boundary	PG transactional reads see committed rows immediately	Export job's upper bound is `now() - INTERVAL 1 HOUR` (already true today for stream-session matching) — well within CH's flush latency.

Schema

sql

-- ClickHouse
CREATE TABLE vf.kick_chat_message (
  env               LowCardinality(String)         NOT NULL,
  channel_id        UInt64                         NOT NULL,
  channel_slug      LowCardinality(String)         NOT NULL,    -- denormalised for query convenience
  message_id        String                         NOT NULL,
  platform_user_id  String                         NOT NULL,    -- the chatter, not the streamer
  username          LowCardinality(String),
  content           String,
  emotes_json       String  DEFAULT '[]',
  badges_json       String  DEFAULT '[]',
  created_at        DateTime                       NOT NULL,
  inserted_at       DateTime                       DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY (env, toYYYYMMDD(created_at))
ORDER BY (env, channel_id, created_at)
TTL created_at + INTERVAL 30 DAY DELETE;

-- Postgres — tiny side table, the only PG residue for chat
CREATE TABLE dm.chat_export_watermark (
  env               text         PRIMARY KEY,
  last_exported_at  timestamptz  NOT NULL DEFAULT '1970-01-01'
);

dm.chat_archive_log (the S3 export ledger) stays — it's a tiny per-batch bookkeeping table, naturally PG.

Export path rewrite

// kick_chat_export.ts — hourly
const { last_exported_at } = await env.DB`
  SELECT last_exported_at FROM dm.chat_export_watermark WHERE env = ${env}
`;
const upperBound = new Date(Date.now() - 60 * 60_000);   // now − 1h

const rows = await chQuery<ChatRow>(`
  SELECT * FROM vf.kick_chat_message
  WHERE env = '${env}'
    AND created_at >  toDateTime(${last_exported_at.getTime() / 1000})
    AND created_at <= toDateTime(${upperBound.getTime() / 1000})
  ORDER BY created_at
  LIMIT 500000
`);

if (rows.length) {
  const maxSeen = rows[rows.length - 1].created_at;
  await streamToS3(rows);
  await env.DB`INSERT INTO dm.chat_archive_log (rows_exported, files_written, success) VALUES (${rows.length}, ${files}, TRUE)`;
  await env.DB`UPDATE dm.chat_export_watermark SET last_exported_at = ${maxSeen} WHERE env = ${env}`;
}

No exported flag. No bulk DELETE in app code. The 50-ish lines of UPDATE/DELETE bookkeeping in the current kick_chat_export.ts shrink to ~20 lines.

Read paths

Site today	After CH-only
`kick_webhook.ts:589` daily chat histogram (30 days)	One CH query: `SELECT toDate(created_at), count() FROM vf.kick_chat_message WHERE channel_id = ? GROUP BY toDate(created_at)`. Faster than the PG version.
`kick_webhook.ts:1846` per-session per-minute chat aggregation	Two-step: fetch session bounds from PG (`dm.stream_session`), then CH aggregation `GROUP BY toStartOfMinute(created_at)`. App merges with probe data — same code path that already does multi-CH-query merge after the probe move.
`kick_webhook.ts:546` 24h chat totals	CH `WHERE channel_id = ? AND created_at >= now() - INTERVAL 24 HOUR`.

All three read patterns map natively to CH columnar scans.

Why this wins

Storage 10–20× smaller via CH LZ4 compression. Chat is the highest-write PG table after probe; moving it sheds another ~50 MB/day of WAL plus indexes.
Write throughput decouples from webhook latency — chInsert is async-buffered. Giveaway/raid spikes are absorbed by CH without backpressuring the webhook handler.
Analytical reads (histograms, per-minute aggregations) run 5–10× faster in CH than in PG.
Schema-shape consistency — all append-only telemetry (probe, trust-score history, webhook audit, chat) now lives in CH. The remaining PG tables in dm are all small mutable bookkeeping: streamer_channel, probe_schedule, stream_session, event_subscription, webhook_message_ids, kick_follow_event, chat_archive_log, chat_export_watermark, slug_history, scrape_log.

Costs

Eventual consistency (~minutes) on writes. Acceptable for chat; export job's hour-aligned upper bound absorbs this.
Per-session chart needs two queries instead of one PG join — but the same code path already issues two CH queries (session + probe) after the probe move. Chat just becomes a third.
30-day TTL is a hard retention floor. If the S3 export pipeline lags more than 30 days, that batch is lost. Alert on watermark lag; failure mode is best-effort archival (acceptable — chat is not a transactional record).

Decision: drop public.kick_chat_messages in the same migration as kick_streamer_probe and streamer_trust_score_history. Data goes to vf.kick_chat_message (CH). The export ledger dm.chat_archive_log stays in PG; a new dm.chat_export_watermark table replaces the exported flag.

Webhook inbox — split into three concerns

webhooks.incoming today serves three distinct purposes in one table:

Concern	Current implementation	Recommended
Idempotency — was `message_id` already processed? (`kick_webhook.ts:261`)	`SELECT id FROM webhooks.incoming WHERE message_id = $1` against a row that also holds the full payload	Tiny `dm.webhook_message_ids(source, message_id PK, received_at)` — no payload. Drops the payload-bearing PG row off the hot path.
Raw audit log — full payload of every verified + every rejected webhook, 30-day retention	PG row with `payload text` + `signature` + `verified bool`	Move to ClickHouse. Append-only, large rows, never read transactionally. Add `vf.webhook_inbox(source, event_type, message_id, payload String, signature, verified UInt8, received_at)`. Same pattern as `vf.sumsub_webhooks`.
Domain-event projection — `channel.followed` is only recoverable from `webhooks.incoming`; the per-session follow chart depends on it (`kick_webhook.ts:1994`)	Reads raw rows back out of audit log	Project at write time. Add `dm.kick_follow_event(channel_id bigint, occurred_at timestamptz)` — handler inserts one row per `channel.followed` event in the same path that writes the audit log. Per-session follow chart reads from this table directly.

Net effect:

PG webhook hot-path table shrinks from 11 columns to 3 — 1–2 orders of magnitude smaller.
Raw payloads land in CH alongside Sumsub webhooks. Compliance + replay queries are one CH query.
channel.followed follow counts read from a dedicated projected table — same query shape but smaller scan, and the dependency on raw audit retention disappears (you can shorten audit TTL without breaking the chart).
webhooks.incoming deletes after 0123 ships.

Stat-aggregation queries that today join webhooks.incoming to kick_event_subscriptions (line 538) move to CH too — they're analytical, not transactional.

API surface — channel-first, username-friendly

A new helper module api/src/streamer_channel.ts is the only place that needs to know how to translate between platform-user-id, slug, and the canonical channel_id:

// Resolve (platform, slug) → channel_id. Cheap (indexed). Null if unknown.
export async function channelIdBySlug(
  env: Env, platform: string, slug: string
): Promise<bigint | null>;

// Resolve (platform, platform_user_id) → channel_id. Cheap.
export async function channelIdByPlatformUserId(
  env: Env, platform: string, platformUserId: string
): Promise<bigint | null>;

// Resolve channel_id → full row (current slug, display name, etc.).
export async function channel(env: Env, channelId: bigint): Promise<Channel | null>;

// Idempotent upsert. Called by every poller/webhook before inserting
// into a child table. Records a slug-rename in slug_history when the
// observed slug differs from current_slug.
export async function upsertChannel(
  env: Env,
  args: {
    platform: string;
    platformUserId: string;
    slug: string;
    displayName?: string;
    avatarUrl?: string;
    followerCount?: number;
    isLive?: boolean;
  }
): Promise<bigint>;          // returns channel_id

Every existing call site that built a SQL query of the form WHERE kick_user_id = $1 converts to WHERE channel_id = $1 once the dm.* descendants are switched. A transition helper kickUserIdToChannelId(env, kuid) covers the brief overlap window.

HTTP

Every endpoint that previously took ?kick_user_id=… learns to also accept ?username=… and ?channel_id=…:

const channelId =
     url.searchParams.has("channel_id")     ? BigInt(url.searchParams.get("channel_id")!)
  :  url.searchParams.has("username")       ? await channelIdBySlug(env, "kick", url.searchParams.get("username")!)
  :  url.searchParams.has("kick_user_id")   ? await channelIdByPlatformUserId(env, "kick", url.searchParams.get("kick_user_id")!)
  :  null;

New routes can take username exclusively. Admin/support tooling stops needing to know about numeric platform IDs at all.

Migration

Two-phase migration. Phase A is purely additive (new structure alongside old). Phase B switches readers/writers and drops the old.

sql

-- 0123_dm_streamer_channel.sql

BEGIN;

CREATE SCHEMA dm;

-- ═══ Phase A — add new structure ═════════════════════════════════════

CREATE TABLE dm.streamer_channel (...);
CREATE TABLE dm.streamer_channel_slug_history (...);

-- Backfill from every observed kick_user_id. Source priority:
--   1. streamer_channels  (we have a resolved slug)
--   2. kick_stream_sessions.channel_name
--   3. (chat lives in CH now — not a backfill source)
--   4. kick_streamer_probe with a synthetic slug if all else missing
INSERT INTO dm.streamer_channel (external_id, platform, platform_user_id, current_slug,
                                  display_name, avatar_url, bio, follower_count,
                                  is_live, first_seen_at, last_seen_at)
SELECT gen_ulid(),
       'kick',
       sc.kick_user_id,
       COALESCE(sc.channel_name, 'unknown-' || sc.kick_user_id),
       sc.kick_display_name, sc.kick_avatar_url, sc.kick_bio,
       sc.kick_follower_count, COALESCE(sc.kick_is_live, false),
       MIN(sc.created_at),     -- approx; real history starts here
       MAX(COALESCE(sc.kick_synced_at, sc.created_at))
FROM   public.streamer_channels sc
WHERE  sc.kick_user_id IS NOT NULL AND sc.platform = 'kick'
GROUP  BY sc.kick_user_id, sc.channel_name, sc.kick_display_name,
          sc.kick_avatar_url, sc.kick_bio, sc.kick_follower_count, sc.kick_is_live;

-- Pick up channels we've observed in probe/session/chat but haven't claimed.
INSERT INTO dm.streamer_channel (external_id, platform, platform_user_id, current_slug, first_seen_at)
SELECT DISTINCT gen_ulid(), 'kick', kss.kick_user_id,
                COALESCE(kss.channel_name, 'unknown-' || kss.kick_user_id),
                MIN(kss.stream_started_at)
FROM   public.kick_stream_sessions kss
WHERE  kss.kick_user_id NOT IN (
         SELECT platform_user_id FROM dm.streamer_channel WHERE platform = 'kick'
       )
GROUP  BY kss.kick_user_id, kss.channel_name;

-- Seed slug_history with one row per channel covering the current slug.
INSERT INTO dm.streamer_channel_slug_history (channel_id, slug, observed_from)
SELECT id, current_slug, first_seen_at FROM dm.streamer_channel;

-- Create the per-env link table.
CREATE TABLE public.streamer_channel_link (
  streamer_id  integer  NOT NULL REFERENCES public.streamers ON DELETE CASCADE,
  channel_id   bigint   NOT NULL,
  verified     boolean  NOT NULL DEFAULT false,
  claimed_at   timestamptz NOT NULL DEFAULT now(),
  PRIMARY KEY (streamer_id, channel_id)
);

ALTER TABLE public.streamer_channel_link
  ADD CONSTRAINT streamer_channel_link_channel_fkey
  FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel(id) NOT VALID;

INSERT INTO public.streamer_channel_link (streamer_id, channel_id, verified, claimed_at)
SELECT sc.streamer_id, dsc.id, sc.verified, sc.created_at
FROM   public.streamer_channels sc
JOIN   dm.streamer_channel       dsc
  ON   dsc.platform = sc.platform AND dsc.platform_user_id = sc.kick_user_id
WHERE  sc.streamer_id IS NOT NULL;

-- ═══ Phase A.2 — move tables, add channel_id, backfill ═══════════════

ALTER TABLE scrape.kick_raw                  SET SCHEMA dm;
ALTER TABLE dm.kick_raw                      RENAME TO scrape_log;
ALTER TABLE dm.scrape_log                    ADD COLUMN channel_id bigint;
UPDATE dm.scrape_log sl SET channel_id = sc.id
  FROM dm.streamer_channel sc
  WHERE sc.platform = 'kick' AND sc.platform_user_id = sl.kick_user_id;

ALTER TABLE public.kick_streamer_probe_schedule  SET SCHEMA dm;
ALTER TABLE dm.kick_streamer_probe_schedule      RENAME TO probe_schedule;
ALTER TABLE dm.probe_schedule                    ADD COLUMN channel_id bigint;
UPDATE dm.probe_schedule ps SET channel_id = sc.id
  FROM dm.streamer_channel sc
  WHERE sc.platform = 'kick' AND sc.platform_user_id = ps.kick_user_id;
ALTER TABLE dm.probe_schedule ALTER COLUMN channel_id SET NOT NULL;
ALTER TABLE dm.probe_schedule DROP CONSTRAINT IF EXISTS probe_schedule_pkey;
ALTER TABLE dm.probe_schedule ADD PRIMARY KEY (channel_id);
ALTER TABLE dm.probe_schedule
  ADD CONSTRAINT probe_schedule_channel_fkey
  FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel NOT VALID;
ALTER TABLE dm.probe_schedule VALIDATE CONSTRAINT probe_schedule_channel_fkey;

-- (Same pattern for kick_stream_sessions → dm.stream_session,
--  kick_chat_messages does NOT move — it's dropped in Phase B
--  after the kick_chat_export pipeline switches to CH reads.)

-- ═══ Phase B — switch code → drop the old ════════════════════════════
-- Deferred to a follow-up migration once all code sites have been
-- repointed. Phase B drops:
--   public.streamer_channels       (replaced by streamer_channel_link + dm.streamer_channel)
--   public.kick_streamer_probe     (CH only)
--   public.streamer_trust_score_history  (CH only)
--   webhooks.incoming              (split into dm.webhook_message_ids + vf.webhook_inbox + dm.kick_follow_event)
--   kick_user_id columns on dm.*   (channel_id is the join key now)

COMMIT;

Phase A is a single transaction that can run on a live database. Every dm.* descendant temporarily carries both kick_user_id and channel_id columns; the code switches its reads/writes over to channel_id incrementally; then Phase B drops kick_user_id and the old public.streamer_channels table. Two PRs, one to flip the schema and one to retire the old columns.

The NOT VALID + VALIDATE CONSTRAINT pattern lets us add FKs without an AccessExclusiveLock on child tables — kept for the large historical PG tables (stream_session, event_subscription, scrape_log) during the move. Chat is no longer a concern: it leaves PG entirely.

Code change scope

File	Change	Lines
`api/migrations/0123_dm_streamer_channel.sql`	Phase A migration	~250
`api/migrations/0124_drop_legacy_kick_columns.sql`	Phase B (after code repoint)	~80
`api/src/streamer_channel.ts`	New helper module (`channelIdBySlug`, `channelIdByPlatformUserId`, `upsertChannel`, `channel`)	~120
`api/src/kick_viewer_poll.ts`	Drop PG probe INSERT (keep `chInsert`); `upsertChannel` returns channel_id; touch `dm.probe_schedule.last_probed_at`	~25
`api/src/kick_subscriptions.ts`	Use `channelIdBySlug` / `channel(channelId)` instead of poking `streamer_channels.channel_name`; subscription tables key on channel_id	~35
`api/src/kick_webhook.ts`	`upsertChannel` on every inbound event; resolve `channel_id` from payload; replace `webhooks.incoming` dedupe with `dm.webhook_message_ids`; project `channel.followed` into `dm.kick_follow_event`; raw audit goes to `vf.webhook_inbox`; 4 probe-read sites repoint to CH helper	~100
`api/src/kick_maintenance.ts`	Drop `webhooks.incoming` purge; add `dm.webhook_message_ids` TTL prune	~5
`api/src/kick_chat_export.ts`	Rewrite: read from CH via watermark, write S3, advance `dm.chat_export_watermark`. The 50 lines of UPDATE/DELETE bookkeeping shrink to ~20 lines of watermark-driven read.	-30 (net)
`api/clickhouse/07_kick_chat_message.sql`	New CH schema for chat messages (MergeTree, 30-day TTL)	~30
`api/src/kick_trust_score.ts`	Drop PG `streamer_trust_score_history` INSERT; keep `chInsert`; reads `dm.stream_session` by `channel_id`	~25
`api/src/applications.ts`	`session_submissions` resolver: probe-window scan moves to CH helper (resolves channel_id first); PG session join uses `channel_id`	~20
`api/src/streamers.ts`	Streamer-channel linking: `upsertChannel` first, then INSERT into `public.streamer_channel_link`	~25
`api/src/clickhouse.ts`	New helpers: `channelProbeQuery`, `channelTrustScoreHistory`, `webhookInboxInsert`; each takes a `channel_id` and resolves to `(platform, platform_user_id)` for the CH WHERE clause	~70
`api/src/backfill-clickhouse.ts`	Delete both `kick_streamer_probe` and `streamer_trust_score_history` branches (PG sources gone). Tool retires.	-260
`api/k8s/backfill-clickhouse-job.yaml`	Delete (tool is gone)	-107
`api/clickhouse/06_webhook_inbox.sql`	New CH schema for raw webhook archive	~30
`api/src/oauth.ts`	Kick OAuth callback writes `streamer_channel_link` instead of `streamer_channels`; `upsertChannel` first	~20
`api/src/x_follower_poll.ts`	(Optional in 0123) Move X polling under `dm.streamer_channel` with `platform='x'`; or defer to a follow-up	~30 or —
`frontend/src/api/streamers.ts`	Add `username` / `channel_id` query param support	~15
`admin/src/panels/StreamersPanel.tsx`	Search by username without an explicit user-id lookup; channel rows render from `dm.streamer_channel`	~10

Touches roughly 19 files. The bulk is the kick_user_id → channel_id propagation, which is mechanical: a typed helper covers the translation during the overlap window, then Phase B retires the old column.

Four PG table drops (in Phase B) — kick_streamer_probe, streamer_trust_score_history, kick_chat_messages, webhooks.incoming — plus streamer_channels itself. The backfill-clickhouse tool retires entirely. After Phase B, every append-only telemetry source lives in ClickHouse; the PG tables that remain in dm are all small mutable state.

Migration ordering relative to other in-flight work

This proposal slots cleanly after the v2 baseline (current 00_baseline.sql) and before the ingestion-tier extraction. Suggested ordering:

Phase A — migration 0123: purely additive (new schema + new identity table + streamer_channel_link + table moves with dual kick_user_id/channel_id columns + FKs). Can ship to a live DB.
Code repoints — same week: ~18 files, low-risk grep-and-replace. The dual-column overlap period means readers can switch incrementally.
Phase B — migration 0124: drop kick_user_id columns from dm.* descendants, drop public.streamer_channels, drop public.kick_streamer_probe, drop public.streamer_trust_score_history, drop public.kick_chat_messages, drop webhooks.incoming. Ships once all writers and readers are on the new columns and kick_chat_export.ts has switched to the watermark-driven CH read.
External Data Ingestion proposal: operates on cleaner ground — dm.* is a single namespace to physically extract, and the dependency arrow already points the right way (env reaches into dm via soft reference; nothing in dm references env-specific data).

Doing it in this order means the ingestion proposal's "move the env-agnostic schema to its own database" step becomes a single pg_dump --schema=dm plus dropping the FK on streamer_channel_link.channel_id — no application-level join rewrites needed.

Open questions

Schema name — dm (datamining). Short, fits the existing schema-naming style (auth, pgboss). Platform-neutral; future dm.sumsub_applicant_* etc. land in the same namespace.
platform_user_id retention — keep dm.streamer_channel.platform_user_id for backward compat with CH (which still keys on it) and for OAuth lookups. Not a join key for dm.* descendants, just a stable identifier.
Should streamer_videos move? No — curated profile content owned by the streamer (per-env), not scraped. After the inversion it FKs streamer_id only; no link to dm.streamer_channel needed.
Backfilling slug history — we don't have it. Backfill seeds one row per channel covering (observed_from = first_seen_at, observed_until = NULL). Real rename detection begins from migration forward.
Slug uniqueness across renames — if channel A renames to foo after channel B held foo and released it, UNIQUE (platform, current_slug) trips. Resolution: relax to a partial UNIQUE allowing NULLs; on a collision the loser gets current_slug = NULL and is only reachable by platform_user_id. Edge case worth logging.
Multi-claim semantics — streamer_channel_link allows multiple unverified claims on the same channel (uq_verified_channel only enforces uniqueness among verified rows). Application code decides when to mark a claim verified; only verified claims count for payout and KYC gating.
What about other platforms? Twitch and X drop in naturally under platform='twitch' / platform='x'. The X follower poll already piggybacks on streamer_channels; in this migration it moves under dm.streamer_channel with the same shape. Twitch scraping (when it ships) gets the schema for free.
ClickHouse re-keying? Not needed in this proposal. CH tables stay keyed on (env, platform, platform_user_id, ts). Per-channel queries resolve channel_id → (platform, platform_user_id) via a one-row PG lookup before issuing the CH query.

Datamining Schema (dm) — Streamer Channel as First-Class Entity ​

Problem ​

1. Identification is awkward ​

2. Boundary lives in convention, not in the schema ​

3. streamer_channels does two jobs in one row ​

4. Renames are silently lossy ​

5. Multi-platform support is bolted on ​

Proposed schema ​

Per-env link table ​

Tables that move into dm ​

Why this is better ​

Probe table → ClickHouse (drop the Postgres copy) ​

Trust score history → ClickHouse (drop the Postgres copy) ​

Chat messages → ClickHouse (drop the Postgres copy) ​

CH-native lifecycle ​

Schema ​

Export path rewrite ​

Read paths ​

Why this wins ​

Costs ​

Webhook inbox — split into three concerns ​

API surface — channel-first, username-friendly ​

HTTP ​

Migration ​

Code change scope ​

Migration ordering relative to other in-flight work ​

Open questions ​

Related docs ​