Datamining Schema (dm) — Streamer Channel as First-Class Entity
Status: Proposed · May 2026 (revised — channel-first design)
Move all environment-agnostic streaming-platform data into a dedicated
dmschema. Promote the channel itself (dm.streamer_channel) to a first-class entity with its own stable PK. Reducepublic.streamer_channelsto a thin per-env link table (public.streamer_channel_link) that maps Verifluence streamer accounts to one or more canonical channels via soft references.Enables the External Data Ingestion proposal: with
dm.*sitting cleanly outsidepublic.*and the dependency arrow inverted, the schema becomes physically movable to a shared ingest tier without dragging per-env tables along — and without needing FDW gymnastics for hot-path joins.
Problem
Today the streaming-platform tables sit in three different schemas (public., scrape., webhooks.), keyed by kick_user_id text everywhere, with the human-readable identifier scattered under three different column names (channel_name, channel_slug, slug). And the public.streamer_channels table does two unrelated jobs in one row. That has four concrete costs:
1. Identification is awkward
External APIs (admin, support, ad-hoc SQL) overwhelmingly know streamers by their handle — ninja, trainwreckstv — not their Kick numeric ID. Every existing lookup pays for a streamer_channels round-trip:
-- "Show me the last 10 stream sessions for username 'trainwreckstv'"
SELECT s.* FROM kick_stream_sessions s
JOIN streamer_channels c ON c.kick_user_id = s.kick_user_id
WHERE c.channel_name = 'trainwreckstv' -- per-env table!
ORDER BY s.stream_started_at DESC LIMIT 10;The handle lookup crosses into a per-env table even though the answer is environment-agnostic. A Kick stream session exists whether or not any Verifluence streamer in this env has claimed it.
2. Boundary lives in convention, not in the schema
scrape.kick_raw, public.kick_streamer_probe, public.kick_stream_sessions, public.kick_chat_messages, etc. are all env-agnostic — see the data-model audit. But they share the public. namespace with per-env tables (streamers, campaigns, deals). The boundary is only visible to developers who happen to know.
When we split production from stage, "what gets duplicated vs. what's shared" becomes a per-table judgment call. We want it to be WHERE schema = 'dm'.
3. streamer_channels does two jobs in one row
The current table holds both:
-- Half A: per-env claim
streamer_id → public.streamers
verified, created_at,
-- Half B: denormalised cache of Kick profile data
channel_name, kick_display_name, kick_avatar_url, kick_bio,
kick_follower_count, kick_is_live, kick_subscriber_count, kick_synced_atHalf B is env-agnostic — it describes the Kick channel, which exists independently of any Verifluence env. Half A is the only thing that ties the row to an env. Bundling them means every change to the Kick profile rewrites a per-env row, every per-env streamer claim drags Kick-profile columns along, and queries that need only one half pay the cost of carrying the other.
4. Renames are silently lossy
channel_name on streamer_channels is the streamer's last-known slug. Kick lets users rename. We update channel_name on resolution, but the previous slug is gone. Historical chat lines reference channel_slug literally and can't be tied back to today's streamer. No rename history is captured.
5. Multi-platform support is bolted on
Today streamer_channels.platform can be 'kick' or 'twitch' (and X-follower polling piggybacks on the same table), but the schema is shaped around Kick — kick_user_id, kick_display_name, etc. Adding Twitch scraping would mean either new columns on the same table, a parallel twitch_streamer_channels table, or a refactor — none of them clean.
Proposed schema
The channel is the first-class entity. It has its own stable PK and identifier. Per-env streamer accounts point at it through a thin link table — many-to-many in principle, one-to-one in practice for most streamers.
CREATE SCHEMA dm;
-- ═══ Canonical channel identity ═══════════════════════════════════════
-- One row per external streaming-platform account ever observed.
-- Stable internal PK (bigserial) — this is what every other dm.* table
-- references. The platform-supplied user-id stays as a text field for
-- backward-compat reads but is no longer the join key.
CREATE TABLE dm.streamer_channel (
id bigserial PRIMARY KEY,
external_id text NOT NULL UNIQUE, -- ULID, for cross-system refs (events, audit)
platform text NOT NULL, -- 'kick' | 'twitch' | 'x'
platform_user_id text NOT NULL, -- kick_user_id / twitch user_id / x user_id
current_slug text NOT NULL,
display_name text,
avatar_url text,
bio text,
follower_count integer,
is_live boolean NOT NULL DEFAULT false,
is_banned boolean NOT NULL DEFAULT false,
first_seen_at timestamptz NOT NULL DEFAULT now(),
last_seen_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (platform, platform_user_id),
UNIQUE (platform, current_slug),
CHECK (current_slug = lower(current_slug))
);
CREATE INDEX idx_channel_platform_slug ON dm.streamer_channel (platform, current_slug);
-- ═══ Rename history ═══════════════════════════════════════════════════
-- Append-only. Every observed change of current_slug writes a row here.
-- Historical chat lines + scrape rows can be tied back to a channel
-- even when the slug was different at the time.
CREATE TABLE dm.streamer_channel_slug_history (
channel_id bigint NOT NULL REFERENCES dm.streamer_channel ON DELETE CASCADE,
slug text NOT NULL,
observed_from timestamptz NOT NULL,
observed_until timestamptz, -- NULL = current
PRIMARY KEY (channel_id, observed_from),
CHECK (slug = lower(slug))
);
CREATE INDEX idx_slug_history_slug ON dm.streamer_channel_slug_history (slug);Per-env link table
public.streamer_channels is replaced by public.streamer_channel_link — a thin mapping. A VF streamer can claim multiple channels (one per platform, or even multiple per platform if needed for, say, alt accounts).
CREATE TABLE public.streamer_channel_link (
streamer_id integer NOT NULL REFERENCES public.streamers ON DELETE CASCADE,
channel_id bigint NOT NULL, -- soft ref → dm.streamer_channel.id
verified boolean NOT NULL DEFAULT false,
claimed_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (streamer_id, channel_id)
);
-- While same-DB, enforce the FK. Drop it when dm moves to its own DB.
ALTER TABLE public.streamer_channel_link
ADD CONSTRAINT streamer_channel_link_channel_fkey
FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel(id) NOT VALID;
-- A channel can only be verified-claimed by one streamer at a time.
CREATE UNIQUE INDEX uq_verified_channel
ON public.streamer_channel_link (channel_id)
WHERE verified = TRUE;The link table is per-env by construction — it FKs into public.streamers. The reference to dm.streamer_channel is intentionally a soft pointer: while same-DB it's enforced by FK; after the future DB split it's enforced by the channel-link upsert flow (one transaction in dm to upsert the channel, then a domain event into the env DB to insert the link). Either way the application is responsible for keeping the relationship coherent — which it already had to be.
Tables that move into dm
Every dm.* descendant now keys on channel_id bigint instead of kick_user_id text. Smaller indexes, uniform across platforms, type-safe.
| From | To | Key column | Notes |
|---|---|---|---|
scrape.kick_raw | dm.scrape_log | channel_id (NULL allowed for pre-resolution) | Raw HTTP audit log |
webhooks.incoming | split — see "Webhook inbox" below | n/a | Today does three jobs; we split |
public.kick_streamer_probe | dropped — moved to ClickHouse only | n/a | See "Probe table → ClickHouse" below |
public.kick_streamer_probe_schedule | dm.probe_schedule | channel_id PK | When to next probe each channel |
public.kick_stream_sessions | dm.stream_session | channel_id | One row per live session (the canonical "stream") |
public.kick_chat_messages | dropped — moved to ClickHouse only | n/a | See "Chat messages → ClickHouse" below |
public.kick_chat_s3_archive_log | dm.chat_archive_log | (none — bookkeeping) | S3 export ledger; stays PG |
public.kick_event_subscriptions | dm.event_subscription | channel_id | Platform's subscription IDs |
Each descendant gets an indexed FK to dm.streamer_channel(id). The dm.* tables stop carrying any denormalised platform-specific identifiers — kick_user_id and channel_slug columns drop. Slug rendering goes through one JOIN to dm.streamer_channel.current_slug (cheap) or through dm.streamer_channel_slug_history for historical accuracy.
Why this is better
Channel is a first-class entity with its own identifier. A
channel_id bigintis stable, internal, and uniform across platforms. Kick renames a user →current_slugchanges,channel_iddoesn't. Twitch GUIDs and X snowflake IDs collapse to the same column. Indexes get smaller.external_idtext (ULID) is the cross-system identifier for domain events and external audit. Internal bigserial stays implementation detail.UNIQUE (platform, platform_user_id)andUNIQUE (platform, current_slug)at the dm layer give canonical-row guarantees the old design couldn't enforce without scattering CHECK constraints across many tables.The link table is genuinely thin. Per-env data shrinks to "this streamer claims that channel, here's when, here's whether we verified it." Nothing platform-specific.
DB split becomes free. When
dmmoves to its own database (the ingestion-tier proposal), the soft reference fromstreamer_channel_link.channel_idsurvives without FDW. Hot-path joins on the env side don't need to cross databases because the env side carries nothing platform-specific to merge in — rendering the channel card pre-joins on the dm side.Multi-platform support is native, not bolted on. Adding Twitch =
INSERT INTO dm.streamer_channel (platform='twitch', …). Probes, sessions, chat, subscriptions all get the Twitch row for free with the same shape. X follower polling can move under this umbrella when it's scoped properly.
Probe table → ClickHouse (drop the Postgres copy)
public.kick_streamer_probe is the per-minute viewer/follower snapshot — append-only, no FK in or out, hot growth (~150k rows/hour at the current cohort, dominates OLTP WAL). The team already dual-writes every probe to vf.kick_streamer_probe in ClickHouse (api/src/kick_viewer_poll.ts:279 — chInsert("kick_streamer_probe", …)).
A code audit of every PG read:
| Site | Pattern | Can CH serve it? |
|---|---|---|
kick_webhook.ts:752 — 7-day follower history (≤ 1008 rows) | Range scan on (platform_user_id, recorded_at) | ✅ trivial |
kick_webhook.ts:1839 — per-session per-minute viewer chart | Equality on (platform_user_id, stream_started_at) | ✅ trivial |
kick_viewer_poll.ts:438 — closeStaleStreamSessions finds last probe per session | MAX(recorded_at) per channel | ✅ trivial (or skip CH: read dm.probe_schedule.last_probed_at) |
applications.ts:804 — resolve a submitted stream URL to a stream_session by overlapping probe range | Range scan + window match | ✅ rewrite as two queries (CH for window, PG for session) |
backfill-clickhouse.ts:120 — explicit PG → CH backfill tool | Reads PG | ❌ obviously goes away; tool deletes |
Nothing requires PG transactional semantics with public.*. Trust-score reads dm.stream_session (the aggregate) — not the raw probe.
Decision: drop public.kick_streamer_probe entirely. ClickHouse becomes the single source of truth. The PG table was always a hot-path mirror written defensively while CH was new; that defence is no longer needed.
What changes:
kick_viewer_poll.ts— keep only thechInsertcall; remove the PG INSERT. ~5–10k writes/hour avoided.- The four PG read paths repoint to CH via a new helper
channelProbeQuery(env, channelId, …)inapi/src/clickhouse.ts. The helper resolveschannel_id→(platform, platform_user_id)via a tiny PG lookup, then queries CH. - Migration 0123 drops the PG table at the end of the transaction. CH already carries 12 months of history.
backfill-clickhouse.ts kick_streamer_probedeletes.closeStaleStreamSessionsreadsdm.probe_schedule.last_probed_at(already updated on every probe), removing the cross-system join entirely.
dm.probe_schedule stays in Postgres. Small mutable bookkeeping (one row per channel with next_probe_at); PG handles point updates with FK-safe semantics. ≤ 50k rows.
So the final picture for "where is probe data?":
- Append-only telemetry:
vf.kick_streamer_probe(ClickHouse) — sole copy - Polling bookkeeping:
dm.probe_schedule(Postgres) - Session aggregate:
dm.stream_session(Postgres)
ClickHouse keeps its existing (env, platform, platform_user_id, ts) ordering — no CH schema change. The dm.streamer_channel mapping happens entirely on the PG side before the CH query.
Trust score history → ClickHouse (drop the Postgres copy)
public.streamer_trust_score_history is the second PG↔CH dual-write. Same shape: keyed on (platform, platform_user_id), env-agnostic, append-only, no FK in or out. kick_trust_score.ts:564 inserts PG and :586 does chInsert("streamer_trust_score_history", …).
Read-site audit:
| Site | Pattern | CH-servable? |
|---|---|---|
kick_webhook.ts:1322 — 90-day score history for the trend chart | Range scan on (platform='kick', platform_user_id, computed_at) | ✅ trivial |
That's the only production read. The cleanup function delete_streamer_cascade in 00_baseline.sql:254 did a DELETE keyed on platform_user_id — semantically wrong in a multi-env world (the Kick channel still exists after a per-env streamer is purged). Dropping the PG table removes the offending cleanup branch.
Write volume is small (~1 row/streamer/day) — the win is schema-shape consistency. Either both env-agnostic mirrors stay in PG or both go CH-only. Mixed is the worst of both worlds.
Decision: drop public.streamer_trust_score_history in the same migration as kick_streamer_probe.
What changes:
kick_trust_score.ts:564— remove PG INSERT; keep thechInsert.kick_webhook.ts:1322— replace PG SELECT withchannelTrustScoreHistory(env, channelId, 90)reading from CH.delete_streamer_cascade(in baseline / follow-up migration) — drop the trust-score-history DELETE branch.backfill-clickhouse.ts— delete both supported-table branches; tool retires. The only two options werekick_streamer_probeandstreamer_trust_score_history.- Migration 0123 ends with
DROP TABLE public.streamer_trust_score_history. CH has 24-month TTL.
public.streamer_trust_scores (the current score, distinct from history) stays in PG. Small (one row per streamer), heavily UPSERTed, joined into operator-facing leaderboard queries. After the inversion, it gains a channel_id column FK to dm.streamer_channel — or stays keyed on (platform, platform_user_id) if that's lower-disruption. Both shapes work.
Chat messages → ClickHouse (drop the Postgres copy)
Chat messages are the third PG→CH candidate. They're an inherently append-only event stream — every chat.message.sent webhook becomes one row that's never updated except for the exported boolean the export pipeline mutates as bookkeeping. The PG-specific operations (UPDATE exported = TRUE, bulk DELETE after 14 days) are workarounds for what CH expresses natively.
Chat is also anonymous: the streamer being monitored has no business relationship with the chatters whose messages we record, and there is no compliance-driven takedown workflow for chat content. That removes the only reason to keep a mutable PG row.
CH-native lifecycle
| Op | PG today | CH replacement |
|---|---|---|
| Skip-already-exported on next run | UPDATE exported = TRUE WHERE id = ANY($1) per batch | Watermark in a tiny PG table dm.chat_export_watermark(env, last_exported_at). Next run reads WHERE created_at > watermark. No row mutation. |
| Retention | DELETE … WHERE exported = TRUE AND created_at < now() - INTERVAL '14 days' | TTL created_at + INTERVAL 30 DAY DELETE on the CH table. Background merges handle it. |
| Buffer-flush boundary | PG transactional reads see committed rows immediately | Export job's upper bound is now() - INTERVAL 1 HOUR (already true today for stream-session matching) — well within CH's flush latency. |
Schema
-- ClickHouse
CREATE TABLE vf.kick_chat_message (
env LowCardinality(String) NOT NULL,
channel_id UInt64 NOT NULL,
channel_slug LowCardinality(String) NOT NULL, -- denormalised for query convenience
message_id String NOT NULL,
platform_user_id String NOT NULL, -- the chatter, not the streamer
username LowCardinality(String),
content String,
emotes_json String DEFAULT '[]',
badges_json String DEFAULT '[]',
created_at DateTime NOT NULL,
inserted_at DateTime DEFAULT now()
)
ENGINE = MergeTree
PARTITION BY (env, toYYYYMMDD(created_at))
ORDER BY (env, channel_id, created_at)
TTL created_at + INTERVAL 30 DAY DELETE;
-- Postgres — tiny side table, the only PG residue for chat
CREATE TABLE dm.chat_export_watermark (
env text PRIMARY KEY,
last_exported_at timestamptz NOT NULL DEFAULT '1970-01-01'
);dm.chat_archive_log (the S3 export ledger) stays — it's a tiny per-batch bookkeeping table, naturally PG.
Export path rewrite
// kick_chat_export.ts — hourly
const { last_exported_at } = await env.DB`
SELECT last_exported_at FROM dm.chat_export_watermark WHERE env = ${env}
`;
const upperBound = new Date(Date.now() - 60 * 60_000); // now − 1h
const rows = await chQuery<ChatRow>(`
SELECT * FROM vf.kick_chat_message
WHERE env = '${env}'
AND created_at > toDateTime(${last_exported_at.getTime() / 1000})
AND created_at <= toDateTime(${upperBound.getTime() / 1000})
ORDER BY created_at
LIMIT 500000
`);
if (rows.length) {
const maxSeen = rows[rows.length - 1].created_at;
await streamToS3(rows);
await env.DB`INSERT INTO dm.chat_archive_log (rows_exported, files_written, success) VALUES (${rows.length}, ${files}, TRUE)`;
await env.DB`UPDATE dm.chat_export_watermark SET last_exported_at = ${maxSeen} WHERE env = ${env}`;
}No exported flag. No bulk DELETE in app code. The 50-ish lines of UPDATE/DELETE bookkeeping in the current kick_chat_export.ts shrink to ~20 lines.
Read paths
| Site today | After CH-only |
|---|---|
kick_webhook.ts:589 daily chat histogram (30 days) | One CH query: SELECT toDate(created_at), count() FROM vf.kick_chat_message WHERE channel_id = ? GROUP BY toDate(created_at). Faster than the PG version. |
kick_webhook.ts:1846 per-session per-minute chat aggregation | Two-step: fetch session bounds from PG (dm.stream_session), then CH aggregation GROUP BY toStartOfMinute(created_at). App merges with probe data — same code path that already does multi-CH-query merge after the probe move. |
kick_webhook.ts:546 24h chat totals | CH WHERE channel_id = ? AND created_at >= now() - INTERVAL 24 HOUR. |
All three read patterns map natively to CH columnar scans.
Why this wins
- Storage 10–20× smaller via CH LZ4 compression. Chat is the highest-write PG table after probe; moving it sheds another ~50 MB/day of WAL plus indexes.
- Write throughput decouples from webhook latency —
chInsertis async-buffered. Giveaway/raid spikes are absorbed by CH without backpressuring the webhook handler. - Analytical reads (histograms, per-minute aggregations) run 5–10× faster in CH than in PG.
- Schema-shape consistency — all append-only telemetry (probe, trust-score history, webhook audit, chat) now lives in CH. The remaining PG tables in
dmare all small mutable bookkeeping:streamer_channel,probe_schedule,stream_session,event_subscription,webhook_message_ids,kick_follow_event,chat_archive_log,chat_export_watermark,slug_history,scrape_log.
Costs
- Eventual consistency (~minutes) on writes. Acceptable for chat; export job's hour-aligned upper bound absorbs this.
- Per-session chart needs two queries instead of one PG join — but the same code path already issues two CH queries (session + probe) after the probe move. Chat just becomes a third.
- 30-day TTL is a hard retention floor. If the S3 export pipeline lags more than 30 days, that batch is lost. Alert on watermark lag; failure mode is best-effort archival (acceptable — chat is not a transactional record).
Decision: drop public.kick_chat_messages in the same migration as kick_streamer_probe and streamer_trust_score_history. Data goes to vf.kick_chat_message (CH). The export ledger dm.chat_archive_log stays in PG; a new dm.chat_export_watermark table replaces the exported flag.
Webhook inbox — split into three concerns
webhooks.incoming today serves three distinct purposes in one table:
| Concern | Current implementation | Recommended |
|---|---|---|
Idempotency — was message_id already processed? (kick_webhook.ts:261) | SELECT id FROM webhooks.incoming WHERE message_id = $1 against a row that also holds the full payload | Tiny dm.webhook_message_ids(source, message_id PK, received_at) — no payload. Drops the payload-bearing PG row off the hot path. |
| Raw audit log — full payload of every verified + every rejected webhook, 30-day retention | PG row with payload text + signature + verified bool | Move to ClickHouse. Append-only, large rows, never read transactionally. Add vf.webhook_inbox(source, event_type, message_id, payload String, signature, verified UInt8, received_at). Same pattern as vf.sumsub_webhooks. |
Domain-event projection — channel.followed is only recoverable from webhooks.incoming; the per-session follow chart depends on it (kick_webhook.ts:1994) | Reads raw rows back out of audit log | Project at write time. Add dm.kick_follow_event(channel_id bigint, occurred_at timestamptz) — handler inserts one row per channel.followed event in the same path that writes the audit log. Per-session follow chart reads from this table directly. |
Net effect:
- PG webhook hot-path table shrinks from 11 columns to 3 — 1–2 orders of magnitude smaller.
- Raw payloads land in CH alongside Sumsub webhooks. Compliance + replay queries are one CH query.
channel.followedfollow counts read from a dedicated projected table — same query shape but smaller scan, and the dependency on raw audit retention disappears (you can shorten audit TTL without breaking the chart).webhooks.incomingdeletes after 0123 ships.
Stat-aggregation queries that today join webhooks.incoming to kick_event_subscriptions (line 538) move to CH too — they're analytical, not transactional.
API surface — channel-first, username-friendly
A new helper module api/src/streamer_channel.ts is the only place that needs to know how to translate between platform-user-id, slug, and the canonical channel_id:
// Resolve (platform, slug) → channel_id. Cheap (indexed). Null if unknown.
export async function channelIdBySlug(
env: Env, platform: string, slug: string
): Promise<bigint | null>;
// Resolve (platform, platform_user_id) → channel_id. Cheap.
export async function channelIdByPlatformUserId(
env: Env, platform: string, platformUserId: string
): Promise<bigint | null>;
// Resolve channel_id → full row (current slug, display name, etc.).
export async function channel(env: Env, channelId: bigint): Promise<Channel | null>;
// Idempotent upsert. Called by every poller/webhook before inserting
// into a child table. Records a slug-rename in slug_history when the
// observed slug differs from current_slug.
export async function upsertChannel(
env: Env,
args: {
platform: string;
platformUserId: string;
slug: string;
displayName?: string;
avatarUrl?: string;
followerCount?: number;
isLive?: boolean;
}
): Promise<bigint>; // returns channel_idEvery existing call site that built a SQL query of the form WHERE kick_user_id = $1 converts to WHERE channel_id = $1 once the dm.* descendants are switched. A transition helper kickUserIdToChannelId(env, kuid) covers the brief overlap window.
HTTP
Every endpoint that previously took ?kick_user_id=… learns to also accept ?username=… and ?channel_id=…:
const channelId =
url.searchParams.has("channel_id") ? BigInt(url.searchParams.get("channel_id")!)
: url.searchParams.has("username") ? await channelIdBySlug(env, "kick", url.searchParams.get("username")!)
: url.searchParams.has("kick_user_id") ? await channelIdByPlatformUserId(env, "kick", url.searchParams.get("kick_user_id")!)
: null;New routes can take username exclusively. Admin/support tooling stops needing to know about numeric platform IDs at all.
Migration
Two-phase migration. Phase A is purely additive (new structure alongside old). Phase B switches readers/writers and drops the old.
-- 0123_dm_streamer_channel.sql
BEGIN;
CREATE SCHEMA dm;
-- ═══ Phase A — add new structure ═════════════════════════════════════
CREATE TABLE dm.streamer_channel (...);
CREATE TABLE dm.streamer_channel_slug_history (...);
-- Backfill from every observed kick_user_id. Source priority:
-- 1. streamer_channels (we have a resolved slug)
-- 2. kick_stream_sessions.channel_name
-- 3. (chat lives in CH now — not a backfill source)
-- 4. kick_streamer_probe with a synthetic slug if all else missing
INSERT INTO dm.streamer_channel (external_id, platform, platform_user_id, current_slug,
display_name, avatar_url, bio, follower_count,
is_live, first_seen_at, last_seen_at)
SELECT gen_ulid(),
'kick',
sc.kick_user_id,
COALESCE(sc.channel_name, 'unknown-' || sc.kick_user_id),
sc.kick_display_name, sc.kick_avatar_url, sc.kick_bio,
sc.kick_follower_count, COALESCE(sc.kick_is_live, false),
MIN(sc.created_at), -- approx; real history starts here
MAX(COALESCE(sc.kick_synced_at, sc.created_at))
FROM public.streamer_channels sc
WHERE sc.kick_user_id IS NOT NULL AND sc.platform = 'kick'
GROUP BY sc.kick_user_id, sc.channel_name, sc.kick_display_name,
sc.kick_avatar_url, sc.kick_bio, sc.kick_follower_count, sc.kick_is_live;
-- Pick up channels we've observed in probe/session/chat but haven't claimed.
INSERT INTO dm.streamer_channel (external_id, platform, platform_user_id, current_slug, first_seen_at)
SELECT DISTINCT gen_ulid(), 'kick', kss.kick_user_id,
COALESCE(kss.channel_name, 'unknown-' || kss.kick_user_id),
MIN(kss.stream_started_at)
FROM public.kick_stream_sessions kss
WHERE kss.kick_user_id NOT IN (
SELECT platform_user_id FROM dm.streamer_channel WHERE platform = 'kick'
)
GROUP BY kss.kick_user_id, kss.channel_name;
-- Seed slug_history with one row per channel covering the current slug.
INSERT INTO dm.streamer_channel_slug_history (channel_id, slug, observed_from)
SELECT id, current_slug, first_seen_at FROM dm.streamer_channel;
-- Create the per-env link table.
CREATE TABLE public.streamer_channel_link (
streamer_id integer NOT NULL REFERENCES public.streamers ON DELETE CASCADE,
channel_id bigint NOT NULL,
verified boolean NOT NULL DEFAULT false,
claimed_at timestamptz NOT NULL DEFAULT now(),
PRIMARY KEY (streamer_id, channel_id)
);
ALTER TABLE public.streamer_channel_link
ADD CONSTRAINT streamer_channel_link_channel_fkey
FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel(id) NOT VALID;
INSERT INTO public.streamer_channel_link (streamer_id, channel_id, verified, claimed_at)
SELECT sc.streamer_id, dsc.id, sc.verified, sc.created_at
FROM public.streamer_channels sc
JOIN dm.streamer_channel dsc
ON dsc.platform = sc.platform AND dsc.platform_user_id = sc.kick_user_id
WHERE sc.streamer_id IS NOT NULL;
-- ═══ Phase A.2 — move tables, add channel_id, backfill ═══════════════
ALTER TABLE scrape.kick_raw SET SCHEMA dm;
ALTER TABLE dm.kick_raw RENAME TO scrape_log;
ALTER TABLE dm.scrape_log ADD COLUMN channel_id bigint;
UPDATE dm.scrape_log sl SET channel_id = sc.id
FROM dm.streamer_channel sc
WHERE sc.platform = 'kick' AND sc.platform_user_id = sl.kick_user_id;
ALTER TABLE public.kick_streamer_probe_schedule SET SCHEMA dm;
ALTER TABLE dm.kick_streamer_probe_schedule RENAME TO probe_schedule;
ALTER TABLE dm.probe_schedule ADD COLUMN channel_id bigint;
UPDATE dm.probe_schedule ps SET channel_id = sc.id
FROM dm.streamer_channel sc
WHERE sc.platform = 'kick' AND sc.platform_user_id = ps.kick_user_id;
ALTER TABLE dm.probe_schedule ALTER COLUMN channel_id SET NOT NULL;
ALTER TABLE dm.probe_schedule DROP CONSTRAINT IF EXISTS probe_schedule_pkey;
ALTER TABLE dm.probe_schedule ADD PRIMARY KEY (channel_id);
ALTER TABLE dm.probe_schedule
ADD CONSTRAINT probe_schedule_channel_fkey
FOREIGN KEY (channel_id) REFERENCES dm.streamer_channel NOT VALID;
ALTER TABLE dm.probe_schedule VALIDATE CONSTRAINT probe_schedule_channel_fkey;
-- (Same pattern for kick_stream_sessions → dm.stream_session,
-- kick_chat_messages does NOT move — it's dropped in Phase B
-- after the kick_chat_export pipeline switches to CH reads.)
-- ═══ Phase B — switch code → drop the old ════════════════════════════
-- Deferred to a follow-up migration once all code sites have been
-- repointed. Phase B drops:
-- public.streamer_channels (replaced by streamer_channel_link + dm.streamer_channel)
-- public.kick_streamer_probe (CH only)
-- public.streamer_trust_score_history (CH only)
-- webhooks.incoming (split into dm.webhook_message_ids + vf.webhook_inbox + dm.kick_follow_event)
-- kick_user_id columns on dm.* (channel_id is the join key now)
COMMIT;Phase A is a single transaction that can run on a live database. Every dm.* descendant temporarily carries both kick_user_id and channel_id columns; the code switches its reads/writes over to channel_id incrementally; then Phase B drops kick_user_id and the old public.streamer_channels table. Two PRs, one to flip the schema and one to retire the old columns.
The NOT VALID + VALIDATE CONSTRAINT pattern lets us add FKs without an AccessExclusiveLock on child tables — kept for the large historical PG tables (stream_session, event_subscription, scrape_log) during the move. Chat is no longer a concern: it leaves PG entirely.
Code change scope
| File | Change | Lines |
|---|---|---|
api/migrations/0123_dm_streamer_channel.sql | Phase A migration | ~250 |
api/migrations/0124_drop_legacy_kick_columns.sql | Phase B (after code repoint) | ~80 |
api/src/streamer_channel.ts | New helper module (channelIdBySlug, channelIdByPlatformUserId, upsertChannel, channel) | ~120 |
api/src/kick_viewer_poll.ts | Drop PG probe INSERT (keep chInsert); upsertChannel returns channel_id; touch dm.probe_schedule.last_probed_at | ~25 |
api/src/kick_subscriptions.ts | Use channelIdBySlug / channel(channelId) instead of poking streamer_channels.channel_name; subscription tables key on channel_id | ~35 |
api/src/kick_webhook.ts | upsertChannel on every inbound event; resolve channel_id from payload; replace webhooks.incoming dedupe with dm.webhook_message_ids; project channel.followed into dm.kick_follow_event; raw audit goes to vf.webhook_inbox; 4 probe-read sites repoint to CH helper | ~100 |
api/src/kick_maintenance.ts | Drop webhooks.incoming purge; add dm.webhook_message_ids TTL prune | ~5 |
api/src/kick_chat_export.ts | Rewrite: read from CH via watermark, write S3, advance dm.chat_export_watermark. The 50 lines of UPDATE/DELETE bookkeeping shrink to ~20 lines of watermark-driven read. | -30 (net) |
api/clickhouse/07_kick_chat_message.sql | New CH schema for chat messages (MergeTree, 30-day TTL) | ~30 |
api/src/kick_trust_score.ts | Drop PG streamer_trust_score_history INSERT; keep chInsert; reads dm.stream_session by channel_id | ~25 |
api/src/applications.ts | session_submissions resolver: probe-window scan moves to CH helper (resolves channel_id first); PG session join uses channel_id | ~20 |
api/src/streamers.ts | Streamer-channel linking: upsertChannel first, then INSERT into public.streamer_channel_link | ~25 |
api/src/clickhouse.ts | New helpers: channelProbeQuery, channelTrustScoreHistory, webhookInboxInsert; each takes a channel_id and resolves to (platform, platform_user_id) for the CH WHERE clause | ~70 |
api/src/backfill-clickhouse.ts | Delete both kick_streamer_probe and streamer_trust_score_history branches (PG sources gone). Tool retires. | -260 |
api/k8s/backfill-clickhouse-job.yaml | Delete (tool is gone) | -107 |
api/clickhouse/06_webhook_inbox.sql | New CH schema for raw webhook archive | ~30 |
api/src/oauth.ts | Kick OAuth callback writes streamer_channel_link instead of streamer_channels; upsertChannel first | ~20 |
api/src/x_follower_poll.ts | (Optional in 0123) Move X polling under dm.streamer_channel with platform='x'; or defer to a follow-up | ~30 or — |
frontend/src/api/streamers.ts | Add username / channel_id query param support | ~15 |
admin/src/panels/StreamersPanel.tsx | Search by username without an explicit user-id lookup; channel rows render from dm.streamer_channel | ~10 |
Touches roughly 19 files. The bulk is the kick_user_id → channel_id propagation, which is mechanical: a typed helper covers the translation during the overlap window, then Phase B retires the old column.
Four PG table drops (in Phase B) — kick_streamer_probe, streamer_trust_score_history, kick_chat_messages, webhooks.incoming — plus streamer_channels itself. The backfill-clickhouse tool retires entirely. After Phase B, every append-only telemetry source lives in ClickHouse; the PG tables that remain in dm are all small mutable state.
Migration ordering relative to other in-flight work
This proposal slots cleanly after the v2 baseline (current 00_baseline.sql) and before the ingestion-tier extraction. Suggested ordering:
- Phase A — migration 0123: purely additive (new schema + new identity table +
streamer_channel_link+ table moves with dualkick_user_id/channel_idcolumns + FKs). Can ship to a live DB. - Code repoints — same week: ~18 files, low-risk grep-and-replace. The dual-column overlap period means readers can switch incrementally.
- Phase B — migration 0124: drop
kick_user_idcolumns fromdm.*descendants, droppublic.streamer_channels, droppublic.kick_streamer_probe, droppublic.streamer_trust_score_history, droppublic.kick_chat_messages, dropwebhooks.incoming. Ships once all writers and readers are on the new columns andkick_chat_export.tshas switched to the watermark-driven CH read. - External Data Ingestion proposal: operates on cleaner ground —
dm.*is a single namespace to physically extract, and the dependency arrow already points the right way (env reaches into dm via soft reference; nothing in dm references env-specific data).
Doing it in this order means the ingestion proposal's "move the env-agnostic schema to its own database" step becomes a single pg_dump --schema=dm plus dropping the FK on streamer_channel_link.channel_id — no application-level join rewrites needed.
Open questions
- Schema name —
dm(datamining). Short, fits the existing schema-naming style (auth,pgboss). Platform-neutral; futuredm.sumsub_applicant_*etc. land in the same namespace. platform_user_idretention — keepdm.streamer_channel.platform_user_idfor backward compat with CH (which still keys on it) and for OAuth lookups. Not a join key for dm.* descendants, just a stable identifier.- Should
streamer_videosmove? No — curated profile content owned by the streamer (per-env), not scraped. After the inversion it FKsstreamer_idonly; no link todm.streamer_channelneeded. - Backfilling slug history — we don't have it. Backfill seeds one row per channel covering
(observed_from = first_seen_at, observed_until = NULL). Real rename detection begins from migration forward. - Slug uniqueness across renames — if channel A renames to
fooafter channel B heldfooand released it,UNIQUE (platform, current_slug)trips. Resolution: relax to a partial UNIQUE allowing NULLs; on a collision the loser getscurrent_slug = NULLand is only reachable byplatform_user_id. Edge case worth logging. - Multi-claim semantics —
streamer_channel_linkallows multiple unverified claims on the same channel (uq_verified_channelonly enforces uniqueness among verified rows). Application code decides when to mark a claim verified; only verified claims count for payout and KYC gating. - What about other platforms? Twitch and X drop in naturally under
platform='twitch'/platform='x'. The X follower poll already piggybacks onstreamer_channels; in this migration it moves underdm.streamer_channelwith the same shape. Twitch scraping (when it ships) gets the schema for free. - ClickHouse re-keying? Not needed in this proposal. CH tables stay keyed on
(env, platform, platform_user_id, ts). Per-channel queries resolvechannel_id→(platform, platform_user_id)via a one-row PG lookup before issuing the CH query.