Worker metrics

Exposed by the worker process at :9092/metrics (cluster-internal). All metrics carry service="worker".

The worker has no HTTP request surface — its job is running the cron schedule. The two questions worth asking with metrics are:

Is each poller alive and not stuck?
Are external API calls (Kick / X / OAuth) succeeding, or are we running into auth issues like 401?

Both are answered directly below.

Poll-job lifecycle

Every cron schedule in worker.ts is wrapped in instrumentPoll(name, fn) which automatically captures runs, duration, and a "last successful run" timestamp.

Metric	Type	Labels
`vf_worker_poll_runs_total`	counter	`job`, `result` (`ok`/`error`)
`vf_worker_poll_duration_seconds`	histogram	`job`
`vf_worker_poll_last_success_timestamp_seconds`	gauge	`job`

Job names

Job	Schedule	What it does
`viewer-poll`	`* * * * *` (every minute)	Polls Kick channels API for live status / viewer counts
`chat-export`	`0 * * * *` (hourly)	Exports Kick chat batches to Hetzner S3
`stale-session-cleanup`	`45 * * * *` (hourly)	Closes stream sessions open > 18 h with no recent probe
`kick-token-refresh`	`30 * * * *` (hourly)	Proactively refreshes near-expiry Kick OAuth tokens
`x-follower-poll`	`15 * * * *` (hourly)	Polls X (Twitter) follower counts
`kick-subscriptions`	`45 3 * * *` (daily 03:45)	Ensures every active streamer has Kick webhook subscriptions
`trust-score-refresh`	`17 3 * * *` (daily 03:17)	Recomputes streamer trust scores
`daily-maintenance`	`22 4 * * *` (daily 04:22)	Purges old rows from high-volume tables

Liveness — the single most important signal

vf_worker_poll_last_success_timestamp_seconds is a Unix timestamp set on every successful run. The headline query is:

promql

# seconds since last success, by job
time() - max by(job) (vf_worker_poll_last_success_timestamp_seconds{service="worker"})

Alert when this exceeds the schedule's normal interval × headroom:

Job	Normal interval	Suggested alert threshold
`viewer-poll`	60 s	300 s (5 min — 5× headroom)
`chat-export`	1 h	2 h
`stale-session-cleanup`	1 h	2 h
`kick-token-refresh`	1 h	2 h
`x-follower-poll`	1 h	2 h
`kick-subscriptions`	24 h	30 h
`trust-score-refresh`	24 h	30 h
`daily-maintenance`	24 h	30 h

This single metric catches both crashes and persistent error loops: a job that throws every minute increments runs_total{result="error"} but last_success_timestamp_seconds never moves — so the alert still fires.

Other useful queries

promql

# error rate by job
sum by(job) (rate(vf_worker_poll_runs_total{result="error"}[10m]))

# job duration p95
histogram_quantile(0.95,
  sum by(le, job) (rate(vf_worker_poll_duration_seconds_bucket[10m]))
)

# total runs in the last hour by job
sum by(job) (increase(vf_worker_poll_runs_total[1h]))

External HTTP calls

Bumped at every call site where we already log on a non-2xx response. Lets us see auth issues (401) separately from network errors and other upstream failures.

Metric	Type	Labels
`vf_external_http_total`	counter	`service`, `endpoint`, `status_code`

status_code is the literal HTTP status ("200", "401", "429", …) or the literal string "network" when the call never reached a response (DNS / TCP / TLS / abort).

Instrumented call sites

`service` label	`endpoint` label	What it is
`kick`	`channels`	`pollViewerCounts` — Kick channels read API
`kick`	`events/subscriptions`	`subscribeAllEvents` — webhook subscription create
`kick-oauth`	`oauth/token`	Both `getAppToken` (client_credentials) and the user-token refresh in `getValidKickToken`
`x`	`users/me`	`pollXFollowerCounts` — Twitter follower count read

Useful queries

promql

# 401 rate across all external services — the headline "is auth broken" query
sum by(service, endpoint) (
  rate(vf_external_http_total{status_code="401"}[5m])
)

# all non-2xx by service
sum by(service, status_code) (
  rate(vf_external_http_total{status_code!~"2.."}[5m])
)

# transport errors (DNS / TCP / TLS / abort)
sum by(service) (
  rate(vf_external_http_total{status_code="network"}[5m])
)

# rate-limit detection
sum by(service, endpoint) (
  rate(vf_external_http_total{status_code="429"}[5m])
)

Scraper providers (ScraperAPI / Scrape.do)

Failover stack used to bypass Cloudflare on Kick endpoints we can't reach from VPS IPs. Wrapped by scraperFetch() in api/src/scraper.ts, which round-robins across configured providers and fails over on 5xx / Cloudflare-block / network / timeout.

Metric	Type	Labels
`vf_scraper_requests_total`	counter	`provider` (`scraperapi` / `scrapedo`), `outcome`

Outcomes:

Outcome	Provider state	Counted as failure?
`success`	healthy	no
`upstream_4xx`	healthy (target site returned 4xx)	no
`http_5xx`	failing	yes
`cf_block`	failing (Cloudflare interstitial)	yes
`timeout`	failing (AbortController fired)	yes
`network`	failing (DNS / TCP / TLS)	yes

Headline queries

promql

# Per-provider request rate — should track ~50/50 under round-robin
sum by(provider) (rate(vf_scraper_requests_total[5m]))

# Per-provider failure ratio (drives the warning alert)
sum by(provider) (rate(vf_scraper_requests_total{
  outcome=~"http_5xx|cf_block|timeout|network"
}[5m]))
/
clamp_min(sum by(provider) (rate(vf_scraper_requests_total[5m])), 0.0001)

Alerts

Alert	Threshold	Severity
`ScraperProviderHighFailureRate`	per-provider failure ratio > 20% for 10m	warning
`ScraperAllProvidersDown`	global failure ratio > 50% for 5m	critical

The warning fires when one provider degrades — failover is still absorbing the impact, but quota is being burned and the circuit breaker is cycling. The critical fires when both providers fail together — follower-backfill, slug resolution, and VOD backfill all stall.

Defined in operations/clusters/prod/stage/monitoring/scraper-rules.yaml.

Node runtime (free, via prom-client defaults)

Same set as the API process — nodejs_eventloop_lag_seconds, process_resident_memory_bytes, GC, heap, active handles. See API metrics → Node runtime.

Implementation

Module: api/src/metrics.ts (shared with the api process)
Wired in: api/src/worker.ts — initMetrics("worker") at startup, every cron callback wrapped in instrumentPoll(name, fn), startMetricsServer(9092)
External-HTTP recorders sit next to the existing log.warn lines in:
- api/src/kick_viewer_poll.ts
- api/src/kick_token.ts
- api/src/kick_subscriptions.ts
- api/src/x_follower_poll.ts

Heartbeat vs metrics — note on `_readiness`

The worker also exposes /_readiness on :3002 which 503s if the every-minute viewer-poll cron hasn't completed in the last 5 minutes (lastCronTickAt). This is a k8s-only signal — it tells the kubelet to restart a stuck pod. The Prometheus metrics above tell operators the same story plus per-job granularity, alert hooks, and history. They don't replace the readiness probe; they complement it.

Worker metrics ​

Poll-job lifecycle ​

Job names ​

Liveness — the single most important signal ​

Other useful queries ​

External HTTP calls ​

Instrumented call sites ​

Useful queries ​

Scraper providers (ScraperAPI / Scrape.do) ​

Headline queries ​

Alerts ​

Node runtime (free, via prom-client defaults) ​

Implementation ​

Heartbeat vs metrics — note on _readiness ​

Worker metrics

Poll-job lifecycle

Job names

Liveness — the single most important signal

Other useful queries

External HTTP calls

Instrumented call sites

Useful queries

Scraper providers (ScraperAPI / Scrape.do)

Headline queries

Alerts

Node runtime (free, via prom-client defaults)

Implementation

Heartbeat vs metrics — note on `_readiness`