Worker metrics
Exposed by the worker process at :9092/metrics (cluster-internal). All metrics carry service="worker".
The worker has no HTTP request surface — its job is running the cron schedule. The two questions worth asking with metrics are:
- Is each poller alive and not stuck?
- Are external API calls (Kick / X / OAuth) succeeding, or are we running into auth issues like 401?
Both are answered directly below.
Poll-job lifecycle
Every cron schedule in worker.ts is wrapped in instrumentPoll(name, fn) which automatically captures runs, duration, and a "last successful run" timestamp.
| Metric | Type | Labels |
|---|---|---|
vf_worker_poll_runs_total | counter | job, result (ok/error) |
vf_worker_poll_duration_seconds | histogram | job |
vf_worker_poll_last_success_timestamp_seconds | gauge | job |
Job names
| Job | Schedule | What it does |
|---|---|---|
viewer-poll | * * * * * (every minute) | Polls Kick channels API for live status / viewer counts |
chat-export | 0 * * * * (hourly) | Exports Kick chat batches to Hetzner S3 |
stale-session-cleanup | 45 * * * * (hourly) | Closes stream sessions open > 18 h with no recent probe |
kick-token-refresh | 30 * * * * (hourly) | Proactively refreshes near-expiry Kick OAuth tokens |
x-follower-poll | 15 * * * * (hourly) | Polls X (Twitter) follower counts |
kick-subscriptions | 45 3 * * * (daily 03:45) | Ensures every active streamer has Kick webhook subscriptions |
trust-score-refresh | 17 3 * * * (daily 03:17) | Recomputes streamer trust scores |
daily-maintenance | 22 4 * * * (daily 04:22) | Purges old rows from high-volume tables |
Liveness — the single most important signal
vf_worker_poll_last_success_timestamp_seconds is a Unix timestamp set on every successful run. The headline query is:
# seconds since last success, by job
time() - max by(job) (vf_worker_poll_last_success_timestamp_seconds{service="worker"})Alert when this exceeds the schedule's normal interval × headroom:
| Job | Normal interval | Suggested alert threshold |
|---|---|---|
viewer-poll | 60 s | 300 s (5 min — 5× headroom) |
chat-export | 1 h | 2 h |
stale-session-cleanup | 1 h | 2 h |
kick-token-refresh | 1 h | 2 h |
x-follower-poll | 1 h | 2 h |
kick-subscriptions | 24 h | 30 h |
trust-score-refresh | 24 h | 30 h |
daily-maintenance | 24 h | 30 h |
This single metric catches both crashes and persistent error loops: a job that throws every minute increments runs_total{result="error"} but last_success_timestamp_seconds never moves — so the alert still fires.
Other useful queries
# error rate by job
sum by(job) (rate(vf_worker_poll_runs_total{result="error"}[10m]))
# job duration p95
histogram_quantile(0.95,
sum by(le, job) (rate(vf_worker_poll_duration_seconds_bucket[10m]))
)
# total runs in the last hour by job
sum by(job) (increase(vf_worker_poll_runs_total[1h]))External HTTP calls
Bumped at every call site where we already log on a non-2xx response. Lets us see auth issues (401) separately from network errors and other upstream failures.
| Metric | Type | Labels |
|---|---|---|
vf_external_http_total | counter | service, endpoint, status_code |
status_code is the literal HTTP status ("200", "401", "429", …) or the literal string "network" when the call never reached a response (DNS / TCP / TLS / abort).
Instrumented call sites
service label | endpoint label | What it is |
|---|---|---|
kick | channels | pollViewerCounts — Kick channels read API |
kick | events/subscriptions | subscribeAllEvents — webhook subscription create |
kick-oauth | oauth/token | Both getAppToken (client_credentials) and the user-token refresh in getValidKickToken |
x | users/me | pollXFollowerCounts — Twitter follower count read |
Useful queries
# 401 rate across all external services — the headline "is auth broken" query
sum by(service, endpoint) (
rate(vf_external_http_total{status_code="401"}[5m])
)
# all non-2xx by service
sum by(service, status_code) (
rate(vf_external_http_total{status_code!~"2.."}[5m])
)
# transport errors (DNS / TCP / TLS / abort)
sum by(service) (
rate(vf_external_http_total{status_code="network"}[5m])
)
# rate-limit detection
sum by(service, endpoint) (
rate(vf_external_http_total{status_code="429"}[5m])
)Scraper providers (ScraperAPI / Scrape.do)
Failover stack used to bypass Cloudflare on Kick endpoints we can't reach from VPS IPs. Wrapped by scraperFetch() in api/src/scraper.ts, which round-robins across configured providers and fails over on 5xx / Cloudflare-block / network / timeout.
| Metric | Type | Labels |
|---|---|---|
vf_scraper_requests_total | counter | provider (scraperapi / scrapedo), outcome |
Outcomes:
| Outcome | Provider state | Counted as failure? |
|---|---|---|
success | healthy | no |
upstream_4xx | healthy (target site returned 4xx) | no |
http_5xx | failing | yes |
cf_block | failing (Cloudflare interstitial) | yes |
timeout | failing (AbortController fired) | yes |
network | failing (DNS / TCP / TLS) | yes |
Headline queries
# Per-provider request rate — should track ~50/50 under round-robin
sum by(provider) (rate(vf_scraper_requests_total[5m]))
# Per-provider failure ratio (drives the warning alert)
sum by(provider) (rate(vf_scraper_requests_total{
outcome=~"http_5xx|cf_block|timeout|network"
}[5m]))
/
clamp_min(sum by(provider) (rate(vf_scraper_requests_total[5m])), 0.0001)Alerts
| Alert | Threshold | Severity |
|---|---|---|
ScraperProviderHighFailureRate | per-provider failure ratio > 20% for 10m | warning |
ScraperAllProvidersDown | global failure ratio > 50% for 5m | critical |
The warning fires when one provider degrades — failover is still absorbing the impact, but quota is being burned and the circuit breaker is cycling. The critical fires when both providers fail together — follower-backfill, slug resolution, and VOD backfill all stall.
Defined in operations/clusters/prod/stage/monitoring/scraper-rules.yaml.
Node runtime (free, via prom-client defaults)
Same set as the API process — nodejs_eventloop_lag_seconds, process_resident_memory_bytes, GC, heap, active handles. See API metrics → Node runtime.
Implementation
- Module:
api/src/metrics.ts(shared with the api process) - Wired in:
api/src/worker.ts—initMetrics("worker")at startup, every cron callback wrapped ininstrumentPoll(name, fn),startMetricsServer(9092) - External-HTTP recorders sit next to the existing
log.warnlines in:api/src/kick_viewer_poll.tsapi/src/kick_token.tsapi/src/kick_subscriptions.tsapi/src/x_follower_poll.ts
Heartbeat vs metrics — note on _readiness
The worker also exposes /_readiness on :3002 which 503s if the every-minute viewer-poll cron hasn't completed in the last 5 minutes (lastCronTickAt). This is a k8s-only signal — it tells the kubelet to restart a stuck pod. The Prometheus metrics above tell operators the same story plus per-job granularity, alert hooks, and history. They don't replace the readiness probe; they complement it.