Skip to content

Worker metrics

Exposed by the worker process at :9092/metrics (cluster-internal). All metrics carry service="worker".

The worker has no HTTP request surface — its job is running the cron schedule. The two questions worth asking with metrics are:

  1. Is each poller alive and not stuck?
  2. Are external API calls (Kick / X / OAuth) succeeding, or are we running into auth issues like 401?

Both are answered directly below.

Poll-job lifecycle

Every cron schedule in worker.ts is wrapped in instrumentPoll(name, fn) which automatically captures runs, duration, and a "last successful run" timestamp.

MetricTypeLabels
vf_worker_poll_runs_totalcounterjob, result (ok/error)
vf_worker_poll_duration_secondshistogramjob
vf_worker_poll_last_success_timestamp_secondsgaugejob

Job names

JobScheduleWhat it does
viewer-poll* * * * * (every minute)Polls Kick channels API for live status / viewer counts
chat-export0 * * * * (hourly)Exports Kick chat batches to Hetzner S3
stale-session-cleanup45 * * * * (hourly)Closes stream sessions open > 18 h with no recent probe
kick-token-refresh30 * * * * (hourly)Proactively refreshes near-expiry Kick OAuth tokens
x-follower-poll15 * * * * (hourly)Polls X (Twitter) follower counts
kick-subscriptions45 3 * * * (daily 03:45)Ensures every active streamer has Kick webhook subscriptions
trust-score-refresh17 3 * * * (daily 03:17)Recomputes streamer trust scores
daily-maintenance22 4 * * * (daily 04:22)Purges old rows from high-volume tables

Liveness — the single most important signal

vf_worker_poll_last_success_timestamp_seconds is a Unix timestamp set on every successful run. The headline query is:

promql
# seconds since last success, by job
time() - max by(job) (vf_worker_poll_last_success_timestamp_seconds{service="worker"})

Alert when this exceeds the schedule's normal interval × headroom:

JobNormal intervalSuggested alert threshold
viewer-poll60 s300 s (5 min — 5× headroom)
chat-export1 h2 h
stale-session-cleanup1 h2 h
kick-token-refresh1 h2 h
x-follower-poll1 h2 h
kick-subscriptions24 h30 h
trust-score-refresh24 h30 h
daily-maintenance24 h30 h

This single metric catches both crashes and persistent error loops: a job that throws every minute increments runs_total{result="error"} but last_success_timestamp_seconds never moves — so the alert still fires.

Other useful queries

promql
# error rate by job
sum by(job) (rate(vf_worker_poll_runs_total{result="error"}[10m]))

# job duration p95
histogram_quantile(0.95,
  sum by(le, job) (rate(vf_worker_poll_duration_seconds_bucket[10m]))
)

# total runs in the last hour by job
sum by(job) (increase(vf_worker_poll_runs_total[1h]))

External HTTP calls

Bumped at every call site where we already log on a non-2xx response. Lets us see auth issues (401) separately from network errors and other upstream failures.

MetricTypeLabels
vf_external_http_totalcounterservice, endpoint, status_code

status_code is the literal HTTP status ("200", "401", "429", …) or the literal string "network" when the call never reached a response (DNS / TCP / TLS / abort).

Instrumented call sites

service labelendpoint labelWhat it is
kickchannelspollViewerCounts — Kick channels read API
kickevents/subscriptionssubscribeAllEvents — webhook subscription create
kick-oauthoauth/tokenBoth getAppToken (client_credentials) and the user-token refresh in getValidKickToken
xusers/mepollXFollowerCounts — Twitter follower count read

Useful queries

promql
# 401 rate across all external services — the headline "is auth broken" query
sum by(service, endpoint) (
  rate(vf_external_http_total{status_code="401"}[5m])
)

# all non-2xx by service
sum by(service, status_code) (
  rate(vf_external_http_total{status_code!~"2.."}[5m])
)

# transport errors (DNS / TCP / TLS / abort)
sum by(service) (
  rate(vf_external_http_total{status_code="network"}[5m])
)

# rate-limit detection
sum by(service, endpoint) (
  rate(vf_external_http_total{status_code="429"}[5m])
)

Scraper providers (ScraperAPI / Scrape.do)

Failover stack used to bypass Cloudflare on Kick endpoints we can't reach from VPS IPs. Wrapped by scraperFetch() in api/src/scraper.ts, which round-robins across configured providers and fails over on 5xx / Cloudflare-block / network / timeout.

MetricTypeLabels
vf_scraper_requests_totalcounterprovider (scraperapi / scrapedo), outcome

Outcomes:

OutcomeProvider stateCounted as failure?
successhealthyno
upstream_4xxhealthy (target site returned 4xx)no
http_5xxfailingyes
cf_blockfailing (Cloudflare interstitial)yes
timeoutfailing (AbortController fired)yes
networkfailing (DNS / TCP / TLS)yes

Headline queries

promql
# Per-provider request rate — should track ~50/50 under round-robin
sum by(provider) (rate(vf_scraper_requests_total[5m]))

# Per-provider failure ratio (drives the warning alert)
sum by(provider) (rate(vf_scraper_requests_total{
  outcome=~"http_5xx|cf_block|timeout|network"
}[5m]))
/
clamp_min(sum by(provider) (rate(vf_scraper_requests_total[5m])), 0.0001)

Alerts

AlertThresholdSeverity
ScraperProviderHighFailureRateper-provider failure ratio > 20% for 10mwarning
ScraperAllProvidersDownglobal failure ratio > 50% for 5mcritical

The warning fires when one provider degrades — failover is still absorbing the impact, but quota is being burned and the circuit breaker is cycling. The critical fires when both providers fail together — follower-backfill, slug resolution, and VOD backfill all stall.

Defined in operations/clusters/prod/stage/monitoring/scraper-rules.yaml.

Node runtime (free, via prom-client defaults)

Same set as the API process — nodejs_eventloop_lag_seconds, process_resident_memory_bytes, GC, heap, active handles. See API metrics → Node runtime.

Implementation

  • Module: api/src/metrics.ts (shared with the api process)
  • Wired in: api/src/worker.tsinitMetrics("worker") at startup, every cron callback wrapped in instrumentPoll(name, fn), startMetricsServer(9092)
  • External-HTTP recorders sit next to the existing log.warn lines in:
    • api/src/kick_viewer_poll.ts
    • api/src/kick_token.ts
    • api/src/kick_subscriptions.ts
    • api/src/x_follower_poll.ts

Heartbeat vs metrics — note on _readiness

The worker also exposes /_readiness on :3002 which 503s if the every-minute viewer-poll cron hasn't completed in the last 5 minutes (lastCronTickAt). This is a k8s-only signal — it tells the kubelet to restart a stuck pod. The Prometheus metrics above tell operators the same story plus per-job granularity, alert hooks, and history. They don't replace the readiness probe; they complement it.

Verifluence Documentation