Skip to content

Webhook metrics

Inbound webhooks are received by the API process at the public endpoint /wh/kick and emit metrics on :9091/metrics alongside the rest of the API metrics. All carry service="api".

The receiver has two phases, each with its own metric pair:

  1. Receive (sync) — runs to completion before the 200 ack goes back. Kick retries on >1 s, so this path is latency-sensitive.
  2. Async (processEvent) — runs after the ack via ctx.waitUntil. Errors here never surface to the sender, so the metric is the only signal we have for downstream-handler health.

Receive (synchronous ack path)

MetricTypeLabels
vf_webhook_received_totalcountersource, event_type, result
vf_webhook_receive_duration_secondshistogramsource, result

The histogram drops event_type to keep cardinality bounded — result alone is enough for SLOs.

Result values

resultWhen
okSignature verified, persisted, async dispatch fired. The 200 ack went back.
duplicateSame Kick-Event-Message-Id already in webhooks.incoming. Returns 200 already_processed. Normal under Kick retries — don't alert on rate.
invalid_signatureRSA verify failed or signature was unparseable. Could be a misconfigured key, a payload mutation in transit, or a spoof attempt. Returns 401.
stale_timestampKick-Event-Message-Timestamp is outside the 5-minute window. Replay protection. Returns 401.
missing_headersRequired Kick-Event-* headers absent. Returns 400. Anything other than zero usually means somebody is poking the endpoint.
bad_methodHEAD / GET / PUT — uptime probes, scanners. Returns 405.
errorReserved for unhandled exceptions in the receive path. Currently unused.

event_type is the literal Kick event type (chat.message.sent, livestream.status.updated, channel.followed, …) or unknown when the header was missing.

Useful queries

promql
# successful delivery rate by event type
sum by(event_type) (
  rate(vf_webhook_received_total{source="kick", result="ok"}[5m])
)

# headline security alert — spoof attempts
rate(vf_webhook_received_total{result="invalid_signature"}[5m])

# clock-skew / replay alert
rate(vf_webhook_received_total{result="stale_timestamp"}[10m])

# Kick retry rate (informational, NOT an alert signal)
rate(vf_webhook_received_total{result="duplicate"}[5m])

# p99 ack latency — Kick retries on > 1 s, so >0.5 s is a warning
histogram_quantile(0.99,
  sum by(le, result) (rate(vf_webhook_receive_duration_seconds_bucket[5m]))
)

Async (background processEvent)

MetricTypeLabels
vf_webhook_async_processed_totalcountersource, event_type, result
vf_webhook_async_duration_secondshistogramsource, event_type

Result values

resultWhen
okprocessEvent returned without throwing. The downstream handler ran (it may still have decided to no-op for unknown event types — we don't track that here).
errorprocessEvent threw or rejected. Logged via log.error; Sentry already captures the stack trace.

Useful queries

promql
# error rate of background handlers, by event type
sum by(event_type) (
  rate(vf_webhook_async_processed_total{result="error"}[10m])
)

# downstream-handler latency p95 by event type
histogram_quantile(0.95,
  sum by(le, event_type) (rate(vf_webhook_async_duration_seconds_bucket[10m]))
)

# total events processed in last hour
sum by(event_type) (
  increase(vf_webhook_async_processed_total[1h])
)

Suggested alerts

AlertConditionWhy
KickWebhookSpoofAttemptsrate(vf_webhook_received_total{result="invalid_signature"}[5m]) > 0.05 for 10 mSustained signature failures = misconfigured Kick public key OR somebody poking the public endpoint. Investigate immediately.
KickWebhookStaleTimestamprate(vf_webhook_received_total{result="stale_timestamp"}[15m]) > 0.05 for 15 mPersistent timestamp rejection = clock drift on our pod (unlikely under k8s) or Kick replaying old deliveries (operational issue).
KickWebhookSlowAckhistogram_quantile(0.99, ...vf_webhook_receive_duration_seconds_bucket...) > 0.5 for 10 mApproaching Kick's 1 s retry threshold — we'll start getting duplicates and increase load on the receiver.
KickWebhookAsyncErrorRaterate(vf_webhook_async_processed_total{result="error"}[10m]) / rate(vf_webhook_async_processed_total[10m]) > 0.1 for 10 mMore than 10 % of webhooks fail their downstream handler — symptom of DB issues, schema drift, or a bug in processEvent.
KickWebhookSilentabsent(rate(vf_webhook_received_total{source="kick", result="ok"}[5m])) or rate(...) == 0 for 30 m during business hoursWe've stopped receiving any Kick webhooks. Either Kick isn't sending (unlikely) or our endpoint / Cloudflare Tunnel is unreachable.

Kick event types we receive

Subscribed at signup time via subscribeAllEvents in api/src/kick_subscriptions.ts:

  • livestream.status.updated
  • chat.message.sent
  • channel.followed
  • channel.subscription.new
  • channel.subscription.renewal
  • channel.subscription.gifts
  • livestream.metadata.updated
  • moderation.banned
  • chat.message.deleted
  • livestream.host.start / .stop

Any of these can show up in the event_type label.

Implementation

  • Module: api/src/metrics.ts (recordWebhookReceive, recordWebhookAsync)
  • Wired in: api/src/kick_webhook.ts — every exit path of handleKickWebhook calls recordWebhookReceive(...); the async processEvent is wrapped to call recordWebhookAsync(...) on resolution / rejection.

What's NOT measured here

  • Outbound webhook signatures we send (e.g. Pusher signatures). These live in api/src/pusher.ts and are signing operations, not deliveries. If we ever sign and POST to a customer endpoint, that gets its own metric pair separate from this one.
  • Per-streamer or per-channel breakdowns. That would explode cardinality. Use Loki + the existing [kick-webhook] log-line prefix to drill into a single delivery.

Verifluence Documentation