Webhook metrics
Inbound webhooks are received by the API process at the public endpoint /wh/kick and emit metrics on :9091/metrics alongside the rest of the API metrics. All carry service="api".
The receiver has two phases, each with its own metric pair:
- Receive (sync) — runs to completion before the 200 ack goes back. Kick retries on >1 s, so this path is latency-sensitive.
- Async (
processEvent) — runs after the ack viactx.waitUntil. Errors here never surface to the sender, so the metric is the only signal we have for downstream-handler health.
Receive (synchronous ack path)
| Metric | Type | Labels |
|---|---|---|
vf_webhook_received_total | counter | source, event_type, result |
vf_webhook_receive_duration_seconds | histogram | source, result |
The histogram drops event_type to keep cardinality bounded — result alone is enough for SLOs.
Result values
result | When |
|---|---|
ok | Signature verified, persisted, async dispatch fired. The 200 ack went back. |
duplicate | Same Kick-Event-Message-Id already in webhooks.incoming. Returns 200 already_processed. Normal under Kick retries — don't alert on rate. |
invalid_signature | RSA verify failed or signature was unparseable. Could be a misconfigured key, a payload mutation in transit, or a spoof attempt. Returns 401. |
stale_timestamp | Kick-Event-Message-Timestamp is outside the 5-minute window. Replay protection. Returns 401. |
missing_headers | Required Kick-Event-* headers absent. Returns 400. Anything other than zero usually means somebody is poking the endpoint. |
bad_method | HEAD / GET / PUT — uptime probes, scanners. Returns 405. |
error | Reserved for unhandled exceptions in the receive path. Currently unused. |
event_type is the literal Kick event type (chat.message.sent, livestream.status.updated, channel.followed, …) or unknown when the header was missing.
Useful queries
promql
# successful delivery rate by event type
sum by(event_type) (
rate(vf_webhook_received_total{source="kick", result="ok"}[5m])
)
# headline security alert — spoof attempts
rate(vf_webhook_received_total{result="invalid_signature"}[5m])
# clock-skew / replay alert
rate(vf_webhook_received_total{result="stale_timestamp"}[10m])
# Kick retry rate (informational, NOT an alert signal)
rate(vf_webhook_received_total{result="duplicate"}[5m])
# p99 ack latency — Kick retries on > 1 s, so >0.5 s is a warning
histogram_quantile(0.99,
sum by(le, result) (rate(vf_webhook_receive_duration_seconds_bucket[5m]))
)Async (background processEvent)
| Metric | Type | Labels |
|---|---|---|
vf_webhook_async_processed_total | counter | source, event_type, result |
vf_webhook_async_duration_seconds | histogram | source, event_type |
Result values
result | When |
|---|---|
ok | processEvent returned without throwing. The downstream handler ran (it may still have decided to no-op for unknown event types — we don't track that here). |
error | processEvent threw or rejected. Logged via log.error; Sentry already captures the stack trace. |
Useful queries
promql
# error rate of background handlers, by event type
sum by(event_type) (
rate(vf_webhook_async_processed_total{result="error"}[10m])
)
# downstream-handler latency p95 by event type
histogram_quantile(0.95,
sum by(le, event_type) (rate(vf_webhook_async_duration_seconds_bucket[10m]))
)
# total events processed in last hour
sum by(event_type) (
increase(vf_webhook_async_processed_total[1h])
)Suggested alerts
| Alert | Condition | Why |
|---|---|---|
KickWebhookSpoofAttempts | rate(vf_webhook_received_total{result="invalid_signature"}[5m]) > 0.05 for 10 m | Sustained signature failures = misconfigured Kick public key OR somebody poking the public endpoint. Investigate immediately. |
KickWebhookStaleTimestamp | rate(vf_webhook_received_total{result="stale_timestamp"}[15m]) > 0.05 for 15 m | Persistent timestamp rejection = clock drift on our pod (unlikely under k8s) or Kick replaying old deliveries (operational issue). |
KickWebhookSlowAck | histogram_quantile(0.99, ...vf_webhook_receive_duration_seconds_bucket...) > 0.5 for 10 m | Approaching Kick's 1 s retry threshold — we'll start getting duplicates and increase load on the receiver. |
KickWebhookAsyncErrorRate | rate(vf_webhook_async_processed_total{result="error"}[10m]) / rate(vf_webhook_async_processed_total[10m]) > 0.1 for 10 m | More than 10 % of webhooks fail their downstream handler — symptom of DB issues, schema drift, or a bug in processEvent. |
KickWebhookSilent | absent(rate(vf_webhook_received_total{source="kick", result="ok"}[5m])) or rate(...) == 0 for 30 m during business hours | We've stopped receiving any Kick webhooks. Either Kick isn't sending (unlikely) or our endpoint / Cloudflare Tunnel is unreachable. |
Kick event types we receive
Subscribed at signup time via subscribeAllEvents in api/src/kick_subscriptions.ts:
livestream.status.updatedchat.message.sentchannel.followedchannel.subscription.newchannel.subscription.renewalchannel.subscription.giftslivestream.metadata.updatedmoderation.bannedchat.message.deletedlivestream.host.start/.stop
Any of these can show up in the event_type label.
Implementation
- Module:
api/src/metrics.ts(recordWebhookReceive,recordWebhookAsync) - Wired in:
api/src/kick_webhook.ts— every exit path ofhandleKickWebhookcallsrecordWebhookReceive(...); the asyncprocessEventis wrapped to callrecordWebhookAsync(...)on resolution / rejection.
What's NOT measured here
- Outbound webhook signatures we send (e.g. Pusher signatures). These live in
api/src/pusher.tsand are signing operations, not deliveries. If we ever sign and POST to a customer endpoint, that gets its own metric pair separate from this one. - Per-streamer or per-channel breakdowns. That would explode cardinality. Use Loki + the existing
[kick-webhook]log-line prefix to drill into a single delivery.