API metrics
Exposed by the api process at :9091/metrics (cluster-internal). All metrics carry service="api".
HTTP server
Recorded by a Hono middleware mounted before the auth gate, so 401s and CORS-rejected requests show up in the histogram too.
| Metric | Type | Labels |
|---|---|---|
vf_http_requests_total | counter | method, route, status_code |
vf_http_request_duration_seconds | histogram | method, route, status_code |
vf_http_requests_in_flight | gauge | — |
The route label is the Hono matched pattern (e.g. /api/offers/:id), not the literal URL — keeps cardinality bounded across the platform's ~50 routes. Unmatched requests group under route="unmatched".
Histogram buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds.
Useful queries
# request rate by route, top 10
topk(10, sum by(route) (rate(vf_http_requests_total[5m])))
# 5xx error rate
sum(rate(vf_http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(vf_http_requests_total[5m]))
# p99 latency by route
histogram_quantile(0.99,
sum by(le, route) (rate(vf_http_request_duration_seconds_bucket[5m]))
)
# in-flight (saturation signal)
vf_http_requests_in_flightDomain gauges (30 s tick)
The API process owns a 30 s background tick that runs cheap SELECT status, COUNT(*) … GROUP BY status queries against five tables and updates these gauges. One round-trip per table; failures log + continue so a transient DB hiccup doesn't leave half the metrics fresh and half stale.
| Metric | Statuses |
|---|---|
vf_offers_by_status{status} | pending, accepted, declined, withdrawn |
vf_negotiations_by_status{status} | in_progress, agreed, cancelled |
vf_deals_by_status{status} | active, completed, refunded, cancelled |
vf_session_submissions_by_status{status} | pending_review, confirmed, rejected |
vf_campaigns_by_status{status} | prepared, funded, in_progress, completed, … |
Useful queries
# pending offers right now
sum(vf_offers_by_status{status="pending"})
# active deals trend (last 24 h)
max_over_time(vf_deals_by_status{status="active"}[24h])
# submissions waiting for operator review
sum(vf_session_submissions_by_status{status="pending_review"})Node runtime (free, via prom-client defaults)
Emitted by prom-client.collectDefaultMetrics(). Most useful single signal for "is the API healthy" is eventloop lag.
| Metric | What it tells you |
|---|---|
nodejs_eventloop_lag_seconds | how late the scheduler is firing — sustained > 100 ms means the API is starved |
process_cpu_seconds_total | CPU usage over time |
process_resident_memory_bytes | RSS — alert above pod memory limit |
nodejs_heap_size_total_bytes, nodejs_heap_size_used_bytes | heap pressure |
nodejs_active_handles_total, nodejs_active_requests_total | leak indicator |
nodejs_gc_duration_seconds | GC pause distribution |
Useful queries
# eventloop lag (alert > 100 ms for 5 min)
nodejs_eventloop_lag_seconds{service="api"}
# memory headroom
process_resident_memory_bytes{service="api"} / 1024 / 1024Implementation
- Module:
api/src/metrics.ts - Wired in:
api/src/server.ts(middleware +startMetricsServer(9091)+startDomainTick(env))
Adding a new metric
- Define the
Counter/Histogram/Gaugeat module scope inmetrics.ts, registered againstregistry. - Use a
vf_<scope>_…name. Counter ends in_total. Duration ends in_seconds. - Keep label cardinality bounded — never add a user-id / offer-id label.
- Increment / observe at the call site. Try to put it next to existing
log.warn/log.infolines so signal stays co-located with logs.