API metrics

Exposed by the api process at :9091/metrics (cluster-internal). All metrics carry service="api".

HTTP server

Recorded by a Hono middleware mounted before the auth gate, so 401s and CORS-rejected requests show up in the histogram too.

Metric	Type	Labels
`vf_http_requests_total`	counter	`method`, `route`, `status_code`
`vf_http_request_duration_seconds`	histogram	`method`, `route`, `status_code`
`vf_http_requests_in_flight`	gauge	—

The route label is the Hono matched pattern (e.g. /api/offers/:id), not the literal URL — keeps cardinality bounded across the platform's ~50 routes. Unmatched requests group under route="unmatched".

Histogram buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10 seconds.

Useful queries

promql

# request rate by route, top 10
topk(10, sum by(route) (rate(vf_http_requests_total[5m])))

# 5xx error rate
sum(rate(vf_http_requests_total{status_code=~"5.."}[5m]))
  /
sum(rate(vf_http_requests_total[5m]))

# p99 latency by route
histogram_quantile(0.99,
  sum by(le, route) (rate(vf_http_request_duration_seconds_bucket[5m]))
)

# in-flight (saturation signal)
vf_http_requests_in_flight

Domain gauges (30 s tick)

The API process owns a 30 s background tick that runs cheap SELECT status, COUNT(*) … GROUP BY status queries against five tables and updates these gauges. One round-trip per table; failures log + continue so a transient DB hiccup doesn't leave half the metrics fresh and half stale.

Metric	Statuses
`vf_offers_by_status{status}`	`pending`, `accepted`, `declined`, `withdrawn`
`vf_negotiations_by_status{status}`	`in_progress`, `agreed`, `cancelled`
`vf_deals_by_status{status}`	`active`, `completed`, `refunded`, `cancelled`
`vf_session_submissions_by_status{status}`	`pending_review`, `confirmed`, `rejected`
`vf_campaigns_by_status{status}`	`prepared`, `funded`, `in_progress`, `completed`, …

Useful queries

promql

# pending offers right now
sum(vf_offers_by_status{status="pending"})

# active deals trend (last 24 h)
max_over_time(vf_deals_by_status{status="active"}[24h])

# submissions waiting for operator review
sum(vf_session_submissions_by_status{status="pending_review"})

Node runtime (free, via prom-client defaults)

Emitted by prom-client.collectDefaultMetrics(). Most useful single signal for "is the API healthy" is eventloop lag.

Metric	What it tells you
`nodejs_eventloop_lag_seconds`	how late the scheduler is firing — sustained > 100 ms means the API is starved
`process_cpu_seconds_total`	CPU usage over time
`process_resident_memory_bytes`	RSS — alert above pod memory limit
`nodejs_heap_size_total_bytes`, `nodejs_heap_size_used_bytes`	heap pressure
`nodejs_active_handles_total`, `nodejs_active_requests_total`	leak indicator
`nodejs_gc_duration_seconds`	GC pause distribution

Useful queries

promql

# eventloop lag (alert > 100 ms for 5 min)
nodejs_eventloop_lag_seconds{service="api"}

# memory headroom
process_resident_memory_bytes{service="api"} / 1024 / 1024

Implementation

Module: api/src/metrics.ts
Wired in: api/src/server.ts (middleware + startMetricsServer(9091) + startDomainTick(env))

Adding a new metric

Define the Counter / Histogram / Gauge at module scope in metrics.ts, registered against registry.
Use a vf_<scope>_… name. Counter ends in _total. Duration ends in _seconds.
Keep label cardinality bounded — never add a user-id / offer-id label.
Increment / observe at the call site. Try to put it next to existing log.warn / log.info lines so signal stays co-located with logs.

API metrics ​

HTTP server ​

Useful queries ​

Domain gauges (30 s tick) ​

Useful queries ​

Node runtime (free, via prom-client defaults) ​

Useful queries ​

Implementation ​

Adding a new metric ​

API metrics

HTTP server

Useful queries

Domain gauges (30 s tick)

Useful queries

Node runtime (free, via prom-client defaults)

Useful queries

Implementation

Adding a new metric