Alerting
Alerts are evaluated by Prometheus, routed by Alertmanager, and delivered to a Telegram channel via Alertmanager's native telegram_configs. Grafana's Alerting UI is a read-only window onto the same Alertmanager — no Grafana-managed rules exist.
Delivery path
PrometheusRule CRDs Alertmanager StatefulSet
(worker-poller-staleness + (k8s-alertmanager, :9093)
~80 cluster defaults from │
kube-prometheus-stack) │ route:
│ │ match severity=critical → critical-alerts
│ scraped + evaluated │ match severity=warning → warning-alerts
│ every 30s │ default → null (drop)
▼ │
Prometheus ──── POST /api/v2/alerts ───►│ receivers:
(k8s-prometheus, :9090) │ critical-alerts: telegram_configs
│ warning-alerts: telegram_configs
│
│ bot_token_file: /etc/alertmanager/secrets/telegram-bot/token
│ chat_id: -100… (in git, not sensitive)
│
▼
api.telegram.org/bot<token>/sendMessage
│
▼
Verifluence stage alerts channelEach link lives somewhere you can inspect:
| Layer | How to inspect |
|---|---|
| Rule definitions | kubectl --context vf get prometheusrule -A |
| Currently firing | Prometheus UI → Alerts tab, or curl /api/v1/alerts |
| Routes + receivers | clusters/prod/monitoring/releases/kube-prometheus-stack.yaml (Helm values) |
| Live AM config | kubectl exec alertmanager-k8s-alertmanager-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml |
| Telegram receiver | The same Helm values file, receivers: block |
| Bot token | k8s Secret/telegram-bot in monitoring, key token |
Custom rules — worker-poller-staleness
Lives at operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml. Four rules, all driven by the vf_worker_poll_last_success_timestamp_seconds gauge that instrumentPoll() updates on every successful run.
| Alert | Threshold | For | Severity | Why this number |
|---|---|---|---|---|
ViewerPollStale | time() - last_success > 600 | 1m | critical | viewer-poll runs every 60 s; 10 min = 10× headroom. The user-visible "10 minutes" promise. |
HourlyPollerStale | > 7200 (2 h) | 5m | warning | hourly schedule × 2; catches a missed run + 1 h grace. Covers chat-export, stale-session-cleanup, kick-token-refresh, x-follower-poll. |
DailyPollerStale | > 108000 (30 h) | 15m | warning | daily schedule + 6 h grace. Covers kick-subscriptions, trust-score-refresh, daily-maintenance. |
WorkerScrapeDown | up{job="worker"} == 0 | 5m | critical | catches the case where the worker pod is down entirely — the per-job staleness alerts can't fire because their label sets disappear with the target. |
Why last_success_timestamp and not error rate? A job that throws every minute increments runs_total{result="error"} but never moves last_success. The timestamp gauge catches both crashes AND persistent error loops in a single expression.
Default cluster rules
kube-prometheus-stack ships ~80 default rules covering the kubelet, node-exporter, scheduler, etcd, Prometheus itself, and general namespace health. They're enabled via defaultRules.rules.* in the Helm values. Notable ones you'll see in the channel:
Watchdog— fires forever, intentionally. Routed to thenullreceiver. Confirms Alertmanager itself is alive end-to-end.KubeMemoryOvercommit— sum of pod memory requests > cluster capacity. Informational; only matters when pods actually use what they reserved.TargetDown— a Prometheus scrape target is unreachable. Each failing target gets one of these.KubePodCrashLooping— a pod's CrashLoopBackOff has been ongoing.
Disable groups you don't want in the Helm values rather than silencing them individually (see defaultRules.rules.kubeProxy: false etc. for the existing pattern).
Telegram delivery
Why Alertmanager's native Telegram
Alertmanager v0.26+ supports telegram_configs directly — no webhook bridge needed. v0.30 (what we run) handles HTML formatting, send-on- resolve, group templates. One k8s Secret, one chat_id, done.
Bot setup (one-time)
@BotFather→/newbot→ token like7890123456:AA…- Add the bot to the channel as admin with "Post Messages"
- Send any message to the channel as a user, then:bashPull
curl -s "https://api.telegram.org/bot<TOKEN>/getUpdates" \ | jq '.result[].channel_post.chat'id(negative, looks like-100…).
Cluster wiring
# Create the Secret (token only — chat_id stays in git)
kubectl --context vf -n monitoring create secret generic telegram-bot \
--from-literal=token='<bot-token>'Then in kube-prometheus-stack.yaml:
alertmanagerSpec:
secrets: # mount the Secret into the AM pod
- telegram-bot # → /etc/alertmanager/secrets/telegram-bot/token
config:
route:
routes:
- match: { severity: critical }
receiver: critical-alerts
- match: { severity: warning }
receiver: warning-alerts
receivers:
- name: critical-alerts
telegram_configs:
- bot_token_file: /etc/alertmanager/secrets/telegram-bot/token
chat_id: -1003739402712
parse_mode: HTML
send_resolved: true
message: |
🔴 <b>{{ .Status | toUpper }}</b> · {{ .GroupLabels.alertname }}
…The bot_token_file indirection keeps the credential out of git.
Message format
HTML parse mode + send_resolved: true so the channel sees both firing and recovery events:
🔴 FIRING · ViewerPollStale
critical · viewer-poll
viewer-poll has not succeeded in 18 minutes
Check worker logs: kubectl -n stage logs deploy/worker --tail=200 | grep viewer-poll🔴 RESOLVED · ViewerPollStale
critical · viewer-poll
viewer-poll has not succeeded in 18 minutesCritical uses 🔴, warnings use 🟡.
Splitting channels by severity
Right now both routes point at the same chat_id. Most teams want critical in a paged channel and warnings in a quiet one. To split:
- Create a second Telegram channel + add the same bot as admin
- Get its chat_id
- Replace the chat_id under the
warning-alertsreceiver only
No new bot, no new Secret needed.
Grafana Alerting UI
Read-only window. Configured via unified_alerting.enabled: true in Grafana's grafana.ini plus an Alertmanager datasource:
additionalDataSources:
- name: Alertmanager
type: alertmanager
url: http://k8s-alertmanager.monitoring.svc.cluster.local:9093
jsonData:
implementation: prometheus
handleGrafanaManagedAlerts: falsehandleGrafanaManagedAlerts: false keeps Grafana out of the rule-management business — rules stay in PrometheusRule CRDs, Grafana just displays them.
| Grafana page | What it shows |
|---|---|
| Alerting → Alert rules | Every PrometheusRule CRD (use the namespace filter to find ours) |
| Alerting → Contact points | null, critical-alerts, warning-alerts |
| Alerting → Notification policies | The route tree |
| Alerting → Silences | Active silences (creating new ones from here POSTs to AM's API) |
| Alerting → Active notifications | Currently-firing alerts grouped by route |
If the sidebar looks empty at the top, click the datasource picker in the top-right and choose Alertmanager instead of "Grafana".
Adding a new alert rule
- Add a
PrometheusRuleresource — same namespace as your service (e.g.stage) keeps the rule discoverable next to the workload. - The
release: kube-prometheus-stacklabel is not required — the cluster'sruleSelector: {}matches all PrometheusRules. - Group related rules (one alert + one recording rule, or a family of similar checks) into a single
groups[].rules[]list — they evaluate together. - Set
severity: criticalorseverity: warningso the existing routes pick it up. Without a severity label, alerts fall to thenulldefault route and never deliver. - Always set
for: 1m(or longer) on rate-based alerts to avoid flapping on single-scrape outliers.
Skeleton:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-feature-rules
namespace: stage
spec:
groups:
- name: my-feature
interval: 30s
rules:
- alert: MyAlert
expr: <PromQL>
for: 5m
labels:
severity: warning
component: my-feature
annotations:
summary: "Short one-liner with {{ $labels.foo }}"
description: |
Multi-line context. Include a kubectl command or a
dashboard link so the on-call doesn't have to dig.After commit + Flux reconcile, the rule shows up in Prometheus → Alerts and (when firing) in the Telegram channel.
Silencing noisy alerts
For a known-noisy alert during planned work:
kubectl --context vf -n monitoring port-forward svc/k8s-alertmanager 19093:9093 &
amtool --alertmanager.url=http://localhost:19093 silence add \
alertname=KubeMemoryOvercommit \
--duration=4h \
--comment "planned migration; resumes after $(date +%H:%M)"Or in Grafana: Alerting → Silences → New silence.
Smoke-testing delivery
Two paths:
1. Synthetic alert via the AM API:
amtool --alertmanager.url=http://localhost:19093 alert add \
alertname=TelegramSmokeTest severity=critical \
--annotation=summary='Telegram delivery smoke test'A 🔴 message lands in the channel within ~10 s.
2. Force a real alert by stopping the worker (covers the WorkerScrapeDown rule end-to-end):
kubectl --context vf -n stage scale deploy/worker --replicas=0
# wait ~6 min
kubectl --context vf -n stage scale deploy/worker --replicas=1You'll see 🔴 FIRING · WorkerScrapeDown then 🔴 RESOLVED · WorkerScrapeDown in the channel.
Source
- Custom rules:
operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml - AM config:
operations/clusters/prod/monitoring/releases/kube-prometheus-stack.yaml(alertmanager.configsection) - Bot Secret:
Secret/telegram-botinmonitoring(created out-of-band, not in git)