Alerting

Alerts are evaluated by Prometheus, routed by Alertmanager, and delivered to a Telegram channel via Alertmanager's native telegram_configs. Grafana's Alerting UI is a read-only window onto the same Alertmanager — no Grafana-managed rules exist.

Delivery path

PrometheusRule CRDs                      Alertmanager StatefulSet
(worker-poller-staleness +                 (k8s-alertmanager, :9093)
 ~80 cluster defaults from                  │
 kube-prometheus-stack)                     │   route:
        │                                   │     match severity=critical → critical-alerts
        │ scraped + evaluated               │     match severity=warning  → warning-alerts
        │ every 30s                         │     default                 → null (drop)
        ▼                                   │
   Prometheus  ──── POST /api/v2/alerts ───►│   receivers:
   (k8s-prometheus, :9090)                  │     critical-alerts: telegram_configs
                                            │     warning-alerts:  telegram_configs
                                            │
                                            │   bot_token_file: /etc/alertmanager/secrets/telegram-bot/token
                                            │   chat_id: -100…  (in git, not sensitive)
                                            │
                                            ▼
                                    api.telegram.org/bot<token>/sendMessage
                                            │
                                            ▼
                                  Verifluence stage alerts channel

Each link lives somewhere you can inspect:

Layer	How to inspect
Rule definitions	`kubectl --context vf get prometheusrule -A`
Currently firing	Prometheus UI → Alerts tab, or `curl /api/v1/alerts`
Routes + receivers	`clusters/prod/monitoring/releases/kube-prometheus-stack.yaml` (Helm values)
Live AM config	`kubectl exec alertmanager-k8s-alertmanager-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml`
Telegram receiver	The same Helm values file, `receivers:` block
Bot token	k8s `Secret/telegram-bot` in `monitoring`, key `token`

Custom rules — `worker-poller-staleness`

Lives at operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml. Four rules, all driven by the vf_worker_poll_last_success_timestamp_seconds gauge that instrumentPoll() updates on every successful run.

Alert	Threshold	For	Severity	Why this number
`ViewerPollStale`	`time() - last_success > 600`	1m	critical	viewer-poll runs every 60 s; 10 min = 10× headroom. The user-visible "10 minutes" promise.
`HourlyPollerStale`	`> 7200` (2 h)	5m	warning	hourly schedule × 2; catches a missed run + 1 h grace. Covers chat-export, stale-session-cleanup, kick-token-refresh, x-follower-poll.
`DailyPollerStale`	`> 108000` (30 h)	15m	warning	daily schedule + 6 h grace. Covers kick-subscriptions, trust-score-refresh, daily-maintenance.
`WorkerScrapeDown`	`up{job="worker"} == 0`	5m	critical	catches the case where the worker pod is down entirely — the per-job staleness alerts can't fire because their label sets disappear with the target.

Why last_success_timestamp and not error rate? A job that throws every minute increments runs_total{result="error"} but never moves last_success. The timestamp gauge catches both crashes AND persistent error loops in a single expression.

Default cluster rules

kube-prometheus-stack ships ~80 default rules covering the kubelet, node-exporter, scheduler, etcd, Prometheus itself, and general namespace health. They're enabled via defaultRules.rules.* in the Helm values. Notable ones you'll see in the channel:

Watchdog — fires forever, intentionally. Routed to the null receiver. Confirms Alertmanager itself is alive end-to-end.
KubeMemoryOvercommit — sum of pod memory requests > cluster capacity. Informational; only matters when pods actually use what they reserved.
TargetDown — a Prometheus scrape target is unreachable. Each failing target gets one of these.
KubePodCrashLooping — a pod's CrashLoopBackOff has been ongoing.

Disable groups you don't want in the Helm values rather than silencing them individually (see defaultRules.rules.kubeProxy: false etc. for the existing pattern).

Telegram delivery

Why Alertmanager's native Telegram

Alertmanager v0.26+ supports telegram_configs directly — no webhook bridge needed. v0.30 (what we run) handles HTML formatting, send-on- resolve, group templates. One k8s Secret, one chat_id, done.

Bot setup (one-time)

@BotFather → /newbot → token like 7890123456:AA…
Add the bot to the channel as admin with "Post Messages"

Send any message to the channel as a user, then:

bash

curl -s "https://api.telegram.org/bot<TOKEN>/getUpdates" \
  | jq '.result[].channel_post.chat'

Pull id (negative, looks like -100…).

Cluster wiring

bash

# Create the Secret (token only — chat_id stays in git)
kubectl --context vf -n monitoring create secret generic telegram-bot \
  --from-literal=token='<bot-token>'

Then in kube-prometheus-stack.yaml:

yaml

alertmanagerSpec:
  secrets:                              # mount the Secret into the AM pod
    - telegram-bot                      # → /etc/alertmanager/secrets/telegram-bot/token

config:
  route:
    routes:
      - match: { severity: critical }
        receiver: critical-alerts
      - match: { severity: warning }
        receiver: warning-alerts
  receivers:
    - name: critical-alerts
      telegram_configs:
        - bot_token_file: /etc/alertmanager/secrets/telegram-bot/token
          chat_id: -1003739402712
          parse_mode: HTML
          send_resolved: true
          message: |
            🔴 <b>{{ .Status | toUpper }}</b> · {{ .GroupLabels.alertname }}
            …

The bot_token_file indirection keeps the credential out of git.

Message format

HTML parse mode + send_resolved: true so the channel sees both firing and recovery events:

🔴 FIRING · ViewerPollStale
critical · viewer-poll
viewer-poll has not succeeded in 18 minutes
Check worker logs: kubectl -n stage logs deploy/worker --tail=200 | grep viewer-poll

🔴 RESOLVED · ViewerPollStale
critical · viewer-poll
viewer-poll has not succeeded in 18 minutes

Critical uses 🔴, warnings use 🟡.

Splitting channels by severity

Right now both routes point at the same chat_id. Most teams want critical in a paged channel and warnings in a quiet one. To split:

Create a second Telegram channel + add the same bot as admin
Get its chat_id
Replace the chat_id under the warning-alerts receiver only

No new bot, no new Secret needed.

Grafana Alerting UI

Read-only window. Configured via unified_alerting.enabled: true in Grafana's grafana.ini plus an Alertmanager datasource:

yaml

additionalDataSources:
  - name: Alertmanager
    type: alertmanager
    url: http://k8s-alertmanager.monitoring.svc.cluster.local:9093
    jsonData:
      implementation: prometheus
      handleGrafanaManagedAlerts: false

handleGrafanaManagedAlerts: false keeps Grafana out of the rule-management business — rules stay in PrometheusRule CRDs, Grafana just displays them.

Grafana page	What it shows
Alerting → Alert rules	Every PrometheusRule CRD (use the namespace filter to find ours)
Alerting → Contact points	`null`, `critical-alerts`, `warning-alerts`
Alerting → Notification policies	The route tree
Alerting → Silences	Active silences (creating new ones from here POSTs to AM's API)
Alerting → Active notifications	Currently-firing alerts grouped by route

If the sidebar looks empty at the top, click the datasource picker in the top-right and choose Alertmanager instead of "Grafana".

Adding a new alert rule

Add a PrometheusRule resource — same namespace as your service (e.g. stage) keeps the rule discoverable next to the workload.
The release: kube-prometheus-stack label is not required — the cluster's ruleSelector: {} matches all PrometheusRules.
Group related rules (one alert + one recording rule, or a family of similar checks) into a single groups[].rules[] list — they evaluate together.
Set severity: critical or severity: warning so the existing routes pick it up. Without a severity label, alerts fall to the null default route and never deliver.
Always set for: 1m (or longer) on rate-based alerts to avoid flapping on single-scrape outliers.

Skeleton:

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-feature-rules
  namespace: stage
spec:
  groups:
    - name: my-feature
      interval: 30s
      rules:
        - alert: MyAlert
          expr: <PromQL>
          for: 5m
          labels:
            severity: warning
            component: my-feature
          annotations:
            summary: "Short one-liner with {{ $labels.foo }}"
            description: |
              Multi-line context. Include a kubectl command or a
              dashboard link so the on-call doesn't have to dig.

After commit + Flux reconcile, the rule shows up in Prometheus → Alerts and (when firing) in the Telegram channel.

Silencing noisy alerts

For a known-noisy alert during planned work:

bash

kubectl --context vf -n monitoring port-forward svc/k8s-alertmanager 19093:9093 &
amtool --alertmanager.url=http://localhost:19093 silence add \
  alertname=KubeMemoryOvercommit \
  --duration=4h \
  --comment "planned migration; resumes after $(date +%H:%M)"

Or in Grafana: Alerting → Silences → New silence.

Smoke-testing delivery

Two paths:

1. Synthetic alert via the AM API:

bash

amtool --alertmanager.url=http://localhost:19093 alert add \
  alertname=TelegramSmokeTest severity=critical \
  --annotation=summary='Telegram delivery smoke test'

A 🔴 message lands in the channel within ~10 s.

2. Force a real alert by stopping the worker (covers the WorkerScrapeDown rule end-to-end):

bash

kubectl --context vf -n stage scale deploy/worker --replicas=0
# wait ~6 min
kubectl --context vf -n stage scale deploy/worker --replicas=1

You'll see 🔴 FIRING · WorkerScrapeDown then 🔴 RESOLVED · WorkerScrapeDown in the channel.

Source

Custom rules: operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml
AM config: operations/clusters/prod/monitoring/releases/kube-prometheus-stack.yaml (alertmanager.config section)
Bot Secret: Secret/telegram-bot in monitoring (created out-of-band, not in git)

Alerting ​

Delivery path ​

Custom rules — worker-poller-staleness ​

Default cluster rules ​

Telegram delivery ​

Why Alertmanager's native Telegram ​

Bot setup (one-time) ​

Cluster wiring ​

Message format ​

Splitting channels by severity ​

Grafana Alerting UI ​

Adding a new alert rule ​

Silencing noisy alerts ​

Smoke-testing delivery ​

Source ​

Alerting

Delivery path

Custom rules — `worker-poller-staleness`

Default cluster rules

Telegram delivery

Why Alertmanager's native Telegram

Bot setup (one-time)

Cluster wiring

Message format

Splitting channels by severity

Grafana Alerting UI

Adding a new alert rule

Silencing noisy alerts

Smoke-testing delivery

Source