Skip to content

Alerting

Alerts are evaluated by Prometheus, routed by Alertmanager, and delivered to a Telegram channel via Alertmanager's native telegram_configs. Grafana's Alerting UI is a read-only window onto the same Alertmanager — no Grafana-managed rules exist.

Delivery path

PrometheusRule CRDs                      Alertmanager StatefulSet
(worker-poller-staleness +                 (k8s-alertmanager, :9093)
 ~80 cluster defaults from                  │
 kube-prometheus-stack)                     │   route:
        │                                   │     match severity=critical → critical-alerts
        │ scraped + evaluated               │     match severity=warning  → warning-alerts
        │ every 30s                         │     default                 → null (drop)
        ▼                                   │
   Prometheus  ──── POST /api/v2/alerts ───►│   receivers:
   (k8s-prometheus, :9090)                  │     critical-alerts: telegram_configs
                                            │     warning-alerts:  telegram_configs

                                            │   bot_token_file: /etc/alertmanager/secrets/telegram-bot/token
                                            │   chat_id: -100…  (in git, not sensitive)


                                    api.telegram.org/bot<token>/sendMessage


                                  Verifluence stage alerts channel

Each link lives somewhere you can inspect:

LayerHow to inspect
Rule definitionskubectl --context vf get prometheusrule -A
Currently firingPrometheus UI → Alerts tab, or curl /api/v1/alerts
Routes + receiversclusters/prod/monitoring/releases/kube-prometheus-stack.yaml (Helm values)
Live AM configkubectl exec alertmanager-k8s-alertmanager-0 -- cat /etc/alertmanager/config_out/alertmanager.env.yaml
Telegram receiverThe same Helm values file, receivers: block
Bot tokenk8s Secret/telegram-bot in monitoring, key token

Custom rules — worker-poller-staleness

Lives at operations/clusters/prod/stage/monitoring/poller-staleness-rules.yaml. Four rules, all driven by the vf_worker_poll_last_success_timestamp_seconds gauge that instrumentPoll() updates on every successful run.

AlertThresholdForSeverityWhy this number
ViewerPollStaletime() - last_success > 6001mcriticalviewer-poll runs every 60 s; 10 min = 10× headroom. The user-visible "10 minutes" promise.
HourlyPollerStale> 7200 (2 h)5mwarninghourly schedule × 2; catches a missed run + 1 h grace. Covers chat-export, stale-session-cleanup, kick-token-refresh, x-follower-poll.
DailyPollerStale> 108000 (30 h)15mwarningdaily schedule + 6 h grace. Covers kick-subscriptions, trust-score-refresh, daily-maintenance.
WorkerScrapeDownup{job="worker"} == 05mcriticalcatches the case where the worker pod is down entirely — the per-job staleness alerts can't fire because their label sets disappear with the target.

Why last_success_timestamp and not error rate? A job that throws every minute increments runs_total{result="error"} but never moves last_success. The timestamp gauge catches both crashes AND persistent error loops in a single expression.

Default cluster rules

kube-prometheus-stack ships ~80 default rules covering the kubelet, node-exporter, scheduler, etcd, Prometheus itself, and general namespace health. They're enabled via defaultRules.rules.* in the Helm values. Notable ones you'll see in the channel:

  • Watchdog — fires forever, intentionally. Routed to the null receiver. Confirms Alertmanager itself is alive end-to-end.
  • KubeMemoryOvercommit — sum of pod memory requests > cluster capacity. Informational; only matters when pods actually use what they reserved.
  • TargetDown — a Prometheus scrape target is unreachable. Each failing target gets one of these.
  • KubePodCrashLooping — a pod's CrashLoopBackOff has been ongoing.

Disable groups you don't want in the Helm values rather than silencing them individually (see defaultRules.rules.kubeProxy: false etc. for the existing pattern).

Telegram delivery

Why Alertmanager's native Telegram

Alertmanager v0.26+ supports telegram_configs directly — no webhook bridge needed. v0.30 (what we run) handles HTML formatting, send-on- resolve, group templates. One k8s Secret, one chat_id, done.

Bot setup (one-time)

  1. @BotFather/newbot → token like 7890123456:AA…
  2. Add the bot to the channel as admin with "Post Messages"
  3. Send any message to the channel as a user, then:
    bash
    curl -s "https://api.telegram.org/bot<TOKEN>/getUpdates" \
      | jq '.result[].channel_post.chat'
    Pull id (negative, looks like -100…).

Cluster wiring

bash
# Create the Secret (token only — chat_id stays in git)
kubectl --context vf -n monitoring create secret generic telegram-bot \
  --from-literal=token='<bot-token>'

Then in kube-prometheus-stack.yaml:

yaml
alertmanagerSpec:
  secrets:                              # mount the Secret into the AM pod
    - telegram-bot                      # → /etc/alertmanager/secrets/telegram-bot/token

config:
  route:
    routes:
      - match: { severity: critical }
        receiver: critical-alerts
      - match: { severity: warning }
        receiver: warning-alerts
  receivers:
    - name: critical-alerts
      telegram_configs:
        - bot_token_file: /etc/alertmanager/secrets/telegram-bot/token
          chat_id: -1003739402712
          parse_mode: HTML
          send_resolved: true
          message: |
            🔴 <b>{{ .Status | toUpper }}</b> · {{ .GroupLabels.alertname }}

The bot_token_file indirection keeps the credential out of git.

Message format

HTML parse mode + send_resolved: true so the channel sees both firing and recovery events:

🔴 FIRING · ViewerPollStale
critical · viewer-poll
viewer-poll has not succeeded in 18 minutes
Check worker logs: kubectl -n stage logs deploy/worker --tail=200 | grep viewer-poll
🔴 RESOLVED · ViewerPollStale
critical · viewer-poll
viewer-poll has not succeeded in 18 minutes

Critical uses 🔴, warnings use 🟡.

Splitting channels by severity

Right now both routes point at the same chat_id. Most teams want critical in a paged channel and warnings in a quiet one. To split:

  1. Create a second Telegram channel + add the same bot as admin
  2. Get its chat_id
  3. Replace the chat_id under the warning-alerts receiver only

No new bot, no new Secret needed.

Grafana Alerting UI

Read-only window. Configured via unified_alerting.enabled: true in Grafana's grafana.ini plus an Alertmanager datasource:

yaml
additionalDataSources:
  - name: Alertmanager
    type: alertmanager
    url: http://k8s-alertmanager.monitoring.svc.cluster.local:9093
    jsonData:
      implementation: prometheus
      handleGrafanaManagedAlerts: false

handleGrafanaManagedAlerts: false keeps Grafana out of the rule-management business — rules stay in PrometheusRule CRDs, Grafana just displays them.

Grafana pageWhat it shows
Alerting → Alert rulesEvery PrometheusRule CRD (use the namespace filter to find ours)
Alerting → Contact pointsnull, critical-alerts, warning-alerts
Alerting → Notification policiesThe route tree
Alerting → SilencesActive silences (creating new ones from here POSTs to AM's API)
Alerting → Active notificationsCurrently-firing alerts grouped by route

If the sidebar looks empty at the top, click the datasource picker in the top-right and choose Alertmanager instead of "Grafana".

Adding a new alert rule

  1. Add a PrometheusRule resource — same namespace as your service (e.g. stage) keeps the rule discoverable next to the workload.
  2. The release: kube-prometheus-stack label is not required — the cluster's ruleSelector: {} matches all PrometheusRules.
  3. Group related rules (one alert + one recording rule, or a family of similar checks) into a single groups[].rules[] list — they evaluate together.
  4. Set severity: critical or severity: warning so the existing routes pick it up. Without a severity label, alerts fall to the null default route and never deliver.
  5. Always set for: 1m (or longer) on rate-based alerts to avoid flapping on single-scrape outliers.

Skeleton:

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-feature-rules
  namespace: stage
spec:
  groups:
    - name: my-feature
      interval: 30s
      rules:
        - alert: MyAlert
          expr: <PromQL>
          for: 5m
          labels:
            severity: warning
            component: my-feature
          annotations:
            summary: "Short one-liner with {{ $labels.foo }}"
            description: |
              Multi-line context. Include a kubectl command or a
              dashboard link so the on-call doesn't have to dig.

After commit + Flux reconcile, the rule shows up in Prometheus → Alerts and (when firing) in the Telegram channel.

Silencing noisy alerts

For a known-noisy alert during planned work:

bash
kubectl --context vf -n monitoring port-forward svc/k8s-alertmanager 19093:9093 &
amtool --alertmanager.url=http://localhost:19093 silence add \
  alertname=KubeMemoryOvercommit \
  --duration=4h \
  --comment "planned migration; resumes after $(date +%H:%M)"

Or in Grafana: Alerting → Silences → New silence.

Smoke-testing delivery

Two paths:

1. Synthetic alert via the AM API:

bash
amtool --alertmanager.url=http://localhost:19093 alert add \
  alertname=TelegramSmokeTest severity=critical \
  --annotation=summary='Telegram delivery smoke test'

A 🔴 message lands in the channel within ~10 s.

2. Force a real alert by stopping the worker (covers the WorkerScrapeDown rule end-to-end):

bash
kubectl --context vf -n stage scale deploy/worker --replicas=0
# wait ~6 min
kubectl --context vf -n stage scale deploy/worker --replicas=1

You'll see 🔴 FIRING · WorkerScrapeDown then 🔴 RESOLVED · WorkerScrapeDown in the channel.

Source

Verifluence Documentation