Alert Pipeline

Designed a custom severity model that turned alert storms into single, prioritized notifications with automatic silence management

500+

Alert Rules

Severity Levels

Fleet

Deployed via

vmalertAlertManagerOpsGenieJenkinsRancher FleetPython

Architecture

The Problem

One incident triggered 10+ pages. A database slowdown would fire critical for complete failure, warning for degradation, and low for elevated latency, all at the same time for the same root cause. No correlation between severities. Passive standby clusters paged with the same urgency as active production. Engineers wrote each severity as a separate rule, so changing a query meant updating three places. With 500+ rules, things drifted.

Template System

I designed a template format where one YAML definition contains all severity levels. Each severity can have completely different expressions, not just different thresholds. A critical might check for complete failure while a warning checks degradation using a different query entirely. A Python generator expands these into standard vmalert rules with automatic severity and severity_order labels. Open-sourced both the generator and a migration tool that converts existing rules into template format.

The CI Pipeline

Fleet doesn't have pre-sync hooks like ArgoCD, so templates can't be transformed at deploy time. I built a Jenkins pipeline: on feature branches, it generates rules locally and validates with vmalert dry-run. No broken expressions reach the main branch. On merge, it generates final rules, commits them to a rules-bundle directory, and pushes with retry logic and rebase. Fleet detects the commit and deploys to vmalert across all clusters.

Inhibition and Routing

The generated rules share the same alertname across severities. AlertManager inhibition suppresses lower severities when a higher one fires: critical suppresses warning/low, warning suppresses low. The on-call engineer sees one alert at the highest applicable severity instead of three. OpsGenie priority is automatic: P1 for critical on active clusters, P2 for passive, P3 for warnings.

Health Check Grouping

For the health monitoring system, alerts are grouped by namespace, cluster, and active/passive state instead of per-service. If 15 services fail in one environment, the on-call gets one grouped notification listing all 15. A 30-second group wait collects related alerts before the first notification. A 2-minute regroup interval catches late arrivals.

Silence Automation

Built a Python service that periodically checks an external business API for environment state. If an environment has no live activity, the silence manager creates a silence in AlertManager for that namespace. When it goes live again, the silence is removed. Handles about 40% of environments at any time. Replaced a manual process where people would create silences and forget to clean them up weeks later.

Deep dive

Building a Health Platform Part 4: From Logs to Grouped Alerts→Deploying Alert Rules at Scale with Fleet and Jenkins→