Designing a Custom Alert Severity System
The Problem
Alert rules were managed individually. If you wanted an alert to fire at different severity levels - critical when something is completely down, warning when it's degraded, low when it's a minor issue - you had to write three separate rules. Each with its own thresholds, its own labels, its own maintenance burden.
With 500+ alert rules across 70+ clusters, this wasn't sustainable. Rules would drift - someone would update the critical threshold but forget to update the warning version. Or a new alert would get created at one severity level with no correlation to related alerts.
The Template System
I designed a template format where engineers write one YAML file per alert concept. The template includes all three severity levels with their respective thresholds in a single definition:
name: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m])
critical: "> 0.1"
warning: "> 0.05"
low: "> 0.02"
for: 5m
annotations:
summary: "Error rate is {{ $value | humanizePercentage }}"
A Jenkins job processes these templates and generates the actual vmalert rule files - three rules per template, each with the correct severity label. The generated rules are committed to a dedicated Git branch and deployed via Rancher Fleet to vmalert across all clusters.
The engineer writes one template. The pipeline produces three consistent, correctly labeled rules.
Correlation and Suppression
The real value isn't just generating three rules - it's how they interact. AlertManager inhibition rules correlate alerts by name:
- When a critical fires, it suppresses warning and low for the same alert
- When a warning fires, it suppresses low
- Low only reaches on-call if nothing more severe has triggered
This means the on-call engineer sees one alert at the highest applicable severity, not three alerts for the same problem at different levels. Before this system, a single failing service could generate a storm of alerts - critical, warning, and informational all firing simultaneously.
On-Call Routing
AlertManager routes to OpsGenie with priority mapping:
- Critical on an active cluster = P1
- Critical on a passive cluster = P2
- Warning = P3
The cluster active/passive state comes from the infrastructure - the same data that powers the health dashboard. This means an alert for a failing service on a passive (standby) cluster doesn't wake someone up at 3am.
The Silence Manager
Some environments aren't live. They might be in maintenance, decommissioned, or used for testing. Alerts for these environments are noise.
I built a Python service that checks an external API for environment business state. If an environment has no live activity, the silence manager creates a silence in AlertManager for that environment's namespace. When the environment goes live again, the silence is removed.
This runs continuously, checking state and managing silences automatically. No manual intervention, no forgotten silences that mask real problems.
Deployment
The alert rules live on a dedicated Git branch. The pipeline:
- Engineer writes or updates a template
- Jenkins validates the YAML structure
- Jenkins generates vmalert rule files (3 per template)
- Generated files are committed to the rules-bundle directory
- Rancher Fleet detects the change and deploys to vmalert
- vmalert picks up the new rules and starts evaluating
The whole cycle from template change to rules being active across all clusters takes a few minutes. No manual deployment steps.
What Changed
Before this system, alert management was a constant source of friction. Rules were inconsistent, severity levels didn't correlate, and on-call engineers dealt with alert storms for single incidents.
After implementing the template system with inhibition:
- One template per alert concept instead of three separate rules
- Automatic correlation - only the highest severity reaches on-call
- Consistent thresholds across all severity levels
- Git-managed - full history, review process, automated deployment
- Auto-silencing for non-production environments
The on-call experience went from "everything is screaming" to "here's the one thing that matters, at the right priority."