2026-03-10-7 minalertingautomationsre

Designing a Custom Alert Severity System

The Problem

Alert rules were managed individually. If you wanted an alert to fire at different severity levels - critical when something is completely down, warning when it's degraded, low when it's a minor issue - you had to write three separate rules. Each with its own thresholds, its own labels, its own maintenance burden.

With 500+ alert rules across 70+ clusters, this wasn't sustainable. Rules would drift - someone would update the critical threshold but forget to update the warning version. Or a new alert would get created at one severity level with no correlation to related alerts.

The Template System

I designed a template format where engineers write one YAML file per alert concept. The template includes all three severity levels with their respective thresholds in a single definition:

name: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m])
critical: "> 0.1"
warning: "> 0.05"
low: "> 0.02"
for: 5m
annotations:
  summary: "Error rate is {{ $value | humanizePercentage }}"

A Jenkins job processes these templates and generates the actual vmalert rule files - three rules per template, each with the correct severity label. The generated rules are committed to a dedicated Git branch and deployed via Rancher Fleet to vmalert across all clusters.

The engineer writes one template. The pipeline produces three consistent, correctly labeled rules.

Correlation and Suppression

The real value isn't just generating three rules - it's how they interact. AlertManager inhibition rules correlate alerts by name:

When a critical fires, it suppresses warning and low for the same alert
When a warning fires, it suppresses low
Low only reaches on-call if nothing more severe has triggered

This means the on-call engineer sees one alert at the highest applicable severity, not three alerts for the same problem at different levels. Before this system, a single failing service could generate a storm of alerts - critical, warning, and informational all firing simultaneously.

On-Call Routing

AlertManager routes to OpsGenie with priority mapping:

Critical on an active cluster = P1
Critical on a passive cluster = P2
Warning = P3

The cluster active/passive state comes from the infrastructure - the same data that powers the health dashboard. This means an alert for a failing service on a passive (standby) cluster doesn't wake someone up at 3am.

The Silence Manager

Some environments aren't live. They might be in maintenance, decommissioned, or used for testing. Alerts for these environments are noise.

I built a Python service that checks an external API for environment business state. If an environment has no live activity, the silence manager creates a silence in AlertManager for that environment's namespace. When the environment goes live again, the silence is removed.

This runs continuously, checking state and managing silences automatically. No manual intervention, no forgotten silences that mask real problems.

Deployment

The alert rules live on a dedicated Git branch. The pipeline:

Engineer writes or updates a template
Jenkins validates the YAML structure
Jenkins generates vmalert rule files (3 per template)
Generated files are committed to the rules-bundle directory
Rancher Fleet detects the change and deploys to vmalert
vmalert picks up the new rules and starts evaluating

The whole cycle from template change to rules being active across all clusters takes a few minutes. No manual deployment steps.

What Changed

Before this system, alert management was a constant source of friction. Rules were inconsistent, severity levels didn't correlate, and on-call engineers dealt with alert storms for single incidents.

After implementing the template system with inhibition:

One template per alert concept instead of three separate rules
Automatic correlation - only the highest severity reaches on-call
Consistent thresholds across all severity levels
Git-managed - full history, review process, automated deployment
Auto-silencing for non-production environments

The on-call experience went from "everything is screaming" to "here's the one thing that matters, at the right priority."