Alert Pipeline
Custom severity model with alert correlation, suppression, and automated on-call routing
500+
Alert Rules
3
Severity Levels
Fleet
Deployed via
Architecture
The Problem
Alert rules were managed individually with no correlation between severity levels. Engineers had to write and maintain separate rules for each severity, leading to inconsistency and gaps in coverage.
Template System
Designed a template system where a single alert definition includes all three severity levels - critical, warning, and low. A Jenkins job transforms these templates into proper vmalert rules, grouping related severities together for consistent management.
Correlation and Suppression
AlertManager inhibition rules correlate alerts by name, so when a critical fires it suppresses the warning and low for the same alert. This prevents alert storms and ensures on-call only sees the highest severity.
Silence Manager
Built a Python service that checks an external API for operator business state. If an operator has no live agents, the silence manager automatically creates silences in AlertManager. Prevents false pages for environments that are intentionally inactive.
Deployment
Alert rules live on a dedicated Git branch with CI validation. Jenkins validates the templates, transforms them, and Fleet deploys the generated rules to vmalert across all clusters automatically.
Deep dive