Deploying Alert Rules at Scale with Fleet and Jenkins
The Problem with Fleet and Alert Rules
Rancher Fleet is great for GitOps. You push configs to a Git branch, Fleet syncs them to your clusters. Simple, reliable, no manual deployments.
But Fleet has a limitation that caught me off guard: no pre-sync hooks.
ArgoCD has resource hooks with a PreSync phase - you can run a Kubernetes Job before the actual sync happens. This is useful when your source files need transformation before deployment. Fleet doesn't have this concept. It uses Helm's native hooks (pre-install, pre-upgrade) but those run at deploy time inside the cluster, not at the Git level before sync. Fleet syncs exactly what's in your Git repository, as-is.
This became a problem when I needed to deploy alert rules to vmalert. The format I wanted engineers to write in wasn't the format vmalert understands. I needed a transformation step between "engineer writes a template" and "vmalert receives the rule."
The Repo Structure
The alert rules live in a dedicated branch of the Fleet GitOps repository. Here's the layout:
vmalert-apps/
├── alert-templates/ # Engineers edit these
│ ├── app-rules/ # Application-specific alerts
│ │ ├── service-a/alerts.yaml
│ │ ├── service-b/alerts.yaml
│ │ └── ...
│ └── infra-rules/ # Infrastructure alerts
│ ├── kubernetes/alerts.yaml
│ ├── database/alerts.yaml
│ └── ...
├── rules-bundle/ # Auto-generated - don't edit
│ └── rules/
│ ├── app-rules/ # Generated from templates
│ ├── infra-rules/ # Generated from templates
│ └── recording-rules/ # Edited directly (no templates)
├── scripts/
│ └── generate_from_template.py
├── fleet.yaml # Fleet targeting config
└── validation_Jenkinsfile # CI pipeline
The alert-templates/ directory is where engineers work. The rules-bundle/rules/ directory is what Fleet deploys to vmalert. The recording-rules/ subdirectory is an exception - those are edited directly since they don't need severity splitting.
Why Templates
Standard vmalert alert rules look like this:
groups:
- name: kubernetes.rules
rules:
- alert: CPUCloseToLimits
expr: |
(sum by (namespace,pod,container,cluster)(
rate(container_cpu_usage_seconds_total[5m])
) / sum by(namespace,pod,container,cluster)(
kube_pod_container_resource_limits{resource="cpu"}
)) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "CPU usage close to limits"
That's one rule, one severity. If I want critical at 95%, warning at 90%, and low at 80%, I write three separate rules with the same expression but different thresholds. Three rules to maintain. Three places to update when the query changes. Three chances to make a mistake.
With 500+ rules across infrastructure, databases, applications, and health checks, this doesn't scale. I designed a template format where you define one alert with a severities block containing all levels. The expressions can be completely different between levels - not just different thresholds. A critical might check for complete failure while a warning checks for degradation using a different query entirely.
But vmalert doesn't understand severities: blocks. Something needs to transform this into standard rules before deployment.
The Transformation: Input vs Output
Here's what the generator actually produces. One template in, multiple vmalert rules out:
What the engineer writes (template):
- name: CPU Close to Limits
annotations:
summary: "CPU usage close to limits"
runbook: /docs/runbooks/cpu-limits.md
labels:
team: infrastructure
severities:
- level: critical
expr: |
(sum by (namespace,pod,container,cluster)(
rate(container_cpu_usage_seconds_total[5m])
) / sum by(namespace,pod,container,cluster)(
kube_pod_container_resource_limits{resource="cpu"}
)) * 100 > 95
for: 5m
- level: warning
expr: |
...same query... > 90
for: 5m
What vmalert receives (generated):
- alert: CPU Close to Limits
expr: |
(sum by (namespace,pod,container,cluster)(
rate(container_cpu_usage_seconds_total[5m])
) / sum by(namespace,pod,container,cluster)(
kube_pod_container_resource_limits{resource="cpu"}
)) * 100 > 95
for: 5m
labels:
team: infrastructure
severity: critical
severity_order: "1"
annotations:
summary: "CPU usage close to limits"
runbook: /docs/runbooks/cpu-limits.md
- alert: CPU Close to Limits
expr: |
...same query... > 90
for: 5m
labels:
team: infrastructure
severity: warning
severity_order: "2"
annotations:
summary: "CPU usage close to limits"
runbook: /docs/runbooks/cpu-limits.md
The key details: both generated rules share the same alert name (for inhibition matching), each gets a severity and severity_order label automatically, and all shared labels/annotations from the template are copied to every rule.
The Generator Logic
The core of the Python script is straightforward. For each template rule that has a severities block, it generates one standard alert rule per severity:
SEVERITY_ORDER = {
'critical': '1',
'warning': '2',
'low': '3',
'info': '4'
}
def generate_alert_rule(template_rule, severity):
"""Generate a single vmalert rule from a template + severity."""
alert_rule = {
'alert': template_rule['name'],
'expr': severity['expr'],
'labels': {},
'annotations': {}
}
if severity.get('for'):
alert_rule['for'] = severity['for']
# Copy shared labels from template, then add severity
if 'labels' in template_rule:
alert_rule['labels'].update(template_rule['labels'])
alert_rule['labels']['severity'] = severity['level']
alert_rule['labels']['severity_order'] = SEVERITY_ORDER.get(severity['level'], '0')
# Copy shared annotations
if 'annotations' in template_rule:
alert_rule['annotations'].update(template_rule['annotations'])
return alert_rule
def process_template_group(template_group):
"""Expand all templates in a group into individual rules."""
generated_rules = []
for template_rule in template_group.get('rules', []):
# Skip disabled alerts
if not template_rule.get('enabled', True):
continue
for severity in template_rule.get('severities', []):
generated_rules.append(
generate_alert_rule(template_rule, severity)
)
return generated_rules
The script also validates that every severity has both expr and level fields, preserves multi-line PromQL as YAML block scalars, and supports an enabled: false flag to disable alerts without deleting them.
The Pipeline Flow
The setup: alert rules live in a dedicated branch of the Fleet GitOps repository. Engineers edit templates in an alert-templates/ directory. A rules-bundle/ directory contains the generated vmalert rules that Fleet actually deploys. The Jenkinsfile handles both validation and generation depending on which branch it runs on.
Two pipelines in one Jenkinsfile:
Feature Branch: Validate Only
When an engineer opens a PR from a feature branch, Jenkins:
-
Checks for direct rule edits - if someone edited the generated rules directory instead of the templates, the pipeline fails immediately with a clear error. The generated directory is output-only.
-
Generates rules locally - runs the Python generator to transform templates into vmalert format.
-
Validates with vmalert dry-run - spins up a vmalert container and validates every generated rule file:
docker run --rm \
-v $(pwd)/rules:/rules \
victoriametrics/vmalert:v1.123.0 \
-rule="/rules/**/*.yaml" \
-dryRun
If any expression has a syntax error, the pipeline fails before the PR can merge. No broken rules reach production.
Main Branch: Generate and Push
After the PR merges to the main alerting branch, Jenkins:
-
Detects template changes - compares the merge commit to find if any template files changed. If only non-template files changed, it skips generation entirely.
-
Generates rules - the Python script reads every template, expands the
severitiesblocks into individual vmalert rules, addsseverityandseverity_orderlabels automatically. -
Commits and pushes - the generated rules get committed back to the same branch with a
[jenkins]tag. Fleet detects the new commit and syncs to vmalert.
The push has retry logic with rebase - if someone else pushed to the branch between the generation and push, Jenkins rebases and retries up to 3 times.
The Generator
The Python script does the actual transformation. For each template rule with a severities block, it generates one standard alert rule per severity level:
namebecomesalert(the alert name)- Each severity's
exprandforbecome the rule's expression and pending duration severitylabel is added automatically (critical, warning, low)severity_orderlabel is added for sorting (1, 2, 3)- Shared
annotationsandlabelsfrom the template are copied to each generated rule - Multi-line PromQL expressions are preserved as YAML block scalars
Rules in the recording-rules/ directory are excluded from generation - those are edited directly since they don't need severity splitting.
The script also handles an enabled: false flag per alert. Engineers can disable a specific alert without deleting it, keeping the definition for reference.
Why This Matters
The generated rules share the same alertname label across severities. This is what makes AlertManager's inhibition work:
inhibit_rules:
- source_matchers:
- severity = critical
target_matchers:
- severity =~ warning|info|low
equal:
- alertname
When "CPU Close to Limits" fires as critical (>95%), AlertManager automatically suppresses the warning (>90%) and low (>80%) for the same alert name. The on-call engineer sees one alert at the highest severity, not three.
Without the template system generating consistent alert names across severities, this inhibition wouldn't work. Engineers would have to manually ensure naming consistency across separately maintained rules.
What's Still Not Solved
There's a pattern I keep running into: multiple different warning conditions should be suppressed by a single critical condition. For example, I might have three warning-level alerts checking different aspects of database health, and one critical alert for "database completely down." When the critical fires, all three warnings should be suppressed.
Current inhibition rules match on alertname, so the critical needs the same name as the warnings. But it's a different alert checking a different thing.
I haven't found a clean solution for this yet. Options I'm considering:
- A
grouplabel that links related alerts across different names - Expanding the template format to support alert families
- Using AlertManager's
target_matcherswith regex patterns
If you've solved this problem, I'd like to hear about it.
The Full Stack
The template system is one piece of a larger alerting pipeline:
- Templates define alerts with multiple severities
- Jenkins transforms and validates before deployment
- Fleet syncs generated rules to vmalert across all clusters
- vmalert evaluates rules against VictoriaMetrics
- AlertManager handles inhibition, grouping, and routing
- OpsGenie receives alerts with priority based on severity and cluster state
- Silence Manager auto-suppresses alerts for non-live environments
Each piece is managed via GitOps from a single repository. Template changes go through PR review, get validated by CI, and deploy automatically. No SSH, no manual kubectl, no "I forgot to apply the new rules to cluster 47."
Migrating Existing Rules
If you already have hundreds of standard Prometheus/vmalert rules and want to adopt this template format, rewriting them by hand isn't practical. I built a migration tool that converts existing rules into the template format automatically.
It reads your current rule files, groups alerts by name, detects rules that share the same name but have different severity labels, and merges them into a single template with a severities block. Rules that don't have severity labels get converted to single-severity templates.
python3 convert_to_template.py \
--input ./existing-rules \
--output ./templates
This bootstraps the migration. You'll still want to review the output and adjust thresholds, but it saves hours of manual conversion.
Open Source
Both the generator and migration tool are available on GitHub: alert-template-generator
For more on the alerting pipeline, OpsGenie routing, and silence automation, see From Health Logs to Grouped OpsGenie Alerts.