Building a Health Platform Part 4: From Logs to Grouped Alerts
The Series
This is part 4 of a 4-part series on building a health monitoring platform:
- Part 1: Architecture and Data Pipeline
- Part 2: Caching 12M+ Logs for a Real-Time Dashboard
- Part 3: Designing Views for Hundreds of Environments
- Part 4: From Health Logs to Grouped OpsGenie Alerts (this post)
The Alerting Problem
The dashboard from Part 3 gives teams a visual overview. But dashboards require someone to look at them. You also need alerts that find you.
The previous alerting setup used PRTG. Every monitored endpoint was a separate sensor. When an endpoint went down, PRTG sent an alert. A single database failure could affect 15 services in one environment, and each one would fire its own alert. One incident, 15 pages. During a larger outage affecting multiple environments, the on-call engineer's phone would light up with dozens of notifications. The first 10 minutes of every incident were spent triaging which alerts shared the same root cause.
There was also no priority based on context. A failing service on a passive standby cluster paged with the same urgency as a failure on the active production cluster.
vmalert Rules on VictoriaLogs
The health check data lives in VictoriaLogs (see Part 1). vmalert evaluates rules against VictoriaLogs directly using LogsQL:
groups:
- name: healthcheck.rules
type: vlogs
params:
query:
- tenant_id=0:0
interval: 1m
rules:
- alert: ServiceUnhealthy
expr: |
_time:5m AND status:"Unhealthy"
| stats by (cluster, namespace, service) count() failures
| filter failures > 3
for: 5m
labels:
source: healthcheck
annotations:
summary: >-
{{ $labels.service }} unhealthy in
{{ $labels.namespace }} on {{ $labels.cluster }}
The rule checks for services that have been unhealthy for more than 3 consecutive checks in a 5-minute window. The for: 5m pending period prevents alerts on brief health check flaps.
For the alert severity template format, Jenkins CI validation, and Fleet deployment pipeline, see Deploying Alert Rules at Scale with Fleet and Jenkins.
Grouping Over Per-Service Noise
The single biggest improvement: how AlertManager groups these alerts.
route:
- receiver: opsgenie-healthcheck
matchers:
- source = "healthcheck"
group_by:
- namespace
- cluster
- is_cluster_active
group_wait: 30s
group_interval: 2m
repeat_interval: 4h
When 15 services fail in the same environment on the same cluster, AlertManager collects them into one group. The group_wait: 30s gives it time to batch related alerts before sending the first notification. If more services fail within the next 2 minutes (group_interval), they get added to the same group.
One notification per environment per cluster, not one per service. The notification lists all failing services inside it. The on-call engineer sees the scope immediately.
OpsGenie Priority Mapping
Not all failures are equal. A failing service on the active production cluster needs someone now. The same failure on a passive standby cluster is urgent but not "wake someone at 3am" urgent.
The priority mapping:
- Critical on an active cluster = P1 (immediate, wakes someone up)
- Critical on a passive/standby cluster = P2 (urgent, business hours)
- Warning on any cluster = P3 (handle during business hours)
- Low/info = P4 (review when available)
The is_cluster_active label comes from infrastructure automation based on the cluster's role. The same service failure generates different OpsGenie priorities depending on where it happens. No manual priority assignment, no judgment calls at 3am about whether this particular cluster matters.
The Silence Manager
Some environments aren't live. They're in maintenance, used for testing, or temporarily inactive with no real traffic. Alerts for these environments are pure noise.
I built a Python service that runs on a schedule and checks an external business API for environment state. If an environment has no live activity, the silence manager creates a silence in AlertManager for that environment's namespace. When the environment goes live again, the silence gets removed automatically.
This replaced a manual process: someone would create silences in AlertManager when they knew an environment was down, then forget to remove them weeks later. Or not create the silence at all and let alerts fire into the void.
The silence manager handles about 40% of environments at any given time. That's a significant chunk of alert noise eliminated without anyone touching AlertManager manually.
Before and After
Before: One alert per service. 15+ notifications per incident. Same priority for active and passive clusters. Manual silencing that people forgot to clean up. Alert fatigue leading to ignored pages.
After: One grouped notification per environment. Priority based on cluster state. Auto-silenced non-live environments. When the pager fires, it's something that needs attention, at the right priority, with the right context.
The on-call experience went from dreading the pager to trusting it.