Monitoring Stack
GitOps-managed VictoriaMetrics and Grafana across 70 Kubernetes clusters
70+
Clusters
50M+
Time Series
1B+
Label Pairs
500+
Alert Rules
Architecture
The Starting Point
I took over an unoptimized VictoriaMetrics setup where Helm charts were manually applied and Bitbucket was frequently out of sync with actual cluster state. Monitoring was scattered across clusters using cattle-monitoring with no unified management.
GitOps Migration
I consolidated everything into a single vm-stack managed via Rancher Fleet from one Git repository. Per-cluster configs, label-based targeting, automatic deployment across 70+ clusters with zero manual Helm operations.
Metrics at Scale
The VictoriaMetrics cluster runs in full cluster mode with vminsert, vmstorage and vmselect handling 50M+ active time series and 1B+ label value pairs.
Grafana Overhaul
Rebuilt the Grafana dashboard library from scratch. The previous state was a mess of duplicates and broken panels. Organized by domain, optimized queries, built for the ops teams who actually use them daily.
Alerting and On-Call
AlertManager handles severity-based inhibition with OpsGenie routing alerts at P1/P2/P3 priority based on cluster active or passive state.
Disaster Recovery
The entire stack is recoverable. A Terraform-based DR plan provisions a new monitoring cluster, auto-connects to Fleet, and redeploys everything from Git.
Deep dive