← All Projects

Monitoring Stack

GitOps-managed VictoriaMetrics and Grafana across 70 Kubernetes clusters

70+

Clusters

50M+

Time Series

1B+

Label Pairs

500+

Alert Rules

VictoriaMetricsGrafanaRancher FleetHelmKubernetes

Architecture

The Starting Point

I took over an unoptimized VictoriaMetrics setup where Helm charts were manually applied and Bitbucket was frequently out of sync with actual cluster state. Monitoring was scattered across clusters using cattle-monitoring with no unified management.

GitOps Migration

I consolidated everything into a single vm-stack managed via Rancher Fleet from one Git repository. Per-cluster configs, label-based targeting, automatic deployment across 70+ clusters with zero manual Helm operations.

Metrics at Scale

The VictoriaMetrics cluster runs in full cluster mode with vminsert, vmstorage and vmselect handling 50M+ active time series and 1B+ label value pairs.

Grafana Overhaul

Rebuilt the Grafana dashboard library from scratch. The previous state was a mess of duplicates and broken panels. Organized by domain, optimized queries, built for the ops teams who actually use them daily.

Alerting and On-Call

AlertManager handles severity-based inhibition with OpsGenie routing alerts at P1/P2/P3 priority based on cluster active or passive state.

Disaster Recovery

The entire stack is recoverable. A Terraform-based DR plan provisions a new monitoring cluster, auto-connects to Fleet, and redeploys everything from Git.

Deep dive