Monitoring Stack

Took over an unstable VictoriaMetrics cluster and turned it into a GitOps-managed monitoring stack serving 50M+ time series across 100+ Kubernetes clusters

50M+

Time Series

5.7M/s

Ingestion

Storage Nodes

500+

Alert Rules

VictoriaMetricsGrafanaRancher FleetHelmKubernetes

Architecture

What I Inherited

An unoptimized VictoriaMetrics cluster with vmstorage nodes restarting multiple times per week. The cause: OOM kills during background merge operations that periodically spike memory usage. When a vmstorage node restarts, other nodes pick up extra load, creating a cascade risk. Helm charts were manually applied and frequently out of sync with actual cluster state.

Stabilizing vmstorage

The fix required two changes together. First, limiting vmstorage background operations to 60% of available memory, leaving 40% headroom for queries and ingestion. At 80% the read path was starved and vmselect queries started timing out. Second, moving vmstorage to a dedicated nodepool with pod anti-affinity so merge spikes don't compete with other workloads for memory. After both changes: zero OOM restarts for months.

Fixing vmagent

The team's approach to handling scrape load was adding more vmagent replicas. The problem: each replica scraped all targets, so more replicas meant the same metrics being sent to VictoriaMetrics multiple times. I converted vmagent to a StatefulSet with target sharding (6 shards, each scraping 1/6th of targets). Added persistent storage so if VictoriaMetrics has brief unavailability, vmagent buffers to disk instead of dropping data.

Ingestion and Cardinality

Added a 60-label limit per metric to catch misconfigured exporters that would create millions of unique series. Set vminsert max request size to 64MB. Query timeouts at 30s for vmselect, 10s for labels API. Enabled kube-state-metrics autosharding with gzip encoding, which wasn't enabled before and the raw payload was unnecessarily large. Top cardinality offenders: response codes at 1.6M series and apiserver SLI buckets at 1.5M.

GitOps via Rancher Fleet

Consolidated the entire stack into a single Git repository managed by Rancher Fleet. VictoriaMetrics, Grafana, AlertManager, vmalert, custom exporters, all with per-cluster configs and label-based targeting. Changes go through PR review, Fleet deploys automatically. No SSH, no manual Helm upgrades.

The Numbers

26 vmstorage nodes with 28Gi RAM and 350Gi disk each. vminsert scales from 20 to 100 replicas via HPA. vmselect scales from 20 to 80. Total data: ~3.9TB across 14.4 trillion rows with 30-day retention. Ingestion rate: 5.7 million samples per second.

Disaster Recovery

The entire stack is recoverable from Git. A Terraform-based DR plan provisions a new monitoring cluster, connects to Fleet, and redeploys everything automatically.

Deep dive

Running VictoriaMetrics at 50M+ Time Series→Deploying Alert Rules at Scale with Fleet and Jenkins→