2026-03-25-10 minvictoriametricsobservabilitykubernetes

Running VictoriaMetrics at 50M+ Time Series

Why Cluster Mode

VictoriaMetrics single-node can handle up to 100 million active time series and about 1 million samples per second ingestion. Our setup pushes ~5.7 million samples per second from 100+ Kubernetes clusters. That alone puts us well past the single-node threshold.

But the ingestion rate isn't the only reason. We need multi-tenancy (production and staging metrics isolated by tenant ID), and we need the ability to scale each component independently. When query load spikes during an incident investigation, we scale vmselect. When ingestion spikes during traffic peaks, vminsert handles it. vmstorage stays steady.

The cluster consists of:

26 vmstorage nodes - each with 28Gi RAM and 350Gi persistent disk on a dedicated nodepool
vminsert - scales from 20 to 100 replicas via HPA at 70% CPU
vmselect - scales from 20 to 80 replicas via HPA at 70% CPU

Total data stored: ~3.9TB across 14.4 trillion rows with 30-day retention.

The OOM Problem

When I took over, vmstorage nodes were restarting a few times per week. The cause: OOM kills during background merge operations.

VictoriaMetrics periodically merges smaller data parts into larger ones on disk. This is normal and necessary, but it temporarily spikes memory usage. With the default configuration, these merge spikes would push vmstorage past its memory limit and the kernel would kill it.

When a vmstorage node restarts, it takes time to recover. Other nodes pick up extra load during that window. In the worst case, this creates a cascade - one node goes down, the others get more pressure, another one goes down.

Fixing Memory Management

My first attempt was setting memory.allowedPercent to 80%, telling vmstorage to keep background operations within 80% of available memory:

extraArgs:
  memory.allowedPercent: "80"

This reduced the OOM kills but created a new problem - vmselect queries started timing out. With 80% of memory reserved for storage operations, there wasn't enough headroom for query processing. The read path was starved.

The fix required two changes together:

1. Lower the memory limit to 60%:

extraArgs:
  memory.allowedPercent: "60"

This gives vmstorage 60% for background work and leaves 40% for queries and ingestion. The merge spikes stay within bounds and reads aren't affected.

2. Move vmstorage to a dedicated nodepool:

Before, vmstorage pods shared nodes with other workloads. When a merge spike happened, it competed with everything else on the node for memory. Moving vmstorage to its own nodepool with pod anti-affinity eliminated that contention entirely. Each node runs vmstorage pods and nothing else.

After both changes: zero OOM restarts. Months of stability.

Handling Traffic Spikes

Our traffic peaks on weekends. When traffic rises, pods autoscale across multiple clusters. More pods means more targets for vmagent to scrape, more time series being generated, and a burst of new metrics hitting vminsert.

The original vmagent setup was a Deployment. When it couldn't keep up with scrape load, the team's solution was to increase replicas. The problem: each vmagent replica scraped the same targets. More replicas meant the same metrics were being sent to VictoriaMetrics multiple times. This wasted resources and didn't actually solve the scaling problem.

I converted vmagent to a StatefulSet with target sharding:

vmagent:
  spec:
    replicaCount: 1
    shardCount: 6
    statefulMode: true
    statefulStorage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 10Gi

With shardCount: 6, targets are distributed across 6 instances. Each shard scrapes roughly 1/6th of the targets instead of all of them. Scaling means adding more shards, not duplicating work.

The statefulMode: true with persistent storage adds another benefit: if VictoriaMetrics has a brief unavailability (like a vmstorage restart), vmagent buffers the data to disk instead of dropping it. When VM comes back, the buffered data gets sent. No metrics lost during short outages.

kube-state-metrics Scaling

The same sharding approach applies to kube-state-metrics. With multiple clusters, the combined /metrics payload from kube-state-metrics is massive. A single instance would hit scrape size limits.

kube-state-metrics:
  autosharding:
    enabled: true
  replicas: 4
  extraArgs:
    - --enable-gzip-encoding

Four replicas with autosharding split the metrics across instances. Adding gzip encoding reduced the transfer size significantly - this wasn't enabled before and the raw payload was unnecessarily large.

Ingestion Tuning

A few settings that prevent bad data from destabilizing the cluster:

Label limits - reject metrics with too many labels:

vminsert:
  extraArgs:
    maxLabelsPerTimeseries: "60"
    maxInsertRequestSize: "64MB"

60 labels per metric is generous for legitimate metrics but catches misconfigured exporters that would otherwise create millions of unique series. The top cardinality offenders in our cluster are nginxplus response codes at 1.6M series and apiserver SLI duration buckets at 1.5M.

Query timeouts - prevent runaway queries from consuming vmselect:

vmselect:
  extraArgs:
    search.maxQueryDuration: "30s"
    search.maxLabelsAPIDuration: "10s"

If a query can't complete in 30 seconds, it needs to be rewritten, not given more time.

Everything in Git

The entire cluster configuration lives in a single Git repository managed by Rancher Fleet. VictoriaMetrics, vmagent, vmselect, vminsert, Grafana, AlertManager, vmalert, and all custom exporters - one repo, one source of truth.

Changes go through PR review. Fleet detects the merge and deploys automatically. No SSH into nodes, no manual Helm upgrades, no "I forgot to apply this to the monitoring cluster."

What I'd Do Differently

I'd start with the dedicated nodepool and 60% memory limit from day one instead of discovering these through production incidents. The defaults work fine for small clusters, but at this scale, tuning isn't optional.

I'd also set up vmagent sharding earlier. The "just add more replicas" approach wastes resources and masks the real problem.