2026-03-25-8 minobservabilityvictoriametricsprometheus

Why VictoriaMetrics Over Prometheus

The Short Answer

If you're running a small cluster, Prometheus is fine. If you're managing monitoring across dozens of clusters with millions of time series, VictoriaMetrics is worth the switch. I run both - here's what I've learned.

My Setup

I manage the observability stack across 70+ Kubernetes clusters. The VictoriaMetrics cluster handles 50M+ active time series with 1B+ label value pairs. Before I took over, the setup was an unoptimized VictoriaMetrics instance with manually applied Helm charts. I redesigned it into a proper cluster mode deployment managed via GitOps.

Why VictoriaMetrics

Cluster mode scales horizontally. VictoriaMetrics splits into vminsert (ingestion), vmstorage (persistence), and vmselect (querying). Each component scales independently. When ingestion spikes during high traffic, I scale vminsert without touching storage or query capacity.

MetricsQL is a superset of PromQL. Everything you know from PromQL works, plus useful extensions. keep_metric_names preserves metric names through functions, label_set and label_del manipulate labels inline, and rollup_* functions give you better control over range calculations. The 500+ alert rules I manage use MetricsQL daily.

Resource efficiency matters at scale. With 50M+ time series, every optimization counts. VictoriaMetrics compresses data more efficiently than Prometheus, uses less memory per series, and handles high cardinality better. I've seen metrics like nginxplus_upstream_server_responses_codes generate 1.6M series alone - VictoriaMetrics handles this without OOM issues that would plague a single Prometheus instance.

Multi-tenant support is built in. Each cluster's vmagent pushes to a tenant endpoint. Data isolation without running separate Prometheus instances per cluster.

Where Prometheus Still Wins

Ecosystem compatibility. Every exporter, every dashboard, every tutorial assumes Prometheus. VictoriaMetrics is compatible, but you'll occasionally hit edge cases where a Grafana dashboard or an alert rule behaves slightly differently.

Simplicity for small setups. If you have 3 clusters and a few thousand time series, Prometheus with remote write to Thanos or Cortex is well-documented and battle-tested. VictoriaMetrics cluster mode is overkill at that scale.

AlertManager and recording rules. I still use Prometheus AlertManager - vmalert evaluates rules against vmselect and fires to AlertManager. It works, but it's an extra component. With Prometheus, alerting is built in.

The Migration

I didn't replace Prometheus everywhere. vmagent (VictoriaMetrics' scraper) runs on every cluster and pushes to the central VM cluster. Some clusters still run Prometheus for local scraping before remote-writing to VM. The key decision was making VictoriaMetrics the central storage and query layer, not replacing every Prometheus instance.

The whole stack is managed via Rancher Fleet from a single Git repository. Per-cluster configs handle differences in scrape targets, retention, and resource allocation. Zero manual Helm operations.

What I'd Do Differently

I'd have started with cluster mode from day one instead of a single-node VM instance that needed to be migrated later. The initial "just install it" approach created technical debt that took months to clean up - data migration, config restructuring, and rewriting alert rules to work with the cluster setup.

Bottom Line

VictoriaMetrics is the better choice when you need a centralized metrics backend at scale. Prometheus is the better choice when you want simplicity and don't need to aggregate across many clusters. For my setup - 70+ clusters, 50M+ time series, 500+ alert rules - VictoriaMetrics was the right call.