Blog
Lessons from building and operating infrastructure at scale.
Running VictoriaMetrics at 50M+ Time Series
What actually breaks when you run VictoriaMetrics at scale, and the specific tweaks that stabilized a 26-node cluster handling 50M+ time series across 100+ Kubernetes clusters.
Building a Health Platform
Part 1: Auto-Discovery for Dynamic Environments
Why blackbox exporter doesn't scale for dynamic environments, and how I built auto-discovery health monitoring using the Kubernetes API and structured logs.
Part 2: Caching 12M+ Logs in Real-Time
LogsQL deduplication, a 4-tier Redis cache, and distributed workers with Redis Streams to serve sub-50ms responses from 12M+ daily health check logs.
Part 3: Designing Views for Hundreds of Environments
How I structured a health monitoring dashboard with four drill-down levels, cascading filters, cross-environment service views, and embedded Grafana panels.
Part 4: From Logs to Grouped Alerts
Replacing per-service alert noise with grouped notifications by environment, automatic priority based on cluster state, and silence automation for non-live environments.
Turning Prometheus Label Values Into Metrics You Can Alert On
PromQL can't convert label values to metric values. I built a YAML-driven exporter that bridges this gap, with hot-reload and stale metric cleanup for dynamic clusters.
Deploying Alert Rules at Scale with Fleet and Jenkins
Fleet doesn't have pre-sync hooks like ArgoCD. Here's how I built a Jenkins pipeline that transforms custom alert templates into vmalert rules, validates them with dry-run, and deploys across 100+ clusters via GitOps.