Building a Health Dashboard for 10K+ Services
The Problem
Every environment had its own health check page. Hundreds of environments across multiple Kubernetes clusters, each showing service dependencies and their status on a separate page. To check if something was healthy, you needed to know which domain, which cluster, and which specific page to open.
PRTG was used to monitor individual health check endpoints, but every endpoint had to be added manually. With dynamic Kubernetes environments where services spin up and down constantly, manual endpoint management wasn't sustainable.
There was no global view. No way to answer "what's broken right now across everything?"
The Architecture
The solution has four layers:
Discovery. A custom exporter runs per-cluster, deployed via Rancher Fleet. It discovers all health check services via the Kubernetes API across every namespace in that cluster, runs concurrent health checks against each endpoint, and pushes structured results to VictoriaLogs. Each log entry contains the cluster, namespace, service name, status, and the full dependency data.
Processing. This is where it gets interesting. Each health check log entry carries up to 170KB of nested dependency data - the full tree of what that service depends on and the health of each dependency. The system generates 12M+ of these log entries per day across all clusters.
A Python cache worker queries VictoriaLogs every 30 seconds using optimized LogsQL. The query uses partition-based deduplication to keep only the latest entry per unique service, reducing thousands of raw entries to unique service states. Results get organized into four tiers of cached data in Redis:
- Tier 1 (~1KB) - Global dashboard summary stats
- Tier 2 (~10KB) - Failing and warning services only
- Tier 3 (~100KB) - Search indexes as Redis sorted sets
- Tier 4 (~1MB) - Full detail per environment and cluster
All writes to Redis are atomic using MULTI/EXEC transactions, so the frontend never sees partial state.
Scaling. The cache worker scales horizontally using Redis Streams and Consumer Groups. One worker wins a publisher election via a Lua-scripted lock, fetches the data, and publishes processing tasks to a stream. All workers (including the publisher) consume and process tasks in parallel. If a worker crashes, XAUTOCLAIM reassigns its orphaned tasks after 60 seconds.
There's also a safety mechanism: if a new data snapshot contains less than 70% of the environments from the previous cycle, the worker assumes the data is incomplete and extends the TTL on existing cache instead of overwriting with partial results.
The Dashboard. A Next.js frontend reads from Redis. A command palette (Cmd+K) searches across all services, environments, and clusters using Redis sorted sets for sub-millisecond prefix matching. Multiple views let operations teams see the big picture or drill into individual service dependencies. Grafana dashboards are embedded with predefined variables so teams can go from "this service is unhealthy" to the actual metrics and logs without leaving the page.
The Numbers
- 12M+ health check logs processed per day
- 227M+ total logs stored
- 350+ namespaces monitored
- Up to 170KB per log entry
- Sub-50ms API response time
- 30-second refresh cycle
What I Learned
Log deduplication is the critical optimization. The initial approach queried VictoriaLogs without deduplication - response times were 90+ seconds. Adding partition by (cluster, namespace, service) | limit 1 reduced query time to under 30 seconds and the result set from ~28K entries to ~8.7K.
Atomic cache writes prevent flickering. Early versions wrote keys individually. The frontend would sometimes display a mix of old and new data between keys being updated. Wrapping everything in a Redis MULTI/EXEC transaction fixed this completely.
Horizontal scaling needs careful work distribution. The first attempt at scaling workers had race conditions - multiple workers processing the same environment. Redis Streams with Consumer Groups solved this cleanly. Each task is delivered to exactly one worker, no duplicates, no lost tasks.
Protect against incomplete data. VictoriaLogs occasionally returns partial results during high load. The 70% threshold check was added after an incident where the cache was overwritten with data for only half the environments, making it look like services had disappeared.
The Impact
Operations teams use this as their single source of truth for service health. No more hunting through individual health check pages or manually managing PRTG endpoints. When something breaks, the dashboard shows it within 30 seconds.
The system also feeds into the alerting pipeline - health check logs are evaluated by vmalert with dedicated rules that fire to OpsGenie with P1/P2/P3 priority based on cluster active/passive state. A separate silence manager automatically suppresses alerts for environments that aren't live based on business state from an external API.