Building a Health Platform Part 2: Caching 12M+ Logs in Real-Time
The Series
This is part 2 of a 4-part series on building a health monitoring platform:
- Part 1: Architecture and Data Pipeline
- Part 2: Caching 12M+ Logs for a Real-Time Dashboard (this post)
- Part 3: Designing Views for Hundreds of Environments
- Part 4: From Health Logs to Grouped OpsGenie Alerts
The Data Challenge
The health exporter described in Part 1 pushes 12M+ structured log entries per day to VictoriaLogs. Each entry can be up to 170KB. The dashboard needs to show current health status for every environment, every cluster, every service, with sub-50ms response times.
Querying VictoriaLogs on every page load isn't practical. Even with LogsQL, scanning millions of entries per request means multi-second response times. The data needs to be pre-processed and cached.
The Deduplication Query
The core of the cache worker is a single LogsQL query that runs every 30 seconds:
_time:5m | sort by (_time) desc limit 1 partition by (cluster, namespace, service)
This does the heavy lifting server-side. For each unique combination of cluster, namespace, and service, it returns only the most recent log entry from the last 5 minutes. Thousands of raw entries get reduced to one per service.
The partition by clause is what makes this work. Before I found this pattern, the query pulled all entries for the time window and the worker deduplicated in Python. Query times were 90+ seconds. With server-side partitioning, it dropped to under 30 seconds.
The 5-minute window is wider than the 30-second cycle on purpose. It handles brief gaps in data. If a healthchecker is slow to respond or the exporter hits a timeout, the previous entry still shows up in the window instead of the service appearing as "unknown."
The 4-Tier Cache
The worker processes query results into four tiers of cached data in Redis. Each tier is optimized for a different dashboard use case:
Tier 1: Global Summary (~1KB)
{
"totalEnvironments": 347,
"totalClusters": 12,
"totalServices": 4892,
"healthyPercent": 98.2,
"warningPercent": 1.1,
"unhealthyPercent": 0.7,
"lastUpdate": "2026-03-30T10:15:00Z"
}
One key, read on every page load. Powers the top-level health indicator and environment counts.
Tier 2: Failing Services (~10KB)
Only services in warning or unhealthy state, pre-filtered. The alerts view reads just this tier instead of scanning all 4,800+ services to find the ones that are broken.
Tier 3: Search Indexes (~100KB)
Redis sorted sets for instant prefix matching. When someone types "pay" in the search bar, a ZRANGEBYLEX query returns all matching services, environments, and clusters in sub-millisecond time. No full-text search engine needed.
Tier 4: Full Detail (~1MB)
Complete data per environment and per cluster. Every service's full dependency tree, pod status, resource usage. This is what the drill-down views read when someone clicks into a specific environment.
Atomic Writes
All cache updates happen inside Redis MULTI/EXEC transactions. The dashboard never sees a half-updated state where Tier 1 says "347 environments" but Tier 4 only has data for 300.
Before I added atomic writes, the dashboard would occasionally flicker. A service would appear healthy for a split second before the status caught up during a refresh. It was cosmetic, but it eroded trust. Users would see a brief flash of "all healthy" during an incident and wonder if they imagined the alert.
Scaling Horizontally
A single cache worker handled the processing at first. But as the number of environments grew, one worker couldn't complete a full refresh cycle within 30 seconds. The VictoriaLogs query returned fast enough, but processing thousands of 170KB entries (parsing JSON, calculating health status, building sorted sets) took longer than the cycle time.
I added horizontal scaling using Redis Streams and Consumer Groups:
- One worker wins a publisher election via a Lua-scripted Redis lock with a short TTL
- The publisher fetches data from VictoriaLogs and publishes processing tasks to a Redis Stream, one task per cluster or namespace batch
- All workers (including the publisher) consume tasks from the stream via Consumer Groups
- Redis distributes each task to exactly one consumer, no duplicates, no coordination needed
- Workers acknowledge tasks after processing
- If a worker crashes mid-task, the unacknowledged task gets automatically reassigned to another consumer after 60 seconds
The Lua-based lock election is simple: SET publisher_lock <worker_id> NX EX 45. If your SET succeeds, you're the publisher for this cycle. If not, someone else already is. The 45-second expiry means if the publisher crashes, another worker takes over within one cycle.
Data Integrity
A scenario I hit early on: VictoriaLogs had a brief hiccup and returned partial data. The cache worker saw 200 environments instead of the usual 340 and overwrote the cache. For about 30 seconds, the dashboard showed 140 environments as "missing" before the next cycle corrected it.
The fix is a 70% threshold. If a new data snapshot contains less than 70% of the environments from the previous cycle, the worker assumes the upstream data is incomplete. Instead of overwriting, it extends the TTL on existing cache data and logs a warning. The dashboard keeps showing the previous state until VictoriaLogs recovers.
There's also orphan cleanup. When an environment genuinely gets removed (not a data hiccup), the stale cache entry needs to go. A background process compares current data against cached keys and removes entries that haven't been refreshed in 3 consecutive cycles. This prevents ghost environments from lingering in the dashboard.
The Result
VictoriaLogs query: under 30 seconds for millions of entries, deduplicated server-side.
Dashboard API response: under 50ms for any view, from global summary to service-level detail.
The cache worker runs without manual intervention. Workers restart, elections happen, tasks get redistributed. The dashboard stays current.
What's Next
- Part 3: Designing Views for Hundreds of Environments covers how the dashboard structures these cache tiers into views for different teams.
- Part 4: From Health Logs to Grouped OpsGenie Alerts covers the alerting pipeline built on the same health data.
The dashboard powered by this caching layer has a live demo with anonymized data. Every view in the demo reads from the cache tiers described above. Part 3 covers the UI design decisions.