Status Dashboard
LIVE DEMOBuilt a health monitoring dashboard that replaced manual endpoint management across hundreds of environments and 70+ clusters
12M+
Logs/Day
227M+
Total Logs
350+
Namespaces
170KB
Log Entry Size
<50ms
Response Time
30s
Refresh Cycle
Architecture
The Problem
Hundreds of environments, each with its own health check page. No global view. To check if something was healthy you needed to know which domain, which cluster, and which page to open. PRTG monitored individual endpoints but every one had to be added by hand.
Service Discovery
Built a per-cluster service that automatically discovers every application via the Kubernetes API and checks their dependencies continuously. New services get picked up without any manual setup. Generates 12M+ structured health check logs per day, each carrying up to 170KB of dependency data.
Data Processing
Built a Python cache worker that queries and deduplicates millions of log entries every 30 seconds, organizing results into multiple cache tiers in Redis. Scales horizontally with multiple workers splitting the load via Redis Streams. All cache writes are atomic so the dashboard never shows partial state. Sub-50ms API responses from 227M+ stored logs.
The Dashboard
Built a Next.js frontend where one search finds any service, environment, or cluster instantly. Multiple views let teams see the big picture or drill into individual dependencies. Embedded Grafana dashboards give direct access to metrics and logs without leaving the page. Used daily by operations teams as the single source of truth for service health.
Deep dive