Building a Health Platform Part 1: Auto-Discovery for Dynamic Environments
The Series
This is part 1 of a 4-part series on building a health monitoring platform:
- Part 1: Architecture and Data Pipeline (this post)
- Part 2: Caching 12M+ Logs for a Real-Time Dashboard
- Part 3: Designing Views for Hundreds of Environments
- Part 4: From Health Logs to Grouped OpsGenie Alerts
The Problem
When you run hundreds of environments across 100+ Kubernetes clusters, monitoring health is not a configuration problem. It's a discovery problem.
Every time a team deploys a new environment, someone has to add the monitoring endpoints. With PRTG (what we had before), this was manual. Someone would create a sensor, point it at the health endpoint, set thresholds. With Prometheus Blackbox Exporter, it's the same pattern dressed up as code: write a ServiceMonitor or a Probe config, add the target URL, apply it. When that environment scales or moves clusters, you update the config again.
Neither approach works when environments are dynamic. Teams spin up environments, scale them, tear them down. You can't maintain a static list of endpoints when the list changes faster than anyone can update it.
The company needed a global health view: one place to see the status of every environment across every cluster. Not per-team dashboards, not individual healthchecker pages scattered across domains. A single status page that operations, management, and on-call engineers could all use. That was the business requirement that drove this project.
I needed health monitoring that discovers services on its own and requires zero configuration when something new deploys.
Why Not the Standard Exporters
Two Prometheus exporters seem like they could solve this:
Blackbox Exporter probes HTTP endpoints and returns probe_success, probe_duration_seconds, TLS status. Good for "is this endpoint up?" but that's all you get. It doesn't parse the response body. The rich dependency data in our healthchecker JSON (which databases are connected, their response times, pod status) is completely invisible to it.
JSON Exporter (prometheus-json-exporter) is the closer fit. It can scrape JSON endpoints and extract fields into Prometheus metrics using JSONPath expressions. You could map $.dependencies[0].status to a metric. But each service has different dependencies at different depths. Maintaining JSONPath mappings for hundreds of endpoints with varying structures isn't realistic. And even if you did, you'd flatten nested dependency trees into individual metrics, losing the relationships between them.
Both exporters share the same fundamental problem: target management. Every endpoint needs to be defined somewhere, either in a static config, a ServiceMonitor, or a Probe custom resource. When environments deploy and scale dynamically across multiple clusters, maintaining these target lists becomes its own operational burden. Add a new environment, update the target list. Remove one, update again. Forget to update, and you have a blind spot.
The Existing Infrastructure
I didn't start from zero. Each environment already had a healthchecker service. This is a service that queries every application in its namespace, collects dependency information, and exposes the results as a JSON endpoint. It had been running for years, maintained by another team.
The healthchecker output looks something like this (anonymized):
{
"serviceName": "payment-api",
"status": "Healthy",
"dependencies": [
{
"name": "postgresql",
"status": "Healthy",
"responseTime": "12ms",
"operations": { "total": 45230, "failed": 0 }
},
{
"name": "redis-cache",
"status": "Healthy",
"responseTime": "2ms"
}
],
"pods": {
"ready": 3,
"total": 3,
"containers": [
{ "name": "api", "cpuUsage": "250m", "memoryUsage": "512Mi" }
]
},
"hpa": { "current": 3, "min": 2, "max": 10 }
}
Not just "healthy" or "unhealthy," but the full picture: which dependencies are up, their response times, operation counts, pod status, resource usage, HPA state. A single entry can be 170KB of structured JSON when an environment has dozens of services with deep dependency trees.
I don't own this healthchecker service. I can't modify it or scale it the way I'd want. But it works, it's been battle-tested over years, and the data it produces is genuinely useful if you actually do something with it.
Why Not Just Scrape Metrics Directly?
Looking at that JSON, the obvious question is: why not skip the healthchecker entirely? Build an exporter that scrapes metrics from each pod, push them to VictoriaMetrics, and create Grafana dashboards. Standard Prometheus pattern, no logs layer needed.
That approach would work. It might even be simpler if I were starting from scratch. But there were real reasons it wasn't the right path here:
The healthchecker already existed. It's a running service I don't own. I can't modify it to expose Prometheus metrics, and rebuilding its functionality from scratch wasn't justified when it already worked reliably.
It gives you a namespace-level aggregated view. One HTTP call to the healthchecker returns every service in that environment with all their dependencies. With direct metric scraping, you'd hit each pod individually and reconstruct the service-to-dependency relationships yourself. That's a lot of scrape targets and a lot of relabeling to get back to the same picture.
The dashboard needs the raw structure. The detail views show full dependency trees: which databases a service connects to, their response times, operation counts, failure rates. That's naturally nested data. Metrics would flatten it into label combinations like dep_response_time{service="payment-api",dep="postgresql"}, and the dashboard would have to reconstruct the tree from individual metrics. With the JSON preserved as a log entry, the frontend just renders it directly.
Cardinality. Each service has different dependencies at different depths. Turning variable-depth JSON trees into metric labels creates unpredictable cardinality. With hundreds of environments and thousands of services, that gets expensive fast.
The honest take: if I owned the health endpoints and could design them from scratch, I'd expose Prometheus metrics and skip the logs layer. That's covered in "What I'd Do Differently" at the end. But given what existed, working with the healthchecker's JSON output and storing it as structured logs was the pragmatic choice.
Why VictoriaLogs
Given that the data is JSON, it needs a log storage backend. I chose VictoriaLogs because it was already part of our stack (we run VictoriaMetrics for metrics) and it supports LogsQL, which handles the deduplication and filtering server-side.
Each health check result gets pushed as a structured log entry with stream fields for cluster, namespace, and service. The full JSON is preserved and queryable. No flattening, no cardinality concerns, no data loss.
The key capability that makes the whole caching layer work is LogsQL's partition by clause, which lets me deduplicate millions of entries server-side in a single query. More on that in Part 2.
Auto-Discovery
The piece I built from scratch is the exporter (hc-exporter). It runs per-cluster, deployed via Rancher Fleet. On each cycle:
- Queries the Kubernetes API for all namespaces and services matching the healthchecker pattern
- Runs concurrent HTTP checks against each discovered endpoint
- Structures the results with cluster, namespace, and service metadata
- Pushes everything to VictoriaLogs as structured log entries
When a new environment deploys, it gets discovered on the next cycle. When an environment is removed, it stops appearing. No configuration files to update, no targets to maintain.
The discovery relies on Kubernetes labels and service naming conventions to identify healthchecker endpoints. It's not magic. It works because every environment's healthchecker follows the same naming pattern. But conventions are cheap and reliable. They've held up across hundreds of environments without a single missed discovery.
The Architecture
The full data flow:
Ingestion path: K8s clusters (healthchecker endpoints) -> hc-exporter (auto-discovery via K8s API) -> VictoriaLogs
Dashboard path: VictoriaLogs -> Cache Worker (LogsQL dedup) -> Redis (4-tier cache) -> Next.js Dashboard
Alerting path: VictoriaLogs -> vmalert (rules evaluated on logs) -> AlertManager (grouping, routing) -> OpsGenie
Each of these paths has its own engineering challenges. The caching layer alone handles 12M+ logs/day and turns them into sub-50ms dashboard responses. The alerting path replaces per-service noise with grouped notifications by environment. The next three posts in this series cover each path in detail.
What's Next
- Part 2: Caching 12M+ Logs for a Real-Time Dashboard covers the LogsQL deduplication, 4-tier Redis cache, and distributed workers.
- Part 3: Designing Views for Hundreds of Environments covers the dashboard UX decisions and the view hierarchy.
- Part 4: From Health Logs to Grouped OpsGenie Alerts covers the alerting pipeline, priority routing, and automatic silence management.
What I'd Do Differently
If building from scratch, I'd skip the healthchecker and the logs layer. Have the exporter query application endpoints directly, scrape their health data as metrics, and push to VictoriaMetrics. The data pipeline becomes simpler: auto-discover services via K8s API, scrape metrics, query with PromQL instead of LogsQL. No structured log entries, no log-based deduplication.
When I originally built this, I didn't fully understand the healthchecker's internals well enough to bypass it. It was a running service that already aggregated dependency data per-namespace, so I built on top of it. In hindsight, going directly to the application endpoints would have given me more control and removed a dependency on a service I don't own.
The dashboard and the custom UI would stay. That's not something Grafana replaces. Grafana is great for metrics exploration, but a purpose-built status page with search, environment drill-downs, and grouped health views serves a different need. Operations teams don't want to set up dashboard variables and navigate panel grids during an incident. They want to type a service name and see its health across every environment in one place. The UI is the product. The data backend behind it (logs vs metrics) is an implementation detail.
The auto-discovery piece would also stay the same. That's the core value regardless of what's behind the health endpoint.
I'm working toward this. The exporter is designed so it can query any HTTP endpoint, not just the existing healthcheckers. Moving to direct endpoint checks is the next evolution.
The other thing I'd push for earlier is OpenTelemetry. With OTel instrumentation, applications would emit health signals (traces, metrics, logs) in a standardized format. The exporter wouldn't need to know anything about endpoint structure or response formats. It would just consume what OTel provides. I'm currently working on propagating OTel adoption inside the company, but we're not there yet. When it lands, a lot of the custom scraping logic in this system becomes unnecessary.
This is also what's driving Atlas, a project I'm building that maps infrastructure dependencies from OTel traces, Kubernetes state, and alerting data into a graph database. The health monitoring platform tells you what's broken. Atlas would tell you why, by tracing the dependency graph from a failing service back to the root cause. OTel is the foundation that connects both systems.
The dashboard built on top of this architecture has a live demo with anonymized data. Part 3 walks through the design decisions.