HyprWatch — monitoring as a find→fix→verify loop
Status: v1 shipped (the first slice below is implemented; "Deferred" items are not). Read AUTOPILOT.md first — HyprWatch is the canonical example of the "we install + verify host apps, we don't become Terraform/k8s" line.
v1 wiring (all in tree): opt-in hyprwatch flag on the per-node agent config
(apps/api/src/routes/nodes.ts, agent runner.go) → ScanMonitoring
(agent/hyprnode/internal/scanner/monitoring.go, gated, INFO monitoring.absent)
→ monitoring.install-stack recommendation (prisma/seed-recommendations.ts)
→ presets/monitoring-only.yaml (now carries risk_level: CONFIRM + a verify:
block that polls Prometheus/Grafana/node-exporter) → existing job-complete
auto-resolve closes the finding. Detection heuristic chosen: C (opt-in).
What HyprWatch is
The pitch: "point HyprBox at a PME server and it stands up + keeps healthy your
monitoring stack." Concretely, the Prometheus / node_exporter / Grafana /
Alertmanager stack shipped as a vetted, verified preset, then driven by the
same loop as every other fix: discover → finding → recommend → preview → apply → verify → auto-resolve.
HyprVault (backups, Phase 4a) is the shape to mirror: a scanner emits a Finding,
a Recommendation maps it to a preset, the preset's verify: asserts the result,
and the job's success auto-resolves the Finding.
Current state (the gap)
presets/monitoring-only.yaml already deploys the full stack via a
docker_compose step + a firewall step. But it is an orphan:
- No scanner detects "this node has no monitoring", so no Finding is ever raised.
- No Recommendation points at it, so it never appears on the Health page as a one-click fix — an operator has to queue it by hand.
- No
verify:steps — applying it FAILS-or-SUCCEEDS on the compose step alone; it never asserts Prometheus/Grafana actually came up. This violates the Fix Contract the whole product is built on. - No
risk_level(defaults to CONFIRM).
So HyprWatch today is a preset, not a module. This doc closes that.
The first slice (end-to-end)
Turn the orphan preset into a real loop. Four small, coherent changes:
Agent scanner
agent/hyprnode/internal/scanner/monitoring.go→ScanMonitoring(cfg)returns amonitoring.absentFinding when the host looks unmonitored (heuristic below). Registered inrunner.golike the others. Severity INFO — this is an enhancement suggestion, not a risk.Recommendation seed (
prisma/seed-recommendations.ts): new entrymonitoring.install-stack,findingTypes: ['monitoring.absent'],presetName: 'monitoring-only',riskLevel: 'CONFIRM'.Preset
monitoring-only.yaml: addrisk_level: CONFIRMandverify:steps that assert the stack is live (see Verify below).Auto-resolve: already wired —
POST /api/jobs/:id/completeresolves OPEN findings whose recommendation preset matches the job, on that node. Nothing to build; it just starts working once 1–3 exist.
Detection heuristic — THE decision to settle first
"Absent" is ambiguous and the failure mode is noise (nagging every node that's monitored from elsewhere). Options:
| Heuristic | False-positive risk | |
|---|---|---|
| A | Nothing listening on Prometheus :9090 |
High — flags every node in a fleet where one central Prom scrapes the rest |
| B | No prometheus/grafana/node-exporter container running (reuse docker_inventory.go) |
Medium — flags nodes monitored by a non-container or external tool |
| C | Opt-in: only scan when the node's agent config enables it (mirrors the HYPRBOX_DEMO-gated demo.go scanner), then apply B (+ port check) |
Low — operator opted this node in |
Recommendation: C for v1. A hyprwatch flag in the per-node agent config
turns the scan on; until then ScanMonitoring is a no-op (zero surprise
findings across a fleet). When enabled, flag monitoring.absent only if Docker
is present (the preset's prerequisite) AND no prometheus/grafana/node-exporter
container is running AND nothing answers on :9090. Promote to default-on once
the heuristic is proven in the field.
Verify steps (completing the Fix Contract)
After the compose step, assert the stack is actually serving:
- Prometheus healthy:
curl -fsS http://localhost:9090/-/healthy - Grafana healthy:
curl -fsS http://localhost:3001/api/health(maps to the preset's3001:3000port) - node-exporter scraping:
curl -fsS http://localhost:9100/metrics | head -1
Any non-zero exit ⇒ job FAILED ⇒ Finding stays OPEN. All idempotent.
Deferred (NOT in the first slice)
- Loki + log shipping, Alertmanager routing config (the preset wires the webhook env but we don't verify alert delivery).
- Multi-node Prometheus federation / a central scrape target registry.
- Grafana dashboard provisioning beyond the stack's default.
- Drift detection ("Grafana container died") as its own Finding — natural next step once the install loop is proven, reusing the same scanner.
Anti-goals (locked, see AUTOPILOT.md)
Not a Grafana competitor (we install it, we don't replace its UI). Not k8s/Helm (host-installed compose stack only). Not a SaaS metrics backend. The agent runs bash + compose on a host; that is the entire surface.