HyprBox docs GitHub ↗

HyprWatch — monitoring as a find→fix→verify loop

Status: v1 shipped (the first slice below is implemented; "Deferred" items are not). Read AUTOPILOT.md first — HyprWatch is the canonical example of the "we install + verify host apps, we don't become Terraform/k8s" line.

v1 wiring (all in tree): opt-in hyprwatch flag on the per-node agent config (apps/api/src/routes/nodes.ts, agent runner.go) → ScanMonitoring (agent/hyprnode/internal/scanner/monitoring.go, gated, INFO monitoring.absent) → monitoring.install-stack recommendation (prisma/seed-recommendations.ts) → presets/monitoring-only.yaml (now carries risk_level: CONFIRM + a verify: block that polls Prometheus/Grafana/node-exporter) → existing job-complete auto-resolve closes the finding. Detection heuristic chosen: C (opt-in).

What HyprWatch is

The pitch: "point HyprBox at a PME server and it stands up + keeps healthy your monitoring stack." Concretely, the Prometheus / node_exporter / Grafana / Alertmanager stack shipped as a vetted, verified preset, then driven by the same loop as every other fix: discover → finding → recommend → preview → apply → verify → auto-resolve.

HyprVault (backups, Phase 4a) is the shape to mirror: a scanner emits a Finding, a Recommendation maps it to a preset, the preset's verify: asserts the result, and the job's success auto-resolves the Finding.

Current state (the gap)

presets/monitoring-only.yaml already deploys the full stack via a docker_compose step + a firewall step. But it is an orphan:

  • No scanner detects "this node has no monitoring", so no Finding is ever raised.
  • No Recommendation points at it, so it never appears on the Health page as a one-click fix — an operator has to queue it by hand.
  • No verify: steps — applying it FAILS-or-SUCCEEDS on the compose step alone; it never asserts Prometheus/Grafana actually came up. This violates the Fix Contract the whole product is built on.
  • No risk_level (defaults to CONFIRM).

So HyprWatch today is a preset, not a module. This doc closes that.

The first slice (end-to-end)

Turn the orphan preset into a real loop. Four small, coherent changes:

  1. Agent scanner agent/hyprnode/internal/scanner/monitoring.goScanMonitoring(cfg) returns a monitoring.absent Finding when the host looks unmonitored (heuristic below). Registered in runner.go like the others. Severity INFO — this is an enhancement suggestion, not a risk.

  2. Recommendation seed (prisma/seed-recommendations.ts): new entry monitoring.install-stack, findingTypes: ['monitoring.absent'], presetName: 'monitoring-only', riskLevel: 'CONFIRM'.

  3. Preset monitoring-only.yaml: add risk_level: CONFIRM and verify: steps that assert the stack is live (see Verify below).

  4. Auto-resolve: already wired — POST /api/jobs/:id/complete resolves OPEN findings whose recommendation preset matches the job, on that node. Nothing to build; it just starts working once 1–3 exist.

Detection heuristic — THE decision to settle first

"Absent" is ambiguous and the failure mode is noise (nagging every node that's monitored from elsewhere). Options:

Heuristic False-positive risk
A Nothing listening on Prometheus :9090 High — flags every node in a fleet where one central Prom scrapes the rest
B No prometheus/grafana/node-exporter container running (reuse docker_inventory.go) Medium — flags nodes monitored by a non-container or external tool
C Opt-in: only scan when the node's agent config enables it (mirrors the HYPRBOX_DEMO-gated demo.go scanner), then apply B (+ port check) Low — operator opted this node in

Recommendation: C for v1. A hyprwatch flag in the per-node agent config turns the scan on; until then ScanMonitoring is a no-op (zero surprise findings across a fleet). When enabled, flag monitoring.absent only if Docker is present (the preset's prerequisite) AND no prometheus/grafana/node-exporter container is running AND nothing answers on :9090. Promote to default-on once the heuristic is proven in the field.

Verify steps (completing the Fix Contract)

After the compose step, assert the stack is actually serving:

  • Prometheus healthy: curl -fsS http://localhost:9090/-/healthy
  • Grafana healthy: curl -fsS http://localhost:3001/api/health (maps to the preset's 3001:3000 port)
  • node-exporter scraping: curl -fsS http://localhost:9100/metrics | head -1

Any non-zero exit ⇒ job FAILED ⇒ Finding stays OPEN. All idempotent.

Deferred (NOT in the first slice)

  • Loki + log shipping, Alertmanager routing config (the preset wires the webhook env but we don't verify alert delivery).
  • Multi-node Prometheus federation / a central scrape target registry.
  • Grafana dashboard provisioning beyond the stack's default.
  • Drift detection ("Grafana container died") as its own Finding — natural next step once the install loop is proven, reusing the same scanner.

Anti-goals (locked, see AUTOPILOT.md)

Not a Grafana competitor (we install it, we don't replace its UI). Not k8s/Helm (host-installed compose stack only). Not a SaaS metrics backend. The agent runs bash + compose on a host; that is the entire surface.