HyprWatch — monitoring as a find→fix→verify loop

Status: v1 shipped (the first slice below is implemented; "Deferred" items are not). Read AUTOPILOT.md first — HyprWatch is the canonical example of the "we install + verify host apps, we don't become Terraform/k8s" line.

v1 wiring (all in tree): opt-in hyprwatch flag on the per-node agent config (apps/api/src/routes/nodes.ts, agent runner.go) → ScanMonitoring (agent/hyprnode/internal/scanner/monitoring.go, gated, INFO monitoring.absent) → monitoring.install-stack recommendation (prisma/seed-recommendations.ts) → presets/monitoring-only.yaml (now carries risk_level: CONFIRM + a verify: block that polls Prometheus/Grafana/node-exporter) → existing job-complete auto-resolve closes the finding. Detection heuristic chosen: C (opt-in).

What HyprWatch is

The pitch: "point HyprBox at a PME server and it stands up + keeps healthy your monitoring stack." Concretely, the Prometheus / node_exporter / Grafana / Alertmanager stack shipped as a vetted, verified preset, then driven by the same loop as every other fix: discover → finding → recommend → preview → apply → verify → auto-resolve.

HyprVault (backups, Phase 4a) is the shape to mirror: a scanner emits a Finding, a Recommendation maps it to a preset, the preset's verify: asserts the result, and the job's success auto-resolves the Finding.

Current state (the gap)

presets/monitoring-only.yaml already deploys the full stack via a docker_compose step + a firewall step. But it is an orphan:

No scanner detects "this node has no monitoring", so no Finding is ever raised.
No Recommendation points at it, so it never appears on the Health page as a one-click fix — an operator has to queue it by hand.
No verify: steps — applying it FAILS-or-SUCCEEDS on the compose step alone; it never asserts Prometheus/Grafana actually came up. This violates the Fix Contract the whole product is built on.
No risk_level (defaults to CONFIRM).

So HyprWatch today is a preset, not a module. This doc closes that.

The first slice (end-to-end)

Turn the orphan preset into a real loop. Four small, coherent changes:

Agent scanner agent/hyprnode/internal/scanner/monitoring.go → ScanMonitoring(cfg) returns a monitoring.absent Finding when the host looks unmonitored (heuristic below). Registered in runner.go like the others. Severity INFO — this is an enhancement suggestion, not a risk.
Recommendation seed (prisma/seed-recommendations.ts): new entry monitoring.install-stack, findingTypes: ['monitoring.absent'], presetName: 'monitoring-only', riskLevel: 'CONFIRM'.
Preset monitoring-only.yaml: add risk_level: CONFIRM and verify: steps that assert the stack is live (see Verify below).
Auto-resolve: already wired — POST /api/jobs/:id/complete resolves OPEN findings whose recommendation preset matches the job, on that node. Nothing to build; it just starts working once 1–3 exist.

Detection heuristic — THE decision to settle first

"Absent" is ambiguous and the failure mode is noise (nagging every node that's monitored from elsewhere). Options:

	Heuristic	False-positive risk
A	Nothing listening on Prometheus `:9090`	High — flags every node in a fleet where one central Prom scrapes the rest
B	No `prometheus`/`grafana`/`node-exporter` container running (reuse `docker_inventory.go`)	Medium — flags nodes monitored by a non-container or external tool
C	Opt-in: only scan when the node's agent config enables it (mirrors the `HYPRBOX_DEMO`-gated `demo.go` scanner), then apply B (+ port check)	Low — operator opted this node in

Recommendation: C for v1. A hyprwatch flag in the per-node agent config turns the scan on; until then ScanMonitoring is a no-op (zero surprise findings across a fleet). When enabled, flag monitoring.absent only if Docker is present (the preset's prerequisite) AND no prometheus/grafana/node-exporter container is running AND nothing answers on :9090. Promote to default-on once the heuristic is proven in the field.

Verify steps (completing the Fix Contract)

After the compose step, assert the stack is actually serving:

Prometheus healthy: curl -fsS http://localhost:9090/-/healthy
Grafana healthy: curl -fsS http://localhost:3001/api/health (maps to the preset's 3001:3000 port)
node-exporter scraping: curl -fsS http://localhost:9100/metrics | head -1

Any non-zero exit ⇒ job FAILED ⇒ Finding stays OPEN. All idempotent.

Deferred (NOT in the first slice)

Loki + log shipping, Alertmanager routing config (the preset wires the webhook env but we don't verify alert delivery).
Multi-node Prometheus federation / a central scrape target registry.
Grafana dashboard provisioning beyond the stack's default.
Drift detection ("Grafana container died") as its own Finding — natural next step once the install loop is proven, reusing the same scanner.

Anti-goals (locked, see AUTOPILOT.md)

Not a Grafana competitor (we install it, we don't replace its UI). Not k8s/Helm (host-installed compose stack only). Not a SaaS metrics backend. The agent runs bash + compose on a host; that is the entire surface.

HyprWatch — monitoring as a find→fix→verify loop

#What HyprWatch is

#Current state (the gap)

#The first slice (end-to-end)

#Detection heuristic — THE decision to settle first

#Verify steps (completing the Fix Contract)

#Deferred (NOT in the first slice)