Autopilot — product north star

Where HyprBox is going. Read this before adding a major feature; if it doesn't pull us toward the loop described here, push back.

The product in one sentence

HyprBox is a self-hosted infrastructure autopilot for the small fleet managed by a freelancer or MSP. It discovers what's actually running on your servers, scores the risks in plain language, and applies verified fixes that the operator can preview before they run.

What it is NOT

Not a monitoring dashboard. Grafana exists; we don't compete with it.
Not an inventory tool. NetBox exists; we don't compete with that either.
Not an "uptime status page". Uptime Kuma exists.

The line that separates HyprBox from those tools: HyprBox proposes a specific, vetted action and runs it for you with a verify-or-rollback contract. Everything else is plumbing to make that action trustworthy.

What we install vs. what we don't (the multi-app question)

The fix loop generalises naturally to provisioning: a preset can install + configure + verify an app, not only repair one. That's the growth path — the roadmap's HyprWatch is exactly this: Prometheus / Loki / Alertmanager / Grafana (+ exporters) shipped as vetted, verified presets, then kept healthy by the same find→fix loop. The pitch for the PME / MSP audience becomes "point HyprBox at your servers and it stands up + maintains your monitoring, logging, alerting, and backups."

What we deliberately do not become:

A Terraform / IaC platform. Terraform drives cloud-provider APIs with its own state + plan/apply/drift model. We can run terraform apply as a command: step if a specific client needs it, but state, drift, and cloud-credential RBAC are a different product (Spacelift / env0 / Atlantis).
A Kubernetes / Helm platform. Helm deploys into a cluster via the k8s API — a different control plane from "an agent running bash on a host". Rancher / Portainer / ArgoCD own it.

Why the line sits here: our audience is the PME / small fleet — a handful of VMs, usually no platform engineer. They need the ops stack installed and watched; they are not running clusters or managing cloud infra as code (and when they touch k8s it's managed, with those tools already in hand). Chasing Terraform/k8s pulls us up-market into saturated territory and away from the wedge. Host-installed apps = yes (presets). Other control planes = no, or a thin command: runner at most.

The closed loop

                 ┌──────────────────┐
                 │ HyprNode scanner │
                 │ (discovers facts)│
                 └────────┬─────────┘
                          │
                          ▼
                 ┌──────────────────┐
                 │     Finding      │
                 │ (a risk seen)    │
                 └────────┬─────────┘
                          │
                          ▼
                 ┌──────────────────┐
                 │  Recommendation  │
                 │ (a vetted fix)   │
                 └────────┬─────────┘
                          │  preview + approve
                          ▼
                 ┌──────────────────┐
                 │   Fix (preset)   │
                 │ detect/precheck  │
                 │ plan/apply/verify│
                 │      rollback?   │
                 └────────┬─────────┘
                          │  agent runs it
                          ▼
                 ┌──────────────────┐
                 │ Job + Finding    │
                 │ → SUCCEEDED →    │
                 │ Finding RESOLVED │
                 └──────────────────┘

Every loop iteration leaves a complete trail: the finding (with its detection time and cause), the job (with its rendered script and stdout), the verify step (with its expected outcome), and the audit event (who clicked apply, from which IP).

What each block actually means

Finding

A discrete risk detected by an agent scanner. Concrete, deduplicated, actionable. Examples:

"TLS certificate for app.client.fr expires in 18 days"
"Postgres container pg-main has no associated backup policy"
"SSH is configured with PasswordAuthentication yes"

A finding is NOT a metric. CPU at 72% is not a finding; CPU pegged at 100% sustained for an hour, with the agent unable to identify the runaway process, might be.

The Finding row carries severity (INFO/WARN/CRITICAL), a stable key for dedup, an optional recommendationId pointing at a Fix, and a status (OPEN/SNOOZED/RESOLVED).

Recommendation

A vetted mapping (finding type) → (preset to run). Curated server-side; not user-editable in MVP. Each recommendation declares:

The preset to use.
The risk level — Safe / Confirm / Dangerous / Manual (see below).
A short human description of what the fix does.

The same recommendation can apply to many findings (TLS expiring on many domains → one "renew Caddy certificates" recommendation).

Fix (a preset with a contract)

A preset that conforms to the Fix Contract:

detect — re-establish that the problem is still present at apply time (the finding might be stale).
precheck — assert preconditions before changing anything (e.g. "at least one SSH key is authorised before we disable password auth").
plan — render the exact bash that will run. This is the artifact the operator previews.
apply — execute the steps. Same agent path as any other Job.
verify — run typed checks AFTER apply. If they fail, the job is marked FAILED even if the steps succeeded.
rollback — optional. When provided, a verify failure runs this automatically.

In code, precheck, verify, and rollback are special step lists in the preset YAML, rendered as guarded sections in the bash output.

Safety levels

Level	What it means	Default UI
`SAFE`	Read-only or strictly idempotent reversible no-op. Audit scripts, snapshot listing.	Apply without confirm
`CONFIRM`	Writes that change state but are easy to undo (firewall rule add, backup policy create).	Modal confirm + show plan
`DANGEROUS`	State changes that can lock you out or destroy data (disable SSH password auth, delete volumes).	Modal confirm + type the node hostname to confirm
`MANUAL`	We can describe the fix but won't run it (e.g. "rotate your Postgres superuser password").	"Open instructions" — no Apply button

Every preset declares one of these as risk_level. Without it, the default is CONFIRM (fail-safe).

The MVP loop (intentionally tiny)

We will demonstrate the end-to-end loop with ONE scanner, ONE finding type, ONE recommendation, ONE preset. That's enough to show the pattern and let us copy-paste it for the next scanner.

Layer	MVP scope
Scanner	TLS expiry — walks `/etc/letsencrypt/live` and configurable cert paths
Finding	`tls.expiring` — key = the cert subject + the path
Recommendation	`tls.caddy-renew` (when a Caddy reverse proxy is detected)
Preset	`caddy-tls-renew` with verify = "new cert valid for > 60 days"

Once that loop is closed and tested, we add:

postgres-no-backup finding → hypervault-restic recommendation
ssh-password-auth finding → ssh-harden recommendation (existing preset)
disk-usage-high finding → docker-prune recommendation (new preset)

Each new finding type is a few hundred lines of agent scanner + a recommendation row + (sometimes) a new preset. We don't refactor the loop — we replicate it.

What we deliberately defer

Infra graph — sexy demo, huge build, marginal MVP value. We store per-finding relatedEntities: string[] (a flat list of identifiers like container:pg-main, volume:pg_data, domain:app.client.fr) and visualise relations as text on the finding card. The "graph view" becomes a thing when we have 10+ entity types AND a customer who pays for it.
Numeric health score — we use three buckets (Healthy / Attention / At risk) calculated from the count of open findings by severity. No "76/100" because every customer asks "why not 78" and we can't answer.
PDF/HTML client reports — the dashboard IS the report. We'll add an export when a paying customer asks. Until then, "Export to PDF" is feature-creep.
Rollback for every fix — providing rollback: is optional. For reversible operations (firewall, backup policy) it's worth writing; for cumulative apt installs it's noise. Don't force every preset to declare it.

How this maps onto what we already shipped

The architecture as of Phase 4a already provides ~60% of what's needed:

Autopilot block	Existing primitive
Discovery Engine	HyprNode collector — needs more scanners
Action Engine	Preset render + Job queue + agent runner
Action audit	`AuditEvent` (Phase 4a)
First Fix template	`hypervault-restic` — needs `verify:` to formalise it
Apply pipeline	`bash -s` over agent + live tail
State after apply	Job row + new heartbeat / scan

What's missing:

Finding + Recommendation schemas (Part A).
At least one real scanner in HyprNode (Part B).
The Findings API + the Health page (Part C).
The Fix Contract — risk_level, verify: steps in the preset schema (Part D).
The auto-resolve loop: when a fix succeeds, the relevant findings flip to RESOLVED on the next scan (Part E).

The demo we're building toward

Operator opens HyprBox. Health page shows:

vps-client-01            At risk
  • CRITICAL  TLS certificate for app.client.fr expires in 7 days
  • WARN     Postgres container 'pg-main' has no verified backup
  • INFO     Docker volume 'old_app_data' unused for 32 days

Click the TLS finding. Modal:

Recommendation: Renew the Caddy certificate
Risk level: Safe

Plan (showing rendered bash):
  caddy reload    # triggers automatic renewal
  wait 30s
  verify: openssl s_client -connect app.client.fr:443
          returns a cert with notAfter > 60 days

[Cancel]  [Apply ▸]

Click Apply. Job page opens with live tail. 12 seconds later:

✓ Renew complete
✓ verify: certificate valid until 2026-08-25

Finding `tls.expiring` → RESOLVED.

That is the product. Everything else exists to make it trustworthy.

When to add a new scanner

A scanner is worth adding when:

The thing it detects is specific (not "CPU is sometimes high" — "container worker has been OOM-killed 3 times in 24h").
There's a plausible fix the operator could apply, even if we don't yet ship the corresponding recommendation.
The detection logic is cheap — a scanner that needs to ssh into another host, parse 200 MB of logs, or call a paid API isn't an MVP scanner.

Each scanner ships with:

The Go code in agent/hyprnode/internal/scanner/<name>/.
A row in docs/SCANNERS.md documenting what it looks at, on which distros, and what findings it can emit.
(Eventually) a corresponding recommendation.

When to add a new finding type

When you have a scanner that can detect it AND a clear human sentence to describe it AND you're willing to draft the matching recommendation.

A finding type without a recommendation is acceptable — it shows up as "At risk" with no Apply button, just a description. That's still useful ("we noticed this, you should do something about it"). It just isn't autopilot — it's a smarter Nagios.

The autopilot pitch only holds when most findings have recommendations and most recommendations have working fixes.

Autopilot — product north star

#The product in one sentence

#What it is NOT

#What we install vs. what we don't (the multi-app question)

#The closed loop

#What each block actually means

#Finding

#Recommendation

#Fix (a preset with a contract)

#Safety levels

#The MVP loop (intentionally tiny)

#What we deliberately defer

#How this maps onto what we already shipped

#The demo we're building toward

#When to add a new scanner

#When to add a new finding type