Roadmap
Source of truth for what we ship next, in what order, and why. If you find yourself re-deciding a question that's answered here, the answer here wins. Update this doc instead of arguing.
Last frozen: 2026-06-02 (Phase 8 complete + two internal security passes).
0. TL;DR
We are building HyprBox = infrastructure autopilot for freelance + MSP.
The product is the loop discover → finding → recommend → preview → apply → verify.
Everything we ship is either:
- A new scanner that produces a new kind of Finding.
- A new fix preset that resolves an existing finding type.
- Infrastructure that makes the loop trustworthy at scale (auth, migrations, cancel, audit UI, observability).
- Polish that makes the demo land (UI, copywriting, performance).
Anything that doesn't fit one of those four buckets goes in the Anti-goals list at the bottom of this file. We do not build it.
Read AUTOPILOT.md for the why; this doc is the what + when.
1. Where we are
| Phase | Status | Shipped |
|---|---|---|
| 1 — MVP | ✅ Done | Heartbeat → dashboard, agent, CLI |
| 2 — Auth + presets + SSE | ✅ Done | JWT + agent tokens, 3 presets, CLI preset commands, hardening |
| 3 — Remote apply | ✅ Done | Job queue + atomic claim + agent jobrunner + live tail |
| 4a — HyprVault + audit | ✅ Done | Backups (policy + run), audit log on sensitive routes |
| 4b — Autopilot loop | ✅ Done | Finding + Recommendation models, TLS scanner, Health page, Fix contract (risk_level + verify), auto-resolve |
| 5 — Scanner expansion | ✅ Done | SSH password-auth + disk usage + Postgres no-backup scanners, 2 new fix presets, inventory cross-ref endpoint, 10 Go parser tests |
| 6 — Make it real | ✅ Done | Prisma migrations + drift check, RBAC (VIEWER/OPERATOR/ADMIN), WebSocket reverse channel + RUNNING-job cancel, admin audit UI with CSV export |
| 7 — HyprGuard | ✅ Done | 9-check security audit scanner (kernel updates, root password, NOPASSWD sudo, /etc world-writable, UFW, fail2ban, unattended-upgrades, LUKS, auditd), apt-security-upgrade preset, conditional severity on UFW for public-IP hosts |
| 8 — Production polish | ✅ Done | 8.1 output offload (chunk-in-DB) + 8.2 change-password + forgot-password reset + 8.3 per-user rate limit + 8.4 GHCR release pipeline + 8.5 Playwright E2E + 8.6 per-node agent config. Fully shipped |
| 9 — HyprWatch v1 | 🟡 v1 | Monitoring as a find→fix→verify loop: opt-in (hyprwatch per-node config) monitoring.absent scanner → monitoring.install-stack recommendation → monitoring-only preset (now with verify: + risk_level) → auto-resolve. Deferred: Loki/log shipping, Alertmanager routing verify, multi-node federation, drift detection. See docs/HYPRWATCH.md |
Tests: 167 vitest specs + 32 Go specs + 3 Playwright E2E, all passing. Typecheck clean on API + Web. Go vet clean on agent + CLI. CI green.
This is the baseline. Everything below is forward-looking.
2. Phase 5 — Scanner expansion (shipped 2026-05-30)
Goal. Take the autopilot loop from "works for TLS" to "covers the three pains every SMB server has".
Why this was next. The loop is the product, but ONE finding type is not a product. Three well-chosen findings turn the dashboard from "demo gimmick" into "you should run this on every server".
Shipped:
ssh.password-auth-enabled(CRITICAL) — sshd_config parser handles Include + Match + first-write-wins semantics. Recommendation:ssh.disable-password-auth(risk: DANGEROUS) with hard precheck on authorized_keys, sshd_config backup, sshd -t validation, auto-revert on validation failure.disk.usage-high(WARN ≥80%, CRITICAL ≥90%) — per mount-point, skips pseudo filesystems. Recommendation:docker-prune(risk: CONFIRM) with "must be docker host" precheck and verify-step that re-checks usage drop.postgres.no-backup(CRITICAL) — agent ships a docker inventory viaPOST /api/nodes/inventory; the cross-reference logic lives on the API (containers matchingpostgres*× BackupPolicy on this node). Reconciliation built in — container removed or policy added → finding auto-resolves on the next inventory tick.- 10 Go unit tests for the sshd_config parser.
The fresh-VM success metric from the original spec (≥3 findings, ≥2 with working Apply) is achievable as written.
5.1 SSH password authentication
| Scanner | agent/hyprnode/internal/scanner/ssh.go — parse /etc/ssh/sshd_config |
| Finding | ssh.password-auth-enabled (CRITICAL) |
| Key | sshd:/etc/ssh/sshd_config (one per file path; supports Include directives) |
| Recommendation | ssh.disable-password-auth → existing preset extended with verify |
| Preset risk | DANGEROUS — need an authorised key first |
| Verify | sshd -t succeeds AND grep -qE '^PasswordAuthentication\s+no' /etc/ssh/sshd_config |
| Precheck | At least one entry in /root/.ssh/authorized_keys OR /home/*/.ssh/authorized_keys |
Definition of done. Spec: scanner emits the finding; recommendation auto-links it; preview shows the bash with the precheck guard; verify-pass auto-resolves the finding. Apply fails cleanly if precheck doesn't pass (no keys → exit 1 before touching anything).
5.2 Disk usage high
| Scanner | agent/hyprnode/internal/scanner/disk.go — gopsutil/v4/disk.Partitions + Usage |
| Finding | disk.usage-high (WARN ≥80%, CRITICAL ≥90%) per mount point |
| Key | mount:/var (mount path; survives device renames) |
| Recommendation | disk.docker-prune (one of two paths: docker cleanup or just "no fix") |
| Preset risk | CONFIRM |
| Preset | docker-prune — docker system prune -af --volumes with a "must be a Docker host" precheck |
| Verify | Re-run the disk check, new usage < before (or below threshold) |
Definition of done. Mount-point granularity (not "the host is at 87%"
— "/var is at 87% because Docker has 23 GiB of dangling images").
Threshold env-configurable.
5.3 PostgreSQL without a backup
The marquee demo finding. This is what makes someone say "oh, I want this".
| Scanner | agent/hyprnode/internal/scanner/postgres.go |
| Detection | Docker container with image matching postgres:* AND no BackupPolicy referencing this node AND the container has a mounted volume |
| Finding | postgres.no-backup (CRITICAL — data loss risk) |
| Key | container:<container_name> |
| Recommendation | postgres.add-backup → hypervault-restic (already seeded) |
| Preset variables | backup_paths auto-suggested to the container's volume mount points |
The detection cross-references DB state (BackupPolicy rows) with agent- reported state (running containers). The cross-reference logic lives on the API, not the agent — the agent ships a docker inventory; the API decides "this looks like a Postgres without a paired policy".
Definition of done. Detection has false-positive rate near zero on the test fleet. False positives are unacceptable here because the fix is not idempotent at zero cost (Restic init on a fresh S3 bucket).
5.4 Phase 5 success metric
A fresh Debian VM running nginx + postgres-via-docker + caddy, scanner
tick once: dashboard shows at least 3 findings with at least 2 of
them having a working Apply button.
Estimated complexity. ~2 sessions (1 for the 3 scanners + 1 for the preset polish + auto-resolve specs + demo recording).
3. Phase 6 — Make it real (shipped 2026-05-31)
Shipped:
- 6.1 Prisma migrations — baseline migration
20260531000000_initcaptures cumulative schema through Phase 5; orphan migrations cleared. Docker entrypoint flipped fromdb pushtomigrate deploy. CI gained amigrate diff --from-migrationsdrift check using a shadow DB so a PR that editsschema.prismawithoutdb:migratefails fast. - 6.2 RBAC —
UserRoleenum (VIEWER/OPERATOR/ADMIN) baked into JWT at sign time.requireRole()middleware gates every mutation route (jobs, tokens, backups, findings/snooze, findings/resolve)./api/users(admin-only) lists + manages teammates; role updates refuse to demote/delete the last admin. First user on a fresh deployment auto-promotes to ADMIN; subsequent registrations land as OPERATOR. - 6.3 WebSocket reverse channel —
@fastify/websocketplugin,/api/agent/wsendpoint with per-node connection registry. Server pushes{type:"wakeup"}on Job.create (sub-second pickup instead of waiting for the 5-min poll tick); pushes{type:"cancel", jobId:…}on operator cancel of a RUNNING job → agent's job context is cancelled → bash SIGTERMed. Falls back transparently to polling when the WS is down (gorilla/websocket reconnect with exponential backoff, 1s → 60s). RUNNING-job cancel without WS connection is refused cleanly (409). - 6.4 Admin audit UI —
GET /api/audit(admin-only) with filters (user/email substring, action, resource type), pagination (50/page), CSV export (?format=csv, cap 10k rows)./dashboard/admin/auditpage gated client-side AND server-side; "Admin" nav entry only rendered for ADMIN role.
The "3-person team self-serves" success metric from §3.5 is achievable end-to-end: admin invites teammates via /register, role-management UI lets them become viewer/operator, audit log captures everything.
4. Phase 7 — HyprGuard (shipped 2026-05-31)
Goal. A scheduled security audit that produces a structured list of findings, not a 200-line bash log no one reads — Lynis-shaped, but every item is a Finding with a Recommendation where one exists.
Shipped:
agent/hyprnode/internal/scanner/audit.go— runs all 9 checks every scanner tick. Pure parsing inaudit_parse.go(testable against captured fixture output, no shell required).The 9 checks and their stable keys:
Check Type Severity apt list --upgradableshows-securitypackagesaudit.kernel-updates-pendingWARN /etc/shadowroot row has a real password hashaudit.root-password-setWARN NOPASSWD:in/etc/sudoers{,.d/*}audit.sudo-nopasswdWARN find /etc -perm -o+wreturns ≥1 fileaudit.world-writable-etcWARN UFW missing or Status: inactiveor default-allowaudit.ufw-inactiveCRITICAL on public-IP host, WARN else systemctl is-active fail2ban≠ "active"audit.fail2ban-downWARN /etc/apt/apt.conf.d/20auto-upgradesmissing or disabledaudit.no-automatic-updatesINFO lsblk -no TYPEhas nocryptrowaudit.no-disk-encryptionINFO systemctl is-active auditd≠ "active"audit.no-auditdINFO presets/apt-security-upgrade.yaml(CONFIRM) — runsunattended-upgrade --debug --minimal-upgrade-stepsagainst pending security packages, then verifiesapt list --upgradablereports zero-securityrows remaining.Recommendations seeded:
audit.kernel-updates-pending→apt-security-upgradeaudit.ufw-inactive/audit.fail2ban-down/audit.no-automatic-updates→ existingserver-light(one rec, three finding types — that's whyfindingTypesis a JSON array).
The other 5 findings stay MANUAL —
passwd -l root, sudoers review, per-filechmod o-w, LUKS-on-reprovision, auditd-on-purpose are decisions an operator owns. The UI just shows them without an Apply button.
Done-when metric (from the original spec): A fresh Ubuntu 24.04 VM
with no hardening produces ≥6 findings; applying server-light resolves
≥4 of them on the next scanner tick. Achievable as wired.
5. Phase 8 — Production polish (complete)
Shipped in this pass:
- 8.2 Change password —
POST /api/auth/change-passwordrequires the current password, refuses re-use, audits success and failure separately. Settings page gains a "Change password" form with inline validation. (Forgot-password email reset is deferred until we wire transactional email — separate decision.) - 8.3 Per-user rate limiting — custom
keyGeneratoron@fastify/rate-limitbuckets Bearer/cookie-authenticated traffic byuser.suband falls back toreq.ipfor anonymous. Solves the NAT'd-office case where the global per-IP bucket was effectively per-organisation. - 8.4 GHCR release pipeline —
.github/workflows/release.ymlmatrix over the three Docker images, triggered byv*.*.*tag pushes (orworkflow_dispatchwith an explicit ref). Stable releases move:latest; pre-releases push the version tag only. - 8.6 Per-node agent config — new schema column
Node.agentConfig(JSON string) + migration.PATCH /api/nodes/:id/config(operator-auth) merges into existing config;nullon a key clears it.GET /api/nodes/me/config(node-token, pinned-only) returns the live config to the agent. Agent pulls before each scanner tick:tls_pathsoverrides the env default for the TLS scanner, andscan_intervalnow drives the scanner ticker cadence (runner.Tickreturns the next interval; honored since the 2026-06-02 audit pass). - New "Agent config" panel on the node detail page edits TLS paths + scan interval.
Phase 8 fully shipped — nothing deferred. (8.1 output offload,
8.2 change-password + forgot-password reset, 8.3–8.6 all landed; see §5b.)
The only loosely-open follow-on idea is opt-in email notifications on backup
failures, which rides on the same lib/mailer.ts transport once SMTP is wired.
5b. Phase 8 — leftover scope (deferred)
8.1 Output offload for long logs (shipped 2026-06-01, chunk-in-DB)
Decision made: chunk-in-DB as the self-hosted default (no extra infra).
A stream >4 MB no longer loses its head: the full ordered log is persisted in
JobOutputChunk (cheap appends), Job.stdout/stderr keep a bounded 4 MB
inline tail, and Job.stdoutBytes/stderrBytes track the true totals.
GET /api/jobs/:id/output/raw?stream=… streams the full log; the job detail
page links to it when truncated. lib/jobOutput.ts is the seam — an
S3/MinIO backend is a future opt-in (reimplement appendJobOutput /
readJobOutput, routes untouched). Deviation from the original sketch: we keep
a 4 MB inline tail (not first+last 100 KB) — simpler, and the full log is one
click away.
8.2 Account self-service — email reset (shipped 2026-06-01)
Change-password (0.9.0) + forgot-password reset (0.9.3) both shipped.
Reset = single-use SHA-256-hashed tokens (1h TTL, burned on use + siblings),
anti-enumeration (forgot-password always 200), rate-limited, audited
(password.reset.{request,success,failure}). Email delivery goes through the
lib/mailer.ts seam — default logs the link (dev/self-hosted); SMTP/SES/
Postmark is a transport swap, callers untouched. The same transport could
later power 8.2bis: email notifications on backup failures (opt-in per
user) — still open.
8.5 E2E tests with Playwright (shipped 2026-06-01)
Playwright wired at the repo root (playwright.config.ts + tests/e2e/,
pnpm test:e2e). Auth is established once in auth.setup.ts and reused via
storageState — keeps the suite under the login route's 10/min rate limit.
Three non-destructive specs cover the "I broke the auth cookie" regression
class that inject() can't see: login → edge-proxy gate → dashboard lists a
node; Health page surfaces a finding with its severity; opening a CONFIRM
finding shows the recommendation + Apply button; a MANUAL finding shows
no Apply button.
Deliberately read-only — the specs never click Apply, because queuing a
preset would be executed for real by a live agent on the host. The "queue a
job → wait for SUCCEEDED → see it in the list" tail from the original spec is
left for CI, where a sacrificial target + a stack-orchestration step
(docker compose up web+api+db, seed a node/findings) can run it safely.
Today the specs assume the dev stack is already up with an agent reporting
≥1 node + findings.
Estimated complexity per item. 0.5–1.5 sessions each. The Phase 8 items not deferred above all shipped together; these three are an explicit follow-up.
Parked — not scheduled
- Anonymized + encrypted dev-DB snapshot (sanitize PII/secrets → encrypt →
devs restore locally; app/login must still work). Proposed 2026-06-01,
parked: not in any phase, no audit finding requires it, low value while
single-operator. Revisit when the team grows or a finding demands it. If done,
keep it tiny: 2 scripts (
scripts/dev-data-{export,import}.sh) + 3 surgical SQL ops (truncate risky blobs → 200B; wipe secret columns; TRUNCATE AuditEvent/Heartbeat), smoke = boot + login, ≤2h. Do NOT build a field-by-field transform pipeline.
6. Phase 9 — Commercial edition (open-core, only after one paying customer)
This is the commercial edition, separate from the Apache-2.0 OSS core (see LICENSE/NOTICE). Open-core: the core stays Apache-2.0; these ship under a commercial license. Locked behind a real "I'll pay for this" conversation — don't build speculatively.
- SSO via OIDC (Google Workspace, Microsoft Entra ID).
- 2FA / WebAuthn for admin accounts.
- Multi-org / multi-tenant (orgs own nodes, users belong to orgs).
- Per-client PDF reports ("here's what HyprBox did for you this month").
- White-label branding (logo, colours, domain).
If a customer demands one of these but won't write a PO, say no. The first PO is expected to come from a design-partner MSP who says "I'd pay for multi-tenant" — that's the trigger to start this edition. We don't speculative-build commercial features.
7. Anti-goals (things we WILL NOT build)
When asked "can we…":
| Ask | Answer | Why |
|---|---|---|
| Infrastructure graph view | Not before 10+ entity types + a paying customer | Sexy to build, marginal until we have meat |
| Numeric health score (62/100) | No | Every customer asks "why not 64" — we can't answer. Buckets only. |
| Mobile app | No | Operator persona is at a laptop |
| Built-in alerting (page on call) | No | Customers keep PagerDuty/OpsGenie; we publish events for them to consume |
| Custom query language | No | If you need to query, use Postgres directly |
| Plugin system | No | API is the extension point |
| Real-time collab features | No | Single-operator-at-a-time is fine |
| Chat/Slack bot | No | The dashboard IS the interaction model |
| AI-generated recommendations | No (until a clear win) | Hand-curated recommendations are still the bar; LLMs are at suggestion quality, not "I trust this to run as root" quality |
| Windows server support | No | Linux only. Period. |
| Be a Terraform / IaC platform | No — command: runner at most |
Different control plane (cloud APIs + state/plan/drift); Spacelift/env0/Atlantis own it; off the PME wedge |
| Be a Kubernetes / Helm platform | No — command: runner at most |
Cluster API ≠ host-bash agent; Rancher/Portainer/ArgoCD own it; PMEs rarely self-run k8s |
This list is not ordered by likelihood of someone asking; it's ordered by how hard "no" is to enforce when the ask gets cute.
8. Decisions locked in
Things we keep re-deciding. Stop re-deciding. The answer is below.
| Question | Decision | Source |
|---|---|---|
| Polling vs WebSocket for agent | Polling until Phase 6.3 | ARCHITECTURE.md |
| Job script: render now or render at apply? | At queue time (audit trail) | JOBS.md |
| Should we offer "reads" RBAC? | No, reads stay open to authenticated users for now | This doc, Phase 6.2 |
| Per-finding-type vs per-finding recommendations | Per-type (one Recommendation row, many Findings link to it) | FINDINGS.md |
| Health score: number or bucket? | Three buckets, no number | AUTOPILOT.md |
Should verify: failure trigger rollback: automatically? |
No (Phase 4b). Operator decides. rollback: lands when a fix needs it. |
PRESETS.md |
Default risk_level when not declared? |
CONFIRM (fail-safe) |
PRESETS.md |
| Where does Restic's password live? | NOT in the DB. passwordRef pointer only. |
HYPRVAULT.md |
| Multi-org from day one? | No — single workspace. Multi-tenant is Phase 9. | This doc |
If something feels under-decided, add a row here in a follow-up commit.
9. Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Scope creep on Phase 5 (operator asks for 10 new scanners) | High | Slows the loop | Ship 3 at most; defer the rest to Phase 7 (HyprGuard) |
| Phase 6.3 WebSocket adds reliability bugs | Medium | Job loss / double-runs | Fallback to polling; atomic claim still serialises both transports |
| Migrations cutover breaks existing prod stacks | Medium | Customer downtime | Baseline migration tested on a snapshot of the prod DB before cutover |
| A scanner false-positives at scale | High | Trust collapse | Treat every false positive as a Sev-2 bug; tighten the key first, the detection second |
| We try to do Phase 9 before a real customer | Medium | Wasted months on enterprise plumbing nobody asks for | Hard rule: no Phase 9 work without a signed PO |
10. Living document
Update this file when:
- A phase ships → flip its status in §1.
- The next phase's scope changes (added or dropped a chunk) → edit
the phase block AND note it in
CHANGELOG.md. - Anti-goal #N gets challenged seriously (real customer money) → don't edit the anti-goal silently; open a discussion thread, link it here, and only then move the row.
The last-frozen date at the top is the only thing that should drift between PRs. Everything else is intentional.