HyprBox docs GitHub ↗

Roadmap

Source of truth for what we ship next, in what order, and why. If you find yourself re-deciding a question that's answered here, the answer here wins. Update this doc instead of arguing.

Last frozen: 2026-06-02 (Phase 8 complete + two internal security passes).


0. TL;DR

We are building HyprBox = infrastructure autopilot for freelance + MSP. The product is the loop discover → finding → recommend → preview → apply → verify. Everything we ship is either:

  1. A new scanner that produces a new kind of Finding.
  2. A new fix preset that resolves an existing finding type.
  3. Infrastructure that makes the loop trustworthy at scale (auth, migrations, cancel, audit UI, observability).
  4. Polish that makes the demo land (UI, copywriting, performance).

Anything that doesn't fit one of those four buckets goes in the Anti-goals list at the bottom of this file. We do not build it.

Read AUTOPILOT.md for the why; this doc is the what + when.


1. Where we are

Phase Status Shipped
1 — MVP ✅ Done Heartbeat → dashboard, agent, CLI
2 — Auth + presets + SSE ✅ Done JWT + agent tokens, 3 presets, CLI preset commands, hardening
3 — Remote apply ✅ Done Job queue + atomic claim + agent jobrunner + live tail
4a — HyprVault + audit ✅ Done Backups (policy + run), audit log on sensitive routes
4b — Autopilot loop ✅ Done Finding + Recommendation models, TLS scanner, Health page, Fix contract (risk_level + verify), auto-resolve
5 — Scanner expansion ✅ Done SSH password-auth + disk usage + Postgres no-backup scanners, 2 new fix presets, inventory cross-ref endpoint, 10 Go parser tests
6 — Make it real ✅ Done Prisma migrations + drift check, RBAC (VIEWER/OPERATOR/ADMIN), WebSocket reverse channel + RUNNING-job cancel, admin audit UI with CSV export
7 — HyprGuard ✅ Done 9-check security audit scanner (kernel updates, root password, NOPASSWD sudo, /etc world-writable, UFW, fail2ban, unattended-upgrades, LUKS, auditd), apt-security-upgrade preset, conditional severity on UFW for public-IP hosts
8 — Production polish ✅ Done 8.1 output offload (chunk-in-DB) + 8.2 change-password + forgot-password reset + 8.3 per-user rate limit + 8.4 GHCR release pipeline + 8.5 Playwright E2E + 8.6 per-node agent config. Fully shipped
9 — HyprWatch v1 🟡 v1 Monitoring as a find→fix→verify loop: opt-in (hyprwatch per-node config) monitoring.absent scanner → monitoring.install-stack recommendation → monitoring-only preset (now with verify: + risk_level) → auto-resolve. Deferred: Loki/log shipping, Alertmanager routing verify, multi-node federation, drift detection. See docs/HYPRWATCH.md

Tests: 167 vitest specs + 32 Go specs + 3 Playwright E2E, all passing. Typecheck clean on API + Web. Go vet clean on agent + CLI. CI green.

This is the baseline. Everything below is forward-looking.


2. Phase 5 — Scanner expansion (shipped 2026-05-30)

Goal. Take the autopilot loop from "works for TLS" to "covers the three pains every SMB server has".

Why this was next. The loop is the product, but ONE finding type is not a product. Three well-chosen findings turn the dashboard from "demo gimmick" into "you should run this on every server".

Shipped:

  • ssh.password-auth-enabled (CRITICAL) — sshd_config parser handles Include + Match + first-write-wins semantics. Recommendation: ssh.disable-password-auth (risk: DANGEROUS) with hard precheck on authorized_keys, sshd_config backup, sshd -t validation, auto-revert on validation failure.
  • disk.usage-high (WARN ≥80%, CRITICAL ≥90%) — per mount-point, skips pseudo filesystems. Recommendation: docker-prune (risk: CONFIRM) with "must be docker host" precheck and verify-step that re-checks usage drop.
  • postgres.no-backup (CRITICAL) — agent ships a docker inventory via POST /api/nodes/inventory; the cross-reference logic lives on the API (containers matching postgres* × BackupPolicy on this node). Reconciliation built in — container removed or policy added → finding auto-resolves on the next inventory tick.
  • 10 Go unit tests for the sshd_config parser.

The fresh-VM success metric from the original spec (≥3 findings, ≥2 with working Apply) is achievable as written.

5.1 SSH password authentication

Scanner agent/hyprnode/internal/scanner/ssh.go — parse /etc/ssh/sshd_config
Finding ssh.password-auth-enabled (CRITICAL)
Key sshd:/etc/ssh/sshd_config (one per file path; supports Include directives)
Recommendation ssh.disable-password-auth → existing preset extended with verify
Preset risk DANGEROUS — need an authorised key first
Verify sshd -t succeeds AND grep -qE '^PasswordAuthentication\s+no' /etc/ssh/sshd_config
Precheck At least one entry in /root/.ssh/authorized_keys OR /home/*/.ssh/authorized_keys

Definition of done. Spec: scanner emits the finding; recommendation auto-links it; preview shows the bash with the precheck guard; verify-pass auto-resolves the finding. Apply fails cleanly if precheck doesn't pass (no keys → exit 1 before touching anything).

5.2 Disk usage high

Scanner agent/hyprnode/internal/scanner/disk.gogopsutil/v4/disk.Partitions + Usage
Finding disk.usage-high (WARN ≥80%, CRITICAL ≥90%) per mount point
Key mount:/var (mount path; survives device renames)
Recommendation disk.docker-prune (one of two paths: docker cleanup or just "no fix")
Preset risk CONFIRM
Preset docker-prunedocker system prune -af --volumes with a "must be a Docker host" precheck
Verify Re-run the disk check, new usage < before (or below threshold)

Definition of done. Mount-point granularity (not "the host is at 87%" — "/var is at 87% because Docker has 23 GiB of dangling images"). Threshold env-configurable.

5.3 PostgreSQL without a backup

The marquee demo finding. This is what makes someone say "oh, I want this".

Scanner agent/hyprnode/internal/scanner/postgres.go
Detection Docker container with image matching postgres:* AND no BackupPolicy referencing this node AND the container has a mounted volume
Finding postgres.no-backup (CRITICAL — data loss risk)
Key container:<container_name>
Recommendation postgres.add-backuphypervault-restic (already seeded)
Preset variables backup_paths auto-suggested to the container's volume mount points

The detection cross-references DB state (BackupPolicy rows) with agent- reported state (running containers). The cross-reference logic lives on the API, not the agent — the agent ships a docker inventory; the API decides "this looks like a Postgres without a paired policy".

Definition of done. Detection has false-positive rate near zero on the test fleet. False positives are unacceptable here because the fix is not idempotent at zero cost (Restic init on a fresh S3 bucket).

5.4 Phase 5 success metric

A fresh Debian VM running nginx + postgres-via-docker + caddy, scanner tick once: dashboard shows at least 3 findings with at least 2 of them having a working Apply button.

Estimated complexity. ~2 sessions (1 for the 3 scanners + 1 for the preset polish + auto-resolve specs + demo recording).


3. Phase 6 — Make it real (shipped 2026-05-31)

Shipped:

  • 6.1 Prisma migrations — baseline migration 20260531000000_init captures cumulative schema through Phase 5; orphan migrations cleared. Docker entrypoint flipped from db push to migrate deploy. CI gained a migrate diff --from-migrations drift check using a shadow DB so a PR that edits schema.prisma without db:migrate fails fast.
  • 6.2 RBACUserRole enum (VIEWER/OPERATOR/ADMIN) baked into JWT at sign time. requireRole() middleware gates every mutation route (jobs, tokens, backups, findings/snooze, findings/resolve). /api/users (admin-only) lists + manages teammates; role updates refuse to demote/delete the last admin. First user on a fresh deployment auto-promotes to ADMIN; subsequent registrations land as OPERATOR.
  • 6.3 WebSocket reverse channel@fastify/websocket plugin, /api/agent/ws endpoint with per-node connection registry. Server pushes {type:"wakeup"} on Job.create (sub-second pickup instead of waiting for the 5-min poll tick); pushes {type:"cancel", jobId:…} on operator cancel of a RUNNING job → agent's job context is cancelled → bash SIGTERMed. Falls back transparently to polling when the WS is down (gorilla/websocket reconnect with exponential backoff, 1s → 60s). RUNNING-job cancel without WS connection is refused cleanly (409).
  • 6.4 Admin audit UIGET /api/audit (admin-only) with filters (user/email substring, action, resource type), pagination (50/page), CSV export (?format=csv, cap 10k rows). /dashboard/admin/audit page gated client-side AND server-side; "Admin" nav entry only rendered for ADMIN role.

The "3-person team self-serves" success metric from §3.5 is achievable end-to-end: admin invites teammates via /register, role-management UI lets them become viewer/operator, audit log captures everything.


4. Phase 7 — HyprGuard (shipped 2026-05-31)

Goal. A scheduled security audit that produces a structured list of findings, not a 200-line bash log no one reads — Lynis-shaped, but every item is a Finding with a Recommendation where one exists.

Shipped:

  • agent/hyprnode/internal/scanner/audit.go — runs all 9 checks every scanner tick. Pure parsing in audit_parse.go (testable against captured fixture output, no shell required).

  • The 9 checks and their stable keys:

    Check Type Severity
    apt list --upgradable shows -security packages audit.kernel-updates-pending WARN
    /etc/shadow root row has a real password hash audit.root-password-set WARN
    NOPASSWD: in /etc/sudoers{,.d/*} audit.sudo-nopasswd WARN
    find /etc -perm -o+w returns ≥1 file audit.world-writable-etc WARN
    UFW missing or Status: inactive or default-allow audit.ufw-inactive CRITICAL on public-IP host, WARN else
    systemctl is-active fail2ban ≠ "active" audit.fail2ban-down WARN
    /etc/apt/apt.conf.d/20auto-upgrades missing or disabled audit.no-automatic-updates INFO
    lsblk -no TYPE has no crypt row audit.no-disk-encryption INFO
    systemctl is-active auditd ≠ "active" audit.no-auditd INFO
  • presets/apt-security-upgrade.yaml (CONFIRM) — runs unattended-upgrade --debug --minimal-upgrade-steps against pending security packages, then verifies apt list --upgradable reports zero -security rows remaining.

  • Recommendations seeded:

    • audit.kernel-updates-pendingapt-security-upgrade
    • audit.ufw-inactive / audit.fail2ban-down / audit.no-automatic-updates → existing server-light (one rec, three finding types — that's why findingTypes is a JSON array).
  • The other 5 findings stay MANUAL — passwd -l root, sudoers review, per-file chmod o-w, LUKS-on-reprovision, auditd-on-purpose are decisions an operator owns. The UI just shows them without an Apply button.

Done-when metric (from the original spec): A fresh Ubuntu 24.04 VM with no hardening produces ≥6 findings; applying server-light resolves ≥4 of them on the next scanner tick. Achievable as wired.


5. Phase 8 — Production polish (complete)

Shipped in this pass:

  • 8.2 Change passwordPOST /api/auth/change-password requires the current password, refuses re-use, audits success and failure separately. Settings page gains a "Change password" form with inline validation. (Forgot-password email reset is deferred until we wire transactional email — separate decision.)
  • 8.3 Per-user rate limiting — custom keyGenerator on @fastify/rate-limit buckets Bearer/cookie-authenticated traffic by user.sub and falls back to req.ip for anonymous. Solves the NAT'd-office case where the global per-IP bucket was effectively per-organisation.
  • 8.4 GHCR release pipeline.github/workflows/release.yml matrix over the three Docker images, triggered by v*.*.* tag pushes (or workflow_dispatch with an explicit ref). Stable releases move :latest; pre-releases push the version tag only.
  • 8.6 Per-node agent config — new schema column Node.agentConfig (JSON string) + migration. PATCH /api/nodes/:id/config (operator-auth) merges into existing config; null on a key clears it. GET /api/nodes/me/config (node-token, pinned-only) returns the live config to the agent. Agent pulls before each scanner tick: tls_paths overrides the env default for the TLS scanner, and scan_interval now drives the scanner ticker cadence (runner.Tick returns the next interval; honored since the 2026-06-02 audit pass).
  • New "Agent config" panel on the node detail page edits TLS paths + scan interval.

Phase 8 fully shipped — nothing deferred. (8.1 output offload, 8.2 change-password + forgot-password reset, 8.3–8.6 all landed; see §5b.) The only loosely-open follow-on idea is opt-in email notifications on backup failures, which rides on the same lib/mailer.ts transport once SMTP is wired.


5b. Phase 8 — leftover scope (deferred)

8.1 Output offload for long logs (shipped 2026-06-01, chunk-in-DB)

Decision made: chunk-in-DB as the self-hosted default (no extra infra). A stream >4 MB no longer loses its head: the full ordered log is persisted in JobOutputChunk (cheap appends), Job.stdout/stderr keep a bounded 4 MB inline tail, and Job.stdoutBytes/stderrBytes track the true totals. GET /api/jobs/:id/output/raw?stream=… streams the full log; the job detail page links to it when truncated. lib/jobOutput.ts is the seam — an S3/MinIO backend is a future opt-in (reimplement appendJobOutput / readJobOutput, routes untouched). Deviation from the original sketch: we keep a 4 MB inline tail (not first+last 100 KB) — simpler, and the full log is one click away.

8.2 Account self-service — email reset (shipped 2026-06-01)

Change-password (0.9.0) + forgot-password reset (0.9.3) both shipped. Reset = single-use SHA-256-hashed tokens (1h TTL, burned on use + siblings), anti-enumeration (forgot-password always 200), rate-limited, audited (password.reset.{request,success,failure}). Email delivery goes through the lib/mailer.ts seam — default logs the link (dev/self-hosted); SMTP/SES/ Postmark is a transport swap, callers untouched. The same transport could later power 8.2bis: email notifications on backup failures (opt-in per user) — still open.

8.5 E2E tests with Playwright (shipped 2026-06-01)

Playwright wired at the repo root (playwright.config.ts + tests/e2e/, pnpm test:e2e). Auth is established once in auth.setup.ts and reused via storageState — keeps the suite under the login route's 10/min rate limit. Three non-destructive specs cover the "I broke the auth cookie" regression class that inject() can't see: login → edge-proxy gate → dashboard lists a node; Health page surfaces a finding with its severity; opening a CONFIRM finding shows the recommendation + Apply button; a MANUAL finding shows no Apply button.

Deliberately read-only — the specs never click Apply, because queuing a preset would be executed for real by a live agent on the host. The "queue a job → wait for SUCCEEDED → see it in the list" tail from the original spec is left for CI, where a sacrificial target + a stack-orchestration step (docker compose up web+api+db, seed a node/findings) can run it safely. Today the specs assume the dev stack is already up with an agent reporting ≥1 node + findings.

Estimated complexity per item. 0.5–1.5 sessions each. The Phase 8 items not deferred above all shipped together; these three are an explicit follow-up.

Parked — not scheduled

  • Anonymized + encrypted dev-DB snapshot (sanitize PII/secrets → encrypt → devs restore locally; app/login must still work). Proposed 2026-06-01, parked: not in any phase, no audit finding requires it, low value while single-operator. Revisit when the team grows or a finding demands it. If done, keep it tiny: 2 scripts (scripts/dev-data-{export,import}.sh) + 3 surgical SQL ops (truncate risky blobs → 200B; wipe secret columns; TRUNCATE AuditEvent/Heartbeat), smoke = boot + login, ≤2h. Do NOT build a field-by-field transform pipeline.

6. Phase 9 — Commercial edition (open-core, only after one paying customer)

This is the commercial edition, separate from the Apache-2.0 OSS core (see LICENSE/NOTICE). Open-core: the core stays Apache-2.0; these ship under a commercial license. Locked behind a real "I'll pay for this" conversation — don't build speculatively.

  • SSO via OIDC (Google Workspace, Microsoft Entra ID).
  • 2FA / WebAuthn for admin accounts.
  • Multi-org / multi-tenant (orgs own nodes, users belong to orgs).
  • Per-client PDF reports ("here's what HyprBox did for you this month").
  • White-label branding (logo, colours, domain).

If a customer demands one of these but won't write a PO, say no. The first PO is expected to come from a design-partner MSP who says "I'd pay for multi-tenant" — that's the trigger to start this edition. We don't speculative-build commercial features.


7. Anti-goals (things we WILL NOT build)

When asked "can we…":

Ask Answer Why
Infrastructure graph view Not before 10+ entity types + a paying customer Sexy to build, marginal until we have meat
Numeric health score (62/100) No Every customer asks "why not 64" — we can't answer. Buckets only.
Mobile app No Operator persona is at a laptop
Built-in alerting (page on call) No Customers keep PagerDuty/OpsGenie; we publish events for them to consume
Custom query language No If you need to query, use Postgres directly
Plugin system No API is the extension point
Real-time collab features No Single-operator-at-a-time is fine
Chat/Slack bot No The dashboard IS the interaction model
AI-generated recommendations No (until a clear win) Hand-curated recommendations are still the bar; LLMs are at suggestion quality, not "I trust this to run as root" quality
Windows server support No Linux only. Period.
Be a Terraform / IaC platform No — command: runner at most Different control plane (cloud APIs + state/plan/drift); Spacelift/env0/Atlantis own it; off the PME wedge
Be a Kubernetes / Helm platform No — command: runner at most Cluster API ≠ host-bash agent; Rancher/Portainer/ArgoCD own it; PMEs rarely self-run k8s

This list is not ordered by likelihood of someone asking; it's ordered by how hard "no" is to enforce when the ask gets cute.


8. Decisions locked in

Things we keep re-deciding. Stop re-deciding. The answer is below.

Question Decision Source
Polling vs WebSocket for agent Polling until Phase 6.3 ARCHITECTURE.md
Job script: render now or render at apply? At queue time (audit trail) JOBS.md
Should we offer "reads" RBAC? No, reads stay open to authenticated users for now This doc, Phase 6.2
Per-finding-type vs per-finding recommendations Per-type (one Recommendation row, many Findings link to it) FINDINGS.md
Health score: number or bucket? Three buckets, no number AUTOPILOT.md
Should verify: failure trigger rollback: automatically? No (Phase 4b). Operator decides. rollback: lands when a fix needs it. PRESETS.md
Default risk_level when not declared? CONFIRM (fail-safe) PRESETS.md
Where does Restic's password live? NOT in the DB. passwordRef pointer only. HYPRVAULT.md
Multi-org from day one? No — single workspace. Multi-tenant is Phase 9. This doc

If something feels under-decided, add a row here in a follow-up commit.


9. Risks

Risk Likelihood Impact Mitigation
Scope creep on Phase 5 (operator asks for 10 new scanners) High Slows the loop Ship 3 at most; defer the rest to Phase 7 (HyprGuard)
Phase 6.3 WebSocket adds reliability bugs Medium Job loss / double-runs Fallback to polling; atomic claim still serialises both transports
Migrations cutover breaks existing prod stacks Medium Customer downtime Baseline migration tested on a snapshot of the prod DB before cutover
A scanner false-positives at scale High Trust collapse Treat every false positive as a Sev-2 bug; tighten the key first, the detection second
We try to do Phase 9 before a real customer Medium Wasted months on enterprise plumbing nobody asks for Hard rule: no Phase 9 work without a signed PO

10. Living document

Update this file when:

  • A phase ships → flip its status in §1.
  • The next phase's scope changes (added or dropped a chunk) → edit the phase block AND note it in CHANGELOG.md.
  • Anti-goal #N gets challenged seriously (real customer money) → don't edit the anti-goal silently; open a discussion thread, link it here, and only then move the row.

The last-frozen date at the top is the only thing that should drift between PRs. Everything else is intentional.