Architecture
State as of Phase 3 + Solidification. Each subsystem is the smallest thing that delivers the user-facing feature it owns; we deliberately avoid premature abstractions (no event sourcing, no message broker, no service mesh).
Bird's-eye view
┌──────────────────────┐
│ Browser (operator) │
└──────────┬───────────┘
│ HTTPS
▼
┌────────────────────────────┐
│ apps/web — Next.js 16 │
│ Studio dark dashboard │
│ EventSource → /api/stream │
└─────────────┬──────────────┘
│ Bearer JWT / cookie
▼
┌─────────────────┐ ┌────────────────────────────┐ ┌──────────┐
│ cli/hyprbox │───▶│ apps/api — Fastify 5 │───▶│ Postgres │
│ (Go + Cobra) │ │ /auth /nodes /tokens │ │ 16 │
└─────────────────┘ │ /presets /jobs /stream │ └──────────┘
└───┬──────────────────▲─────┘
│ rendered bash │ Bearer node token
│ + status flips │ heartbeat + poll
▼ │
┌─────────────────────────┴──┐
│ agent/hyprnode (Go) │
│ collector + reporter + │
│ jobrunner (bash -s) │
└────────────┬───────────────┘
│
▼
┌──────────────────────┐
│ Linux server (target)│
└──────────────────────┘
Components
apps/web — operator dashboard
Next.js 16 App Router, React 19, Tailwind v4. Pages:
/login— credentials →POST /api/auth/login→ stores JWT in localStorage AND thehb_tokencookie (the cookie is what the proxy/middleware checks)./dashboard— fleet overview: hero chart, KPI cards, node grid with per-node sparklines fed by/api/nodes/sparks./dashboard/nodes/[nodeId]— detail page: 3 BigMetric cards with /history sparklines, details panel, activity timeline./dashboard/deploys— preset catalogue. Click Preview → modal with rendered bash + variables form. Click Apply → node-picker →POST /api/jobs./dashboard/jobs— job list (polled every 3s)./dashboard/jobs/[id]— live tail via SSE on/api/jobs/:id/stream./dashboard/logs— live feed of heartbeats + activity events via SSE./dashboard/settings— agent token management (create/revoke + one-shot reveal)./dashboard/alerts— Coming Soon stub.
Edge proxy (apps/web/src/proxy.ts, ex-middleware.ts — Next.js 16 rename)
gates every /dashboard/* route on the presence of the hb_token cookie.
Without it, the browser is redirected to /login?next=….
apps/api — control plane
Fastify 5 + Prisma 6 + Zod. Single process, listens on :4000. Plugins in
order: helmet, rate-limit, cors (credentials enabled), cookie, jwt
(reads both Authorization: Bearer and hb_token cookie), swagger (only
when HYPRBOX_ENABLE_DOCS=true in prod). The global rate-limit bucket is
per authenticated user for Bearer/cookie JWTs, per IP otherwise.
Routes are split by domain in src/routes/*.ts:
auth.ts—/login,/register,/me,/logout. Bcrypt password hashing, rate-limit 10/min per IP on/login, 5/min on/register.nodes.ts—/heartbeat(POST, agent token),/(GET, user JWT),/:nodeId/status|history|activity,/sparksbatch.tokens.ts— node-token CRUD. Plaintext returned exactly once at creation; the rest of the API only sees the SHA-256 hash.presets.ts—/list,/:namedetail,/:name/render(POST → bash script).jobs.ts— user side (POST /,GET /,GET /:id,POST /:id/cancel,GET /:id/streamSSE) + agent side (GET /pending,POST /:id/output,POST /:id/complete).stream.ts—/api/stream/nodes— process-local EventEmitter bus.backups.ts— HyprVault:BackupPolicyCRUD + agent-side run lifecycle.lib/audit.ts— append-only log helper; called from auth, tokens, jobs, backups after each sensitive mutation.
Middleware:
requireAuthextracts a token from header OR cookie and callsapp.jwt.verifysynchronously. We bypassreq.jwtVerify()because@fastify/jwtv9 with cookie support refuses to fall back to the header when the cookie is absent (seedocs/SECURITY.md).requireNodeTokenlooks up the token by 8-char prefix index, thentimingSafeEqualagainst the SHA-256 hash. Optional in dev unlessHYPRBOX_REQUIRE_NODE_TOKEN=true.
agent/hyprnode — server-side agent
Go 1.23, statically linked, ~9 MB. Two responsibilities, one tick loop:
- Heartbeat — collect host metrics (
gopsutil/v4) → POST to/api/nodes/heartbeatwith the optional inventory fields (ip, region, kernel, architecture) sent on the first heartbeat or whenever they change. - Job runner — after the heartbeat, drain queued jobs by polling
GET /api/jobs/pending?nodeId=…. Each claimed job is executed viabash -swith stdin = rendered script; stdout/stderr are buffered and flushed to/outputevery 1.5s, then/completereports the final exit code.
Graceful shutdown via signal-aware context — a running bash gets killed if
the parent receives SIGINT/SIGTERM.
cli/hyprbox — operator CLI
Go + Cobra. Single command tree:
hyprbox status [--api-url] [--token] [--json]— fleet table.hyprbox preset list|show|preview|apply— preset operations.applyis scoped to--target=local(Phase 3.5 — remote will route via the agent).
Both API URL and token come from --flag > $HYPRBOX_API_URL/$HYPRBOX_API_TOKEN > defaults.
Data model
User 1 ── n NodeToken
User 1 ── n Job
Node 1 ── n Heartbeat
Node 1 ── n Job
User— operators with bcrypt password hash.NodeToken— agent shared secret: prefix (8 chars, indexed), SHA-256 hash, optionalnodeIdfor pinning.Node— registered hosts. Status enum (ONLINE/OFFLINE/WARNING/CRITICAL/UNKNOWN); effective status is computed at query time fromlastSeenAt(>10 min → OFFLINE).Heartbeat— append-only metric history. ~24h@1min = ~1440 rows/node.Job— preset application: script frozen at queue time, status flips throughQUEUED → RUNNING → (SUCCEEDED|FAILED|CANCELLED). Output capped at 4 MB per stream (tail kept on overflow).BackupPolicy/BackupRun— HyprVault. Policy is the recipe, run is one execution.passwordRefis a pointer (env:NAME/file:/path) — the password itself is never stored.AuditEvent— append-only trail of sensitive mutations (login, token CRUD, job CRUD, backup CRUD). Seedocs/AUDIT.md.
Full schema: apps/api/prisma/schema.prisma.
Flows
Heartbeat (every 5min in prod, 10s in dev)
agent.tick()
→ collector.Collect() # gopsutil host metrics
→ POST /api/nodes/heartbeat # Bearer agent token (optional in dev)
→ requireNodeToken # prefix lookup + hash compare
→ prisma.node.upsert # create or refresh metadata
→ prisma.heartbeat.create
→ publish({type: 'heartbeat'}) # fans out to /api/stream/nodes
→ 202 Accepted
Queue + run a preset
Operator → POST /api/jobs
→ requireAuth → AuthedUser
→ getPreset(name) + renderPreset(preset, vars) # bash frozen here
→ prisma.job.create({ status: QUEUED, script })
→ publish({type: 'activity', kind: 'deploy'})
→ 201
agent.tick() (next heartbeat)
→ GET /api/jobs/pending?nodeId=X
→ SELECT WHERE status=QUEUED ORDER BY createdAt LIMIT 1
→ UPDATE ... WHERE id=X AND status=QUEUED # atomic claim
→ if count=0 → return null (someone else got it)
→ publish({kind: 'info', msg: 'Agent picked up...'})
→ exec /bin/bash -s < script
→ every 1.5s: POST /:id/output {stdout, stderr}
→ on exit: POST /:id/complete {exitCode, ...}
→ prisma.job.update({ status, finishedAt, ... })
Live tail in the dashboard
Browser /dashboard/jobs/[id]
→ GET /api/jobs/:id (initial state, includes stdout so far)
→ EventSource /api/jobs/:id/stream
→ server polls Job row every 500ms
→ emits `stdout` / `stderr` events for the delta since last tick
→ emits `status` on flip + `done` on terminal state, then closes
→ React patches the console buffer; auto-scroll unless the user
scrolled up (detected via scrollTop reflow).
Why these choices
- Polling, not WebSocket, for agent → API: agents already poll for
heartbeats. Adding a job-pull on the same tick was zero new infrastructure.
Latency =
HYPRBOX_INTERVAL(5 min default), which is fine for ops jobs. WebSocket lands in Phase 4 if customers need sub-minute apply latency. - SSE, not WebSocket, for API → browser: one-way push fits the use case
(heartbeats, job output). SSE works through reverse proxies without sticky
sessions, EventSource reconnects for free, and the protocol is text — easy
to debug with
curl -N. - Cookie + header dual auth: EventSource cannot set custom headers, so
SSE endpoints depend on the
hb_tokencookie. The same cookie is also what the Next.js proxy reads to gate/dashboard/*without round-tripping the API. For everything else (CLI, agent, API consumers),Authorization: Beareris the canonical transport. - Render-at-queue, not render-at-run: storing the bash script on the Job row means a later schema change to the preset doesn't retroactively rewrite what a job actually ran. The audit trail is exact and reproducible.
- Atomic claim via
updateMany: PostgreSQL row locking +WHERE status='QUEUED'guarantees exactly one winner across parallel pollers. No advisory locks, no job queue library; the schema is the queue. - Prefix-indexed token lookup: the first 8 chars of
hbnt_…are unique and indexed, so the/heartbeathot path is a single index scan + one hash compare. The full token never appears in a query.
Operational properties
| Property | Current |
|---|---|
| API process model | Single Fastify process. Horizontal scale needs sticky sessions for SSE OR Redis pub/sub. |
| DB | PostgreSQL 16. Migrations via prisma db push (no migration history yet — landing in Phase 4). |
| Logs | API: pino JSON in prod, pino-pretty in dev. Agent: stdlib log to stderr. |
| Metrics | Application metrics: none yet. Host metrics: collected by HyprNode for the host it runs on, not the API/web. Phase 4 will surface API-level metrics via Prometheus. |
| Backups | None automated. Postgres dump = manual. HyprVault module (Phase 4) will own this. |
| Auth lifetime | JWT TTL = 7 days, no refresh token, no revocation list. Node tokens are revocable per-row. |