Security
The threat model assumes the API and dashboard sit on the public internet, the agents reach out from server LANs, and the database is internal. Within that, the goals are:
- An attacker without credentials cannot enumerate users, read fleet data, or queue jobs.
- A compromised node token cannot impersonate the operator or pivot to other nodes.
- A leaked operator JWT lets the attacker do anything that operator could, but expires within 7 days.
Threat #1 — the agent runs as root and executes control-plane bash
This is the largest blast radius in the system; everything else is secondary.
The agent's whole job is to apply system fixes (apt, ufw, ssh, docker), so it runs as root and executes shell that the control-plane renders and stores at queue time. Consequence: anyone who can queue a job for a node gets root code execution on that node. Aggregated across the fleet, the prize targets are, in order:
- The API host / its secrets. RCE on the API process, a stolen
JWT_SECRET, or admin credentials → the attacker mints/queues arbitrary root jobs to every connected node = fleet-wide root RCE. - The database, on its own. Jobs are claimed via
GET /api/jobs/pending, which serves whateverJobrows areQUEUED. An attacker who can only write the DB (SQL injection, a restored/leaked DB dump, a rogue DBA) canINSERTaQUEUEDjob row; the next time that node's agent polls, it claims and runs it as root. There is currently no per-job integrity check, so a DB-write primitive is a fleet-root primitive. - A node token. Pinned to one
nodeId(see Identities), so it's scoped to that node — it cannot queue work for, or claim work from, other nodes.
What constrains this today
- Transport + identity: HTTPS; node tokens are pinned to a
nodeId(requirePinnedNodeTokenon every agent write/claim — a token for node A can't touch node B); operatortokenVersiongives instant revocation; agents dial out (no inbound port on the node). - Authoring: only
OPERATOR+ can create jobs (RBAC). The exact rendered script is stored on the Job row at queue time → a complete audit trail of what ran. Injection sinks are hardened (assertCommandSaferejects shell metacharacters in variables interpolated intocommand/config_file; SSH port is digit/range-validated) and typed steps are preferred over rawcommand. - Operator-facing safety: preview the bash, per-tier Apply gating,
prechecks that refuse when unsafe, and
verify(no false "success"). Note these protect against operator mistakes, not a malicious authenticated attacker — an OPERATOR is trusted to run root fixes by design.
Residual risk + job-payload signing (analysis, not yet implemented)
Targets #1 and #2 above are not mitigated by the controls above. The defense-in-depth option is to sign job payloads: the control-plane signs the rendered script (e.g. Ed25519) at queue time; the agent verifies against a pinned public key and refuses anything unsigned or mis-signed.
- What it buys: closes target #2 — a DB-only attacker can write a
Jobrow but cannot produce a valid signature, so the agent rejects it. Also a belt-and-suspenders against a TLS-bypassing MITM. - What it does NOT buy: target #1. If the API host is compromised, the attacker has whatever the API uses to sign → they can sign malicious jobs. Signing only helps if the key lives outside the DB (so #2 holds) and ideally outside the API runtime (HSM / separate signer) — and even then a compromised API can request signatures. So signing meaningfully raises the bar against DB-tampering; it is not a substitute for hardening the API host.
- Verdict: worth doing as defense-in-depth against the DB-tampering class (a common, lower-bar attack), with the signing key kept out of the database. Tracked in docs/BACKLOG.md. Not a 1.0 blocker, but the first thing to add when the fleet holds more than one customer's nodes.
Hardening guidance
- Treat the API host as crown jewels: minimal surface, patched, restricted
egress,
JWT_SECRETand any future signing key kept out of the DB. - Rotate
JWT_SECRET(fleet-wide kill switch) and revoke node tokens on any suspicion. - Where a node only needs a narrow set of fixes, prefer a
NOPASSWDsudoers allowlist for that agent over full root — smaller per-node blast radius (trade-off: presets needing broader privileges will fail their precheck). - Keep the audit log + stored job scripts for forensics (Audit log, below).
Identities
| Identity | What it authenticates | Lifetime |
|---|---|---|
| Operator | HS256 JWT issued by /api/auth/login. Bearer header OR hb_token cookie. |
7 days, no refresh. |
| Agent | hbnt_<43 base64url chars> shared secret. Sent as Bearer. |
Until revoked. |
JWT is verified via @fastify/jwt (v10 → fast-jwt). requireAuth also checks
the token's tv claim against User.tokenVersion on every request (one
indexed lookup) — that per-request check is the per-user revocation
mechanism: role change / password change / reset / user delete bump
tokenVersion and kill outstanding tokens immediately. The browser hb_token
cookie is HttpOnly; SameSite=Lax; Secure (prod). Node tokens are verified by a
prefix-indexed lookup + timingSafeEqual against the SHA-256 hash.
Why two systems
Operators need to be able to log in and out, recover passwords (someday), and have a personality (the dashboard says "signed in as Alice"). JWTs are great for that.
Agents need a long-lived secret that can be embedded in a config file, rotated when a node is decommissioned, and pinned to a specific node so a stolen token from one server can't impersonate another. That's a different problem from operator auth, and shoving both into the same primitive made the code dirtier without paying for itself.
Password storage
Bcrypt cost factor 10 (bcrypt.hash(password, 10)). The login path runs
bcrypt.compare against a dummy hash on user-not-found so the timing
profile of "wrong password" vs "no such user" matches — defeats user
enumeration via login response time.
Token storage
NodeToken rows store:
prefix(first 8 chars after thehbnt_marker) — unique, indexed. Used for the fast lookup in the heartbeat hot path.hash— SHA-256 hex of the plaintext. The plaintext itself is never persisted; we return it once at creation and the row only knows the hash.nodeId(optional, but required for agent writes) — pins the token. A pinned token used against a different node returns403, and agent write/claim endpoints (job pull/output/complete, findings upsert, backup runs, inventory) reject unpinned tokens (requirePinnedNodeToken). Anonymous heartbeat is the only exception (dev bootstrap).revokedAt— soft delete. The checkif (!row || row.revokedAt)is the first thing the middleware does after the hash compare.
Plaintext appears in exactly two places: (1) the create-token response —
shown once in the dashboard with a Copy button — and (2) wherever the user
chooses to store it (typically /etc/hyprbox/hyprnode.env mode 0600).
RBAC
Three roles (Phase 6.2):
| Role | Can do |
|---|---|
VIEWER |
Read everything (fleet, jobs, backups, findings, presets). Cannot mutate. |
OPERATOR |
All viewer + queue jobs, mint/revoke tokens, create/edit/delete backup policies, snooze/resolve findings. |
ADMIN |
All operator + list/promote/demote users, read the audit log, export it as CSV. |
Implementation:
- The role lives in
User.roleand is baked into the JWT at sign time.requireAuthverifies the token'stvclaim againstUser.tokenVersionon every request (one indexed lookup), so any flow that bumpstokenVersion— role change, password change, password reset, user delete — invalidates outstanding JWTs immediately, without waiting for the 7d expiry. Rotating the JWT secret remains the fleet-wide kill switch. requireRole(role)inapps/api/src/middleware/auth.tschains AFTERrequireAuth. Every mutation route uses it; reads stay open to any authenticated user. The role hierarchy isVIEWER < OPERATOR < ADMIN, sorequireRole('OPERATOR')admits both OPERATOR and ADMIN.- First-user bootstrapping: on a fresh deployment, the very first
call to
POST /api/auth/registerlands as ADMIN (User.count()===0). After that, public self-registration is closed by default — it returns 403 in production (HYPRBOX_ALLOW_PUBLIC_REGISTRATIONoverrides; default is open outside production for the dev quick-start). Admins provision teammates via admin-onlyPOST /api/users(email + role + initial password), then adjust roles withPATCH /api/users/:id/role. - Last-admin guard:
PATCH /api/users/:id/roleandDELETE /api/users/:idrefuse to demote or delete the last remaining ADMIN — would orphan the workspace. - Job and backup-policy rows are STILL owned by their creator
(
createdBy); other operators see404(not403) on detail/cancel — deliberate, no existence leak.
If you need to bootstrap a deployment without going through the API
(scripted seeding), use the seed-admin.ts script which writes
role: 'ADMIN' directly.
Rate limiting
@fastify/rate-limit:
- Global: 200 req/min per authenticated user (
Authorization: Bearerorhb_tokencookie), falling back to IP for anonymous/invalid requests. /api/auth/login: 10 req/min — slows credential stuffing./api/auth/register: 5 req/min — slows account-creation spam./api/stream/*: exempt — long-lived SSE connections shouldn't count.
429 responses include the standard X-RateLimit-* headers and an
Retry-After so clients back off.
Transport
- TLS termination at the reverse proxy (Caddy / Nginx). The compose stack
binds the API port to
127.0.0.1by default. SetWEB_BIND=127.0.0.1as well when the dashboard is meant to be reachable only through the reverse proxy; otherwise the production compose default exposes the Web container directly on0.0.0.0:3000. - CORS allows credentials and is locked to
CORS_ORIGIN. EventSource requireswithCredentials: trueand a same-site cookie — setCORS_ORIGIN=https://your-web-hostexactly, no wildcards. - HSTS is set on API responses by
@fastify/helmetin production, and on Web responses byapps/web/next.config.tsin production. The reverse proxy should also send HSTS on public TLS hostnames. - API responses use the Helmet bundle. Web responses set the non-breaking
baseline headers in
apps/web/next.config.ts:X-Frame-Options: SAMEORIGIN,X-Content-Type-Options: nosniff,Referrer-Policy, COOP/CORP,Origin-Agent-Cluster, and a restrictivePermissions-Policy.
Secrets
Three things should never end up in the repo or in logs:
| Secret | Lives | Don't log |
|---|---|---|
JWT_SECRET |
env var only | API never logs the value. |
| User passwords | bcrypt hash on the User row | Login route logs email only. |
| Node tokens | hashed on NodeToken row | Logger logs the prefix only (when used). |
HYPRBOX_REQUIRE_NODE_TOKEN=true in prod (set automatically by the prod
compose) makes anonymous heartbeats hard-fail with 401. Dev keeps it off
so the quick-start works without provisioning a token first.
The error handler in apps/api/src/index.ts logs bodyKeys (the field
names) but never the body itself — so a Zod-validation error on
/api/auth/login won't leak the rejected password into your aggregator.
Common operations
Rotate the JWT secret
Changing JWT_SECRET invalidates every existing operator session — they all
get 401 on the next request and have to log back in. Acceptable for an
emergency; not great for routine rotation.
# Generate a new value (≥ 32 chars enforced in prod)
openssl rand -base64 48
# Edit .env.production, then:
docker compose -f docker-compose.prod.yml --env-file .env.production up -d api
Revoke a node token
From the dashboard: /dashboard/settings → row → Revoke. Or via API:
curl -X DELETE http://localhost:4000/api/tokens/<id> \
-H "Authorization: Bearer $JWT"
The agent gets 401 on its next heartbeat. There's no kill signal — the
worst case is one extra heartbeat in the offline gap.
Identify which token a request used
Tokens log their prefix (8 chars) but never the full string. Search the
API logs for the prefix to trace a session:
docker compose -f docker-compose.prod.yml logs api \
| jq 'select(.nodeToken.id == "<id>")'
Audit log
Every sensitive mutation writes a row to AuditEvent: login success/failure,
register, token create/revoke, job create/cancel, backup policy CRUD +
trigger, user create/role/delete, password change/reset, and per-node config
updates. The schema captures userId (nullable, so failed logins still land),
action, resourceType/resourceId, ip, and a small JSON metadata object.
Reads are not logged on purpose — they'd dwarf the interesting events.
The audit table is now exposed at GET /api/audit (admin-only) with
filters (user/email substring, action, resourceType, resourceId),
pagination, and CSV export (?format=csv, capped at 10k rows). The
admin dashboard surface lives at /dashboard/admin/audit.
Full reference: docs/AUDIT.md.
Two operational details worth knowing:
- The helper is fire-and-forget. A failed audit write becomes a warn log and the user-facing request still succeeds. We'd rather drop a row than degrade the response.
userIdonlogin.failureis alwaysnulleven when the email matches a real user — that's deliberate to avoid leaking which emails are registered through the audit table.
What we explicitly do NOT do (yet)
- 2FA / WebAuthn — Phase 9 / commercial.
- SSO (OIDC / SAML) — Phase 9 / commercial.
- Strict enforced CSP — tracked separately. Baseline headers are in place; CSP needs report-only rollout and inline-style/script cleanup first.
- Token usage anomaly detection — none. If an agent token is leaking data, you'd have to spot it manually.
- Per-node IP allowlist — schema has
fromfor firewall rules in presets, not for the API itself. Reverse-proxy-level IP allowlists are the workaround.
If any of those land before Phase 9 it's because a customer or deployment constraint demanded them — open an issue with the scenario.