Security

The threat model assumes the API and dashboard sit on the public internet, the agents reach out from server LANs, and the database is internal. Within that, the goals are:

An attacker without credentials cannot enumerate users, read fleet data, or queue jobs.
A compromised node token cannot impersonate the operator or pivot to other nodes.
A leaked operator JWT lets the attacker do anything that operator could, but expires within 7 days.

Threat #1 — the agent runs as root and executes control-plane bash

This is the largest blast radius in the system; everything else is secondary.

The agent's whole job is to apply system fixes (apt, ufw, ssh, docker), so it runs as root and executes shell that the control-plane renders and stores at queue time. Consequence: anyone who can queue a job for a node gets root code execution on that node. Aggregated across the fleet, the prize targets are, in order:

The API host / its secrets. RCE on the API process, a stolen JWT_SECRET, or admin credentials → the attacker mints/queues arbitrary root jobs to every connected node = fleet-wide root RCE.
The database, on its own. Jobs are claimed via GET /api/jobs/pending, which serves whatever Job rows are QUEUED. An attacker who can only write the DB (SQL injection, a restored/leaked DB dump, a rogue DBA) can INSERT a QUEUED job row; the next time that node's agent polls, it claims and runs it as root. There is currently no per-job integrity check, so a DB-write primitive is a fleet-root primitive.
A node token. Pinned to one nodeId (see Identities), so it's scoped to that node — it cannot queue work for, or claim work from, other nodes.

What constrains this today

Transport + identity: HTTPS; node tokens are pinned to a nodeId (requirePinnedNodeToken on every agent write/claim — a token for node A can't touch node B); operator tokenVersion gives instant revocation; agents dial out (no inbound port on the node).
Authoring: only OPERATOR+ can create jobs (RBAC). The exact rendered script is stored on the Job row at queue time → a complete audit trail of what ran. Injection sinks are hardened (assertCommandSafe rejects shell metacharacters in variables interpolated into command/config_file; SSH port is digit/range-validated) and typed steps are preferred over raw command.
Operator-facing safety: preview the bash, per-tier Apply gating, prechecks that refuse when unsafe, and verify (no false "success"). Note these protect against operator mistakes, not a malicious authenticated attacker — an OPERATOR is trusted to run root fixes by design.

Residual risk + job-payload signing (analysis, not yet implemented)

Targets #1 and #2 above are not mitigated by the controls above. The defense-in-depth option is to sign job payloads: the control-plane signs the rendered script (e.g. Ed25519) at queue time; the agent verifies against a pinned public key and refuses anything unsigned or mis-signed.

What it buys: closes target #2 — a DB-only attacker can write a Job row but cannot produce a valid signature, so the agent rejects it. Also a belt-and-suspenders against a TLS-bypassing MITM.
What it does NOT buy: target #1. If the API host is compromised, the attacker has whatever the API uses to sign → they can sign malicious jobs. Signing only helps if the key lives outside the DB (so #2 holds) and ideally outside the API runtime (HSM / separate signer) — and even then a compromised API can request signatures. So signing meaningfully raises the bar against DB-tampering; it is not a substitute for hardening the API host.
Verdict: worth doing as defense-in-depth against the DB-tampering class (a common, lower-bar attack), with the signing key kept out of the database. Tracked in docs/BACKLOG.md. Not a 1.0 blocker, but the first thing to add when the fleet holds more than one customer's nodes.

Hardening guidance

Treat the API host as crown jewels: minimal surface, patched, restricted egress, JWT_SECRET and any future signing key kept out of the DB.
Rotate JWT_SECRET (fleet-wide kill switch) and revoke node tokens on any suspicion.
Where a node only needs a narrow set of fixes, prefer a NOPASSWD sudoers allowlist for that agent over full root — smaller per-node blast radius (trade-off: presets needing broader privileges will fail their precheck).
Keep the audit log + stored job scripts for forensics (Audit log, below).

Identities

Identity	What it authenticates	Lifetime
Operator	HS256 JWT issued by `/api/auth/login`. Bearer header OR `hb_token` cookie.	7 days, no refresh.
Agent	`hbnt_<43 base64url chars>` shared secret. Sent as Bearer.	Until revoked.

JWT is verified via @fastify/jwt (v10 → fast-jwt). requireAuth also checks the token's tv claim against User.tokenVersion on every request (one indexed lookup) — that per-request check is the per-user revocation mechanism: role change / password change / reset / user delete bump tokenVersion and kill outstanding tokens immediately. The browser hb_token cookie is HttpOnly; SameSite=Lax; Secure (prod). Node tokens are verified by a prefix-indexed lookup + timingSafeEqual against the SHA-256 hash.

Why two systems

Operators need to be able to log in and out, recover passwords (someday), and have a personality (the dashboard says "signed in as Alice"). JWTs are great for that.

Agents need a long-lived secret that can be embedded in a config file, rotated when a node is decommissioned, and pinned to a specific node so a stolen token from one server can't impersonate another. That's a different problem from operator auth, and shoving both into the same primitive made the code dirtier without paying for itself.

Password storage

Bcrypt cost factor 10 (bcrypt.hash(password, 10)). The login path runs bcrypt.compare against a dummy hash on user-not-found so the timing profile of "wrong password" vs "no such user" matches — defeats user enumeration via login response time.

Token storage

NodeToken rows store:

prefix (first 8 chars after the hbnt_ marker) — unique, indexed. Used for the fast lookup in the heartbeat hot path.
hash — SHA-256 hex of the plaintext. The plaintext itself is never persisted; we return it once at creation and the row only knows the hash.
nodeId (optional, but required for agent writes) — pins the token. A pinned token used against a different node returns 403, and agent write/claim endpoints (job pull/output/complete, findings upsert, backup runs, inventory) reject unpinned tokens (requirePinnedNodeToken). Anonymous heartbeat is the only exception (dev bootstrap).
revokedAt — soft delete. The check if (!row || row.revokedAt) is the first thing the middleware does after the hash compare.

Plaintext appears in exactly two places: (1) the create-token response — shown once in the dashboard with a Copy button — and (2) wherever the user chooses to store it (typically /etc/hyprbox/hyprnode.env mode 0600).

RBAC

Three roles (Phase 6.2):

Role	Can do
`VIEWER`	Read everything (fleet, jobs, backups, findings, presets). Cannot mutate.
`OPERATOR`	All viewer + queue jobs, mint/revoke tokens, create/edit/delete backup policies, snooze/resolve findings.
`ADMIN`	All operator + list/promote/demote users, read the audit log, export it as CSV.

Implementation:

The role lives in User.role and is baked into the JWT at sign time. requireAuth verifies the token's tv claim against User.tokenVersion on every request (one indexed lookup), so any flow that bumps tokenVersion — role change, password change, password reset, user delete — invalidates outstanding JWTs immediately, without waiting for the 7d expiry. Rotating the JWT secret remains the fleet-wide kill switch.
requireRole(role) in apps/api/src/middleware/auth.ts chains AFTER requireAuth. Every mutation route uses it; reads stay open to any authenticated user. The role hierarchy is VIEWER < OPERATOR < ADMIN, so requireRole('OPERATOR') admits both OPERATOR and ADMIN.
First-user bootstrapping: on a fresh deployment, the very first call to POST /api/auth/register lands as ADMIN (User.count()===0). After that, public self-registration is closed by default — it returns 403 in production (HYPRBOX_ALLOW_PUBLIC_REGISTRATION overrides; default is open outside production for the dev quick-start). Admins provision teammates via admin-only POST /api/users (email + role + initial password), then adjust roles with PATCH /api/users/:id/role.
Last-admin guard: PATCH /api/users/:id/role and DELETE /api/users/:id refuse to demote or delete the last remaining ADMIN — would orphan the workspace.
Job and backup-policy rows are STILL owned by their creator (createdBy); other operators see 404 (not 403) on detail/cancel — deliberate, no existence leak.

If you need to bootstrap a deployment without going through the API (scripted seeding), use the seed-admin.ts script which writes role: 'ADMIN' directly.

Rate limiting

@fastify/rate-limit:

Global: 200 req/min per authenticated user (Authorization: Bearer or hb_token cookie), falling back to IP for anonymous/invalid requests.
/api/auth/login: 10 req/min — slows credential stuffing.
/api/auth/register: 5 req/min — slows account-creation spam.
/api/stream/*: exempt — long-lived SSE connections shouldn't count.

429 responses include the standard X-RateLimit-* headers and an Retry-After so clients back off.

Transport

TLS termination at the reverse proxy (Caddy / Nginx). The compose stack binds the API port to 127.0.0.1 by default. Set WEB_BIND=127.0.0.1 as well when the dashboard is meant to be reachable only through the reverse proxy; otherwise the production compose default exposes the Web container directly on 0.0.0.0:3000.
CORS allows credentials and is locked to CORS_ORIGIN. EventSource requires withCredentials: true and a same-site cookie — set CORS_ORIGIN=https://your-web-host exactly, no wildcards.
HSTS is set on API responses by @fastify/helmet in production, and on Web responses by apps/web/next.config.ts in production. The reverse proxy should also send HSTS on public TLS hostnames.
API responses use the Helmet bundle. Web responses set the non-breaking baseline headers in apps/web/next.config.ts: X-Frame-Options: SAMEORIGIN, X-Content-Type-Options: nosniff, Referrer-Policy, COOP/CORP, Origin-Agent-Cluster, and a restrictive Permissions-Policy.

Secrets

Three things should never end up in the repo or in logs:

Secret	Lives	Don't log
`JWT_SECRET`	env var only	API never logs the value.
User passwords	bcrypt hash on the User row	Login route logs `email` only.
Node tokens	hashed on NodeToken row	Logger logs the prefix only (when used).

HYPRBOX_REQUIRE_NODE_TOKEN=true in prod (set automatically by the prod compose) makes anonymous heartbeats hard-fail with 401. Dev keeps it off so the quick-start works without provisioning a token first.

The error handler in apps/api/src/index.ts logs bodyKeys (the field names) but never the body itself — so a Zod-validation error on /api/auth/login won't leak the rejected password into your aggregator.

Common operations

Rotate the JWT secret

Changing JWT_SECRET invalidates every existing operator session — they all get 401 on the next request and have to log back in. Acceptable for an emergency; not great for routine rotation.

# Generate a new value (≥ 32 chars enforced in prod)
openssl rand -base64 48

# Edit .env.production, then:
docker compose -f docker-compose.prod.yml --env-file .env.production up -d api

Revoke a node token

From the dashboard: /dashboard/settings → row → Revoke. Or via API:

curl -X DELETE http://localhost:4000/api/tokens/<id> \
  -H "Authorization: Bearer $JWT"

The agent gets 401 on its next heartbeat. There's no kill signal — the worst case is one extra heartbeat in the offline gap.

Identify which token a request used

Tokens log their prefix (8 chars) but never the full string. Search the API logs for the prefix to trace a session:

docker compose -f docker-compose.prod.yml logs api \
  | jq 'select(.nodeToken.id == "<id>")'

Audit log

Every sensitive mutation writes a row to AuditEvent: login success/failure, register, token create/revoke, job create/cancel, backup policy CRUD + trigger, user create/role/delete, password change/reset, and per-node config updates. The schema captures userId (nullable, so failed logins still land), action, resourceType/resourceId, ip, and a small JSON metadata object.

Reads are not logged on purpose — they'd dwarf the interesting events. The audit table is now exposed at GET /api/audit (admin-only) with filters (user/email substring, action, resourceType, resourceId), pagination, and CSV export (?format=csv, capped at 10k rows). The admin dashboard surface lives at /dashboard/admin/audit. Full reference: docs/AUDIT.md.

Two operational details worth knowing:

The helper is fire-and-forget. A failed audit write becomes a warn log and the user-facing request still succeeds. We'd rather drop a row than degrade the response.
userId on login.failure is always null even when the email matches a real user — that's deliberate to avoid leaking which emails are registered through the audit table.

What we explicitly do NOT do (yet)

2FA / WebAuthn — Phase 9 / commercial.
SSO (OIDC / SAML) — Phase 9 / commercial.
Strict enforced CSP — tracked separately. Baseline headers are in place; CSP needs report-only rollout and inline-style/script cleanup first.
Token usage anomaly detection — none. If an agent token is leaking data, you'd have to spot it manually.
Per-node IP allowlist — schema has from for firewall rules in presets, not for the API itself. Reverse-proxy-level IP allowlists are the workaround.

If any of those land before Phase 9 it's because a customer or deployment constraint demanded them — open an issue with the scenario.

Security

#Threat #1 — the agent runs as root and executes control-plane bash

#What constrains this today

#Residual risk + job-payload signing (analysis, not yet implemented)

#Hardening guidance

#Identities

#Why two systems

#Password storage

#Token storage

#RBAC

#Rate limiting

#Transport

#Secrets

#Common operations

#Rotate the JWT secret

#Revoke a node token

#Identify which token a request used

#Audit log

#What we explicitly do NOT do (yet)