Jobs
A job is one preset application against one node. The user queues it from the dashboard or API; the agent on the target node claims it on its next heartbeat tick, runs the rendered bash, and reports back.
Lifecycle
┌────────────┐
│ QUEUED │ ← user POST /api/jobs (script rendered + frozen)
└─────┬──────┘
│ agent GET /api/jobs/pending → atomic claim
▼
┌────────────┐
│ RUNNING │ ← agent exec /bin/bash -s, streams /output every 1.5s
└─────┬──────┘
│ agent POST /api/jobs/:id/complete
┌──────┴──────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ SUCCEEDED │ │ FAILED │ (CANCELLED reachable from QUEUED only)
└───────────┘ └───────────┘
States:
- QUEUED — created but no agent has claimed it yet. User can
POST /cancel. - RUNNING — the agent claimed it (atomic
UPDATE WHERE status='QUEUED') and is executing.startedAtis set. - SUCCEEDED —
completewas called withexitCode === 0. - FAILED — non-zero exit code, or agent reported an
errorMsg(e.g.bashfailed to start). - CANCELLED — user cancelled before the agent picked it up. Cancelling a
RUNNINGjob is a Phase 4 feature (needs the agent reverse channel).
The transition QUEUED → RUNNING is the only contended one. We use a single
UPDATE jobs SET status='RUNNING' WHERE id=? AND status='QUEUED' — Postgres'
row-level locking guarantees exactly one winner across N concurrent pullers.
Losers see count=0 from updateMany and return { job: null } to the agent.
Why the script is frozen at queue time
Job rows store the rendered bash on creation, not a reference to the preset name. So if you:
- Queue a
server-lightjob forprod-eu-1withssh_port=2222. - Edit
presets/server-light.yamland change defaults. - Restart the API.
The queued job still runs the original script. The audit trail is exact — diffing two jobs of the same preset tells you what changed at apply time, not what the preset YAML happened to say later.
Output capture
The agent buffers stdout + stderr in memory and flushes every 1.5s via
POST /api/jobs/:id/output. The API appends to the row's stdout/stderr
TEXT columns, capped at 4 MB per stream. On overflow we keep the tail
(the last 4 MB) — that's where the failure usually lives.
Live tail in the browser goes through GET /api/jobs/:id/stream (SSE). The
endpoint polls the row every 500ms and pushes only the delta since the last
tick. When the job hits a terminal state, the server emits event: done and
closes the connection.
Troubleshooting
My job stays QUEUED forever
Most likely: the agent isn't polling for jobs.
- Check the node is
ONLINEon the dashboard. If not, no agent is reaching the API at all — verifyHYPRBOX_API_URLandHYPRBOX_NODE_TOKEN. - Check the agent's logs for
[jobrunner] poll error: .... Common causes:unauthorized — check HYPRBOX_NODE_TOKEN→ token was revoked or refers to a different node thanHYPRBOX_NODE_ID.- HTTP error reaching
/api/jobs/pending→ network/firewall.
- The agent silently skips job polling on Windows (
runtime.GOOS != "linux"). Logs only show heartbeats. Use a Linux VM or container for real jobs.
Job went RUNNING and then stuck there
The agent crashed mid-run or lost network mid-job. There's no automatic timeout sweeper yet (Phase 4) — for now, manually flip the row:
UPDATE jobs SET status='FAILED', error_msg='agent disappeared',
finished_at=now() WHERE id='<id>' AND status='RUNNING';
A new agent process won't pick the same job back up — pending only returns
QUEUED rows.
Job says FAILED — exit 2 and the output mentions distro guard
The preset's targets: list doesn't include the node's /etc/os-release ID.
Either:
- Add the distro to
targets:in the YAML and re-queue. - Or override at apply time with
HYPRBOX_FORCE=1on the agent (last resort).
Output is truncated at exactly 4 MB
That's the hard cap (STDOUT_MAX_BYTES). The tail is kept; the head is
dropped. Either fix the preset to be less chatty (drop set -x) or wait for
Phase 4 (object-storage offload).
"Job not found" on a job I can see in the DB
The GET /api/jobs/:id and cancel endpoints return 404 (not 403) for
jobs owned by another user. Log in as the right operator, or check
createdBy in the row.
API surface for ops
| Action | Endpoint | Auth |
|---|---|---|
| Queue | POST /api/jobs |
user |
| List mine | GET /api/jobs[?nodeId&status] |
user |
| Detail | GET /api/jobs/:id |
user |
| Cancel queued | POST /api/jobs/:id/cancel |
user |
| Live tail | GET /api/jobs/:id/stream (SSE) |
user (cookie) |
| Agent pull | GET /api/jobs/pending?nodeId=… |
node token |
| Agent flush | POST /api/jobs/:id/output |
node token |
| Agent finish | POST /api/jobs/:id/complete |
node token |