Jobs

A job is one preset application against one node. The user queues it from the dashboard or API; the agent on the target node claims it on its next heartbeat tick, runs the rendered bash, and reports back.

Lifecycle

        ┌────────────┐
        │  QUEUED    │ ← user POST /api/jobs (script rendered + frozen)
        └─────┬──────┘
              │ agent GET /api/jobs/pending → atomic claim
              ▼
        ┌────────────┐
        │  RUNNING   │ ← agent exec /bin/bash -s, streams /output every 1.5s
        └─────┬──────┘
              │ agent POST /api/jobs/:id/complete
       ┌──────┴──────┐
       ▼             ▼
 ┌───────────┐ ┌───────────┐
 │ SUCCEEDED │ │  FAILED   │   (CANCELLED reachable from QUEUED only)
 └───────────┘ └───────────┘

States:

QUEUED — created but no agent has claimed it yet. User can POST /cancel.
RUNNING — the agent claimed it (atomic UPDATE WHERE status='QUEUED') and is executing. startedAt is set.
SUCCEEDED — complete was called with exitCode === 0.
FAILED — non-zero exit code, or agent reported an errorMsg (e.g. bash failed to start).
CANCELLED — user cancelled before the agent picked it up. Cancelling a RUNNING job is a Phase 4 feature (needs the agent reverse channel).

The transition QUEUED → RUNNING is the only contended one. We use a single UPDATE jobs SET status='RUNNING' WHERE id=? AND status='QUEUED' — Postgres' row-level locking guarantees exactly one winner across N concurrent pullers. Losers see count=0 from updateMany and return { job: null } to the agent.

Why the script is frozen at queue time

Job rows store the rendered bash on creation, not a reference to the preset name. So if you:

Queue a server-light job for prod-eu-1 with ssh_port=2222.
Edit presets/server-light.yaml and change defaults.
Restart the API.

The queued job still runs the original script. The audit trail is exact — diffing two jobs of the same preset tells you what changed at apply time, not what the preset YAML happened to say later.

Output capture

The agent buffers stdout + stderr in memory and flushes every 1.5s via POST /api/jobs/:id/output. The API appends to the row's stdout/stderr TEXT columns, capped at 4 MB per stream. On overflow we keep the tail (the last 4 MB) — that's where the failure usually lives.

Live tail in the browser goes through GET /api/jobs/:id/stream (SSE). The endpoint polls the row every 500ms and pushes only the delta since the last tick. When the job hits a terminal state, the server emits event: done and closes the connection.

Troubleshooting

My job stays `QUEUED` forever

Most likely: the agent isn't polling for jobs.

Check the node is ONLINE on the dashboard. If not, no agent is reaching the API at all — verify HYPRBOX_API_URL and HYPRBOX_NODE_TOKEN.
Check the agent's logs for [jobrunner] poll error: .... Common causes:
- unauthorized — check HYPRBOX_NODE_TOKEN → token was revoked or refers to a different node than HYPRBOX_NODE_ID.
- HTTP error reaching /api/jobs/pending → network/firewall.
The agent silently skips job polling on Windows (runtime.GOOS != "linux"). Logs only show heartbeats. Use a Linux VM or container for real jobs.

Job went `RUNNING` and then stuck there

The agent crashed mid-run or lost network mid-job. There's no automatic timeout sweeper yet (Phase 4) — for now, manually flip the row:

UPDATE jobs SET status='FAILED', error_msg='agent disappeared',
  finished_at=now() WHERE id='<id>' AND status='RUNNING';

A new agent process won't pick the same job back up — pending only returns QUEUED rows.

Job says `FAILED — exit 2` and the output mentions distro guard

The preset's targets: list doesn't include the node's /etc/os-release ID. Either:

Add the distro to targets: in the YAML and re-queue.
Or override at apply time with HYPRBOX_FORCE=1 on the agent (last resort).

Output is truncated at exactly 4 MB

That's the hard cap (STDOUT_MAX_BYTES). The tail is kept; the head is dropped. Either fix the preset to be less chatty (drop set -x) or wait for Phase 4 (object-storage offload).

"Job not found" on a job I can see in the DB

The GET /api/jobs/:id and cancel endpoints return 404 (not 403) for jobs owned by another user. Log in as the right operator, or check createdBy in the row.

API surface for ops

Action	Endpoint	Auth
Queue	`POST /api/jobs`	user
List mine	`GET /api/jobs[?nodeId&status]`	user
Detail	`GET /api/jobs/:id`	user
Cancel queued	`POST /api/jobs/:id/cancel`	user
Live tail	`GET /api/jobs/:id/stream` (SSE)	user (cookie)
Agent pull	`GET /api/jobs/pending?nodeId=…`	node token
Agent flush	`POST /api/jobs/:id/output`	node token
Agent finish	`POST /api/jobs/:id/complete`	node token

Jobs

#Lifecycle

#Why the script is frozen at queue time

#Output capture

#Troubleshooting

#My job stays QUEUED forever

#Job went RUNNING and then stuck there

#Job says FAILED — exit 2 and the output mentions distro guard

#Output is truncated at exactly 4 MB

#"Job not found" on a job I can see in the DB

#API surface for ops