HyprBox docs GitHub ↗

Findings & Recommendations

The data model behind the autopilot loop. Read docs/AUTOPILOT.md first for the why; this doc covers the how.

Data model

Recommendation 1 ──n── Finding
                       │
                       └── nodeId → Node
                       └── resolvedByUserId → User?  (manual resolve)
                       └── resolvedByJobId        ?  (auto-resolve via job)

Finding

A discrete risk detected by a scanner. Every Finding row answers "what's wrong, on which node, and how do I know it's the same thing I saw last scan?"

model Finding {
  id               String   @id @default(cuid())
  nodeId           String                       // FK Node.nodeId
  type             String                       // "tls.expiring", "postgres.no-backup", ...
  key              String                       // discriminator inside (nodeId, type)
  severity         FindingSeverity              // INFO | WARN | CRITICAL
  title            String                       // one-line headline
  message          String   @db.Text             // operator-facing explanation
  recommendationId String?                       // nullable: not every finding has a fix yet
  relatedEntities  String   @default("[]")       // JSON array of "kind:id" strings
  metadata         String   @default("{}")       // scanner output, curate to small JSON
  status           FindingStatus                 // OPEN | SNOOZED | RESOLVED
  detectedAt       DateTime @default(now())
  lastSeenAt       DateTime @default(now())
  resolvedAt       DateTime?
  resolvedByJobId  String?
  resolvedByUserId String?
  snoozedUntil     DateTime?

  @@unique([nodeId, type, key])               // dedup contract
  @@index([nodeId, status, severity])
  @@index([recommendationId])
}

The (nodeId, type, key) unique constraint is the dedup contract. When the same scanner re-sees the same problem, it upserts: the row's lastSeenAt advances; the id, detectedAt, and any user actions (snooze, manual resolve) survive.

Recommendation

A vetted mapping from a finding type to a preset to run. Curated server-side, seeded into the DB; not user-editable in MVP.

model Recommendation {
  id            String              @id   // "tls.caddy-renew"
  title         String
  description   String              @db.Text
  presetName    String                    // FK by name into the YAML catalogue
  riskLevel     RecommendationRisk        // SAFE | CONFIRM | DANGEROUS | MANUAL
  findingTypes  String              @default("[]")  // JSON array of finding `type`s
  createdAt     DateTime            @default(now())
  updatedAt     DateTime            @updatedAt
  findings      Finding[]
}

One recommendation can serve multiple finding types — that's why findingTypes is a JSON array, not a single field. It's also why we don't need a join table: the row count stays in the dozens, the JSON parsing is cheap on read.

Each new recommendation lands in apps/api/prisma/seed-recommendations.ts AND ships with a matching preset YAML. The seed script is idempotent — re-running it upserts on the stable string id.

Severity model

Severity What it means Health bucket impact
INFO FYI; no immediate action needed Doesn't move the node
WARN Should be addressed soon; not on fire Bumps the node to Attention
CRITICAL On fire OR will be very soon (cert expiring < 7d, no DB backup) Bumps the node to At risk

The dashboard computes per-node health as: CRITICAL > 0 ⇒ At risk else WARN > 0 ⇒ Attention else Healthy. No numeric score — see AUTOPILOT.md for the why.

Lifecycle

                     ┌──────────┐
                     │  (seed)  │  scanner detects something for the first time
                     └────┬─────┘
                          │
                          ▼
                     ┌──────────┐
            ┌────────│   OPEN   │────────┐
            │        └────┬─────┘        │
            │             │              │ next scan: still there?
            │             │              │ → upsert, lastSeenAt advances, stays OPEN
   user clicks       agent re-scan       │ → still there? AND scan misses for 3 ticks? → RESOLVED
   Snooze            after job           │ → not there? → RESOLVED, resolvedByJobId set
   for N hours       success
            │             │
            ▼             ▼
       ┌─────────┐   ┌──────────┐
       │ SNOOZED │   │ RESOLVED │  terminal — but a future re-detect creates a NEW row
       └────┬────┘   └──────────┘  (the old one stays for history)
            │
            │ snoozedUntil elapses
            ▼
       ┌─────────┐
       │  OPEN   │
       └─────────┘

A few subtleties:

  • Auto-resolve via job: when a Job finishes SUCCEEDED, the API looks at findings whose recommendationId.presetName == job.presetName AND nodeId == job.nodeId AND status == OPEN, and marks them RESOLVED with resolvedByJobId = job.id. The next scan that confirms the fix worked just leaves them resolved; the next scan that re-detects the problem creates a NEW finding (different id, same dedup tuple would conflict — so we generate a new key, e.g. with a date suffix, to make the new occurrence visible).
  • Snooze with a deadline: snoozedUntil is a timestamp. The UI hides snoozed findings by default; the API treats OPEN and SNOOZED-but- past-the-deadline identically when computing the health bucket.
  • History stays: RESOLVED rows are kept forever (the table is small). Use them to answer "did we already fix this on this node before?".

Adding a new finding type

A scanner emits findings of a type. To wire a new type end-to-end:

  1. Pick a stable type string. Format: <scanner>.<short-name>, kebab-case, dot-separated. Examples: tls.expiring, ssh.password-auth, postgres.no-backup, disk.usage-high.
  2. Define the key. It's the discriminator that says "this finding is distinct from another finding of the same type on the same node". For TLS expiry, the key is the cert subject + path. For "disk usage high", it's the mount point.
  3. Pick a severity. Be conservative — INFO is the default; promote to WARN/CRITICAL only when the operator should drop what they're doing.
  4. Write the scanner (Go, in agent/hyprnode/internal/scanner/<name>/). It POSTs [{type, key, severity, title, message, metadata, relatedEntities}] to /api/findings after every tick.
  5. (Optional) Add a recommendation. Add a row to seed-recommendations.ts and write the matching preset. A finding type without a recommendation is acceptable — it shows in the UI without an Apply button.

API surface

Action Endpoint Auth
Agent: upsert batch POST /api/findings node token
Agent: docker inventory (cross-ref) POST /api/nodes/inventory node token
List for the operator GET /api/findings[?nodeId&status] user JWT
Detail GET /api/findings/:id user JWT
Snooze POST /api/findings/:id/snooze user JWT
Manual resolve POST /api/findings/:id/resolve user JWT
Health rollup per node GET /api/findings/health/summary user JWT

The full reference lives in docs/API.md.

Built-in finding types

The catalogue as of Phase 7. New types land here when the matching scanner ships.

Type Severity Scanner Key shape Recommendation
tls.expiring WARN <30d, CRITICAL <7d scanner/tls.go walks LE + /etc/ssl/certs/hyprbox/* cert:<subject>:<path> tls.caddy-renew (SAFE)
tls.unreadable INFO scanner/tls.go <path> none — operator visibility only
ssh.password-auth-enabled CRITICAL scanner/ssh.go parses sshd_config (Include + Match + first-write-wins) sshd:<config-path> ssh.disable-password-auth (DANGEROUS)
ssh.unreadable INFO scanner/ssh.go sshd:<path> none
disk.usage-high WARN ≥80%, CRITICAL ≥90% scanner/disk.go via gopsutil, skips pseudo FS mount:<path> disk.docker-prune (CONFIRM) — only meaningful on docker hosts
postgres.no-backup CRITICAL API cross-ref on POST /api/nodes/inventory × BackupPolicy container:<name> postgres.add-backup (CONFIRM) → hypervault-restic
audit.kernel-updates-pending WARN scanner/audit.go parses apt list --upgradable for -security source audit:kernel-updates audit.apply-security-updates (CONFIRM) → apt-security-upgrade
audit.root-password-set WARN /etc/shadow line for root has a real hash starting with $ audit:root-password none — MANUAL (passwd -l root)
audit.sudo-nopasswd WARN greps /etc/sudoers{,.d/*} for NOPASSWD: audit:sudo-nopasswd none — MANUAL (per-rule review)
audit.world-writable-etc WARN find /etc -xdev -type f -perm -o+w returns ≥1 audit:world-writable-etc none — MANUAL (per-file chmod o-w)
audit.ufw-inactive CRITICAL on public-IP host, WARN otherwise ufw status verbose parsed for Status: active AND Default: deny (incoming) audit:ufw audit.server-light-baseline (CONFIRM) → server-light
audit.fail2ban-down WARN systemctl is-active fail2banactive audit:fail2ban audit.server-light-baseline (CONFIRM)
audit.no-automatic-updates INFO /etc/apt/apt.conf.d/20auto-upgrades missing or both directives ≠ "1" audit:auto-upgrades audit.server-light-baseline (CONFIRM)
audit.no-disk-encryption INFO lsblk -no TYPE has no row equal to crypt audit:luks none — MANUAL (LUKS can't be added live)
audit.no-auditd INFO systemctl is-active auditdactive audit:auditd none — INFO only

Reconciliation rules per type

Not all findings are emit-only. Some get auto-resolved by the API when the underlying state changes, without waiting for a Job to run:

  • postgres.no-backup — the POST /api/nodes/inventory endpoint walks existing OPEN findings on the node. If the container in the key is no longer present in the live inventory, OR a BackupPolicy now targets the node, the finding flips to RESOLVED on that tick. The next inventory with the container back AND no policy re-emits.
  • All other types — resolved either by a successful Job whose preset matches the recommendation, OR by a manual POST /:id/resolve. Re-emit on the next scan if the scanner sees the problem again.

Writing dedup-safe scanners

Two rules:

  1. Never invent a new key for the same underlying problem. A scanner that emits key="cert-app.client.fr-2026-08-25" (with a date) will produce a new finding row every scan and drown the operator. Use a stable identifier: key="cert:app.client.fr:/etc/letsencrypt/live/app.client.fr/fullchain.pem".
  2. Don't include free-form messages in key. Phrasing changes, the key shouldn't. Anything dynamic (the actual days-until-expiry, current usage percent) goes in message and metadata, not in key.

The unique index does the heavy lifting; the upsert does the right thing as long as the key is stable.

Querying

-- Findings per node, grouped by severity
SELECT node_id, severity, count(*)
FROM "Finding"
WHERE status = 'OPEN'
GROUP BY node_id, severity
ORDER BY node_id;

-- Health bucket for one node
SELECT
  CASE
    WHEN sum(CASE WHEN severity='CRITICAL' THEN 1 ELSE 0 END) > 0 THEN 'At risk'
    WHEN sum(CASE WHEN severity='WARN'     THEN 1 ELSE 0 END) > 0 THEN 'Attention'
    ELSE 'Healthy'
  END
FROM "Finding"
WHERE node_id = 'prod-eu-1' AND status = 'OPEN';

-- "How did this finding get resolved?"
SELECT f.id, f.title, f.resolved_at, j.preset_name, j.status
FROM "Finding" f
LEFT JOIN "Job" j ON j.id = f.resolved_by_job_id
WHERE f.id = 'cmp...';

What this is NOT

  • An alerting system. Findings are facts about the present state of a node. Alerts (with rules, channels, dedup windows, escalation policies) are a separate layer — Phase 5 territory.
  • A monitoring buffer. Heartbeats and per-minute metrics live in Heartbeat. Findings only get created when something crosses a threshold OR is structurally wrong (no backup on a DB container).
  • A graph database. relatedEntities is a flat array of kind:identifier strings. The "click around the graph" UX is a Phase 4.5 feature once we have enough scanners to make it interesting.