Dossier schema reference

Every data/dossiers/<driver-path>/dossier.json validates against data/schema.v1.json. The schema is a closed Draft-2020-12 JSON Schema:

  • every property is in required
  • every level has additionalProperties: false
  • every enum is closed

This makes the corpus directly usable as an Astro content collection (mirror the schema in Zod) or any other typed data pipeline.

Top-level fields

{
  "driver_path":                       "drivers/net/wireless/ath/ar5523",
  "chipset_family":                    "Atheros AR5523 802.11abg USB",
  "hardware_still_sold_new_in_2025":   false,
  "last_widely_available_year":        2010,
  "deployments_today":                 "low",
  "replacement_driver":                null,
  "recommendation_hint":               "deprecate",
  "confidence":                        0.79,
  "sources":                           [{"url": "...", "claim": "..."}],
  "reasoning_notes":                   "Long prose explaining the verdict + provenance of each cited URL."
}

Field-by-field

driver_pathstring

Path within the Linux source tree, e.g. drivers/net/wireless/ath/ar5523. Must match the directory the dossier lives in (the validator checks this).

chipset_familystring

Short chipset family name as referred to in vendor docs. Free text. May be empty for not-a-driver entries. Examples: “Atheros AR5523 802.11abg USB”, “Cortina/StorLink Gemini SoC”, “3Com Vortex/Boomerang/Cyclone PCI Ethernet”.

hardware_still_sold_new_in_2025boolean

Best evidence-based answer to “is this hardware sold new today?” The model is asked to ground this in vendor pages, distro package indexes, retail searches, etc. — not training-data recall alone.

last_widely_available_yearinteger | null

The year the hardware was last widely available retail. Range is 1990-2026. Null if the model could not pin a date with confidence; the prompt prefers null over guessing.

deployments_today — enum

One of none | low | medium | high | unknown. The qualitative answer to “is anyone running this today?” — combines virt guest support, embedded use, distro defaults, hobbyist communities.

  • none — no evidence of any current deployment
  • low — niche / hobbyist / single-vendor industrial, observed but rare
  • medium — real ongoing use in non-trivial population
  • high — broadly deployed
  • unknown — model could not find evidence

replacement_driverstring | null

The upstream driver(s) that cover the same use case today. Null if no clean replacement exists. Examples:

  • ibmveth (replaces IBM eHEA)
  • e1000e (replaces classic 3Com / DEC PCI NICs)
  • ath9k_htc (replaces ar5523 USB Atheros)
  • null (no equivalent — driver is genuinely orphaned)

recommendation_hint — enum

One of keep | keep-annotate | deprecate | remove | unsure | not-a-driver. The headline output of the dossier.

  • keep — actively maintained, broadly deployed, no action needed
  • keep-annotate — mostly inactive but plausible niche use; document the niche rather than remove
  • deprecate — strong evidence the hardware is gone; candidate for the next removal series; no in-flight patch yet
  • remove — an upstream removal patch is already in flight for this driver; the dossier exists to surface that fact and back the patch with evidence
  • unsure — confidence < 0.5; the model declined to commit
  • not-a-driver — the directory is content (routing tables, test fixtures, header-only) rather than a driver. Phase-1 catches most of these; this is the model’s safety net for ones that slipped through.

confidencenumber

Range [0, 1]. Self-reported by the model. Calibration:

  • < 0.3 → almost always unsure or not-a-driver early-exit
  • 0.3-0.6 → cautious recommendation, often keep-annotate
  • 0.6-0.8 → defensible recommendation backed by 3-5 cited URLs
  • > 0.8 → strong recommendation; multiple corroborating sources; often a lore.kernel.org cite to active upstream activity
  • 0.9+ → reserved for remove verdicts where an in-flight removal patch is cited

sources — array

Zero or more {url, claim} objects. The prompt requires every non-trivial fact to be cited. Empty array is allowed when the model has no evidence and confidence is correspondingly low.

{
  "url":   "https://lore.kernel.org/netdev/20260422-...-lunn.ch/",
  "claim": "April 22, 2026 removal patch series proposes deleting fmvj18x_cs.c (1213 lines)."
}

URL well-formedness is post-validated by scripts/spot_check.py which HEAD-resolves every URL and reports anomalies.

reasoning_notesstring

Free text. The model uses this to:

  • explain the verdict in 1-2 sentences
  • name the provenance of each cited URL (“the lore URL came from lore_file_timeline; the LKDDb URL was canonical recall; the vendor EOL URL was found via web_search”)
  • record any caveats or uncertainty

The prompt explicitly asks for provenance attribution, which gives auditors a single place to spot-check tool-vs-recall claims.

Schema enforcement

Two layers of validation:

  1. At inference time: codex passes --output-schema to the Responses API, which enforces the schema server-side. The model cannot emit a malformed JSON or break enums; the API refuses the response.
  2. Post-hoc: scripts/validate_dossiers.py re-validates every dossier with jsonschema.Draft202012Validator, plus structural checks: required files present, driver_path matches directory layout, URLs well-formed, meta.json carries token counts.

On the current corpus, post-hoc validation reports zero issues across 864 dossiers and 3,552 cited URLs.

Mapping to Zod (Astro)

import { z, defineCollection } from "astro:content";

const driver = defineCollection({
  type: "data",
  schema: z.object({
    driver_path: z.string(),
    chipset_family: z.string(),
    hardware_still_sold_new_in_2025: z.boolean(),
    last_widely_available_year: z.number().int().min(1990).max(2026).nullable(),
    deployments_today: z.enum(["none", "low", "medium", "high", "unknown"]),
    replacement_driver: z.string().nullable(),
    recommendation_hint: z.enum([
      "keep", "keep-annotate", "deprecate",
      "remove", "unsure", "not-a-driver",
    ]),
    confidence: z.number().min(0).max(1),
    sources: z.array(z.object({
      url: z.string().url(),
      claim: z.string(),
    })),
    reasoning_notes: z.string(),
  }).strict(),
});

export const collections = { driver };

Sibling files

Beyond dossier.json, every per-driver dir contains:

  • summary.json — derived index entry: {driver_path, valid_json, recommendation_hint, confidence, sources, tool_counts, tool_log}
  • meta.json — invocation receipt: {driver_path, kernel_root, kernel_head_sha, phase1_since, codex_cmd, elapsed_s, exit_code, in_tokens, cached_input_tokens, out_tokens, generated_at}
  • static_features.json — phase-1 git-log facts: see ranking.md for the field list
  • prompt.md — the prompt sent to codex (auditable)
  • events.jsonl — full codex event stream
  • stderr.log — codex stderr

For the website, the dossier is the primary content. The other files are useful for an “audit trail” view per driver: show what was asked, what tools fired, how long it took, how many tokens it cost, against which kernel SHA.

Schema versioning

schema.v1.json is v1. If a future revision changes shape:

  • bump to schema.v2.json
  • keep v1 around for old dossiers
  • the validator should pick the schema version from the dossier (we don’t currently embed it; consider adding a _schema field if v2 lands)

The current schema is intentionally simple. Future fields to consider:

  • cve_count_5y — number of CVEs touching this driver in 5 years
  • distro_enablement[] — structured per-distro Y/M/N
  • audit_trail — names of tools used by the model in producing the dossier (currently free-text in reasoning_notes)