Dossier schema reference

Every data/dossiers/<driver-path>/dossier.json validates against data/schema.v1.json. The schema is a closed Draft-2020-12 JSON Schema:

every property is in required
every level has additionalProperties: false
every enum is closed

This makes the corpus directly usable as an Astro content collection (mirror the schema in Zod) or any other typed data pipeline.

Top-level fields

{
  "driver_path":                       "drivers/net/wireless/ath/ar5523",
  "chipset_family":                    "Atheros AR5523 802.11abg USB",
  "hardware_still_sold_new_in_2025":   false,
  "last_widely_available_year":        2010,
  "deployments_today":                 "low",
  "replacement_driver":                null,
  "recommendation_hint":               "deprecate",
  "confidence":                        0.79,
  "sources":                           [{"url": "...", "claim": "..."}],
  "reasoning_notes":                   "Long prose explaining the verdict + provenance of each cited URL."
}

Field-by-field

`driver_path` — `string`

Path within the Linux source tree, e.g. drivers/net/wireless/ath/ar5523. Must match the directory the dossier lives in (the validator checks this).

`chipset_family` — `string`

Short chipset family name as referred to in vendor docs. Free text. May be empty for not-a-driver entries. Examples: “Atheros AR5523 802.11abg USB”, “Cortina/StorLink Gemini SoC”, “3Com Vortex/Boomerang/Cyclone PCI Ethernet”.

`hardware_still_sold_new_in_2025` — `boolean`

Best evidence-based answer to “is this hardware sold new today?” The model is asked to ground this in vendor pages, distro package indexes, retail searches, etc. — not training-data recall alone.

`last_widely_available_year` — `integer | null`

The year the hardware was last widely available retail. Range is 1990-2026. Null if the model could not pin a date with confidence; the prompt prefers null over guessing.

`deployments_today` — enum

One of none | low | medium | high | unknown. The qualitative answer to “is anyone running this today?” — combines virt guest support, embedded use, distro defaults, hobbyist communities.

none — no evidence of any current deployment
low — niche / hobbyist / single-vendor industrial, observed but rare
medium — real ongoing use in non-trivial population
high — broadly deployed
unknown — model could not find evidence

`replacement_driver` — `string | null`

The upstream driver(s) that cover the same use case today. Null if no clean replacement exists. Examples:

ibmveth (replaces IBM eHEA)
e1000e (replaces classic 3Com / DEC PCI NICs)
ath9k_htc (replaces ar5523 USB Atheros)
null (no equivalent — driver is genuinely orphaned)

`recommendation_hint` — enum

keep — actively maintained, broadly deployed, no action needed
keep-annotate — mostly inactive but plausible niche use; document the niche rather than remove
deprecate — strong evidence the hardware is gone; candidate for the next removal series; no in-flight patch yet
remove — an upstream removal patch is already in flight for this driver; the dossier exists to surface that fact and back the patch with evidence
unsure — confidence < 0.5; the model declined to commit
not-a-driver — the directory is content (routing tables, test fixtures, header-only) rather than a driver. Phase-1 catches most of these; this is the model’s safety net for ones that slipped through.

`confidence` — `number`

Range [0, 1]. Self-reported by the model. Calibration:

< 0.3 → almost always unsure or not-a-driver early-exit
0.3-0.6 → cautious recommendation, often keep-annotate
0.6-0.8 → defensible recommendation backed by 3-5 cited URLs
> 0.8 → strong recommendation; multiple corroborating sources; often a lore.kernel.org cite to active upstream activity
0.9+ → reserved for remove verdicts where an in-flight removal patch is cited

`sources` — array

Zero or more {url, claim} objects. The prompt requires every non-trivial fact to be cited. Empty array is allowed when the model has no evidence and confidence is correspondingly low.

{
  "url":   "https://lore.kernel.org/netdev/20260422-...-lunn.ch/",
  "claim": "April 22, 2026 removal patch series proposes deleting fmvj18x_cs.c (1213 lines)."
}

URL well-formedness is post-validated by scripts/spot_check.py which HEAD-resolves every URL and reports anomalies.

`reasoning_notes` — `string`

Free text. The model uses this to:

explain the verdict in 1-2 sentences
name the provenance of each cited URL (“the lore URL came from lore_file_timeline; the LKDDb URL was canonical recall; the vendor EOL URL was found via web_search”)
record any caveats or uncertainty

The prompt explicitly asks for provenance attribution, which gives auditors a single place to spot-check tool-vs-recall claims.

Schema enforcement

Two layers of validation:

At inference time: codex passes --output-schema to the Responses API, which enforces the schema server-side. The model cannot emit a malformed JSON or break enums; the API refuses the response.
Post-hoc: scripts/validate_dossiers.py re-validates every dossier with jsonschema.Draft202012Validator, plus structural checks: required files present, driver_path matches directory layout, URLs well-formed, meta.json carries token counts.

On the current corpus, post-hoc validation reports zero issues across 864 dossiers and 3,552 cited URLs.

Mapping to Zod (Astro)

import { z, defineCollection } from "astro:content";

const driver = defineCollection({
  type: "data",
  schema: z.object({
    driver_path: z.string(),
    chipset_family: z.string(),
    hardware_still_sold_new_in_2025: z.boolean(),
    last_widely_available_year: z.number().int().min(1990).max(2026).nullable(),
    deployments_today: z.enum(["none", "low", "medium", "high", "unknown"]),
    replacement_driver: z.string().nullable(),
    recommendation_hint: z.enum([
      "keep", "keep-annotate", "deprecate",
      "remove", "unsure", "not-a-driver",
    ]),
    confidence: z.number().min(0).max(1),
    sources: z.array(z.object({
      url: z.string().url(),
      claim: z.string(),
    })),
    reasoning_notes: z.string(),
  }).strict(),
});

export const collections = { driver };

Sibling files

Beyond dossier.json, every per-driver dir contains:

summary.json — derived index entry: {driver_path, valid_json, recommendation_hint, confidence, sources, tool_counts, tool_log}
meta.json — invocation receipt: {driver_path, kernel_root, kernel_head_sha, phase1_since, codex_cmd, elapsed_s, exit_code, in_tokens, cached_input_tokens, out_tokens, generated_at}
static_features.json — phase-1 git-log facts: see ranking.md for the field list
prompt.md — the prompt sent to codex (auditable)
events.jsonl — full codex event stream
stderr.log — codex stderr

For the website, the dossier is the primary content. The other files are useful for an “audit trail” view per driver: show what was asked, what tools fired, how long it took, how many tokens it cost, against which kernel SHA.

Schema versioning

schema.v1.json is v1. If a future revision changes shape:

bump to schema.v2.json
keep v1 around for old dossiers
the validator should pick the schema version from the dossier (we don’t currently embed it; consider adding a _schema field if v2 lands)

The current schema is intentionally simple. Future fields to consider:

cve_count_5y — number of CVEs touching this driver in 5 years
distro_enablement[] — structured per-distro Y/M/N
audit_trail — names of tools used by the model in producing the dossier (currently free-text in reasoning_notes)

Dossier schema reference

Top-level fields

Field-by-field

driver_path — string

chipset_family — string

hardware_still_sold_new_in_2025 — boolean

last_widely_available_year — integer | null

deployments_today — enum

replacement_driver — string | null

recommendation_hint — enum

confidence — number

sources — array

reasoning_notes — string