Methodology

Goal

Produce an evidence-backed deprecation recommendation for every driver-bearing subdirectory under drivers/ in the Linux kernel, without burying the operator in cost.

The output is a structured dossier per directory. Each dossier:

  • assigns one of six recommendations: keep, keep-annotate, deprecate, remove, unsure, or not-a-driver
  • carries a confidence in [0, 1]
  • cites at least one source for every non-trivial fact
  • names the chipset family and replacement driver (if any)
  • explicitly notes which tool produced which fact

The output is a structured corpus. Anyone — a maintainer, a distro packager, a researcher — can browse it, sort it, filter it, challenge individual verdicts, and rebuild the whole thing from scratch with a different kernel SHA or a different model.

The three phases

Phase 1 — git-log dormancy ranker (no LLM)

Walks drivers/ and produces, per leaf directory containing at least one .c file with a real driver entry-point macro:

  • commits_5y (raw count, last 5 years)
  • substantive_commits_5y (raw minus mechanical sweeps)
  • first_touch_ts, last_touch_ts, last_substantive_touch_ts
  • unique_authors_5y, top_author, top_author_commits
  • a single scalar dormancy_score derived from the above

A directory’s dormancy score is high when it is old, has had few substantive commits recently, and the most recent substantive touch was long ago. The score is zero for:

  • directories younger than 5 years (recent re-organisations don’t count as dormant)
  • mega-subsystem leaves (the score lookup table for MEGA_SUBSYSTEM_PREFIXES)
  • non-driver content (asset suffix /tests, /include, /dt, … or no driver-entry-point macro found)

Phase 1 runs in about three minutes on the full kernel tree, costs nothing, and emits a ranked top-N shortlist with parent-subsumption applied (if drivers/foo/bar is shortlisted, no descendant of bar/ is added separately).

Full design + parameters: ranking.md.

Phase 2 — codex CLI dossier probe

For each shortlisted directory, one call to OpenAI’s codex CLI:

codex exec \
    --ephemeral --ignore-rules --skip-git-repo-check \
    -c model="gpt-5.4" \
    -c model_reasoning_effort="medium" \
    -s workspace-write \
    -C "$KERNEL_ROOT" \
    --add-dir data/dossiers \
    --add-dir "/run/user/$(id -u)" \
    --output-schema data/schema.v1.json \
    -o data/dossiers/<path>/dossier.json \
    --json \
    "<prompt>"

The prompt:

  • explains the role and the available tools
  • carries the phase-1 static features so the model can reason about age + churn without re-reading git
  • defines an “early exit” rule: if the directory is clearly not a driver, return recommendation_hint=not-a-driver immediately, no tool calls
  • lists tool budgets (3-5 total) and which fallback to use when primary tools fail

The model can use:

  • a lore.kernel.org MCP server (lore_activity, lore_file_timeline, lore_search, lore_regex, etc.)
  • shell, including lei q against the local public-inbox mirror
  • native web_search

--output-schema is non-negotiable — without it the model free-texts around enum values and we lose the discipline that makes the corpus uniform. Every property in data/schema.v1.json is in required, additionalProperties: false is set at every level, every enum is closed.

Per-probe cost: ~75s wall clock, ~170k input tokens (~78% cached across a batch), ~3k output tokens, ~$0.20-$0.30 at current gpt-5.4 pricing.

Full design + the structured-output gotchas: pipeline.md.

Phase 3 — validation

Two checks, both runnable any time:

# Structural: schema + required files + driver_path consistency
uv run --script scripts/validate_dossiers.py

# URL resolution: HEAD-check every cited source
uv run --script scripts/spot_check.py data/dossiers

The structural validator confirms that every per-driver directory has the same set of files, that dossier.json validates against the schema, that summary.json and meta.json carry the expected keys, and that driver_path matches the directory layout. On the current corpus it reports zero issues across 864 dossiers.

The URL resolver issues curl -L HEAD requests in parallel and classifies each cited source as 2xx (real and reachable), 4xx (broken), or 5xx (transient). Bot-blocked URLs (Anubis on lore, Cloudflare on vendor sites) return 403/429 — those are still real, they’re not fabrications. Across the spot-checked deprecate/remove subset, 0 of 159 URLs were genuine fabrications.

Why this shape

A few decisions are worth defending explicitly:

  • JSON Schema, not free-text. Asking the model “should this driver be deprecated?” is a different question depending on who is asking. Asking “fill out this 10-field form” is the same question every time. The schema discipline makes the corpus comparable across drivers and re-runnable.
  • Citations or confidence-cap. The prompt says: if you can’t cite real evidence, return sources: [] and cap confidence at 0.3. That nudges the model away from confident-sounding hot takes when web search hasn’t returned anything useful.
  • Six categories, not two. “Keep vs deprecate” loses too much information. keep-annotate (a real category — most legacy drivers belong here) means “leave the code in place but document the niche”. remove means “patches are already in flight”; deprecate means “candidate for the next removal series”; unsure is escape-valve for the model. Six lets us distinguish active ongoing cleanup from latent deprecation candidates.
  • One dir, one dossier. A driver-file-level pipeline would be more accurate but burns more codex calls and produces redundant output for tightly-related files. The dir-level coarsening is a deliberate trade. See limitations.md.
  • Mega-subsystem blocklist. Drivers under drivers/gpu/drm/amd or drivers/net/ethernet/intel are essentially all keep. The blocklist prevents wasted codex calls on a-priori-known-active code. The list is in phase1_rank.py::MEGA_SUBSYSTEM_PREFIXES.

Re-running

The pipeline is reproducible given:

  • a Linux kernel checkout (any branch or SHA)
  • the OpenAI codex CLI configured with a working mcp_servers.lore-http entry in ~/.codex/config.toml
  • uv (Python package manager) installed
  • ~$140 of API budget for the full top-1000 sweep

To bump the kernel snapshot, change --ref and --since on the phase-1 invocation; the dormancy formula and filters carry forward. Phase 2 will idempotently skip drivers whose dossier.json already exists at the new ranking — pass --force to re-probe everything.

To change the model or the prompt: edit phase2_probe.py and re-run with --force on the affected paths. Each run records its model + reasoning effort in meta.json, so you can mix versions in one corpus and audit later.

What this is not

  • It is not a recommendation that any specific patch should land. The corpus surfaces evidence; humans decide.
  • It is not stable across re-runs in a strict sense — model outputs are deterministic-ish but not bit-identical. Re-running the same prompt with the same model often produces a different source list but a similar verdict.
  • It is not a replacement for upstream review. A remove verdict here is a conversation starter, not a pre-approved patch.