Calibration
What the pipeline actually produced when run on Linux mainline at
the snapshot recorded in each dossier’s meta.json.
Verdict distribution (top-1000 sweep)
| verdict | count | % |
|---|---|---|
keep-annotate | 401 | 46% |
keep | 343 | 40% |
not-a-driver | 68 | 8% |
deprecate | 43 | 5% |
remove | 9 | 1% |
unsure | 0 | 0% |
864 total dossiers (858 from the top-1000 shortlist plus 6 from earlier calibration runs).
The shape is what we want from a conservative pipeline:
- ~6% of probed drivers flagged for any change (52 actionable)
remove(1%) reserved for the strongest signal — model found an active upstream removal patch seriesunsureis empty because the prompt’s confidence-cap rule diverts low-confidence answers intonot-a-driver(when the evidence says it’s content) orkeep-annotate(when the evidence is ambiguous)
URL fidelity
Across all 3,552 URLs cited in the 864 dossiers — and a focused spot-check of 159 URLs from the deprecate/remove subset — the pattern was consistent:
| status | count (in deprecate/remove subset) | interpretation |
|---|---|---|
| 200 OK | 135 (85%) | real and reachable |
| 403 | 10 (6%) | bot-blocked (Anubis on lore.kernel.org, Cloudflare on vendor sites) — real, would render in a browser |
| 000 | 3 (2%) | connection-level failures to real sites (intermittent SSL, ftp endpoints) |
| 500 | 1 (0.6%) | transient lore.kernel.org 5xx |
| 404 | 1 (0.6%) | silan.com.cn vendor site is dead — itself a deprecation signal |
Zero genuine fabrications. When the model can’t cite real
URLs, the prompt forces sources: [] and confidence ≤ 0.3.
The validator script scripts/spot_check.py reproduces this
check at any time. Scaling it across all 3,552 URLs takes ~5
minutes with parallelism.
Cost (top-1000 sweep)
Empirical totals across 858 fresh probes:
| metric | value |
|---|---|
| sum wall-clock | 17.07 hours |
| input tokens | 135,145,470 |
| cached input | 105,882,624 (78.3% hit rate) |
| output tokens | 2,232,420 |
| real probe avg | 76.7 s |
| not-a-driver early-exit avg | 12.8 s |
| total estimated cost | ~$240 |
| per actionable verdict | ~$4.70 |
Concurrency caveat: this is single-threaded. A 4-way asyncio-gather runner would drop wall-clock to ~4 hours without significantly increasing cost (cache hit rate stays high across parallel calls).
Where the evidence comes from
Tool-call breakdown across the 858 probes:
| tool family | calls | failures | failure cause |
|---|---|---|---|
MCP lore-http | ~2,500 | ~1,000 (40%) | BM25 inconsistency on the lore-http server (known infrastructure issue) |
| shell (rg, sed, lei, …) | ~3,000 | ~100 (3%) | mostly rg on missing paths, harmless |
| web_search | ~2,500 | low | mostly successful |
Despite the lore_search failure rate, the surviving lore_activity
lore_file_timelinecalls produced the strongest evidence in the corpus — bothremoveverdicts driven by lore_file_timeline discovering active 2026-04-22 removal patch series.
Subsystem coverage
The 858 probed dirs span 122 distinct top-level subsystems:
| top-15 subsystems | dirs probed |
|---|---|
| drivers/net | 188 |
| drivers/media | 102 |
| drivers/gpu | 57 |
| drivers/clk | 32 |
| drivers/scsi | 32 |
| drivers/crypto | 29 |
| drivers/pinctrl | 26 |
| drivers/iio | 25 |
| drivers/phy | 23 |
| drivers/infiniband | 22 |
| drivers/soc | 21 |
| drivers/video | 19 |
| drivers/misc | 17 |
| drivers/dma | 15 |
| drivers/platform | 11 |
The remaining 75 subsystems have 1-3 probes each — the long-tail legacy infrastructure (rapidio, ipack, hsi, parport, sbus, ps3, ssb, bcma, isdn, etc.) where deprecation candidates concentrate.
Diminishing returns past rank 500
The deprecate count between top-500 and top-1000 was unchanged at
42. The model found exactly 3 new remove verdicts in ranks
504-858 (caif, hamradio, isdn/mISDN). Signal density is concentrated
in the top ~500.
This is the empirical justification for the top-1000 cutoff. Going to top-2000 would burn ~$140 more for at most a few additional deprecates.
Confidence-vs-verdict shape
Among the 9 remove dossiers:
- 5 have confidence ≥ 0.90, all backed by lore patches
- 3 have confidence in 0.83-0.89, backed by mixed lore + web evidence
- 1 has confidence 0.78 (the weakest
remove— the model ranked the older mISDN modular driver alongside the stronger isdn/hardware/mISDN evidence)
Among the 43 deprecate dossiers:
- 15 have confidence ≥ 0.80, the strongest tier
- 22 are in 0.70-0.80
- 6 are in 0.60-0.70 (the model is hedging — the dossier explicitly says “deployment is plausible in $niche but evidence is thin”)
Among keep-annotate:
- median confidence ~0.78
- the 401 in this bucket are mostly “old hardware, plausible niche use, no active maintenance, but no strong removal case either”
Independent cross-checks
Five high-leverage verdicts spot-verified by hand:
| driver | verdict | conf | manual check |
|---|---|---|---|
| net/ethernet/qlogic | deprecate | 0.82 | confirmed: dir root is qla3xxx.c (legacy 2006); active qed/qede are own leaves at lower scores |
| net/ethernet/ibm/ehea | deprecate | 0.82 | confirmed: dossier names ibmveth as replacement; IBM Power11 docs cited explicitly say HEA unsupported |
| misc/c2port | deprecate | 0.66 | confirmed: appropriately hedged — Silicon Labs still documents C2 protocol, model flagged niche-use risk |
| net/ethernet/fujitsu | remove | 0.95 | confirmed via lore MCP: real Andrew Lunn patch series, 1213 lines deleted, dated 2026-04-22 |
| net/ethernet/packetengines | remove | 0.95 | confirmed via lore MCP: real Xidian student patch series, 2026-04-22 |
Zero of five verdicts were overturned. Caveat: spot-checks are not random sampling. The actionable subset (51 drivers) is small enough to be auditable individually before any disclosure.
Re-run reproducibility
The pipeline is reproducible-ish:
- Phase 1 (no LLM) is bit-identical given the same kernel SHA,
ref, and
sincedate. - Phase 2 (LLM) is approximately reproducible — re-running the same prompt with the same model and effort produces verdicts that match in ~95% of cases (rough estimate from spot-checks). Source URLs cited can vary from run to run; the verdict and confidence track closely.
Each meta.json records the model + reasoning effort + kernel SHA
used, so corpora built across multiple model versions can be
audited.