Reproduce
Concrete recipe for re-running every phase of the activity-tracking pipeline against a fresh kernel snapshot, a different model, or a different prompt. Each phase is idempotent and independently re-runnable.
Prereqs
- A Linux kernel checkout, any branch or SHA you want to snapshot
against. Set
KERNEL_ROOT=/path/to/your/linuxfor the commands below. - The OpenAI
codexCLI, with a workingmcp_servers.lore-httpentry in~/.codex/config.toml. The probe useslore.kernel.orgMCP tools (lore_activity,lore_file_timeline,lore_search,lore_regex) plus the model’s nativeweb_search. uv— the Python package manager used by every script inscripts/. The scripts are PEP-723 inline-script style;uv run --script foo.pyresolves dependencies and runs the script in one step.- About $140 of OpenAI API budget for a full top-1000 sweep at
gpt-5.4 /
model_reasoning_effort="medium". Per-probe cost is ~$0.20–0.30; ~78 % of input tokens cache across a batch. - (Optional) a
leimirror oflore.kernel.orgfor offline lore queries. Not required — the MCP server will fetch live ifleiisn’t reachable.
Phase 1 — dormancy ranker (no LLM)
uv run --script scripts/phase1_rank.py \
--kernel-root "$KERNEL_ROOT" \
--since 2021-04-24 \
--top-n 1000 \
--out data/phase1-ranking.json \
--shortlist data/phase1-shortlist.txt
Walks every leaf directory under drivers/, computes the
deterministic features (commit counts, age, last-substantive-touch,
top author, dormancy score), and emits two artifacts:
data/phase1-ranking.json— the full ranking of all ~2,000 leaf dirs (including the ones forced to score 0 by the mega-subsystem and no-driver-marker filters).data/phase1-shortlist.txt— the top-N (default 1000) candidates with parent-subsumption applied.
Wall time: ~3 minutes on a kernel checkout with five years of history. No API spend.
See ranking for the dormancy formula and the mega-subsystem blocklist.
Phase 2 — codex dossier probe
uv run --script scripts/phase2_probe.py \
--shortlist data/phase1-shortlist.txt \
--kernel-root "$KERNEL_ROOT" \
--schema data/schema.v1.json \
--out-dir data/dossiers
For each shortlisted directory, runs one codex exec call
producing a strict-schema dossier. The actual command, frozen on
disk in every data/dossiers/<path>/meta.json, is:
codex exec \
--ephemeral --ignore-rules --skip-git-repo-check \
-c model="gpt-5.4" \
-c model_reasoning_effort="medium" \
-s workspace-write \
-C "$KERNEL_ROOT" \
--add-dir data/dossiers/<path> \
--add-dir "/run/user/$(id -u)" \
--output-schema data/schema.v1.json \
-o data/dossiers/<path>/dossier.json \
--json \
"<prompt>"
The probe is idempotent: it skips any directory whose
dossier.json already exists. To re-probe a subset, pass
--force --include drivers/foo,drivers/bar. To re-probe
everything, pass --force alone.
Per-probe stats (recorded in meta.json):
- ~75 seconds wall clock
- ~170k input tokens (~78 % cached after the first few probes in a batch)
- ~3k output tokens
- ~$0.20–0.30 cost at current gpt-5.4 pricing
Sequential probing of all 864 dirs: ~18 hours wall, ~$200.
Parallelize with xargs -P against your account’s per-minute
RPM/TPM if you want it faster.
See pipeline for the prompt design and tool budget.
Phase 3 — validate
uv run --script scripts/validate_dossiers.py
uv run --script scripts/spot_check.py data/dossiers
validate_dossiers.py is the structural check: every dossier dir
has the expected files, dossier.json validates against
data/schema.v1.json, and driver_path matches the directory
layout. Across 864 dossiers, it reports zero issues.
spot_check.py issues curl -L HEAD requests against every
cited URL in parallel and classifies each as 2xx / 3xx / 4xx /
5xx / blocked. Bot-blocked URLs (Anubis on lore, Cloudflare on
some vendor sites) return 403/429 — those are real, just gated.
Genuine fabrications would be 404s.
Refresh the snapshot
To bump to a newer kernel:
cd "$KERNEL_ROOT" && git pull- Re-run Phase 1 (3 min, no cost).
- Re-run Phase 2 — incremental by default; only newly-shortlisted
dirs get probed. Add
--forceif you want to re-probe drivers whose dossiers might be stale. - Re-run Phase 3.
- Rebuild the site:
cd site && npm run build. The site reads the corpus through the symlinks insite/src/content/driverandsite/src/data/, so no code change is needed.
Change the model or the prompt
The model and reasoning effort are in phase2_probe.py:
"-c", 'model="gpt-5.4"',
"-c", 'model_reasoning_effort="medium"',
The prompt is built in build_prompt() in the same file. Edit
either, then re-probe with --force on the affected paths. Each
run preserves its own model, model_reasoning_effort, and the
full codex_cmd argv in meta.json, so a single corpus can mix
versions and remain auditable.
Bulk export
The site publishes the corpus as flat machine-readable downloads for anyone who wants to crunch it directly without scraping the HTML pages:
/data/dossiers.json— all 864 LLM dossiers as a single JSON array, full schema (chipset, verdict, confidence, sources, reasoning_notes)./data/dossiers.csv— same, flattened to CSV. Sources are joined with;in a single column;reasoning_notesis preserved as a quoted cell./data/registry.json— all 2,028 leaf driver directories as a single array, with akindfield (dossierorstub) and the deterministic Phase 1 features for every entry./data/registry.csv— same, CSV-flat.
Both files are regenerated on every site build, so they always
match the current snapshot. License: CC-BY-4.0 (data) — cite this
project + the kernel SHA in data/index.json when you use it.