Phase 2: codex CLI dossier probe
The phase-2 runner takes a shortlist of driver directories and
emits one dossier per directory by calling the OpenAI codex CLI
with strict structured outputs.
It runs in scripts/phase2_probe.py. Each probe takes about 75
seconds and a few hundred thousand tokens; a full top-1000 sweep
runs in ~17 hours of wall-clock at single-threaded serial cadence.
Runtime requirements
codexCLI installed (tested withcodex-cli 0.124.0)- a configured OpenAI API key in your codex auth
~/.codex/config.tomlwith at minimum a[mcp_servers.lore-http]entry pointing at a working lore.kernel.org MCP server- a Linux kernel checkout for the model to read (passed via
-C) leiCLI optional; if installed and reachable, the model can drop into shell and use ituvfor running the Python wrapper
The invocation
codex exec \
--ephemeral \
--ignore-rules \
--skip-git-repo-check \
-c model="gpt-5.4" \
-c model_reasoning_effort="medium" \
-s workspace-write \
-C "$KERNEL_ROOT" \
--add-dir data/dossiers/<driver-path> \
--add-dir "/run/user/$(id -u)" \
--output-schema data/schema.v1.json \
-o data/dossiers/<driver-path>/dossier.json \
--json \
"<prompt>" \
> events.jsonl 2> stderr.log
Each flag has a specific reason to exist:
| flag | what it does | why we need it |
|---|---|---|
exec | non-interactive mode | one-shot batch probe, no TUI |
--ephemeral | no session file on disk | dossier IS the artifact; no session state to garbage-collect |
--ignore-rules | skip ~/.codex/rules.md | reproducibility across operators with different rule files |
--skip-git-repo-check | tolerate non-git cwd | the working dir is the kernel checkout, but the dossier output dir may not be |
-c model="gpt-5.4" | model override | explicit so re-runs can compare |
-c model_reasoning_effort="medium" | medium effort | sufficient for the task; high adds cost without much accuracy |
-s workspace-write | sandbox mode | shell + MCP + lei need writes; read-only blocks lei’s runtime dir |
-C "$KERNEL_ROOT" | working dir | model reads kernel source to identify chipset/Kconfig context |
--add-dir <dossier-dir> | extra writable path | needed for -o to land |
--add-dir /run/user/$UID | extra writable path | lei’s public-inbox daemon writes here; without it lei q fails with “Read-only file system” |
--output-schema <path> | structured-output enforcement | rejects free-text replies; closes enum values; non-negotiable |
-o <path> | write final response to file | one-shot deliverable; pairs with —json |
--json | event stream as JSONL on stdout | per-call cost accounting and tool-call audit |
Don’ts
- Do not pass
--ignore-user-config. It strips MCP server registrations from~/.codex/config.toml, silently disabling the lore search tool. - Do not pass
-s read-only. It blocks lei’s daemon from creating its runtime directory. - Do not pass
-a(ask-for-approval). It is a top-level flag (TUI-only);codex execrejects it. - Do not omit
-C. Without it, codex drops into the dossier output dir and pulls in unrelated context.
Structured-output schema rules
The OpenAI Responses-API structured-output validator is stricter than generic JSON Schema:
- Every property must appear in
required. Optional fields break validation. We list every property inrequiredand declare optional values via union types like"type": ["string", "null"]. additionalProperties: falseat every object level. Prevents the model from inventing extra fields.formatkeywords are rejected."format": "uri"returns HTTP 400. URL fields are plain strings; URL well-formedness is enforced by the validator script post-hoc.- Closed enums for any categorical field.
recommendation_hintis an enum of six strings.deployments_todayis an enum of five.
data/schema.v1.json is the canonical schema. See
schema.md for the field-by-field reference.
The prompt
PROMPT_TEMPLATE in phase2_probe.py is parameterised by the
phase-1 features for that directory. The prompt:
- Sets the role — “hardware obsolescence analyst”
- Defines the early-exit rule — if the directory is clearly
not a driver (test fixtures, header-only, etc.), return
recommendation_hint=not-a-driverimmediately, no tool calls. This saves 60s and ~100k tokens per misclassified dir. - Lists tool budgets — 3-5 tool calls total, INCLUDING web search
- Lists fallback strategies — if
lore_searchreturns a BM25-inconsistency error, fall back tolore_regexorlore_file_timeline; do not retry the same query - Inserts phase-1 facts — c file count, substantive vs raw commit count, last touch date, dominant author, parent subsystem
- Asks the model to ground evidence in lore first, then look for deployment evidence
- Reminds the model: confidence < 0.5 →
recommendation_hint = unsure. State inreasoning_noteshow each cited URL was obtained.
The prompt does NOT prescribe a verdict. It asks the model to
reason from evidence. Empirically this produces a healthy
distribution: 46% keep-annotate, 40% keep, 8% not-a-driver,
5% deprecate, 1% remove.
MCP tools (lore.kernel.org)
The lore-http MCP server exposes a structured search interface
to the public-inbox archives at lore.kernel.org. Useful tools and
their cost class:
| tool | what it does | reliability (current state) |
|---|---|---|
lore_activity | per-file activity counts over time | reliable |
lore_file_timeline | per-year file touches | reliable |
lore_substr_subject | byte substring on subject_raw | reliable |
lore_message | fetch one message by ID | reliable |
lore_search | fused BM25 + trigram + metadata | server-side issue: BM25 generation behind corpus, fails ~40% of the time |
lore_regex | DFA regex over subject/from/prose/patch | reliable but slow |
lore_thread | walk thread by message-id | times out at 5s on long threads |
lore_count | aggregate counts over predicates | reliable for cheap predicates |
The 40% lore_search failure rate observed in the top-1000 sweep
is a known server-side issue (BM25 needs --with-bm25 rebuild). The
prompt explicitly instructs the model to fall back to lore_regex
or lore_file_timeline rather than retry the same failing query.
The lore_file_timeline tool is the workhorse. The two
high-confidence remove verdicts (packetengines, fujitsu) were
both driven by lore_file_timeline calls that surfaced active
upstream removal patches.
Web search
codex exec exposes the native Responses web_search tool to the
model even without the --search top-level flag (which is
TUI-only). This appears to be enabled by default when any MCP
server is configured. We treat that as observed behaviour rather
than a guarantee — if a future codex upgrade removes implicit
web_search, fall back to a brave/serpapi MCP server registered
in ~/.codex/config.toml.
In the corpus, web search is heavily used (199 calls across the top-100 sweep) and contributes most of the deployment evidence (distro package pages, vendor EOL notices, virtualisation guest docs, OpenWrt/postmarketOS wikis).
Per-driver output directory
Every probe writes to data/dossiers/<driver-path>/:
prompt.md the prompt sent to codex
dossier.json the schema-validated verdict
events.jsonl full codex event stream (tool calls, tokens, ...)
stderr.log codex stderr (rarely interesting; lei daemon failures land here)
static_features.json phase-1 git-log facts
meta.json invocation receipt: model, kernel SHA, since date, tokens, elapsed
summary.json derived: verdict + tool counts + tool log
The directory is self-contained. Anyone reading
data/dossiers/drivers/foo/bar/ can reconstruct exactly what was
asked, what tools fired, what the model said, and what kernel
state it was reasoning against.
Idempotence and re-runs
phase2_probe.py skips any driver whose dossier.json already
exists. To re-probe:
# Single driver
rm -rf data/dossiers/drivers/foo/bar
uv run --script scripts/phase2_probe.py ... --paths drivers/foo/bar
# Force the whole shortlist
uv run --script scripts/phase2_probe.py ... --force
Mixing model versions across a corpus is fine — each meta.json
records its own model_reasoning_effort and timestamp. The
build_index step picks them up uniformly.
Cost characteristics
Empirical (top-1000 sweep, 858 dossiers):
- 76.7s avg wall-clock per real probe
- 12.8s avg for
not-a-driverearly-exits (significant saving) - 78% input-token cache hit rate across consecutive probes
- ~165k input + ~3k output tokens per real probe
- ~$0.20-$0.30 per probe at gpt-5.4 pricing
- ~$240 for the full top-1000
The cache hit rate matters: the codex system context is reused across calls, so amortised cost per driver drops substantially in a long batch. A single-shot probe pays full price; a batch of 100 pays roughly 22% of that per driver.
Concurrency
The current runner is single-threaded. Phase-2 probes are CPU-
light and IO-bound (waiting on codex/MCP/web), so a 4-way
concurrent runner would drop wall-clock by roughly 4x. Adding
that is asyncio.gather over the shortlist with a semaphore;
roughly 30 lines. Current corpus was built serial, but this is the
obvious next improvement for further scale.