Phase 2: codex CLI dossier probe

The phase-2 runner takes a shortlist of driver directories and emits one dossier per directory by calling the OpenAI codex CLI with strict structured outputs.

It runs in scripts/phase2_probe.py. Each probe takes about 75 seconds and a few hundred thousand tokens; a full top-1000 sweep runs in ~17 hours of wall-clock at single-threaded serial cadence.

Runtime requirements

codex CLI installed (tested with codex-cli 0.124.0)
a configured OpenAI API key in your codex auth
~/.codex/config.toml with at minimum a [mcp_servers.lore-http] entry pointing at a working lore.kernel.org MCP server
a Linux kernel checkout for the model to read (passed via -C)
lei CLI optional; if installed and reachable, the model can drop into shell and use it
uv for running the Python wrapper

The invocation

codex exec \
    --ephemeral \
    --ignore-rules \
    --skip-git-repo-check \
    -c model="gpt-5.4" \
    -c model_reasoning_effort="medium" \
    -s workspace-write \
    -C "$KERNEL_ROOT" \
    --add-dir data/dossiers/<driver-path> \
    --add-dir "/run/user/$(id -u)" \
    --output-schema data/schema.v1.json \
    -o data/dossiers/<driver-path>/dossier.json \
    --json \
    "<prompt>" \
    > events.jsonl 2> stderr.log

Each flag has a specific reason to exist:

flag	what it does	why we need it
`exec`	non-interactive mode	one-shot batch probe, no TUI
`--ephemeral`	no session file on disk	dossier IS the artifact; no session state to garbage-collect
`--ignore-rules`	skip `~/.codex/rules.md`	reproducibility across operators with different rule files
`--skip-git-repo-check`	tolerate non-git cwd	the working dir is the kernel checkout, but the dossier output dir may not be
`-c model="gpt-5.4"`	model override	explicit so re-runs can compare
`-c model_reasoning_effort="medium"`	medium effort	sufficient for the task; high adds cost without much accuracy
`-s workspace-write`	sandbox mode	shell + MCP + lei need writes; `read-only` blocks lei’s runtime dir
`-C "$KERNEL_ROOT"`	working dir	model reads kernel source to identify chipset/Kconfig context
`--add-dir <dossier-dir>`	extra writable path	needed for `-o` to land
`--add-dir /run/user/$UID`	extra writable path	lei’s public-inbox daemon writes here; without it `lei q` fails with “Read-only file system”
`--output-schema <path>`	structured-output enforcement	rejects free-text replies; closes enum values; non-negotiable
`-o <path>`	write final response to file	one-shot deliverable; pairs with —json
`--json`	event stream as JSONL on stdout	per-call cost accounting and tool-call audit

Don’ts

Do not pass --ignore-user-config. It strips MCP server registrations from ~/.codex/config.toml, silently disabling the lore search tool.
Do not pass -s read-only. It blocks lei’s daemon from creating its runtime directory.
Do not pass -a (ask-for-approval). It is a top-level flag (TUI-only); codex exec rejects it.
Do not omit -C. Without it, codex drops into the dossier output dir and pulls in unrelated context.

Structured-output schema rules

The OpenAI Responses-API structured-output validator is stricter than generic JSON Schema:

Every property must appear in required. Optional fields break validation. We list every property in required and declare optional values via union types like "type": ["string", "null"].
additionalProperties: false at every object level. Prevents the model from inventing extra fields.
format keywords are rejected. "format": "uri" returns HTTP 400. URL fields are plain strings; URL well-formedness is enforced by the validator script post-hoc.
Closed enums for any categorical field. recommendation_hint is an enum of six strings. deployments_today is an enum of five.

data/schema.v1.json is the canonical schema. See schema.md for the field-by-field reference.

The prompt

PROMPT_TEMPLATE in phase2_probe.py is parameterised by the phase-1 features for that directory. The prompt:

Sets the role — “hardware obsolescence analyst”
Defines the early-exit rule — if the directory is clearly not a driver (test fixtures, header-only, etc.), return recommendation_hint=not-a-driver immediately, no tool calls. This saves 60s and ~100k tokens per misclassified dir.
Lists tool budgets — 3-5 tool calls total, INCLUDING web search
Lists fallback strategies — if lore_search returns a BM25-inconsistency error, fall back to lore_regex or lore_file_timeline; do not retry the same query
Inserts phase-1 facts — c file count, substantive vs raw commit count, last touch date, dominant author, parent subsystem
Asks the model to ground evidence in lore first, then look for deployment evidence
Reminds the model: confidence < 0.5 → recommendation_hint = unsure. State in reasoning_notes how each cited URL was obtained.

The prompt does NOT prescribe a verdict. It asks the model to reason from evidence. Empirically this produces a healthy distribution: 46% keep-annotate, 40% keep, 8% not-a-driver, 5% deprecate, 1% remove.

MCP tools (lore.kernel.org)

The lore-http MCP server exposes a structured search interface to the public-inbox archives at lore.kernel.org. Useful tools and their cost class:

tool	what it does	reliability (current state)
`lore_activity`	per-file activity counts over time	reliable
`lore_file_timeline`	per-year file touches	reliable
`lore_substr_subject`	byte substring on subject_raw	reliable
`lore_message`	fetch one message by ID	reliable
`lore_search`	fused BM25 + trigram + metadata	server-side issue: BM25 generation behind corpus, fails ~40% of the time
`lore_regex`	DFA regex over subject/from/prose/patch	reliable but slow
`lore_thread`	walk thread by message-id	times out at 5s on long threads
`lore_count`	aggregate counts over predicates	reliable for cheap predicates

The 40% lore_search failure rate observed in the top-1000 sweep is a known server-side issue (BM25 needs --with-bm25 rebuild). The prompt explicitly instructs the model to fall back to lore_regex or lore_file_timeline rather than retry the same failing query.

The lore_file_timeline tool is the workhorse. The two high-confidence remove verdicts (packetengines, fujitsu) were both driven by lore_file_timeline calls that surfaced active upstream removal patches.

Web search

codex exec exposes the native Responses web_search tool to the model even without the --search top-level flag (which is TUI-only). This appears to be enabled by default when any MCP server is configured. We treat that as observed behaviour rather than a guarantee — if a future codex upgrade removes implicit web_search, fall back to a brave/serpapi MCP server registered in ~/.codex/config.toml.

In the corpus, web search is heavily used (199 calls across the top-100 sweep) and contributes most of the deployment evidence (distro package pages, vendor EOL notices, virtualisation guest docs, OpenWrt/postmarketOS wikis).

Per-driver output directory

Every probe writes to data/dossiers/<driver-path>/:

prompt.md               the prompt sent to codex
dossier.json            the schema-validated verdict
events.jsonl            full codex event stream (tool calls, tokens, ...)
stderr.log              codex stderr (rarely interesting; lei daemon failures land here)
static_features.json    phase-1 git-log facts
meta.json               invocation receipt: model, kernel SHA, since date, tokens, elapsed
summary.json            derived: verdict + tool counts + tool log

The directory is self-contained. Anyone reading data/dossiers/drivers/foo/bar/ can reconstruct exactly what was asked, what tools fired, what the model said, and what kernel state it was reasoning against.

Idempotence and re-runs

phase2_probe.py skips any driver whose dossier.json already exists. To re-probe:

# Single driver
rm -rf data/dossiers/drivers/foo/bar
uv run --script scripts/phase2_probe.py ... --paths drivers/foo/bar

# Force the whole shortlist
uv run --script scripts/phase2_probe.py ... --force

Mixing model versions across a corpus is fine — each meta.json records its own model_reasoning_effort and timestamp. The build_index step picks them up uniformly.

Cost characteristics

Empirical (top-1000 sweep, 858 dossiers):

76.7s avg wall-clock per real probe
12.8s avg for not-a-driver early-exits (significant saving)
78% input-token cache hit rate across consecutive probes
~165k input + ~3k output tokens per real probe
~$0.20-$0.30 per probe at gpt-5.4 pricing
~$240 for the full top-1000

The cache hit rate matters: the codex system context is reused across calls, so amortised cost per driver drops substantially in a long batch. A single-shot probe pays full price; a batch of 100 pays roughly 22% of that per driver.

Concurrency

The current runner is single-threaded. Phase-2 probes are CPU- light and IO-bound (waiting on codex/MCP/web), so a 4-way concurrent runner would drop wall-clock by roughly 4x. Adding that is asyncio.gather over the shortlist with a semaphore; roughly 30 lines. Current corpus was built serial, but this is the obvious next improvement for further scale.