Phase 2: codex CLI dossier probe

The phase-2 runner takes a shortlist of driver directories and emits one dossier per directory by calling the OpenAI codex CLI with strict structured outputs.

It runs in scripts/phase2_probe.py. Each probe takes about 75 seconds and a few hundred thousand tokens; a full top-1000 sweep runs in ~17 hours of wall-clock at single-threaded serial cadence.

Runtime requirements

  • codex CLI installed (tested with codex-cli 0.124.0)
  • a configured OpenAI API key in your codex auth
  • ~/.codex/config.toml with at minimum a [mcp_servers.lore-http] entry pointing at a working lore.kernel.org MCP server
  • a Linux kernel checkout for the model to read (passed via -C)
  • lei CLI optional; if installed and reachable, the model can drop into shell and use it
  • uv for running the Python wrapper

The invocation

codex exec \
    --ephemeral \
    --ignore-rules \
    --skip-git-repo-check \
    -c model="gpt-5.4" \
    -c model_reasoning_effort="medium" \
    -s workspace-write \
    -C "$KERNEL_ROOT" \
    --add-dir data/dossiers/<driver-path> \
    --add-dir "/run/user/$(id -u)" \
    --output-schema data/schema.v1.json \
    -o data/dossiers/<driver-path>/dossier.json \
    --json \
    "<prompt>" \
    > events.jsonl 2> stderr.log

Each flag has a specific reason to exist:

flagwhat it doeswhy we need it
execnon-interactive modeone-shot batch probe, no TUI
--ephemeralno session file on diskdossier IS the artifact; no session state to garbage-collect
--ignore-rulesskip ~/.codex/rules.mdreproducibility across operators with different rule files
--skip-git-repo-checktolerate non-git cwdthe working dir is the kernel checkout, but the dossier output dir may not be
-c model="gpt-5.4"model overrideexplicit so re-runs can compare
-c model_reasoning_effort="medium"medium effortsufficient for the task; high adds cost without much accuracy
-s workspace-writesandbox modeshell + MCP + lei need writes; read-only blocks lei’s runtime dir
-C "$KERNEL_ROOT"working dirmodel reads kernel source to identify chipset/Kconfig context
--add-dir <dossier-dir>extra writable pathneeded for -o to land
--add-dir /run/user/$UIDextra writable pathlei’s public-inbox daemon writes here; without it lei q fails with “Read-only file system”
--output-schema <path>structured-output enforcementrejects free-text replies; closes enum values; non-negotiable
-o <path>write final response to fileone-shot deliverable; pairs with —json
--jsonevent stream as JSONL on stdoutper-call cost accounting and tool-call audit

Don’ts

  • Do not pass --ignore-user-config. It strips MCP server registrations from ~/.codex/config.toml, silently disabling the lore search tool.
  • Do not pass -s read-only. It blocks lei’s daemon from creating its runtime directory.
  • Do not pass -a (ask-for-approval). It is a top-level flag (TUI-only); codex exec rejects it.
  • Do not omit -C. Without it, codex drops into the dossier output dir and pulls in unrelated context.

Structured-output schema rules

The OpenAI Responses-API structured-output validator is stricter than generic JSON Schema:

  • Every property must appear in required. Optional fields break validation. We list every property in required and declare optional values via union types like "type": ["string", "null"].
  • additionalProperties: false at every object level. Prevents the model from inventing extra fields.
  • format keywords are rejected. "format": "uri" returns HTTP 400. URL fields are plain strings; URL well-formedness is enforced by the validator script post-hoc.
  • Closed enums for any categorical field. recommendation_hint is an enum of six strings. deployments_today is an enum of five.

data/schema.v1.json is the canonical schema. See schema.md for the field-by-field reference.

The prompt

PROMPT_TEMPLATE in phase2_probe.py is parameterised by the phase-1 features for that directory. The prompt:

  1. Sets the role — “hardware obsolescence analyst”
  2. Defines the early-exit rule — if the directory is clearly not a driver (test fixtures, header-only, etc.), return recommendation_hint=not-a-driver immediately, no tool calls. This saves 60s and ~100k tokens per misclassified dir.
  3. Lists tool budgets — 3-5 tool calls total, INCLUDING web search
  4. Lists fallback strategies — if lore_search returns a BM25-inconsistency error, fall back to lore_regex or lore_file_timeline; do not retry the same query
  5. Inserts phase-1 facts — c file count, substantive vs raw commit count, last touch date, dominant author, parent subsystem
  6. Asks the model to ground evidence in lore first, then look for deployment evidence
  7. Reminds the model: confidence < 0.5 → recommendation_hint = unsure. State in reasoning_notes how each cited URL was obtained.

The prompt does NOT prescribe a verdict. It asks the model to reason from evidence. Empirically this produces a healthy distribution: 46% keep-annotate, 40% keep, 8% not-a-driver, 5% deprecate, 1% remove.

MCP tools (lore.kernel.org)

The lore-http MCP server exposes a structured search interface to the public-inbox archives at lore.kernel.org. Useful tools and their cost class:

toolwhat it doesreliability (current state)
lore_activityper-file activity counts over timereliable
lore_file_timelineper-year file touchesreliable
lore_substr_subjectbyte substring on subject_rawreliable
lore_messagefetch one message by IDreliable
lore_searchfused BM25 + trigram + metadataserver-side issue: BM25 generation behind corpus, fails ~40% of the time
lore_regexDFA regex over subject/from/prose/patchreliable but slow
lore_threadwalk thread by message-idtimes out at 5s on long threads
lore_countaggregate counts over predicatesreliable for cheap predicates

The 40% lore_search failure rate observed in the top-1000 sweep is a known server-side issue (BM25 needs --with-bm25 rebuild). The prompt explicitly instructs the model to fall back to lore_regex or lore_file_timeline rather than retry the same failing query.

The lore_file_timeline tool is the workhorse. The two high-confidence remove verdicts (packetengines, fujitsu) were both driven by lore_file_timeline calls that surfaced active upstream removal patches.

codex exec exposes the native Responses web_search tool to the model even without the --search top-level flag (which is TUI-only). This appears to be enabled by default when any MCP server is configured. We treat that as observed behaviour rather than a guarantee — if a future codex upgrade removes implicit web_search, fall back to a brave/serpapi MCP server registered in ~/.codex/config.toml.

In the corpus, web search is heavily used (199 calls across the top-100 sweep) and contributes most of the deployment evidence (distro package pages, vendor EOL notices, virtualisation guest docs, OpenWrt/postmarketOS wikis).

Per-driver output directory

Every probe writes to data/dossiers/<driver-path>/:

prompt.md               the prompt sent to codex
dossier.json            the schema-validated verdict
events.jsonl            full codex event stream (tool calls, tokens, ...)
stderr.log              codex stderr (rarely interesting; lei daemon failures land here)
static_features.json    phase-1 git-log facts
meta.json               invocation receipt: model, kernel SHA, since date, tokens, elapsed
summary.json            derived: verdict + tool counts + tool log

The directory is self-contained. Anyone reading data/dossiers/drivers/foo/bar/ can reconstruct exactly what was asked, what tools fired, what the model said, and what kernel state it was reasoning against.

Idempotence and re-runs

phase2_probe.py skips any driver whose dossier.json already exists. To re-probe:

# Single driver
rm -rf data/dossiers/drivers/foo/bar
uv run --script scripts/phase2_probe.py ... --paths drivers/foo/bar

# Force the whole shortlist
uv run --script scripts/phase2_probe.py ... --force

Mixing model versions across a corpus is fine — each meta.json records its own model_reasoning_effort and timestamp. The build_index step picks them up uniformly.

Cost characteristics

Empirical (top-1000 sweep, 858 dossiers):

  • 76.7s avg wall-clock per real probe
  • 12.8s avg for not-a-driver early-exits (significant saving)
  • 78% input-token cache hit rate across consecutive probes
  • ~165k input + ~3k output tokens per real probe
  • ~$0.20-$0.30 per probe at gpt-5.4 pricing
  • ~$240 for the full top-1000

The cache hit rate matters: the codex system context is reused across calls, so amortised cost per driver drops substantially in a long batch. A single-shot probe pays full price; a batch of 100 pays roughly 22% of that per driver.

Concurrency

The current runner is single-threaded. Phase-2 probes are CPU- light and IO-bound (waiting on codex/MCP/web), so a 4-way concurrent runner would drop wall-clock by roughly 4x. Adding that is asyncio.gather over the shortlist with a semaphore; roughly 30 lines. Current corpus was built serial, but this is the obvious next improvement for further scale.