2.2 KiB
2.2 KiB
CorrectBench Paper Experiments
This directory contains the batch runner and analyzer used to execute the paper-style Baseline vs CGA study on FSM/protocol tasks.
Scope
- Task set:
experiments/paper_tasks.py - Base config:
config/configs/paper_fsm_qwen.yaml - Runner:
experiments/run_paper_experiments.py - Analyzer:
experiments/analyze_paper_experiments.py
Conditions
baseline: runs the full CorrectBench pipeline but skips CGA optimization. A single coverage collection pass is still executed so the structural coverage metric remains directly comparable with CGA.cga: runs the full pipeline with CGA enabled.
Recommended main experiment
Run 5 repeats for each condition:
venv/bin/python experiments/run_paper_experiments.py \
--base-config config/configs/paper_fsm_qwen.yaml \
--experiment-name paper_fsm_qwen \
--models qwen-max \
--conditions baseline cga \
--repeats 5
If you want a quick smoke test first:
venv/bin/python experiments/run_paper_experiments.py \
--dry-run \
--limit-tasks 2 \
--repeats 1
Analyze outputs
After runs finish, analyze the generated manifest:
venv/bin/python experiments/analyze_paper_experiments.py \
--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json
Optional output directory:
venv/bin/python experiments/analyze_paper_experiments.py \
--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \
--output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis
Files to use in the paper
- Run-level source of truth: each run's
Chatbench_RunInfo.json - Per-task iteration trace: each task's
task_log.log - Main aggregated table:
task_summary.csv - Paired Baseline vs CGA deltas:
paired_deltas.csv - Per-task mean delta and bootstrap CI:
task_delta_summary.csv - Overall statistics and Wilcoxon test:
stats_summary.txt - Case-study pointers:
case_studies.md
Notes
- Structural coverage is the Verilator annotated score used by CorrectBench. It is not pure line coverage and not functional coverage.
- The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not require SciPy.
- If
matplotlibis installed, the analyzer also emits the four paper figures.