# CorrectBench Paper Experiments This directory contains the batch runner and analyzer used to execute the paper-style Baseline vs CGA study on FSM/protocol tasks. ## Scope - Task set: `experiments/paper_tasks.py` - Base config: `config/configs/paper_fsm_qwen.yaml` - Runner: `experiments/run_paper_experiments.py` - Analyzer: `experiments/analyze_paper_experiments.py` ## Conditions - `baseline`: runs the full CorrectBench pipeline but skips CGA optimization. A single coverage collection pass is still executed so the structural coverage metric remains directly comparable with CGA. - `cga`: runs the full pipeline with CGA enabled. ## Recommended main experiment Run 5 repeats for each condition: ```bash venv/bin/python experiments/run_paper_experiments.py \ --base-config config/configs/paper_fsm_qwen.yaml \ --experiment-name paper_fsm_qwen \ --models qwen-max \ --conditions baseline cga \ --repeats 5 ``` If you want a quick smoke test first: ```bash venv/bin/python experiments/run_paper_experiments.py \ --dry-run \ --limit-tasks 2 \ --repeats 1 ``` ## Analyze outputs After runs finish, analyze the generated manifest: ```bash venv/bin/python experiments/analyze_paper_experiments.py \ --manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json ``` Optional output directory: ```bash venv/bin/python experiments/analyze_paper_experiments.py \ --manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \ --output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis ``` ## Files to use in the paper - Run-level source of truth: each run's `Chatbench_RunInfo.json` - Per-task iteration trace: each task's `task_log.log` - Main aggregated table: `task_summary.csv` - Paired Baseline vs CGA deltas: `paired_deltas.csv` - Per-task mean delta and bootstrap CI: `task_delta_summary.csv` - Overall statistics and Wilcoxon test: `stats_summary.txt` - Case-study pointers: `case_studies.md` ## Notes - Structural coverage is the Verilator annotated score used by CorrectBench. It is not pure line coverage and not functional coverage. - The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not require SciPy. - If `matplotlib` is installed, the analyzer also emits the four paper figures.