76 lines
2.2 KiB
Markdown
76 lines
2.2 KiB
Markdown
# CorrectBench Paper Experiments
|
|
|
|
This directory contains the batch runner and analyzer used to execute the
|
|
paper-style Baseline vs CGA study on FSM/protocol tasks.
|
|
|
|
## Scope
|
|
|
|
- Task set: `experiments/paper_tasks.py`
|
|
- Base config: `config/configs/paper_fsm_qwen.yaml`
|
|
- Runner: `experiments/run_paper_experiments.py`
|
|
- Analyzer: `experiments/analyze_paper_experiments.py`
|
|
|
|
## Conditions
|
|
|
|
- `baseline`: runs the full CorrectBench pipeline but skips CGA optimization.
|
|
A single coverage collection pass is still executed so the structural
|
|
coverage metric remains directly comparable with CGA.
|
|
- `cga`: runs the full pipeline with CGA enabled.
|
|
|
|
## Recommended main experiment
|
|
|
|
Run 5 repeats for each condition:
|
|
|
|
```bash
|
|
venv/bin/python experiments/run_paper_experiments.py \
|
|
--base-config config/configs/paper_fsm_qwen.yaml \
|
|
--experiment-name paper_fsm_qwen \
|
|
--models qwen-max \
|
|
--conditions baseline cga \
|
|
--repeats 5
|
|
```
|
|
|
|
If you want a quick smoke test first:
|
|
|
|
```bash
|
|
venv/bin/python experiments/run_paper_experiments.py \
|
|
--dry-run \
|
|
--limit-tasks 2 \
|
|
--repeats 1
|
|
```
|
|
|
|
## Analyze outputs
|
|
|
|
After runs finish, analyze the generated manifest:
|
|
|
|
```bash
|
|
venv/bin/python experiments/analyze_paper_experiments.py \
|
|
--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json
|
|
```
|
|
|
|
Optional output directory:
|
|
|
|
```bash
|
|
venv/bin/python experiments/analyze_paper_experiments.py \
|
|
--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \
|
|
--output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis
|
|
```
|
|
|
|
## Files to use in the paper
|
|
|
|
- Run-level source of truth: each run's `Chatbench_RunInfo.json`
|
|
- Per-task iteration trace: each task's `task_log.log`
|
|
- Main aggregated table: `task_summary.csv`
|
|
- Paired Baseline vs CGA deltas: `paired_deltas.csv`
|
|
- Per-task mean delta and bootstrap CI: `task_delta_summary.csv`
|
|
- Overall statistics and Wilcoxon test: `stats_summary.txt`
|
|
- Case-study pointers: `case_studies.md`
|
|
|
|
## Notes
|
|
|
|
- Structural coverage is the Verilator annotated score used by CorrectBench.
|
|
It is not pure line coverage and not functional coverage.
|
|
- The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not
|
|
require SciPy.
|
|
- If `matplotlib` is installed, the analyzer also emits the four paper figures.
|