Files
CGA-bench/experiments/README_paper_experiments.md

76 lines
2.2 KiB
Markdown
Raw Normal View History

2026-05-22 10:02:42 +08:00
# CorrectBench Paper Experiments
This directory contains the batch runner and analyzer used to execute the
paper-style Baseline vs CGA study on FSM/protocol tasks.
## Scope
- Task set: `experiments/paper_tasks.py`
- Base config: `config/configs/paper_fsm_qwen.yaml`
- Runner: `experiments/run_paper_experiments.py`
- Analyzer: `experiments/analyze_paper_experiments.py`
## Conditions
- `baseline`: runs the full CorrectBench pipeline but skips CGA optimization.
A single coverage collection pass is still executed so the structural
coverage metric remains directly comparable with CGA.
- `cga`: runs the full pipeline with CGA enabled.
## Recommended main experiment
Run 5 repeats for each condition:
```bash
venv/bin/python experiments/run_paper_experiments.py \
--base-config config/configs/paper_fsm_qwen.yaml \
--experiment-name paper_fsm_qwen \
--models qwen-max \
--conditions baseline cga \
--repeats 5
```
If you want a quick smoke test first:
```bash
venv/bin/python experiments/run_paper_experiments.py \
--dry-run \
--limit-tasks 2 \
--repeats 1
```
## Analyze outputs
After runs finish, analyze the generated manifest:
```bash
venv/bin/python experiments/analyze_paper_experiments.py \
--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json
```
Optional output directory:
```bash
venv/bin/python experiments/analyze_paper_experiments.py \
--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \
--output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis
```
## Files to use in the paper
- Run-level source of truth: each run's `Chatbench_RunInfo.json`
- Per-task iteration trace: each task's `task_log.log`
- Main aggregated table: `task_summary.csv`
- Paired Baseline vs CGA deltas: `paired_deltas.csv`
- Per-task mean delta and bootstrap CI: `task_delta_summary.csv`
- Overall statistics and Wilcoxon test: `stats_summary.txt`
- Case-study pointers: `case_studies.md`
## Notes
- Structural coverage is the Verilator annotated score used by CorrectBench.
It is not pure line coverage and not functional coverage.
- The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not
require SciPy.
- If `matplotlib` is installed, the analyzer also emits the four paper figures.