experiments/README_paper_experiments.md

# CorrectBench Paper Experiments

This directory contains the batch runner and analyzer used to execute the
paper-style Baseline vs CGA study on FSM/protocol tasks.

## Scope

- Task set: `experiments/paper_tasks.py`
- Base config: `config/configs/paper_fsm_qwen.yaml`
- Runner: `experiments/run_paper_experiments.py`
- Analyzer: `experiments/analyze_paper_experiments.py`

## Conditions

- `baseline`: runs the full CorrectBench pipeline but skips CGA optimization.
  A single coverage collection pass is still executed so the structural
  coverage metric remains directly comparable with CGA.
- `cga`: runs the full pipeline with CGA enabled.

## Recommended main experiment

Run 5 repeats for each condition:

```bash
venv/bin/python experiments/run_paper_experiments.py \
  --base-config config/configs/paper_fsm_qwen.yaml \
  --experiment-name paper_fsm_qwen \
  --models qwen-max \
  --conditions baseline cga \
  --repeats 5
```

If you want a quick smoke test first:

```bash
venv/bin/python experiments/run_paper_experiments.py \
  --dry-run \
  --limit-tasks 2 \
  --repeats 1
```

## Analyze outputs

After runs finish, analyze the generated manifest:

```bash
venv/bin/python experiments/analyze_paper_experiments.py \
  --manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json
```

Optional output directory:

```bash
venv/bin/python experiments/analyze_paper_experiments.py \
  --manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \
  --output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis
```

## Files to use in the paper

- Run-level source of truth: each run's `Chatbench_RunInfo.json`
- Per-task iteration trace: each task's `task_log.log`
- Main aggregated table: `task_summary.csv`
- Paired Baseline vs CGA deltas: `paired_deltas.csv`
- Per-task mean delta and bootstrap CI: `task_delta_summary.csv`
- Overall statistics and Wilcoxon test: `stats_summary.txt`
- Case-study pointers: `case_studies.md`

## Notes

- Structural coverage is the Verilator annotated score used by CorrectBench.
  It is not pure line coverage and not functional coverage.
- The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not
  require SciPy.
- If `matplotlib` is installed, the analyzer also emits the four paper figures.
first commit 2026-05-22 10:02:42 +08:00			`# CorrectBench Paper Experiments`

			`This directory contains the batch runner and analyzer used to execute the`
			`paper-style Baseline vs CGA study on FSM/protocol tasks.`

			`## Scope`

			- Task set: `experiments/paper_tasks.py`
			- Base config: `config/configs/paper_fsm_qwen.yaml`
			- Runner: `experiments/run_paper_experiments.py`
			- Analyzer: `experiments/analyze_paper_experiments.py`

			`## Conditions`

			- `baseline`: runs the full CorrectBench pipeline but skips CGA optimization.
			`A single coverage collection pass is still executed so the structural`
			`coverage metric remains directly comparable with CGA.`
			- `cga`: runs the full pipeline with CGA enabled.

			`## Recommended main experiment`

			`Run 5 repeats for each condition:`

			```bash
			`venv/bin/python experiments/run_paper_experiments.py \`
			`--base-config config/configs/paper_fsm_qwen.yaml \`
			`--experiment-name paper_fsm_qwen \`
			`--models qwen-max \`
			`--conditions baseline cga \`
			`--repeats 5`
			```

			`If you want a quick smoke test first:`

			```bash
			`venv/bin/python experiments/run_paper_experiments.py \`
			`--dry-run \`
			`--limit-tasks 2 \`
			`--repeats 1`
			```

			`## Analyze outputs`

			`After runs finish, analyze the generated manifest:`

			```bash
			`venv/bin/python experiments/analyze_paper_experiments.py \`
			`--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json`
			```

			`Optional output directory:`

			```bash
			`venv/bin/python experiments/analyze_paper_experiments.py \`
			`--manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \`
			`--output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis`
			```

			`## Files to use in the paper`

			- Run-level source of truth: each run's `Chatbench_RunInfo.json`
			- Per-task iteration trace: each task's `task_log.log`
			- Main aggregated table: `task_summary.csv`
			- Paired Baseline vs CGA deltas: `paired_deltas.csv`
			- Per-task mean delta and bootstrap CI: `task_delta_summary.csv`
			- Overall statistics and Wilcoxon test: `stats_summary.txt`
			- Case-study pointers: `case_studies.md`

			`## Notes`

			`- Structural coverage is the Verilator annotated score used by CorrectBench.`
			`It is not pure line coverage and not functional coverage.`
			`- The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not`
			`require SciPy.`
			- If `matplotlib` is installed, the analyzer also emits the four paper figures.