Files
CGA-bench/experiments/README_paper_experiments.md
2026-05-22 10:02:42 +08:00

2.2 KiB

CorrectBench Paper Experiments

This directory contains the batch runner and analyzer used to execute the paper-style Baseline vs CGA study on FSM/protocol tasks.

Scope

  • Task set: experiments/paper_tasks.py
  • Base config: config/configs/paper_fsm_qwen.yaml
  • Runner: experiments/run_paper_experiments.py
  • Analyzer: experiments/analyze_paper_experiments.py

Conditions

  • baseline: runs the full CorrectBench pipeline but skips CGA optimization. A single coverage collection pass is still executed so the structural coverage metric remains directly comparable with CGA.
  • cga: runs the full pipeline with CGA enabled.

Run 5 repeats for each condition:

venv/bin/python experiments/run_paper_experiments.py \
  --base-config config/configs/paper_fsm_qwen.yaml \
  --experiment-name paper_fsm_qwen \
  --models qwen-max \
  --conditions baseline cga \
  --repeats 5

If you want a quick smoke test first:

venv/bin/python experiments/run_paper_experiments.py \
  --dry-run \
  --limit-tasks 2 \
  --repeats 1

Analyze outputs

After runs finish, analyze the generated manifest:

venv/bin/python experiments/analyze_paper_experiments.py \
  --manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json

Optional output directory:

venv/bin/python experiments/analyze_paper_experiments.py \
  --manifest analysis/paper_runs/paper_fsm_qwen/run_manifest.json \
  --output-dir analysis/paper_runs/paper_fsm_qwen/final_analysis

Files to use in the paper

  • Run-level source of truth: each run's Chatbench_RunInfo.json
  • Per-task iteration trace: each task's task_log.log
  • Main aggregated table: task_summary.csv
  • Paired Baseline vs CGA deltas: paired_deltas.csv
  • Per-task mean delta and bootstrap CI: task_delta_summary.csv
  • Overall statistics and Wilcoxon test: stats_summary.txt
  • Case-study pointers: case_studies.md

Notes

  • Structural coverage is the Verilator annotated score used by CorrectBench. It is not pure line coverage and not functional coverage.
  • The analyzer uses pure-Python bootstrap CI and Wilcoxon logic so it does not require SciPy.
  • If matplotlib is installed, the analyzer also emits the four paper figures.