CorrectBench Paper Experiment Summary ================================================== Total run-level rows: 4 Total paired rows: 2 [Overall Coverage Delta] Mean delta: 0.0 95% bootstrap CI: [0.0, 0.0] Wilcoxon signed-rank: n=0, p=1.000000, method=degenerate [Overall Semantic Coverage Delta] Mean delta: 0.0 95% bootstrap CI: [0.0, 0.0] Wilcoxon signed-rank: n=0, p=1.000000, method=degenerate [Per-Task Paired Coverage Delta] qwen-max | 2012_q2fsm: mean=0.0 CI=[0.0, 0.0] n=1 qwen-max | 2013_q2afsm: mean=0.0 CI=[0.0, 0.0] n=1