18 lines
533 B
Plaintext
18 lines
533 B
Plaintext
CorrectBench Paper Experiment Summary
|
|
==================================================
|
|
Total run-level rows: 4
|
|
Total paired rows: 2
|
|
|
|
[Overall Coverage Delta]
|
|
Mean delta: 0.0
|
|
95% bootstrap CI: [0.0, 0.0]
|
|
Wilcoxon signed-rank: n=0, p=1.000000, method=degenerate
|
|
|
|
[Overall Semantic Coverage Delta]
|
|
Mean delta: 0.0
|
|
95% bootstrap CI: [0.0, 0.0]
|
|
Wilcoxon signed-rank: n=0, p=1.000000, method=degenerate
|
|
|
|
[Per-Task Paired Coverage Delta]
|
|
qwen-max | 2012_q2fsm: mean=0.0 CI=[0.0, 0.0] n=1
|
|
qwen-max | 2013_q2afsm: mean=0.0 CI=[0.0, 0.0] n=1 |