mcpmark-release - Evaluation Results
Generated: 2025-12-15T06:36:08.102520
Task set: standard
Model
Total Tasks
Pass@1 (avg ± std)
Pass@4
Pass^4
Per-Run Cost (USD)
Avg Agent Time (s)
gpt-5-2-high
127
57.5% ± 1.1%
66.9%
44.9%
$250.47
732.5
gemini-3-pro-high
127
53.9% ± 0.4%
66.9%
37.8%
$265.59
222.4
gpt-5-medium
127
52.6% ± 1.3%
68.5%
33.9%
$127.46
478.2
gpt-5-high
127
51.6% ± 2.5%
66.1%
33.1%
$153.89
1029.3
gemini-3-pro-low
127
50.8% ± 2.1%
67.7%
30.7%
$257.15
209.4
gpt-5-low
127
46.9% ± 2.9%
63.0%
26.8%
$125.87
385.8
claude-opus-4-5-high
127
42.3% ± 2.0%
53.5%
33.9%
$466.18
216.9
deepseek-v3-2-thinking
127
36.8% ± 1.8%
51.2%
21.3%
$31.28
398.0
claude-sonnet-4-5
127
32.1% ± 2.3%
46.5%
16.5%
$281.6
173.2
grok-4
127
31.7% ± 2.9%
44.9%
18.1%
$257.41
319.8
gpt-5-mini-high
127
30.3% ± 1.7%
46.5%
16.5%
$40.35
349.9
claude-opus-4-1
127
29.9% ± 0.0%
/
/
$1165.45
361.8
deepseek-v3-2-chat
127
29.7% ± 1.5%
46.5%
13.4%
$26.58
298.4
claude-sonnet-4-high
127
28.3% ± 2.4%
40.9%
18.1%
$442.33
185.6
claude-sonnet-4
127
28.1% ± 2.6%
44.9%
12.6%
$252.41
218.3
claude-sonnet-4-low
127
27.4% ± 1.7%
39.4%
18.1%
$460.95
199.4
gpt-5-mini-medium
127
27.4% ± 3.1%
45.7%
9.4%
$26.02
159.9
o3
127
25.4% ± 2.0%
43.3%
12.6%
$113.94
169.4
qwen-3-coder-plus
127
24.8% ± 2.1%
40.9%
12.6%
$36.46
274.3
grok-4-fast
127
24.0% ± 3.1%
38.6%
12.6%
$17.88
109.9
kimi-k2-0905
127
21.9% ± 1.2%
31.5%
12.6%
$72.57
493.8
deepseek-v3-1-terminus-thinking
127
21.3% ± 3.3%
37.0%
5.5%
$10.52
734.5
grok-code-fast-1
127
20.5% ± 3.4%
30.7%
9.4%
$16.08
156.6
kimi-k2-0711
127
19.1% ± 1.6%
31.5%
11.8%
$36.45
214.8
qwen-3-max
127
17.7% ± 1.3%
22.8%
11.0%
$160
213.6
o4-mini
127
17.3% ± 2.3%
31.5%
6.3%
$63.62
323.3
deepseek-chat
127
16.7% ± 1.4%
28.3%
7.9%
$35.66
269.9
deepseek-v3-1-terminus
127
16.5% ± 5.1%
29.9%
3.9%
$12.65
244.9
gemini-2-5-pro
127
15.8% ± 0.6%
29.9%
4.7%
$162.48
119.4
glm-4-5
127
15.6% ± 1.2%
24.4%
6.3%
$18.27
166.3
gemini-2-5-flash
127
9.1% ± 0.7%
18.1%
3.9%
$41.81
114.9
gpt-5-mini-low
127
8.3% ± 1.3%
18.9%
0.8%
$7.86
63.2
gpt-4-1
127
8.1% ± 0.7%
12.6%
3.1%
$83.62
59.7
gpt-5-nano-medium
127
6.3% ± 2.0%
11.8%
1.6%
$4.06
157.2
gpt-5-nano-high
127
5.1% ± 2.1%
14.2%
0.0%
$7.79
309.5
gpt-oss-120b
127
4.7% ± 1.0%
13.4%
0.0%
$0.64
27.4
gpt-5-nano-low
127
4.3% ± 1.2%
10.2%
0.8%
$2.5
96.4
gpt-4-1-mini
127
3.9% ± 1.0%
7.1%
1.6%
$59.96
85.7
gpt-4-1-nano
127
0.0% ± 0.0%
0.0%
0.0%
$2.54
39.1
Model
Total Tasks
Pass@1 (avg ± std)
Pass@4
Pass^4
Per-Run Cost (USD)
Avg Agent Time (s)
gpt-5-2-high
30
60.8% ± 2.8%
70.0%
46.7%
$40.75
500.0
gemini-3-pro-high
30
59.2% ± 3.6%
80.0%
40.0%
$41.68
229.3
gpt-5-medium
30
57.5% ± 3.6%
76.7%
36.7%
$13.31
313.1
gemini-3-pro-low
30
56.7% ± 4.1%
80.0%
33.3%
$39.3
209.2
gpt-5-low
30
54.2% ± 6.8%
73.3%
33.3%
$15.48
275.6
gpt-5-high
30
52.5% ± 3.6%
70.0%
36.7%
$17.48
828.0
grok-4
30
50.8% ± 6.4%
73.3%
26.7%
$27.08
256.6
claude-opus-4-5-high
30
40.0% ± 2.4%
50.0%
33.3%
$36.23
87.9
deepseek-v3-2-thinking
30
36.7% ± 4.1%
46.7%
23.3%
$5.47
413.5
o3
30
35.8% ± 2.8%
50.0%
26.7%
$45.65
277.9
gpt-5-mini-high
30
35.0% ± 7.6%
46.7%
23.3%
$5.56
288.9
claude-opus-4-1
30
33.3% ± 0.0%
/
/
$132.3
267.8
gpt-5-mini-medium
30
33.3% ± 6.2%
53.3%
10.0%
$3.74
174.3
claude-sonnet-4-5
30
32.5% ± 4.9%
43.3%
13.3%
$26.32
95.1
grok-4-fast
30
29.2% ± 7.2%
53.3%
16.7%
$3.25
74.7
claude-sonnet-4
30
27.5% ± 2.8%
50.0%
6.7%
$29
193.1
deepseek-v3-2-chat
30
25.0% ± 5.5%
43.3%
6.7%
$5.03
311.2
o4-mini
30
25.0% ± 2.9%
36.7%
13.3%
$11.78
263.6
deepseek-v3-1-terminus-thinking
30
24.2% ± 10.1%
43.3%
6.7%
$1.03
773.4
gemini-2-5-pro
30
24.2% ± 3.6%
43.3%
10.0%
$19.61
126.1
claude-sonnet-4-low
30
23.3% ± 4.7%
36.7%
13.3%
$95.98
176.0
grok-code-fast-1
30
23.3% ± 7.4%
40.0%
10.0%
$1.76
75.5
claude-sonnet-4-high
30
23.3% ± 4.1%
36.7%
10.0%
$82.56
143.4
kimi-k2-0711
30
20.0% ± 2.4%
30.0%
13.3%
$6.99
222.5
deepseek-chat
30
15.8% ± 1.4%
26.7%
6.7%
$7.25
281.7
kimi-k2-0905
30
14.2% ± 1.4%
23.3%
6.7%
$12.88
376.5
qwen-3-coder-plus
30
13.3% ± 6.7%
26.7%
3.3%
$5.93
157.2
gpt-4-1
30
12.5% ± 1.4%
20.0%
3.3%
$9.07
41.0
gpt-5-nano-low
30
12.5% ± 3.6%
30.0%
3.3%
$1.19
129.1
gpt-5-mini-low
30
12.5% ± 4.9%
33.3%
3.3%
$1.13
67.7
deepseek-v3-1-terminus
30
10.8% ± 4.9%
20.0%
3.3%
$1.28
179.8
qwen-3-max
30
10.8% ± 1.4%
13.3%
10.0%
$14.54
133.9
gemini-2-5-flash
30
8.3% ± 1.7%
13.3%
6.7%
$1.18
62.1
glm-4-5
30
7.5% ± 1.4%
13.3%
3.3%
$2.08
130.2
gpt-5-nano-medium
30
6.7% ± 5.3%
16.7%
0.0%
$0.93
129.5
gpt-5-nano-high
30
5.8% ± 4.9%
16.7%
0.0%
$1.47
206.6
gpt-oss-120b
30
5.8% ± 4.3%
16.7%
0.0%
$0.05
18.3
gpt-4-1-mini
30
3.3% ± 0.0%
3.3%
3.3%
$2.43
51.5
gpt-4-1-nano
30
0.0% ± 0.0%
0.0%
0.0%
$0.37
28.1
Model
Total Tasks
Pass@1 (avg ± std)
Pass@4
Pass^4
Per-Run Cost (USD)
Avg Agent Time (s)
gpt-5-high
23
50.0% ± 2.2%
60.9%
34.8%
$34.73
1083.5
gpt-5-medium
23
47.8% ± 8.1%
65.2%
17.4%
$23.7
456.6
gpt-5-2-high
23
47.8% ± 5.3%
60.9%
34.8%
$43.34
714.5
gemini-3-pro-high
23
46.7% ± 6.4%
65.2%
30.4%
$30.96
209.5
gemini-3-pro-low
23
45.6% ± 11.7%
65.2%
26.1%
$32.55
205.5
claude-opus-4-5-high
23
37.0% ± 7.2%
52.2%
21.7%
$143.37
555.3
claude-sonnet-4-5
23
29.3% ± 8.9%
43.5%
17.4%
$68.75
220.0
claude-sonnet-4-high
23
28.3% ± 2.2%
43.5%
21.7%
$58.39
170.0
gpt-5-low
23
27.2% ± 1.9%
39.1%
17.4%
$22.31
268.3
claude-sonnet-4-low
23
25.0% ± 3.6%
34.8%
21.7%
$57.93
173.6
glm-4-5
23
22.8% ± 6.4%
34.8%
13.0%
$3.77
153.5
claude-opus-4-1
23
21.7% ± 0.0%
/
/
$224.18
390.2
deepseek-v3-2-thinking
23
20.6% ± 1.9%
43.5%
0.0%
$6.18
411.8
qwen-3-coder-plus
23
19.6% ± 6.5%
34.8%
13.0%
$9.2
320.4
gpt-5-mini-high
23
19.6% ± 2.2%
34.8%
8.7%
$6.92
338.1
gpt-5-mini-medium
23
18.5% ± 7.8%
34.8%
4.3%
$3.89
127.6
deepseek-v3-2-chat
23
17.4% ± 5.3%
39.1%
0.0%
$5.45
292.1
claude-sonnet-4
23
16.3% ± 5.7%
30.4%
8.7%
$49.61
196.5
kimi-k2-0905
23
16.3% ± 1.9%
26.1%
8.7%
$14.21
780.8
gemini-2-5-flash
23
15.2% ± 2.2%
21.7%
8.7%
$8.37
206.4
o3
23
14.1% ± 3.6%
21.7%
4.3%
$21.41
128.0
qwen-3-max
23
14.1% ± 3.6%
17.4%
4.3%
$37.56
181.5
grok-4
23
14.1% ± 3.6%
21.7%
8.7%
$56.18
269.0
o4-mini
23
14.1% ± 6.4%
26.1%
4.3%
$13.79
248.8
grok-4-fast
23
13.0% ± 3.1%
21.7%
0.0%
$2.77
143.1
deepseek-v3-1-terminus-thinking
23
10.9% ± 4.9%
21.7%
0.0%
$2.07
702.3
kimi-k2-0711
23
10.9% ± 2.2%
13.0%
4.3%
$5.37
205.0
deepseek-chat
23
9.8% ± 1.9%
13.0%
8.7%
$4.75
194.0
gemini-2-5-pro
23
9.8% ± 1.9%
21.7%
0.0%
$11.96
91.3
grok-code-fast-1
23
8.7% ± 5.3%
17.4%
4.3%
$3.68
182.9
gpt-5-nano-high
23
8.7% ± 3.1%
17.4%
0.0%
$1.86
317.0
gpt-5-mini-low
23
8.7% ± 3.1%
13.0%
0.0%
$2.59
63.1
gpt-4-1
23
7.6% ± 1.9%
8.7%
4.3%
$20.97
90.2
gpt-5-nano-medium
23
7.6% ± 1.9%
13.0%
0.0%
$1.11
187.3
gpt-4-1-mini
23
6.5% ± 6.5%
17.4%
0.0%
$4.35
83.1
deepseek-v3-1-terminus
23
5.4% ± 7.1%
17.4%
0.0%
$2.2
231.9
gpt-oss-120b
23
4.3% ± 3.1%
8.7%
0.0%
$0.14
24.0
gpt-5-nano-low
23
0.0% ± 0.0%
0.0%
0.0%
$0.29
57.7
gpt-4-1-nano
23
0.0% ± 0.0%
0.0%
0.0%
$0.74
51.8
Model
Total Tasks
Pass@1 (avg ± std)
Pass@4
Pass^4
Per-Run Cost (USD)
Avg Agent Time (s)
gpt-5-2-high
28
60.7% ± 2.5%
67.9%
50.0%
$60.66
1259.0
gemini-3-pro-high
28
47.3% ± 3.0%
57.1%
28.6%
$19.11
158.6
deepseek-v3-2-thinking
28
45.5% ± 4.6%
57.1%
32.1%
$6.53
408.1
gpt-5-high
28
44.6% ± 1.8%
60.7%
21.4%
$28.98
1161.4
gemini-3-pro-low
28
43.8% ± 3.9%
57.1%
21.4%
$24.13
197.0
gpt-5-medium
28
42.0% ± 3.0%
50.0%
32.1%
$21.98
661.8
claude-opus-4-5-high
28
38.4% ± 3.9%
46.4%
32.1%
$79.08
148.7
gpt-5-low
28
36.6% ± 7.7%
53.6%
14.3%
$23.26
559.6
claude-opus-4-1
28
35.7% ± 0.0%
/
/
$276.24
294.2
deepseek-v3-2-chat
28
32.1% ± 5.7%
46.4%
14.3%
$6.06
264.8
claude-sonnet-4-5
28
25.0% ± 7.6%
46.4%
3.6%
$58.39
165.0
o3
28
24.1% ± 3.9%
46.4%
7.1%
$14.72
171.4
deepseek-v3-1-terminus-thinking
28
22.3% ± 3.0%
39.3%
3.6%
$3.05
919.4
claude-sonnet-4-low
28
22.3% ± 3.0%
32.1%
7.1%
$105.36
220.0
deepseek-v3-1-terminus
28
22.3% ± 8.1%
39.3%
7.1%
$4.28
252.5
glm-4-5
28
21.4% ± 2.5%
32.1%
10.7%
$5.97
222.2
claude-sonnet-4
28
21.4% ± 5.1%
39.3%
7.1%
$56.1
193.2
gpt-5-mini-high
28
20.5% ± 13.0%
42.9%
7.1%
$10.62
522.6
o4-mini
28
20.5% ± 5.9%
42.9%
7.1%
$11.44
442.4
qwen-3-coder-plus
28
19.6% ± 6.4%
39.3%
7.1%
$4.52
99.7
claude-sonnet-4-high
28
19.6% ± 8.2%
35.7%
3.6%
$117.43
190.6
qwen-3-max
28
17.0% ± 4.6%
25.0%
3.6%
$33.34
183.5
gpt-5-mini-medium
28
16.1% ± 5.9%
32.1%
3.6%
$5.63
168.7
kimi-k2-0711
28
14.3% ± 4.4%
32.1%
7.1%
$9.37
183.4
deepseek-chat
28
12.5% ± 3.1%
28.6%
0.0%
$8
238.9
kimi-k2-0905
28
8.0% ± 3.0%
10.7%
3.6%
$19.13
467.0
gpt-4-1
28
6.2% ± 1.6%
14.3%
0.0%
$7.9
48.8
gemini-2-5-flash
28
6.2% ± 4.6%
21.4%
0.0%
$2.15
55.8
gpt-5-mini-low
28
5.4% ± 5.4%
14.3%
0.0%
$1.07
62.6
gemini-2-5-pro
28
4.5% ± 3.0%
7.1%
0.0%
$17.9
102.4
gpt-5-nano-medium
28
3.6% ± 0.0%
3.6%
3.6%
$0.65
170.5
gpt-oss-120b
28
3.6% ± 2.5%
14.3%
0.0%
$0.15
34.0
grok-4-fast
28
3.6% ± 0.0%
3.6%
3.6%
$3.17
152.7
grok-code-fast-1
28
2.7% ± 1.6%
3.6%
0.0%
$3.45
334.1
grok-4
28
2.7% ± 1.6%
3.6%
0.0%
$62.48
554.0
gpt-4-1-mini
28
1.8% ± 1.8%
3.6%
0.0%
$3
59.1
gpt-5-nano-high
28
0.9% ± 1.6%
3.6%
0.0%
$1.43
401.0
gpt-5-nano-low
28
0.0% ± 0.0%
0.0%
0.0%
$0.19
68.0
gpt-4-1-nano
28
0.0% ± 0.0%
0.0%
0.0%
$0.28
32.2
Model
Total Tasks
Pass@1 (avg ± std)
Pass@4
Pass^4
Per-Run Cost (USD)
Avg Agent Time (s)
gpt-5-2-high
25
46.0% ± 3.5%
60.0%
32.0%
$88
534.8
gpt-5-low
25
45.0% ± 1.7%
56.0%
32.0%
$58.7
526.9
claude-opus-4-5-high
25
45.0% ± 1.7%
52.0%
40.0%
$153.68
169.6
gpt-5-medium
25
43.0% ± 5.2%
56.0%
36.0%
$61.92
608.2
gpt-5-high
25
42.0% ± 4.5%
56.0%
24.0%
$61.32
1115.9
gemini-3-pro-high
25
40.0% ± 5.7%
48.0%
28.0%
$162.52
325.5
gemini-3-pro-low
25
40.0% ± 4.0%
48.0%
28.0%
$153.19
286.8
grok-4
25
35.0% ± 7.7%
48.0%
20.0%
$97.36
277.2
qwen-3-coder-plus
25
30.0% ± 4.5%
48.0%
8.0%
$14.31
680.0
kimi-k2-0905
25
30.0% ± 6.0%
40.0%
20.0%
$20.51
380.6
claude-sonnet-4-5
25
27.0% ± 5.9%
36.0%
16.0%
$94.37
175.4
grok-4-fast
25
27.0% ± 3.3%
40.0%
16.0%
$7.68
105.8
claude-sonnet-4
25
26.0% ± 6.0%
36.0%
8.0%
$94.47
278.7
claude-sonnet-4-high
25
26.0% ± 2.0%
28.0%
24.0%
$154.28
261.9
grok-code-fast-1
25
25.0% ± 1.7%
36.0%
8.0%
$6.06
119.5
claude-opus-4-1
25
24.0% ± 0.0%
/
/
$435.18
395.2
claude-sonnet-4-low
25
22.0% ± 3.5%
28.0%
20.0%
$157.11
239.1
deepseek-v3-2-chat
25
19.0% ± 3.3%
28.0%
12.0%
$7.19
314.9
deepseek-v3-2-thinking
25
17.0% ± 1.7%
24.0%
12.0%
$10.21
349.5
o3
25
15.0% ± 5.2%
32.0%
8.0%
$28.71
153.9
gemini-2-5-pro
25
15.0% ± 1.7%
32.0%
4.0%
$108.12
177.7
gpt-5-mini-high
25
15.0% ± 5.2%
32.0%
4.0%
$15.42
365.6
glm-4-5
25
13.0% ± 3.3%
20.0%
4.0%
$4.9
165.6
deepseek-v3-1-terminus
25
13.0% ± 1.7%
20.0%
8.0%
$3.57
329.9
kimi-k2-0711
25
13.0% ± 3.3%
16.0%
8.0%
$11.17
221.4
gpt-5-mini-medium
25
12.0% ± 6.3%
24.0%
4.0%
$11.77
216.0
o4-mini
25
12.0% ± 2.8%
28.0%
0.0%
$25.71
530.6
deepseek-v3-1-terminus-thinking
25
9.0% ± 1.7%
20.0%
0.0%
$3.05
775.8
gpt-4-1
25
8.0% ± 2.8%
12.0%
4.0%
$43.16
92.2
qwen-3-max
25
8.0% ± 0.0%
12.0%
4.0%
$69.1
417.7
deepseek-chat
25
7.0% ± 3.3%
16.0%
0.0%
$11.78
288.3
gemini-2-5-flash
25
6.0% ± 2.0%
12.0%
0.0%
$29.31
205.4
gpt-oss-120b
25
3.0% ± 1.7%
4.0%
0.0%
$0.26
37.3
gpt-5-nano-high
25
2.0% ± 2.0%
4.0%
0.0%
$2.32
325.0
gpt-5-mini-low
25
1.0% ± 1.7%
4.0%
0.0%
$2.72
67.1
gpt-5-nano-low
25
0.0% ± 0.0%
0.0%
0.0%
$0.67
139.3
gpt-4-1-nano
25
0.0% ± 0.0%
0.0%
0.0%
$0.98
53.8
gpt-5-nano-medium
25
0.0% ± 0.0%
0.0%
0.0%
$1.07
171.1
gpt-4-1-mini
25
0.0% ± 0.0%
0.0%
0.0%
$49.72
195.7
Model
Total Tasks
Pass@1 (avg ± std)
Pass@4
Pass^4
Per-Run Cost (USD)
Avg Agent Time (s)
gemini-3-pro-high
21
79.8% ± 5.2%
85.7%
66.7%
$11.32
188.8
gpt-5-medium
21
76.2% ± 7.5%
100.0%
47.6%
$6.55
338.2
gpt-5-low
21
73.8% ± 4.1%
95.2%
38.1%
$6.11
272.3
gpt-5-high
21
72.6% ± 4.0%
85.7%
52.4%
$11.37
977.9
gpt-5-2-high
21
72.6% ± 2.1%
76.2%
61.9%
$17.73
617.9
gemini-3-pro-low
21
70.2% ± 4.0%
90.5%
47.6%
$7.97
138.6
gpt-5-mini-high
21
66.7% ± 3.4%
81.0%
42.9%
$1.83
201.2
deepseek-v3-2-thinking
21
66.7% ± 5.8%
90.5%
38.1%
$2.88
405.0
gpt-5-mini-medium
21
61.9% ± 5.8%
90.5%
28.6%
$1
96.5
deepseek-v3-2-chat
21
59.5% ± 5.3%
81.0%
38.1%
$2.84
311.9
grok-4
21
58.3% ± 7.8%
81.0%
38.1%
$14.32
204.3
claude-sonnet-4
21
53.6% ± 6.2%
71.4%
38.1%
$23.24
239.5
claude-opus-4-5-high
21
53.6% ± 5.2%
71.4%
42.9%
$53.82
177.6
grok-4-fast
21
52.4% ± 8.9%
81.0%
28.6%
$1
71.8
claude-sonnet-4-5
21
50.0% ± 4.1%
66.7%
38.1%
$33.77
241.8
claude-sonnet-4-high
21
50.0% ± 7.1%
66.7%
38.1%
$29.68
165.1
claude-sonnet-4-low
21
48.8% ± 7.0%
71.4%
33.3%
$44.57
186.2
qwen-3-coder-plus
21
47.6% ± 5.8%
61.9%
38.1%
$2.5
140.9
grok-code-fast-1
21
47.6% ± 4.8%
61.9%
28.6%
$1.12
51.3
kimi-k2-0905
21
47.6% ± 4.8%
66.7%
28.6%
$5.84
517.3
qwen-3-max
21
44.0% ± 2.1%
52.4%
38.1%
$5.46
159.6
deepseek-chat
21
42.9% ± 7.5%
61.9%
28.6%
$3.89
355.6
deepseek-v3-1-terminus-thinking
21
41.7% ± 7.8%
61.9%
19.1%
$1.31
418.5
kimi-k2-0711
21
40.5% ± 7.9%
71.4%
28.6%
$3.55
248.4
o3
21
36.9% ± 4.0%
66.7%
14.3%
$3.46
75.6
claude-opus-4-1
21
33.3% ± 0.0%
/
/
$97.54
515.4
deepseek-v3-1-terminus
21
33.3% ± 19.9%
57.1%
0.0%
$1.34
240.7
gemini-2-5-pro
21
26.2% ± 7.9%
47.6%
9.5%
$4.89
93.5
gpt-5-nano-medium
21
15.5% ± 5.2%
28.6%
4.8%
$0.3
129.1
glm-4-5
21
14.3% ± 7.5%
23.8%
0.0%
$1.56
158.1
gpt-5-mini-low
21
14.3% ± 3.4%
28.6%
0.0%
$0.34
52.7
o4-mini
21
11.9% ± 4.1%
19.1%
4.8%
$0.9
84.7
gemini-2-5-flash
21
10.7% ± 6.2%
23.8%
4.8%
$0.81
60.9
gpt-5-nano-high
21
9.5% ± 3.4%
33.3%
0.0%
$0.72
307.7
gpt-4-1-mini
21
9.5% ± 3.4%
14.3%
4.8%
$0.45
42.1
gpt-5-nano-low
21
8.3% ± 4.0%
19.1%
0.0%
$0.16
78.5
gpt-oss-120b
21
7.1% ± 2.4%
23.8%
0.0%
$0.04
23.3
gpt-4-1
21
4.8% ± 0.0%
4.8%
4.8%
$2.52
28.9
gpt-4-1-nano
21
0.0% ± 0.0%
0.0%
0.0%
$0.17
32.5