English Performance
English Scores by Model
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
32B 66.27±0.15 |
80B MoE 65.49±0.10 |
32B 65.30±0.17 |
32B 65.04±0.13 |
70B 64.74±0.16 |
70B 63.47±0.17 |
14B 63.44±0.16 |
123B 62.72±0.15 |
109B MoE 62.22±0.15 |
111B 62.16±0.19 |
27B 61.82±0.21 |
72B 61.71±0.18 |
27B 61.70±0.17 |
30B MoE 60.93±0.15 |
8B 60.36±0.16 |
21B MoE 59.65±0.16 |
8B 58.97±0.18 |
70B 57.93±0.18 |
32B 57.83±0.16 |
70B 57.17±0.19 |
12B 56.33±0.21 |
14B 56.33±0.20 |
8B 56.20±0.12 |
14B 54.49±0.16 |
4B 50.21±0.14 |
70B 49.87±0.14 |
4B 49.81±0.12 |
24B 48.45±0.17 |
8B 44.92±0.28 |
9B 43.80±0.18 |
7B 43.41±0.19 |
7B 42.11±0.18 |
27B 41.24±0.24 |
8B 38.40±0.19 |
8B 36.94±0.17 |
32B 36.23±0.26 |
32B 33.35±0.09 |
20B 32.37±0.14 |
9B 31.44±0.19 |
7B 30.63±0.14 |
104B 30.39±0.15 |
10B 30.25±0.19 |
13B 30.19±0.18 |
83B 29.06±0.19 |
8B 28.85±0.17 |
32B 28.37±0.24 |
70B 26.18±0.16 |
8B 25.17±0.26 |
8B 23.32±0.20 |
7B 22.37±0.13 |
8B 21.38±0.22 |
7B 20.87±0.11 |
9B 19.95±0.14 |
8B 14.80±0.11 |
English Competencies
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
Model | EN | English Tasks |
|---|---|---|
Qwen 3 32B Alibaba | 66.27 ± 0.15 | 66.27 ± 0.15 |
Qwen 3 Next 80B MoE Alibaba | 65.49 ± 0.10 | 65.49 ± 0.10 |
SEA-LION v4 (Qwen) 32B AISG | 65.30 ± 0.17 | 65.30 ± 0.17 |
Qwen 3 VL 32B Alibaba | 65.04 ± 0.13 | 65.04 ± 0.13 |
Llama 3.3 70B Meta | 64.74 ± 0.16 | 64.74 ± 0.16 |
SEA-LION v3 (Llama) 70B AISG | 63.47 ± 0.17 | 63.47 ± 0.17 |
Qwen 3 14B Alibaba | 63.44 ± 0.16 | 63.44 ± 0.16 |
Mistral Large 2411 123B Mistral AI | 62.72 ± 0.15 | 62.72 ± 0.15 |
Llama 4 Scout 109B MoE Meta | 62.22 ± 0.15 | 62.22 ± 0.15 |
Command A 03-2025 111B CohereLabs | 62.16 ± 0.19 | 62.16 ± 0.19 |
SEA-LION v4 (Gemma) 27B AISG | 61.82 ± 0.21 | 61.82 ± 0.21 |
Qwen 2.5 72B Alibaba | 61.71 ± 0.18 | 61.71 ± 0.18 |
Gemma 3 27B | 61.70 ± 0.17 | 61.70 ± 0.17 |
Qwen 3 30B MoE Alibaba | 60.93 ± 0.15 | 60.93 ± 0.15 |
SEA-LION v4 (Qwen VL) 8B AISG | 60.36 ± 0.16 | 60.36 ± 0.16 |
ERNIE 4.5 21B MoE Baidu | 59.65 ± 0.16 | 59.65 ± 0.16 |
Qwen 3 8B Alibaba | 58.97 ± 0.18 | 58.97 ± 0.18 |
Llama 3.1 70B Meta | 57.93 ± 0.18 | 57.93 ± 0.18 |
Qwen 2.5 32B Alibaba | 57.83 ± 0.16 | 57.83 ± 0.16 |
Tulu 3 70B AI2 | 57.17 ± 0.19 | 57.17 ± 0.19 |
Gemma 3 12B | 56.33 ± 0.21 | 56.33 ± 0.21 |
phi-4 14B Microsoft | 56.33 ± 0.20 | 56.33 ± 0.20 |
Qwen 3 VL 8B Alibaba | 56.20 ± 0.12 | 56.20 ± 0.12 |
Qwen 2.5 14B Alibaba | 54.49 ± 0.16 | 54.49 ± 0.16 |
SEA-LION v4 (Qwen VL) 4B AISG | 50.21 ± 0.14 | 50.21 ± 0.14 |
Llama 3 70B Meta | 49.87 ± 0.14 | 49.87 ± 0.14 |
Qwen 3 VL 4B Alibaba | 49.81 ± 0.12 | 49.81 ± 0.12 |
Mistral Small 3.1 2503 24B Mistral AI | 48.45 ± 0.17 | 48.45 ± 0.17 |
SEA-LION v3 (Llama) 8B AISG | 44.92 ± 0.28 | 44.92 ± 0.28 |
SEA-LION v3 (Gemma 2) 9B AISG | 43.80 ± 0.18 | 43.80 ± 0.18 |
Olmo 3 7B AI2 | 43.41 ± 0.19 | 43.41 ± 0.19 |
Qwen 2.5 7B Alibaba | 42.11 ± 0.18 | 42.11 ± 0.18 |
Gemma 2 27B | 41.24 ± 0.24 | 41.24 ± 0.24 |
Llama 3.1 8B Meta | 38.40 ± 0.19 | 38.40 ± 0.19 |
Tulu 3 8B AI2 | 36.94 ± 0.17 | 36.94 ± 0.17 |
Aya Expanse 32B CohereLabs | 36.23 ± 0.26 | 36.23 ± 0.26 |
Olmo 2 0325 32B AI2 | 33.35 ± 0.09 | 33.35 ± 0.09 |
Sailor2 20B SAIL | 32.37 ± 0.14 | 32.37 ± 0.14 |
Gemma 2 9B | 31.44 ± 0.19 | 31.44 ± 0.19 |
Command R7B 12-2024 7B CohereLabs | 30.63 ± 0.14 | 30.63 ± 0.14 |
Command R+ 08-2024 104B CohereLabs | 30.39 ± 0.15 | 30.39 ± 0.15 |
MERaLiON 2 10B A*STAR | 30.25 ± 0.19 | 30.25 ± 0.19 |
Olmo 2 1124 13B AI2 | 30.19 ± 0.18 | 30.19 ± 0.18 |
Babel 83B Alibaba-DAMO | 29.06 ± 0.19 | 29.06 ± 0.19 |
Llama 3 8B Meta | 28.85 ± 0.17 | 28.85 ± 0.17 |
Command R 08-2024 32B CohereLabs | 28.37 ± 0.24 | 28.37 ± 0.24 |
Apertus 70B Swiss AI | 26.18 ± 0.16 | 26.18 ± 0.16 |
Ministral 2410 8B Mistral AI | 25.17 ± 0.26 | 25.17 ± 0.26 |
Aya Expanse 8B CohereLabs | 23.32 ± 0.20 | 23.32 ± 0.20 |
Olmo 2 1124 7B AI2 | 22.37 ± 0.13 | 22.37 ± 0.13 |
Apertus 8B Swiss AI | 21.38 ± 0.22 | 21.38 ± 0.22 |
SeaLLMs V3 7B Alibaba-DAMO | 20.87 ± 0.11 | 20.87 ± 0.11 |
Babel 9B Alibaba-DAMO | 19.95 ± 0.14 | 19.95 ± 0.14 |
Sailor2 8B SAIL | 14.80 ± 0.11 | 14.80 ± 0.11 |
English Tasks
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
Model | EN | English Tasks | BBH | GPQA | IFEval | MATH Hard | MMLU Pro | MuSR |
|---|---|---|---|---|---|---|---|---|
Qwen 3 32B Alibaba | 66.27 ± 0.15 | 66.27 ± 0.15 | 75.19 ± 0.09 | 36.02 ± 0.73 | 83.76 ± 0.27 | 68.90 ± 0.21 | 65.43 ± 0.09 | 68.35 ± 0.58 |
Qwen 3 Next 80B MoE Alibaba | 65.49 ± 0.10 | 65.49 ± 0.10 | 76.96 ± 0.09 | 28.35 ± 0.41 | 86.08 ± 0.22 | 69.55 ± 0.20 | 68.16 ± 0.06 | 63.84 ± 0.57 |
SEA-LION v4 (Qwen) 32B AISG | 65.30 ± 0.17 | 65.30 ± 0.17 | 75.18 ± 0.06 | 31.18 ± 0.82 | 83.89 ± 0.28 | 69.67 ± 0.31 | 63.96 ± 0.08 | 67.91 ± 0.55 |
Qwen 3 VL 32B Alibaba | 65.04 ± 0.13 | 65.04 ± 0.13 | 74.11 ± 0.07 | 26.05 ± 0.49 | 81.50 ± 0.32 | 70.47 ± 0.22 | 67.11 ± 0.07 | 71.01 ± 0.59 |
Llama 3.3 70B Meta | 64.74 ± 0.16 | 64.74 ± 0.16 | 75.09 ± 0.08 | 43.45 ± 0.60 | 88.34 ± 0.18 | 52.63 ± 0.25 | 65.33 ± 0.08 | 63.59 ± 0.48 |
SEA-LION v3 (Llama) 70B AISG | 63.47 ± 0.17 | 63.47 ± 0.17 | 75.00 ± 0.15 | 33.42 ± 0.59 | 86.10 ± 0.32 | 55.64 ± 0.36 | 66.37 ± 0.10 | 64.26 ± 0.51 |
Qwen 3 14B Alibaba | 63.44 ± 0.16 | 63.44 ± 0.16 | 71.95 ± 0.08 | 29.58 ± 0.70 | 85.59 ± 0.17 | 66.95 ± 0.32 | 63.63 ± 0.08 | 62.92 ± 0.67 |
Mistral Large 2411 123B Mistral AI | 62.72 ± 0.15 | 62.72 ± 0.15 | 70.80 ± 0.13 | 33.51 ± 0.65 | 79.00 ± 0.27 | 52.52 ± 0.29 | 66.58 ± 0.09 | 73.90 ± 0.54 |
Llama 4 Scout 109B MoE Meta | 62.22 ± 0.15 | 62.22 ± 0.15 | 70.11 ± 0.12 | 36.18 ± 0.57 | 84.44 ± 0.22 | 63.66 ± 0.24 | 58.37 ± 0.13 | 60.57 ± 0.65 |
Command A 03-2025 111B CohereLabs | 62.16 ± 0.19 | 62.16 ± 0.19 | 75.06 ± 0.12 | 14.05 ± 0.81 | 86.81 ± 0.29 | 57.86 ± 0.31 | 65.14 ± 0.10 | 74.02 ± 0.49 |
SEA-LION v4 (Gemma) 27B AISG | 61.82 ± 0.21 | 61.82 ± 0.21 | 73.95 ± 0.11 | 22.99 ± 0.79 | 80.54 ± 0.23 | 73.20 ± 0.29 | 61.37 ± 0.08 | 58.87 ± 0.66 |
Qwen 2.5 72B Alibaba | 61.71 ± 0.18 | 61.71 ± 0.18 | 71.60 ± 0.11 | 27.11 ± 0.85 | 83.44 ± 0.31 | 61.67 ± 0.28 | 66.21 ± 0.07 | 60.26 ± 0.59 |
Gemma 3 27B | 61.70 ± 0.17 | 61.70 ± 0.17 | 73.44 ± 0.14 | 23.62 ± 0.61 | 81.38 ± 0.30 | 73.34 ± 0.28 | 61.40 ± 0.09 | 57.02 ± 0.59 |
Qwen 3 30B MoE Alibaba | 60.93 ± 0.15 | 60.93 ± 0.15 | 74.60 ± 0.09 | 16.20 ± 0.49 | 82.96 ± 0.23 | 66.42 ± 0.23 | 64.53 ± 0.08 | 60.84 ± 0.64 |
SEA-LION v4 (Qwen VL) 8B AISG | 60.36 ± 0.16 | 60.36 ± 0.16 | 71.41 ± 0.08 | 15.32 ± 0.65 | 83.41 ± 0.24 | 69.41 ± 0.21 | 59.05 ± 0.06 | 63.56 ± 0.59 |
ERNIE 4.5 21B MoE Baidu | 59.65 ± 0.16 | 59.65 ± 0.16 | 66.62 ± 0.14 | 41.12 ± 0.63 | 79.64 ± 0.32 | 63.22 ± 0.30 | 56.00 ± 0.12 | 51.30 ± 0.72 |
Qwen 3 8B Alibaba | 58.97 ± 0.18 | 58.97 ± 0.18 | 67.52 ± 0.12 | 22.93 ± 0.71 | 82.30 ± 0.35 | 64.55 ± 0.27 | 58.68 ± 0.09 | 57.82 ± 0.64 |
Llama 3.1 70B Meta | 57.93 ± 0.18 | 57.93 ± 0.18 | 72.73 ± 0.12 | 27.13 ± 0.76 | 83.79 ± 0.33 | 39.96 ± 0.31 | 63.27 ± 0.08 | 60.68 ± 0.72 |
Qwen 2.5 32B Alibaba | 57.83 ± 0.16 | 57.83 ± 0.16 | 64.92 ± 0.10 | 23.52 ± 0.68 | 79.11 ± 0.28 | 57.35 ± 0.25 | 65.42 ± 0.08 | 56.64 ± 0.84 |
Tulu 3 70B AI2 | 57.17 ± 0.19 | 57.17 ± 0.19 | 73.04 ± 0.15 | 26.29 ± 0.80 | 79.78 ± 0.31 | 45.24 ± 0.20 | 59.82 ± 0.09 | 58.86 ± 0.58 |
Gemma 3 12B | 56.33 ± 0.21 | 56.33 ± 0.21 | 68.57 ± 0.11 | 15.70 ± 0.62 | 78.43 ± 0.35 | 64.48 ± 0.31 | 53.77 ± 0.09 | 57.04 ± 0.78 |
phi-4 14B Microsoft | 56.33 ± 0.20 | 56.33 ± 0.20 | 70.39 ± 0.12 | 30.95 ± 0.62 | 59.31 ± 0.42 | 61.40 ± 0.31 | 57.41 ± 0.09 | 58.51 ± 0.84 |
Qwen 3 VL 8B Alibaba | 56.20 ± 0.12 | 56.20 ± 0.12 | 70.61 ± 0.12 | 6.40 ± 0.46 | 83.65 ± 0.26 | 65.67 ± 0.22 | 57.82 ± 0.07 | 53.07 ± 0.59 |
Qwen 2.5 14B Alibaba | 54.49 ± 0.16 | 54.49 ± 0.16 | 66.71 ± 0.15 | 18.97 ± 0.77 | 78.31 ± 0.22 | 55.09 ± 0.30 | 60.09 ± 0.10 | 47.78 ± 0.71 |
SEA-LION v4 (Qwen VL) 4B AISG | 50.21 ± 0.14 | 50.21 ± 0.14 | 68.14 ± 0.11 | 0.20 ± 0.17 | 81.20 ± 0.34 | 63.05 ± 0.25 | 51.15 ± 0.08 | 37.53 ± 0.79 |
Llama 3 70B Meta | 49.87 ± 0.14 | 49.87 ± 0.14 | 72.04 ± 0.09 | 16.97 ± 0.55 | 77.12 ± 0.32 | 24.32 ± 0.21 | 55.35 ± 0.08 | 53.42 ± 0.56 |
Qwen 3 VL 4B Alibaba | 49.81 ± 0.12 | 49.81 ± 0.12 | 68.68 ± 0.12 | 0.00 ± 0.00 | 80.88 ± 0.27 | 62.49 ± 0.23 | 51.31 ± 0.09 | 35.50 ± 0.68 |
Mistral Small 3.1 2503 24B Mistral AI | 48.45 ± 0.17 | 48.45 ± 0.17 | 47.58 ± 0.20 | 25.79 ± 0.79 | 70.04 ± 0.45 | 43.83 ± 0.43 | 46.20 ± 0.17 | 57.27 ± 0.75 |
SEA-LION v3 (Llama) 8B AISG | 44.92 ± 0.28 | 44.92 ± 0.28 | 62.30 ± 0.15 | 13.83 ± 1.01 | 78.62 ± 0.43 | 27.42 ± 0.30 | 49.17 ± 0.13 | 38.16 ± 0.94 |
SEA-LION v3 (Gemma 2) 9B AISG | 43.80 ± 0.18 | 43.80 ± 0.18 | 58.02 ± 0.22 | 16.11 ± 0.81 | 75.85 ± 0.36 | 28.86 ± 0.33 | 48.91 ± 0.11 | 35.03 ± 0.60 |
Olmo 3 7B AI2 | 43.41 ± 0.19 | 43.41 ± 0.19 | 54.45 ± 0.21 | 2.19 ± 0.68 | 76.04 ± 0.31 | 55.24 ± 0.30 | 37.63 ± 0.12 | 34.91 ± 0.74 |
Qwen 2.5 7B Alibaba | 42.11 ± 0.18 | 42.11 ± 0.18 | 54.25 ± 0.13 | 9.73 ± 0.59 | 70.97 ± 0.29 | 48.70 ± 0.32 | 49.82 ± 0.12 | 19.19 ± 0.75 |
Gemma 2 27B | 41.24 ± 0.24 | 41.24 ± 0.24 | 54.71 ± 0.16 | 12.86 ± 0.99 | 74.82 ± 0.36 | 25.27 ± 0.26 | 45.90 ± 0.12 | 33.86 ± 0.76 |
Llama 3.1 8B Meta | 38.40 ± 0.19 | 38.40 ± 0.19 | 53.60 ± 0.21 | 4.06 ± 0.80 | 74.46 ± 0.33 | 22.33 ± 0.28 | 41.68 ± 0.11 | 34.30 ± 0.64 |
Tulu 3 8B AI2 | 36.94 ± 0.17 | 36.94 ± 0.17 | 47.27 ± 0.17 | 8.14 ± 0.64 | 79.09 ± 0.32 | 20.07 ± 0.33 | 39.23 ± 0.10 | 27.83 ± 0.79 |
Aya Expanse 32B CohereLabs | 36.23 ± 0.26 | 36.23 ± 0.26 | 54.74 ± 0.22 | 7.00 ± 0.96 | 68.19 ± 0.36 | 14.75 ± 0.21 | 40.96 ± 0.13 | 31.74 ± 0.97 |
Olmo 2 0325 32B AI2 | 33.35 ± 0.09 | 33.35 ± 0.09 | 49.19 ± 0.17 | 0.00 ± 0.00 | 80.89 ± 0.37 | 20.24 ± 0.30 | 45.04 ± 0.11 | 4.77 ± 0.35 |
Sailor2 20B SAIL | 32.37 ± 0.14 | 32.37 ± 0.14 | 37.54 ± 0.13 | 14.53 ± 0.55 | 33.97 ± 0.25 | 40.32 ± 0.30 | 46.62 ± 0.08 | 21.23 ± 0.74 |
Gemma 2 9B | 31.44 ± 0.19 | 31.44 ± 0.19 | 46.10 ± 0.16 | 3.84 ± 0.64 | 69.04 ± 0.37 | 19.09 ± 0.25 | 27.72 ± 0.11 | 22.87 ± 0.86 |
Command R7B 12-2024 7B CohereLabs | 30.63 ± 0.14 | 30.63 ± 0.14 | 49.89 ± 0.26 | 0.50 ± 0.28 | 67.86 ± 0.42 | 20.65 ± 0.27 | 23.59 ± 0.12 | 21.27 ± 0.80 |
Command R+ 08-2024 104B CohereLabs | 30.39 ± 0.15 | 30.39 ± 0.15 | 53.10 ± 0.24 | 0.00 ± 0.00 | 69.65 ± 0.40 | 9.74 ± 0.20 | 36.26 ± 0.12 | 13.57 ± 0.76 |
MERaLiON 2 10B A*STAR | 30.25 ± 0.19 | 30.25 ± 0.19 | 47.46 ± 0.22 | 0.61 ± 0.31 | 69.37 ± 0.43 | 17.09 ± 0.26 | 27.15 ± 0.12 | 19.84 ± 0.80 |
Olmo 2 1124 13B AI2 | 30.19 ± 0.18 | 30.19 ± 0.18 | 39.20 ± 0.22 | 0.02 ± 0.04 | 73.27 ± 0.43 | 15.79 ± 0.22 | 33.18 ± 0.13 | 19.70 ± 0.80 |
Babel 83B Alibaba-DAMO | 29.06 ± 0.19 | 29.06 ± 0.19 | 57.11 ± 0.19 | 10.85 ± 0.86 | 30.06 ± 0.49 | 13.11 ± 0.34 | 46.31 ± 0.16 | 16.90 ± 0.63 |
Llama 3 8B Meta | 28.85 ± 0.17 | 28.85 ± 0.17 | 47.84 ± 0.20 | 0.00 ± 0.00 | 67.62 ± 0.45 | 7.92 ± 0.25 | 28.80 ± 0.08 | 20.94 ± 0.70 |
Command R 08-2024 32B CohereLabs | 28.37 ± 0.24 | 28.37 ± 0.24 | 48.67 ± 0.19 | 3.92 ± 0.88 | 61.70 ± 0.38 | 5.84 ± 0.15 | 33.17 ± 0.17 | 16.92 ± 0.78 |
Apertus 70B Swiss AI | 26.18 ± 0.16 | 26.18 ± 0.16 | 34.52 ± 0.27 | 0.03 ± 0.06 | 57.19 ± 0.40 | 8.75 ± 0.20 | 29.37 ± 0.12 | 27.20 ± 0.82 |
Ministral 2410 8B Mistral AI | 25.17 ± 0.26 | 25.17 ± 0.26 | 36.25 ± 0.25 | 2.02 ± 0.72 | 47.01 ± 0.58 | 19.30 ± 0.28 | 26.01 ± 0.14 | 20.41 ± 0.81 |
Aya Expanse 8B CohereLabs | 23.32 ± 0.20 | 23.32 ± 0.20 | 33.24 ± 0.22 | 2.61 ± 0.72 | 57.54 ± 0.37 | 7.20 ± 0.19 | 24.76 ± 0.12 | 14.60 ± 0.69 |
Olmo 2 1124 7B AI2 | 22.37 ± 0.13 | 22.37 ± 0.13 | 25.75 ± 0.22 | 0.03 ± 0.06 | 66.87 ± 0.34 | 11.79 ± 0.24 | 24.93 ± 0.13 | 4.87 ± 0.68 |
Apertus 8B Swiss AI | 21.38 ± 0.22 | 21.38 ± 0.22 | 25.19 ± 0.23 | 3.25 ± 0.82 | 64.38 ± 0.61 | 4.75 ± 0.16 | 24.13 ± 0.11 | 6.57 ± 0.58 |
SeaLLMs V3 7B Alibaba-DAMO | 20.87 ± 0.11 | 20.87 ± 0.11 | 38.31 ± 0.21 | 0.05 ± 0.08 | 38.31 ± 0.52 | 15.82 ± 0.30 | 32.71 ± 0.13 | 0.01 ± 0.01 |
Babel 9B Alibaba-DAMO | 19.95 ± 0.14 | 19.95 ± 0.14 | 43.00 ± 0.22 | 0.38 ± 0.24 | 28.52 ± 0.34 | 12.00 ± 0.27 | 32.06 ± 0.16 | 3.78 ± 0.52 |
Sailor2 8B SAIL | 14.80 ± 0.11 | 14.80 ± 0.11 | 31.69 ± 0.20 | 0.25 ± 0.22 | 30.62 ± 0.32 | 12.08 ± 0.23 | 12.36 ± 0.12 | 1.77 ± 0.39 |