English Performance
English Scores by Model
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
32B 68.02±0.15 |
80B MoE 67.31±0.10 |
32B 67.10±0.17 |
70B 66.50±0.16 |
70B 65.20±0.18 |
32B 65.04±0.13 |
14B 65.00±0.16 |
123B 64.31±0.15 |
111B 63.92±0.18 |
109B MoE 63.86±0.14 |
27B 63.68±0.21 |
27B 63.55±0.17 |
72B 63.38±0.18 |
30B MoE 62.49±0.15 |
8B 60.57±0.18 |
8B 60.36±0.16 |
21B MoE 59.65±0.16 |
70B 59.59±0.18 |
70B 58.73±0.19 |
32B 58.61±0.16 |
12B 58.06±0.21 |
14B 57.87±0.20 |
8B 56.20±0.12 |
14B 55.84±0.16 |
70B 51.23±0.14 |
4B 50.21±0.14 |
4B 49.81±0.12 |
24B 49.53±0.17 |
8B 46.31±0.28 |
9B 45.14±0.19 |
7B 43.41±0.19 |
7B 43.29±0.18 |
27B 42.75±0.24 |
8B 39.62±0.20 |
8B 38.12±0.17 |
32B 37.21±0.26 |
32B 34.44±0.09 |
9B 32.65±0.19 |
20B 32.37±0.14 |
7B 31.63±0.15 |
104B 31.53±0.15 |
10B 31.41±0.19 |
13B 31.12±0.17 |
8B 29.88±0.17 |
32B 29.20±0.24 |
83B 29.20±0.20 |
70B 26.92±0.17 |
8B 26.03±0.26 |
8B 24.06±0.21 |
7B 23.09±0.13 |
8B 22.00±0.22 |
7B 21.87±0.11 |
9B 20.99±0.13 |
8B 14.80±0.11 |
English Competencies
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
Model | EN | English Tasks |
|---|---|---|
Qwen 3 32B Alibaba | 68.02 ± 0.15 | 68.02 ± 0.15 |
Qwen 3 Next 80B MoE Alibaba | 67.31 ± 0.10 | 67.31 ± 0.10 |
SEA-LION v4 (Qwen) 32B AISG | 67.10 ± 0.17 | 67.10 ± 0.17 |
Llama 3.3 70B Meta | 66.50 ± 0.16 | 66.50 ± 0.16 |
SEA-LION v3 (Llama) 70B AISG | 65.20 ± 0.18 | 65.20 ± 0.18 |
Qwen 3 VL 32B Alibaba | 65.04 ± 0.13 | 65.04 ± 0.13 |
Qwen 3 14B Alibaba | 65.00 ± 0.16 | 65.00 ± 0.16 |
Mistral Large 2411 123B Mistral AI | 64.31 ± 0.15 | 64.31 ± 0.15 |
Command A 03-2025 111B CohereLabs | 63.92 ± 0.18 | 63.92 ± 0.18 |
Llama 4 Scout 109B MoE Meta | 63.86 ± 0.14 | 63.86 ± 0.14 |
SEA-LION v4 (Gemma) 27B AISG | 63.68 ± 0.21 | 63.68 ± 0.21 |
Gemma 3 27B | 63.55 ± 0.17 | 63.55 ± 0.17 |
Qwen 2.5 72B Alibaba | 63.38 ± 0.18 | 63.38 ± 0.18 |
Qwen 3 30B MoE Alibaba | 62.49 ± 0.15 | 62.49 ± 0.15 |
Qwen 3 8B Alibaba | 60.57 ± 0.18 | 60.57 ± 0.18 |
SEA-LION v4 (Qwen VL) 8B AISG | 60.36 ± 0.16 | 60.36 ± 0.16 |
ERNIE 4.5 21B MoE Baidu | 59.65 ± 0.16 | 59.65 ± 0.16 |
Llama 3.1 70B Meta | 59.59 ± 0.18 | 59.59 ± 0.18 |
Tulu 3 70B AI2 | 58.73 ± 0.19 | 58.73 ± 0.19 |
Qwen 2.5 32B Alibaba | 58.61 ± 0.16 | 58.61 ± 0.16 |
Gemma 3 12B | 58.06 ± 0.21 | 58.06 ± 0.21 |
phi-4 14B Microsoft | 57.87 ± 0.20 | 57.87 ± 0.20 |
Qwen 3 VL 8B Alibaba | 56.20 ± 0.12 | 56.20 ± 0.12 |
Qwen 2.5 14B Alibaba | 55.84 ± 0.16 | 55.84 ± 0.16 |
Llama 3 70B Meta | 51.23 ± 0.14 | 51.23 ± 0.14 |
SEA-LION v4 (Qwen VL) 4B AISG | 50.21 ± 0.14 | 50.21 ± 0.14 |
Qwen 3 VL 4B Alibaba | 49.81 ± 0.12 | 49.81 ± 0.12 |
Mistral Small 3.1 2503 24B Mistral AI | 49.53 ± 0.17 | 49.53 ± 0.17 |
SEA-LION v3 (Llama) 8B AISG | 46.31 ± 0.28 | 46.31 ± 0.28 |
SEA-LION v3 (Gemma 2) 9B AISG | 45.14 ± 0.19 | 45.14 ± 0.19 |
Olmo 3 7B AI2 | 43.41 ± 0.19 | 43.41 ± 0.19 |
Qwen 2.5 7B Alibaba | 43.29 ± 0.18 | 43.29 ± 0.18 |
Gemma 2 27B | 42.75 ± 0.24 | 42.75 ± 0.24 |
Llama 3.1 8B Meta | 39.62 ± 0.20 | 39.62 ± 0.20 |
Tulu 3 8B AI2 | 38.12 ± 0.17 | 38.12 ± 0.17 |
Aya Expanse 32B CohereLabs | 37.21 ± 0.26 | 37.21 ± 0.26 |
Olmo 2 0325 32B AI2 | 34.44 ± 0.09 | 34.44 ± 0.09 |
Gemma 2 9B | 32.65 ± 0.19 | 32.65 ± 0.19 |
Sailor2 20B SAIL | 32.37 ± 0.14 | 32.37 ± 0.14 |
Command R7B 12-2024 7B CohereLabs | 31.63 ± 0.15 | 31.63 ± 0.15 |
Command R+ 08-2024 104B CohereLabs | 31.53 ± 0.15 | 31.53 ± 0.15 |
MERaLiON 2 10B A*STAR | 31.41 ± 0.19 | 31.41 ± 0.19 |
Olmo 2 1124 13B AI2 | 31.12 ± 0.17 | 31.12 ± 0.17 |
Llama 3 8B Meta | 29.88 ± 0.17 | 29.88 ± 0.17 |
Command R 08-2024 32B CohereLabs | 29.20 ± 0.24 | 29.20 ± 0.24 |
Babel 83B Alibaba-DAMO | 29.20 ± 0.20 | 29.20 ± 0.20 |
Apertus 70B Swiss AI | 26.92 ± 0.17 | 26.92 ± 0.17 |
Ministral 2410 8B Mistral AI | 26.03 ± 0.26 | 26.03 ± 0.26 |
Aya Expanse 8B CohereLabs | 24.06 ± 0.21 | 24.06 ± 0.21 |
Olmo 2 1124 7B AI2 | 23.09 ± 0.13 | 23.09 ± 0.13 |
Apertus 8B Swiss AI | 22.00 ± 0.22 | 22.00 ± 0.22 |
SeaLLMs V3 7B Alibaba-DAMO | 21.87 ± 0.11 | 21.87 ± 0.11 |
Babel 9B Alibaba-DAMO | 20.99 ± 0.13 | 20.99 ± 0.13 |
Sailor2 8B SAIL | 14.80 ± 0.11 | 14.80 ± 0.11 |
English Tasks
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
Model | EN | English Tasks | BBH | GPQA | IFEval | MATH Hard | MMLU Pro | MuSR |
|---|---|---|---|---|---|---|---|---|
Qwen 3 32B Alibaba | 68.02 ± 0.15 | 68.02 ± 0.15 | 85.65 ± 0.10 | 36.02 ± 0.73 | 83.76 ± 0.27 | 68.90 ± 0.21 | 65.43 ± 0.09 | 68.35 ± 0.58 |
Qwen 3 Next 80B MoE Alibaba | 67.31 ± 0.10 | 67.31 ± 0.10 | 87.85 ± 0.09 | 28.35 ± 0.41 | 86.08 ± 0.22 | 69.55 ± 0.20 | 68.16 ± 0.06 | 63.84 ± 0.57 |
SEA-LION v4 (Qwen) 32B AISG | 67.10 ± 0.17 | 67.10 ± 0.17 | 86.02 ± 0.06 | 31.18 ± 0.82 | 83.89 ± 0.28 | 69.67 ± 0.31 | 63.96 ± 0.08 | 67.91 ± 0.55 |
Llama 3.3 70B Meta | 66.50 ± 0.16 | 66.50 ± 0.16 | 85.63 ± 0.08 | 43.45 ± 0.60 | 88.34 ± 0.18 | 52.63 ± 0.25 | 65.33 ± 0.08 | 63.59 ± 0.48 |
SEA-LION v3 (Llama) 70B AISG | 65.20 ± 0.18 | 65.20 ± 0.18 | 85.42 ± 0.16 | 33.42 ± 0.59 | 86.10 ± 0.32 | 55.64 ± 0.36 | 66.37 ± 0.10 | 64.26 ± 0.51 |
Qwen 3 VL 32B Alibaba | 65.04 ± 0.13 | 65.04 ± 0.13 | 74.11 ± 0.07 | 26.05 ± 0.49 | 81.50 ± 0.32 | 70.47 ± 0.22 | 67.11 ± 0.07 | 71.01 ± 0.59 |
Qwen 3 14B Alibaba | 65.00 ± 0.16 | 65.00 ± 0.16 | 81.32 ± 0.08 | 29.58 ± 0.70 | 85.59 ± 0.17 | 66.95 ± 0.32 | 63.63 ± 0.08 | 62.92 ± 0.67 |
Mistral Large 2411 123B Mistral AI | 64.31 ± 0.15 | 64.31 ± 0.15 | 80.36 ± 0.15 | 33.51 ± 0.65 | 79.00 ± 0.27 | 52.52 ± 0.29 | 66.58 ± 0.09 | 73.90 ± 0.54 |
Command A 03-2025 111B CohereLabs | 63.92 ± 0.18 | 63.92 ± 0.18 | 85.64 ± 0.12 | 14.05 ± 0.81 | 86.81 ± 0.29 | 57.86 ± 0.31 | 65.14 ± 0.10 | 74.02 ± 0.49 |
Llama 4 Scout 109B MoE Meta | 63.86 ± 0.14 | 63.86 ± 0.14 | 79.95 ± 0.11 | 36.18 ± 0.57 | 84.44 ± 0.22 | 63.66 ± 0.24 | 58.37 ± 0.13 | 60.57 ± 0.65 |
SEA-LION v4 (Gemma) 27B AISG | 63.68 ± 0.21 | 63.68 ± 0.21 | 85.14 ± 0.12 | 22.99 ± 0.79 | 80.54 ± 0.23 | 73.20 ± 0.29 | 61.37 ± 0.08 | 58.87 ± 0.66 |
Gemma 3 27B | 63.55 ± 0.17 | 63.55 ± 0.17 | 84.55 ± 0.15 | 23.62 ± 0.61 | 81.38 ± 0.30 | 73.34 ± 0.28 | 61.40 ± 0.09 | 57.02 ± 0.59 |
Qwen 2.5 72B Alibaba | 63.38 ± 0.18 | 63.38 ± 0.18 | 81.56 ± 0.12 | 27.11 ± 0.85 | 83.44 ± 0.31 | 61.67 ± 0.28 | 66.21 ± 0.07 | 60.26 ± 0.59 |
Qwen 3 30B MoE Alibaba | 62.49 ± 0.15 | 62.49 ± 0.15 | 83.96 ± 0.09 | 16.20 ± 0.49 | 82.96 ± 0.23 | 66.42 ± 0.23 | 64.53 ± 0.08 | 60.84 ± 0.64 |
Qwen 3 8B Alibaba | 60.57 ± 0.18 | 60.57 ± 0.18 | 77.13 ± 0.12 | 22.93 ± 0.71 | 82.30 ± 0.35 | 64.55 ± 0.27 | 58.68 ± 0.09 | 57.82 ± 0.64 |
SEA-LION v4 (Qwen VL) 8B AISG | 60.36 ± 0.16 | 60.36 ± 0.16 | 71.41 ± 0.08 | 15.32 ± 0.65 | 83.41 ± 0.24 | 69.41 ± 0.21 | 59.05 ± 0.06 | 63.56 ± 0.59 |
ERNIE 4.5 21B MoE Baidu | 59.65 ± 0.16 | 59.65 ± 0.16 | 66.62 ± 0.14 | 41.12 ± 0.63 | 79.64 ± 0.32 | 63.22 ± 0.30 | 56.00 ± 0.12 | 51.30 ± 0.72 |
Llama 3.1 70B Meta | 59.59 ± 0.18 | 59.59 ± 0.18 | 82.69 ± 0.12 | 27.13 ± 0.76 | 83.79 ± 0.33 | 39.96 ± 0.31 | 63.27 ± 0.08 | 60.68 ± 0.72 |
Tulu 3 70B AI2 | 58.73 ± 0.19 | 58.73 ± 0.19 | 82.36 ± 0.15 | 26.29 ± 0.80 | 79.78 ± 0.31 | 45.24 ± 0.20 | 59.82 ± 0.09 | 58.86 ± 0.58 |
Qwen 2.5 32B Alibaba | 58.61 ± 0.16 | 58.61 ± 0.16 | 69.62 ± 0.11 | 23.52 ± 0.68 | 79.11 ± 0.28 | 57.35 ± 0.25 | 65.42 ± 0.08 | 56.64 ± 0.84 |
Gemma 3 12B | 58.06 ± 0.21 | 58.06 ± 0.21 | 78.94 ± 0.11 | 15.70 ± 0.62 | 78.43 ± 0.35 | 64.48 ± 0.31 | 53.77 ± 0.09 | 57.04 ± 0.78 |
phi-4 14B Microsoft | 57.87 ± 0.20 | 57.87 ± 0.20 | 79.63 ± 0.13 | 30.95 ± 0.62 | 59.31 ± 0.42 | 61.40 ± 0.31 | 57.41 ± 0.09 | 58.51 ± 0.84 |
Qwen 3 VL 8B Alibaba | 56.20 ± 0.12 | 56.20 ± 0.12 | 70.61 ± 0.12 | 6.40 ± 0.46 | 83.65 ± 0.26 | 65.67 ± 0.22 | 57.82 ± 0.07 | 53.07 ± 0.59 |
Qwen 2.5 14B Alibaba | 55.84 ± 0.16 | 55.84 ± 0.16 | 74.80 ± 0.16 | 18.97 ± 0.77 | 78.31 ± 0.22 | 55.09 ± 0.30 | 60.09 ± 0.10 | 47.78 ± 0.71 |
Llama 3 70B Meta | 51.23 ± 0.14 | 51.23 ± 0.14 | 80.22 ± 0.09 | 16.97 ± 0.55 | 77.12 ± 0.32 | 24.32 ± 0.21 | 55.35 ± 0.08 | 53.42 ± 0.56 |
SEA-LION v4 (Qwen VL) 4B AISG | 50.21 ± 0.14 | 50.21 ± 0.14 | 68.14 ± 0.11 | 0.20 ± 0.17 | 81.20 ± 0.34 | 63.05 ± 0.25 | 51.15 ± 0.08 | 37.53 ± 0.79 |
Qwen 3 VL 4B Alibaba | 49.81 ± 0.12 | 49.81 ± 0.12 | 68.68 ± 0.12 | 0.00 ± 0.00 | 80.88 ± 0.27 | 62.49 ± 0.23 | 51.31 ± 0.09 | 35.50 ± 0.68 |
Mistral Small 3.1 2503 24B Mistral AI | 49.53 ± 0.17 | 49.53 ± 0.17 | 54.06 ± 0.22 | 25.79 ± 0.79 | 70.04 ± 0.45 | 43.83 ± 0.43 | 46.20 ± 0.17 | 57.27 ± 0.75 |
SEA-LION v3 (Llama) 8B AISG | 46.31 ± 0.28 | 46.31 ± 0.28 | 70.67 ± 0.17 | 13.83 ± 1.01 | 78.62 ± 0.43 | 27.42 ± 0.30 | 49.17 ± 0.13 | 38.16 ± 0.94 |
SEA-LION v3 (Gemma 2) 9B AISG | 45.14 ± 0.19 | 45.14 ± 0.19 | 66.10 ± 0.24 | 16.11 ± 0.81 | 75.85 ± 0.36 | 28.86 ± 0.33 | 48.91 ± 0.11 | 35.03 ± 0.60 |
Olmo 3 7B AI2 | 43.41 ± 0.19 | 43.41 ± 0.19 | 54.45 ± 0.21 | 2.19 ± 0.68 | 76.04 ± 0.31 | 55.24 ± 0.30 | 37.63 ± 0.12 | 34.91 ± 0.74 |
Qwen 2.5 7B Alibaba | 43.29 ± 0.18 | 43.29 ± 0.18 | 61.35 ± 0.15 | 9.73 ± 0.59 | 70.97 ± 0.29 | 48.70 ± 0.32 | 49.82 ± 0.12 | 19.19 ± 0.75 |
Gemma 2 27B | 42.75 ± 0.24 | 42.75 ± 0.24 | 63.80 ± 0.17 | 12.86 ± 0.99 | 74.82 ± 0.36 | 25.27 ± 0.26 | 45.90 ± 0.12 | 33.86 ± 0.76 |
Llama 3.1 8B Meta | 39.62 ± 0.20 | 39.62 ± 0.20 | 60.91 ± 0.22 | 4.06 ± 0.80 | 74.46 ± 0.33 | 22.33 ± 0.28 | 41.68 ± 0.11 | 34.30 ± 0.64 |
Tulu 3 8B AI2 | 38.12 ± 0.17 | 38.12 ± 0.17 | 54.35 ± 0.17 | 8.14 ± 0.64 | 79.09 ± 0.32 | 20.07 ± 0.33 | 39.23 ± 0.10 | 27.83 ± 0.79 |
Aya Expanse 32B CohereLabs | 37.21 ± 0.26 | 37.21 ± 0.26 | 60.64 ± 0.21 | 7.00 ± 0.96 | 68.19 ± 0.36 | 14.75 ± 0.21 | 40.96 ± 0.13 | 31.74 ± 0.97 |
Olmo 2 0325 32B AI2 | 34.44 ± 0.09 | 34.44 ± 0.09 | 55.67 ± 0.18 | 0.00 ± 0.00 | 80.89 ± 0.37 | 20.24 ± 0.30 | 45.04 ± 0.11 | 4.77 ± 0.35 |
Gemma 2 9B | 32.65 ± 0.19 | 32.65 ± 0.19 | 53.32 ± 0.15 | 3.84 ± 0.64 | 69.04 ± 0.37 | 19.09 ± 0.25 | 27.72 ± 0.11 | 22.87 ± 0.86 |
Sailor2 20B SAIL | 32.37 ± 0.14 | 32.37 ± 0.14 | 37.54 ± 0.13 | 14.53 ± 0.55 | 33.97 ± 0.25 | 40.32 ± 0.30 | 46.62 ± 0.08 | 21.23 ± 0.74 |
Command R7B 12-2024 7B CohereLabs | 31.63 ± 0.15 | 31.63 ± 0.15 | 55.93 ± 0.27 | 0.50 ± 0.28 | 67.86 ± 0.42 | 20.65 ± 0.27 | 23.59 ± 0.12 | 21.27 ± 0.80 |
Command R+ 08-2024 104B CohereLabs | 31.53 ± 0.15 | 31.53 ± 0.15 | 59.97 ± 0.27 | 0.00 ± 0.00 | 69.65 ± 0.40 | 9.74 ± 0.20 | 36.26 ± 0.12 | 13.57 ± 0.76 |
MERaLiON 2 10B A*STAR | 31.41 ± 0.19 | 31.41 ± 0.19 | 54.42 ± 0.21 | 0.61 ± 0.31 | 69.37 ± 0.43 | 17.09 ± 0.26 | 27.15 ± 0.12 | 19.84 ± 0.80 |
Olmo 2 1124 13B AI2 | 31.12 ± 0.17 | 31.12 ± 0.17 | 44.77 ± 0.22 | 0.02 ± 0.04 | 73.27 ± 0.43 | 15.79 ± 0.22 | 33.18 ± 0.13 | 19.70 ± 0.80 |
Llama 3 8B Meta | 29.88 ± 0.17 | 29.88 ± 0.17 | 53.99 ± 0.18 | 0.00 ± 0.00 | 67.62 ± 0.45 | 7.92 ± 0.25 | 28.80 ± 0.08 | 20.94 ± 0.70 |
Command R 08-2024 32B CohereLabs | 29.20 ± 0.24 | 29.20 ± 0.24 | 53.66 ± 0.20 | 3.92 ± 0.88 | 61.70 ± 0.38 | 5.84 ± 0.15 | 33.17 ± 0.17 | 16.92 ± 0.78 |
Babel 83B Alibaba-DAMO | 29.20 ± 0.20 | 29.20 ± 0.20 | 57.97 ± 0.20 | 10.85 ± 0.86 | 30.06 ± 0.49 | 13.11 ± 0.34 | 46.31 ± 0.16 | 16.90 ± 0.63 |
Apertus 70B Swiss AI | 26.92 ± 0.17 | 26.92 ± 0.17 | 38.97 ± 0.28 | 0.03 ± 0.06 | 57.19 ± 0.40 | 8.75 ± 0.20 | 29.37 ± 0.12 | 27.20 ± 0.82 |
Ministral 2410 8B Mistral AI | 26.03 ± 0.26 | 26.03 ± 0.26 | 41.40 ± 0.25 | 2.02 ± 0.72 | 47.01 ± 0.58 | 19.30 ± 0.28 | 26.01 ± 0.14 | 20.41 ± 0.81 |
Aya Expanse 8B CohereLabs | 24.06 ± 0.21 | 24.06 ± 0.21 | 37.65 ± 0.23 | 2.61 ± 0.72 | 57.54 ± 0.37 | 7.20 ± 0.19 | 24.76 ± 0.12 | 14.60 ± 0.69 |
Olmo 2 1124 7B AI2 | 23.09 ± 0.13 | 23.09 ± 0.13 | 30.04 ± 0.23 | 0.03 ± 0.06 | 66.87 ± 0.34 | 11.79 ± 0.24 | 24.93 ± 0.13 | 4.87 ± 0.68 |
Apertus 8B Swiss AI | 22.00 ± 0.22 | 22.00 ± 0.22 | 28.94 ± 0.24 | 3.25 ± 0.82 | 64.38 ± 0.61 | 4.75 ± 0.16 | 24.13 ± 0.11 | 6.57 ± 0.58 |
SeaLLMs V3 7B Alibaba-DAMO | 21.87 ± 0.11 | 21.87 ± 0.11 | 44.35 ± 0.21 | 0.05 ± 0.08 | 38.31 ± 0.52 | 15.82 ± 0.30 | 32.71 ± 0.13 | 0.01 ± 0.01 |
Babel 9B Alibaba-DAMO | 20.99 ± 0.13 | 20.99 ± 0.13 | 49.20 ± 0.22 | 0.38 ± 0.24 | 28.52 ± 0.34 | 12.00 ± 0.27 | 32.06 ± 0.16 | 3.78 ± 0.52 |
Sailor2 8B SAIL | 14.80 ± 0.11 | 14.80 ± 0.11 | 31.69 ± 0.20 | 0.25 ± 0.22 | 30.62 ± 0.32 | 12.08 ± 0.23 | 12.36 ± 0.12 | 1.77 ± 0.39 |