SEA-HELM
(Southeast Asian Holistic Evaluation of Language Models)
SEA-HELM is an assessment of large language models across various tasks, with an emphasis on Southeast Asian languages. The leaderboard evaluates models across key multilingual capabilities such as proficiency in Southeast Asian chat, instruction-following in Southeast Asian languages, Southeast Asian linguistic tasks and performance on a suite of English tasks.
76
68 open & 8 closed models tested
Model families: Claude 4*, Gemini 2.5*, GPT-5*, Qwen 3, Gemma 3, Llama 4, Deepseek, Tulu, Apertus, and many more.
*Supported by credits from their respective teams.
1st
SEA-LION v4 Instruct ranking
At <200B model sizes, SEA-LION v4 is the top performing instruct model overall on SEA languages*
*Tested SEA Languages: Burmese, Filipino, Indonesian, Malay, Tamil, Thai and Vietnamese.
SEA Overall
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
32B 60.63±0.06 |
80B MoE 60.55±0.05 |
27B 59.74±0.06 |
27B 59.63±0.06 |
32B 58.40±0.06 |
View all scores →
Performance for each SEA Language
Burmese
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
32B 48.28±0.14 |
27B 47.36±0.17 |
27B 46.50±0.16 |
109B MoE 44.27±0.17 |
80B MoE 43.68±0.16 |
View all scores →
Filipino
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
27B 68.10±0.14 |
27B 67.70±0.12 |
80B MoE 66.48±0.13 |
70B 66.38±0.17 |
32B 65.35±0.14 |
View all scores →
Indonesian
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
80B MoE 67.11±0.10 |
111B 66.73±0.17 |
32B 66.59±0.10 |
32B 65.67±0.11 |
72B 64.82±0.08 |
View all scores →
Malay
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
80B MoE 62.80±0.12 |
32B 61.36±0.14 |
30B MoE 61.13±0.14 |
27B 61.10±0.16 |
27B 60.92±0.17 |
View all scores →
Tamil
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
27B 64.43±0.16 |
27B 64.36±0.22 |
32B 62.30±0.15 |
80B MoE 60.05±0.13 |
12B 59.86±0.21 |
View all scores →
Thai
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
80B MoE 58.09±0.09 |
32B 57.91±0.13 |
32B 56.98±0.15 |
30B MoE 55.77±0.11 |
14B 54.82±0.15 |
View all scores →
Vietnamese
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
80B MoE 65.68±0.10 |
30B MoE 65.56±0.14 |
32B 62.63±0.14 |
70B 62.37±0.23 |
32B 62.19±0.12 |
View all scores →
English
Average of 30 bootstraps. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
32B 68.02±0.15 |
80B MoE 67.31±0.10 |
32B 67.10±0.17 |
70B 66.50±0.16 |
70B 65.20±0.18 |
View all scores →