SEA-HELM
(Southeast Asian Holistic Evaluation of Language Models)
SEA-HELM is an assessment of large language models across various tasks, with an emphasis on Southeast Asian languages. The leaderboard evaluates models across key multilingual capabilities such as proficiency in Southeast Asian chat, instruction-following in Southeast Asian languages, Southeast Asian linguistic tasks and performance on a suite of English tasks.
68
60 open & 8 closed models tested
Model families: Claude 4*, Gemini 2.5*, GPT-5*, Qwen 3, Gemma 3, Llama 4, Deepseek, Tulu, and many more.
*Supported by credits from their respective teams.
1st
SEA-LION v4 Instruct ranking
At <200B model sizes, SEA-LION v4 is the top performing instruct model overall on SEA languages*
*Tested SEA Languages: Burmese, Filipino, Indonesian, Malay, Tamil, Thai and Vietnamese.
SEA Overall
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 27B 67.52±0.11 |
![]() ![]() 27B 67.35±0.08 |
![]() ![]() 32B 65.00±0.16 |
![]() ![]() 12B 64.88±0.10 |
![]() ![]() 70B 64.44±0.38 |
View all scores →
Performance for each SEA Language
Burmese
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 27B 57.78±0.43 |
![]() ![]() 27B 57.18±0.42 |
![]() ![]() 109B MoE 54.76±0.23 |
![]() ![]() 12B 52.82±0.22 |
![]() ![]() 32B 43.03±0.63 |
View all scores →
Filipino
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 27B 74.53±0.28 |
![]() ![]() 27B 74.09±0.12 |
![]() ![]() 70B 72.84±0.46 |
![]() ![]() 12B 72.02±0.31 |
![]() ![]() 70B 70.26±0.29 |
View all scores →
Indonesian
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 111B 74.75±0.66 |
![]() ![]() 32B 72.81±0.18 |
![]() ![]() 30B MoE 72.36±0.28 |
![]() ![]() 70B 72.15±0.48 |
![]() ![]() 27B 71.89±0.33 |
View all scores →
Malay
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 30B MoE 71.55±0.28 |
![]() ![]() 27B 71.31±0.43 |
![]() ![]() 27B 71.20±0.34 |
![]() ![]() 32B 70.01±0.23 |
![]() ![]() 70B 69.82±0.43 |
View all scores →
Tamil
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 27B 68.47±0.30 |
![]() ![]() 27B 68.45±0.47 |
![]() ![]() 12B 65.83±0.63 |
![]() ![]() 109B MoE 64.22±0.28 |
![]() ![]() 32B 64.10±0.39 |
View all scores →
Thai
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 32B 65.36±0.33 |
![]() ![]() 30B MoE 64.57±0.21 |
![]() ![]() 27B 63.18±0.16 |
![]() ![]() 14B 63.01±0.32 |
![]() ![]() 72B 62.91±0.44 |
View all scores →
Vietnamese
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 30B MoE 72.49±0.20 |
![]() ![]() 32B 69.94±0.38 |
![]() ![]() 70B 69.65±0.55 |
![]() ![]() 111B 69.10±0.58 |
![]() ![]() 70B 68.85±0.24 |
View all scores →
English
Average of 8 runs. 95% CI are shown.
Model Size: ≤200B
Open instruct models only
![]() ![]() 32B 73.82±0.29 |
![]() ![]() 70B 72.16±0.15 |
![]() ![]() 14B 71.66±0.24 |
![]() ![]() 70B 71.35±0.45 |
![]() ![]() 27B 70.90±0.24 |
View all scores →