SEA-HELM Leaderboard

Evaluation Details

Methodology

Evaluation Parameters

SEA-HELM evaluations are conducted using the following parameters:

Number of evaluation runs: 8 independent runs per model
Number of bootstrap runs: 30 bootstrapped runs that are drawn from the 8 independent runs per model
Generation parameters: We use the model-specific defaults when available in the model configurations. For any unspecified parameters, we apply the vLLM default settings.

Note: All prompts in SEA-HELM are presented in their native languages using zero-shot prompting for instruct/reasoning models and five-shot prompting for base models.

Scoring

Score Aggregation Methodology

Our scoring system follows a hierarchical approach, aggregating results from individual tasks up to the overall SEA score:

📋 Task Level
Individual task scores are calculated as the mean across all 8 evaluation runs. Standard errors use the clustered standard error methodology as detailed in Miller (2024).
🎯 Competency Level
For each evaluation run, we calculate competency scores by averaging all task scores within that competency area. The final competency score and its standard error are derived from the mean and standard error of these per-run scores.
🌏 Language Level
Language scores aggregate all competency scores available for that specific language, calculated using the same approach as the competency-level aggregation.
🏆 SEA Level (Overall Score)
The SEA score represents the performance across all Southeast Asian languages and is calculated as the aggregate of the individual language scores with their respective standard errors. This calculation follows the same approach as the competency-level aggregation.