Evaluation Details

Methodology

Evaluation Parameters

SEA-HELM evaluations are conducted using the following parameters:

  • Number of evaluation runs: 8 independent runs per model

  • Generation parameters: We use the model-specific defaults when available in the model configurations. For any unspecified parameters, we apply the vLLM default settings.

Note: All prompts in SEA-HELM are presented in their native languages using zero-shot prompting.

Scoring

Score Aggregation Methodology

Our scoring system follows a hierarchical approach, aggregating results from individual tasks up to the overall SEA score:

  • 📋 Task Level

    Individual task scores are calculated as the mean across all 8 evaluation runs. Standard errors use the clustered standard error methodology as detailed in Miller (2024).

  • 🎯 Competency Level

    For each evaluation run, we calculate competency scores by averaging all task scores within that competency area. The final competency score and its standard error are derived from the mean and standard error of these per-run scores.

  • 🌏 Language Level

    Language scores aggregate all competency scores available for that specific language, calculated using the same approach as the competency-level aggregation.

  • 🏆 SEA Level (Overall Score)

    The SEA score represents the performance across all Southeast Asian languages and is calculated as the aggregate of the individual language scores with their respective standard errors. This calculation follows the same approach as the competency-level aggregation.