Benchmark (AI Benchmark)

A standardized test or dataset used to evaluate model quality, robustness, and performance. Examples include MMLU, HELM, and custom task‑specific evals.