LM Evaluation Harness

MLOps

LM Evaluation Harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Enabled by defaultBuilt In

CLI install commandaegis skills install lm-evaluation-harness

Install guide

Overview

Bundled with the packaged Aegis CLI as a built-in procedural skill.

Already ships inside the packaged Aegis bundle. Use `aegis skills install lm-evaluation-harness` only when you want an explicit local materialization record.

LM Evaluation Harness

Also in MLOps