LM Evaluation Harness
MLOps
LM Evaluation Harness
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.
Enabled by defaultBuilt In
CLI install command
aegis skills install lm-evaluation-harnessBundled with the packaged Aegis CLI as a built-in procedural skill.
Already ships inside the packaged Aegis bundle. Use `aegis skills install lm-evaluation-harness` only when you want an explicit local materialization record.