Skip to main content

LM Evaluation Harness

MLOps

LM Evaluation Harness

Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.

Enabled by defaultBuilt In
CLI install commandaegis skills install lm-evaluation-harness
Overview

Bundled with the packaged Aegis CLI as a built-in procedural skill.

Already ships inside the packaged Aegis bundle. Use `aegis skills install lm-evaluation-harness` only when you want an explicit local materialization record.