Evaluation Harness

Automated tests and datasets to assess quality, safety, latency, and cost across model versions.