What to Keep When Evaluating AI

Evaluating an AI system requires more than collecting successful examples. Failure cases, criteria, and human review boundaries need to be recorded as well.

LLM-based tools make the gap between “looks good once” and “works repeatedly” especially visible. Evaluation notes should preserve at least the following: