What to Keep When Evaluating AI
Evaluating an AI system requires more than collecting successful examples. Failure cases, criteria, and human review boundaries need to be recorded as well.
LLM-based tools make the gap between “looks good once” and “works repeatedly” especially visible. Evaluation notes should preserve at least the following:
- where the system works well
- where it fails
- whether the failure is risky or merely inconvenient
- where human review is required
This note is a public fragment that may later grow into Writing or a Lab.