Skip to main content

Graders

What is a Grader?

A grader is the evaluation component that scores or judges the output of an AI system against a desired standard.

Think of it as the core evaluation unit in the workflow: Inputs: The grader takes in the AI model’s response (and sometimes the “ideal” or ground-truth answer if one exists). Process: It applies a scoring method. This could be: Rule-based (exact string match, regex check, cosine similarity) or LLM-as-a-Judge (using another model to evaluate correctness, relevance, style, or safety).

Outputs: It produces a score - ideally 0 (Fail) or Pass (1). This score feeds into the overall Statsig experiment or eval framework to determine performance across datasets, experiments, or model versions.

What is a Critical Grader?

A critical grader is a must-pass evaluation in Statsig AI Evals: if the AI output fails this grader, the entire run is marked as failed. It enforces non-negotiable requirements, acting as a hard gate before results are considered valid. When it does not fail, it acts like a normal grader.

Use Case For example, in a financial support chatbot, a critical grader could check that the model never fabricates account balances. Even if the answer is otherwise helpful, a single failure here blocks the model from being promoted.