# Evaluation

Canonical page: https://antern.co/evaluation/

## Summary

Antern evaluation is process-aware. Participants are assessed on reasoning, verification, business consequence, failure detection, and ability to defend technical decisions, not only on whether they produced a polished artifact.

In an AI-native world, output alone is a weak signal. A participant can produce something fluent without understanding it. The assessment looks for reasoning, verification, and consequence-awareness.

## Assessment Target

The target is operator-level judgment.

Participants are evaluated on whether they can reason, verify, recover, and defend technical decisions when AI is useful but not automatically trustworthy.

## Assessment Dimensions

### Reasoning Trace

Can the participant explain the path from problem framing to final decision, including assumptions, alternatives, and why a chosen approach is appropriate?

### Verification Discipline

Can the participant test AI output, inspect evidence, design checks, use independent reviewers, and reject plausible but unsupported answers?

### Failure Awareness

Can the participant predict where the system breaks: hallucination, bad retrieval, context overflow, silent regressions, cost spikes, latency, or unsafe actions?

### System Judgment

Can the participant connect technical choices to users, maintainability, business constraints, deployment realities, and long-term system behavior?

## Competence Ladder

- Order Taker: accepts the answer, follows the tool, and treats fluent output as evidence of competence.
- Mechanic: can make the artifact run, fix local issues, and identify obvious technical mistakes.
- Operator: can supervise the full system, challenge AI, defend tradeoffs, and reason about consequences under ambiguity.

## Evaluation Loop

1. Ambiguous prompt: participants receive a task with missing context, unclear intent, or a tempting but incomplete instruction.
2. System design first: they define the objective, invariants, risks, success metrics, and plan before implementation.
3. AI-assisted build: they use AI tools, but must preserve logs, reasoning, checkpoints, and decision ownership.
4. Independent verification: they evaluate the artifact with tests, slices, critics, reviewers, traces, and failure probes.
5. Defense: they explain what worked, what failed, what they would change, and why the system should be trusted.

## Evidence Required

Proof-of-work must include the reasoning around the work. A final demo matters, but it is not enough.

Expected artifacts may include:

- Definition of done
- System design note
- Invariants and success metrics
- Evaluation slices
- Failure-mode analysis
- Human-AI verification log
- Architecture decision record
- Final demo and technical writeup

## What Does Not Count

Fluent output is not proof of capability.

The evaluation rejects:

- A polished AI-generated answer without explanation
- A project that works only on the happy path
- A demo with no failure cases or evaluation
- A correct output that the participant cannot defend
- High activity with no clear reasoning trail
- Overconfidence when the evidence is weak
