Human Review

Automated evaluation cannot answer every question. When AIGuard's confidence in a result is low, the case is escalated to a review queue. Reviewers approve, reject, or comment; their decisions feed back into the scorers as calibration data.

Review queue

Each project has its own review queue stored in SQLite. Items contain the original trace, the evaluation that triggered escalation, and a secure completion token.

$aiguard review serve --port 8123

The review server provides:

A web UI for browsing pending items
Optional SMTP alerts when new items are queued
Token-based completion endpoints so reviewers can be reached via email

Decisions and comments

Reviewers record a ReviewDecision ( approved, rejected, or needs_more_context) with an optional free-text comment. Decisions are stored alongside the original evaluation result.

Agreement and reviewer history

AIGuard tracks per-reviewer history and inter-rater agreement. Use this to spot reviewers who systematically diverge from the consensus or to weight calibration data.

Calibration loop

Periodically, reviewer decisions are aggregated and used to recalibrate evaluation thresholds. Concretely, if reviewers consistently overturn AIGuard'shallucination verdicts in one direction, the threshold for that category is nudged. The result: the system gets more accurate with use, without code changes.

python

from review.queue import ReviewQueue

q = ReviewQueue(project="my-project")
for item in q.pending():
    print(item.trace_id, item.reason)