Human Review
Automated evaluation cannot answer every question. When AIGuard's confidence in a result is low, the case is escalated to a review queue. Reviewers approve, reject, or comment; their decisions feed back into the scorers as calibration data.
Review queue
Each project has its own review queue stored in SQLite. Items contain the original trace, the evaluation that triggered escalation, and a secure completion token.
aiguard review serve --port 8123The review server provides:
- A web UI for browsing pending items
- Optional SMTP alerts when new items are queued
- Token-based completion endpoints so reviewers can be reached via email
Decisions and comments
Reviewers record a ReviewDecision ( approved, rejected, or needs_more_context) with an optional free-text comment. Decisions are stored alongside the original evaluation result.
Agreement and reviewer history
AIGuard tracks per-reviewer history and inter-rater agreement. Use this to spot reviewers who systematically diverge from the consensus or to weight calibration data.
Calibration loop
Periodically, reviewer decisions are aggregated and used to recalibrate evaluation thresholds. Concretely, if reviewers consistently overturn AIGuard'shallucination verdicts in one direction, the threshold for that category is nudged. The result: the system gets more accurate with use, without code changes.
from review.queue import ReviewQueue
q = ReviewQueue(project="my-project")
for item in q.pending():
print(item.trace_id, item.reason)