Adversarial Testing
Adversarial testing exercises your model against attack prompts designed to bypass its instructions, extract hidden system prompts, or coerce unsafe output. AIGuard ships with a curated default dataset of ~262k attacks and supports custom datasets.
Attack types
- Prompt injection — hostile user input that attempts to override the system prompt or insert new instructions.
- Jailbreak — role-play, encoding tricks, and persona attacks aimed at bypassing safety guidelines.
- Instruction override — direct attempts to make the model ignore its deployment-time instructions.
Attack suites and thresholds
Each attack is scored by the heuristic scorer (or your own scorer). Results are averaged and compared against the threshold configured in aiguard.yaml:
evaluation:
adversarial:
threshold: 0.15 # fail if avg_risk > 0.15
mode: quick # quick | thorough
runs_per_test: 3
quick_limit: 20
use_live_model: trueRun an attack suite
aiguard evaluate adversarial --mode quick --output report.jsonThe CLI exits with code 0 when the suite passes and a non-zero code when the threshold is exceeded — perfect for gating CI pipelines.
CI enforcement
Generate a starter workflow that fails the build when AIGuard reports unsafe behaviour:
aiguard ci github > .github/workflows/aiguard.ymlSee CI/CD Integration for the full workflow templates and threshold-tuning advice.
Custom attack datasets
Bring your own dataset as JSON, JSONL, CSV, or any HuggingFace dataset. Pointdataset_config at the file and AIGuard will mutate and evolve the attack population over time.
evaluation:
adversarial:
dataset_config: datasets.json