Skip to content

Adversarial Testing

Adversarial testing exercises your model against attack prompts designed to bypass its instructions, extract hidden system prompts, or coerce unsafe output. AIGuard ships with a curated default dataset of ~262k attacks and supports custom datasets.

Attack types

  • Prompt injection — hostile user input that attempts to override the system prompt or insert new instructions.
  • Jailbreak — role-play, encoding tricks, and persona attacks aimed at bypassing safety guidelines.
  • Instruction override — direct attempts to make the model ignore its deployment-time instructions.

Attack suites and thresholds

Each attack is scored by the heuristic scorer (or your own scorer). Results are averaged and compared against the threshold configured in aiguard.yaml:

aiguard.yamlyaml
evaluation:
  adversarial:
    threshold: 0.15      # fail if avg_risk > 0.15
    mode: quick          # quick | thorough
    runs_per_test: 3
    quick_limit: 20
    use_live_model: true

Run an attack suite

$aiguard evaluate adversarial --mode quick --output report.json

The CLI exits with code 0 when the suite passes and a non-zero code when the threshold is exceeded — perfect for gating CI pipelines.

CI enforcement

Generate a starter workflow that fails the build when AIGuard reports unsafe behaviour:

$aiguard ci github > .github/workflows/aiguard.yml

See CI/CD Integration for the full workflow templates and threshold-tuning advice.

Custom attack datasets

Bring your own dataset as JSON, JSONL, CSV, or any HuggingFace dataset. Pointdataset_config at the file and AIGuard will mutate and evolve the attack population over time.

yaml
evaluation:
  adversarial:
    dataset_config: datasets.json