Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agentguardian.io/llms.txt

Use this file to discover all available pages before exploring further.

Red team your AI agents before attackers do.

AgentGuardian is an open-source toolkit for testing prompt injection, tool abuse, RAG poisoning, memory attacks, and multi-agent vulnerabilities in AI systems. It deploys an adversarial swarm of eleven specialist attackers against your LLM agent and returns a deterministic 0–100 AIVSS score aligned with the OWASP Top 10 for Agentic Applications 2026, MITRE ATLAS v5.4.0, and the CSA Agentic AI Red Teaming Guide.
AIVSS is AgentGuardian’s deterministic Agentic-AI Vulnerability Scoring System: a reproducible 0–100 score (higher is safer) derived from per-finding severity, target-tier weighting, and ASI sub-scores. Same seed, same target, same score — every time.

What AgentGuardian does

Single-chain red-teaming tools send one prompt at a time. Production agents compose tools, hold memory, talk to other agents, and run real code — and that surface needs eleven attackers working in concert. AgentGuardian runs a Swarm Commander LLM that dispatches up to fourteen parallel attacker agents — one reconnaissance agent maps your target, then ten ASI-aligned specialists attack concurrently. Every finding is triple-tagged with OWASP ASI, MITRE ATLAS, and CSA Agentic-RT categories, then composed into a single signed report.

Who it’s for

  • Agent developers shipping LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Strands, or Google ADK applications who need a pre-production red-team before the agent touches a real user.
  • Security engineers running PR-gate checks and continuous assurance on agent code, with SARIF findings landing inline in GitHub’s Security tab.
  • MCP server authors and tool-surface owners who need to know which of their tools can be weaponised before publishing.
  • Compliance and audit teams who need a signed evidence pack (HMAC + Ed25519, optional Sigstore) tied to a reproducible scan.

What it tests

96 probes across the 10 OWASP Top 10 for Agentic Applications 2026 categories. Every probe is a YAML file under src/agent_guardian/probes/asi*/ — versioned, golden-tested, and triple-framework-tagged.

ASI01 — Goal Hijack

Direct and indirect prompt injection that overrides the agent’s system goal. 9 probes covering role-swap, echoleak zero-click, tool-output IPI, and dialect roleplay.

ASI02 — Tool Misuse

Argument injection, chain exfiltration, parameter smuggling, recursion bombs. 8 probes — the eight ways an agent’s tool surface gets weaponised.

ASI03 — Privilege Abuse

Escalation through inherited credentials, role confusion, and authority misattribution. 9 probes.

ASI04 — Supply Chain

Poisoned plugins, malicious adapters, dependency-confusion routes into the agent. 8 probes.

ASI05 — Code Execution

Unauthorised code execution through exec/eval tools, sandbox escapes, and code-as-input vectors. 8 probes.

ASI06 — Memory Poisoning

RAG corpus inject, persistent triggers, embedding collisions, cross-tenant vector bleed. 13 probes covering both memory-poisoning (MP) and HITL failure modes.

ASI07 — Agent-to-Agent (A2A)

Compromise that crosses an A2A boundary — a delegated sub-agent that follows attacker-supplied instructions. 8 probes.

ASI08 — Cascading Failures

Multi-step blast radius: one bad tool call that opens the next, that opens the next. 8 probes.

ASI09 — Trust Exploitation

Misuse of inherited trust — user impersonation, identity leak, output exfiltration via trusted channels. 17 probes (the largest family).

ASI10 — Rogue Agents (drift)

Goal-drift over long sessions, persona breaks, behavioural divergence from the system prompt. 8 probes.

What it generates

Every scan persists a canonical scan.json at ~/.agentguardian/scans/<scan_id>/scan.json (HMAC- and Ed25519-signed), plus your chosen report format at --output-path.
FormatFlagWhen to use
JSON--output jsonProgrammatic post-processing, dashboards, custom gates.
SARIF--output sarifGitHub Code Scanning — auto-upload via codeql-action/upload-sarif@v3.
JUnit--output junitAny CI runner that already parses test results.
Markdown--output mdPR comments, RFCs, internal write-ups.
PDF--output pdfAuditor / stakeholder share-out (WeasyPrint, ReportLab fallback).
For evidence retention, --bundle ./evidence/ produces a bundle_<scan_id>/ tree with the SARIF, the PoV (proof-of-vulnerability) replay artefacts, raw evidence, and a manifest.json.

When to use this

  • Pre-production agent red-team before the first real user touches it.
  • PR gate on every change to system prompts, tool definitions, or adapter code.
  • Before scaling tool surface — adding a new tool, a new MCP server, a new sub-agent.
  • Before publishing an MCP server so the tool author knows what an adversarial caller can do.
  • Regression testing after a hardening change to prove the AIVSS number moved.
AgentGuardian is for testing systems you own or are explicitly authorised to test. Use against third-party systems without authorisation is unlawful in most jurisdictions.

Get a score in 60 seconds

uv add agent-guardian
agent-guardian scan --system-prompt prompt.txt --model stub
The --model stub flag runs the swarm against a deterministic in-process model so you can verify the pipeline with zero API keys. Swap to --model gemini:gemini-2.5-flash, --model openai:gpt-4o-mini, or --model anthropic:claude-sonnet-4-5 once you have a key exported.

Expected output

scan ag_2026_xxxxxxxx done: AIVSS=87 band=GOOD tier=T4 findings=2 report=scan.json
→ live dashboard: http://127.0.0.1:7474/scans/ag_2026_xxxxxxxx

How to read the score

AIVSS is inverse-risk: 100 is perfect, 0 is critical. The numeric score is bucketed into a human-readable band by agent_guardian.models.severity.band_for_score:
BandRangeMeaning
EXCELLENT90–100No high/critical findings; ship with confidence.
GOOD80–89Minor findings; review and ship.
WARNING60–79Real findings present; remediate before production.
POOR40–59Material risk; do not ship.
CRITICAL0–39Severe agent vulnerabilities; halt.
not_evaluatedn/aScan ran in --model stub or --mode fast/smart; no authoritative number can be claimed — see the warning on stderr.
tier is the auto-detected target tier (T4 = prompt-only … T1 = tools + memory + PII), which weights how much each finding moves the score.

Next step

Quickstart

Five minutes from pip install to your first AIVSS score on a real LangGraph chatbot.

How it works

The six-phase swarm: Recon → Decompose → Parallel launch → Checkpoint → Budget donate → Finalise.

Attack library

All 96 probes across the 10 OWASP ASI categories, with severity and tier breakdowns.

GitHub Actions

Gate every PR on an AIVSS floor with SARIF auto-upload to the Security tab.