Back to Promptfoo

redteam-guardrails (Guardrails Evaluation)

examples/redteam-guardrails/README.md

0.121.94.1 KB
Original Source

redteam-guardrails (Guardrails Evaluation)

This preset provides a comprehensive testing suite for evaluating guardrails effectiveness against the most critical security threats and harmful content that guardrails are designed to prevent. Similar to how IBM evaluates their Granite Guardian models across 40+ datasets including AegisSafetyTest, ToxicChat, HarmBench, BeaverTails, and RAG hallucination benchmarks, this preset ensures your guardrails are rigorously tested against real-world threats. The comprehensive coverage is essential because guardrails must defend against an evolving landscape of attacks - from sophisticated jailbreaks and prompt injections to harmful content generation and hallucination risks in production AI systems.

You can run this example with:

bash
npx promptfoo@latest init --example redteam-guardrails
cd redteam-guardrails

What it Tests

The Guardrails Evaluation preset includes 42 plugins specifically chosen to test guardrails against:

🎯 Prompt Injection & Jailbreaking (10 plugins)

  • ascii-smuggling - Unicode obfuscation attacks
  • indirect-prompt-injection - Context-based injection attacks
  • cca - Context compliance attacks
  • hijacking - Goal hijacking attempts
  • system-prompt-override - System prompt manipulation
  • beavertails - 330K jailbreak prompts dataset
  • harmbench - Academic jailbreak dataset
  • pliny - Curated jailbreak collection
  • donotanswer - Refusal testing dataset
  • prompt-extraction - System prompt disclosure

⚠️ Harmful Content - WMD & Weapons (3 plugins)

  • harmful:chemical-biological-weapons - WMD content
  • harmful:indiscriminate-weapons - Mass destruction weapons
  • harmful:weapons:ied - IED instructions

🔥 Harmful Content - Violence & Crime (5 plugins)

  • harmful:violent-crime - Violent criminal activities
  • harmful:sex-crime - Sexual crime content
  • harmful:non-violent-crime - Non-violent crimes
  • harmful:graphic-content - Graphic/violent content
  • harmful:unsafe-practices - Dangerous instructions

👥 Harmful Content - Exploitation & Abuse (7 plugins)

  • harmful:child-exploitation - Child safety
  • harmful:harassment-bullying - Harassment content
  • harmful:hate - Hate speech
  • harmful:self-harm - Self-harm content
  • harmful:sexual-content - Sexual content
  • harmful:insults - Personal attacks
  • harmful:profanity - Profane language

🏴‍☠️ Harmful Content - Radicalization & Extremism (1 plugin)

  • harmful:radicalization - Extremist content

💻 Harmful Content - Cybercrime & Security (2 plugins)

  • harmful:cybercrime - General cybercrime
  • harmful:cybercrime:malicious-code - Malicious code generation

💊 Harmful Content - Substances (3 plugins)

  • harmful:illegal-activities - General illegal activities
  • harmful:illegal-drugs - Drug-related content
  • harmful:illegal-drugs:meth - Specific methamphetamine content

📰 Harmful Content - Information Integrity (2 plugins)

  • harmful:misinformation-disinformation - False information
  • harmful:specialized-advice - Unqualified advice
  • harmful:copyright-violations - Copyright infringement
  • harmful:intellectual-property - IP theft

🛡️ System Security & Integrity (6 plugins)

  • cyberseceval - Meta's cybersecurity dataset
  • excessive-agency - Unauthorized actions
  • hallucination - False information generation
  • overreliance - Over-dependence issues
  • divergent-repetition - Resource exhaustion attacks
  • reasoning-dos - Reasoning DoS attacks

🔐 PII & Privacy Protection (1 plugin)

  • harmful:privacy - Privacy violations

Usage

Basic Usage

yaml
redteam:
  plugins:
    - guardrails-eval

With Strategies (Enhanced Testing)

yaml
redteam:
  plugins:
    - guardrails-eval
  strategies:
    - jailbreak
    - prompt-injection
    - multilingual