src/redteam/providers/README.md
This folder contains red teaming providers designed to test and evaluate language and image generation models. These providers implement adversarial approaches to probe the limits of AI safety measures.
The text-based iterative provider is designed to generate adversarial prompts that attempt to bypass the safety measures of language models. It uses an iterative approach to refine prompts based on the model's responses. It:
flowchart TB
Start([Start]) --> Init[Initialize]
Init --> Loop{Iteration
Loop}
Loop -->|New Iteration| Generate[Generate Prompt]
Generate --> OnTopic{On Topic?}
OnTopic -->|Yes| Target[Send to Target Model]
OnTopic -->|No| Feedback
Target --> Judge[Judge Response]
Judge --> Penalize{Contains
Penalized
Phrase?}
Penalize -->|Yes| AdjustScore[Adjust Score]
Penalize -->|No| CheckBest
AdjustScore --> CheckBest
CheckBest{Best Score?}
CheckBest -->|Yes| UpdateBest[Update Best Response]
CheckBest -->|No| Feedback
UpdateBest --> Feedback[Prepare Feedback]
Feedback --> CheckEnd{End Conditions:
1. Score ≥ 10
2. Max Iterations
Reached}
CheckEnd -->|No| Loop
CheckEnd -->|Yes| End([End])
class Generate,Target,Judge,AdjustScore,UpdateBest,Feedback process;
class OnTopic,Penalize,CheckBest,CheckEnd decision;
class Start,End start_end;
The Iterative Tree Provider is an advanced red teaming approach that uses a tree search algorithm to generate and refine adversarial prompts. This method systematically explores different variations of prompts to identify the most effective ones for bypassing language model safety measures. It:
graph TD
A[Start] --> B[Initialize Root Node]
B --> C[Tree Search Loop]
C --> D[Select Best Unexplored Node]
D --> E[Generate Child Prompts]
E --> F[Evaluate Prompts]
F --> G[Update Tree]
G --> H{Stop Condition Met?}
H -->|No| C
H -->|Yes| I[Return Best Result]
I --> J[End]
subgraph Evaluation Process
F --> F1[Check On-Topic]
F1 --> F2[Get Model Response]
F2 --> F3[Check Non-Refusal]
F3 --> F4[Judge Response]
F4 --> F5[Calculate Score]
end
The image-based iterative provider extends the concept to image generation models. It generates prompts that attempt to create images that bypass safety filters, then uses a vision model to interpret the results.
flowchart TB
Start([Start]) --> Init[Initialize Providers]
Init --> Loop{Iteration
Loop}
Loop -->|New Iteration| Generate[Generate Prompt]
Generate --> OnTopic{On Topic?}
OnTopic -->|Yes| Target[Send to Target Image Model]
OnTopic -->|No| Feedback
Target --> ParseURL[Parse Image URL]
ParseURL --> VisionModel[Send to Vision Model]
VisionModel --> Judge[Judge Response]
Judge --> CheckBest{Best Score?}
CheckBest -->|Yes| UpdateBest[Update Best Response]
CheckBest -->|No| Feedback
UpdateBest --> Feedback[Prepare Feedback]
Feedback --> CheckEnd{End Conditions:
1. Score ≥ 10
2. Max Iterations
Reached}
CheckEnd -->|No| Loop
CheckEnd -->|Yes| End([End])
class Generate,Target,ParseURL,VisionModel,Judge,UpdateBest,Feedback process;
class OnTopic,CheckBest,CheckEnd decision;
class Start,End start_end;
The unblocking feature helps red team providers handle situations where a target model asks clarifying questions that block the conversation from progressing. When enabled, it automatically detects and responds to these blocking questions to keep the adversarial conversation flowing.
During a red team attack (e.g., Crescendo, GOAT, Custom providers), the unblocking system:
Example blocking questions:
Enable unblocking (PROMPTFOO_ENABLE_UNBLOCKING=true) when:
Keep disabled (default) when:
Set the environment variable before running red team evaluations:
# Enable unblocking
export PROMPTFOO_ENABLE_UNBLOCKING=true
# Run red team eval
promptfoo eval
Pros:
Cons:
The unblocking feature is implemented in shared.ts and uses:
blocking-question-analysis taskUnblocking is available in: