Red Team Providers

This folder contains red teaming providers designed to test and evaluate language and image generation models. These providers implement adversarial approaches to probe the limits of AI safety measures.

Iterative Provider

The text-based iterative provider is designed to generate adversarial prompts that attempt to bypass the safety measures of language models. It uses an iterative approach to refine prompts based on the model's responses. It:

Implements a red teaming assistant to generate adversarial prompts
Uses a judge to evaluate the effectiveness of the prompts
Penalizes specific phrases to discourage certain types of attacks
Implements an on-topic check to ensure relevance

mermaid

flowchart TB
    Start([Start]) --> Init[Initialize]
    Init --> Loop{Iteration
Loop}

    Loop -->|New Iteration| Generate[Generate Prompt]
    Generate --> OnTopic{On Topic?}

    OnTopic -->|Yes| Target[Send to Target Model]
    OnTopic -->|No| Feedback

    Target --> Judge[Judge Response]
    Judge --> Penalize{Contains
Penalized
Phrase?}

    Penalize -->|Yes| AdjustScore[Adjust Score]
    Penalize -->|No| CheckBest
    AdjustScore --> CheckBest

    CheckBest{Best Score?}
    CheckBest -->|Yes| UpdateBest[Update Best Response]
    CheckBest -->|No| Feedback

    UpdateBest --> Feedback[Prepare Feedback]
    Feedback --> CheckEnd{End Conditions:
1. Score ≥ 10
2. Max Iterations
Reached}

    CheckEnd -->|No| Loop
    CheckEnd -->|Yes| End([End])

    class Generate,Target,Judge,AdjustScore,UpdateBest,Feedback process;
    class OnTopic,Penalize,CheckBest,CheckEnd decision;
    class Start,End start_end;

Iterative Tree Provider

The Iterative Tree Provider is an advanced red teaming approach that uses a tree search algorithm to generate and refine adversarial prompts. This method systematically explores different variations of prompts to identify the most effective ones for bypassing language model safety measures. It:

Initializes with a root node representing the initial prompt.
Iteratively selects the best unexplored node and generates child prompts.
Evaluates each prompt based on its relevance and effectiveness.
Updates the tree with new nodes and their scores.
Continues the process until a stopping condition is met, such as reaching a maximum number of iterations or achieving a high score.

mermaid

graph TD
    A[Start] --> B[Initialize Root Node]
    B --> C[Tree Search Loop]
    C --> D[Select Best Unexplored Node]
    D --> E[Generate Child Prompts]
    E --> F[Evaluate Prompts]
    F --> G[Update Tree]
    G --> H{Stop Condition Met?}
    H -->|No| C
    H -->|Yes| I[Return Best Result]
    I --> J[End]

    subgraph Evaluation Process
    F --> F1[Check On-Topic]
    F1 --> F2[Get Model Response]
    F2 --> F3[Check Non-Refusal]
    F3 --> F4[Judge Response]
    F4 --> F5[Calculate Score]
    end

Key Features

Systematic Exploration: Uses a tree search algorithm to explore different prompt variations.
Evaluation Process: Each prompt is evaluated for relevance and effectiveness, ensuring only the best prompts are refined further.
Stopping Conditions: The process continues until a predefined stopping condition is met, ensuring efficient use of resources.

IterativeImage Provider

The image-based iterative provider extends the concept to image generation models. It generates prompts that attempt to create images that bypass safety filters, then uses a vision model to interpret the results.

mermaid

flowchart TB
    Start([Start]) --> Init[Initialize Providers]
    Init --> Loop{Iteration
Loop}

    Loop -->|New Iteration| Generate[Generate Prompt]
    Generate --> OnTopic{On Topic?}

    OnTopic -->|Yes| Target[Send to Target Image Model]
    OnTopic -->|No| Feedback

    Target --> ParseURL[Parse Image URL]
    ParseURL --> VisionModel[Send to Vision Model]
    VisionModel --> Judge[Judge Response]

    Judge --> CheckBest{Best Score?}

    CheckBest -->|Yes| UpdateBest[Update Best Response]
    CheckBest -->|No| Feedback

    UpdateBest --> Feedback[Prepare Feedback]
    Feedback --> CheckEnd{End Conditions:
1. Score ≥ 10
2. Max Iterations
Reached}

    CheckEnd -->|No| Loop
    CheckEnd -->|Yes| End([End])

    class Generate,Target,ParseURL,VisionModel,Judge,UpdateBest,Feedback process;
    class OnTopic,CheckBest,CheckEnd decision;
    class Start,End start_end;

Unblocking Feature

Overview

The unblocking feature helps red team providers handle situations where a target model asks clarifying questions that block the conversation from progressing. When enabled, it automatically detects and responds to these blocking questions to keep the adversarial conversation flowing.

How It Works

During a red team attack (e.g., Crescendo, GOAT, Custom providers), the unblocking system:

Analyzes the target model's response to detect if it's asking a blocking question
If a blocking question is detected, generates an appropriate answer
Sends the answer to continue the conversation
Continues with the red team attack

Example blocking questions:

"What specific industry are you in?"
"Can you provide more details about your use case?"
"Which country are you located in?"

When to Enable

Enable unblocking (PROMPTFOO_ENABLE_UNBLOCKING=true) when:

Testing conversational agents that frequently ask clarifying questions
Running red team evaluations against customer service bots or domain-specific assistants
You want more realistic multi-turn adversarial conversations
The target system requires contextual information to proceed

Keep disabled (default) when:

Testing simple question-answering systems
You want faster red team evaluations (fewer API calls)
You want to measure how well the target handles ambiguous queries
Budget constraints are a concern (each unblocking check adds cost)

Configuration

Set the environment variable before running red team evaluations:

bash

# Enable unblocking
export PROMPTFOO_ENABLE_UNBLOCKING=true

# Run red team eval
promptfoo eval

Tradeoffs

Pros:

More realistic adversarial testing
Better coverage for conversational systems
Helps surface vulnerabilities in multi-turn scenarios

Cons:

Increases evaluation time and cost
Adds extra API calls to the remote inference service
May generate unnecessary responses if the target model isn't actually blocking

Implementation Details

The unblocking feature is implemented in shared.ts and uses:

Remote inference for blocking question detection
The blocking-question-analysis task
Conversation history and goal context for generating appropriate answers

Unblocking is available in:

Crescendo Provider
GOAT Provider
Custom Provider

References

Based on research from: Jailbroken: How Does LLM Safety Training Fail?
Based on research from: A STRONGREJECT for Empty Jailbreaks