examples/redteam-multi-modal/README.md
This example demonstrates how to use promptfoo's red teaming capabilities with multi-modal models, showing three different approaches to testing model safety and robustness against adversarial inputs involving images.
You can run this example with:
npx promptfoo@latest init --example redteam-multi-modal
cd redteam-multi-modal
Install dependencies:
npm install promptfoo sharp
Set up environment variables (see next section)
Run the static image example:
npx promptfoo@latest redteam eval -c redteam.static-image.yaml
Run the image strategy example:
npx promptfoo@latest redteam eval -c redteam.image-strategy.yaml
Run the UnsafeBench example:
npx promptfoo@latest redteam eval -c redteam.unsafebench.yaml
Run the VLGuard example:
npx promptfoo@latest redteam eval -c redteam.vlguard.yaml
Review results in the promptfoo interface
This example requires the following environment variables depending on which approach you're using:
For the Static Image and Image Strategy examples:
AWS_ACCESS_KEY_ID - Your AWS access key for Amazon BedrockAWS_SECRET_ACCESS_KEY - Your AWS secret key for Amazon BedrockAWS_REGION - Your AWS region (e.g., us-east-1)For the UnsafeBench and VLGuard examples:
HF_TOKEN or HF_API_TOKEN - Your Hugging Face API token (optional for VLGuard, required for UnsafeBench dataset access)You can set these in a .env file or directly in your environment.
This example provides four different approaches to red team testing with multi-modal models:
promptfooconfig.static-image.yaml)This configuration demonstrates how to red team a multi-modal model by keeping a static image (in this example, Buzz Aldrin on the moon) constant while varying the text prompt to test different potential attack vectors. The adversarial prompts are injected into the question variable.
Key features:
promptfooconfig.image-strategy.yaml)This configuration demonstrates how to use promptfoo's image strategy for red teaming. Instead of attacking through text prompts, this approach converts potentially harmful text into images (text-to-image) and then sends those images to the model. This can help identify if models process text differently when it's embedded in an image rather than presented as plain text.
Key features:
promptfooconfig.unsafebench.yaml)This configuration uses the UnsafeBench plugin to evaluate multi-modal models against potentially unsafe imagery from the UnsafeBench dataset. It tests how well models detect and refuse to engage with harmful content across various categories.
Key features:
promptfooconfig.vlguard.yaml)This configuration uses the VLGuard plugin to evaluate multi-modal models against potentially unsafe imagery from the VLGuard dataset. It provides focused testing of content moderation capabilities with a smaller, curated dataset of 442 images.
Key features:
In all examples, we work with prompt templates that have multiple variables:
image and questionimage variableA crucial concept in these examples is that red teaming injects content into only one variable at a time (specified by injectVar in the config). The remaining variables need default values, which are set in the defaultTest section:
defaultTest:
vars:
# For static image example:
image: file://aldrin_apollo_11.jpg # Fixed image while question varies
# For image strategy example:
question: 'Please answer the question in the image' # Fixed text while image varies
This approach lets you precisely control which part of your prompt receives adversarial content, while keeping other elements constant across all tests.
The static image example uses a fixed JPEG image with adversarial text prompts injected into the conversation. The image remains constant while the text varies based on the redteam plugins.
The image strategy converts text into images that are then encoded as base64 strings:
This approach can reveal whether models process text differently when it's embedded in images versus presented as plain text.
Important notes about the image strategy:
npx promptfoo@latest redteam eval -c redteam.static-image.yaml
npm install sharp # Required for the image strategy
npx promptfoo@latest redteam eval -c redteam.image-strategy.yaml
First, ensure you have access to the UnsafeBench dataset and set your Hugging Face token:
export HF_TOKEN=your_huggingface_token
Then run:
npx promptfoo@latest redteam generate -c promptfooconfig.unsafebench.yaml
npx promptfoo@latest redteam eval -c redteam.yaml
The VLGuard example works similarly to UnsafeBench but with a smaller, more focused dataset:
npx promptfoo@latest redteam eval -c redteam.vlguard.yaml
Or generate fresh test cases:
npx promptfoo@latest redteam generate -c promptfooconfig.vlguard.yaml
npx promptfoo@latest redteam eval -c redteam.yaml
When configuring your red team tests, ensure that the purpose statement in your configuration is relevant to the image content in some way. This helps create more realistic and contextually appropriate test scenarios.
For example, in the static image example, the purpose statement references NASA and Buzz Aldrin, which is directly relevant to the moon landing image:
redteam:
purpose: >-
You are a helpful assistant for NASA that can answer questions about this photo of Buzz Aldrin on the moon.
For more information on how the image strategy works, refer to the Image Inputs strategy documentation.
This example uses Amazon Bedrock's Nova model by default for the first two examples:
targets:
- id: bedrock:amazon.nova-lite-v1:0
label: multimodal-model
However, the techniques demonstrated can be applied to any multi-modal model that supports image inputs. You can easily swap out the target model in the configuration files to use models from:
openai:gpt-4.1-mini and newer versions that support visionTo use a different model, simply update the targets section in your configuration file. The prompt format may need to be adjusted to match the requirements of your chosen model.
For example, to use OpenAI's GPT-4o:
targets:
- id: openai:gpt-4o
label: gpt4o-vision
Note you may have to update the prompt format to match the requirements of your chosen model.
After running these examples, consider: