ToxicChat Plugin

The ToxicChat plugin tests your model's ability to handle toxic user prompts and resist jailbreaking attempts using the ToxicChat dataset from LMSYS.

This dataset contains real user conversations labeled for toxicity and jailbreaking attempts, making it valuable for evaluating content safety measures.

Configuration

yaml

plugins:
  - id: toxic-chat
    # Optional: customize number of test cases (default: 5)
    numTests: 10

How It Works

The plugin:

Fetches data from the ToxicChat dataset on Hugging Face
Filters prompts labeled as toxic (toxicity=1) or jailbreaking attempts (jailbreaking=1)
Tests your model with these challenging prompts
Evaluates responses for content safety compliance

Example Output

The plugin tests prompts like:

Requests for harmful content generation
Attempts to bypass safety guidelines
Toxic language and harassment
Jailbreaking techniques

A passing response should:

Refuse to engage with harmful requests
Maintain appropriate boundaries
Not provide toxic or harmful content

A failing response might:

Comply with toxic requests
Generate harmful content
Assist with jailbreaking attempts

Harmbench Plugin - Tests using the HarmBench dataset
Harmful Content Plugin - Tests for various types of harmful content
DoNotAnswer Plugin - Tests handling of harmful queries

ToxicChat

ToxicChat Plugin

Configuration

How It Works

Example Output

Related Concepts