Moderation

Use the moderation assert type to ensure that LLM outputs are safe.

Currently, this supports OpenAI's moderation model, Meta's LlamaGuard models (LlamaGuard 3 and 4) via Replicate, and Azure Content Safety API.

In general, we encourage the use of Meta's LlamaGuard as it substantially outperforms OpenAI's moderation API as well as GPT-4. See benchmarks.

OpenAI moderation

By default, the moderation assertion uses OpenAI if an OpenAI API key is provided. Just make sure that the OPENAI_API_KEY environment variable is set:

yaml

tests:
  - vars:
      foo: bar
    assert:
      # Ensure that it passes OpenAI's moderation filters
      - type: moderation

OpenAI monitors the following categories:

Category	Description
hate	Content that promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability, or caste. Hateful content aimed at non-protected groups is harassment.
hate/threatening	Hateful content that includes violence or serious harm towards the targeted group.
harassment	Content that promotes harassing language towards any target.
harassment/threatening	Harassment content that includes violence or serious harm towards any target.
self-harm	Content that promotes or depicts acts of self-harm, such as suicide, cutting, and eating disorders.
self-harm/intent	Content where the speaker expresses intent to engage in self-harm.
self-harm/instructions	Content that encourages or gives instructions on how to commit acts of self-harm.
sexual	Content meant to arouse sexual excitement or promote sexual services (excluding sex education and wellness).
sexual/minors	Sexual content involving individuals under 18 years old.
violence	Content that depicts death, violence, or physical injury.
violence/graphic	Content that depicts death, violence, or physical injury in graphic detail.

Check specific categories

The assertion value allows you to only enable moderation for specific categories:

yaml

tests:
  - vars:
      foo: bar
    assert:
      - type: moderation
        // highlight-start
        value:
          - harassment
          - harassment/threatening
          - sexual
          - sexual/minors
        // highlight-end

Meta LlamaGuard moderation

This example uses the LlamaGuard model hosted on Replicate. Be sure to set the REPLICATE_API_KEY environment variable:

yaml

tests:
  - vars:
      foo: bar
    assert:
      - type: moderation
        # Use the latest Llama Guard on replicate
        provider: 'replicate:moderation:meta/llama-guard-3-8b:146d1220d447cdcc639bc17c5f6137416042abee6ae153a2615e6ef5749205c8'

:::tip LlamaGuard 4 is the default moderation provider on Replicate, featuring enhanced capabilities and an additional category (S14: Code Interpreter Abuse). You can explicitly specify it:

yaml

provider: 'replicate:moderation:meta/llama-guard-4-12b'

For compatibility or specific use cases, you can use LlamaGuard 3:

yaml

provider: 'replicate:moderation:meta/llama-guard-3-8b:146d1220d447cdcc639bc17c5f6137416042abee6ae153a2615e6ef5749205c8'

:::

LlamaGuard monitors the following categories:

Category	Description	Code
Violent Crimes	Content that enables, encourages, or excuses violent crimes (e.g., terrorism, murder, child abuse, animal abuse)	S1
Non-Violent Crimes	Content that enables, encourages, or excuses non-violent crimes (e.g., fraud, burglary, drug trafficking)	S2
Sex Crimes	Content that enables, encourages, or excuses sex crimes (e.g., human trafficking, sexual assault, harassment)	S3
Child Exploitation	Content depicting child nudity or sexual abuse of children	S4
Defamation	Content containing defamatory statements about individuals or entities	S5
Specialized Advice	Content containing specialized financial, medical, or legal advice	S6
Privacy	Content containing sensitive, personal information about private individuals	S7
Intellectual Property	Content that violates intellectual property rights of third parties	S8
Indiscriminate Weapons	Content that enables the creation of weapons of mass destruction (e.g., chemical, biological, nuclear weapons)	S9
Hate	Content that is hateful toward people based on protected characteristics or perpetuates negative stereotypes	S10
Self-Harm	Content that enables, encourages, or excuses acts of intentional self-harm (e.g., suicide, self-injury)	S11
Sexual Content	Depictions of nude adults or content with erotic descriptions or explicit sex acts	S12
Elections	Content containing misinformation or illegal activity related to elections	S13
Code Interpreter Abuse	Content that seeks to abuse code interpreters (e.g., denial of service, container escapes) - LlamaGuard 4 only	S14

Check specific categories

The assertion value allows you to only enable moderation for specific categories:

yaml

tests:
  - vars:
      foo: bar
    assert:
      - type: moderation
        provider: 'replicate:moderation:meta/llama-guard-3-8b:146d1220d447cdcc639bc17c5f6137416042abee6ae153a2615e6ef5749205c8'
        // highlight-start
        value:
          - S1
          - S3
          - S4
        // highlight-end

Azure Content Safety moderation

You can use the Azure Content Safety API for moderation. To set it up, you need to create an Azure Content Safety resource and get the API key and endpoint.

Setup

First, set these environment variables:

bash

AZURE_CONTENT_SAFETY_ENDPOINT=https://your-resource-name.cognitiveservices.azure.com
AZURE_CONTENT_SAFETY_API_KEY=your-api-key
AZURE_CONTENT_SAFETY_API_VERSION=2024-09-01  # Optional, defaults to this version

If AZURE_CONTENT_SAFETY_ENDPOINT is set, PromptFoo will automatically use the Azure Content Safety service for moderation instead of OpenAI's moderation API.

Or you can explicitly use the Azure moderation provider in your tests:

yaml

tests:
  - vars:
      foo: bar
    assert:
      - type: moderation
        provider: 'azure:moderation'

Moderation Categories

The Azure Content Safety API checks content for these categories:

Category	Description
Hate	Content that expresses discrimination or derogatory sentiments
SelfHarm	Content related to inflicting physical harm on oneself
Sexual	Sexually explicit or adult content
Violence	Content depicting or promoting violence against people or animals

Check specific categories

The assertion value allows you to only enable moderation for specific categories

yaml

tests:
  - vars:
      foo: bar
    assert:
      - type: moderation
        provider: 'azure:moderation'
        value:
          - hate
          - sexual

You can also set blocklist names and halt on blocklist hit in the provider config:

yaml

tests:
  - vars:
      foo: bar
    assert:
      - type: moderation
        provider:
          id: azure:moderation
          config:
            blocklistNames: ['my-custom-blocklist', 'industry-terms']
            haltOnBlocklistHit: true