Back to Open Assistant

Detoxify evaluation

notebooks/detoxify-evaluation/README.md

0.0.14.4 KB
Original Source

Detoxify evaluation

Detoxify is an open source model used to identify prompts as toxic

It contains 3 different models that vary in transformer type and data it was trained on

Model nameTransformer typeData from
originalbert-base-uncasedToxic Comment Classification Challenge
unbiasedroberta-baseUnintended Bias in Toxicity Classification
multilingualxlm-roberta-baseMultilingual Toxic Comment Classification

Unbiased and original models also have a 'small' version - but since normal models are not memory heavy, and small models perform noticeably worse, they are only described in the notebook

All tests below were ran on a 3090TI

Inference and training times and memory usages

Charts showing detailed memory usages and times for different sentence lengths and batch sizes are inside the notebook Quick overview batch size 16, sentence length 4k for training, batch size 128 sentence length 4k for Inference

Model nameTraining memoryTraining speedInference MemoryInference Speed
original11.8GB2.40s4.8GB16.48s
unbiased12GB1.09s4.8GB5.59s
multilingual14GB1.00s5.5GB4.89s

Filtering quality

Detoxify was tested on 4 different types of inputs

  • Not obviously toxic
  • Not obviously non-toxic
  • Obviously toxic
  • Obviously non-toxic

Sentences used for testing and rating are contained inside the .ipynb

Model nameNot obviously toxicNot obviously non-toxicObviously toxicObviously non-toxic
originalfailed at all, easily accepted racist, sexist overally toxic prompts that were well formulatedVery sensitive on swear words, failed to recognize contextgood performancegood performance
unbiasedManaged to find some hidden toxicity but not on all sentencesVery sensitive explicit language but shown ability to recognize contextDid well but failed to recognize some gender stereotype mockerygood performance
multilingualManaged to find some hidden toxicity but not on all sentencesVery sensitive explicit language but shown ability to recognize contextDid well but failed to recognize some gender stereotype mockerygood performance

Subjectively 'unbiased' looks like the best performing model.

I don't think it would do well as a security layer in a live version of open assistant unless we do some finetuning first, because it can be fooled to pass toxicity if it's presented in formal language.

With some caution it can be used to filter prompts but I would suggest also using someone for verification of messages that are marked as toxic but still below 90% confidence

Licensing

Detoxify is on Apache-2.0 license that means:

You can:

  • Commercial use

  • Modification

  • Distribution

  • Patent use

  • Private use

You cannot

  • Hold the owner liable

  • Use the owner's trademark

You must

  • Include Copyright

  • Include License

  • State changes you made to the product

  • Include notice

This is obviously not legal advice.

Hosting

The model is currently available on huggingface and torch hub

torch.hub.load('unitaryai/detoxify',model)

where model is one of:

  • toxic_bert

  • unbiased_toxic_roberta

  • multilingual_toxic_xlm_r