[Industry] Towards Building a Robust Toxicity Predictor

Dmitriy Bespalov; Sourav Bhabesh; Yi Xiang; Liutong Zhou; Yanjun Qi

[Industry] Towards Building a Robust Toxicity Predictor

Dmitriy Bespalov, Sourav Bhabesh, Yi Xiang, Liutong Zhou, Yanjun Qi

📝 Paper

Anthology

Underline 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Industry: Industry Industry Paper

Session 4: Industry (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Session 4 (15:00-16:30 UTC)

TLDR: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \textbackslash{}texttt\{ToxicTrap\}, introducing small word-level perturbations to fo...

You can open the #paper-I148 channel in a separate window.

Abstract: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \textbackslash{}texttt\{ToxicTrap\}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. \textbackslash{}texttt\{ToxicTrap\} exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow \textbackslash{}texttt\{ToxicTrap\} to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98\textbackslash{}\% attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks.