[Industry] Towards Building a Robust Toxicity Predictor

Dmitriy Bespalov, Sourav Bhabesh, Yi Xiang, Liutong Zhou, Yanjun Qi

Industry: Industry Industry Paper

Session 4: Industry (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 11, Session 4 (15:00-16:30 UTC)
TLDR: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \textbackslash{}texttt\{ToxicTrap\}, introducing small word-level perturbations to fo...
You can open the #paper-I148 channel in a separate window.
Abstract: Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \textbackslash{}texttt\{ToxicTrap\}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. \textbackslash{}texttt\{ToxicTrap\} exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow \textbackslash{}texttt\{ToxicTrap\} to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98\textbackslash{}\% attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks.