Layerwise universal adversarial attack on NLP models

Olga Tsymboi, Danil Malaev, Andrei Petrovskii, Ivan Oseledets

Findings: Interpretability and Analysis of Models for NLP Findings Paper

Session 1: Interpretability and Analysis of Models for NLP (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 10, Session 1 (15:00-16:30 UTC)
Spotlight Session: Spotlight - Metropolitan West (Spotlight)
Conference Room: Metropolitan West
Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)
Global Time: July 10, Spotlight Session (23:00-01:00 UTC)
Keywords: adversarial attacks/examples/training
TLDR: In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models ...
You can open the #paper-P3907 channel in a separate window.
Abstract: In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9.3\% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Using small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7.1\% in the fooling rate.