Layerwise universal adversarial attack on NLP models

Olga Tsymboi; Danil Malaev; Andrei Petrovskii; Ivan Oseledets

Layerwise universal adversarial attack on NLP models

Olga Tsymboi, Danil Malaev, Andrei Petrovskii, Ivan Oseledets

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Findings: Interpretability and Analysis of Models for NLP Findings Paper

Session 1: Interpretability and Analysis of Models for NLP (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Session 1 (15:00-16:30 UTC)

Spotlight Session: Spotlight - Metropolitan West (Spotlight)

Conference Room: Metropolitan West

Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)

Global Time: July 10, Spotlight Session (23:00-01:00 UTC)

Keywords: adversarial attacks/examples/training

TLDR: In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models ...

You can open the #paper-P3907 channel in a separate window.

Abstract: In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9.3\% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Using small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7.1\% in the fooling rate.