HyperMixer: An MLP-based Low Cost Alternative to Transformers

Florian Mai; Arnaud Pannatier; Fabio J Fehr; Haolin Chen; Francois Marelli; Francois Fleuret; James Henderson

HyperMixer: An MLP-based Low Cost Alternative to Transformers

Florian Mai, Arnaud Pannatier, Fabio J Fehr, Haolin Chen, Francois Marelli, Francois Fleuret, James Henderson

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Main: Machine Learning for NLP Main-poster Paper

Poster Session 1: Machine Learning for NLP (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 10, Poster Session 1 (15:00-16:30 UTC)

Keywords: representation learning

TLDR: Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple M...

You can open the #paper-P5796 channel in a separate window.

Abstract: Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.