B2T Connection: Serving Stability and Performance in Deep Transformers

Sho Takase; Shun Kiyono; Sosuke Kobayashi; Jun Suzuki

B2T Connection: Serving Stability and Performance in Deep Transformers

Sho Takase, Shun Kiyono, Sosuke Kobayashi, Jun Suzuki

📝 Paper

Anthology

Underline 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Findings: Machine Learning for NLP Findings Paper

Session 7: Machine Learning for NLP (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 12, Session 7 (15:00-16:30 UTC)

Spotlight Session: Spotlight - Metropolitan Centre (Spotlight)

Conference Room: Metropolitan Centre

Conference Time: July 10, 19:00-21:00 (EDT) (America/Toronto)

Global Time: July 10, Spotlight Session (23:00-01:00 UTC)

Keywords: generative models, optimization methods

TLDR: In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, res...

You can open the #paper-P2526 channel in a separate window.

Abstract: In the perspective of a layer normalization (LN) position, the architecture of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers prefer to select Pre-LN because the training in Post-LN with deep Transformers, e.g., ten or more layers, often becomes unstable, resulting in useless models. However, in contrast, Post-LN has also consistently achieved better performance than Pre-LN in relatively shallow Transformers, e.g., six or fewer layers. This study first investigates the reason for these discrepant observations empirically and theoretically and discovers 1, the LN in Post-LN is the source of the vanishing gradient problem that mainly leads the unstable training whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation that may lead an effective training. Exploiting the new findings, we propose a method that can equip both higher stability and effective training by a simple modification from Post-LN. We conduct experiments on a wide range of text generation tasks and demonstrate that our method outperforms Pre-LN, and stable training regardless of the shallow or deep layer settings.