CAME: Confidence-guided Adaptive Memory Efficient Optimization

Yang Luo; Xiaozhe REN; Zangwei Zheng; ZHUO JIANG; Xin Jiang; Yang You

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Yang Luo, Xiaozhe REN, Zangwei Zheng, ZHUO JIANG, Xin Jiang, Yang You

📝 Paper

Anthology

Underline 🪧 Poster 📺 Watch Video on Underline Add to Favorites

Main: Machine Learning for NLP Main-poster Paper

Session 4: Machine Learning for NLP (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 11, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 11, Session 4 (15:00-16:30 UTC)

Keywords: optimization methods

TLDR: Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. ...

You can open the #paper-P1856 channel in a separate window.

Abstract: Adaptive gradient methods, such as Adam and LAMB, have demonstrated excellent performance in the training of large language models. Nevertheless, the need for adaptivity requires maintaining second-moment estimates of the per-parameter gradients, which entails a high cost of extra memory overheads. To solve this problem, several memory-efficient optimizers (e.g., Adafactor) have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. In this paper, we first study a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods. Extensive experiments demonstrate the training stability and superior performance of CAME across various NLP tasks such as BERT and GPT-2 training. Notably, for BERT pre-training on the large batch size of 32,768, our proposed optimizer attains faster convergence and higher accuracy compared with the Adam optimizer. The implementation of CAME is publicly available.