Small Pre-trained Language Models Can be Fine-tuned as Large Models via Over-Parameterization
Ze-Feng Gao, Kun Zhou, Peiyu Liu, Wayne Xin Zhao, Ji-Rong Wen
Main: Machine Learning for NLP Main-oral Paper
Session 2: Machine Learning for NLP (Oral)
Conference Room: Metropolitan Centre
Conference Time: July 10, 14:00-15:30 (EDT) (America/Toronto)
Global Time: July 10, Session 2 (18:00-19:30 UTC)
Keywords:
model compression methods
TLDR:
By scaling the model size, large pre-trained language models~(PLMs) have shown remarkable performance in various natural language processing tasks, mostly outperforming small PLMs by a large margin.
However, due to the high computational cost, the huge number of parameters also restricts the applica...
You can open the
#paper-P295
channel in a separate window.
Abstract:
By scaling the model size, large pre-trained language models~(PLMs) have shown remarkable performance in various natural language processing tasks, mostly outperforming small PLMs by a large margin.
However, due to the high computational cost, the huge number of parameters also restricts the applicability of large PLMs in real-world systems.
In this paper, we focus on scaling up the parameters of PLMs \emph{only during} fine-tuning, to benefit from the over-parameterization, while without increasing \emph{the inference latency}. Given a relatively small PLM, we over-parameterize it by employing a matrix product operator, an efficient and almost lossless decomposition method to factorize its contained parameter matrices into a set of higher-dimensional tensors.
Considering the efficiency, we further propose both static and dynamic strategies to select the most important parameter matrices for over-parameterization.
Extensive experiments have demonstrated that our approach can significantly boost the fine-tuning performance of small PLMs and even help small PLMs outperform $3\times$ parameterized larger ones.
Our code is publicly available at {https://github.com/zfgao66/OPF}.