Teaching Small Language Models to Reason

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Dominik Adamek, Eric Malmi, Aliaksei Severyn

Main: Machine Learning for NLP Main-poster Paper

Session 7: Machine Learning for NLP (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 12, Session 7 (15:00-16:30 UTC)
Keywords: few-shot learning
TLDR: Chain of thought prompting successfully improves the reasoning capabilities of large language models, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with at least tens of billions of parameters. In this paper, we explo...
You can open the #paper-P2154 channel in a separate window.
Abstract: Chain of thought prompting successfully improves the reasoning capabilities of large language models, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with at least tens of billions of parameters. In this paper, we explore the transfer of such reasoning capabilities to smaller models via knowledge distillation, also investigating model and dataset size trade-off. Specifically, we finetune a student model on the chain of thought outputs generated by a larger teacher model. Our experiments show that the proposed method improves task performance across arithmetic, commonsense and symbolic reasoning datasets. For example, the accuracy of T5 XXL on GSM8K improves from 8.11\% to 21.99\% and 18.42\% when finetuned on PaLM 540B and GPT-3 175B generated chains of thought, respectively.