Better Language Models of Code through Self-Improvement

Hung Quoc To; Nghi D. Q. Bui; Jin L.C. Guo; Tien N Nguyen

Better Language Models of Code through Self-Improvement

Hung Quoc To, Nghi D. Q. Bui, Jin L.C. Guo, Tien N Nguyen

📝 Paper

Anthology

Underline 🪧 Poster 🧑‍🏫 Slides 📺 Watch Video on Underline Add to Favorites

Findings: NLP Applications Findings Paper

Session 7: NLP Applications (Virtual Poster)

Conference Room: Pier 7&8

Conference Time: July 12, 11:00-12:30 (EDT) (America/Toronto)

Global Time: July 12, Session 7 (15:00-16:30 UTC)

Keywords: code generation and understanding

TLDR: Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve thi...

You can open the #paper-P1154 channel in a separate window.

Abstract: Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a data augmentation framework using knowledge distillation. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to augment training data, which is then used for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in sequence-generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.