Boosting Text Augmentation via Hybrid Instance Filtering Framework
Heng Yang, Ke Li
Findings: Machine Learning for NLP Findings Paper
Session 1: Machine Learning for NLP (Virtual Poster)
Conference Room: Pier 7&8
Conference Time: July 10, 11:00-12:30 (EDT) (America/Toronto)
Global Time: July 10, Session 1 (15:00-16:30 UTC)
TLDR:
Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmen...
You can open the
#paper-P5464
channel in a separate window.
Abstract:
Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses approximately 2\% in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BoostAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BoostAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by 2-3\% in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BoostAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.