[SRW] A State-Vector Framework For Dataset Effects

Esmat Sahak, Zining Zhu, Frank Rudzicz

Student Research Workshop Srw Paper

Session 5: Student Research Workshop (Poster)
Conference Room: Frontenac Ballroom and Queen's Quay
Conference Time: July 11, 16:15-17:45 (EDT) (America/Toronto)
Global Time: July 11, Session 5 (20:15-21:45 UTC)
TLDR: The impressive recent performance of DNN-based systems can be partially attributed to high-quality datasets -- indeed -- often, multiple at once. However, the effects of the datasets, especially how they interact with each other, are not studied well. We propose a state-vector framework to quantify ...
You can open the #paper-S46 channel in a separate window.
Abstract: The impressive recent performance of DNN-based systems can be partially attributed to high-quality datasets -- indeed -- often, multiple at once. However, the effects of the datasets, especially how they interact with each other, are not studied well. We propose a state-vector framework to quantify the effect of the data itself. This framework uses idealized probing task results as the bases of the vector space and allows us to quantify the individual and interaction effects of datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic, and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that might appear irrelevant to the intended tasks. Our state-vector framework provides a systematic approach to the study of the effects of data, a crucial component in responsible model development.