A Question Answering Benchmark Database for Hungarian

Attila Novák; Borbála Novák; Tamás Zombori; Gergő Szabó; Zsolt Szántó; Richárd Farkas

A Question Answering Benchmark Database for Hungarian

Attila Novák, Borbála Novák, Tamás Zombori, Gergő Szabó, Zsolt Szántó, Richárd Farkas

Add to Favorites

The 17th Linguistic Annotation Workshop (LAW-XVII) \\ @ ACL 2023 Long paper (8 pages) Paper

TLDR: Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introd

RocketChat
Abstract

You can open the #paper-LAW_44 channel in a separate window.

Abstract: Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introduced a number of innovations beyond SQuAD: e.g., yes/no-questions, list-like answers consisting of several text spans, long answers, questions requiring calculation and other question types where you cannot simply copy the answer from the text. For all these non-extractive question types, the pragmatically adequate form of the answer was also added to make the training of generative models possible. We implemented and evaluated a set of baseline retrieval and answer span extraction models on the dataset. BM25 performed better than any vector-based solution for retrieval. Cross-lingual transfer from English significantly improved span extraction models.