A Question Answering Benchmark Database for Hungarian
Attila Novák, Borbála Novák, Tamás Zombori, Gergő Szabó, Zsolt Szántó, Richárd Farkas
The 17th Linguistic Annotation Workshop (LAW-XVII) \\ @ ACL 2023 Long paper (8 pages) Paper
TLDR:
Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introd
You can open the
#paper-LAW_44
channel in a separate window.
Abstract:
Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introduced a number of innovations beyond SQuAD: e.g., yes/no-questions, list-like answers consisting of several text spans, long answers, questions requiring calculation and other question types where you cannot simply copy the answer from the text. For all these non-extractive question types, the pragmatically adequate form of the answer was also added to make the training of generative models possible. We implemented and evaluated a set of baseline retrieval and answer span extraction models on the dataset. BM25 performed better than any vector-based solution for retrieval. Cross-lingual transfer from English significantly improved span extraction models.