Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?

Ayushi Agarwal; Nisarg Patel; Neeraj Varshney; Mihir Parmar; Pavan Mallina; Aryan Shah; Srihari Raju Sangaraju; Tirth Patel; Nihar Thakkar; Chitta Baral

Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?

Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan Mallina, Aryan Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, Chitta Baral

Add to Favorites

The Third Workshop on Trustworthy Natural Language Processing Paper

TLDR: Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive ans

RocketChat
Abstract

You can open the #paper-TrustNLP_35 channel in a separate window.

Abstract: Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer such as questions about future events, questions lacking necessary details to find the answer, and questions that are ambiguous. Incorrectly answering such questions certainly hampers a system's reliability and trustworthiness. Can SOTA models accurately identify such questions and provide a reasonable response?To investigate the above question, we introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers. Furthermore, for each QnotA instance, we also provide a corresponding 'QA' instance i.e. an alternate question that "can be" answered. With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions. Through comprehensive experiments, we show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline. We conduct a thorough analysis which further leads to several interesting findings such as, despite not being able to accurately identify a QnotA question, GPT-3 on being prompted to output a justification of why the given QnotA question doesn't have a definitive answer is able to provide a reasonable justification. Finally, we believe our work and findings will encourage and facilitate development of more robust NLP systems that can also reasonably respond to questions that don't have a definitive answer.