KNOW How to Make Up Your Mind! Adversarially Detecting and Alleviating Inconsistencies in Natural Language Explanations

Myeongjun Jang, Bodhisattwa Prasad Majumder, Julian McAuley, Thomas Lukasiewicz, Oana-Maria Camburu

Main: Interpretability and Analysis of Models for NLP Main-oral Paper

Session 5: Interpretability and Analysis of Models for NLP (Oral)
Conference Room: Metropolitan East
Conference Time: July 11, 16:15-17:15 (EDT) (America/Toronto)
Global Time: July 11, Session 5 (20:15-21:15 UTC)
Keywords: free-text/natural language explanations
TLDR: While recent works have been considerably improving the quality of the natural language explanations (NLEs) generated by a model to justify its predictions, there is very limited research in detecting and alleviating inconsistencies among generated NLEs. In this work, we leverage external knowledge ...
You can open the #paper-P2373 channel in a separate window.
Abstract: While recent works have been considerably improving the quality of the natural language explanations (NLEs) generated by a model to justify its predictions, there is very limited research in detecting and alleviating inconsistencies among generated NLEs. In this work, we leverage external knowledge bases to significantly improve on an existing adversarial attack for detecting inconsistent NLEs. We apply our attack to high-performing NLE models and show that models with higher NLE quality do not necessarily generate fewer inconsistencies. Moreover, we propose an off-the-shelf mitigation method to alleviate inconsistencies by grounding the model into external background knowledge. Our method decreases the inconsistencies of previous high-performing NLE models as detected by our attack.