NLRSE

Organizers: Peter Clark, Ellie Pavlick, Denny Zhou, Noah Goodman, Sarah Wiegreffe, Felix Hill

With recent scaling of large pre-trained Transformer language models (LLMs), the scope of feasible NLP tasks has broadened. Significant recent work has focused on tasks that require some kind of natural language reasoning. A trajectory in question answering has led us from extraction-oriented datasets like SQuAD to “multi-hop” reasoning datasets like HotpotQA and StrategyQA. Although LLMs have shown remarkable performance on most NLP tasks, it is often unclear why their answers follow from what they know. To address this gap, a new class of explanation techniques has emerged which play an integral part in structuring the reasoning necessary to solve these datasets. For example, the chain-of-thought paradigm leverages explanations as vehicles for LLMs to mimic human reasoning processes. Entailment trees offer a way to ground multi-step reasoning in a collection of verifiable steps. Frameworks like SayCan bridge high-level planning in language and with low-level action trajectories. As a result, we see a confluence of methods blending explainable machine learning/NLP, classical AI (especially theorem proving), and cognitive science (how do humans structure explanations?). This workshop aims to bring together a diverse set of perspectives from these different traditions and attempt to establish common ground for how these various kinds of explanation structures can tackle a broad class of reasoning problems in natural language and beyond.
You can open the #workshop-NLRSE channel in separate windows.

Workshop Papers

Logical Reasoning over Natural Language as Knowledge Representation: A Survey
Authors: Zonglin Yang, Xinya Du, Rui Mao, Jinjie Ni, Erik Cambria

Logical reasoning is central to human cognition and intelligence. Past research of logical reasoning within AI uses formal language as knowledge representation\textasciitilde{}(and symbolic reasoners). However, reasoning with formal language has proved challenging\textasciitilde{}(e.g., brittleness and knowledge-acquisition bottleneck). This paper provides a comprehensive overview on a new paradigm of logical reasoning, which uses natural language as knowledge representation\textasciitilde{}(and pretrained language models as reasoners), including philosophical definition and categorization of logical reasoning, advantages of the new paradigm, benchmarks and methods, challenges of the new paradigm, desirable tasks \textbackslash{}\& methods in the future, and relation to related NLP fields. This new paradigm is promising since it not only alleviates many challenges of formal representation but also has advantages over end-to-end neural methods.

Go to Paper
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
Authors: Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan

Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL-E-2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models and Diffusion Models. We use GPT-3 with Chain-of-Thought prompting to generate text that represents a visual elaboration of the linguistic metaphor, containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models. Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors. Evaluation by professional illustrators show the promise of LLM-Diffusion Model collaboration for this task. We also perform an intrinsic and an extrinsic evaluation using a downstream task: visual entailment. Fine-tuning a state-of-the-art vision-language model on our dataset leads to 23-point improvement in accuracy compared to its performance when finetuned on SNLI-VE, a large-scale visual entailment dataset.

Go to Paper
Negated Complementary Commonsense using Large Language Models
Authors: Navid Rezaei, Marek Reformat

Larger language models, such as GPT-3, have shown to be excellent in many tasks. However, we demonstrate that out-of-ordinary questions can throw the model off guard. This work focuses on finding answers to negated complementary questions in commonsense scenarios. We illustrate how such questions adversely affect the model responses. We propose a model-agnostic methodology to improve the performance in negated complementary scenarios. Our method outperforms few-shot generation from GPT-3 (by more than 11 points) and, more importantly, highlights the significance of studying the response of large language models in negated complementary questions. The code, data, and experiments are available under: https://github.com/navidre/negated\_complementary\_commonsense.

Go to Paper
The Role of Semantic Parsing in Understanding Procedural Text
Authors: Hossein Rajaby Faghihi, Parisa Kordjamshidi, Choh Man Teng, James Allen

In this paper, we investigate whether symbolic semantic representations, extracted from deep semantic parsers, can help to reason over the states of involved entities in a procedural text. We consider a deep semantic parser\textasciitilde{}(TRIPS) and semantic role labeling as two sources of semantic parsing knowledge. First, we propose PROPOLIS, a symbolic parsing-based procedural reasoning framework.Second, we integrate semantic parsing information into state-of-the-art neural models to conduct procedural reasoning.Our experiments indicate that explicitly incorporating such semantic knowledge improves procedural understanding. This paper presents new metrics for evaluating procedural reasoning tasks that clarify the challenges and identify differences among neural, symbolic, and integrated models.

Go to Paper
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning
Authors: Mengxue Zhang, Zichao Wang, Zhichao Yang, Weiqi Feng, Andrew Lan

We study the problem of generating coherent and correct intermediate solution steps for math word problems (MWPs). Solutions to MWPs with step-by-step explanations are valuable, especially in education, to help students better comprehend problem-solving strategies. Most existing approaches narrowly focus on obtaining the final correct answer. A few recent approaches leverage intermediate solution steps to improve final answer correctness but often cannot generate coherent steps with a clear solution strategy. Contrary to existing work, we focus on improving the correctness and coherence of the intermediate solutions steps. We propose a step-by-step planning method for intermediate solution generation, which strategically plans the generation of the next solution step based on the MWP and the previous solution steps. Our approach first plans the next step by predicting the necessary math operation needed to proceed given history steps, then generates the next step, token-by-token, by prompting a language model with the predicted math operation. Experiments on the GSM8K dataset demonstrate that our method improves the accuracy and interpretability of the solution by both automatic metrics and human evaluation.

Go to Paper
SCOTT: Self-Consistent Chain-of-Thought Distillation
Authors: Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, Xiang Ren

Large language models (LMs) beyond a certain scale, demonstrate the emergent capability of generating free-text rationales for their predictions via chain-of-thought (CoT) prompting. While CoT can yield dramatically improved performance, such gains are only observed for sufficiently large LMs. Even more concerning, there is little guarantee that the generated rationales are consistent with LM's predictions or faithfully justify the decisions. In this work, we propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To form better supervision, we elicit rationales supporting the gold answers from a large LM (teacher) by contrastive decoding, which encourages the teacher to generate tokens that become more plausible only when the answer is considered. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective, which prevents the student from ignoring the rationales to make inconsistent predictions. Experiments show that while yielding comparable performance, our method leads to a more faithful model than baselines. Further analysis shows that such a model respects the rationales more when making decisions; thus, we can improve its performance more by refining its rationales.

Go to Paper
Hierarchical Prompting Assists Large Language Model on Web Navigation
Authors: Chi-fan Lo, Abishek Sridhar, Hao Zhu, Frank F. Xu, Shuyan Zhou

Prompting has been utilized to exploit large language models (LLM) for sequential planning tasks within interactive settings. In this paper, we propose a novel prompting approach, Actor-Summarizer-Hierarchical prompting, for interactive web navigation. Diverging from previous prompting approaches that always put the full state (eg a web page) to the prompt, we propose to first construct an action-aware state which is more condensed and relevant with a dedicated summarizer prompt. The resulting state is concatenated to the summarized history and fed to an actor prompt to predict the next action. This hierarchical mechanism is especially useful since the full state of a step in web navigation often contains redundant and irrelevant information. Our approach outperforms the previous state-of-the-art prompting mechanism with the same LLM by 6.2\% on task success rate, demonstrating its potential on interactive decision making tasks with long observation traces.

Go to Paper
Towards Reasoning in Large Language Models: Survey, Implication, and Reflection
Authors: Jie Huang, Kevin Chen-chuan Chang

Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.

Go to Paper
Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI
Authors: Alex Mei, Sharon Levy, William Yang Wang

Users' physical safety is an increasing concern as the market for intelligent systems continues to grow, where unconstrained systems may recommend users dangerous actions that can lead to serious injury. Covertly unsafe text is an area of particular interest, as such texts may arise from everyday scenarios and are challenging to detect as harmful. We propose FARM, a novel framework that leverages external knowledge for trustworthy rationale generation in the context of safety. In particular, FARM foveates on missing knowledge to qualify the information required to reason in specific scenarios and retrieves this information with attribution to trustworthy sources. It then uses this knowledge to both classify the safety of the original text and generate human-interpretable rationales, shedding light on the risk of systems to specific user groups and helping both stakeholders manage the risks of their systems and policymakers to provide concrete safeguards for consumer safety. Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9 points.

Go to Paper
Teaching Large Language Models to Self-Debug
Authors: Xinyun Chen, Maxwell Lin, Nathanael Schaerli, Denny Zhou

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging. In this work, we propose self-debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that self-debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, self-debugging with code explanation consistently improves the baseline by 2-3\%, and improves the prediction accuracy on problems of the hardest label by 9\%\$. On TransCoder and MBPP where unit tests are available, self-debugging can improve the baseline accuracy by 12\%. Meanwhile, by leveraging feedback messages and reusing failed predictions, self-debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.

Go to Paper
QAMPARI: A Benchmark for Open-domain Questions with Many Answers
Authors: Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig, Jonathan Berant

Existing benchmarks for open-domain question answering (ODQA) typically focus on questions whose answers are all in a single paragraph. By contrast, many natural questions, such as "What players were drafted by the Brooklyn Nets? have a long list of answers extracted from multiple paragraphs. Answering such questions requires retrieving and reading many passages from a large corpus. We introduce QAMPARI, an ODQA benchmark, where answers are lists of entities, spread across many paragraphs. We created QAMPARI by (a) generating questions with multiple answers from Wikipedia's knowledge graph and tables, (b) automatically pairing answers with supporting evidence in Wikipedia paragraphs, and (c) manually paraphrasing questions and validating each answer. Across a wide range of ODQA models, we find that QAMPARI is challenging in terms of both passage retrieval and answer generation, with models reaching an F1 score of 32.8 at best. We view QAMPARI as a valuable resource for ODQA research, which will aid to develop models that handle a broad range of question types, including single and multianswer questions.

Go to Paper
Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering
Authors: Jinheon Baek, Alham Fikri Aji, Amir Saffari

Large Language Models (LLMs) are capable of performing zero-shot closed-book question answering tasks, based on their internal knowledge stored in parameters during pre-training. However, such internalized knowledge might be insufficient and incorrect, which could lead LLMs to generate factually wrong answers. Furthermore, fine-tuning LLMs to update their knowledge is expensive. To this end, we propose to augment the knowledge directly in the input of LLMs. Specifically, we first retrieve the relevant facts to the input question from the knowledge graph based on semantic similarities between the question and its associated facts. After that, we prepend the retrieved facts to the input question in the form of the prompt, which is then forwarded to LLMs to generate the answer. Our framework, Knowledge-Augmented language model PromptING (KAPING), requires no model training, thus completely zero-shot. We validate the performance of our KAPING framework on the knowledge graph question answering task, that aims to answer the user's question based on facts over a knowledge graph, on which ours outperforms relevant zero-shot baselines by up to 48\% in average, across multiple LLMs of various sizes.

Go to Paper
Generative Multi-hop Retrieval
Authors: Hyunji Lee, Sohee Yang, Hanseok Oh, Minjoon Seo

A common practice for text retrieval is to use an encoder to map the documents and the query to a common vector space and perform a nearest neighbor search (NNS); multi-hop retrieval also often adopts the same paradigm, usually with a modification of iteratively reformulating the query vector so that it can retrieve different documents at each hop. However, such a bi-encoder approach has limitations in multi-hop settings; (1) the reformulated query gets longer as the number of hops increases, which further tightens the embedding bottleneck of the query vector, and (2) it is prone to error propagation. In this paper, we focus on alleviating these limitations in multi-hop settings by formulating the problem in a fully generative way. We propose an encoder-decoder model that performs multi-hop retrieval by simply generating the entire text sequences of the retrieval targets, which means the query and the documents interact in the language model's parametric space rather than L2 or inner product space as in the bi-encoder approach. Our approach, Generative Multi-hop Retrieval (GMR), consistently achieves comparable or higher performance than bi-encoder models in five datasets while demonstrating superior GPU memory and storage footprint.

Go to Paper
Causal Reasoning of Entities and Events in Procedural Texts
Authors: Li Zhang, Hainiu Xu, Yue Yang, Shuyan Zhou, Weiqiu You, Manni Arora, Chris Callison-burch

Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.

Go to Paper
Can In-context Learners Learn a Reasoning Concept from Demonstrations?
Authors: Michal Tefnik, Marek Kadlcik

Large language models show an emergent ability to learn a new task from a small number of input-output demonstrations.However, recent work shows that in-context learners largely rely on their pre-trained knowledge, such as the sentiment of the labels, instead of finding new associations in the input.However, the commonly-used few-shot evaluation settings using a random selection of in-context demonstrations can not disentangle models' ability to learn a new skill from demonstrations, as most of the randomly-selected demonstrations do not present relations informative for prediction beyond exposing the new task distribution.To disentangle models' in-context learning ability independent of models' memory, we introduce a Conceptual few-shot learning method selecting the demonstrations sharing a possibly-informative concept with the predicted sample. We extract a set of such concepts from annotated explanations and measure how much can models benefit from presenting these concepts in few-shot demonstrations.We find that smaller models are more sensitive to the presented concepts. While some of the models are able to benefit from concept-presenting demonstrations for each assessed concept, we find that none of the assessed in-context learners can benefit from all presented reasoning concepts consistently, leaving the in-context concept learning an open challenge.

Go to Paper
DREAM: Improving Situational QA by First Elaborating the Situation
Authors: Yuling Gu, Bhavana Dalvi Mishra, Peter Clark

When people answer questions about a specific situation, e.g., "I cheated on my mid-term exam last week. Was that wrong?, cognitive science suggests that they form a mental picture of that situation before answering. While we do not know how language models (LMs) answer such questions, we conjecture that they may answer more accurately if they are also provided with additional details about the question situation, elaborating the "scene. To test this conjecture, we train a new model, DREAM, to answer questions that elaborate the scenes that situated questions are about, and then provide those elaborations as additional context to a question-answering (QA) model. We find that DREAM is able to create better scene elaborations (more accurate, useful, and consistent) than a representative state-of-the-art, zero-shot model (Macaw). We also find that using the scene elaborations as additional context improves the answer accuracy of a downstream QA system, including beyond that obtainable by simply further fine-tuning the QA system on DREAM's training data. These results suggest that adding focused elaborations about a situation can improve a system's reasoning about it, and may serve as an effective way of injecting new scenario-based knowledge into QA models.

Go to Paper
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Authors: Cheng-yu Hsieh, Chun-liang Li, Chih-kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-yu Lee, Tomas Pfister

Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80\% of available data on a benchmark task.

Go to Paper
Effect Graph: Effect Relation Extraction for Explanation Generation
Authors: Jonathan Kobbe, Ioana Hulpu, Heiner Stuckenschmidt

Argumentation is an important means of communication. For describing especially arguments about consequences, the notion of effect relations has been introduced recently. We propose a method to extract effect relations from large text resources and apply it on encyclopedic and argumentative texts. By connecting the extracted relations, we generate a knowledge graph which we call effect graph. For evaluating the effect graph, we perform crowd and expert annotations and create a novel dataset. We demonstrate a possible use case of the effect graph by proposing a method for explaining arguments from consequences.

Go to Paper
Explaining Competitive-Level Programming Solutions using LLMs
Authors: Jierui Li, Szymon Tworkowski, Yingying Wu, Raymond Mooney

In this paper, we approach competitive-level programming problem-solving as a composite task of reasoning and code generation. We propose a novel method to automatically annotate natural language explanations to the \textless{}problem, solution\textgreater{} pairs. We show that despite poor performance in solving competitive-level programming problems, state-of-the-art LLMs exhibit a strong capacity in describing and explaining their solutions. Our explanation generation methodology can generate a structured solution explanation for the problem while containing the description and analysis. To evaluate the quality of the annotated explanations, we examine their effectiveness in two aspects: 1) satisfying the human programming expert who authored the oracle solution, and 2) aiding LLMs in solving problems more effectively. The experimental results on the CodeContests dataset demonstrate that while LLM GPT3.5's and GPT-4's abilities in describing the solution are comparable, GPT-4 shows a better understanding of the key idea behind the solution.

Go to Paper
OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models
Authors: Badr Alkhamissi, Siddharth Verma, Ping Yu, Zhijing Jin, Asli Celikyilmaz, Mona Diab

We conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs), focusing specifically on the Open Pretrained Transformers (OPT) models as a representative of such models. Our study entails finetuning three different sizes of OPT on a carefully curated reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned without explanations, and OPT-RE, finetuned with explanations. We then evaluate all models on 57 out-of-domain tasks drawn from the Super-NaturalInstructions benchmark, covering 26 distinct reasoning skills, utilizing three prompting techniques. Through a comprehensive grid of 27 configurations and 6,156 test evaluations, we investigate the dimensions of finetuning, prompting, and scale to understand the role of explanations on different reasoning skills. Our findings reveal that having explanations in the fewshot exemplar has no significant impact on the model's performance when the model is finetuned, while positively affecting the non-finetuned counterpart. Moreover, we observe a slight yet consistent increase in classification accuracy as we incorporate explanations during prompting and finetuning, respectively. Finally, we offer insights on which reasoning skills benefit the most from incorporating explanations during finetuning and prompting, such as Numerical (+20.4\%) and Analogical (+13.9\%) reasoning, as well as skills that exhibit negligible or negative effects.

Go to Paper
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
Authors: Boxin Wang, Wei Ping, Peng Xu, Lawrence Mcafee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao

Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pretrained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, R ETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO ++, which largely improves the open-domain QA results of the original RETRO and significantly outperforms retrieval-augmented GPT across different model sizes. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.

Go to Paper
Grounded physical language understanding with probabilistic programs and simulated worlds
Authors: Cedegao Zhang, Lionel Wong, Gabriel Grand, Josh Tenenbaum

Human language richly invokes our intuitive physical knowledge. We talk about physical objects, scenes, properties, and events; and we can make predictions and draw inferences about physical worlds described entirely in language. Understanding this everyday language requires inherently probabilistic reasoningover possible physical worlds invoked in language and over uncertainty inherent to those physical worlds. In this paper, we propose PiLoT, a neurosymbolic generative model that translates language into probabilistic programs grounded in a physics engine. Our model integrates a large language model to robustly parse language into program expressions and uses a probabilistic physics engine to support inferences over scenes described in language. We construct a linguistic reasoning benchmark based on prior psychophysics experiments that requires reasoning about physical outcomes based on linguistic scene descriptions. We show that PiLoT well predicts human judgments and outperforms baseline large language models across this battery of tasks.

Go to Paper
Designing harder benchmarks for evaluating zero-shot generalizability in Question Answering over Knowledge Bases
Authors: Ritam Dutt, Sopan Khosla, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

Most benchmarks for question answering on knowledge bases (KBQA) operate with the i.i.d. assumption. Recently, the GrailQA dataset was established to evaluate zero-shot generalization capabilities of KBQA models. Reasonable performance of current KBQA systems on the zero-shot GrailQA split hints that the field might be moving towards more generalizable systems. In this work, we observe a bias in the GrailQA dataset towards simpler one or two-hop questions which results in an inaccurate assessment of the aforementioned prowess. We propose GrailQA++, a challenging zero-shot KBQA test set that contains a larger number of questions relying on complex reasoning. We leverage the concept of reasoning paths to control the complexity of the questions and to ensure that our proposed test set has a fair distribution of simple and complex questions. Evaluating existing KBQA models on this new test set shows that they suffer a substantial drop in performance as compared to the GrailQA zero-shot split. This highlights the non-generalizability of existing models and the necessity for harder benchmarks. Our analysis reveals how reasoning paths can be used to understand complementary strengths of different KBQA models, and provide a deeper insight into model mispredictions.

Go to Paper
Deductive Additivity for Planning of Natural Language Proofs
Authors: Zayne Sprague, Kaj Bostrom, Swarat Chaudhuri, Greg Durrett

Current natural language systems designed for multi-step claim validation typically operate in two phases: retrieve a set of relevant premise statements using heuristics (planning), then generate novel conclusions from those statements using a large language model (deduction). The planning step often requires expensive Transformer operations and does not scale to arbitrary numbers of premise statements. In this paper, we investigate whether efficient planning heuristic is possible via embedding spaces compatible with deductive reasoning. Specifically, we evaluate whether embedding spaces exhibit a property we call deductive additivity: the sum of premise statement embeddings should be close to embeddings of conclusions based on those premises. We explore multiple sources of off-the-shelf dense embeddings in addition to fine-tuned embeddings from GPT3 and sparse embeddings from BM25. We study embedding models both intrinsically, evaluating whether the property of deductive additivity holds, and extrinsically, using them to assist planning in natural language proof generation. Lastly, we create a dataset, Single-Step Reasoning Contrast (SSRC), to further probe performance on various reasoning types. Our findings suggest that while standard embedding methods frequently embed conclusions near the sums of their premises, they fall short of being effective heuristics and lack the ability to model certain categories of reasoning.

Go to Paper
Case-Based Reasoning with Language Models for Classification of Logical Fallacies
Authors: Zhivar Sourati, Filip Ilievski, Hng-n Sandlin, Alain Mermoud

The ease and speed of spreading misinformation and propaganda on the Web motivate the need to develop trustworthy technology for detecting fallacies in natural language arguments. However, state-of-the-art language modeling methods exhibit a lack of robustness on tasks like logical fallacy classification that require complex reasoning. In this paper, we propose a Case-Based Reasoning method that classifies new cases of logical fallacy by language-modeling-driven retrieval and adaptation of historical cases. We design four complementary strategies to enrich input representation for our model, based on external information about goals, explanations, counterarguments, and argument structure. Our experiments in in-domain and out-of-domain settings indicate that Case-Based Reasoning improves the accuracy and generalizability of language models. Our ablation studies suggest that representations of similar cases have a strong impact on the model performance, that models perform well with fewer retrieved cases, and that the size of the case database has a negligible effect on the performance. Finally, we dive deeper into the relationship between the properties of the retrieved cases and the model performance.

Go to Paper
The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code
Authors: Xiao Liu, Da Yin, Chen Zhang, Yansong Feng, Dongyan Zhao

Causal reasoning, the ability to identify cause-and-effect relationship, is crucial in human thinking. Although large language models (LLMs) succeed in many NLP tasks, it is still challenging for them to conduct complex causal reasoning like abductive reasoning and counterfactual reasoning. Given the fact that programming code may express causal relations more often and explicitly with conditional statements like ``if``, we want to explore whether Code-LLMs acquire better causal reasoning abilities. Our experiments show that compared to text-only LLMs, Code-LLMs with code prompts are better causal reasoners. We further intervene on the prompts from different aspects, and discover that the key point is the programming structure.

Go to Paper
Neural-symbolic Contrastive Learning for Cross-domain Inference
Authors: Mingyue Liu, Jialin Yu, Hao Cui, Sara Uckelman, Yang Long

It has been suggested in literature that large pre-trained language models (PLMs) are able to suppress human-level performance for natural language inference (NLI) tasks. However, the failure of learning the underlying generalizations and the inconsistency to small textual perturbations rise doubt about whether models rely on adopting shallow heuristics to guess the correct label. To mitigate this issue, we propose a neural-symbolic contrastive learning framework inspired by Inductive Logic Programming (ILP) to better capture logical relationships from data. Unlike the usual methods for NLI tasks, our approach represents data as logic programs, sets of logic rules. We aim to learn an embedding space in which the examples share as various as possible textual information with as similar as possible underlying logical meanings that are close together, and vice versa. Experimental results affirm this approach's ability to enhance the model's transferability performance.

Go to Paper
Synthetic Dataset for Evaluating Complex Compositional Knowledge for Natural Language Inference
Authors: Sushma Anand Akoju, Robert Vacareanu, Eduardo Blanco, Haris Riaz, Mihai Surdeanu

We introduce a synthetic dataset called Sentences Involving Complex Compositional Knowledge (SICCK) and a novel analysis that investigates the performance of Natural Language Inference (NLI) models to understand compositionality in logic. We produce 1,304 sentence pairs by modifying 15 examples from the SICK dataset (Marelli et al., 2014). To this end, we modify the original texts using a set of phrases modifiers that correspond to universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009). We use these phrases to modify the subject, verb, and object parts of the premise and hypothesis. Lastly, we annotate these modified texts with the corresponding entailment labels following NL rules. We conduct a preliminary verification of how well the change in the structural and semantic composition is captured by neural NLI models, in both zero-shot and fine-tuned scenarios. We found that the performance of NLI models under the zero-shot setting is poor, especially for modified sentences with negation and existential quantifiers. After fine-tuning this dataset, we observe that models continue to perform poorly over negation, existential and universal modifiers.

Go to Paper
STREET: A Multi-Task Structured Reasoning and Explanation Benchmark
Authors: Danilo Neves Ribeiro, Shen Wang, Xiaofei Ma, Henghui Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica Ramos, William Yang Wang, Zhiheng Huang

We introduce STREET, a unified multi-task and multi-domain natural language reasoning and explanation benchmark. Unlike most existing question-answering (QA) datasets, we expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer. We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. We find that these models still lag behind human performance when producing such structured reasoning steps. We believe this work will provide a way for the community to better train and test systems on multi-step reasoning and explanations in natural language.

Go to Paper
Complementary Explanations for Effective In-Context Learning
Authors: Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Veselin Stoyanov, Greg Durrett, Ramakanth Pasunuru

Large language models (LLMs) have exhibited remarkable capabilities in learning from explanations in prompts, but there has been limited understanding of exactly how these explanations function or why they are effective. This work aims to better understand the mechanisms by which explanations are used for in-context learning. We first study the impact of two different factors on the performance of prompts with explanations: the computation trace (the way the solution is decomposed) and the natural language used to express the prompt. By perturbing explanations on three controlled tasks, we show that both factors contribute to the effectiveness of explanations. We further study how to form maximally effective sets of explanations for solving a given test query. We find that LLMs can benefit from the complementarity of the explanation set: diverse reasoning skills shown by different exemplars can lead to better performance. Therefore, we propose a maximal marginal relevance-based exemplar selection approach for constructing exemplar sets that are both relevant as well as complementary, which successfully improves the in- context learning performance across three real- world tasks on multiple LLMs.

Go to Paper

ACL 2023

Back to Top

© 2023 Association for Computational Linguistics