ACL2023: TrustNLP

TrustNLP

Organizers: Yada Pruksachatkun, Ninareh Mehrabi, Kai-Wei Chang, Aram Galystan, Jwala Dhamala, Anaelia Ovalle, Apurv Verma, Yang Trista Cao, Anoop Kumar, Rahul Gupta

Recent advances in Natural Language Processing, and the emergence of pretrained Large Language Models (LLM) specifically, have made NLP systems omnipresent in various aspects of our everyday life. In addition to traditional examples such as personal voice assistants, recommender systems, etc, more recent developments include content-generation models such as ChatGPT, text-to-image models (Dall-E), and so on. While these emergent technologies have an unquestionable potential to power various innovative NLP and AI applications, they also pose a number of challenges in terms of their safe and ethical use. To address such challenges, NLP researchers have formulated various objectives, e.g., intended to make models more fair, safe, and privacy-preserving. However, these objectives are often considered separately, which is a major limitation since it is often important to understand the interplay and/or tension between them. For instance, meeting a fairness objective might require access to users’ demographic information, which creates tension with privacy objectives. The goal of this workshop is to move toward a more comprehensive notion of Trustworthy NLP, by bringing together researchers working on those distinct yet related topics, as well as their intersection.

External Website

You can open the #workshop-TrustNLP channel in separate windows.

Workshop Papers

Improving Factuality of Abstractive Summarization via Contrastive Reward Learning

Authors: I-chun Chern, Zhiruo Wang, Sanjan Das, Bhavuk Sharma, Pengfei Liu, Graham Neubig

Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factuality metrics using contrastive reward learning, leading to more factual summaries by human evaluations. This suggests that further advances in learning and evaluation algorithms can feed directly into providing more factual summaries. Code and human evaluation results will be publicly available at \textbackslash{}url\{https://github.com/EthanC111/factuality\_summarization\}.

Go to Paper

Examining the Causal Impact of First Names on Language Models: The Case of Social Commonsense Reasoning

Authors: Sullam Jeoung, Jana Diesner, Halil Kilicoglu

As language models continue to be integrated into applications of personal and societal relevance, ensuring these models' trustworthiness is crucial, particularly with respect to producing consistent outputs regardless of sensitive attributes. Given that first names may serve as proxies for (intersectional) socio-demographic representations, it is imperative to examine the impact of first names on commonsense reasoning capabilities. In this paper, we study whether a model's reasoning given a specific input differs based on the first names provided. Our underlying assumption is that the reasoning about Alice should not differ from the reasoning about James. We propose and implement a controlled experimental framework to measure the causal effect of first names on commonsense reasoning, enabling us to distinguish between model predictions due to chance and caused by actual factors of interest. Our results indicate that the frequency of first names has a direct effect on model prediction, with less frequent names yielding divergent predictions compared to more frequent names. To gain insights into the internal mechanisms of models that are contributing to these behaviors, we also conduct an in-depth explainable analysis. Overall, our findings suggest that to ensure model robustness, it is essential to augment datasets with more diverse first names during the configuration stage.

Go to Paper

Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics and Prompt Wording

Authors: Aisha Khatun, Daniel Brown

Large language models (LLMs) have become mainstream technology with their versatile use cases and impressive performance. Despite the countless out-of-the-box applications, LLMs are still not reliable. A lot of work is being done to improve the factual accuracy, consistency, and ethical standards of these models through fine-tuning, prompting, and Reinforcement Learning with Human Feedback (RLHF), but no systematic analysis of the responses of these models to different categories of statements, or on their potential vulnerabilities to simple prompting changes is available. In this work, we analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response. We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies. The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.

Go to Paper

On the Privacy Risk of In-context Learning

Authors: Haonan Duan, Adam Dziedzic, Mohammad Yaghini, Nicolas Papernot, Franziska Boenisch

Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task---often the private dataset of a party, e.g., a company that wants to leverage the LLM on their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by proposing a highly effective membership inference attack.We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model's sensitivity to their prompts---in form of a significantly higher prediction confidence on the prompted data---as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.

Go to Paper

Sample Attackability in Natural Language Adversarial Attacks

Authors: Vyas Raina, Mark Gales

Adversarial attack research in natural language processing (NLP) has made significant progress in designing powerful attack methods and defence approaches. However, few efforts have sought to identify which source samples are the most attackable or robust, i.e. can we determine for an unseen target model, which samples are the most vulnerable to an adversarial attack. This work formally extends the definition of sample attackability/robustness for NLP attacks. Experiments on two popular NLP datasets, four state of the art models and four different NLP adversarial attack methods, demonstrate that sample uncertainty is insufficient for describing characteristics of attackable/robust samples and hence a deep learning based detector can perform much better at identifying the most attackable and robust samples for an unseen target model. Nevertheless, further analysis finds that there is little agreement in which samples are considered the most attackable/robust across different NLP attack methods, explaining a lack of portability of attackability detection methods across attack methods.

Go to Paper

A Keyword Based Approach to Understanding the Overpenalization of Marginalized Groups by English Marginal Abuse Models on Twitter

Authors: Kyra Yee, Alice Schoenauer Sebag, Olivia Redfield, Matthias Eck, Emily Sheng, Luca Belli

Harmful content detection models tend to have higher false positive rates for content from marginalized groups. In the context of marginal abuse modeling on Twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. Current approaches to algorithmic harm mitigation, and bias detection for NLP models are often very ad hoc and subject to human bias. We make two main contributions in this paper. First, we design a novel methodology, which provides a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. Second, we apply our methodology to audit Twitter's English marginal abuse model, which is used for removing amplification eligibility of marginally abusive content. Without utilizing demographic labels or dialect classifiers, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. In order to mitigate the associated harms, we experiment with adding additional true negative examples and find that doing so provides improvements to our fairness metrics without large degradations in model performance.

Go to Paper

An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models

Authors: Saghar Hosseini, Hamid Palangi, Ahmed Hassan Awadallah

Large-scale Pre-Trained Language Models (PTLMs) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. In this paper, we leverage the primary task of PTLMs, i.e., language modeling, and propose a new metric to quantify manifested implicit representational harms in PTLMs towards 13 marginalized demographics. Using this metric, we conducted an empirical analysis of 24 widely used PTLMs. Our analysis provides insights into the correlation between the proposed metric in this work and other related metrics for representational harm. We observe that our metric correlates with most of the gender-specific metrics in the literature. Through extensive experiments, we explore the connections between PTLMs architectures and representational harms across two dimensions: depth and width of the networks. We found that prioritizing depth over width, mitigates representational harms in some PTLMs. Our code and data can be found at [place holder].

Go to Paper

Towards Faithful Explanations for Text Classification with Robustness Improvement and Explanation Guided Training

Authors: Dongfang Li, Baotian Hu, Qingcai Chen, Shan He

Feature attribution methods highlight the important input tokens as explanations to model predictions, which have been widely applied to deep neural networks towards trustworthy AI. However, recent works show that explanations provided by these methods face challenges of being faithful and robust. In this paper, we propose a method with Robustness improvement and Explanation Guided training towards more faithful EXplanations (REGEX) for text classification. First, we improve model robustness by input gradient regularization technique and virtual adversarial training. Secondly, we use salient ranking to mask noisy tokens and maximize the similarity between model attention and feature attribution, which can be seen as a self-training procedure without importing other external information. We conduct extensive experiments on six datasets with five attribution methods, and also evaluate the faithfulness in the out-of-domain setting. The results show that REGEX improves fidelity metrics of explanations in all settings and further achieves consistent gains based on two randomization tests. Moreover, we show that using highlight explanations produced by REGEX to train select-then-predict models results in comparable task performance to the end-to-end method.

Go to Paper

Linguistic Properties of Truthful Response

Authors: Bruce W. Lee, Benedict Florance Arockiaraj, Helen Jin

We investigate the phenomenon of an LLM's untruthful response using a large set of 220 handcrafted linguistic features. We focus on GPT-3 models and find that the linguistic profiles of responses are similar across model sizes. That is, how varying-sized LLMs respond to given prompts stays similar on the linguistic properties level. We expand upon this finding by training support vector machines that rely only upon the stylistic components of model responses to classify the truthfulness of statements. Though the dataset size limits our current findings, we present promising evidence that truthfulness detection is possible without evaluating the content itself. We release our code and raw data.

Go to Paper

Debunking Biases in Attention

Authors: Shijing Chen, Usman Naseem, Imran Razzak

Despite the remarkable performances in various applications, machine learning (ML) models could potentially discriminate. They may result in biasness in decision-making, leading to an impact negatively on individuals and society. Recently, various methods have been developed to mitigate biasness and achieve significant performance. Attention mechanisms are a fundamental component of many state-of-the-art ML models and may potentially impact the fairness of ML models. However, how they explicitly influence fairness has yet to be thoroughly explored. In this paper, we investigate how different attention mechanisms affect the fairness of ML models, focusing on models used in Natural Language Processing (NLP) models. We evaluate the performance of fairness of several models with and without different attention mechanisms on widely used benchmark datasets. Our results indicate that the majority of attention mechanisms that have been assessed can improve the fairness performance of Bidirectional Gated Recurrent Unit (BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) in all three datasets regarding religious and gender-sensitive groups, however, with varying degrees of trade-offs in accuracy measures. Our findings highlight the possibility of fairness being affected by adopting specific attention mechanisms in machine learning models for certain datasets

Go to Paper

Guiding Text-to-Text Privatization by Syntax

Authors: Stefan Arnold, Dilara Yesilbas, Sven Weinzierl

Metric Differential Privacy is a generalization of differential privacy tailored to address the unique challenges of text-to-text privatization. By adding noise to the representation of words in the geometric space of embeddings, words are replaced with words located in the proximity of the noisy representation. Since embeddings are trained based on word co-occurrences, this mechanism ensures that substitutions stem from a common semantic context. Without considering the grammatical category of words, however, this mechanism cannot guarantee that substitutions play similar syntactic roles. We analyze the capability of text-to-text privatization to preserve the grammatical category of words after substitution and find that surrogate texts consist almost exclusively of nouns. Lacking the capability to produce surrogate texts that correlate with the structure of the sensitive texts, we encompass our analysis by transforming the privatization step into a candidate selection problem in which substitutions are directed to words with matching grammatical properties. We demonstrate a substantial improvement in the performance of downstream tasks by up to 4.66\% while retaining comparative privacy guarantees.

Go to Paper

Differentially Private In-Context learning

Authors: Ashwinee Panda, Tong Wu, Jiachen Wang, Prateek Mittal

An important question in deploying large language models (LLMs) is how to augment LLMs with private data.We propose Differentially Private In-context Learning (DP-ICL) to enable LLMs to adapt to new tasks while maintaining privacy guarantees. DP-ICL performs private inference by establishing a noisy consensus over an ensemble of exemplars using the Report-Noisy-Max mechanism. We evaluate DP-ICL on four benchmarks and find that it achieves comparable performance (\textless{} 2\textbackslash{}\% degradation) with non-private ICL.

Go to Paper

Are fairness metric scores enough to assess discrimination biases in machine learning?

Authors: Fanny Jourdan, Laurent Risser, Jean-michel Loubes, Nicholas Asher

This paper presents novel experiments shedding light on the shortcomings of current metrics for assessing biases of gender discrimination made by machine learning algorithms on textual data. We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography. Such prediction tasks are common in commercial Natural Language Processing (NLP) applications such as automatic job recommendations. We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets, although the norm in many industrial NLP applications is to use small to reasonably large linguistic datasets for which the main practical constraint is to get a good prediction accuracy. We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions.Our experiments sample the Bios dataset and learn more than 200 models on different sample sizes. This allows us to statistically study our results and to confirm that common gender bias indices provide diverging and sometimes unreliable results when applied to relatively small training and test samples. This highlights the crucial importance of variance calculations for providing sound results in this field.

Go to Paper

DEPTH+: An Enhanced Depth Metric for Wikipedia Corpora Quality

Authors: Saied Alshahrani, Norah Alshahrani, Jeanna Matthews

Wikipedia articles are a common source of training data for Natural Language Processing (NLP) research, especially as a source for corpora in languages other than English. However, research has shown that not all Wikipedia editions are produced organically by native speakers, and there are substantial levels of automation and translation activities in the Wikipedia project that could negatively impact the degree to which they truly represent the language and the culture of native speakers. To encourage transparency in the Wikipedia project, Wikimedia Foundation introduced the depth metric as an indication of the degree of collaboration or how frequently users edit a Wikipedia edition's articles. While a promising start, this depth metric suffers from a few serious problems, like a lack of adequate handling of inflation of edits metric and a lack of full utilization of users-related metrics. In this paper, we propose the DEPTH+ metric, provide its mathematical definitions, and describe how it reflects a better representation of the depth of human collaborativeness. We also quantify the bot activities in Wikipedia and offer a bot-free depth metric after the removal of the bot-created articles and the bot-made edits on the Wikipedia articles.

Go to Paper

Distinguishing Fact from Fiction: A Benchmark Dataset for Identifying Machine-Generated Scientific Papers in the LLM Era.

Authors: Edoardo Mosca, Mohamed Hesham Ibrahim Abdalla, Paolo Basso, Margherita Musumeci, Georg Groh

As generative NLP can now produce content nearly indistinguishable from human writing, it becomes difficult to identify genuine research contributions in academic writing and scientific publications. Moreover, information in NLP-generated text can potentially be factually wrong or even entirely fabricated. This study introduces a novel benchmark dataset, containing human-written and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica. After describing the generation and extraction pipelines, we also experiment with four distinct classifiers as a baseline for detecting the authorship of scientific text. A strong focus is put on generalization capabilities and explainability to highlight the strengths and weaknesses of detectors. We believe our work serves as an important step towards creating more robust methods for distinguishing between human-written and machine-generated scientific papers, ultimately ensuring the integrity of scientific literature.

Go to Paper

Detecting Personal Information in Training Corpora: an Analysis

Authors: Nishant Subramani, Sasha Luccioni, Jesse Dodge, Margaret Mitchell

Large language models are trained on increasing quantities of unstructured text, the largest sources of which are scraped from the Web. These Web scrapes are mainly composed of heterogeneous collections of text from multiple domains with minimal documentation. While some work has been done to identify and remove toxic, biased, or sexual language, the topic of personal information (PI) in textual data used for training Natural Language Processing (NLP) models is relatively under-explored. In this work, we draw from definitions of PI across multiple countries to define the first PI taxonomy of its kind, categorized by type and risk level. We then conduct a case study on the Colossal Clean Crawled Corpus (C4) and the Pile, to detect some of the highest-risk personal information, such as email addresses and credit card numbers, and examine the differences between automatic and regular expression-based approaches for their detection. We identify shortcomings in modern approaches for PI detection, and propose a reframing of the problem that is informed by global perspectives and the goals in personal information detection.

Go to Paper

Enhancing textual counterfactual explanation intelligibility through Counterfactual Feature Importance

Authors: Milan Bhan, Jean-noel Vittaut, Nicolas Chesneau, Marie-jeanne Lesot

Textual counterfactual examples explain a prediction by modifying the tokens of an initial instance in order to flip the outcome of a classifier. Even under sparsity constraint, counterfactual generation can lead to numerous changes from the initial text, making the explanation hard to understand. We propose Counterfactual Feature Importance, a method to make non-sparse counterfactual explanations more intelligible. Counterfactual Feature Importance assesses token change importance between an instance to explain and its counterfactual example. We develop two ways of computing Counterfactual Feature Importance, respectively based on classifier gradient computation and counterfactual generator loss evolution during counterfactual search. Then we design a global version of Counterfactual Feature Importance, providing rich information about semantic fields globally impacting classifier predictions. Counterfactual Feature Importance enables to focus on impacting parts of counterfactual explanations, making counterfactual explanations involving numerous changes more understandable.

Go to Paper

Driving Context into Text-to-Text Privatization

Authors: Stefan Arnold, Dilara Yesilbas, Sven Weinzierl

Metric Differential Privacy enables text-to-text privatization by adding calibrated noise to the vector of a word derived from an embedding space and projecting this noisy vector back to a discrete vocabulary using a nearest neighbor search. Since words are substituted without context, this mechanism is expected to fall short at finding substitutes for words with ambiguous meanings, such as 'bank'. To account for these ambiguous words, we leverage a sense embedding and incorporate a sense disambiguation step prior to noise injection. We encompass our modification to the privatization mechanism with an estimation of privacy and utility. For word sense disambiguation on the Words in Context dataset, we demonstrate a substantial increase in classification accuracy by 6.05\%.

Go to Paper

Privacy- and Utility-Preserving NLP with Anonymized data: A case study of Pseudonymization

Authors: Oleksandr Yermilov, Vipul Raheja, Artem Chernodub

This work investigates the effectiveness of different pseudonymization techniques, ranging from rule-based substitutions to using pre-trained Large Language Models (LLMs), on a variety of datasets and models used for two widely used NLP tasks: text classification and summarization. Our work provides crucial insights into the gaps between original and anonymized data (focusing on the pseudonymization technique) and model quality and fosters future research into higher-quality anonymization techniques better to balance the trade-offs between data protection and utility preservation. We make our code, pseudonymized datasets, and downstream models publicly available.

Go to Paper

Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that Don't have a Definitive Answer?

Authors: Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan Mallina, Aryan Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar Thakkar, Chitta Baral

Though state-of-the-art (SOTA) NLP systems have achieved remarkable performance on a variety of language understanding tasks, they primarily focus on questions that have a correct and a definitive answer. However, in real-world applications, users often ask questions that don't have a definitive answer such as questions about future events, questions lacking necessary details to find the answer, and questions that are ambiguous. Incorrectly answering such questions certainly hampers a system's reliability and trustworthiness. Can SOTA models accurately identify such questions and provide a reasonable response?To investigate the above question, we introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers. Furthermore, for each QnotA instance, we also provide a corresponding 'QA' instance i.e. an alternate question that "can be" answered. With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions. Through comprehensive experiments, we show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline. We conduct a thorough analysis which further leads to several interesting findings such as, despite not being able to accurately identify a QnotA question, GPT-3 on being prompted to output a justification of why the given QnotA question doesn't have a definitive answer is able to provide a reasonable justification. Finally, we believe our work and findings will encourage and facilitate development of more robust NLP systems that can also reasonably respond to questions that don't have a definitive answer.

Go to Paper

SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases

Authors: Yanchen Liu, Jing Yan, Yan Chen, Jing Liu, Hua Wu

Recent studies have shown that various biases exist in different NLP tasks, and over-reliance on these biases can result in poor generalization and low adversarial robustness in models. To address this issue, previous research has proposed several debiasing techniques that effectively mitigate specific biases, but are limited in their ability to address other biases. In this paper, we introduce a novel debiasing method, Sparse Mixture-of-Adapters (SMoA), which can effectively and efficiently mitigate multiple dataset biases. Our experiments on Natural Language Inference and Paraphrase Identification tasks demonstrate that SMoA outperforms both full-finetuning and adapter tuning baselines, as well as prior strong debiasing methods. Further analysis reveals that SMoA is interpretable, with each sub-adapter capable of capturing specific patterns from the training data and specializing in handling specific biases.

Go to Paper

GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Authors: Evan Lucas, Timothy Havens

This work analyzes backdoor watermarks in an autoregressive transformer fine-tuned to perform a generative sequence-to-sequence task, specifically summarization. We propose and demonstrate an attack to identify trigger words or phrases by analyzing open ended generations from autoregressive models that have backdoor watermarks inserted. It is shown in our work that triggers based on random common words are easier to identify than those based on single, rare tokens. The attack proposed is easy to implement and only requires access to the model weights. Code used to create the backdoor watermarked models and analyze their outputs is shared at [github link to be inserted for camera ready version].

Go to Paper

Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data

Authors: Xinzhe Li, Ming Liu

This paper addresses the ethical concerns arising from the use of unauthorized public data in deep learning models and proposes a novel solution. Specifically, building on the work of Huang et al. (2021), we extend their bi-level optimization approach to generate unlearnable text using a gradient-based search technique. However, although effective, this approach faces practical limitations, including the requirement of batches of instances and model architecture knowledge that is not readily accessible to ordinary users with limited access to their own data. Furthermore, even with semantic-preserving constraints, unlearnable noise can alter the text's semantics. To address these challenges, we extract simple patterns from unlearnable text produced by bi-level optimization and demonstrate that the data remains unlearnable for unknown models. Additionally, these patterns are not instance- or dataset-specific, allowing users to readily apply them to text classification and question-answering tasks, even if only a small proportion of users implement them on their public content.We also open-source codes to generate unlearnable text and assess unlearnable noise to benefit the public and future studies.

Go to Paper

Bias Beyond English: Counterfactual Tests for Bias in Sentiment Analysis in Four Languages

Authors: Seraphina Goldfarb-tarrant, Adam Lopez, Roi Blanco, Diego Marcheggiani

Sentiment analysis (SA) systems are used in many products and hundreds of languages. Gender and racial biases are well-studied in English SA systems, but understudied in other languages, with few resources for such studies. To remedy this, we build a counterfactual evaluation corpus for gender and racial/migrant bias in four languages. We demonstrate its usefulness by answering a simple but important question that an engineer might need to answer when deploying a system: What biases do systems import from pre-trained models when compared to a baseline with no pre-training? Our evaluation corpus, by virtue of being counterfactual, not only reveals which models have less bias, but also pinpoints changes in model bias behaviour, which enables more targeted mitigation strategies. We release our code and evaluation corpora to facilitate future research.

Go to Paper

Training Data Extraction From Pre-trained Language Models: A Survey

Authors: Shotaro Ishihara

As the deployment of pre-trained language models (PLMs) expands, pressing security concerns have arisen regarding the potential for malicious extraction of training data, posing a threat to data privacy.This study is the first to provide a comprehensive survey of training data extraction from PLMs.Our review covers more than 100 key papers in fields such as natural language processing and security.First, preliminary knowledge is recapped and a taxonomy of various definitions of memorization is presented.The approaches for attack and defense are then systemized.Furthermore, the empirical findings of several quantitative studies are highlighted.Finally, future research directions based on this review are suggested.

Go to Paper

Expanding Scope: Adapting English Adversarial Attacks to Chinese

Authors: Hanyu Liu, Chengyuan Cai, Yanjun Qi

Recent studies have revealed that NLP predictive models are vulnerable to adversarial attacks. Most existing studies focused on designing attacks to evaluate the robustness of NLP models in the English language alone. Literature has seen an increasing need for NLP solutions for other languages. We, therefore, ask one natural question whether state-of-the-art (SOTA) attack methods generalize to other languages. This paper investigates how to adapt SOTA adversarial attack algorithms in English to the Chinese language. Our experiments show that attack methods previously applied to English NLP can generate high-quality adversarial examples in Chinese when combined with proper text segmentation and linguistic constraints. In addition, we demonstrate that the generated adversarial examples can achieve high fluency and sentiment consistency by focusing on the Chinese language's morphology and phonology, which in turn can be used to improve the adversarial robustness of Chinese NLP models.

Go to Paper

IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

Authors: Xuanli He, Jun Wang, Benjamin Rubinstein, Trevor Cohn

Backdoor attacks are an insidious security threat against machine learning models. Adversaries can manipulate the predictions of compromised models by inserting triggers into the training phase. Various backdoor attacks have been devised which can achieve nearly perfect attack success without affecting model predictions for clean inputs. Means of mitigating such vulnerabilities are underdeveloped, especially in natural language processing. To fill this gap, we introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks at inference time. Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5\% of inserted triggers. Thus, it significantly reduces the attack success rate while attaining competitive accuracy on the clean dataset across widespread insertion-based attacks compared to two baselines. Finally, we show that our approach is model-agnostic, and can be easily ported to several pre-trained transformer models.

Go to Paper

Large Language Models with Controllable Working Memory

Authors: Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model with an informational context to aid its performance on the underlying task, how the model's world knowledge interacts with the factual information presented in the context remains under explored. As a desirable behavior, an LLM should give precedence to the context whenever it contains task-relevant information that conflicts with the model's memorized knowledge. This enables model predictions to be grounded in the context, which can then be used to update or correct specific model predictions without frequent retraining. By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge. In this paper, we undertake a first joint study of the aforementioned two properties, namely controllability and robustness, in the context of LLMs. We demonstrate that state-of-the-art T5 and PaLM (both pretrained and finetuned) could exhibit poor controllability and robustness, which do not scale with increasing model size. As a solution, we propose a novel method - Knowledge Aware FineTuning (KAFT) - to strengthen both controllability and robustness by incorporating counterfactual and irrelevant contexts to standard supervised datasets. Our comprehensive evaluation showcases the utility of KAFT across model architectures and sizes.

Go to Paper

On The Real-world Performance of Machine Translation: Exploring Social Media Post-authors' Perspectives

Authors: Ananya Gupta, Jae Takeuchi, Bart Knijnenburg

Many social networking sites (SNS) offer machine translation of posts in an effort to increase understanding, engagement, and connectivity between users across language barriers. However, the translations of these posts are still not 100\% accurate and can be a cause of misunderstandings that can harm post-authors' professional or personal relationships. An exacerbating factor is on most SNS, authors cannot view the translation of their own posts, nor make corrections to inaccurate translations. This paper reports findings from a survey (N = 189) and an interview (N = 15) to explore users' concerns regarding this automatic form of machine translation. Our findings show that users are concerned about potential inaccuracies in the meaning of the translations of their posts, and would thus appreciate being able to view and potentially correct such translations. Additionally, we found that when users write posts in their native language, they write them for specific audiences, so they do not always want them translated. This underscores the urgency of providing users with more control over the translation of their posts.

Go to Paper

ActiveAED: A Human in the Loop Improves Annotation Error Detection

Authors: Leon Weber, Barbara Plank

Manually annotated datasets are crucial for training and evaluating Natural Language Processing models. However, recent work has discovered that even widely-used benchmark datasets contain a substantial number of erroneous annotations. This problem has been addressed with Annotation Error Detection (AED) models, which can flag such errors for human re-annotation. However, even though many of these AED methods assume a final curation step in which a human annotator decides whether the annotation is erroneous, they have been developed as static models without any human-in-the-loop component. In this work, we propose ActiveAED, an AED method that can detect errors more accurately by repeatedly querying a human for error corrections in its prediction loop. We evaluate ActiveAED on eight datasets spanning five different tasks and find that it leads to improvements over the state of the art on seven of them, with gains of up to six percentage points in average precision. This work will be published in Findings of ACL 2023 and thus we would like to submit as non-archival. We are also interested in presenting this work at the LAW workshop, but will know whether this is possible only in a few weeks. We have attached to the appendix the reviews and the author response indicating our changes for the camera-ready version.

Go to Paper

Shielded Representations: Protecting Sensitive Attributes Through Iterative Gradient-Based Projection

Authors: Shadi Iskander, Kira Radinsky, Yonatan Belinkov

This paper proposes a novel approach, called Iterative Gradient-Based Projection (IGBP), for removing non-linear encoded demographic information from neural representations. The method is evaluated on gender and race attributes using intrinsic and extrinsic metrics. The comprehensive results demonstrate the effectiveness of the proposed method.The paper got accepted to the Findings of ACL, reviews are included in the appendix.The parts of the paper which have been revised are colored in blue.

Go to Paper

Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models

Authors: Pranav Narayanan Venkit, Mukund Srinath, Shomir Wilson

We analyze sentiment analysis and toxicity detection models to detect the presence of explicit bias against people with disability (PWD). We employ the bias identification framework of Perturbation Sensitivity Analysis to examine conversations related to PWD on social media platforms, specifically Twitter and Reddit, in order to gain insight into how disability bias is disseminated in real-world social settings. We then create the Bias Identification Test in Sentiment (BITS) corpus to quantify explicit disability bias in any sentiment analysis and toxicity detection models. Our study utilizes BITS to uncover significant biases in four open AIaaS (AI as a Service) sentiment analysis tools, namely TextBlob, VADER, Google Cloud Natural Language API, DistilBERT and two toxicity detection models, namely two versions of Toxic-BERT. Our findings indicate that all of these models exhibit statistically significant explicit bias against PWD.

Go to Paper

Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models

Authors: Ioana Baldini, Chhavi Yadav, Payel Das, Kush Varshney

Auditing unwanted social bias in language models (LMs) is inherently hard due to the multi-disciplinary nature of the work. In addition, the rapid evolution of LMs can make benchmarks irrelevant in no time. Bias auditing is further complicated by LM brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness?We propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between types of model errors. First, we extend an existing bias benchmark for NLI (BBNLI) using a combination of LM-generated lexical variations, adversarial filtering, and human validation.We demonstrate that the newly created dataset (BBNLI-next) is more challenging than BBNLI: on average, BBNLI-next reduces the accuracy of state-of-the-art NLI models from 95.3\%, as observed by BBNLI, to 58.6\%.Second, we employ BBNLI-next to showcase the interplay between robustness and bias, and the subtlety in differentiating between the two. Third, we point out shortcomings in current bias scores used in the literature and propose bias measures that take into account pro-/anti-stereotype bias and model brittleness. We will publicly release the BBNLI-next dataset to inspire research on rapidly expanding benchmarks to keep up with model evolution, along with research on the robustness-bias interplay in bias auditing.Note: This paper contains offensive text examples.

Go to Paper

Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Authors: Yejin Bang, Tiezheng Yu, Andrea Madotto, Zhaojiang Lin, Mona Diab, Pascale Fung

Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps.First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56\% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using classifiers with explicit human value input improves both inclusivity \& explainability in AI.

Go to Paper

This Prompt is Measuring \textless{}MASK\textgreater{}: Evaluating Bias Evaluation in Language Models

Authors: Seraphina Goldfarb-tarrant, Eddie Ungless, Esma Balkir, Su Lin Blodgett

Bias research in NLP seeks to analyse models for social biases, thus helping NLP practitioners uncover, measure, and mitigate social harms. We analyse the body of work that uses prompts and templates to assess bias in language models. We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure and how that measurement is carried out. By applying this taxonomy to 90 bias tests, we illustrate qualitatively and quantitatively that core aspects of bias test conceptualisations and operationalisations are frequently unstated or ambiguous, carry implicit assumptions, or be mismatched. Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched. We offer guidance to enable the community to explore a wider section of the possible bias space, and to better close the gap between desired outcomes and experimental design, both for bias and for evaluating language models more broadly.

Go to Paper

COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP tasks

Authors: Fanny Jourdan, Agustin Picard, Laurent Risser, Jean-michel Loubes, Nicholas Asher

Transformer architectures are complex and their use in NLP, while it has engendered many successes, makes their interpretability or explainability challenging. Recent debates have shown that attention maps and attribution methods are unreliable (Pruthi et al., 2019; Brunner et al., 2019). In this paper, we present some of their limitations and introduce COCKATIEL, which successfully addresses some of them. COCKATIEL is a novel, post-hoc, concept-based, model-agnostic XAI technique that generates meaningful explanations from the last layer of a neural net model trained on an NLP classification task by using Non-Negative Matrix Factorization (NMF) to discover the concepts the model leverages to make predictions and by exploiting a Sensitivity Analysis to estimate accurately the importance of each of these concepts for the model. It does so without compromising the accuracy of the underlying model or requiring a new one to be trained. We conduct experiments in single and multi-aspect sentiment analysis tasks and we show COCKATIEL's superior ability to discover concepts that align with humans' on Transformer models without any supervision, we objectively verify the faithfulness of its explanations through fidelity metrics, and we showcase its ability to provide meaningful explanations in two different datasets.

Go to Paper

Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement

Authors: Gwenyth Portillo Wightman, Alexandra Delucia, Mark Dredze

Large language models have achieved impressive few-shot performance on a wide variety of tasks. However, in many settings, users require confidence estimates for model predictions. While traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. We compare generations across diverse prompts and show that these can be used to create confidence scores. By utilizing more prompts we can get more precise confidence estimates and use response diversity as a proxy for confidence. We evaluate this approach across ten multiple-choice question-answering datasets using three models: T0, FLAN-T5, and GPT-3. In addition to analyzing multiple human written prompts, we automatically generate more prompts using a language model in order to produce finer-grained confidence estimates. Our method produces more calibrated confidence estimates compared to the log probability of the answer to a single prompt. These improvements could benefit users who rely on prediction confidence for integration into a larger system or in decision-making processes.

Go to Paper

Adversarial Named-Entity Recognition with Word Attributions and Disentanglement

Authors: Xiaomeng Jin, Bhanukiran Vinzamuri, Sriram Venkatapathy, Heng Ji, Pradeep Natarajan

The problem of making Named Entity Recognition (NER) models robust to adversarial attacks has received widespread attention recently (Simoncini and Spanakis, 2021; Lin et al., 2021). The existing techniques for robustfying the NER models rely on exhaustive perturbation of the input training data to generate adversarial examples, often resulting in adversarial examples that are not semantically equivalent to the original. In this paper, we employ word attributions guided perturbations that generate adversarial examples with a comparable attack rates but at a lower modification rate. Our approach also uses disentanglement of entity and non-entity word representations as a mechanism to generate diverse and unbiased adversarial examples. Adversarial training results based on our method improves the F1 score over originally trained NER model by 8\% and 18\% on CoNLL-2003 and Ontonotes 5.0 datasets respectively.

Go to Paper

Characterizing Political Bias in Automatic Summaries: A Case Study of Trump and Biden

Authors: Karen Zhou, Chenhao Tan

Growing literature has shown that powerful NLP systems may encode social biases; however, the political bias of summarization models remains relatively unknown. In this work, we use an entity replacement method to investigate the portrayal of politicians in automatically generated summaries of news articles. We develop a computational framework based on political entities and lexical resources, and use it to assess biases about Donald Trump and Joe Biden in both extractive and abstractive summarization models. We find consistent differences, such as stronger associations of a collective US government (i.e., administration) with Biden than with Trump. These summary dissimilarities are most prominent when the entity is heavily featured in the source article. Our systematic characterization provides a framework for future studies of bias in summarization.

Go to Paper

Model-tuning Via Prompts Makes NLP Models Adversarially Robust

Authors: Mrigank Raman, Pratyush Maini, Zico Kolter, Zachary C. Lipton, Danish Pruthi

In recent years,NLP practitioners have converged on the following practice:(i) import an off-the-shelf pretrained (masked) language model;(ii) append a multilayer perceptron atop the CLS token's hidden representation(with randomly initialized weights);and (iii) fine-tune the entire model on a downstream task (\textbackslash{}linearFTns).This procedure has produced massive gains on standard NLP benchmarks, but these models remain brittle, even to mild adversarial perturbations, such as word-level synonym substitutions. In this work, we demonstrate surprising gains in adversarial robustness enjoyed by Model-tuning Via Prompts (MVP),an alternative method of adapting to downstream tasks. Rather than modifying the model (by appending an MLP head), MVP instead modifies the input (by appending a prompt template). Across three classification datasets, MVP improves performance against adversarial word-level synonym substitutions by an average of 8\% over standard methods and even outperforms adversarial training-based state-of-art defenses by 3.5\%. By combining MVP with adversarial training, we achieve further improvements in robust accuracy while maintaining clean accuracy. Finally, we conduct ablations to investigate the mechanism underlying these gains. Notably, we find that the main causes of vulnerability of MLP can be attributed to the misalignment between pre-training and fine-tuning tasks, and the randomly initialized MLP parameters.

Go to Paper

Mitigating Bias for Question Answering Models by Tracking Bias Influence

Authors: Mingyu Derek Ma, Jiun-yu Kao, Arpit Gupta, Yu-hsiang Lin, Wenbo Zhao, Tagyoung Chung, Kai-wei Chang, Nanyun Peng

Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.

Go to Paper

Pay Attention to the Robustness of Chinese Minority Language Models! Syllable-level Textual Adversarial Attack on Tibetan Script

Authors: Xi Cao, Dolma Dawa, Nuo Qun, Trashi Nyima

The textual adversarial attack refers to an attack method in which the attacker adds imperceptible perturbations to the original texts by elaborate design so that the NLP (natural language processing) model produces false judgments. This method is also used to evaluate the robustness of NLP models. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, to the best of our knowledge, there is little research targeting Chinese minority languages. Textual adversarial attacks are a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a Tibetan syllable-level black-box textual adversarial attack called TSAttacker based on syllable cosine distance and scoring mechanism. And then, we conduct TSAttacker on six models generated by fine-tuning two PLMs (pre-trained language models) for three downstream tasks. The experiment results show that TSAttacker is effective and generates high-quality adversarial samples. In addition, the robustness of the involved models still has much room for improvement.

Go to Paper

PromptAttack: Probing Dialogue State Trackers with Adversarial Prompts

Authors: Xiangjue Dong, Yun He, Ziwei Zhu, James Caverlee

A key component of modern conversational systems is the Dialogue State Tracker (or DST), which models a user's goals and needs. Toward building more robust and reliable DSTs, we introduce a prompt-based learning approach to automatically generate effective adversarial examples to probe DST models. Two key characteristics of this approach are: (i) it only needs the output of the DST with no need for model parameters; and (ii) it can learn to generate natural language utterances that can target any DST. Through experiments over state-of-the-art DSTs, the proposed framework leads to the greatest reduction in accuracy and the best attack success rate while maintaining good fluency and low perturbation ratio. We also show how much the generated adversarial examples can bolster a DST through adversarial training. These results indicate the strength of prompt-based attacks on DSTs and leave open avenues for continued refinement.

Go to Paper

Can we trust the evaluation on ChatGPT?

Authors: Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-yeol Ahn

ChatGPT, the first large language model with mass adoption, has demonstrated remarkableperformance in numerous natural language tasks. Despite its evident usefulness, evaluatingChatGPT's performance in diverse problem domains remains challenging due to the closednature of the model and its continuous updates via Reinforcement Learning from HumanFeedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study in stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.

Go to Paper