Can we trust the evaluation on ChatGPT?
Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-yeol Ahn
The Third Workshop on Trustworthy Natural Language Processing Paper
TLDR:
ChatGPT, the first large language model with mass adoption, has demonstrated remarkableperformance in numerous natural language tasks. Despite its evident usefulness, evaluatingChatGPT's performance in diverse problem domains remains challenging due to the closednature of the model and its continuou
You can open the
#paper-TrustNLP_8
channel in a separate window.
Abstract:
ChatGPT, the first large language model with mass adoption, has demonstrated remarkableperformance in numerous natural language tasks. Despite its evident usefulness, evaluatingChatGPT's performance in diverse problem domains remains challenging due to the closednature of the model and its continuous updates via Reinforcement Learning from HumanFeedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study in stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.