Drowzee: Metamorphic Testing for Fact-conflicting Hallucination Detection in Large Language Models
Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations — coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH), where LLMs produce content contradicting established facts. Addressing FCH is difficult due to two key challenges: \textbf{1)} Automatically constructing and updating benchmark datasets is hard, as existing methods rely on manually curated static benchmarks that cannot cover the broad, evolving spectrum of FCH cases. \textbf{2)} Validating the reasoning behind LLM outputs is inherently difficult, especially for complex logical relations.
To tackle these challenges, we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia, seamlessly integrated into Drowzee. Using logical reasoning rules, we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts, requiring them to provide reasoned answers. To validate their reasoning, we propose two semantic-aware oracles that assess the similarity between the logical/semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains, with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts, out-of-distribution knowledge, and lack of logical reasoning capabilities. The results show that logic-based test cases generated by Drowzee effectively trigger and detect hallucinations. To further mitigate the identified FCHs, we explored model editing techniques, which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations.
Fri 25 OctDisplayed time zone: Pacific Time (US & Canada) change
16:00 - 17:40 | Testing Everything, Everywhere, All At OnceOOPSLA 2024 at IBR East Chair(s): Alex Potanin Australian National University | ||
16:00 20mTalk | Crabtree: Rust API Test Synthesis Guided by Coverage and Type OOPSLA 2024 Yoshiki Takashima Carnegie Mellon University, Chanhee Cho Carnegie Mellon University, Ruben Martins Carnegie Mellon University, Limin Jia , Corina S. Păsăreanu Carnegie Mellon University; NASA Ames DOI | ||
16:20 20mTalk | Drowzee: Metamorphic Testing for Fact-conflicting Hallucination Detection in Large Language Models OOPSLA 2024 Ningke Li Huazhong University of Science and Technology, Yuekang Li UNSW, Yi Liu Nanyang Technological University, Ling Shi Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology, Haoyu Wang Huazhong University of Science and Technology DOI | ||
16:40 20mTalk | Reward Augmentation in Reinforcement Learning for Testing Distributed Systems OOPSLA 2024 Andrea Borgarelli Max Planck Institute for Software Systems, Constantin Enea LIX, CNRS, Ecole Polytechnique, Rupak Majumdar MPI-SWS, Srinidhi Nagendra CNRS, Université Paris Cité, IRIF, Chennai Mathematical Institute DOI | ||
17:00 20mTalk | Rustlantis: Randomized Differential Testing of the Rust Compiler OOPSLA 2024 DOI | ||
17:20 20mTalk | Statistical Testing of Quantum Programs via Fixed-Point Amplitude Amplification OOPSLA 2024 DOI |