Fri 25 Oct 2024 16:20 - 16:40 at IBR East - Testing Everything, Everywhere, All At Once Chair(s): Alex Potanin

Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations — coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH), where LLMs produce content contradicting established facts. Addressing FCH is difficult due to two key challenges: \textbf{1)} Automatically constructing and updating benchmark datasets is hard, as existing methods rely on manually curated static benchmarks that cannot cover the broad, evolving spectrum of FCH cases. \textbf{2)} Validating the reasoning behind LLM outputs is inherently difficult, especially for complex logical relations.

To tackle these challenges, we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia, seamlessly integrated into Drowzee. Using logical reasoning rules, we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts, requiring them to provide reasoned answers. To validate their reasoning, we propose two semantic-aware oracles that assess the similarity between the logical/semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains, with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts, out-of-distribution knowledge, and lack of logical reasoning capabilities. The results show that logic-based test cases generated by Drowzee effectively trigger and detect hallucinations. To further mitigate the identified FCHs, we explored model editing techniques, which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations.

Fri 25 Oct

Displayed time zone: Pacific Time (US & Canada) change

16:00 - 17:40
Testing Everything, Everywhere, All At OnceOOPSLA 2024 at IBR East
Chair(s): Alex Potanin Australian National University
16:00
20m
Talk
Crabtree: Rust API Test Synthesis Guided by Coverage and Type
OOPSLA 2024
Yoshiki Takashima Carnegie Mellon University, Chanhee Cho Carnegie Mellon University, Ruben Martins Carnegie Mellon University, Limin Jia , Corina S. Păsăreanu Carnegie Mellon University; NASA Ames
DOI
16:20
20m
Talk
Drowzee: Metamorphic Testing for Fact-conflicting Hallucination Detection in Large Language Models
OOPSLA 2024
Ningke Li Huazhong University of Science and Technology, Yuekang Li UNSW, Yi Liu Nanyang Technological University, Ling Shi Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology, Haoyu Wang Huazhong University of Science and Technology
DOI
16:40
20m
Talk
Reward Augmentation in Reinforcement Learning for Testing Distributed Systems
OOPSLA 2024
Andrea Borgarelli Max Planck Institute for Software Systems, Constantin Enea LIX, CNRS, Ecole Polytechnique, Rupak Majumdar MPI-SWS, Srinidhi Nagendra CNRS, Université Paris Cité, IRIF, Chennai Mathematical Institute
DOI
17:00
20m
Talk
Rustlantis: Randomized Differential Testing of the Rust Compiler
OOPSLA 2024
Qian (Andy) Wang ETH Zurich and Imperial College London, Ralf Jung ETH Zurich
DOI
17:20
20m
Talk
Statistical Testing of Quantum Programs via Fixed-Point Amplitude Amplification
OOPSLA 2024
Chan Gu Kang Korea University, Joonghoon Lee Korea University, Hakjoo Oh Korea University
DOI