Drowzee: Metamorphic Testing for Fact-conflicting Hallucination Detection in Large Language Models (SPLASH 2024 - OOPSLA 2024)

Sun 20 - Fri 25 October 2024 Pasadena, California, United States

Who

Ningke Li, Yuekang Li, Yi Liu, Ling Shi, Kailong Wang, Haoyu Wang

Track

SPLASH 2024 OOPSLA

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 25 Oct 2024 16:20 - 16:40 at IBR East - Testing Everything, Everywhere, All At Once Chair(s): Alex Potanin

Abstract

Large language models (LLMs) have revolutionized language processing, but face critical challenges with security, privacy, and generating hallucinations — coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH), where LLMs produce content contradicting established facts. Addressing FCH is difficult due to two key challenges: \textbf{1)} Automatically constructing and updating benchmark datasets is hard, as existing methods rely on manually curated static benchmarks that cannot cover the broad, evolving spectrum of FCH cases. \textbf{2)} Validating the reasoning behind LLM outputs is inherently difficult, especially for complex logical relations.

To tackle these challenges, we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia, seamlessly integrated into Drowzee. Using logical reasoning rules, we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts, requiring them to provide reasoned answers. To validate their reasoning, we propose two semantic-aware oracles that assess the similarity between the logical/semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains, with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts, out-of-distribution knowledge, and lack of logical reasoning capabilities. The results show that logic-based test cases generated by Drowzee effectively trigger and detect hallucinations. To further mitigate the identified FCHs, we explored model editing techniques, which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations.

DOI

https://doi.org/10.1145/3689776

Ningke Li

Huazhong University of Science and Technology

Yuekang Li

UNSW

Australia

Yi Liu

Nanyang Technological University

Singapore

Ling Shi

Nanyang Technological University

Singapore

Kailong Wang

Huazhong University of Science and Technology

China

Haoyu Wang

Huazhong University of Science and Technology

China

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 25 Oct
Displayed time zone: Pacific Time (US & Canada) change

16:00 - 17:40	Testing Everything, Everywhere, All At OnceOOPSLA 2024 at IBR East Chair(s): Alex Potanin Australian National University

16:00 20m Talk		Crabtree: Rust API Test Synthesis Guided by Coverage and Type OOPSLA 2024 Yoshiki Takashima Carnegie Mellon University, Chanhee Cho Carnegie Mellon University, Ruben Martins Carnegie Mellon University, Limin Jia , Corina S. Păsăreanu Carnegie Mellon University; NASA Ames DOI
16:20 20m Talk		Drowzee: Metamorphic Testing for Fact-conflicting Hallucination Detection in Large Language Models OOPSLA 2024 Ningke Li Huazhong University of Science and Technology, Yuekang Li UNSW, Yi Liu Nanyang Technological University, Ling Shi Nanyang Technological University, Kailong Wang Huazhong University of Science and Technology, Haoyu Wang Huazhong University of Science and Technology DOI
16:40 20m Talk		Reward Augmentation in Reinforcement Learning for Testing Distributed Systems OOPSLA 2024 Andrea Borgarelli Max Planck Institute for Software Systems, Constantin Enea LIX, CNRS, Ecole Polytechnique, Rupak Majumdar MPI-SWS, Srinidhi Nagendra CNRS, Université Paris Cité, IRIF, Chennai Mathematical Institute DOI
17:00 20m Talk		Rustlantis: Randomized Differential Testing of the Rust Compiler OOPSLA 2024 Qian (Andy) Wang ETH Zurich and Imperial College London, Ralf Jung ETH Zurich DOI
17:20 20m Talk		Statistical Testing of Quantum Programs via Fixed-Point Amplitude Amplification OOPSLA 2024 Chan Gu Kang Korea University, Joonghoon Lee Korea University, Hakjoo Oh Korea University DOI