1. Introduction
The rapid advancements in large language models (LLMs) in terms of abstraction, comprehension, knowledge retention, and human interaction capabilities [
1] have enabled their application across a variety of domains, including medical data analysis, personalized education, customer service automation, and personal assistance. The deployment of these LLM-based agents in real-world settings is driven by their potential to enhance precision in healthcare [
2] and to improve productivity and quality in other professional environments [
3]. However, as LLMs assume responsibilities traditionally held by humans, their decisions can have significant impacts on human well-being and organizational performance. Ensuring the reliability and trustworthiness of LLMs is therefore crucial.
Among the various issues arising from the integration of LLMs into professional environments, hallucination is one of the most pressing concerns. Hallucination refers to a phenomenon where a model’s output is either partially or completely inconsistent with the ideal ground truth completion [
4]. This means that the information generated by the model can be inaccurate, misleading, or entirely false. Additionally, hallucination can be challenging to detect because language models typically exhibit a high degree of language fluency, which makes them appear confident even when the content is incorrect [
5]. Such subtle inaccuracies can render LLMs unsafe not only in critical roles but also as ordinary assistants interacting with humans in daily tasks. For instance, a public AI system like ChatGPT may be asked to provide medical advice or used for self-diagnosis based on symptoms due to its perceived benefits in decision-making processes [
6]. However, undetected hallucinations can lead to serious consequences, such as incorrect treatment.
Figure 1 illustrates an example of hallucination in a medical advice scenario: the assistant incorrectly recommends ibuprofen instead of acetaminophen as a safe medication during the third trimester of pregnancy. This is a critical error since NSAID drugs like ibuprofen are associated with several risk factors [
7].
Previous studies on the issue of hallucinations have extensively described their types and causes. Hong et al. identified two types of hallucinations: faithfulness hallucination and factuality hallucination [
8]. Faithfulness hallucination pertains to the model’s ability to produce information that is consistent with the provided sources, whereas factuality hallucination pertains to the model’s ability to produce factually correct information. In straightforward scenarios involving simple prompts, such as basic question-answering for a LLM-based tutor or medical adviser, faithfulness hallucinations may be less relevant. However, in more complex scenarios where the model must analyze large volumes of data, both types of hallucination are important considerations for assessing the model’s reliability.
Hallucinations can arise from various factors, including the data source (misinformation, bias, limitations), the training process (flawed architecture, capacity, or belief misalignment), and the inference process (sampling randomness, decoding representation deficiencies) [
9]. Regardless of the cause, hallucinations ultimately manifest during inference when the model interacts with the user. Besides augmenting the raw model with additional frameworks, such as retrieval-augmented generation (RAG) [
10] or self-correcting with tool-interactive critiquing (CRITIC) [
11], several techniques have been proposed to reduce the model’s intrinsic susceptibility to hallucinations. One such technique is the improvement of training data quality, as demonstrated by the Phi model family from Microsoft. Despite their small size, the Phi models are trained on textbook-quality data, enabling them to compete with larger models [
12]. For example, according to the Hugging Face Open LLM Leaderboard [
13], Phi-3-mini (a 3.8 billion parameter model) performs significantly better than Llama-3-8b on the TruthfulQA dataset, a benchmark designed to evaluate hallucinations (see
Section 2.2 for more detail). Other methods to mitigate hallucinations include the development of better network architectures and improved alignment techniques, both of which enhance stability during training and inference. Given the extensive ongoing research in these areas, it is essential to have a set of reliable and targeted metrics to directly evaluate a model’s susceptibility to hallucinations.
We introduce a novel method for benchmarking LLMs on their susceptibility to hallucination: Deception-Based (DB) benchmarking. This approach involves asking the model to complete a text generation task, which may require either a step-by-step reasoning process or an answer to an open-ended question. For each question, the model responds twice independently. The first time, the model answers normally after being provided with the prompt. The second time, the model is required to begin its answer with a pre-written misleading start. This misleading introduction is intended to induce hallucination by hinting at an incorrect conclusion. The model is then evaluated based on three metrics:
Accuracy: This metric reflects the score obtained on the benchmark for each category independently (normal answer and misleading answer). It is calculated by directly evaluating the answer, regardless of the category. A higher score indicates better performance. It is expected that the normal answer category will have a higher score than the misleading answer category, as the model is less likely to hallucinate under normal conditions.
Suceptibility: This metric indicates the likelihood that the model is influenced by the misleading prompt. It is calculated as the quotient between the accuracy of the normal answers and the accuracy of the misleading answers. A higher score suggests a greater susceptibility to hallucination and a lower capacity for self-correction once hallucination occurs during inference.
Consistency: This metric measures the percentage of answers that are identical across both categories, regardless of whether the answer is correct. A higher consistency score indicates that the model is certain of its answers and is less likely to be influenced by noise in the random sampling process during inference.
With deception-based benchmarking, we aim to address both types of hallucination (faithfulness and factuality) and all three sources of hallucination (data, training, and inference) as previously discussed. This necessitates a flexible benchmarking methodology adaptable to various contexts. Therefore, we propose two approaches for preparing the dataset used for DB benchmarking: either by modifying an existing dataset to include a misleading prompt for each question or by creating a new dataset that targets a specific aspect of LLMs. When using existing datasets, it is essential that the dataset involves a task requiring text generation. If a few-shot multiple-choice dataset is selected, it can be adapted so that each question is answered using chain-of-thought (CoT) reasoning [
14]. The ability to utilize different datasets ensures that DB benchmarking can target various characteristics of LLMs, encompassing both types of hallucination. For example, a dataset involving information retrieval from a given context can be used to evaluate faithfulness, while a question-answering dataset can be used to evaluate factuality. Moreover, the text generation process ensures that hallucination from the inference stage is also considered, unlike few-shot multiple-choice questions where only the next token probability is taken into account. The process of DB benchmarking is illustrated in
Figure 2, with a concrete example of a question provided in
Figure 3.
To evaluate the effectiveness of the DB benchmarking methodology, we introduce a new dataset:
Deception-
Based
Massive
Multitask
Language
Understanding (DB-MMLU), derived from the MMLU dataset [
15]. We tested several open-source chat or instruct models with fewer than 16 billion parameters using this benchmark. The results indicate that susceptibility and consistency are distinct from traditional performance metrics on the benchmark and provide valuable insight on hallucination. These new metrics can help guide the development of the next generation of models for safer user interaction.
6. Limitations and Future Work
Full control over models: Deception-based benchmarking requires complete control over the models to force them to start with a predefined sentence. Currently, this is not possible with closed models, making it challenging to compare top-performing models, as most of them are closed source.
Instruction following capacities: The DB-MMLU benchmark necessitates precise instruction-following abilities from the models, as their output must be converted to JSON format. Among the models tested, Llama 3 and Mistral exhibit the highest error rates (see
Figure A12). Other models, such as Qwen1.5-14B-Chat, StableLM-2-12b-chat, and InternLM2-chat-20b, cannot be tested due to their high error rates. To address this issue, fine-tuning the models before benchmarking could be a viable option.
Result variability: Results on the DB-MMLU dataset can vary slightly even if repeated under the same conditions. This variability is due to the randomness in the token sampling process. This testing methodology prefers a non-zero temperature to evaluate the model in real-world conditions. However, this randomness can be completely eliminated by adjusting the parameters if more precise results are required.
Source of the dataset: Due to resource constraints, the misleading prompts used in DB-MMLU are generated by Gemini 1.5 Pro. While most prompts meet the expectations, this is not the case for every question. Since the model itself only has an accuracy of 85.9% on the MMLU dataset, it does not understand every question. Consequently, the dataset contains some inaccuracies that can impact the results. Potential improvements include constructing the dataset with human experts or filtering the questions after generation.