Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Understanding Logical Reasoning Ability of Large Language Models

Version 1 : Received: 22 August 2024 / Approved: 23 August 2024 / Online: 23 August 2024 (18:52:34 CEST)

How to cite: Chan, E. Understanding Logical Reasoning Ability of Large Language Models. Preprints 2024, 2024081712. https://doi.org/10.20944/preprints202408.1712.v1 Chan, E. Understanding Logical Reasoning Ability of Large Language Models. Preprints 2024, 2024081712. https://doi.org/10.20944/preprints202408.1712.v1

Abstract

Large language models have recently made significant progress in natural language processing, and there is observation that these models exhibit reasoning abilities when they are sufficiently large. This has sparked considerable research interest since reasoning is a hallmark of human intelligence that is widely considered missed in artificial intelligence systems. Due to large size of these models, evaluation of LLMs’ reasoning ability is largely empirical. Creating datasets to evaluate the reasoning ability of LLMs is therefore an important area of LLM research. A key open question is whether LLMs reason or simply recite memorized texts they have encountered during their training phase. This work conducts simple experiments using Cheryl’s Birthday Puzzle and Cheryl’s Age Puzzle and their variants created in this work to investigate whether LLMs recite or reason and discovers that LLMs tend to recite memorized answers for well-known questions, which appear more frequently on the internet, even though such answers may not be sensible for the modified versions of the questions. When presented with less well-known questions, it is observed that LLMs answer with correct reasoning more frequently. A possible inference is that LLMs tend to reason on less well-known questions but recite memorized answers on popular questions. As a result, to accurately evaluate the reasoning ability of LLMs, it is essential to create new datasets to ensure that LLMs are elicited to truly use their reasoning ability to generate responses to the presented questions. In view of the finding, this work proposes a new dataset comprising of unseen questions requiring semantic and deductive logical reasoning skills to elicit reasoning ability from LLMs. The proposed evaluation framework is based on the intuition that some questions or puzzles can only be answered through the mastery of reasoning needed for situational awareness, and the dataset consists of 84 logical reasoning questions on room assignment subject to constraints. The proposed evaluation framework has several desirable properties, including resilience to training data contamination, ease of response verification, extensibility, usefulness and automated test case generation. This work then applies the proposed dataset to evaluate the reasoning ability of state-of-the-art LLMs, including GPT-3, GPT-4, Llama-3.1, Gemini-1.5 and Claude-3.5, and compare their performance with the performance of human intelligence on the dataset.

Keywords

large language models; reasoning; deductive logic; prompting; reasoning benchmark

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.