Submitted:
27 July 2023
Posted:
02 August 2023
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. What is Reasoning?
3. Test Problems
3.1. Simple Arithmetic
3.2. Simple Counting
3.3. (Medical) Common Sense
3.4. Elementary Logic
- P(x): x is an even number
- Q(x): x is a positive number
- a:
- P(x) implies Q(x): If x is an even number, then x is a positive number. This statement is not universally true, as we can have negative even numbers. However, it does not preclude the existence of a model with the other given conditions.
- Q(a) does not hold: Q() means is a positive number, which is false. So, this condition is satisfied.
- P(a) holds: P() means is an even number, which is true. So, this condition is also satisfied.
- : x is a multiple of 4
- : x is an even number
- implies : If x is a multiple of 4, then x is an even number. This statement is universally true because every multiple of 4 is also a multiple of 2, making it an even number.
- does not hold: means 4 is an even number, which is true. So, this condition is not satisfied.
- : x is a positive multiple of 4
- : x is a positive even number
- implies : If x is a positive multiple of 4, then x is a positive even number. This statement is universally true because every positive multiple of 4 is also a multiple of 2, making it a positive even number.
- does not hold: means is a positive even number, which is false. So, this condition is satisfied.
- holds: means is a positive multiple of 4, which is false. So, this condition is not satisfied.
- P(x): x is an element of set A
- Q(x): x is an element of set B
- a: a specific element, a
- implies : If x is an element of set A, then x is an element of set B. In this case, since the only element in set A is a, and set B is defined such that it contains all elements of A except a, this implication is vacuously true. This is because there are no other elements in set A for which the implication would be false.
- does not hold: means a is an element of set B. By definition, set B does not include a, so this condition is satisfied.
- holds: means a is an element of set A. Since set A is defined to contain only a, this condition is also satisfied.
- : x is divisible by 3
- : x is divisible by 2
- a: 3
- implies : If x is divisible by 3, then x is divisible by 2. This statement is not universally true. However, we can create a domain D that contains only one element, , which is divisible by 3 but not by 2. By restricting the domain, the implication implies becomes vacuously true within the domain D because there are no elements in D for which the implication would be false.
- does not hold: means 3 is divisible by 2. Since 3 is not divisible by 2, this condition is satisfied.
- holds: means 3 is divisible by 3. Since 3 is divisible by 3, this condition is also satisfied.
3.5. Simple Quantifier Semantics
- [forall x.P(x) ==> Q(x)]
- [exists x.P(x)]
- [exists x.Q(x)]
- [forall x.P(x) ==> Q(x)] says that for all x, if P(x) is true then Q(x) is true.
- [exists x.P(x)] says that there exists at least one x such that P(x) is true.
- [exists x.Q(x)] says that there exists at least one x such that Q(x) is not true.
-
(forall x.P(x)<==>Q(x)) implies (forall x.P(x))<==>(forall x.Q(x)) Assume that for all x, P(x)<==>Q(x) holds.Direction 1: If forall x, P(x) holds, we need to show that forall x, Q(x) holds. Since P(x)<==>Q(x) for all x, if P(x) holds for all x, then Q(x) must also hold for all x, as they are equivalent.Direction 2: If forall x, Q(x) holds, we need to show that forall x, P(x) holds. Since P(x)<==>Q(x) for all x, if Q(x) holds for all x, then P(x) must also hold for all x, as they are equivalent.Thus, we have shown that (forall x.P(x))<==>(forall x.Q(x)) holds when (forall x.P(x)<==>Q(x)) holds.
-
(forall x.P(x))<==>(forall x.Q(x)) implies (forall.P(x)<==>Q(x)) Assume that (forall x.P(x))<==>(forall x.Q(x)) holds.We need to show that for all x, P(x)<==>Q(x).Let x be an arbitrary element from the domain of discourse. We will consider the two directions of the biconditional P(x)<==>Q(x):Direction 1: If P(x) holds, we need to show that Q(x) holds. If P(x) holds for our arbitrary x, then it must be the case that forall x, P(x) holds. Since (forall x.P(x))<==>(forall x.Q(x)) holds, it follows that forall x, Q(x) holds. Therefore, Q(x) must hold for our arbitrary x as well.Direction 2: If Q(x) holds, we need to show that P(x) holds. If Q(x) holds for our arbitrary x, then it must be the case that forall x, Q(x) holds. Since (forall x.P(x))<==>(forall x.Q(x)) holds, it follows that forall x, P(x) holds. Therefore, P(x) must hold for our arbitrary x as well.
3.6. Simple Graph Coloring
- Vertex 0: Red
- Vertex 1: Green
- Vertex 2: Red
- Vertex 3: Green
- Vertex 4: Red
- Vertex 5: Green
- Vertex 0: Red
- Vertex 1: Green
- Vertex 2: Green
- Vertex 3: Red
- Vertex 4: Green
- Vertex 5: Red
- Vertex 0: Red
- Vertex 1: Green
- Vertex 2: Red
- Vertex 3: Green
- Vertex 4: Red
- Vertex 5: Green
- Vertex 0: Red
- Vertex 1: Green
- Vertex 2: Blue
- Vertex 3: Red
- Vertex 4: Green
- Vertex 5: Blue
- Vertex 0: Red
- Vertex 1: Green
- Vertex 2: Blue
- Vertex 3: Green
- Vertex 4: Red
- Vertex 5: Blue
3.7. Subset Sum
3.8. Elementary Discrete Math
3.9. Simple Scheduling
-
T2 before T4:
- If T1 is before T3: T1, T2, T3, T4 (T2 before T4, and T4 is the last task as T1 is before T3)
- If T1 is after T3: T2, T3, T1, T4 (T2 before T4, and no constraint on the position of T4 as T1 is after T3)
- T4 before T2: Not possible, as T2 must be done before T4.
- T1, T2, T3, T4
- T2, T3, T1, T4
- T2, T1, T3, T4
- T3, T2, T1, T4
3.10. Russell’s Paradox
3.11. Blocks World
- The second-from-the-top block is green.
- The fourth-from-the-top block is not green.
3.12. Spatial Reasoning
3.12.1. Left and Right
3.12.2. Furniture Arrangement
- A must not be adjacent to C.
- Nothing is to the right of E.
- If D and A are not adjacent, then B should be in the middle.
- D is above all others.
- E and D are adjacent.
3.12.3. Seating Puzzle
- If p5 is in seat 1: In this case, there are no seats to the left of p5 for either p1 or p3 to occupy. Since both p1 and p3 must flank p5, this position for p5 is not possible.
- If p5 is in seat 2: p1 would have to be seated in seat 1 to flank p5. However, there is no available seat for p3 to also flank p5 on the other side, as p3 would be required to sit in seat 3, which is not adjacent to p5. Thus, this position for p5 is not possible.
- If p5 is in seat 4: Similar to the case when p5 is in seat 2, p3 would have to be seated in seat 5 to flank p5, but there is no available seat for p1 to also flank p5 on the other side, as p1 would be required to sit in seat 3, which is not adjacent to p5. Thus, this position for p5 is not possible.
- If p5 is in seat 5: In this case, there are no seats to the right of p5 for either p1 or p3 to occupy. Since both p1 and p3 must flank p5, this position for p5 is not possible.
- If p5 is in the middle seat (seat 3): In this case, p1 and p3 can flank p5 by occupying seats 2 and 4 or seats 4 and 2, respectively. This configuration satisfies condition C.
- If p5 is in seat 2: p1 would have to be seated in seat 1 to flank p5. p3 could be seated in seat 3, which is also flanking p5.
3.13. Temporal Reasoning
3.14. Murder or Suicide?
- Someone who lives in Dreadbury Mansion killed Aunt Agatha.
- The only people who live in Dreadbury Mansion are Aunt Agatha, the butler, and Charles.
- A killer always hates his victims, and is never richer than his victims.
- Charles hates no one that Aunt Agatha hates.
- Aunt Agatha hates everyone except the butler.
- The butler hates everyone not richer than Aunt Agatha.
- The butler hates everyone Aunt Agatha hates.
- No one hates everyone.
- Aunt Agatha is not the butler.
3.15. Wason Selection Task
3.16. Entropy
3.17. Simple Compiler Correctness
4. Conclusions
- Use of generative AI in software development (or in science and engineering in general) for anything other than tedious tasks (as a sort of turbo-charged autocomplete for knowledge-heavy coding questions) is fraught with serious risks. Normative standards of correctness are of paramount importance in these fields, and current LLMs cannot meet such standards. Just like generative AI is already starting to pollute the web with badly written ads,30 it has the potential to proliferate buggy code at scale.
- If LLM reasoning continues to improve, rigorous proof checking is likely to become increasingly important. Confidence in the correctness of a system’s reasoning is imperative for applications, particularly in science, medicine, and engineering, and proof checking is a technology that can deliver such confidence. This approach could be implemented by requiring LLMs to formalize their reasoning (express it in a symbolic notation that is amenable to proof checking), or potentially by training other LLMs to check a stretch of reasoning expressed in natural language.
- As things stand, dystopian scenarios involving a rogue AI that subjugates humankind, or even other humans using AI for sinister purposes, are exceedingly far-fetched, often to the point of absurdity.31 When the most advanced AI system cannot tell left from right (literally, see Section 3.12), it is at best comically premature to call for policies and institutions to protect humanity from it or its descendants (often by appeal to the latest “scaling law”). At worst, it is a misuse of human time and capital that could be better channeled into addressing much more pressing challenges.
References
- Arkoudas, K.; Musser, D. Fundamental Proof Methods in Computer Science; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Barwise, J.; Perry, J. Situations and Attitudes; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
- Karpas, E.; Abend, O.; Belinkov, Y.; Lenz, B.; Lieber, O.; Ratner, N.; Shoham, Y.; Bata, H.; Levine, Y.; Leyton-Brown, K.; et al. MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv 2022, arXiv:2205.00445. [Google Scholar]
- Yao, S., Zhao. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv 2023, arXiv:2210.03629. [Google Scholar]
- Planken, L. Temporal Reasoning Problems and Algorithms for Solving Them: Literature Survey, 2008.
- McCoy, T., Pavlick. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019. [Google Scholar]
- Liu, H., Ning R. Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4, arXiv 2023, arXiv:2304.03439. [Google Scholar]
- OpenAI, GPT-4 Technical Report, 2023.
- Wang, J.; Hu, X.; Hou, W.; Chen, H.; Zheng, R.; Wang, Y.; Yang, L.; Huang, H.; Ye, W.; Geng, X.; et al. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective. arXiv 2023, arXiv:2302.12095. [Google Scholar]
- Niven, T., Kao. Probing Neural Network Comprehension of Natural Language Arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019. [Google Scholar]
- Johnson-Laird, P.N. How We Reason; Oxford University Press: Oxford, UK, 2006. [Google Scholar]
| 1 | A modified version of that is being published in the journal Philosophy & Technology. |
| 2 | The notion of an emergent property is clear enough, at least at a high enough level. What is not clear is the relationship between such properties and LLM architectures, their basic configurations (number of parameters, compute budget, dataset size, and so on), and more importantly, important tasks such as reasoning. |
| 3 | Or with perfect precision and recall, to put it—more loosely—in ML-like terms. |
| 4 | Of which there are many: propositional logic, the two-variable fragment of first-order logic, the Ackerman fragment, the guarded fragment, various quantifier-prefix fragments, and so on. |
| 5 | Understanding that structure and rigorously characterizing its relationship with algorithm performance (e.g., via different problem parameterizations, such as clause/variable ratios in the case of SAT) is a key open problem in theoretical computer science, but that is another matter. |
| 6 | Humans do not seem to solve problems by predicting the most likely sequence of tokens to generate. They think, explore, experiment, engage in protracted conversation with the people who posed the problem (sometimes over weeks, months, or even years), refine, generalize, come up with new concepts and terminology, prove results, make and refute conjectures, apply heuristics, execute algorithms, analyze and synthesize, and iterate. But how solutions are generated is one thing and what solutions are generated is another, and that’s why it’s not incoherent to speak of a model whose reasoning performance is roughly at the same level as that of an average human engineer. Such a claim can be understood operationally, to mean that a given LLM is able to produce roughly the same solutions that we might reasonably expect an average human engineer to produce (though obviously on a very different time scale). |
| 7 | According to the analysis carried out by the lm-contamination index, well-known NLP datasets such as Squad, CoNLL03, MNLI, and others, are indeed contaminated, while several others are at best suspicious. |
| 8 | In fact, the substring checks carried out by OpenAI were not even applied on the entire problem instance, only on 3 randomly selected substrings of 50 characters each. This is not enough to ensure disjointness for long (or even moderately long) problems, which are quite common in tests like the UBE (Uniform Bar Exam). |
| 9 | Models have been shown to leverage the presence of certain cue words (especially negation words) and to formulate quick-and-dirty (i.e., unsound) heuristics such as lexical overlap, subsequence, and constituency [6]. Most of these results are from 2019 and revolve around BERT, but more recent work [9] has shown that while larger foundational models such as ChatGPT are more robust to input perturbations and OOD (out-of-distribution) samples, these continue to be challenges, suggesting that even ChatGPT-scale models learn unsound shortcuts. |
| 10 | Here we understood premises and conclusions as syntactic objects (sentences or diagrams), but there are alternative approaches. For instance, a semanticist might think of premises and conclusions as propositions, abstract objects capable of being true or false. A sentence then expresses or represents a proposition. Propositions are handy theoretical entities for many reasons. For example, they can serve as the objects of psychological attitudes such as beliefs and desires. What do I mean when I claim to believe that Obama won the 2012 presidential election? Surely I don’t believe a particular sentence, i.e., a specific syntactic object like “Obama won the 2012 US presidential election” (I). Rather, I believe something about the way the world actually is. That something can be understood as a proposition, a unique entity that can expressed by many different equivalent sentences. Propositions can be cashed out in modal terms, as sets of possible worlds (or as “situations” in situation-theoretic semantics [2]). A possible world is a way in which things might have been, but described completely, down to the most minute detail (unlike situations, which can be thought of as partial specifications of worlds). So the proposition that Obama won the 2012 US presidential election is identified with the set of all possible worlds in which Obama won that election. This set becomes the information content of sentences such as (I). Propositions can also serve to analyze fundamental semantic notions such as entailment. A set of premises entails a conclusion p iff the intersection of the sets of possible words represented by all the is a superset of the set of worlds represented by p. This is another way of understanding the claim that the conclusion of a valid deductive argument does not introduce any information that is not already contained in the premises. Note, however, that while the possible-worlds approach to propositions is very powerful, it also suffers from severe defects, as it is notoriously coarse-grained, meaning that it cannot distinguish between propositions that we intuitively regard as quite distinct. This is perhaps easier to see in the case of mathematical truths, which, being necessary (true in all possible worlds), are collapsed into one and the same object, the set of all possible worlds (and dually, of course, all contradictions are identified with the empty set of worlds). As a result, the proposition that 1 + 1 = 2 and Fermat’s theorem become identical, as they have the exact same information content. There have been attempts to address these issues (structured propositions and impossible worlds being two of the most prominent), but the interested reader will have to consult the literature for more details. |
| 11 | This can be made more precise using information-theoretic notions, at least in the case of propositional logic, where we have an infinite supply of formulas that are either atomic (propositional variables) or else Boolean combinations of formulas. Instead of imposing the usual Kolmogorov axioms on a probability measure defined over a set of events (a -field) from a sample space , we impose the same axioms (non-negativity, finite additivity, and the axiom that assigns a measure of 1 to every tautology—the analogue of ) on a probability measure defined over the set of all formulas. Then truth and falsity become the extreme probabilities of 1 and 0, respectively. This allows us to associate a probability with any sentence (event) , and hence every sentence automatically gets an information content in the usual way: . To say that the information content of a valid deductive argument with premises and conclusion p is zero is simply to say that the conditional is a tautology. By definition, a tautology has probability 1, and therefore . |
| 12 | At this point the reader might ask: If deductive arguments convey zero information, why bother with them? Indeed, if all mathematical proofs are proofs of tautologies, with zero information content, what is their point? The thinking is that arguments with no information content are not useful, so if all deductive arguments (including all mathematical results) have zero information content, then they are not useful. This is, in brief, the so-called “scandal of deduction” (named by parity to the “scandal of induction,” i.e., Hume’s problem of induction). There have not been any widely accepted resolutions of this ostensible paradox. But few of course doubt that mathematical results are actually informative and extend our knowledge. (Surely if we woke up tomorrow and read that someone proved , that would be tremendously informative.) It’s also clear that the word “information” has a number of informal meanings that are not captured by the canonical definition of information content (as the negative logarithm of probability), and most efforts to resolve the “scandal of deduction” have attempted to formalize distinct notions of informational gain that would render deductive arguments informative. |
| 13 | Several other types of reasoning are often discussed in the literature, such as analogical reasoning (which includes, for instance, case-based reasoning), Bayesian reasoning, causal reasoning, and so on, but these are usually subsumed under one of the three main categories I have described, most often under induction. (But there is no consensus, for instance, some thinkers, from Aristotle to recent authors, have tried to assimilate analogical reasoning under deduction.) |
| 14 | We are assuming of course that the car model whose mpg we are predicting was not included in the given data, otherwise there would be no prediction or generalization involved. |
| 15 | The training of deep neural networks, too, works by trying to discover values for various weights that are “optimal” for a given training dataset (in that they minimize loss), except that in their case the relationship between the inputs, outputs, and weights can be much more complicated (non-linear) and the training algorithm might not converge to the optimal weight values. |
| 16 | Some desired properties of explanations are obvious. Truth is one of them—a good explanation cannot be based on a false hypothesis. But other desired properties, such as parsimony and generality (explaining as much as possible while assuming as little as possible) are much harder to explicate. |
| 17 | Even from a purely linguistic viewpoint, it doesn’t seem appropriate to say that I have “concluded” or “derived” or “inferred” anything at all in the swan or in the plumber examples. I have simply made a tentative hypothesis (or conjecture), which might be refuted. |
| 18 | In the same way that even the process of discovering deductions is not itself deductive, at least not entirely so. Both are fundamentally search processes, though they are almost certainly informed and generally penetrated by deduction. |
| 19 | This viewpoint assumes a functional-programming stance, but computation can be readily reduced to deduction in any other style of programming (e.g., imperative) by an appropriate axiomatic formulation of the relevant semantics (e.g., operational semantics using stores). |
| 20 | In addition, of course, different versions of GPT-4 might get deployed at any time. |
| 21 | An unrealistic assumption given that the Internet is filled with an unbounded number of agents (millions of them, from completely arbitrary computer programs to smart-phone apps to travel-booking APIs to games and beyond) that provide an open-ended and constantly changing array of functionality. |
| 22 | By concrete counting I mean counting a number of specific object tokens instantiated in space and time, as in the coins in one’s pocket or the number of lines in a text file. By contrast, abstract counting based on combinatorial principles, search procedures, and logical constraints (like the scheduling problem in Section 3.9) is indeed a reasoning activity. |
| 23 | In the same way that the numbers 100000 and 1000000 only differ in one zero, but if we are talking about your bank balance that one zero makes a huge difference. |
| 24 | Usually the quantifier variables range explicitly over a sort such as Man, but this is not essential for the derivation. |
| 25 | Formally, this problem belongs to a class of temporal-reasoning problems literally known as STP (“Simple Temporal Problems”) [5]. This class is of limited expressivity and there exist very efficient algorithms for solving STPs (e.g., consistency can be decided in where n is the number of events described in a given STP and m is the number of constraints between the events). |
| 26 | This particular version is taken from Chapter 18 of the textbook Fundamental Proof Methods in Computer Science by (author?) [1]. |
| 27 | Many shallow coding problems these days are essentially knowledge problems. What library or API can I use to do such and such? What configuration parameters are available and how can they be set? How do I zip or unzip files in Python? How do I read and write JSON or XML? How do I compute quantiles for a frequency table? Knowledge-heavy problems of this sort tend to be widely discussed on the web, and LLMs can be very effective productivity boosters for such problems (at least as long as this data remains freely available to companies such as OpenAI for pretraining purposes, something that might well change in the near future). Even conventional search engines like Google were already effective for these types of problems, prior to LLMs (and remain more effective than LLMs in many cases). But most interesting coding problems are reasoning-heavy. How can I make sure that this program produces correct outputs? How can I improve the asymptotic complexity of this program (where the program might contain many thousands of line of code)? And so on. If we are talking about self-contained and cookie-cutter components, like sorting algorithms, then these questions can often be reduced to knowledge-based questions. But the minute we start straying into unique situations with arbitrary specifications and code bases, we start facing the curse of general reasoning. |
| 28 | Can this be posed as a simple SAT problem? Is it an SMT problem? Does it need quantifier reasoning? If so, is it of the sort that SMT solvers can handle or does it need a full first-order prover? Does the problem quantify over infinite functions or sets? If so, higher-order logic might be needed. Does it have any temporal or epistemic operators that might call for a modal-logic reasoner? And so on. |
| 29 | For instance, a state-of-the-art automated theorem prover might generate a proof, but the proof would be incomprehensible to the LLM user, as it would be expressed in the resolution calculus and would operate on CNF versions of the input formulas. It is an open problem to convert resolution proofs into fluid natural-deduction proofs (e.g., proofs that avoid references to Skolem constants introduced during the CNF conversion). |
| 30 | A recent Wall Street Journal article interviewed editors who are “seeing a growing amount of AI-generated content that is so far beneath their standards that they consider it a new kind of spam”, a trend that is “growing exponentially.” The publishers interviewed for the article said that their publications “reject all AI-written submissions” and that these “are easy to identify.” They have “perfect spelling and grammar, but a completely incoherent story.” Another said “They’re all written in a rather bland and generic way. They are all grammatically correct. They just feel very formulaic, and they are really useless to us.” |
| 31 | The former scenarios would be absurd even if AI technology had already attained superhuman intelligence, as LLMs do not have desires, in the same way that they don’t have beliefs or any other mental states. They do not actually want anything. To think otherwise is akin to thinking that a laptop that is simulating a hurricane will get wet (or, as Stephen Pinker has put it, thinking that because airplanes have now exceeded the flight ability of birds, they will suddenly start acting like eagles, swooping down from the sky to grab rabbits and squirrels). Genuine mental states can only be produced by brains, or by systems that have the same causal powers that brains have. Digital computers executing DNNs are not such systems. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).