Large Language Model (LLM)
Definition: A Large Language Model is a type of AI specifically designed to understand and generate human language. LLMs are trained on massive amounts of text data and can generate coherent and contextually relevant text based on the input they receive.
Examples: GPT-4 (which powers this conversation) is an example of an LLM. It can assist with tasks such as writing, summarizing, translating, and answering questions across various topics.
Note that the examples given precisely represent the sorts of tasks that traditionally have been faced by students in many “softer” discipline areas, including the humanities, arts, social sciences and business. The capacity of ChatGPT-4o extends to answering questions in more technical disciplines. For example, when asked to solve the first-order differential equation , it not only gave the correct solution but provided each step in the solution along with appropriate explanations. The explanations given were certainly clear enough to be understood by an undergraduate student of calculus who was engaging with the course material and could be further used as a learning tool. The concept of an integrating factor used in the solution of such equations is one that initially poses some difficulty for many students. Prompted to explain the intuition behind the idea of an integrating factor, ChatGPT-4o provided a clear and correct explanation. Of course, this may not always be the case, and we have seen examples where an LLM gets wrong answers to relatively simple mathematical questions and even hallucinates references in other discipline areas. In the case of incorrect answers to technical questions, ChatGPT-4o will often reconsider its answer if prompted to do so and then produce the correct answer along with an apology. Of course, a student may not know an answer is wrong but, nevertheless, with reasonably judicious use such as always asking the Chatbot to re-check its answer, students in most disciplines could, with the assistance of an LLM, produce potentially excellent answers to many types of assessment question. Less sophisticated use may produce answers of lesser yet still passable quality.
The question then is whether the use of AI can be detected by those charged with assessing student work. Some useful work has been done to address this question empirically. Before we examine some of this research, note that, although it may be possible to cheat in a traditionally invigilated in person examination, it is the use of other forms of assessment, such as written assignments, take-home exams and non-invigilated online exams, which provide the potential for unethical use of AI as well as other forms of outsourcing of effort. The COVID-19 pandemic certainly led to an increase in the use of unsupervised online assessments. However, the trend away from traditional exams to coursework started many decades ago and has accelerated this century as educational theorists have promoted the concepts of authentic assessment and constructive alignment of assessment. These concepts are often referred to in higher educational settings, although often it is not at all clear what is meant. Gullikers [
37] (p. 69) defines an authentic assessment as ‘an assessment requiring students to use the same competencies, or combinations of knowledge, skills, and attitudes that they need to apply in the criterion situation in professional life’ before noting that ‘Authenticity is subjective.’ This sort of vagueness, along with an insistence on aligning the physical and social context of assessment to professional practice, seems to argue for assessment to be conducted face to face in small group settings. While this may be ideal, especially in the training of students in disciplines, for example medicine, where clear certification of certain competencies would seem essential, it is neither practical nor is it often observed in the context of mass higher education in the majority of disciplines. As Scarfe et al. [
35] (p. 18) note, there are assessments that may well satisfy the definition of authentic yet can be completed by an LLM in seconds. If an institution of higher education is to reliably certify student learning, the crucial distinction to be made is between supervised and unsupervised assessment rather than between authentic and inauthentic.
Terwiesch, at the Wharton School of the University of Pennsylvania, teaches operations management on the MBA course [
2]. He was interested to know how Chat GPT3 would perform on his MBA exam. He notes that if the skills taught in such courses can, to a large extent, be automated, then the value of the qualification ought to fall if only from the fall in demand for graduates when incumbents in related jobs can perform tasks more efficiently. He submitted the five questions on his exam to Chat GPT 3 and provided his own analysis of the responses. On questions involving simple process analysis, inventory turns and working capital requirements, Terwiesch would have scored the responses A+. On some other questions, the LLM made rather basic errors in calculation. Some of the poorer answers were improved by further prompting. Overall, Chat GPT3 would have achieved a B to B-, which is perhaps now below the median Ivy League performance but a sound result for a student who would have to have learned nothing. Interestingly, since Terwiesch’s trial, Chat GPT has been considerably improved. We entered the question that had been handled worst (on queuing theory) into Chat GPT-4o, and it provided the correct answer, aligned with that given by Terwiesch, along with details of the theory and steps it used almost instantly.
A study providing direct experimental evidence on the use of AI in a university examinations system [
35] added some submissions generated by 100% by ChatGPT 4 into the online examinations system of five undergraduate Psychology courses at the University of Reading, without the knowledge of the markers. Perhaps surprisingly, the University Ethics Committee Lead decided that this study did not require research ethics approval despite the lack of consent from the markers who could be considered the subjects of the research. However, the ignorance of the subjects to the subterfuge involved provided more credible evidence than if they had been forewarned. In summary, 94% of the AI submissions were not detected, which raises the question of how many of the submissions by real students may also have been at least AI assisted and went undetected. The grades attained by the AI submissions clustered at the upper end of the grade distributions except for one course. In two of the courses, the probability of AI outperforming real students was 100%.
Chaudhry et al. [
36] adopted a research design crafted to avoid the perceived ethical issue. Four courses from the subjects offered for the Bachelor of Business Administration in Human Resource Management at a high-ranking private provider in Abu Dhabi were chosen at random, and one assignment from each course was then chosen for the experiment. Questions from these assignments were submitted to ChatGPT, and the responses were marked by instructors who had led each course in a previous semester. A sample was also taken from a control group of students who had completed plagiarism-free work before the launch of ChatGPT. In the courses Leadership & Teamwork, Organizational Behavior, and Production Management, the AI generated responses received perfect scores, while in Staffing, the AI response gained 70%.
Ibrahim and 38 co-authors [
3] compared the performance of ChatGPT against the performance of students in 32 different university level courses at New York University Abu Dhabi. Teachers of each course provided 10 questions from the course they taught along with three randomly chosen student answers to each question; ChatGPT was used to generate three responses to each question. The resulting six submissions for each course were graded by three different markers who had previously taught the course. Every discipline, with the exceptions of Mathematics and Economics, had at least one course in which the performance of AI equaled or exceeded the performance of actual students. It seems that quantitative questions were the area in which ChatGPT performed relatively poorly but note our comment above on re-submitting Terwiesch’s most quantitative type question to ChatGPT-4o. It is also our own direct experience that, on detecting incorrect answers to quantitative questions provided by ChatGPT-4o, asking the Chatbot to reconsider often produces a correct answer, explanation and apology. Even if a student does not recognize that an answer is wrong, simply asking for a check is likely to ensure a high probability of success.
The examples of research given above lead us to the conclusion that, left unchecked, the use of LLMs in higher education will certainly exacerbate the problem of grade inflation, if not credential inflation, with the obvious flow on effects to returns to education, or rather returns to credentials, and the mismatch of only nominally qualified workers to jobs. It is undoubtedly difficult for human markers to detect the use of AI. Perhaps we could be more alert to answers that seem too good to be true in the sense of being near perfect not only in substance but also in expression, but suspecting unethical conduct and proving it are different issues. Plagiarism and other forms of cheating are nothing new in academia; unethical practices can never be eliminated and, therefore, the question is to determine the optimal level of effort to detect it. The use of AI itself can assist, and a number of services have been specifically developed to detect AI content. LLMs themselves offer AI writing detection, but all come with disclaimers as to their current accuracy levels. To illustrate the potential difficulties, we submitted a brief piece of writing to both ChatGPT-4o and Turnitin. Many universities have, since 2000, used Turnitin [
38] as a provider of similarity checking of student submissions. In 2023, Turnitin added AI writing detection to its similarity checking. Turnitin classified our submission as 100% AI-generated with 0% AI-generated text that was likely revised using a paraphrase tool, albeit displaying the usual disclaimer that this ‘should not be used as the sole basis for adverse actions against the student.’ Our own ‘scrutiny and human judgment’ [
38] is that the piece is almost certainly mainly AI-written; in fact, our suspicion that this was the case was our reason for choosing the piece. However, looking at only one piece of evidence risks confirmation bias and, in stark contrast, here is the first response of ChatGPT-4o to the same piece of writing:
Based on the analysis of the provided excerpt, it appears to be written with a high level of detail and specificity that is typical of human authorship, particularly in an academic context. The style is formal and consistent, which is also common in professional writing. There are no clear signs of AI-generated content such as unnatural repetition, lack of depth, or excessive keyword stuffing.
We followed up, suggesting to ChatGPT-4o that some of the word choices were odd and perhaps characteristic of AI writing. In part of a quite lengthy response, Chat GPT-4o said:
You're correct that certain phrases and word choices might seem unusual or overly complex. Let's delve deeper into the text to scrutinize the language usage and structure.
While there are some phrases that seem overly complex or jargon-heavy, the content itself is detailed and specific, which is more typical of human-written text, especially in academic contexts. The unusual phrases could simply be attributed to an attempt to sound more scholarly or technical rather than AI generation.
These types of conflicting responses may deter many teachers from further action. Turnitin and similar services warn users that AI detection software may generate both false positives and false negatives. Action against a student for academic misconduct will only become an issue in the case of a false positive so that, once the software has flagged an issue, it falls back to human judgment to determine whether any action should be taken. In the context of the transactional relationship between a student and a higher education provider, it is likely that college and university administrators will lean towards affording a student the benefit of the doubt.
It can be concluded that the widespread availability of LLMs, together with the obstacles to definitive identification of AI generated text, reduce the ability to be sure that an unsupervised student assignment does not represent, wholly or in part, work that is not the student’s own. To the extent that such assessments underpin the granting of a qualification, both the signaling and the human capital value of the qualification is necessarily diminished.