1. Introduction
The annual count of reported incidents within the Aviation Safety Reporting System (ASRS) has been consistently increasing, a trend that is anticipated to persist in the foreseeable future. This projected growth is largely attributable to the ease of submitting incident reports and the integration of novel systems such as Unmanned Aerial Systems (UAS) into the National Airspace System (NAS). Because the current processing time of incident reports by two safety analysts can take up to five business days [
1], new approaches are sought to help facilitate and accelerate such tasks. The development of hybrid human-AI approaches, and in particular those involving the use of Large Language Models (LLMs), are expected to enhance the efficiency of safety analyses and reduce the time required to process incident reports.
Prior studies, including those mentioned in [
2], have utilized the ASRS dataset in conjunction with Large Language Models (LLMs), such as BERT. However, the conclusions drawn from these studies may be limited for a variety of reasons. The first is the imposed maximum narrative length of 256 or 512 WordPiece tokens when using BERT, which may potentially lead to information loss. For comparison, this specified length is below the 25
th percentile of incident narrative lengths, which stands at 747 WordPiece tokens. The second reason is the fact that BERT requires specific training or fine-tuning on domain-specific data. Generative language models, on the other hand, can learn from a wider range of information due to their expansive training on larger datasets. This not only makes them more adaptable to evolving linguistic trends and domain shifts but also enhances their performance in zero-shot tasks without requiring specialized fine-tuning.
The use of generative language models in the field of aviation safety remains largely unexplored. These models can serve as effective “copilots” or assistants to aviation safety analysts in numerous ways. They can automate the analysis of safety reports, identify patterns or anomalies that highlight potential safety issues, and identify potential risks based on historical data, hence aiding in the development of proactive safety strategies. Their proficiency in Natural Language Processing (NLP) can be harnessed for summarizing incident reports and extracting crucial information of interest. Furthermore, these models can be employed as training tools to simulate various scenarios, or to create synthetic data as a means to test safety measures or fill data gaps. However, their utility relies heavily on their training and implementation, and they should complement rather than replace human expertise.
In light of the considerable potential of generative language models, the primary objective of this work is to conduct a comprehensive assessment of the applicability and significance of generative language models, such as GPT-3.5 (ChatGPT) [
3] in the context of aviation safety analysis, specifically the ASRS dataset. In the context of the ASRS dataset, these language models hold the potential to serve as instrumental tools to aid human safety analysts by accelerating the examination of incident reports, while simultaneously preserving consistency and reproducibility in their analyses. In particular, this paper evaluates the efficacy of generative language models in automatically performing the following tasks from free-form incident narratives:
Generate succinct synopses of the incidents.
Compare the faithfulness of the generated synopses to human-written synopses.
Discern the human factor(s) contributing to an incident.
Pinpoint the party responsible for the incident.
Provide explanatory logic/rationale for the generative language model’s decisions.
The assembled dataset, which includes the ground truth, generated outputs, and accompanying rationale, can be found on the HuggingFace platform [
4]. This accessibility allows for additional examination and validation, thereby fostering further advancements in the field.
This paper is organized as follows.
Section 2 provides detailed information regarding the ASRS, introduces LLMs, and discusses the use of LLMs in the context of the ASRS dataset.
Section 3 elaborates on the methodology implemented in this study, with a particular focus on the dataset used, prompt engineering, and the methodology used for comparing the generated outputs to the ground truth.
Section 4 discusses the findings of this work and presents examples of incident narratives, synopses, human factor errors, and responsible parties. Lastly,
Section 5 summarizes this research effort, discusses its limitations, and suggests potential avenues for future work.
2. Background
This section provides more information about the ASRS dataset and the way in which incident reports are gathered and analyzed by safety analysts to draw useful insights. This section also offers a comprehensive overview of LLMs as foundation models, specifically focusing on generative language models, as well as a discussion on the application of NLP in aviation safety analysis.
2.1. Aviation Safety Reporting System (ASRS)
The ASRS offers a selection of five distinct forms for the submission of incident reports by various personnel, as presented in
Table 1. It is possible for multiple reports pertaining to the same event or incident to exist, which are subsequently merged by safety analysts at a later stage. A segment of the General Form for reporting incidents involving aircraft is depicted in
Figure 1.
Figure 2 illustrates the pipeline for processing incident reports in the ASRS. The process begins with ASRS receiving the reports in electronic or paper format. Subsequently, each report undergoes a date and time-stamping procedure based on the receipt date. Two ASRS analysts then screen the report to determine its initial categorization and triage for processing [
1]. This screening process typically takes approximately five working days. Based on the initial analysis, the analysts have the authority to issue an
Alert Message and share de-identified information with relevant organizations in positions of authority. These organizations are responsible for further evaluation and any necessary corrective actions [
1].
Afterward, multiple reports related to the same incident are consolidated to form a single
record in the ASRS database. Reports that require additional analysis are identified and entered into the database after being coded using the ASRS taxonomy. If further clarification is needed, the analyst may contact the reporter of the incident and any newly obtained information is documented in the
Callback column. Following the analysis phase, any identifying information is removed, and a final check is conducted to ensure coding accuracy. The de-identified reports are then added to the ASRS database, which can be accessed through the organization’s website. To maintain confidentiality, all original incident reports, both physical and electronic, are securely destroyed [
1].
Table 2 shows some of the columns included in the ASRS dataset.
In the ASRS database, information in different columns is populated either based on reporter-provided data or by safety analysts who analyze incident reports. For instance, the Narrative column is examined to populate related columns like Human Factors, Contributing Factors/Situations, and Synopsis.
With the increase in the number of incident reports over time, there is a need for a human-in-the-loop system to assist safety analysts in processing and analyzing these reports, which will help with reducing the processing time, improving labeling accuracy, and ultimately enhancing the safety of the NAS. LLMs, which are introduced in the section below, have the potential to help address this need.
2.2. Large Language Models
This section provides an overview of LLMs, their pre-training and fine-tuning processes, and highlights their significance as foundational models. Furthermore, it explores recent advancements in the field, particularly focusing on generative language models and their relevance to the present work on aviation safety.
2.2.1. Large Language Models (LLMs) as Foundation Models
LLMs, such as Bidirectional Encoder Representations from Transformers (BERT) [
7] and the Generative Pre-trained Transformer (GPT) family [
8,
9,
10,
11], LLaMA [
12], LaMDA [
13], PaLM [
14] are advanced natural language processing systems that have shown remarkable capabilities in understanding and generating human-like text. These models are built upon Transformer neural networks with attention mechanisms [
15]. Neural networks, inspired by the functioning of the human brain, consist of interconnected nodes organized in layers that process and transform input data. The attention mechanism enables the model to focus on relevant parts of the input during processing, effectively capturing dependencies between different words and improving contextual understanding. Transformers neural architectures have been particularly successful in NLP tasks, providing an efficient and effective way to process text.
The training process of LLMs involves two main stages:
pre-training and
fine-tuning. During pre-training, the model is exposed to vast amounts of text data from the Internet or other sources, which helps it learn patterns, grammar, and semantic relationships. This unsupervised learning phase utilizes a large corpus of text to predict masked words, allowing the model to capture the linguistic nuances of the language. The pre-training stage often involves a variant of unsupervised learning called
self-supervised learning [
16], where the model generates its own training labels using methods such as Masked Language Modeling (MLM), Next Sentence Prediction (NSP), generative pre-training, etc. This enables the model to learn without relying on human-annotated data, making it highly scalable. Unsupervised pre-training typically uses general-purpose texts, including data scraped from the Internet, novels, and other sources. This helps overcome the non-trivial cost and limits on scaling associated with data annotation.
After pre-training, LLMs are often fine-tuned on specific downstream tasks. This stage involves training the model on annotated data for NLP tasks like text classification, question answering, or translation. Fine-tuning allows the model to adapt its pre-learned knowledge to the specific requirements of the target task and domain, enhancing its performance and accuracy. The fine-tuning process typically involves using gradient-based optimization methods to update the model’s parameters and minimize the discrepancy between its predictions and the ground truth labels.
The general schematics of pre-training and fine-tuning LLMs are shown in
Figure 3.
In summation, LLMs, including BERT and GPT are often termed as
foundation models. They provide the basis for an extensive range of specialized models and applications [
18].
With specific regard to the aerospace field, there exist fine-tuned models, namely aeroBERT-NER [
17] and aeroBERT-Classifier [
19], developed by fine-tuning variants of BERT on annotated aerospace corpora [
17,
19,
20,
21]. These models were designed to recognize aerospace-specific named entities and to categorize aerospace requirements into different types, respectively [
21].
The next subsection introduces a specific type of foundation model, namely, generative language models.
2.2.2. Generative Language Models
An alternative to BERT’s MLM and NSP pre-training is
generative pre-training. This approach draws inspiration from statistical language models [
22,
23], which aim to generate text sequences by choosing word (token) sequences that maximize next-token probabilities conditioned on prior text. Neural language models [
24] use neural networks for estimating the conditional next-token probabilities. During generative pre-training, a model is fed a partial sequence of text with the remainder hidden from itself and it is trained to complete the text. Hence, the corpus for generative pre-training can be unlabeled as in the self-supervised training of BERT.
The GPT [
8] is a neural language model that employs a Transformer-based
decoder [
15] as its neural architecture. This should be contrasted with the Transformer
encoder of BERT, which is pre-trained with MLM and NSP. The decoder-only structure of GPT allows the model to perform diverse NLP tasks, such as classifying text, answering questions, and summarizing text, with minimal architectural changes. Rather, GPT performs task-specific input transformations indicating which task should be done, all of which can be fed to the same Transformer.
In GPT-1 [
8], generative pre-training is followed by task-specific supervised fine-tuning by simply adding a dense layer and softmax, and fine-tuning for only a few training epochs. This approach is similar to BERT in that it requires sufficiently large annotated datasets for supervised fine-tuning. GPT-2 and GPT-3 place greater focus on zero-shot and few-shot learning, where the model must learn how to perform its tasks with zero or only a few examples of correct answers. GPT-2 [
9] proposed conditioning its output probabilities on both the input and the desired task. Training a model for this multi-task learning by supervised means is infeasible, as it would require thousands of (dataset, objective) pairs for training. Therefore, GPT-2 shifts focus and demonstrates strong zero-shot capabilities on some tasks without supervised fine-tuning can be achieved with a larger model (1.5B parameters). GPT-3 [
10] scales GPT-2 up to 175B parameters, which greatly improves performance in task-agnostic performance in zero-shot, one-shot, and few-shot settings without any supervised fine-tuning or parameter updates. It even outperforms some fine-tuned models to achieve state-of-the-art performance on a few NLP benchmarks.
However, GPT-3 has numerous limitations. It struggles to synthesize text by repeating itself, losing coherence in long-generated text, and including non-sequitur sentences. It lags far behind fine-tuned models in some NLP tasks, such as question answering. In addition, its responses to user prompts are not always aligned with the user’s intent and sometimes show unintended behaviors, such as making up facts (“hallucinating”), generating biased or toxic text [
25], and not following user instructions. These limitations stem from a fundamental incompatibility between the pre-training objective of generating the next token and the real objective of following user instructions safely and helpfully. InstructGPT (GPT-3.5) [
3] aims to correct this misalignment. InstructGPT is the underlying LLM for ChatGPT by OpenAI [
3].
Since the mistakes GPT-3 makes are not easy to evaluate through simple NLP metrics, InstructGPT employs reinforcement learning with human feedback (RLHF) [
26,
27] after pre-training to dramatically improve performance. This is a three-step process:
Supervised policy fine-tuning: Collect a set of instruction prompts and data labelers to demonstrate the desired output. This is used for supervised fine-tuning (SFT) of GPT-3.
Train a reward model: Collect a set of instruction prompts, each with multiple different model outputs, and have data labelers rank the responses. This is used to train a reward model (RM) starting from the SFT model with the final layer removed.
Optimize a policy against the RM via RL: Collect a set of prompts, outputs, and corresponding rewards. This is used to fine-tune the SFT model on their environment using proximal policy optimization (PPO).
Once InstructGPT fine-tunes GPT-3 through these steps, it becomes a standalone, off-the-shelf LLM that can perform a diverse set of tasks effectively based on text instructions without the need for any additional training. Indeed, it has numerous benefits over GPT-3: labelers prefer InstructGPT outputs, it is more truthful and less toxic, and it generalizes better to tasks and instructions not seen during training.
Generative language models can be instrumental in advancing aviation safety analysis by streamlining various processes and tasks. These models can significantly expedite the review and assessment of incident reports, by automatically identifying and categorizing crucial information, thus aiding in faster decision-making and implementation of preventive measures. In terms of forecasting, generative language models can detect patterns and trends from historical data to anticipate potential safety issues, thereby contributing to proactive safety management. The ability of these models to generate concise summaries of complex reports further enhances their utility by ensuring that key insights are readily available for stakeholders. Additionally, these models can provide nuanced insights into the contextual factors influencing safety incidents by analyzing communication data and safety reports. Therefore, generative language models can significantly augment safety analysis by introducing efficiencies and providing multifaceted insights. Despite the promises, it is imperative to assert that generative language models should not supersede the role of human safety analysts. Rather, they ought to be incorporated within a system that maintains human involvement at its core. In addition, most of these applications of generative language models still remain unexplored.
In the context of this study, the terms InstructGPT, GPT-3.5, and ChatGPT are used synonymously, as they fundamentally represent the same technology utilized via an API.
2.2.3. NLP in Aviation Safety Analysis
There has been a decrease in the occurrence of incidents resulting from technical failures, however, the incidents arising from human factor issues have emerged as the predominant underlying cause of the majority of incidents [
28,
29]. Several studies, such as [
30,
31,
32,
33,
34,
35], have looked into human factor issues in aviation. One of the most complex and difficult tasks when classifying aviation safety incidents is to sub-classify incidents stemming from human factor complications, which is a primary area of interest in this research. The investigation conducted in [
36] gathered labels from six individual annotators, each working on a subset of 400 incident reports, culminating in a collective 2,400 individual annotations. The outcomes indicated that there was a high level of disagreement among the human annotators. This highlights the potential for language models to assist in incident classification, with subsequent verification by safety analysts.
In light of contemporary advancements in the field of NLP, numerous LLMs have been employed for the evaluation of aviation safety reports. The majority of research conducted in this sphere has largely concentrated on the classification of safety documents or reports [
2,
37].
In their study [
2], Andrade et al. introduce SafeAeroBERT, a LLM generated by initially training BERT on incident and accident reports sourced from the ASRS and the National Transportation Safety Board (NTSB). The model is capable of classifying reports into four distinct categories, each based on the causative factor that led to the incident. Despite its capability, SafeAeroBERT superseded BERT and SciBERT only in two out of the four categories in which it was explicitly trained, thereby indicating potential areas for enhancement. In a similar vein, Kierszbaum et al. [
37] propose a model named ASRS-CMFS, which is a more compact model drawing inspiration from RoBERTa and is trained using domain-specific corpora. The purpose of the training is to perform diverse classifications based on the types of anomalies that resulted in incidents. From the authors’ findings, it became evident that in most instances, the base RoBERTa model maintained a comparative advantage.
Despite the abundance of research and literature in the domain of aviation safety analysis, the application of generative language models remains largely unexplored within this field.
The following section discusses the dataset and methodology developed to demonstrate the potential of generative language models in the realm of aviation safety analysis.
3. Materials and Methods
This section details the dataset utilized in this work, the specific prompt employed, and the methodology adopted for interacting with ChatGPT.
3.1. Dataset
The ASRS database contains 70,829 incident reports added between January 2009 and July 2022. 10,000 incident reports whose Primary Problem was labeled as Human Factors were downloaded for use in this study. This choice was motivated by the large number of incidents resulting from human factors when compared to other causes. The high number of aviation incidents linked to human factors can be attributed to several factors such as the aviation environment being complex and stressful, which can lead to errors. People have physical and mental limits, and issues like fatigue, stress, and misunderstanding can contribute to mistakes. Differences in crew training and skills can also cause disparities. Sometimes, cockpit design can lead to confusion, and some airline training cultures may not focus enough on safety.
3.2. Prompt Engineering for ASRS Analysis
This work leverages GPT-3.5 (via the OpenAI’s ChatGPT API) [
38] to analyze incident narratives, identify human factor issues leading to incidents (
Table 3), pinpoint responsible parties, and generate incident synopses. As mentioned, the primary objective is to investigate and validate the potential uses of generative language models in the context of aviation safety analysis.
Interacting with ChatGPT involved testing a variety of prompts before selecting the most suitable one. A prompt forms the initial input to a language model, shaping its subsequent output and significantly impacting the generated text’s quality and relevance. Prompt engineering is the act of optimizing these prompts. This process refines the input to guide the model’s responses effectively, improving its performance and output applicability. The temperature parameter of ChatGPT was set to zero for this work. When the value is set to zero, the output becomes predominantly predetermined and is well-suited for tasks that necessitate stability and yield the most probable outcome.
The initial step in the prompt engineering process involved assigning the persona of an aviation safety analyst to ChatGPT. Subsequently, ChatGPT was instructed to produce a brief synopsis based on the incident description. Initially, there were no restrictions on the length of the generated synopses, resulting in significant variations in length compared to the actual synopses. To address this, the lengths of the actual synopses were examined, and a maximum limit of two sentences was imposed on the generated synopses. Because the model appeared to omit the names of the systems involved at first, it was then specifically prompted to include system names and other relevant abbreviations.
Subsequently, the model was tasked with identifying various human factor issues responsible for the incident based on the provided incident narratives. While the model demonstrated the ability to identify these human factor issues, its responses exhibited significant variability due to its capability to generate highly detailed causal factors. This made it challenging to compare the generated responses with the ground truth, which encompassed twelve overarching human factor issues. Consequently, adjustments were made to the prompt to instruct the model to categorize the incidents into these twelve predefined classifications, where the model could choose one or more factors for each incident. Additionally, the model’s reasoning behind the identification of human factor issues was generated via a prompt to provide an explanation for the decision made by the language model. Likewise, the model was directed to determine the party accountable for the incident from a predetermined list of general options (such as ATC, Dispatch, Flight Crew, etc.) to prevent the generation of excessively specific answers, thereby facilitating the aggregation and subsequent evaluation process. The rationale behind these classifications was generated as well.
Lastly, the model was prompted to generate the output in a JSON format for which the keys were provided, namely, Synopsis, Human Factor issue, Rationale - Human Factor issue, Incident attribution, and Rationale - Incident attribution. This structured format was then converted into a .csv file for further analysis using Python.
The prompt employed in this work can be found in
Appendix A.
3.3. Analyzing ChatGPT’s Performance
The .csv file generated was further analyzed to benchmark ChatGPT’s performance against that of safety analysts.
The quality of ChatGPT-generated incident synopses was analyzed first. Two approaches were taken to assess quality: (1) the similarity of ChatGPT-generated synopses to human-written synopses using BERT-based LM embeddings, and (2) the manual examination of a small subset of the synopses.
When a sequence of text is fed to a BERT-based LM, it is first encoded by its pre-trained Transformer encoder into a numerical representation (i.e., an
embedding). This same embedding may be fed to heads (or decoders) that performs various NLP task, such as sentiment analysis, text summarization, or named-entity recognition. Hence, this embedding contains deep syntactical and contextual information characterizing the text. For standard-size BERT models, the hidden representation of a text sequence made up of
T WordPiece tokens is of dimension
, and the row sum is commonly used as the embedding. Two text sequences are thought to be similar if their embedding vectors from the same LM have a similar angle, i.e. have high cosine similarity (near 1), and dissimilar when they have low cosine similarity (near -1). Hence, the cosine similarity of all pairs of human-written and ChatGPT-generated incident synopses were measured using several BERT-based LMs, including BERT [
7], aeroBERT [
17,
19], and Sentence-BERT (SBERT) [
39]. It is important to note that the latter uses a Siamese network architecture and fine-tunes BERT to generate sentence embeddings that the authors suggest are more suited to comparison by cosine similarity. This was followed by a manual examination of some of the actual and generated synopses to better understand the reasoning employed by ChatGPT for the task.
Subsequently, a comparison was made between the human factor issues identified by ChatGPT and those identified by human safety analysts. The frequency at which ChatGPT associated each human factor issue with incident narratives was taken into account. To visualize these comparisons, a histogram was constructed. Furthermore, a normalized multi-label confusion matrix was created to illustrate the level of concordance between ChatGPT and safety analysts in attributing human factor issues to incident narratives. Each row in the matrix represents a human factor issue assigned to cases by the human annotators. The values within each row indicate the percentage of instances where ChatGPT assigned the same issue to the corresponding narrative. In an ideal scenario where ChatGPT perfectly aligns with the human annotators, all values except for those on the diagonal would be zero. Performance metrics such as precision, recall, and F1 score were used to assess the agreement between ChatGPT and safety analysts.
Finally, an analysis was conducted regarding the attribution of fault by ChatGPT. The focus was placed on the top five parties involved in incidents, and their attribution was discussed qualitatively, supported by specific examples.
The subsequent section discusses the results obtained by implementing the methodology described in the current section.
4. Results and Discussion
The dataset of 10,000 records from the ASRS database was first screened for duplicates as represented by the ASRS Case Number (ACN), and all duplicates were deleted, resulting in 9,984 unique records. The prompt in
Appendix A was run on each record’s narrative via the OpenAI ChatGPT API, resulting in five new features generated by GPT-3.5, outlined in
Table 4.
4.1. Generation of Incident Synopses
Two approaches were employed to assess the quality of ChatGPT-generated incident synopsis: (1) the similarity of ChatGPT-generated synopses to human-written synopses using BERT-based LM embeddings, and (2) the manual examination of a small subset of the synopses.
As depicted in
Figure 4, all the models (BERT, SBERT, and aeroBERT) find the synopses mostly quite similar. aeroBERT with its fine-tuning to aerospace-specific language evaluates the sequences as most similar, while the dedicated but general-purpose sequence-comparison model SBERT finds the ChatGPT synopses as similar but less so.
The practicality of these similarities was not clear, hence the next step involved a manual evaluation of a sample of high-similarity and low-similarity pairs of synopses from each of the three models.
Table 5 displays the three synopses that have the highest cosine similarity between the embeddings generated by each LM of generated synopses and the synopses written by safety analysts. It is important to highlight that the lengths of the generated and ground truth synopses are very similar in this case. This is particularly relevant considering the prompt directed the generative model to produce a synopsis based on the incident narrative while maintaining a length constraint of 1-2 sentences, as depicted in
Table 4.
Similarly,
Table 6 illustrates the three synopses with the minimum cosine similarity between the embeddings of the generated synopses and the corresponding synopses written by safety analysts. A noteworthy observation from this data is the significant difference in length between the generated synopses and the ground truth for all three LMs. The length of the generated synopses fluctuates depending on the narrative’s length and the prompt’s restriction to produce 1-2 sentences. In contrast, the ground truth synopses are notably concise. Furthermore, there appears to be a degree of overlap concerning the synopses that obtained the lowest ranking in terms of cosine similarity. For example, the generated synopsis corresponding to “
A light twin and a single engine have a NMAC at Isla Grande” (ground truth) was given the lowest ranking by both BERT and SBERT models. However, the cosine similarity when employing BERT embeddings was higher relative to that of SBERT. The cosine similarities between human-written and ChatGPT-written synopses for aeroBERT, SBERT, and BERT show correlations of 0.25, 0.32, and 0.50, respectively, to the lengths of the human-written synopses (in words). This suggests aeroBERT is less likely to attribute similarities based on the confounding variable of synopsis length.
4.2. Performance with Human Factors Related Issues
In this section, we evaluate how human evaluators and ChatGPT compare in attributing human factors issues to ASRS incidents. ChatGPT was prompted (see
Appendix A) to attribute human factors issues from
Table 3 to each incident narrative, and to explain its rationale.
First, the frequency with which ChatGPT attributes each human factors issue to incident narratives was considered. ChatGPT assigns almost all human factors issues to narratives less frequently than human annotators. The only exception is Training/Qualification, which ChatGPT assigns more frequently. Notably, ChatGPT almost never assigns Confusion, Human-Machine Interface, Other/Unknown, and Troubleshooting. It should be emphasized that there were variations in the human factor issues as identified by safety analysts due to subtle differences in categories like Confusion and Fatigue.
Figure 5.
Frequency of Human-attributed and ChatGPT-attributed Human Factors Issues
Figure 5.
Frequency of Human-attributed and ChatGPT-attributed Human Factors Issues
Next, the normalized multi-label confusion matrix [
40] in
Figure 6 shows the degree of agreement between ChatGPT and the human annotators for human factors issues attributed to each record in the ASRS. Each row represents a human factor issue attributed to cases by the human annotators. The numbers in the row are the percentages of times ChatGPT attributed the same issue to the narrative. If ChatGPT agreed perfectly with the human annotators, all numbers except the numbers on the diagonal would be 0.
We note that ChatGPT agrees with human annotators more than 40% of the time on the following classes:
Communication Breakdown,
Fatigue,
Situational Awareness, and
Training/Qualification. It agrees substantially less for the other classes. The most common source of disagreement is when ChatGPT chooses not to attribute any human factors issues (see the rightmost column of the multilabel confusion matrix in
Figure 6). In addition, ChatGPT attributes many more incidents than humans do to the factors textitCommunication Breakdown,
Distraction,
Situational Awareness, and
Training/Qualification.
To provide more of a comprehensive assessment,
Table 7 provides precision, recall, and F1 score measuring ChatGPT agreement with humans for each issue. Generally speaking, when ChatGPT claims a human factor issue applies, human annotators agree 61% of the time (i.e., precision). The most common source of disagreement is that ChatGPT is generally more conservative in assigning human factors issues than humans and, hence, has lower recall values for most issues. However, it is crucial to note that there can be significant variations in annotations performed by safety analysts, as demonstrated in [
36].
4.3. Attribution of Fault
ChatGPT was further employed to identify the party the incident could be attributed to. It is important to emphasize that the intent is not to impose punitive measures on the implicated parties, but rather to leverage these specific incidents as instructive examples during training.
The outputs from ChatGPT did not rigidly follow the options provided in the prompt and proceeded in generating parties responsible based on the narrative. Moreover, in scenarios where responsibility could be apportioned to several parties, the model demonstrated the capability to do so.
Table 8 provides the top five parties responsible for incidents that can be attributed to human factors.
In the context of this task, the term ‘flight crew’ encompassed all individuals aboard the flight, barring the passengers. A total of 5744 incidents were attributed to errors by the flight crew. Similarly, the model adeptly pinpointed incidents resulting from Air Traffic Control (ATC) errors with considerable specificity, including those from ATC (Approach Control), ATC (Ground Control), ATC (Kona Tower), ATC (Military Controller), ATC (TRACON Controller), ATC (Indy Center), and so forth. These were combined into a single ATC category by the authors.
To comprehend the reason behind ChatGPT associating a particular incident to a specific party, it was encouraged to produce the underlying rationale for each attribution.
Table 9 provides some examples to facilitate a deeper understanding of the principles governing attribution rationale. Upon a detailed review of certain incident accounts, it was noted that the entity to which ChatGPT attributed the incident, and the logic underlying such attribution, seemed to indicate a high level of accuracy in the model’s responses and their justifications. For example, in the first case presented in
Table 9, the lead Flight Attendant, overly concerned with a passenger’s non-compliance to mask policies, violated sterile cockpit procedures by alerting the flight crew via a chime during landing rollout to request a supervisor, a gesture typically associated with emergencies or critical aircraft issues. Consequently, the attribution of the incident to the flight crew, as made by ChatGPT, is appropriate. Furthermore, the reasoning behind such attribution is well-founded.
Asking the model for a rationale behind a certain decision provides a degree of transparency in the identification process, promoting trust in the conclusions drawn. Furthermore, the rationale can help in error identification and subsequent model optimization, thereby improving the precision of incident responsibility attribution. In regulatory contexts, the rationale supports auditability by providing a clear trail of how conclusions were reached, which is essential for accountability. Moreover, it allows for an assessment of any potential bias in assigning responsibility, leading to fairer and more balanced conclusions.
5. Conclusions
The primary objective of this study was to evaluate the potential utility of generative language models based on the current state-of-the-art generative LM (ChatGPT) for aviation safety analysis. ChatGPT was deployed to generate incident synopses derived from the narrative. These generated synopses were then juxtaposed with the ground truth synopses found in the Synopsis column of the ASRS dataset. This comparison was performed utilizing embeddings generated by LLMs.During manual evaluation, it was observed that synopses exhibiting higher cosine similarity tended to exhibit consistent similarities in terms of length. In contrast, synopses with lower cosine similarity displayed more noticeable discrepancies in their respective lengths.
Subsequent to this, a comparison was conducted between the human factor issues linked to an incident by safety analyses to those identified by ChatGPT on the basis of the narrative. In general, there was a 61% concurrence between decisions made by ChatGPT and those made by safety analysts. ChatGPT demonstrated a more cautious approach in assigning human factor issues in comparison to human evaluators. This may be ascribed to its limitation in not being able to infer causes that extend beyond the explicit content described within the narrative given no other columns were provided as inputs to the model.
Lastly, ChatGPT was employed to determine the party the incident could be attributed to. As there was no dedicated column serving as ground truth for this specific task, a manual inspection was undertaken on a limited dataset. ChatGPT attributed 5877, 1971, 805, and 738 incidents to Flight Crew, ATC, Ground Personnel, and Maintenance, respectively. The rationale and underlying logic provided by ChatGPT for its attributions were well-founded. Nonetheless, due to the sheer volume of incidents used in this study, a manual examination of every individual incident record was not conducted.
The aforementioned results lead to the inference that the application of generative language models for aviation safety purposes presents considerable potential. However, it is suggested that these models be utilized in the capacity of “co-pilots” or assistants to aviation safety analysts, as opposed to being solely used in an automated way for safety analysis purposes. Implementing models in such a manner would serve to alleviate the analysts’ workload while simultaneously enhancing the efficiency of their work.
Future work in this area should primarily focus on broadening and validating the application of generative language models for aviation safety analysis. More specifically, fine-tuning models like ChatGPT on domain-specific data could enhance their understanding of the field’s nuances, improving the generation of incident synopses, identification of human factors, and attributing fault. To reiterate, the identification of responsibility will aid in enhancing the training of a specific group (such as pilots, maintenance personnel, and so forth), as opposed to imposing punitive measures. The significant positive correlations between synopsis length and cosine similarity suggests future experiments to isolate and account for this bias.
Expanding the scope of incident attribution to include other parties such as passengers, weather conditions, or technical failures could further refine the application of these models. To solidify the ground truth for incident attribution, future research should scale manual inspection or employ other methods on a larger dataset. Following the suggestion to utilize these models as “co-pilots,” the development of human-AI teaming approaches is another promising avenue. By designing interactive systems where human analysts can guide and refine model outputs, or systems providing explanatory insights for aiding human decision-making, both efficiency and accuracy could be enhanced. Finally, assessing the generalizability of these models across other safety-critical sectors such as space travel, maritime, or nuclear industries would further solidify their wider applicability.
Author Contributions
Conceptualization, A.T.R., A.P.B., O.J.P.F, and D.N.M; methodology, A.T.R., A.P.B, R.T.W., and V.M.N; Software, A.T.R., R.T.W, and V.M.N; validation, A.T.R., and R.T.W; formal analysis, A.T.R., and A.P.B; investigation, A.T.R., A.P.B, and R.T.W.; data curation, A.T.R., and A.P.B; writing—original draft preparation, A.T.R., and R.T.W; writing—review and editing, A.T.R., A.P.B, O.J.P.F., R.T.W., V.M.N, and D.N.M.; visualization, A.T.R., and R.T.W. All authors have read and agreed to the published version of the manuscript.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
AI |
Artificial Intelligence |
ASRS |
Aviation Safety Reporting System |
BERT |
Bidirectional Encoder Representations from Transformers |
CSV |
Comma-separated values |
FAA |
Federal Aviation Administration |
GPT |
Generative Pre-trained Transformer |
JSON |
JavaScript Object Notation |
LaMDA |
Language Models for Dialog Applications |
LLaMA |
Large Language Model Meta AI |
LLM |
Large Language Model |
MLM |
Masked Language Modeling |
NAS |
National Airspace System |
NASA |
National Aeronautics and Space Administration |
NLP |
Natural Language Processing |
NSP |
Next Sentence Prediction |
PaLM |
Pathways Language Model |
PPO |
Proximal Policy Optimization |
RL |
Reinforcement Learning |
RLHF |
Reinforcement learning from human feedback |
RM |
Reward Model |
SFT |
Supervised Fine-tuning |
UAS |
Unmanned Aerial Systems |
Appendix A
The prompt used for this work is presented below.
Listing 1. Prompt used for this work. |
|
References
- ASRS Program Briefing. https://asrs.arc.nasa.gov/overview/summary.html(accessed: 05.16.2023).
- Andrade, S.R.; Walsh, H.S., SafeAeroBERT: Towards a Safety-Informed Aerospace-Specific Language Model. In AIAA AVIATION 2023 Forum. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; others. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 2022, 35, 27730–27744. [Google Scholar]
- Tikayat Ray, A.; Bhat, A.P.; White, R.T.; Nguyen, V.M.; Pinon Fischer, O.J.; Mavris, D.N. ASRS-ChatGPT Dataset. [CrossRef]
- Electronic Report Submission (ERS). https://asrs.arc.nasa.gov/report/electronic.html(accessed: 05.16.2023).
- General Form. https://akama.arc.nasa.gov/asrs_ers/general.html.(accessed: 05.16.2023).
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, Minnesota, 2019; pp. 4171–4186. [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; others. Improving language understanding by generative pre-training 2018.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; others. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; others. Language models are few-shot learners. Advances in neural information processing systems 2020, 33, 1877–1901. [Google Scholar]
- OpenAI. GPT-4 Technical Report, 2023. arXiv:cs.CL/2303.08774. [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; others. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; others. Lamda: Language models for dialog applications. arXiv 2022, arXiv:2201.08239. [Google Scholar] [CrossRef]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; others. Palm: Scaling language modeling with pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. Advances in Neural Information Processing Systems; Guyon, I.; Luxburg, U.V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; Garnett, R., Eds. Curran Associates, Inc., 2017, Vol. 30.
- Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural computation 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
- Tikayat Ray, A.; Pinon Fischer, O.J.; Mavris, D.N.; White, R.T.; Cole, B.F. aeroBERT-NER: Named-Entity Recognition for Aerospace Requirements Engineering using BERT. In AIAA SCITECH 2023 Forum. [CrossRef]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; Brynjolfsson, E.; Buch, S.; Card, D.; Castellon, R.; Chatterji, N.S.; Chen, A.S.; Creel, K.A.; Davis, J.; Demszky, D.; Donahue, C.; Doumbouya, M.; Durmus, E.; Ermon, S.; Etchemendy, J.; Ethayarajh, K.; Fei-Fei, L.; Finn, C.; Gale, T.; Gillespie, L.E.; Goel, K.; Goodman, N.D.; Grossman, S.; Guha, N.; Hashimoto, T.; Henderson, P.; Hewitt, J.; Ho, D.E.; Hong, J.; Hsu, K.; Huang, J.; Icard, T.F.; Jain, S.; Jurafsky, D.; Kalluri, P.; Karamcheti, S.; Keeling, G.; Khani, F.; Khattab, O.; Koh, P.W.; Krass, M.S.; Krishna, R.; Kuditipudi, R.; Kumar, A.; Ladhak, F.; Lee, M.; Lee, T.; Leskovec, J.; Levent, I.; Li, X.L.; Li, X.; Ma, T.; Malik, A.; Manning, C.D.; Mirchandani, S.P.; Mitchell, E.; Munyikwa, Z.; Nair, S.; Narayan, A.; Narayanan, D.; Newman, B.; Nie, A.; Niebles, J.C.; Nilforoshan, H.; Nyarko, J.F.; Ogut, G.; Orr, L.; Papadimitriou, I.; Park, J.S.; Piech, C.; Portelance, E.; Potts, C.; Raghunathan, A.; Reich, R.; Ren, H.; Rong, F.; Roohani, Y.H.; Ruiz, C.; Ryan, J.; R’e, C.; Sadigh, D.; Sagawa, S.; Santhanam, K.; Shih, A.; Srinivasan, K.P.; Tamkin, A.; Taori, R.; Thomas, A.W.; Tramèr, F.; Wang, R.E.; Wang, W.; Wu, B.; Wu, J.; Wu, Y.; Xie, S.M.; Yasunaga, M.; You, J.; Zaharia, M.A.; Zhang, M.; Zhang, T.; Zhang, X.; Zhang, Y.; Zheng, L.; Zhou, K.; Liang, P. On the Opportunities and Risks of Foundation Models. ArXiv 2021. [Google Scholar] [CrossRef]
- Tikayat Ray, A.; Cole, B.F.; Pinon Fischer, O.J.; White, R.T.; Mavris, D.N. aeroBERT-Classifier: Classification of Aerospace Requirements Using BERT. Aerospace 2023, 10. [Google Scholar] [CrossRef]
- Tikayat Ray, A.; Cole, B.F.; Pinon Fischer, O.J.; Bhat, A.P.; White, R.T.; Mavris, D.N. , Agile Methodology for the Standardization of Engineering Requirements using Large Language Models; 2023. [CrossRef]
- Tikayat Ray, A. Standardization of Engineering Requirements Using Large Language Models. PhD thesis, Georgia Institute of Technology, 2023. [CrossRef]
- Weaver, W. Translation. In Machine Translation of Languages; Locke, W.N., Boothe, A.D., Eds.; MIT Press: Cambridge, MA, 1949/1955; pp. 15–23. Reprinted from a memorandum written byWeaver in 1949. [Google Scholar]
- Brown, P.F.; Cocke, J.; Della Pietra, S.A.; Della Pietra, V.J.; Jelinek, F.; Lafferty, J.; Mercer, R.L.; Roossin, P.S. A statistical approach to machine translation. Computational linguistics 1990, 16, 79–85. [Google Scholar] [CrossRef]
- Bengio, Y.; Ducharme, R.; Vincent, P. A Neural Probabilistic Language Model. Advances in Neural Information Processing Systems; Leen, T.; Dietterich, T.; Tresp, V., Eds. MIT Press, 2000, Vol. 13.
- Gehman, S.; Gururangan, S.; Sap, M.; Choi, Y.; Smith, N.A. REALTOXICITYPROMPTS: Evaluating Neural Toxic Degeneration in Language Models. [CrossRef]
- Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 2019. [Google Scholar] [CrossRef]
- Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; Christiano, P.F. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 2020, 33, 3008–3021. [Google Scholar] [CrossRef]
- Graeber, C. The role of human factors in improving aviation safety. Aero Boeing 1999, 8. [Google Scholar]
- Santos, L.; Melicio, R. Stress, Pressure and Fatigue on Aircraft Maintenance Personal. International Review of Aerospace Engineering 2019, 12, 35–45. [Google Scholar] [CrossRef]
- Saleh, J.H.; Tikayat Ray, A.; Zhang, K.S.; Churchwell, J.S. Maintenance and inspection as risk factors in helicopter accidents: Analysis and recommendations. PloS one 2019, 14, e0211424. [Google Scholar] [CrossRef] [PubMed]
- Dumitru, I.M.; Boşcoianu, M. Human factors contribution to aviation safety. International Scientific Committee 2015, 49. [Google Scholar]
- Hobbs, A. Human factors: the last frontier of aviation safety? The International Journal of Aviation Psychology 2004, 14, 331–345. [Google Scholar] [CrossRef]
- Salas, E.; Maurino, D.; Curtis, M. Human factors in aviation: an overview. Human factors in aviation 2010, pp. 3–19. [CrossRef]
- Cardosi, K.; Lennertz, T. ; others. Human factors considerations for the integration of unmanned aerial vehicles in the national airspace system: An analysis of reports submitted to the aviation safety reporting system (ASRS) 2017.
- Madeira, T.; Melício, R.; Valério, D.; Santos, L. Machine learning and natural language processing for prediction of human factors in aviation incident reports. Aerospace 2021, 8, 47. [Google Scholar] [CrossRef]
- Boesser, C.T. Comparing human and machine learning classification of human factors in incident reports from aviation 2020.
- Kierszbaum, S.; Klein, T.; Lapasset, L. ASRS-CMFS vs. RoBERTa: Comparing Two Pre-Trained Language Models to Predict Anomalies in Aviation Occurrence Reports with a Low Volume of In-Domain Data Available. Aerospace 2022, 9. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT API. https://openai.com/blog/introducing-chatgpt-and-whisper-apis, 2023. gpt-3.5-turbo.
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Hong Kong, China, 2019; pp. 3982–3992. [CrossRef]
- Heydarian, M.; Doyle, T.E.; Samavi, R. MLCM: Multi-Label Confusion Matrix. IEEE Access 2022, 10, 19083–19095. [Google Scholar] [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).