3.1. Architectural Perspective
Majority of AFV systems broadly adopt a three-stage pipeline architecture similar to the
Fact
Extraction and
VERification (FEVER) shared task [
24], as identified and commented on by many researchers [
24,
25,
26,
27,
28,
29,
30]. These three stages (also called sub-tasks) are, document retrieval (evidence retrieval), sentence selection (evidence selection) and Recognizing Textual Entailment or RTE (label/veracity prediction). The document retrieval component is responsible for gathering relevant documents from a knowledge base, such as Wikipedia, based on a given query. The sentence-retrieval component then selects the most pertinent evidence sentences from the retrieved documents. Lastly, the RTE component predicts the entailment relationship between the query and the retrieved evidence. While the above framework is generally followed in AFV, alternative approaches incorporate additional distinct components to identify credible claims and provide justifications for label predictions, as shown in
Figure 3. The inclusion of a justification component in such alternative approaches contributes to the system’s capacity for explainability within the AFV paradigm.
The majority of AFV systems are highly dependent on deep neural networks (DNNs) for the label prediction task [
4]. Furthermore, in recent years, deep learning-based approaches have demonstrated exceptional performance in detecting fake news [
31]. As we mentioned in
Section 2, there is, however, an inherent conflict between the performance of AI models and their ability to explain how they make decisions. However, although existing AFV systems lack inherent explainability [
4], it would be foolish to overlook the potential to use these less interpretable deep models for AFV, as these models possess the ability to achieve state-of-the-art results with a remarkable level of prediction accuracy. However, this also indicates that model-based interpretation approaches may not be a suitable solution for AFV systems. The reason being that these methods require the involvement of simple and transparent AI models that can be easily understood and interpreted.
Therefore, considering the architectural characteristics of state-of-the-art AFV systems, a potential trade-off solution to achieve explainability may involve incorporating post hoc measures of explainability, either at the prediction-level or dataset-level, while still leveraging the capabilities of less interpretable deep transformer models. The subsequent subsections delve into the attempts made in the literature to incorporate post-hoc explainability in terms of methods and input within the context of AFV.
3.3. Data Perspective
The potential of data explainability lies in its ability to provide deep insights that enhance the explainability of AI systems (which rely heavily on data for knowledge acquisition) [
6,
13]. Data explainability methods encompass a collection of techniques aimed at better comprehending the data sets used in the training and design of AI models [
13]. The importance of a training data set in shaping the behavior of AI models highlights the need to achieve a high level of data explainability. Therefore, it is crucial to note that constructing a high-performing and explainable model requires a high-quality training dataset. In AFV, the nature of this dataset, also known as the source of evidence, has evolved over time. Initially, the evidence was primarily based on claims, where information directly related to the claim was used for verification. Subsequently, knowledge-base-based approaches were introduced, utilizing structured knowledge sources to support the verification process. Further advances led to the adoption of text-based evidence, where relevant textual sources were used for verification. In recent developments, there has been a shift towards dynamically retrieved sentences, where the system dynamically retrieves and selects sentences that are most relevant to the claim for verification purposes. We will explore these changes through the lens of explainability.
Systems such as [
40] that process the claim itself, using no other source of information as evidence, can be termed as ‘knowledge-free’ or ‘retrieval-free’ systems. In these systems, the linguistic characteristics of the claim are considered the deciding factor. For example, claims that contain a misleading phrase are labeled ‘Mostly False’. [
41] also employ a similar approach, focusing on linguistic patterns, but incorporate a hybrid methodology by including claim-related metadata with the input text to the deep learning model. These additional data include information such as the claim reporter’s profile and the media source where the claim is published. These knowledge-free systems face limitations in their performance, as they depend only on the information inherent in the claim and do not consider the current state of affairs [
42]. The absence of contextual understanding and the inability to incorporate external information make dataset-level explainability infeasible in these systems.
In knowledge-base-based fact-verification systems [
43,
44,
45], a claim is verified against the RDF triples present in a knowledge graph. The veracity of the claim is calculated by assessing the error between the claim and the triples based on different approaches such as rule-based, subgraph-based, or an embedding-based one. The drawback of such systems is the likelihood of a claim being verified as false, based on the assumption that the supporting facts of a true claim are already present in the graph, which is not always feasible. This limited scalability and the inability to capture nuanced information hinder the achievement of explainability in these type of fact verification models.
Unlike the latter two approaches; in the evidence retrieval approach, supporting pieces of evidence for the claim verdict have to be fetched from a relevant source using an information retrieval method. While the benefits of such systems outweigh the limitations of static approaches mentioned earlier, there are certain significant constraints that can also affect the explainability of these models. While the quality of the source (biased or unreliable), availability of the source (geographical or language restrictions), and resources for the retrieval process (time-consuming, and expensive human and computational resources) can have a significant impact on the evidence retrieval and limit the scope of evidence; a deep understanding of claim context is critical to avoid misinterpreted and incomplete evidence which lead to erroneous verdicts. Nevertheless, these limitations suggest that the evidence retrieval approach might not be entirely consistent with key XAI principles such as ‘Accuracy’ and ‘Fidelity’. This, in turn, casts doubt on the effectiveness of any post hoc explainability measures attempted within this data aspect.
An alternative approach is using text from verified sources of information as evidence; Encyclopedia articles, journals, Wikipedia, and fact-checked databases are some examples. Since Wikipedia is an open source web-based encyclopedia and contains articles on a wide range of topics, it is consistently considered an important source of information for many applications, including economic development [
46], education [
47], data mining [
48], and AFV. For example, the FEVER task [
24], an application in AFV, relies on the retrieval of evidence from Wikipedia pages. In the FEVER dataset, each SUPPORTED/REFUTED claim is annotated with evidence from Wikipedia. This evidence could be a single sentence, multiple sentences, or a composition of evidence from multiple sentences, sourced from the same page or multiple pages of Wikipedia. This approach aligns well with the XAI principle of ‘Interpretability’, as Wikipedia is a widely accessible and easily understandable source of information. However, it is crucial to note that Wikipedia also comes with limitations that could impact the ‘Accuracy’ and ‘Fidelity’ principles of XAI, which can potentially impact the interpretability of models relying on Wikipedia as a primary data source. Firstly, like any other source, Wikipedia pages can contain biased and inaccurate content, and these can remain undetected for a longer period (same with outdated information); this compromises the ‘Accuracy’ of any AFV model trained on these data. Secondly, despite covering a wide range of topics, Wikipedia suffers deficiencies in comprehensiveness
9, limiting a model’s ability to understand contextual information fully, thereby affecting ‘Interpretability. Lastly, models trained predominantly on Wikipedia’s textual content can develop biases and limitations inherent to the nature and scope of Wikipedia’s content, impacting both ‘Fidelity’ and ‘Interpretability’ when applied to diverse real-world scenarios and varied types of unstructured data.
Given these considerations and their misalignment with the XAI objectives of ‘Interpretability,’ ‘Accuracy’, and ‘Fidelity,’ it becomes evident that relying solely on Wikipedia as a training dataset may not be the most effective pathway toward explainable AFV.
Alternatively, Wikipedia can be used as an elementary corpus to train the AI model to achieve a general understanding of various knowledge domains for AFV, and this background or prior knowledge can then be harnessed further with additional domain data to gain a deeper context (which helps the model to attain information on global relationship and thus increase explainability). Being the largest Wikipedia-based benchmark dataset for fact verification [
26,
49], the FEVER dataset can unarguably be considered as this elementary corpus for AFV tasks, and Transformers and Transfer Learning is the most pragmatic technology choice for AFV according to state-of-the-art systems [
29,
30,
50].
The quality of the data set we use or create for the application is a major factor in determining the explainability of a transformer-based AFV model and its ability to comprehend the underlying context. For example; [
51] developed the SCIFACT data set in order to expand the ideas of FEVER to COVID-19 applications. SCIFACT comprises 1.4K expert-written scientific claims along with 5K+ abstracts (from different scientific articles) that either support or refute each claim and is annotated with rationales, which consists of a minimal collection of sentences from the abstract that imply the claim. The study demonstrated the obvious advantages of using such a domain-specific dataset (can also be called subdomain here since scientific claim verification is a subtask of claim verification) as opposed to just using a Wikipedia-based evidence dataset. [
51] argues that the inclusion of rationales in the training data set "facilitates the development of interpretable models" that not only label predictions but also identify the specific sentences necessary to support their decisions. However, the limited scale of the dataset, consisting of only 1.4K claims, necessitates caution in interpreting assessments of system performance and underscores the need for more expansive datasets to propel advancements in explainable fact-checking research.
Building on this perspective of improving the quality and diversity of the data set, [
52] critically evaluated the FEVER corpus, emphasizing its reliance on synthetic claims from Wikipedia and advocating for a corpus that incorporates natural claims from a variety of web sources. In response to this identified need, they introduced a new, mixed-domain corpus, which includes domains like blogs, news, and social media, the mediums often responsible for the spread of unreliable information. This corpus, which encompasses 6,422 validated claims and over 14,000 documents annotated with evidence, addresses the prevalent limitations in existing corpora, including restricted sizes, lack of detailed annotations, and domain confinement. However, through meticulous error analysis, [
52] discovered inherent challenges and biases in claim classification, attributed to the heterogeneous nature of the data and the incorporation of Fine-Grained Evidence (FGE) from unreliable sources. We infer that these findings illustrate substantial barriers to realizing the fundamental goals of XAI, particularly accuracy and fidelity. Moreover, [
52]’s focus on diligently modeling meta-information related to evidence and claims could be understood as their implicit recognition of the crucial role of explainability in the realm of automated fact-checking. By suggesting the integration of diverse forms of contextual information and reliability assessments of sources, they highlight the necessity of developing models that are not only more accurate but also capable of providing reasoned and understandable decisions, a pivotal step towards fostering explainability in automated fact-checking systems.
Table 2 offers a comprehensive categorization of the datasets used in Fact Verification systems, highlighting a variety of dataset types, each highlighting distinctive attributes and challenges. The datasets are categorized meticulously based on their inherent nature and source, such as ‘Knowledge-free Systems’, ‘Knowledge-Base-Based’, ‘Wikipedia-Based’, ‘Domain(Single)-Specific-Corpus’, and ‘Mixed-domain-Corpus (non-Wikipedia-based)’. Each type is represented with illustrative studies and remarks to provide insight into the inherent limitations or challenges in relation to enhancing explainability in AFV systems. The categorization is enriched with subclassifications under ‘Knowledge Type’, ‘Text Type’, and ‘Domain Type’. ‘Knowledge-free systems’ are denoted with dashes (-) under ‘Text Type’ and ‘Domain Type’, indicating the inherent absence of these attributes. This underscores the retrieval-free nature of such systems, which predominantly rely on the intrinsic linguistic features of the claims, thus lacking contextual understanding aKnowledge-Base-Basedel explainability infeasible. The ‘Knowledge-Base-Based’ type can be either single-dosubcategoriesdomain, represented by checkmarks in both sub-categories under ‘Domain Type’. This illustrates the versatility of knowledge-based systems in utilizing structured information from a specialized domain or amalgamating insights from multiple domains. The ability to cater to varied domains accentuates the expansive applicability of such systems, though it also brings forth challenges related to scalability and capturing nuanced information. ‘Wikipedia-Based’ datasets, inherently multi-domain, are highlighted separately to focus on the specific challenges of using Wikipedia as the main information source, such as dealing with potential biases and inaccuracies. The ‘Domain(Single)-Specific-Corpus is distinguished as it focuses on a specialized or singular domain, providing depth and specificity. While this focus allows for a detailed exploration of a particular domain, it also poses limitations due to the restricted scope and potential biases inherent to the selected domain, thereby affecting the overall evaluation and applicability of the system. Additionally, the ‘Mixed-domain Corpus’ type emphasizes the inclusion of diverse domains, especially those not solely reliant on Wikipedia, addressing the challenges arising from data heterogeneity and reliability.
The categorization in
Table 2, coupled with associated remarks, is intended to act as a resource, providing information on the various challenges and possibilities to improve explainability within AFV systems. This categorization can guide researchers and practitioners in making informed decisions regarding dataset selection and utilization, providing a clearer understanding of the implications and limitations of different dataset types in the context of Automated Fact Verification.
We acknowledge the extensive investigations conducted by [
32] in Explainable NLP and by [
4] in Explainable AFV, which provide meticulous lists and insightful analyses of prevalent datasets in their respective fields. It is crucial to clarify that our endeavor in this section (
Section 3.3) does not aim to perform an exhaustive review of datasets, a task diligently undertaken by the aforementioned studies. Instead, our work is uniquely positioned to illuminate the distinctive attributes and inherent diversity within various dataset types in AFV. We hope that our attempt to examine the impact of different data types on explainability serves as a thoughtful addition to ongoing discussions and reflections on the subject, offering a new perspective on the multifaceted interactions between data diversity and explainability in AFV.