1. Introduction
Text similarity refers to the task of evaluating the extent of similarity between two sentences on the basis of their semantic, structural, and content-related attributes. It has substantial applications across diverse fields, including intelligent search, recommendation systems, and question-answering systems. As natural language processing (NLP) technologies continue to advance, enhancing the precision of text similarity calculations has emerged as a pivotal area of research focus, especially for short text similarity.
In recent years, deep learning algorithms, which have undergone training on extensive textual data, have exhibited the capacity to learn the semantic representations of texts. These algorithms possess the ability to represent words and sentences in the form of high-dimension vectors, better for natural language understanding. A notable example of such models is BERT[
1], which is based on Transformers[
2]. Through unsupervised pre-training on extensive text corpora, BERT effectively captures contextual representations of sentences, resulting in exceptional performance. However, despite the significant advancements, models based on Transformers still have several limitations.
Deep learning models are significantly dependent on the data they have been trained on and lack the ability to incorporate novel or extraneous knowledge beyond their training corpus[
3]. Although models, such as BERT, can capture contextual relationships between words, they are unable to autonomously reason or leverage knowledge that exceeds the scope of their training data. In contrast to humans, deep learning models do not possess external world knowledge and are unable to infer or reason about information that transcends the text.
Furthermore, models face challenges in processing information related to temporal, spatial, or external factual updates. The data utilized during the pre-training phase of BERT is typically static, whereas real-world knowledge is continuously evolving. As time progresses and new knowledge emerges, these models fail to automatically update or integrate such new information[
4]. Consequently, predictions based on outdated data may experience reduced accuracy.
These constraints also impact the effectiveness of models in handling tasks involving intricate reasoning, including common-sense reasoning or cross-domain inference, which require the system to engage in more elaborate reasoning processes that transcend basic text-based comprehension. These obstacles present opportunities for future advancements in Natural Language Processing (NLP), particularly regarding the integration of external knowledge with current deep learning models, a subject of growing interest among both academic researchers and industry professionals.
To overcome the limitations of deep learning models in reasoning with external knowledge, numerous researchers have achieved notable advancements by incorporating external knowledge into language models. Velickovic et al.[
5], introduced knowledge graphs utilizing Graph Neural Networks (GNNs) to facilitate effective reasoning regarding complex relational structures. Lee et al.[
6] investigated the possibility of treating language models as knowledge bases, thereby enhancing reasoning capabilities through the integration of structured knowledge bases. Petroni et al.[
7] put forward a multi-task learning framework that integrates pre-trained language models with external knowledge bases, significantly boosting performance in open-domain question answering tasks. Wang et al.[
8] systematically summarized methods for incorporating external knowledge into open-domain question answering through a combination of retrieval and reading strategies. Chen et al.[
9] proposed a Retriever-Reader architecture that dynamically retrieves knowledge, resulting in substantial improvements in model accuracy. Ren et al.[
10] improved model performance on domain-specific and common-sense reasoning tasks by embedding knowledge graphs into language model representations.
Furthermore, the work of Zhou et al.[
11] introduced the KEBERT-GCN model, which incorporates external knowledge bases for computing sentence similarity. This model employs BERT to generate a novel adjacency matrix. This is achieved by performing a Hadamard product between the similarity matrix derived from external knowledge bases and the attention matrix, which is subsequently input into a Graph Convolutional Network (GCN) to capture intricate relationships between words. Notably, the model constructs the similarity matrix using semantic similarities between words from WordNet. This approach enhances the utilization of external knowledge within the attention layer. This work efforts have significantly contributed to the advancement of external knowledge integration in Natural Language Processing (NLP), resulting in substantial improvements in model performance for tasks involving complex reasoning and similarity calculations.
Models incorporating external knowledge have achieved remarkable success in English Natural Language Processing tasks[
12], yet their application to Chinese poses distinct challenges. For example, prevalent methods such as the KEBERT-GCN model typically dissect sentences into individual words to construct similarity matrices. However, this technique frequently fails to adequately encapsulate the comprehensive semantics of Chinese sentences. In English, utilities like WordNet can be utilized to compute semantic distances between words, yielding semantic similarity matrices as inputs derived from external knowledge, thereby facilitating the understanding of word relationships. This methodology has proven efficacious, particularly in managing intricate semantic relationships. Conversely, in Chinese, the absence of a direct equivalent to WordNet restricts the applicability of this approach in Chinese text processing.
Moreover, these methods exhibits certain limitations even in English. It predominantly focuses on the relationships between individual words while often neglecting the actual contribution of each word to the overall semantic similarity computation. For instance, words possess varying degrees of importance within a sentence, yet existing similarity matrices may inadequately reflect this. Current methods[
13,
14] also tend to prioritize global semantic information, without adequately capturing intricate relationships within the local context. These issues are particularly acute in Chinese text processing, where the intricacy of word and character combinations necessitates that models address both global and local semantic attributes. As a result, new methods must be introduced to address the scarcity of external knowledge resources for Chinese and delve into integrate both global and local information for enhanced contextual comprehension in Chinese NLP.
To address the aforementioned challenges, this research introduces a methodology that incorporates external knowledge to enhance the performance of models in Chinese NLP tasks. This methodology utilizes vector representations derived from Tencent's 8-million-word corpus and Netease's embedding corpus as external knowledge resources for the BERT model, thereby allowing the model to access detailed, word-level external knowledge inputs in Chinese. The proposed methodology is highly flexible, as it allows for the substitution of the external knowledge base with domain-specific word embeddings tailored to the requirements of specific tasks, thus improving performance within those domains. Moreover, the structure of the proposed model is not limited to BERT and can be adapted to other pre-trained models, making it suitable for a variety of training tasks and data scenarios. Experimental results obtained on the LCQMC demonstrate that the proposed model significantly outperforms the traditional BERT model, especially in tasks involving semantic similarity calculation and contextual understanding. When compared to existing models, it achieves an accuracy of 90.16% on the LCQMC, representing an improvement of 2.23% over ERNIE and 1.46% over the previously top-performing model, Glyce + BERT, providing an effective solution for Chinese NLP tasks.
The key contributions of this paper are as follows:
Flexible Integration of External Knowledge: We introduce a methodology that facilitates the utilization of external knowledge,such as Tencent's 8-million-word corpus and Netease's embedding corpus, as inputs for Chinese word-level semantic understanding. This methodology offers the flexibility to incorporate domain-specific word embedding libraries based on task requirements.
Combination of Global and Local Information: By integrating external knowledge with internal attention mechanisms, the methodology effectively combines global semantic information with local contextual relationships. This addresses the limitations of previous methodologies that focused solely on global relationships.
Task Generalizability: The proposed methodology transcends the confines of short text similarity, extending its application to enhance semantic understanding and recognition accuracy across various NLP tasks. We conduct innovative experiments on Named Entity Recognition (NER), demonstrating remarkable results.
Extensive Experimental Validation: The proposed methodology undergoes extensive validation on datasets such as MSRA-NER. The results demonstrate not only superior performance in general tasks but also versatility across various language processing scenarios. These findings provide new insights for future NLP research.
The remainder of this paper is structured as follows:
Section 2 presents the related works;
Section 3 outlines the proposed method, discussing the overall model design, training details, and the integration of global and local information.
Section 4 presents the experimental results, while
Section 5 extends the application to other tasks. Finally, Section 6 concludes our work.
5. Conclusion
This study addresses semantic challenges in Chinese short text matching by integrating external knowledge bases with pre-trained language models like BERT. By combining BERT’s global features with local semantic features from external lexicons (e.g., Tencent’s 8-million-word corpus), the model significantly improves accuracy and robustness in short text similarity tasks. Key contributions include the integration of external knowledge for enhanced semantic disambiguation, multi-layer feature fusion to capture nuanced meanings, validation on datasets like LCQMC and BQ where it outperformed traditional methods, and successful application to named entity recognition (NER), demonstrating cross-task utility.
future research could focus on dynamic knowledge integration, use of multimodal knowledge, expansion to domain-specific tasks, and optimization for resource efficiency, especially in low-resource settings. Addressing current limitations, such as noise from external knowledge bases, adaptability to dynamic updates, and constraints in specific domains, may involve strategies like noise filtering and dynamic knowledge retrieval.