1. Introduction
In the context of large-scale and intensive pig breeding practices, it is of great significance to establish intelligent diagnostic and preventive measures for pig diseases. Early prevention and timely diagnosis are pivotal for maintaining swine health and mitigating potential losses. Named Entity Recognition (NER) assumes a critical role in this endeavor by identifying specific entities within textual corpora, serving as the cornerstone for numerous downstream tasks in natural language processing. These tasks include but are not limited to information retrieval, intelligent question answering, and knowledge graph construction. However, the existing entity recognition methods mostly focus on recognition of person, location and organization, etc. Given the pressing need to bolster disease surveillance and management in swine, there arises an urgent imperative to develop specialized NER methodologies tailored to the specific lexicon of pig disease terminology in Chinese.
The early NER methods include rule-based recognition methods and statistics-based machine learning recognition methods. In recent years, with the rapid development of neural networks, methods of deep learning are more suitable for the task of NER and become the mainstream method [
1,
2,
3,
4,
5].
The rule-based NER method requires the rules which are formulated manually by experts. This method has high accuracy when dealing with small datasets, but it is difficult to expand it on a large scale and apply it in different domains because the rules are based on manual construction, which is a time-consuming task [
6].
The statistics-based NER method select the appropriate training model according to the specific research background. Commonly used statistical models include hidden Markov models(HMM), conditional random field model(CRF), branch support vector machine (SVM) and maximum entropy model (ME), etc. Compared to the rule-based model, this method omits many tedious rule designs and are fast, portable and convenient to use [
7,
8]. However, the statistics-based method requires a large number of manually labeled datasets to train model parameters, which is gradually replaced by deep learning method.
The deep learning based NER method can learn more complex features and achieve good results. In contrast to the preceding two approaches, deep learning-based NER methods do not necessitate an abundance of artificial features. Therefore, the deep learning-based methods has been widely concerned by researchers. Common deep learning models include convolutional neural network (CNN), recurrent neural network (RNN), graph neural network (GNN), deep neural network (DNN), generative adversarial network (GAN), long short-term memory network (LSTM), Transformer and BERT(bi-directional encode representation from transformers) and so on [
1,
9]. Compared to the rule-based and statistics-based models, deep learning models are dominant and achieve state-of-the-art results in NER. However, the scalability of deep learning models applied in specific domain remains a significant challenge.
The lexicon-based NER method can effectively avoid segmentation errors and improve the accuracy of entity boundary recognition by integrating potential word information into feature vectors. Currently, a large number of lexicon enhanced Chinese entity extraction methods have been proposed, with better performance than methods based on character embedding or word embedding. Lattice-LSTM [
10] has achieved new benchmark results on several public Chinese NER datasets. However, the Lattice-LSTM model architecture is complex, which limits its application in many industrial areas requiring real-time NER responses. A convolutional neural network based method that incorporates lexicons using a rethinking mechanism was proposed,which can model all the characters and potential words that match the sentence in parallel [
11]. A lexicon-based graph neural network with global semantics was proposed to tackle word ambiguities. In this model, the lexicon knowledge is used to connect characters to capture the local composition, while a global relay node can capture global sentence semantics and long-range dependency [
12]. A Lexicon Enhanced BERT (LEBERT) for Chinese sequence labeling was put forward [
13]. The model integrates external lexicon knowledge into BERT layers directly by a Lexicon Adapter layer and achieves better performance than both lexicon enhanced models and BERT baseline in Chinese datasets. More character-word association models have been proposed, such as SoftLexicon [
14],FLAT [
15],PLTE [
16].
The pre-trained model-based NER method effectively leverages deep bidirectional contextual information. It demonstrates superior performance with shorter training times, reduced labeling data requirements, and improved results compared to traditional models. Currently, BERT [
17] is widely used, followed by ELMo [
18], RoBERTA [
19], ERNIE [
20], ALBERT [
21], and others. At present, the pre-trained models and lexicon are integrated by utilizing their respective strengths. Li proposed Flat-LAttice Transformer for Chinese NER, which converts the lattice structure into a flat structure consisting of spans [
15]. Li proposed the LEBERT-BiLSTM-CRF model for elementary mathematics text NER, which integrates external lexicon knowledge into BERT layers directly by a lexicon adapter layer and performs better than other NER models [
22] .
Contrastive learning acquires feature representations of samples by comparing positive and negative samples in feature space. This approach has garnered significant attention in the fields of computer vision (CV) and natural language processing (NLP). ConSERT (Contrastive Framework for Self-supervised Sentiment Representation Transfer) and SimCSE(Simple Contrastive Learning of Sentiment Embedding) model, which use different data enhancement methods and comparative learning loss function to learn the representation of sentences, obtain SOTA results on the task of text semantic similarity [
23,
24]. COntrastive learning with Prompt guiding for few-shot NER (COPNER) was proposed and outperforms state-of-the-art models with a significant margin in most cases. This method introduces category specific words COPNER composed of prompts as supervised signals for contrastive learning to optimize entity token representation [
25]. Moreover, Named Entity Recognition in low-resource scenarios based on contrastive learning has also received considerable attention [
26,
27,
28]. He proposed a novel prompt-based contrastive learning method for few-shot NER without template construction and label word mappings [
26]. Li proposed a multi-task learning framework CLINER for Few-Shot NER [
27].
In the field of livestock husbandry, text mining, Named Entity Recognition (NER), intelligent question-and-answer systems, and artificial intelligence (AI) technologies have been gradually applied. However, this field faces numerous challenges, including the prevalence of technical terms, complex knowledge structures, fine knowledge granularity, and a lack of labeled datasets [
29]. Seok created a BERT-DIS-NER model that adds a CRF layer to BERT for the disease named entity recognition and used syllable unit-based named entity recognition that can reflect the characteristics of disease names. The F1-score is 0.81 trained with human data and fine-tuned with animal data [
30]. Kung designed and implemented an intelligent knowledge question-and-answer system for pig farming based on bi-GRU and SNN methods, combined with the LTSM deep-learning method [
31] .
NER methods have been found extensive applications in the agricultural domain [
32,
33,
34,
35,
36,
37]. Nonetheless, there remains a apparent gap in current research concerning the accurate recognition of named entities within the domain of pig diseases in Chinese. Pig disease data is characterized by complex entities, fuzzy boundaries and domain-specific vocabulary. Unlike conventional NER tasks focusing on common entities such as person and organization names, pig disease data encompasses specialized terminologies drawn from the domains of animal husbandry and veterinary science.
Furthermore, the resources in the field of pig diseases are confined and dispersed, exacerbating the scarcity of publicly available benchmark corpora and labeled datasets specific to this domain in Chinese. While considerable research has been devoted to NER systems in human medicine [
38,
39], it remains impractical to directly transfer such models to the domain of pig diseases due to the domain-specific rules and vocabulary governing this domain. Hence, named entity recognition in the field of pig diseases needs to be further explored. A model of Pig Disease Chinese Named Entity Recognition(PDCNER) is proposed in this paper. The main contributions of the paper are as follows:
(1)A named entity recognition model that integrates contrastive learning and enhanced lexicon was proposed for pig diseases corpus in Chinese, which achieved the best recognition results.
(2)To enrich the contextual understanding and semantic representation of pig disease data, we employed Lexicon-enhanced BERT. This approach facilitated the direct integration of external lexicon knowledge in pig disease domain into BERT layers via a Lexicon Adapter layer, seamlessly combining the characteristics of both characters and words. Furthermore, we used contrastive learning to maximize the agreement between representations within the same batch while ensuring their distinctiveness from other representations. This approach enhances model robustness and facilitates more effective feature extraction and representation learning.
(3)We constructed a comprehensive Chinese corpus and lexicon for identifying specific terms in pig diseases domain. Moreover, we built an annotated datasets encompassing 25 distinct types of pig diseases and 6 entity categories, comprising a total of 7518 annotated entities.
The remainder of the paper is organized as follows:
Section 2 introduces the data set and method proposed in this paper. The experiments and results are described in
Section 3.
Section 4 compares the method proposed in this paper with other commonly used methods and provides an analysis of the experimental results. Finally, the conclusion are presented in
Section 5.
4. Discussion
4.1. Performance Analysis of the Proposed Model
For better understanding of the proposed approach, we evaluate the PDCNER model separately on the six entities type, disease, body parts, symptom, medicine, control, which are presented in
Figure 3 and
Table 3.
We found that the F1-scores for type, disease, and medicine all exceeded 90%, with the F1-score for type being the highest at 95.41%. Conversely, the lowest F1-score was for the entity of control, at only 63.16%. The primary reason for this disparity is that the boundaries of pig type and disease entities are very clear, whereas the boundaries of control measures entities are more ambiguous. For instance, type entities typically end with terms like 'pigs (猪)' (e.g., sick pigs (患病猪), conservation pigs (保育猪), fattening pigs (育肥猪)), while disease entities usually end with terms such as 'disease (病),' 'inflammation (炎),' and 'plague (瘟)' (e.g., Porcine blue ear disease (猪蓝耳病), Necrotic enteritis (坏死性肠炎), African swine fever (非洲猪瘟)). In contrast, control measure entities are generally composed of verbs and nouns, such as 'isolating infected pigs(隔离感染猪群)' and 'reducing environmental stress factors(减少环境应激因素)'. The second reason is the uneven distribution of entities. Control entities in the training set are significantly fewer than other categories, comprising only 7.29%. Consequently, the model could not fully learn their contextual features. Additionally, the average length of control entities is 11 Chinese characters, which contributes to a low overall recognition rate.
On the other hand, the F1-scores of disease entities and medicine entities were 92.96% and 90.05%, respectively. Both disease and medicine entities include a large number of technical terms, yet the method proposed in this paper achieves a good recognition effect on these two entities. The results demonstrate that PDCNER fully utilizes both Chinese character features and lexicon knowledge in the pig disease domain at the input level, and the lexicon adapter can effectively leverage pig disease knowledge.
4.2. Comparison of Common Pre-Trained and Lexicon-Based Model
In contrast to models such as BERT-BiLSTM-CRF, BERT-BILSTM-CRF-SoftLexicon, RoBERTa and LEBERT, PDCNER has demonstrated significant advancements, confirming its efficiency. PDCNER holds a distinct advantage over other pre-trained and lexicon-based models, illustrating the value of incorporating pig disease-related lexicon features directly into the BERT representation from the bottom layer and using the contrastive learning method.
For the comparison of other models, we present the recognition results for six entity types in
Figure 4. It can be clearly seen from the
Figure 4 that PDCNER has achieved the best results in 4 entities such as type, body parts, symptom and medicine, which exhibit robust domain-specific features.
- (1)
Effectiveness of the lexicon Enhanced BERT
Comparative analysis with the results of BERT-BiLSTM-CRF reveals notable improvement in the precision, recall, and F1-score of PDCNER, with improvements of 5.98 percentage points, 0.04 percentage points, and 3.05 percentage points, respectively. PDCNER leverages the lexicon adapter to make full use of pig disease feature information, seamlessly integrating it into the BERT architecture. Specifically, the Lexicon Adapter is attached between certain transformers within BERT, facilitating the infusion of pig disease lexicon knowledge into the model's representation.
- (2)
Effectiveness of the contrastive learning
Through comparative evaluation utilizing the same dataset and downstream model, PDCNER demonstrates superior accuracy in identifying pig disease entities compared to LEBERT, exhibiting improvements in precision, recall, and F1-score by 0.45 percentage points, 0.44 percentage points, and 0.44 percentage points, respectively. This underscores the efficacy of the loss function employed in contrastive learning, which enhances the model's capacity for semantic representation of text. Consequently, this optimization contributes to superior performance across various NER tasks.
4.3. Analysis of Results for Few-Shot
In order to verify the reliability and robustness of PDCNER in the condition of scarce data for few-shot entity recognition, we used 1%, 10% and 30% of pig disease corpus for experimentation. The results can be found in
Table 4. The result shows that the PDCNER model has obvious improvement compared to BERT-BILSTM-CRF and LEBERT.
The F1-score of PDCNER reaches 84.77% when the sample size is 10%, which is only 1.22% lower than that of the full sample. As the sample size increases to 30%, the F1-score of the PDCNER model further improves to 85.39%, showing a marginal decrease of only 0.6% compared to the full sample. Moreover, it outperforms the BERT-BiLSTM-CRF and LEBERT models by 6.38% and 8.42%, respectively. These results demonstrate the PDCNER model's capability to achieve higher recognition accuracy even under data scarcity scenarios. The incorporation of lexical information in the bottom layer of BERT enables efficient utilization of BERT's representational capabilities. Additionally, the adoption of contrastive learning enhances the semantic representation space, facilitating effective feature capture without extensive training.
4.4. Experiments on Public Datasets
To assess the generalization capability of PDCNER, we conducted evaluations across three public datasets: Weibo, Ontonotes, and Resume. As illustrated in
Table 5, the PDCNER model achieved the highest F1 score across all three datasets. These results indicate that PDCNER exhibits superior performance not only on the pig disease corpus but also demonstrates a degree of generalizability to other domains.
5. Conclusions
High-quality extraction of knowledge related to pig diseases is critical for intelligent consultation, question answering, technical recommendations, and other application scenarios.
In this study, we constructed a corpus, labeled datasets and lexicon for Chinese named entity recognition specific to pig diseases, encompassing 152,596 characters, 7,518 entities and 2,391 professional terms. To tackle the challenges of entity identification in the pig disease domain, such as the scarcity of annotated data, numerous technical terms, and fuzzy boundaries, we propose the PDCNER model. This model integrates lexicon information from the pig disease domain into the BERT's Transformer layers at the lower level and employs contrastive learning to enhance representation quality and generalization capability. The results indicate that the PDCNER model surpasses the performance of BERT-BiLSTM-CRF and other mainstream models in extracting named entities related to pig diseases, achieving accuracy, recall, and F1-score of 86.92%, 85.08%, and 85.99%, respectively. This demonstrates high-quality entity recognition in the field of pig diseases. Moreover, few-shot experiments confirm that our model remains robust with limited data, and experiments on public datasets verify its generalization ability.
In future work, we plan to utilize additional datasets from other related animal diseases, such as chicken and cow diseases, to further test the scalability and generalization ability of the model.