1. Introduction
The egg and broiler industries are pivotal to global food production, with China playing a crucial role as the world’s largest egg producer and a major supplier of broiler meat. China accounts for 36% of global egg production and 14% of chicken meat, making these industries crucial for both food security and as a significant income source for millions of people in urban and rural areas. However, the rapid expansion of these industries has heightened the threat of various poultry diseases, which can severely impact production and livelihoods. As a result, effective disease prevention and control have become increasingly critical. Despite the availability of information on poultry diseases, it is often fragmented and poorly organized, making it difficult for farmers to access the necessary knowledge when needed. This paper addresses this issue by proposing a method that utilizes deep learning techniques to accurately identify and extract essential information on poultry diseases from extensive text sources. This approach lays the foundation for developing a comprehensive knowledge graph, which can support advanced applications such as intelligent Q&A systems and efficient knowledge retrieval platforms [
1]. These tools will equip farmers with the information they need to protect their flocks and maintain their livelihoods.
Named Entity Recognition (NER) is a critical task in natural language processing, involving the identification of specific entities within text. Its importance has grown significantly with the rapid expansion of biomedical literature and data. In the biomedical domain, Biomedical Named Entity Recognition (BioNER) targets the recognition of entities such as disease names, symptoms, drugs, and anatomical parts [
2]. Traditionally, NER models have approached this task as a sequence labelling problem, employing both rule-based and machine learning techniques [
3]. Rule-based methods necessitate the manual creation of extensive rule sets, which rely heavily on domain expertise and are often restricted to narrowly defined areas [
4]. In contrast, machine learning approaches, including Hidden Markov Models (HMM), Maximum Entropy Models (MEM), Support Vector Machines (SVM), and Conditional Random Fields (CRF) [
5,
6,
7,
8], rely on feature engineering. This reliance poses challenges, particularly in selecting appropriate features and capturing long-term dependencies between entities [
9]. The advent of deep learning has revolutionized the field, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) for NER tasks [
10,
11,
12], significantly improving both efficiency and recognition accuracy. Notably, the BiLSTM-CRF model has gained widespread use [
13,
14,
15,
16], as it automates feature extraction, thereby enhancing both efficiency and accuracy without the large scale feature engineering required by earlier methods [
17].
As research in NER has advanced, increasing attention has been directed toward the recognition of nested entities. The concept of "nested region names" was first introduced in the task definition of the Automatic Content Extraction (ACE) research [
18], Represented as entities that exist within other entities [
19,
20]. For instance, in the study of chicken diseases, a symptom such as ‘spleen swelling’ includes the nested entities ’spleen’ representing a body part, and ’swelling’ representing a symptom. Recognizing these nested entities, whether they are heterogeneous (different types) or homogeneous (overlapping types), is complex and presents a significant challenge for accurate identification [
21]. The introduction of the concept of "nested region names" in ACE research underscores the importance of recognizing these intricate structures within named entities.
To address the challenges posed by nested named entities, researchers have proposed various methods, including hypergraph-based, region-based, transition-based, and span-based techniques. For example, Huang et al. [
22] introduced a Hyper Graph Network (HGN) structure to manage nested entities by representing each sentence as a hypergraph, with words as nodes and entities as hyper-edges, thereby transforming the recognition task into hyper-edges classification. Region-based methods treat nested NER as a multi-class classification problem by first representing potential regions (subsequences) and then classifying them. Jiang et al. [
23] proposed a candidate region-aware model that utilizes binary sequence labeling followed by candidate region classification, demonstrating significant performance on public datasets. Transition-based methods, inspired by dependency parsers, incrementally construct trees through greedy decoding. Wang et al. [
24] developed a neural transition model using Stack LSTM, which effectively captures character-level representations and efficiently represents the system's state. Span-based methods enumerate all possible text segments (spans) and then determine their entity status [
25], which naturally suits the nested entity recognition task. Li et al. [
26] enhanced this approach by developing a segment-enhanced span-based model (SESNER), which improves model performance while accurately handling complex nested entities. Additionally, the Global Pointer (GP) model, which leverages relative positions through a multiplicative attention mechanism, offers a global perspective on start and end positions for predicting entities and has demonstrated outstanding performance across various nested NER tasks [
27,
28,
29,
30].
Despite these advances, current NER models still rely heavily on large training datasets, making pre-trained models crucial for embedding layers in NER tasks [
31,
32]. Models such as BERT, MCBERT, RoBERTa, XLNet, and ERNIE, trained on extensive corpora, have significantly improved entity recognition accuracy [
33,
34,
35,
36]. However, challenges remain in accurately recognizing rare entities and domain-specific terms. To address these issues, researchers have sought to enhance pre-trained models with lexicon information. Techniques like Softword, ExSoftword, and SoftLexicon incorporate lexicons to improve the recognition of domain-specific named entities [
37]. For instance, Zhao et al. [
38]introduced the BERT-IDCNN-CRF model, which integrates the SoftLexicon method, demonstrating impressive efficiency across multiple datasets. Similarly, Zhang et al. [
39]improved recognition by incorporating lexicons and similar words into character representations. Liu et al. [
40]further enhanced model capabilities by dynamically updating custom lexicon segmentation methods, thereby improving the identification of domain-specific terms and new entities. Additionally, incorporating syntactic information has been shown to significantly enhance NER performance [
41]. For example, Tian et al. [
42]developed the BioKMNER model, which employs a Key-Value Memory Network (KVMN) to integrate syntactic information, achieving excellent results on biomedical datasets. Luoma et al. [
43] demonstrated that adding context through additional sentences in BERT input systematically improves NER performance. These methods, whether introducing lexicon information, syntactic data, sentence context, or even stroke information of Chinese characters [
44], are aimed at improving NER model performance in specialized domains.
In summary, NER continues to face significant challenges in the field of biological disease research, particularly in accurately recognizing nested entities and domain-specific terms. These challenges are especially pronounced in the study of chicken diseases, where the acquisition of relevant corpora is difficult and the labour cost of data annotation is high. To address these issues, we collected and organized a chicken disease corpus comprising 20 million characters, trained a specialized word vector tailored to chicken disease terminology, and annotated a portion of this data with high precision. Building on this foundation, we developed a nested NER model, MFGFF-BiLSTM-EGP, which leverages Multiple Fine-Grained Feature Fusion (MFGFF) and the Efficient Global Pointer (EGP). The primary contributions of this paper are as follows:
We have constructed the MFGFF-BiLSTM-EGP model, which connects the fusion output of multi-fine-grained features to the BiLSTM neural network layer, and finally through a fully connected layer into the EGP to predicts the entity position.
In the MFGFF module we designed, the character encoder obtains character features by fine-tuning the RoBERTa pre-trained model, the word encoder acquires word features through word-character matching, word frequency weighting, and multi-head attention mechanism, and the sentence features are output using SBERT. MFGFF effectively integrates multiple fine-grained features. In addition, the introduction of EGP enables the prediction of nested entities by means of positional coding.
We have constructed a comprehensive knowledge base for chicken diseases, which includes a 20-million-character corpus, a vocabulary containing 6760 specialized terms, a 200-dimensional word vector in the field of chicken diseases, and a high-quality annotated dataset CDNER annotated under the guidance of veterinarians.
3. Results
We conducted experiments on three datasets, each divided into training, validation, and testing sets with a 6:2:2 ratio:
For the CDNER dataset, we employed chicken disease word vectors trained on a 20 million characters corpus. For the CMeEE V2 and CLUENER datasets, we utilized 200-dimensional Chinese word vectors from Tencent AI Lab [
49].
3.1. Main Results Compared with Other Models
We evaluated the performance of several mainstream NER models on these datasets:
As summarized in
Table 4, the MFGFF-BiLSTM-EGP model consistently outperforms other models across all three datasets, achieving the highest F1 scores: 91.98% on CDNER, 73.32% on CMeEE V2, and 82.54% on CLUENER. The superior performance of the MFGFF-BiLSTM-EGP model can be attributed to its integration of character, word, and sentence vectors, fused with BiLSTM and EPG, which enhances recognition accuracy by incorporating specialized vocabulary and contextual information.
The SLRBC model also exhibits strong performance, particularly on the CDNER dataset, with an F1 score of 89.64%. This success is largely due to the Softlexicon method, which enhances vocabulary representation, combined with RoBERTa's robust contextual embeddings, BiLSTM's capacity to capture sequence data, and CRF's sequence tagging capabilities. Although BERT-CRF lacks the Softlexicon and BiLSTM modules present in the SLRBC model, it still performs competitively due to BERT's powerful representation capabilities, albeit with slightly lower metrics across all datasets. BERT-MRC, which reinterprets NER as a reading comprehension task, delivers adequate but not outstanding results, with F1 scores of 82.93% on CDNER, 67.97% on CMeEE V2, and 76.89% on CLUENER. Its performance could potentially be improved by refining the description of entity types during MRC parameter settings.
3.2. Entity Level Evaluation
Figure 7 and
Table 5 presents the entity-level evaluation results of the MFGFF-BiLSTM-EGP model across the CDNER, CMeEE, and CLUENER datasets, detailing Precision, Recall, and F1 Scores. In the CDNER dataset, the model demonstrates exceptional performance in recognizing the "symptom," "bodypart," "disease," "drug," and "type" categories, achieving F1 scores of 87.6%, 92.43%, 92.98%, 93.22%, and 93.68%, respectively. This consistency underscores the high quality of our dataset, indicating that the annotations are both balanced and precise, thereby facilitating robust model training. These results validate the model's superior performance in entity recognition within this dataset.
For the CMeEE V2 dataset, the model's performance varies across different categories. It excels in the "dis" (disease) and "dru" (drug) categories, with F1 scores of 83.81% and 81.1%, respectively. However, the model encounters difficulties in the "equ" (equipment) and "ite" (medical examination items) categories, where F1 scores drop to 62.62% and 62.04%, respectively. Notably, in the "equ" category, the model's Precision is only 55.23%, likely due to the limited representation of specialized vocabulary within the general domain word vectors, resulting in weaker recognition performance in these areas.
In the CLUENER dataset, the model shows a relatively balanced performance across categories, achieving high F1 scores in the "company," "government," and "position" categories—86.05%, 84.28%, and 84.33%, respectively. This indicates strong recognition capabilities with minimal disparity between Precision and Recall, reflecting good stability. However, the model's performance in the "movie" and "book" categories is less satisfactory, with F1 scores of 75.76% and 80.2%, respectively. Overall, the model maintains balanced performance across multiple categories.
3.3. Ablation Study
Table 6 shows the results of ablation experiments performed by the MFGFF-BiLSTM-EGP model on the CDNER dataset to evaluate the impact of different module combinations on the model's F1 score. The study investigates the contributions of the pre-trained model, word encoder, and sentence encoder. The results indicate that each module significantly enhances the model’s performance, with the highest F1 score of 91.98% achieved when all three modules are integrated.
3.3.1. Effect of Pre-Trained Model
The pre-trained model substantially improved the overall performance. When only the pre-trained model (Model 1) was employed, the F1 score reached 88.01%, which is 5.69% higher than that of the model without the pre-trained model (82.32% for Model 2). This finding underscores the pre-trained model's effectiveness in capturing underlying features. The combination of the pre-trained model with the word encoder (Model 4) further increased the F1 score to 91.33%, representing a 9.01% improvement over Model 2 (which used only the word encoder). This emphasizes the significance of pre-trained models in complex feature representation. When the pre-trained model was combined with both the word and sentence encoders (Model 6), the F1 score peaked at 91.98%, showing a 9.31% improvement over Model 3, which combined the word and sentence encoders. This result further demonstrates the pre-trained model’s capacity to maximize overall model performance.
3.3.2. Effect of Word Encoder
The word encoder plays a crucial role in enhancing the model’s performance, especially when integrated with other modules. The F1 score for the word encoder alone (Model 2) is 82.32%, which, although lower than that of Model 1, which utilized only the pre-trained model, still illustrates the word encoder’s value in word-level feature extraction. When the word encoder is combined with the pre-trained model (Model 4), the F1 score rises to 91.33%, a 3.32% improvement over Model 1. This finding indicates that the word encoder significantly contributes to refining the pre-trained model’s fine-grained feature representation. In Model 6, where the word encoder is combined with both the pre-trained model and the sentence encoder, the F1 score further improves to 91.98%, a 3.64% increase over Model 5, which combined the pre-trained model and the sentence encoder. This demonstrates the pivotal role of the word encoder in a multi-module combination.
3.3.3. Effect of Sentence Encoder
The sentence encoder’s impact on model performance is more nuanced and depends on its combination with other modules. When combined with the word encoder (Model 3), the F1 score reached 82.67%, a modest increase of 0.35% compared to Model 2, which used only the word encoder. This slight improvement may be due to the introduction of the sentence encoder, which could add redundant information in the absence of a pre-trained model. When the sentence encoder was combined with the pre-trained model (Model 5), the F1 score increased slightly to 88.34%, just 0.33% higher than Model 1 (88.01%), which used only the pre-trained model. In Model 6, the F1 score achieved its maximum of 91.98% when all three modules were used together, an improvement of 0.65% compared to Model 4. These results suggest that the sentence encoder, when used alongside the pre-trained model and word encoder, can marginally enhance the global semantic representation of the model.
4. Discussion
4.1. Visualization of Token Representations in Feature Space
We conducted feature visualization and analysis across three datasets: CDNER, CMeEE V2, and CLUENER. By labeling 50 entities per category and extracting their features, we applied t-SNE for dimensionality reduction to facilitate visualization. Our analysis focused on two approaches: word vectors and MFGFF.
Figure 8 illustrates that the effectiveness of MFGFF varies significantly across different datasets. In the CDNER dataset, the original word vector representations exhibited minimal distinction between feature vectors across entity categories, leading to substantial overlap and dispersion. This lack of clear differentiation was evident. However, following MFGFF, the feature representations improved markedly, with data points clustering more centrally within their respective categories. This resulted in more distinct category clusters and enhanced inter-category distinguishability. Despite these improvements, challenges remain, such as the observed similarity between categories like 'body part' and 'symptom.' This overlap likely arises from semantic similarities or nested relationships within the text.
Similarly, the CMeEE V2 dataset initially showed poor category aggregation, with blurred boundaries and significant overlap under the original word vectors. The application of MFGFF significantly clarified these boundaries and improved data point aggregation, highlighting its effectiveness in enhancing feature differentiation. However, certain categories, such as medical examination item, medical procedure, and body still exhibited similarities, likely due to semantic overlaps in medical terminology.
In contrast, the CLUENER dataset, a general domain corpus, displayed a more even distribution of entities under the original word vectors, though noticeable category overlap was still present. The application of MFGFF greatly improved differentiation between entity types, which may be attributed to the dataset's inherent diversity in entity text, allowing fused features to better segregate categories.
In summary, our analysis highlights the complexities inherent in biomedical NER, including challenges like semantic overlap, domain-specific vocabulary, and nested entities. While MFGFF demonstrates significant advantages in enhancing feature representation, it still faces challenges, particularly in addressing category similarity. The varying performance of this technique across different datasets is closely tied to the datasets' characteristics and the semantic features of the entity categories, indicating a need for further optimization and adaptation in specific applications.
4.2. Nested Entity Predictive Analytic
Based on the experimental results presented in
Figure 9, we conducted a detailed evaluation of the effectiveness of nested entity recognition, comparing the performance of Fine-Tuning (FT) RoBERTa and MFGFF across different datasets. For the CDNER dataset, Fine-Tuning RoBERTa achieved a precision of 79.68%, a recall of 72.54%, and an F1 score of 75.98%. On the CMeEE V2 dataset, these metrics were notably lower, with a precision of 42.13%, a recall of 47.15%, and an F1 score of 44.50%. Compared to overall entity recognition, the F1 scores for CDNER and CMeEE V2 decreased by 16.03% and 28.82%, respectively, highlighting the significant challenges posed by nested entities in NER.
To address this challenge, we developed a high-quality nested entity dataset to enhance model training with more representative data. Implementing the MFGFF method further improved the model’s capability to recognize nested entities. On the CDNER dataset, MFGFF achieved precision, recall, and F1 scores of 71.90%, 72.13%, and 71.91%, respectively, closely aligning with the performance of Fine-Tuning RoBERTa. On the more complex CMeEE V2 dataset, MFGFF yielded precision, recall, and F1 scores of 39.56%, 41.10%, and 40.32%, respectively. Although MFGFF's absolute values were slightly lower than those of Fine-Tuned RoBERTa, its stability and robustness in recognizing nested entities were more pronounced, particularly in the challenging CMeEE V2 dataset.
In conclusion, while nested entity recognition continues to present significant challenges in NER tasks, constructing high-quality datasets and adopting a MFGFF approach can significantly enhance model performance. The MFGFF method, with its superior stability and robustness, outperforms the fine-tuning-only strategy in recognizing nested entities, demonstrating both the effectiveness and potential application value of our proposed method.
4.3. Comparative Analysis of Pre-rained Models
The choice of a pre-trained model for embedding is crucial to the recognition effectiveness of NER models. We compared the performance of five pre-trained models—BERT, RoBERTa, XLNet, MCBERT, and ERNIE—as illustrated in
Figure 10.
RoBERTa outperforms the other models across all metrics, particularly in recall (92.18%) and F1 score (91.98%), where it leads significantly. This superior performance can be attributed to RoBERTa’s use of a larger dataset, extended training time, and the removal of the Next Sentence Prediction task from BERT. These modifications enhance RoBERTa's ability to capture contextual information and manage long dependencies effectively. MCBERT, a model pre-trained specifically for the Chinese medical domain, performs comparably to BERT in medical text classification tasks, with accuracy and F1 scores of 90.25% and 90.49%, respectively. MCBERT’s strength lies in its pre-training on a vast corpus of medical domain texts, enabling it to handle medical terminology and specialized sentence structures effectively. However, compared to RoBERTa and ERNIE, MCBERT is slightly weaker in precision and recall, likely due to the more extensive pre-training data and optimization strategies employed by RoBERTa and ERNIE, which result in better performance even in specialized areas like chicken disease NER. XLNet, on the other hand, lags behind the other models in precision (88.87%), recall (90.08%), and F1 score (89.44%). Despite its autoregressive architecture, which aims to merge the benefits of BERT and Transformer-XL to capture richer contextual information, XLNet’s performance may degrade when handling shorter texts or when there is insufficient contextual information.