Wang, D.; Wang, L.; Tang, K.; Bo, Q.; Han, B. PDAM-FAQ: Paraphrasing-based Data Augmentation and Mixed-Feature Semantic Matching for Low-Resource FAQs. Preprints2024, 2024080060. https://doi.org/10.20944/preprints202408.0060.v1
APA Style
Wang, D., Wang, L., Tang, K., Bo, Q., & Han, B. (2024). PDAM-FAQ: Paraphrasing-based Data Augmentation and Mixed-Feature Semantic Matching for Low-Resource FAQs. Preprints. https://doi.org/10.20944/preprints202408.0060.v1
Chicago/Turabian Style
Wang, D., Qile Bo and Bin Han. 2024 "PDAM-FAQ: Paraphrasing-based Data Augmentation and Mixed-Feature Semantic Matching for Low-Resource FAQs" Preprints. https://doi.org/10.20944/preprints202408.0060.v1
Abstract
Frequently Asked Questions (FAQs) systems rely on semantic symmetry similarity measuring between two sentences. To address the challenges of insufficient training data and limited domain-specific understanding in low-resource FAQs, this paper proposes a general framework, PDAMF-FAQ, to solve these issues. Firstly, we propose a paraphrasing-based data augmentation model that integrates syntactic information and edit vectors. Using a rule-based approach, it retrieves template sentences from a corpus and masks relevant words with special characters. Edit vectors between the original and paraphrase sentences are added to a pre-trained model's encoding layer to enhance the model's ability to learn the differences between the original and reference paraphrase sentences. Additionally, this paper presents a mixed-feature semantic matching model based on SimBERT. The model extracts keyword features from the text, replacing these keywords with special characters to construct intent features. These intent features, along with the user question and keyword features, are then concatenated to form the model's input. Experiments were conducted respectively on the paraphrasing-based data augmentation model, mixed-feature semantic matching model and their comprehensive application in a low-resource domain-specific FAQ. The experimental results show that the proposed framework effectively improve the performance of domain-specific FAQ system.
Keywords
Paraphrasing; Data Augmentation; Semantic Matching; FAQ; Low-Resource
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.