Preprint Article Version 1 This version is not peer-reviewed

PDAM-FAQ: Paraphrasing-based Data Augmentation and Mixed-Feature Semantic Matching for Low-Resource FAQs

Version 1 : Received: 29 July 2024 / Approved: 1 August 2024 / Online: 1 August 2024 (14:51:28 CEST)

How to cite: Wang, D.; Wang, L.; Tang, K.; Bo, Q.; Han, B. PDAM-FAQ: Paraphrasing-based Data Augmentation and Mixed-Feature Semantic Matching for Low-Resource FAQs. Preprints 2024, 2024080060. https://doi.org/10.20944/preprints202408.0060.v1 Wang, D.; Wang, L.; Tang, K.; Bo, Q.; Han, B. PDAM-FAQ: Paraphrasing-based Data Augmentation and Mixed-Feature Semantic Matching for Low-Resource FAQs. Preprints 2024, 2024080060. https://doi.org/10.20944/preprints202408.0060.v1

Abstract

Frequently Asked Questions (FAQs) systems rely on semantic symmetry similarity measuring between two sentences. To address the challenges of insufficient training data and limited domain-specific understanding in low-resource FAQs, this paper proposes a general framework, PDAMF-FAQ, to solve these issues. Firstly, we propose a paraphrasing-based data augmentation model that integrates syntactic information and edit vectors. Using a rule-based approach, it retrieves template sentences from a corpus and masks relevant words with special characters. Edit vectors between the original and paraphrase sentences are added to a pre-trained model's encoding layer to enhance the model's ability to learn the differences between the original and reference paraphrase sentences. Additionally, this paper presents a mixed-feature semantic matching model based on SimBERT. The model extracts keyword features from the text, replacing these keywords with special characters to construct intent features. These intent features, along with the user question and keyword features, are then concatenated to form the model's input. Experiments were conducted respectively on the paraphrasing-based data augmentation model, mixed-feature semantic matching model and their comprehensive application in a low-resource domain-specific FAQ. The experimental results show that the proposed framework effectively improve the performance of domain-specific FAQ system.

Keywords

Paraphrasing; Data Augmentation; Semantic Matching; FAQ; Low-Resource

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.