Version 1
: Received: 16 October 2024 / Approved: 17 October 2024 / Online: 17 October 2024 (17:04:22 CEST)
How to cite:
Kumar, T.; Bhujbal, R.; Raj, K.; Roy, A. M. Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science. Preprints2024, 2024101395. https://doi.org/10.20944/preprints202410.1395.v1
Kumar, T.; Bhujbal, R.; Raj, K.; Roy, A. M. Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science. Preprints 2024, 2024101395. https://doi.org/10.20944/preprints202410.1395.v1
Kumar, T.; Bhujbal, R.; Raj, K.; Roy, A. M. Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science. Preprints2024, 2024101395. https://doi.org/10.20944/preprints202410.1395.v1
APA Style
Kumar, T., Bhujbal, R., Raj, K., & Roy, A. M. (2024). Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science. Preprints. https://doi.org/10.20944/preprints202410.1395.v1
Chicago/Turabian Style
Kumar, T., Kislay Raj and Arunabha M. Roy. 2024 "Navigating Complexity: A Tailored Question-Answering Approach for PDFs in Finance, Bio-Medicine, and Science" Preprints. https://doi.org/10.20944/preprints202410.1395.v1
Abstract
Understanding complex Portable Document Format (PDF) files, such as research papers, clinical reports, and scientific manuals, is often a time-consuming endeavor. While significant progress has been made in developing question-answering (QA) systems that yield contextually relevant responses, the creation of a comprehensive end-to-end machine learning model capable of addressing intricate questions remains a formidable challenge. These systems typically rely on substantial labeled training data to effectively train their foundational models for specific tasks. However, assembling such datasets is particularly challenging for complex documents, including annual reports from major technology companies. In this paper, we address this issue by developing a QA system specifically designed for PDF documents, focusing on the domains of finance, biomedicine, and scientific literature. We manually curated datasets from these areas for evaluation purposes and utilized pre-trained Bidirectional Encoder Representations from Transformers (BERT) models from the Hugging Face library. The models were evaluated using the F1 score, achieving a notable score of 44% with the BERT Large model.
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.