Version 1
: Received: 16 July 2024 / Approved: 16 July 2024 / Online: 17 July 2024 (10:20:08 CEST)
How to cite:
Li, B.; Jiang, G.; Li, N.; Song, C. Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model. Preprints2024, 2024071364. https://doi.org/10.20944/preprints202407.1364.v1
Li, B.; Jiang, G.; Li, N.; Song, C. Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model. Preprints 2024, 2024071364. https://doi.org/10.20944/preprints202407.1364.v1
Li, B.; Jiang, G.; Li, N.; Song, C. Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model. Preprints2024, 2024071364. https://doi.org/10.20944/preprints202407.1364.v1
APA Style
Li, B., Jiang, G., Li, N., & Song, C. (2024). Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model. Preprints. https://doi.org/10.20944/preprints202407.1364.v1
Chicago/Turabian Style
Li, B., Ningxin Li and Chaoda Song. 2024 "Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model" Preprints. https://doi.org/10.20944/preprints202407.1364.v1
Abstract
Since the beginning of the internet era, there has been an explosion of growth in structured data (such as numbers, symbols, and labels) as well as unstructured data (including images, videos, and text). Efficient and accurate mixed query of these two types of data is a key technology to achieve high-quality information retrieval, and it is also a major challenge that needs to be solved urgently in the industry. In this study, we employ an advanced Transformer model that combines strategies and fine-tuning techniques for multi-task learning. Specifically, the model is first pre-trained on a large-scale, general-purpose dataset to learn different types of data representations and basic language comprehension skills. After that, we fine-tuned the parameters of the model to better suit these specific data processing tasks for specific application scenarios, such as image annotation, video content analysis, and structured data query. At the heart of the model is the self-attention mechanism, which allows the model to automatically emphasize the important parts and ignore irrelevant information when processing the input data. In addition, we have introduced task-specific adaptation layers that are designed to add additional processing power to the original Transformer architecture, such as a semantic analysis layer for unstructured text data and a relational extraction layer for structured data. This combination of general pre-training and task-specific fine-tuning allows the model to flexibly process and integrate information from different data sources, improving processing efficiency and accuracy. Experimental results show that the model performs well in a variety of data processing tasks, significantly improves the accuracy and efficiency of information retrieval, and verifies the strong potential and adaptability of large language models in processing mixed data types.
Keywords
Structured and Unstructured Data; Processing; Large Language Model; Transformer Model; Self-attention Mechanism.
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.