Preprint Article Version 1 This version is not peer-reviewed

Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model

Version 1 : Received: 16 July 2024 / Approved: 16 July 2024 / Online: 17 July 2024 (10:20:08 CEST)

How to cite: Li, B.; Jiang, G.; Li, N.; Song, C. Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model. Preprints 2024, 2024071364. https://doi.org/10.20944/preprints202407.1364.v1 Li, B.; Jiang, G.; Li, N.; Song, C. Research on Large-scale Structured and Unstructured Data Processing based on Large Language Model. Preprints 2024, 2024071364. https://doi.org/10.20944/preprints202407.1364.v1

Abstract

Since the beginning of the internet era, there has been an explosion of growth in structured data (such as numbers, symbols, and labels) as well as unstructured data (including images, videos, and text). Efficient and accurate mixed query of these two types of data is a key technology to achieve high-quality information retrieval, and it is also a major challenge that needs to be solved urgently in the industry. In this study, we employ an advanced Transformer model that combines strategies and fine-tuning techniques for multi-task learning. Specifically, the model is first pre-trained on a large-scale, general-purpose dataset to learn different types of data representations and basic language comprehension skills. After that, we fine-tuned the parameters of the model to better suit these specific data processing tasks for specific application scenarios, such as image annotation, video content analysis, and structured data query. At the heart of the model is the self-attention mechanism, which allows the model to automatically emphasize the important parts and ignore irrelevant information when processing the input data. In addition, we have introduced task-specific adaptation layers that are designed to add additional processing power to the original Transformer architecture, such as a semantic analysis layer for unstructured text data and a relational extraction layer for structured data. This combination of general pre-training and task-specific fine-tuning allows the model to flexibly process and integrate information from different data sources, improving processing efficiency and accuracy. Experimental results show that the model performs well in a variety of data processing tasks, significantly improves the accuracy and efficiency of information retrieval, and verifies the strong potential and adaptability of large language models in processing mixed data types.

Keywords

Structured and Unstructured Data; Processing; Large Language Model; Transformer Model; Self-attention Mechanism.

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.