Version 1
: Received: 4 November 2024 / Approved: 5 November 2024 / Online: 5 November 2024 (10:28:36 CET)
How to cite:
Tyagi, N.; Vahab, N.; Tyagi, S. Genome Language Modeling (Glm): A Beginner’s Cheat Sheet. Preprints2024, 2024110285. https://doi.org/10.20944/preprints202411.0285.v1
Tyagi, N.; Vahab, N.; Tyagi, S. Genome Language Modeling (Glm): A Beginner’s Cheat Sheet. Preprints 2024, 2024110285. https://doi.org/10.20944/preprints202411.0285.v1
Tyagi, N.; Vahab, N.; Tyagi, S. Genome Language Modeling (Glm): A Beginner’s Cheat Sheet. Preprints2024, 2024110285. https://doi.org/10.20944/preprints202411.0285.v1
APA Style
Tyagi, N., Vahab, N., & Tyagi, S. (2024). Genome Language Modeling (Glm): A Beginner’s Cheat Sheet. Preprints. https://doi.org/10.20944/preprints202411.0285.v1
Chicago/Turabian Style
Tyagi, N., Naima Vahab and Sonika Tyagi. 2024 "Genome Language Modeling (Glm): A Beginner’s Cheat Sheet" Preprints. https://doi.org/10.20944/preprints202411.0285.v1
Abstract
Combining genomics with digital healthcare information is set to transform per- sonalized medicine. However, this integration is challenging due to the differing nature of the data modalities. The large size of the genome makes it impossible to store it as part of the standard electronic health record (EHR) system. Rep- resenting the genome as a condensed representation containing biomarkers and usable features is required to make the genome interoperable with EHR data. This systematic review examines both conventional and state-of-the-art methods for genome language modeling (GLM), which involves representing and extract- ing features from genomic sequences. Feature extraction is an essential step for applying machine learning (ML) models to large genomic datasets, especially within integrated workflows. We first provide a step-by-step guide to various genomic sequence pre-processing and representation techniques. Then we explore feature extraction methods including tokenization, and transformation of tokens using frequency, embedding, and neural network-based approaches. In the end, we discuss ML applications in genomics, focusing on classification, prediction, and language processing algorithms. Additionally, we explore the role of GLM in func- tional annotation, emphasizing how advanced ML models, such as Bidirectional encoder representations from transformers (BERT), enhance the interpretation of genomic data. To the best of our knowledge, we compile the first end-to-end analytic guide to convert complex genomic data into biologically interpretable information using GLM, thereby facilitating the development of novel data-driven hypotheses.
Keywords
Natural language processing, genomics, digital health, precision medicine, machine learning, AI
Subject
Biology and Life Sciences, Biology and Biotechnology
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.