Preprint Review Version 1 This version is not peer-reviewed

Genome Language Modeling (Glm): A Beginner’s Cheat Sheet

Version 1 : Received: 4 November 2024 / Approved: 5 November 2024 / Online: 5 November 2024 (10:28:36 CET)

How to cite: Tyagi, N.; Vahab, N.; Tyagi, S. Genome Language Modeling (Glm): A Beginner’s Cheat Sheet. Preprints 2024, 2024110285. https://doi.org/10.20944/preprints202411.0285.v1 Tyagi, N.; Vahab, N.; Tyagi, S. Genome Language Modeling (Glm): A Beginner’s Cheat Sheet. Preprints 2024, 2024110285. https://doi.org/10.20944/preprints202411.0285.v1

Abstract

Combining genomics with digital healthcare information is set to transform per- sonalized medicine. However, this integration is challenging due to the differing nature of the data modalities. The large size of the genome makes it impossible to store it as part of the standard electronic health record (EHR) system. Rep- resenting the genome as a condensed representation containing biomarkers and usable features is required to make the genome interoperable with EHR data. This systematic review examines both conventional and state-of-the-art methods for genome language modeling (GLM), which involves representing and extract- ing features from genomic sequences. Feature extraction is an essential step for applying machine learning (ML) models to large genomic datasets, especially within integrated workflows. We first provide a step-by-step guide to various genomic sequence pre-processing and representation techniques. Then we explore feature extraction methods including tokenization, and transformation of tokens using frequency, embedding, and neural network-based approaches. In the end, we discuss ML applications in genomics, focusing on classification, prediction, and language processing algorithms. Additionally, we explore the role of GLM in func- tional annotation, emphasizing how advanced ML models, such as Bidirectional encoder representations from transformers (BERT), enhance the interpretation of genomic data. To the best of our knowledge, we compile the first end-to-end analytic guide to convert complex genomic data into biologically interpretable information using GLM, thereby facilitating the development of novel data-driven hypotheses.

Keywords

 Natural language processing, genomics, digital health, precision medicine,
machine learning, AI

Subject

Biology and Life Sciences, Biology and Biotechnology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.