Version 1
: Received: 12 June 2021 / Approved: 14 June 2021 / Online: 14 June 2021 (14:54:32 CEST)
How to cite:
Sánchez-Reyes, A.; Fernández-López, M. G. Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics. Preprints2021, 2021060368. https://doi.org/10.20944/preprints202106.0368.v1
Sánchez-Reyes, A.; Fernández-López, M. G. Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics. Preprints 2021, 2021060368. https://doi.org/10.20944/preprints202106.0368.v1
Sánchez-Reyes, A.; Fernández-López, M. G. Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics. Preprints2021, 2021060368. https://doi.org/10.20944/preprints202106.0368.v1
APA Style
Sánchez-Reyes, A., & Fernández-López, M. G. (2021). Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics. Preprints. https://doi.org/10.20944/preprints202106.0368.v1
Chicago/Turabian Style
Sánchez-Reyes, A. and Maikel Gilberto Fernández-López. 2021 "Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics" Preprints. https://doi.org/10.20944/preprints202106.0368.v1
Abstract
The analysis of curated genomic, metagenomic, and proteomic data are of paramount importance in the fields of biology, medicine, education, and bioinformatics. Although this type of data is usually hosted in raw form in free international repositories, its access requires plenty of computing, storage, and processing capacities for the domestic user. The purpose of the study is to offer a comprehensive set of genomic and proteomic reference data, in an accessible and easy-to-use form to the scientific community. A representative type material set of genomes, proteomes and metagenomes were directly downloaded from the site: https://www.ncbi.nlm.nih.gov/assembly/ and from Genome Taxonomy Database, associated with the major groups of Bacteria, Archaea, Virus, and Fungi. Sketched databases were subsequently created and stored on handy raw reduced representations, by using Mash software. Our dataset contains near to 100 GB of space disk reduced to 585.78 MB and represents 87,476 genomics/proteomic records from eight informative contexts, which have been prefiltered to make them accessible, usable, and user-friendly with computational resources. Potential uses of this dataset include but are not limited to, microbial species delimitation, estimation of genomic distances, genomic novelties, paired comparisons between proteomes, genomes, and metagenomes.
Biology and Life Sciences, Biochemistry and Molecular Biology
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.