Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics

Ayixon Sánchez-Reyes; Maikel Gilberto Fernández-López

doi:10.20944/preprints202106.0368.v1

Submitted:

12 June 2021

Posted:

14 June 2021

You are already at the latest version

Abstract

The analysis of curated genomic, metagenomic, and proteomic data are of paramount importance in the fields of biology, medicine, education, and bioinformatics. Although this type of data is usually hosted in raw form in free international repositories, its access requires plenty of computing, storage, and processing capacities for the domestic user. The purpose of the study is to offer a comprehensive set of genomic and proteomic reference data, in an accessible and easy-to-use form to the scientific community. A representative type material set of genomes, proteomes and metagenomes were directly downloaded from the site: https://www.ncbi.nlm.nih.gov/assembly/ and from Genome Taxonomy Database, associated with the major groups of Bacteria, Archaea, Virus, and Fungi. Sketched databases were subsequently created and stored on handy raw reduced representations, by using Mash software. Our dataset contains near to 100 GB of space disk reduced to 585.78 MB and represents 87,476 genomics/proteomic records from eight informative contexts, which have been prefiltered to make them accessible, usable, and user-friendly with computational resources. Potential uses of this dataset include but are not limited to, microbial species delimitation, estimation of genomic distances, genomic novelties, paired comparisons between proteomes, genomes, and metagenomes.

Keywords:

Microbial Mash database

;

Mash distance

;

Genome containment

;

Type material

;

Microbial taxonomy

Subject:

Biology and Life Sciences - Biochemistry and Molecular Biology

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Mash Sketched Reference Dataset for Genome-Based Taxonomy and Comparative Genomics

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe