Preprint Article Version 1 This version is not peer-reviewed

How Trustworthy Are Genomic Sequences of SARS-CoV-2 in GenBank?

Version 1 : Received: 27 August 2024 / Approved: 27 August 2024 / Online: 28 August 2024 (12:37:53 CEST)

How to cite: Xia, X. How Trustworthy Are Genomic Sequences of SARS-CoV-2 in GenBank?. Preprints 2024, 2024081963. https://doi.org/10.20944/preprints202408.1963.v1 Xia, X. How Trustworthy Are Genomic Sequences of SARS-CoV-2 in GenBank?. Preprints 2024, 2024081963. https://doi.org/10.20944/preprints202408.1963.v1

Abstract

Well-annotated gene and genomic sequences serve as a foundation for making inferences in molecular biology and evolution, and can directly impact public health. The first SARS-CoV-2 genome was submitted to GenBank and used to develop the two successful vaccines. Conserved protein domains are often chosen as targets for developing antiviral medicines or vaccines. Mutation and substitution patterns provide crucial information not only on functional motifs and genome/protein interactions but also for characterizing phylogenetic relationships among viral strains. These patterns, together with the collection time of viral samples, serve as the basis for addressing the question of when and where the host-switching event occurred. Unfortunately, viral genomic sequences submitted to GenBank undergo little quality control, and critical information in the annotation is frequently changed without being recorded. Researchers often have no choice but to hold blind faith in the accuracy of the sequences. There have been reports of incorrect genome annotation but no report that casts doubt on the genomic sequences themselves because it seems theoretically impossible to identify genomic sequences that may not be authentic. This paper takes an innovative approach to show that some SARS-CoV-2 genomes submitted to GenBank cannot be possibly authentic. Specifically, some SARS-CoV-2 genomic sequences deposited in GenBank with collection time in 2023 and 2024, isolated from saliva, nasopharyngeal, sewage, and stool are identical to the reference genome of SARS-CoV-2 (NC_045512). The probability for such occurrence is effectively 0. I also compile SARS-CoV-2 genomes with changed sample collection time. One may led astray in bioinformatic analysis without being aware of errors in sequences and sequence annotation.

Keywords

SARS-CoV-2; COVID-19; GenBank; data validation; genome; genomic analysis

Subject

Biology and Life Sciences, Biochemistry and Molecular Biology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.