Table 1.
Overview and comparison of recently updated (2022 or later) virus-related databases. A comprehensive description of sequence counts especially in the metagenomic context (vOTUs) can be found in
Section 2.3. To obtain more detailed information regarding the download process, please see Table S3.
: To access the complete range of features and download data, user authentication is required; #seq-n – number of nucleotide sequences; #seq-p – number of protein sequences; #spec – number of species; – – not known/ no access; use – subjective impression of usability workb. – workbench; own d. – own data: the possibility to work with personal data; compl. – complexity of data base, ranging from simple (
) to extensive (
); tools – availability of tools and a quantitative ranking:
– a lot of tools;
– available;
– only few tools;
– no tools available; F – Findability; A – Accessibility; I – Interoperability; R – Reusability (see Table S5 and Figure S1); Down – downloadable via Web (W), FTP (F), and API (A); Click – by one click no data (no), one dataset (one), selected data (sel), or all data can be download (all); re3da –
re3data; Fshare –
FAIRsharing.org; DBcom–
Database Commons; elexir–
ELEXIR bio.tools; NAR –
NAR Database list; for VVR: just one of the seven VVR resources is listed;
2.1. Knowledge databases
These databases play a crucial role in research, facilitating knowledge transfer and providing foundational information. It is important to note that these sources do not directly contain sequences, but provide external links to the sequences, instead serving as repositories of centralized knowledge on viruses.
One of the fundamental tasks in virology is establishing a robust taxonomy to facilitate effective comparison and study of viruses. The
International Committee on Taxonomy of Viruses (ICTV) is the global organization tasked by the
International Union of Microbiological Societies (
IUMS) with developing, refining and maintaining the virus taxonomy down to the level of species [
30,
31,
60]. As of May 2023, the virus taxonomy curated by the
ICTV (version: MSL38) comprises 264 families, 2,818 genera, and 11,273 species. The taxonomy is periodically updated, with revisions released at least once and up to twice a year. New entries are proposed by the scientific community and reviewed by expert subcommittees within the
ICTV. The categorization of virus groups is based on various characteristics, such as genetic material, genome organization, replication strategy, and host range. Notably, the
ICTV has recently embraced the inclusion of virus groups based solely on sequence information, departing from the traditional reliance on virus morphology. Users can download the entire taxonomy as an
Excel sheet or browse through it, available at the
visual browser. In our opinion, the website may profit from a clearer structure. Helpful are the provided "How-to Videos" as a valuable resource to assist users in effectively navigating the taxonomy.
The
ViralZone is a powerful and up-to-date online encyclopedic database that provides summarized expert knowledge on various aspects of viruses, including genomic structure, virus replication cycle, host range, virus taxonomy, and molecular biology [
32]. Widely embraced by the research community, it has become a prevalent and trusted resource for obtaining information about specific viruses, serving as a key starting point for addressing novel research questions. In total, it houses detailed descriptions of over 128 families, 567 genera, and 7 virus species (e.g. Influenza A virus and SARS coronavirus 2). Every entry within the database comprises a fact sheet that presents visual representations of the virion and genome, alongside comprehensive details concerning gene expression and replication mechanisms. While the
ViralZone itself does not contain sequences in bulk download form, it does provide links to protein and nucleotide sequences of reference sequences within the fact sheet. It is a structured, user-friendly and well-connected website where users and especially virologists can quickly find the information for which they are looking.
The
VIPERdb, a specialized database dedicated to icosahedral virus capsid structures, offers a wealth of information derived from both structural and computational analyses [
33,
34]. The database provides diverse visualizations on various levels, multiple sequence alignments, relevant publications, and useful tools such as anomaly analysis and contact finder. Each of the 1,332 structures are linked to their respective protein sequences on
PubMed. The search functionality allows users to explore structures based on taxonomic classifications or specific criteria. One limitation of
VIPERdb is its focus on icosahedral virus capsid structures. However, there are ongoing efforts to expand its scope and some helical structures are already included. With its latest release,
VIPERdb has introduced a new standalone database on its website, namely
Virus World. This comprehensive database encompasses information on 181,476 viruses belonging to 158 families.
Virus World also provides the capsid protein sequences for these viruses in some instances. In our opinion, the search function of
VIPERdb could be improved as users must have prior knowledge of their virus of interest before using it. Please note that there was an additional database known as
VIPR, which has recently been incorporated into the
BV-BRC (
Bacterial and Viral Bioinformatics Resource Center) as described below.
The
Virus-Host DB is a comprehensive and manually curated database that links viruses and hosts using pairs of
NCBI taxonomy Ids [
35]. It includes viruses with complete genomes stored in
NCBI/RefSeq and
GenBank, with the accession numbers listed in
EBI Genomes. Host information is collected from various sources, including
RefSeq,
GenBank,
UniProt, and
ViralZone, and supplemented with additional data obtained from literature surveys. The database offers comprehensive information on 15,179 virus species, encompassing scientific names, lineages, Baltimore groups,
RefSeq sequences, database links, and details of 3,791 hosts, enabling users to investigate interactions from both virus and host viewpoints. The database is well interconnected, making it user-friendly and valuable for obtaining an overview of interactions. The database contains a limited amount of information, as it focuses on providing specific linkages rather than comprehensive data.
2.2. Databases containing virus sequences
Genomic, transcriptomic, and proteomic virus sequences serve as a foundational element for a wide range of virus bioinformatics analyses. For example, phylogenetic analysis typically starts with multiple sequence alignment of a collection of sequences. For host sequences we refer to additional -omics databases, see below. Sequence databases serve as a critical starting point for examining genetic variations and functional components. When working with all or a significant portion of sequences from a database for further analyses, e.g. in virus ecology, it is important to be aware of the imbalance in virus representation. In other words, the composition of viruses sequences within a database does not reflect the natural occurrence of viruses.
The
Bacterial and Viral Bioinformatics Resource Cen-ter (BV-BRC) is a recently merged platform that integrates various
NIAID-funded pathogen-related resources, including the
Virus Pathogen Resource (
ViPR), the
Influenza Research Database (
IRD) and
PATRIC (the
Bacterial Bioinformatics Database and Analysis Resource). With diverse computational tools, the
BV-BRC empowers researchers to analyze and interpret large genetic datasets originating from
NCBI GenBank and
Refseq (see below) as well as specific projects. Users have the ability to search, browse, download, and analyze a multitude of data types, including metadata, taxonomy, genomes, features (ORFs), proteins, protein structures, domains and motifs, epitopes, and experimental data. It also offers a private workbench for the secure analysis and storage of private datasets. The
BV-BRC encompasses integrated datasets from mainly pathogenic bacteria, archaea, viruses, and eukaryotes, allowing users to search, browse, download, and analyze various data types such as metadata, taxonomy, genomes, features, proteins, and experimental data. Tools and services are categorized into genomics, phylogenomics, protein analysis, metagenomics, transcriptomics, and utilities. The
BV-BRC provides access to 295,306,161 virus sequences including 9,763,946 genomes/segments, representing 106 virus families, 1,946 genera, and 24,824 species. Additionally, there are 480,376,932 protein sequences. The
BV-BRC has categorized plasmids under the virus category, resulting in the inclusion of plasmid sequences within the overall sequence count. Note that the number of species is higher than that of the official
ICTV number because the
BV-BRC includes unclassified taxa. Despite the complexity of the platform, efforts have been made to maintain user-friendliness and visual accessibility. Workshops and training opportunities are provided regularly to enhance user proficiency in utilizing the database effectively.
The recently established
NCBI Virus interface is a consolidation of various NCBI resources:
NCBI Viral Genomes (a former version of
NCBI Virus),
NCBI Nucleotide (selected for taxonomic classification to viruses), including
Refseq,
Genbank,
Virus Variation Resource, and the old resource
NCBI Retroviruses [
37]. Virus genome sequences are submitted by users into the public sequence repositories which are part of the
International Nucleotide Sequence Database Collaboration (
INSDC). The
INSDC collaboration is composed of three organizations: the
National Center for Biotechnology Information (
NCBI)
GenBank, the
EMBL-EBI European Nucleotide Archive (
ENA) and the
DNA DataBank of Japan (
DDBJ). These three repositories contain the same data, ensuring data consistency across platforms. The sequences from these repositories are frequently utilized as a starting point by other databases, which then apply diverse analyses or visualizations to further explore the data.
NCBI Virus is regularly updated, with the core component being the
GenBank and
Refseq sequences and well-curated metadata, and additional features being new analysis or visualization functionalities. As of June 2023, 11,345,662 virus nucleotide and 52,734,161 virus protein sequences are accessible through
NCBI Virus. These sequences are linked to the
NCBI Nucleotide database, providing extensive metadata such as organism, host, taxonomy, publication, organization (e.g. ORF or domains), as well as the corresponding nucleotide or protein sequences. The number of species in the
NCBI Virus database, which is 52,414, surpasses the official
ICTV count due to the inclusion of unclassified taxa. However, it is important to note that within the broader
NCBI GenBank database, there are instances of erroneous sequences that can potentially contribute to false-positive results in analyses (see
Section 3 for more details).
NCBI offers the
Reference Sequence (
RefSeq) collection as a comprehensive, integrated, and well-annotated dataset containing diverse data types, including 19,975 nucleotide sequences and 710,847 protein sequences of viruses. These high-quality sequences are extensively utilized by the scientific community.
The search interface is user-friendly and the results are filterable by a wide range of curated metadata, such as taxonomy, length, completeness, host, submitter, genome molecule type, and date. Users can perform a sequence blast or keyword search, with example searches such as "all viruses" or "bacteriophages". Based on our experience, datasets containing up to 100,000 sequences can be readily downloaded, offering users a range of options to select from. Additionally, users can conveniently create their own custom
FASTA headers. Several tools are available to perform alignments or phylogenetic analyses with selected sequences.Over the years,
NCBI Virus resources have evolved, offering enhanced functionality. Compared to the older
NCBI Viral Genomes database [
38],
NCBI Virus is more organized, functional, and visually appealing.The
Virus Variation Resource (
VVR) covers seven viruses (Influenza Virus, Dengue Virus, Zika Virus, Rotavirus, West Nile Virus, MERS coronavirus, and Ebolavirus). Only the sub-database for Influenza Virus provides extra functionality, such as an annotation tool.
NCBI SARS-CoV-2 Resources is another specific virus database for COVID-19 only. In summary,
NCBI Virus serves as the go-to resource when working with
NCBI virus-related data, offering a visually appealing and user-friendly interface to virus sequence data and metadata.
The
Reference Viral Database (RVDB) comprises a comprehensive collection of nucleotide sequences, encompassing viral, virus-related, and virus-like sequences (excluding bacterial viruses) [
39]. The database provides two versions: an unclustered and a clustered version based on sequence similarity. Researchers can conveniently download all sequences, although it should be noted that the large file size (approximately 20 GB) may result in longer download times. The
RVDB, a curated subset of
GenBank, is preferred by researchers for bioinformatics analyses due to its comprehensive sequence coverage and ongoing curation efforts. The database is designed with simplicity in mind, offering user-friendly functionalities. Additionally, the database provides a
BLAST tool for performing sequence searches, further enhancing its usability for various research needs.
The
Virus Orthologous Groups Database (VOGDB) is a regularly updated database that is based on
RefSeq virus genomes, providing a comprehensive representation of viral lineages in Virus Orthologous Groups (VOG) for comparative virus (meta-)genomics. The
VOGDB contains 10,327 sequences from 10,327 species, grouped into 30,218 virus-specific VOGs, allowing for multiple assignments of the same sequence to different VOGs, reflecting the small functional parts of the genome typically represented by a VOG. While
VOGDB currently supports searching for VOGs and provides taxonomic information, direct downloading of a single VOG is not available. Instead, users can access fileshare platforms or choose from 11 different (compressed) file formats for their downloads, which may be slightly disorganized. Surprisingly, there have been no publications published on
VOGDB to date.
The
Virxicon is a centralized knowledge base gathering information about viruses and their associated sequences [
40].
Virxicon is a database that maintains the
ICTV virus taxonomy, incorporating virus sequences from the
NCBI Viral Genomes database and
GenBank, and annotating them based on the Baltimore classification system. The database comprises a total of 599,538 sequences (the website statistics were not retrievable for the numbers of families, genera or species represented). In their research paper, the authors compare
Virxicon with other databases, such as
ViralZone,
NCBI Virus, and
ViPR (now
BV-BRC), aiming to combine the strengths of these databases.
Virxicon facilitates the bulk download of virus sequences with searchable, well-curated metadata, namely Baltimore class, molecular types, and topological resources. However, it is our impression that the database does not provide unique functionality compared to other virus databases.
NCBI Virus and
BV-BRC provide a larger number of sequences and more extensive functionality related to sequences, including search and tools, while
ViralZone serves as a more comprehensive lexicon including virus sequence download and curated simple metadata. The
Virxicon website offers an intuitive and user-friendly interface, providing search and easy access to information.
The
ZOVER, a comprehensive database of zoonotic and vector-borne viruses, aims to integrate virological, ecological, and epidemiological information to enhance understanding of animal-associated viruses and their significant impact on human and animal health [
41,
42,
43].
ZOVER is a valuable resource, offering a curated subset of
NCBI GenBank data and manually collected from published literature focused on four specific hosts: bats, rodents, mosquitoes, and ticks.
ZOVER was merged from the Database of Bat-associated Viruses and Database of Rodent-associated Viruses. The
ZOVER database includes 64,289 sequences, combining both protein and nucleotide sequences, making it challenging to differentiate them individually.
ZOVER presents data in a well-organized, visualized and user-friendly manner, providing a comprehensive and visually appealing platform for accessing information.
ZOVER offers a valuable tool for researchers in the field, as it provides curated and easily accessible data, enhances data visualization, and offers a user-friendly interface for efficient exploration and analysis. Users can easily navigate the database using taxonomy-based searches or various search options, including sequence-based, text-based or region-based.
2.3. Omics databases
The emergence of databases dedicated to -omics data and analyses represents a remarkable advancement in the field of virology. These specialized resources go beyond traditional databases, providing a next-level platform for researchers to delve into the vast realm of -omics data sets and unlock hidden viral treasures. By focusing on -omics data, which encompass various ’-omics’ disciplines such as genomics, metagenomics, transcriptomics, and proteomics, these databases offer a comprehensive view of the viral world at a molecular level. These databases serve as central hubs for storing, organizing, and analyzing -omics data sets, enabling researchers to explore uncharted territories and uncover previously unknown virus species.
The online platform
The Integrated Microbial Genomes/Virus (IMG/VR) provides their own
geNomad analysis workflow, with which
IMG/VR systematically identifies viral sequences from user-contributed and publicly available datasets, providing researchers with a comprehensive collection of 15,677,623 Uncultivated Viral Genomes (UViGs) [
44,
45] with different levels of confidence for download and analysis. The resource incorporates data from the metagenomic and metatranscriptomic
JGI database IMG/M,
RefSeq database, and three specific virus databases, while enhancing the annotation process with genome quality estimation, up-to-date taxonomic classification, and microbial host taxonomy prediction.
IMG/VR offers users a comprehensive platform with abundant information, analysis tools, and links to sub-databases, including detailed meta-information and statistics for each virus. However, for us its navigation and accessibility pose challenges, necessitating substantial time investment for users to become acquainted with its features. The database contains an impressive collection of 15,677,623 putative viral sequences, categorized into viral genomes and Single-scaffold UViGs, organized into viral operational taxonomic units (vOTUs) across various viral families, genera, and species. Despite its importance in the study of Uncultivated Viruses (UViGs), users should be mindful of
IMG/VR’s complexity and lack of user-friendliness, requiring a login for full functionality and demanding considerable effort to effectively explore and utilize all available features.
The
Multi-omics Portal of Virus Infection (MVIP) collects and analyzes virus infection-related high-throughput sequencing data, integrating comprehensive meta-information [
46]. It enables -omics data analysis and visualization, presenting a summary table of samples for specific tissues and viruses. Users can access detailed datasets, including differential expression, pathway enrichment, and alternative splicing, which are downloadable.
MVIP provides external resource links and allows user submissions for broader analyses and database enhancement.
MVIP offers valuable information and analysis for specific biosamples, driving advancements in the understanding of virus infection, and provides users with the opportunity to suggest biosamples for integration. Currently,
MVIP boasts a dataset comprising approximately 6,586 sequencing samples derived from 77 distinct viruses, such as SARS-CoV-2, SARS-CoV, DENV, ZIKV, and IAV, across 33 host species, including
Homo sapiens and
Mus musculus.
MVIP is a visually appealing database that provides comprehensive -omics data analysis capabilities, serving as both a resource for analyzing existing data and a knowledge base for researchers conducting their own sequencing projects, offering insights into the availability of suitable datasets for specific research questions, despite occasional short loading delays.
The
Viral Host RangeDatabase (VHRdb) is a unique resource that consolidates experimental data on the range of hosts a virus can infect [
47]. Despite the wealth of host-range experiments conducted in laboratories, this valuable data is often inaccessible and underutilized. The
VHRdb is an online platform that centralizes experimental data on viral host ranges, allowing users to browse, upload, analyze, and visualize results. Currently, it contains 17,170 interactions between 776 viruses and 2,041 hosts from 20 datasets. Among the 776 viruses in the
VHRdb, 303 are linked to the
NCBI, representing 279 species from 25 families. The comprehensive overview of virus-host interactions is presented in a visually appealing table, categorizing the relationships into "No infection," "Intermediate," and "Infection." The
VHRdb provides extensive and helpful documentation, including Quick Start guides, to assist users in navigating and utilizing the database effectively. However, a limitation of the
VHRdb is its relatively limited representation of viruses, which are not be evenly distributed across various viral families, relying heavily on available studies that may not provide comprehensive coverage. To mitigate this, users can upload their own data for public access or private use. Despite these limitations, the existing studies are visually presented excellently, allowing for straightforward interpretation and analysis.
2.3.1. Specific databases
In addition to comprehensive databases, there are numerous virus-specific databases available. If one is working on a specific virus, it is worthwhile to explore specialized databases dedicated to that particular species. Here is a brief list of potential databases that primarily focus on a single virus species. One exception is the
NCBI VVR database, which specifically addresses seven different viruses, as mentioned earlier. For coronaviruses, we rely on the following databases: (1)
GISAID (2)
COVID-19 Data Portal and (3)
Stanford Coronavirus Antiviral & Resistance Database (COVDB) [
48,
49,
50,
51,
52]. For HIV, we have the following databases listed in
Table 1: (1)
LANL HIV Database, (2)
EuResist, (3)
HIV Drug Resistance DB, and (4)
PSD [
53,
54,
55,
56,
59]. Due to our inability to access
EuResit by the time of submission, the investigation could not be conducted as extensively as with other databases.
2.3.2. Non-viral specific databases
In addition to virus-specific databases, there exist numerous databases that are also important for virus research in the broader fields of biology and genomics. We would suggest also for interested users to keep an eye on initiatives such as the
Global Core Biodata Resources which seek to identify invaluable, long-term resources for the life sciences.
UniProt [
61] provides a vast collection of protein sequences and functional insights, including those from viral sources, enabling researchers to unravel the molecular mechanisms and biological functions of viruses. The
Rfam database [
62], widely recognized and utilized, encompasses RNA families with detailed sequence alignments, secondary structures, and covariance models, while the Pfam, now
InterPro database serves as an extensively employed resource, offering multiple sequence alignments and hidden Markov models for protein families [
63]. The
NCBI houses additional non-virus-specific databases, such as the
Gene Expression Omnibus, which serves as an international public repository for high-throughput functional genomic data sets, or
Sequence Read Archive(
SRA), a valuable resource that provides access to biological sequence data, fostering reproducibility and enabling new discoveries through data set comparisons within the research community [
64]. Of note is the new
NCBI datasets browser (currently in beta version), which provides easy searchable access to different
NCBI databases and
NCBI Taxonomy via fact sheets.
Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive biological database that represents molecular networks and pathways, facilitates analysis of genomic data, and integrates drug labels and disease databases, making it one of the most widely used resources in the field [
65,
66,
67]. The
miRBase is the central repository for microRNA (miRNA) [
68]. It enables users to search and browse entries representing hairpin and mature miRNA sequences. Entries can be retrieved by various criteria, and both sequence and annotation data are available for download. The database currently includes 320 precursors and 510 mature miRNAs related to viruses.
2.3.3. Other databases
Additionally, there exist virus-related online platforms that link together pre-existing tools, databases, and datasets. These websites serve as valuable resources for researchers and practitioners seeking to leverage existing resources and foster collaboration within the scientific community. By linking together disparate resources, these platforms contribute to the dissemination and accessibility of scientific information, promoting efficient utilization of available resources for further research and innovation. One example is the
European Virus Bioinformatics Center (EVBC) website, on which a total of 275 entries are linked, sorted by software type (such as database, command-line tool, or similar), virus family, or functionality [
69,
70]. Another example is
iVirus.us, which provides a platform to access 27 tools and 21 datasets [
71,
72].
2.3.4. FAIR evaluation
Many virus databases aim to support the (re)use of virus data and enable processing using machine-learning methods. Both goals can be facilitated by adopting the FAIR principles. We therefore included an evaluation of FAIR properties in our database overview using
FAIR principles checklist. Where a virus database had a table featuring one virus per row, the entries were evaluated as research objects (please refer to the data sources of the FAIR evaluation of the databases at the Table S5). Where available, a virus sequence was considered to be "data". Note that some databases in the list were therefore excluded from the FAIR evaluation because of a lack of a comparable research object. The databases that did not have comparable research objects were
NCBI Viral Genomes (due to being a central website linking to different resources) and the
EuResist database (to which we did not have access by the time of submission). The FAIR scores are based on presence/absence for each of the checklist criteria as has been done previously in the context of data deposition of nuclear magnetic resonance data [
73]. The scores are out of four for the subcriteria in Findability, Accessibility and Reusability, and are out of three for those of Interoperability. A more complete description of the FAIR Principles checklist can be found in the Figure S1.
In general the FAIR scores of the content of the active databases reviewed here (summarized in
Table 1 and the full table available in Table S5) ranged from less FAIR for the smaller or older databases and more FAIR for the larger and newer databases. An important component of the Findability score is the assignment of a database-given global and persistent identifier; while the large platforms such as
BV-BRC and
IMG/VR featured this, the smaller databases such as
HBVdb often used an external id, e.g. the
NCBI Accession ID or TaxIDs. This might be due to the differing aims of the virus databases as some are focused on data reuse and machine-readability while others may have simpler goals such as cross-linking available knowledge. Accessibility for the databases was generally positive owing to web-accessible links and straightforward download options (see also Table S3). Further, the overall low score for Interoperability reflects the lack of standards for all virus metadata; while there exist clear ontologies, e.g. for clinical data (as for the
HIV drug resistance DB) or for pathogenic virus metadata (see the
Genomic Standards Consortium (GSC)) this is not yet the case for metadata for all viruses. This is currently a target for various groups such as the
GSC (which are responsible for the Minimum Information about Sequencing standards which are used by the
INSDC repositories), the
Gene Ontology consortium, the
Genomes Online Database (GOLD) which complements the IMG databases of the Joint Genome Institute
JGI and other efforts such as Bernasconi
et. al. with the Viral Conceptual Model [
74,
75]. This shows that community-wide metadata standards are poised to improve interoperability in the near future. Last, the Reusability of many of the virus databases would benefit from the inclusion of formal licenses describing the reuse of their data (see
Choose A License). Overall, this FAIR evaluation was a first for virus databases and highlighted several areas for improvement.
2.4. Catalogs of databases
To assist users in selecting appropriate databases, scholarly journals and other entities have established catalogs that employ various criteria for indexing databases based on different criteria to improve their findability and accessibility. Here we describe five catalogs of databases: (1)
re3data.org, (2)
FAIRsharing [
24], (3)
The Database Commons [
25], (4)
ELEXIR bio.tools [
26], and (5)
NAR database list [
27], see
Figure 1. We analyzed a range of entries, narrowing down to virus-specific databases, categorizing them based on their up-to-date status and relevance to COVID-19, while excluding non-virus databases that didn’t meet the criteria.
The
re3data.org website is a web-based registry that facilitates data discovery, access, and sharing for researchers. Its comprehensive metadata on data repositories allows researchers to identify repositories that align with their specific data management needs. The platform has a particular focus on the FAIR principles. There are 3,125 entries on this platform, of which 2,181 are databases or scientific and statistical data formats in terms of content types. Among them, there are 186 virus-related entries identified using the search term "virology," of which only 24 are virus-specific, and 17 are considered up-to-date. Nine of these databases have been extensively described in our curated
Table 1, seven are dedicated to coronavirus research (see Table S4), and one database, namely
WestNile.ca.gov, was excluded due to its narrow focus.
The online platform
FAIRsharing, is designed to enhance the visibility of scientific data standards, databases, and policies for the scientific community. The platform includes a registry of data standards, databases, policies, collections, and organizations that details each resource, such as its scope, history, and adoption status. In total, 3,888 entries are listed in the registry, of which 2,032 are repositories or knowledge-bases. Among these, 112 are virus-related (identified using the keyword "virology"). However, only 62 of these resources are virus-specific, and only 41 are up-to-date. These 41 resources can be further classified into 13 listed in our curated
Table 1, 24 related to coronavirus data (see Table S4), and 4 we excluded: (1)
HIV Drug Interactions, (2)
HEP Drug Interactions, (3)
Global.health, and (4)
HIV and COVID-19 Registry in Europe. These databases were excluded due to their restricted focus, such as focusing only on drug interactions of a particular virus or containing primarily epidemiological data, which did not align with our definition of a comprehensive virus database. Additionally, one of the databases resembled more of a network than a traditional database.
The
Database Commons is a curated catalog of biological databases that organize databases based on data type, species, and subject matter. It provides detailed metadata for each database, including name, URL, description, hosting institution, and contact information. Within the
Database Commons, there are currently 5,902 entries listed. Among them, 355 databases fall under the "Data Object" virus category. Of these, 146 are virus-specific, and 36 are considered up-to-date. These 36 databases can be further categorized as 17 listed in our curated
Table 1, 16 coronavirus databases (see Table S4), and 3 other databases (
Disease Monitoring Dashboard,
RID, and
Virus-CKB). The additional databases were excluded due to their specific nature, such as being more tool-oriented or containing limited data with only two tables rather than meeting the criteria of a comprehensive virus database.
A comprehensive registry of bioinformatics resources is established through a community-driven curation effort supported by
ELIXIR, a
(ELIXIR).
ELIXIR bio.tools serves as the dedicated registry within this infrastructure, ensuring the sustainable upkeep of the curated information [
26]. Collaborative curation, tailored to local needs and facilitated by a network of partners, enables the continuous development and accessibility of this valuable resource. In total, there are over 28,211 resources listed in the registry, including various tools. Among them are 3,664 databases, and a search using the keyword "Virology" identified 44 databases in this category. Out of these, 42 databases are virus-specific, with 11 being up-to-date. Eight of these virus-specific databases are included in our curated
Table 1. Additionally, we have identified two up-to-date coronavirus-specific databases and one particular database, namely the
United States Swine Pathogen Database, which we excluded.
To our knowledge, the Nucleic Acids Research Journal Database Summary Issue
NAR is the oldest known list of databases. Published annually, it provides descriptions of new and updated databases that contain nucleic acid and protein sequences and structures [
27]. The
NAR provides the links for these databases at the
Molecular Biology Database Collection. These databases are categorized into genomics, transcriptomics, proteomics, metabolomics, and structural biology. Presently, it includes a total of 1965 databases. Each database is described in detail, including its scope, content, features, relevant citations, and links to access the resource. The most recent issue from January 2023 lists 32 databases in the "virus genome database" category. Among them, 9 are considered up-to-date and included in
Table 1.
In conclusion, despite the availability of database catalogs that assist researchers in finding relevant resources, there are still challenges and limitations to address. These catalogs lack virus-specific content and often do not reflect the current status or usability of the databases. Furthermore, there is a need for better metadata standardization and information on the reliability and quality of the databases. Although these catalogs serve as a starting point, they may not provide comprehensive and detailed information for researchers to make informed decisions about utilizing the databases effectively.