1. Introduction
Pathogen genomics surveillance laboratories generate microbial sequence data that can be used in a variety of ways. Examples include the detection and resolution of outbreaks, development of vaccines and diagnostic tests, understanding microbial evolution including antimicrobial resistance and virulence mechanisms, detection of zoonotic events and patterns of transmission, source attribution, and more (Petrillo et al., 2022; Robinson et al., 2013; Munnink et al., 2021; World Health Organization, 2022; Cook, 2021; Hendriksen et al., 2019; Brown et al., 2021). The quality of sequence datasets greatly impacts their utility, the interpretations of analytical results, and the decisions that can be made based on how much confidence one has in the interpretations (Rick et al., 2022; Smits, 2019). The quality of sequence data can vary for many reasons, including, but not limited to, low concentrations of starting materials, expired reagents, deviations from ideal sample handling and storage conditions, errors during library preparation, overloading/underloading of flow cells in the case of long-read sequencing techniques, and contamination within and between sequencing runs (Gargis et al., 2016; Rossen et al., 2018). Quality control metrics of raw reads depend on many factors such as the depth and breadth of coverage of generated reads compared to a reference, the presence of reads from another source (i.e., previous run, host organism, sample contamination) and, the number and density of flow cell clusters in a run (Wagner et al., 2021). Quality control metrics and thresholds often differ across laboratories and surveillance networks. While public health laboratories generate and release a vast number of high-quality sequences, there will often be a proportion of datasets that may fall just short of a set of prescribed baseline quality control metrics. These datasets are then excluded from many types of public health analyses and are often not publicly released. In these cases, the issues associated with these borderline or lower-quality datasets have often been identified, and the datasets can still provide important surveillance insights or information on real-world test performance. Conversely, low-quality datasets can sometimes be included in public repository submissions but are not flagged which creates issues for laboratories using the data.
As pathogen sequencing is increasingly used routinely in public health laboratories for the surveillance of different pathogens, and programs are continually developed and expanded, laboratories must carry out robust optimization of wet- and dry-lab procedures (Carrillo & Blais, 2021). Lower quality datasets are highly useful for optimizing, validating, verifying and benchmarking the performance of algorithms, pipelines and instruments, as well as training new personnel (
Figure 1). An example of the utility of high and low quality datasets can be seen in Xiaoli et al. (2022) in which SARS-CoV-2 Nanopore/Illumina read datasets generated from public health genomic surveillance were shared as a collection to support benchmarking tools, understanding the genomic epidemiology of different lineages, and identifying variants of concern. The collection also contained a number of SARS-CoV-2 genomes of lower quality due to recognized errors and common sequencing failures (Xiaoli et al., 2022).
Sharing sub-optimal data can be useful for the broader public health and research community, particularly when the data is carefully annotated with known issues so that it is not mistaken for better quality information, and can be more easily identified in repositories. However, there are currently no standardized attributes for tagging poor-quality datasets, preventing them from being easily searched and made accessible, and all but ensuring that they are excluded from applications that require only high-quality data. Standardized fields and terms have previously proven useful in improving data harmonization and integration as well as communication and data sharing in SARS-CoV-2 surveillance (Griffiths et al., 2022; Lusignan et al., 2020), and the implementation of genomics contextual data (metadata) standards have been encouraged repeatedly in the community (Black et al., 2020; Gozashti & Corbett-Detig, 2021; Pettengill et al., 2021; Schriml et al., 2020; Stevens et al., 2020). As such, a set of standardized attributes to describe the properties, quality and purpose of microbial sequence datasets could make quality control results more explicit in public (and private) repositories.
The Public Health Alliance for Genomic Epidemiology (PHA4GE) is a global coalition of scientists focused on improving the reproducibility, interoperability, portability, and openness of public health bioinformatics software, data and expertise (
https://pha4ge.org/). As part of its’ mission to improve interoperability and reproducibility, PHA4GE workgroups develop, share and promote consensus data specifications, in an effort to streamline and improve data structures across public health bioinformatics resources (tools, protocols, databases, platforms, and repositories). The overarching goal of this work is the development of an open software philosophy and ecosystem that will empower more stakeholders across global public health to analyze, manage and govern their own data, regardless of resource status. The Data Structures Working Group, tasked with assessing needs and developing these specifications, operates by consensus and comprises diverse perspectives and expertise with members representing many different countries, organizations, microbial sequencing initiatives, and standards development efforts.
To address the challenges of sharing lower-quality datasets, PHA4GE has developed a set of standardized contextual data attributes (fields and terms known as “tags”) that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues to increase their discoverability, and to facilitate their interpretation and appropriate reuse. The contextual data tags (attributes) were developed through a series of consultations with the public health microbiology research community, including input from the International Nucleotide Sequence Data Collaboration (INSDC), and staff from multiple national, regional and local public health institutions. The development of these tags, standardized using community-based resources known as ontologies, is expected to be an iterative and participatory process with input from users and subject matter experts from across the community. Ontologies are well-defined controlled vocabulary describing a domain, structured in a hierarchy where logical relationships link the terms, and the meanings of terms are disambiguated using persistent identifiers (Smith et al., 2007). As ontologies are developed by community consensus, by applying ontology-based attributes in publicly available data, the pitfalls and variability of institution-specific vocabulary and free text can be avoided. The standardized tags are agnostic to the organism and sequencing technique used, and can be applied to data generated from any pathogen using an array of sequencing techniques. The list of standardized tags for quality control, introduced here, is maintained by PHA4GE and can be found at
https://github.com/pha4ge/contextual_data_QC_tags. Recognizing that data needs can change over time, or that different use cases can require additional vocabulary, PHA4GE accepts suggestions for new tags from the community which can be submitted using the New Term Request System on GitHub (see PHA4GE repository linked above).
As testing and implementation are key to the success and uptake of data specifications, PHA4GE partnered with the US Food and Drug Administration Center for Food Safety and Applied Nutrition (FDA CFSAN)’s GenomeTrakr program as a test use case and in the early adoption of the QC tags described in this work. GenomeTrakr is a US Food and Drug Administration (FDA)-led international pathogen surveillance network through which member labs submit sequence data and minimal contextual data in real-time for the purposes of tracking and identifying outbreaks (Timme et al., 2019). GenomeTrakr has focused on surveillance of foodborne pathogens for many years, and with the onset of the COVID-19 pandemic, has also expanded to metagenomic wastewater surveillance of SARS-CoV-2. Wastewater monitoring can provide an early warning of COVID-19 detection in a community or setting (e.g., watershed, institution). An early warning of even a few days can be critical to the success of public health interventions. Metagenomic surveillance of wastewater can be challenging due to the complex nature of samples, therefore, sharing information regarding quality control assessment is of uttermost importance. The GenomeTrakr network has tested and now implements the PHA4GE QC contextual data tags as part of its routine submission process. and provides a worked example of how the tags can be customized by organizations and surveillance initiatives within the public health bioinformatics community.
The sharing of lower-quality datasets and their annotation using the contextual data tags described here will enable public health labs to make use of data that would have otherwise been discarded, and in some cases, side-step the need to generate synthetic data for representing different real-world scenarios. Using these tags will enable the community to more easily establish datasets for training and testing purposes (software and human). The inclusion of ontologized PHA4GE QC tags will also make datasets and quality control results FAIR (Findable, Accessible, Interoperable, and Reusable) (Wilkinson et al., 2016).
2. Methods
The members of PHA4GE are involved in many sequencing initiatives and surveillance networks, and as a result, have a broad collective experience in developing solutions to microbial bioinformatics challenges. Requests for standardized quality control tags were made to PHA4GE from members of the wider public health and research communities via direct communication and social media. The range and types of common quality control issues were identified through a survey via member networks. The proposed list of quality control tags was circulated for feedback within the PHA4GE community and was improved based on feedback. The QC attributes were then mapped to existing ontologies, and ontology terms were created for tags with no existing equivalent and made publicly available in the Genomic Epidemiology Ontology (GenEpiO,
https://github.com/GenEpiO/genepio). Definitions were also developed, along with recommendations for their use in INSDC sequence submissions (in collaboration with INSDC representatives). The QC attributes were made publicly available on GitHub in October 2022, and included in a specially designed SRA submission form for pathogen sequence data (available on GitHub).
QC contextual data tags were reviewed and evaluated by GenomeTrakr scientists for tagging known quality control issues in wastewater metagenomics datasets used for SARS-CoV-2 surveillance. The fields were added to the prescribed GenomeTrakr submission requirements, along with additional values for GenomeTrakr-specific pipelines and analyses.
3. Results
Best Practices for Use
Below are a few simple recommendations for implementing the QC tags (also available in the Field Reference Guide available at GitHub).
Providing the name of the method used for quality control is very important for interpreting the rest of the QC information. A method name should always be included (do not include additional QC tags if no method name is provided).
Method names can be provided in the form of a name of a pipeline or a link to a GitHub repository. Multiple methods should be listed and separated by a semicolon.
Methods updates can make big differences to their outputs. The version of the method used for quality control should be included.
The method version can be expressed using whatever convention the developer implements (e.g., date, semantic versioning).
If multiple methods were used, record the version numbers in the same order as the method names. Separate the version numbers using a semicolon.
If a pick list does not contain a desired value, a new term request should be submitted to PHA4GE via the QC Tag GitHub repository issuetracker New Term Request form (described below under “Community Development and Maintenance”).
Annotation Limitations and Considerations
The QC tags are intended to address issues pertaining to different types of sequencing techniques (single isolate or targeted sequencing, metagenomics). Not all tags may apply to all techniques and so where they are not appropriate then they should not be used. The tags are also intended to describe QC results of sequence data rather than downstream analytical results (e.g., raw reads, consensus sequences or assemblies rather than phylogenies or lineage determinations). Owing to the wide variety of quality control software available, and the differences in criteria and thresholds, the application of these attribute tags may be subjective and dependent on the QC processes performed. To better evaluate and interpret the QC determinations proposed, it is recommended that other information pertaining to QC be included in other contextual data fields not specified in this work (i.e., choice of reference genome), and that the tags be interpreted in light of the other methodological metadata included in the record (i.e., BioSample, Experiment/SRA contextual data). The controlled vocabulary attributes are intended for high-level triage purposes rather than capturing all methods in detail. However, information affecting the selection of one tag over another can also be included in the “quality_control_details” field. It is also important to note that the quality control tags refer to a particular sample obtained at one point in time, and not the comparison of a set of samples across time or from different tissues of the same host.
Community Development and Maintenance
While the initial list of standardized QC fields and values was developed by PHA4GE through community consultation, we recognize it will need to change over time. To ensure the list reflects current QC issues across pathogens and methods (e.g., sequencing techniques, bioinformatics analyses), a mechanism for requesting additional QC tags was created via the PHA4GE QC GitHub repository Issue tracker (New Term Request (NTR) form). A template for submitting new terms is available and community members are welcome to submit suggestions for new terms by following the instructions provided with the template. Suggestions will be evaluated and periodic updates to the list will be performed.
4. Discussion
A major goal of pathogen genomics surveillance programs is to produce high-quality data for use in public health analyses and decision-making. Owing to time, personnel and resource limitations, samples that yield sequences of borderline or slightly poorer quality cannot often be re-sequenced. This sequence data, while perhaps not suitable for surveillance or outbreak analysis, is still useful for the development of tools, the optimization and validation of quality frameworks and sequencing processes, as well as for bioinformatics training purposes. Useful datasets for testing and training purposes can include sequences containing contamination, low yields and/or low average genome coverage, shorter than expected read lengths, sequence amplification artifacts, low signal-to-noise ratio, and low coverage of characteristic mutations.
Due to the lack of standardized attributes in contextual data records, purposefully identifying sub-optimal quality datasets in public repositories is difficult. The PHA4GE Contextual Data QC Tag Specification provides a set of five fields which can be included as user-defined contextual data in public repository raw read sequence submissions. While PHA4GE encourages the use of these fields and terms in any repository, not all public repositories have the mandate or the ability to include user-defined attributes. The PHA4GE tags are implementable in submissions to the INSDC (in SRA (NCBI, DDBJ) and as “Experiment” contextual data in ENA). The tags have been used by the GenomeTrakr pathogen surveillance network to flag general quality control issues (or the lack thereof), as well as to provide additional quality control methods information.
The GenomeTrakr implementation demonstrates how the generic PHA4GE tags can be customized according to initiative-specific needs. GenomeTrakr adds standardized names of QC pipelines used by different data providers in the “quality_control_method_name” field, and has created other “quality_control_issues” tags that were subsequently added to the PHA4GE prescribed list i.e., “ low coverage of characteristic mutations”. We anticipate that as the tags are implemented for different organisms and initiatives, there may be other useful tags that should be included in the PHA4GE list. PHA4GE encourages feedback and suggestions from the community via the New Term Request form on GitHub. By sharing community needs and requests with PHA4GE in this way, we are able to work with ontology developers and public repository scientists to make new standardized vocabulary available through different channels. Also, it is possible to create different collections of specifications so that tags are honed for particular use cases. PHA4GE also recommends updating records when possible with known quality control issues.
There are many elements to standardizing quality control including specifying types of metrics and their parameters and thresholds, selecting and documenting tools and algorithms, prescribing different checkpoints in wet and dry lab processes, and so on. However, the PHA4GE Contextual Data QC Tag Specification does not delve into these more in-depth aspects, but rather the attributes act as quick, searchable, downstream flags for overall outcomes of QC assessments. Further development in standardized QC language and harmonized QC threshold for such nuanced aspects of QC frameworks is therefore needed. We hope that these simple tags will help improve communication around quality control in public repositories, as well as make datasets of variable quality easier to identify.
5. Conclusion
Availability and Requirements: The software used in this study is available on GitHub.
Project name: PHA4GE QC Contextual Data Tags Specification
Operating system: Platform independent.
Programming language: Not applicable.
Other requirements: None.
License: MIT License.
Funding
We wish to thank the Bill & Melinda Gates Foundation for supporting the establishment and work of the PHA4GE consortium. EJG and RC were funded by a Genome Canada CanCOGeN grant E09CMA. CIM was supported by the Fundação para a Ciência e Tecnologia (grants SFRH/BD/129483/2017 and COVID/BD/152583/2022). LC acknowledges funding from the MRC Centre for Global Infectious Disease Analysis (reference MR/R015600/1), jointly funded by the UK Medical Research Council (MRC) and the UK Foreign, Commonwealth & Development Office (FCDO), under the MRC/FCDO Concordat agreement and is also part of the EDCTP2 programme supported by the European Union.
Institutional Review Board Statement
Not applicable.
Acknowledgements
The authors would like to thank the many members of the bioinformatics community who volunteer their time to provide ongoing feedback and support to PHA4GE, without which this work would not be possible.
Conflicts of Interest
The authors declare that they have no competing interests.
List of Abbreviations
CFSAN, Center for Food Safety and Applied Nutrition; DDBJ, DNA Data Bank of Japan; DSWG, Data Structures Working Group; EMBL-EBI, European Molecular Biology Laboratory-European Bioinformatics Institute; ENA, European Nucleotide Archive; FAIR, Findable, Accessible, Interoperable, Reusable; FDA, Food and Drug Administration; GENEPIO, Genomic Epidemiology Ontology; INSDC, International Nucleotide Sequence Database Collaboration; NTR, New Term Request; OBO, Open Biological and Biomedical Ontology; PHA4GE, Public Health Alliance for Genomic Epidemiology; QC, quality control; SRA, Sequence Read Archive.
References
- Black, A.; et al. Ten recommendations for supporting open pathogen genomic analysis in public health. Nat. Med. 2020, 26, 832–841. [Google Scholar] [CrossRef] [PubMed]
- Brown, B.; et al. An economic evaluation of the Whole Genome Sequencing source tracking program in the U. S. PLoS ONE 2021, 16, e0258262. [Google Scholar] [CrossRef] [PubMed]
- Carrillo, C.D.; Blais, B.W. Whole-Genome Sequence Datasets: A Powerful Resource for the Food Microbiology Laboratory Toolbox. Front. Sustain. Food Syst. 2021, 5, 754988. [Google Scholar] [CrossRef]
- Cook, S. Genomic surveillance in the roll out of vaccines. PHG Foundation. 2021. Accessed Jan 12 2023 https://www.phgfoundation.
- Gargis, A.S.; et al. Assuring the Quality of Next-Generation Sequencing in Clinical Microbiology and Public Health Laboratories. J. Clin. Microbiol. 2016, 54, 2857–2865. [Google Scholar] [CrossRef] [PubMed]
- Gozashti, L. Corbett-Detig Shortcomings of SARS-CoV-2 genomic metadata. BMC Res. Notes 2021, 14, 189. [Google Scholar] [CrossRef] [PubMed]
- Griffiths, E.; et al. Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package. GigaScience 2022, 11. [Google Scholar] [CrossRef] [PubMed]
- Hendriksen, R.S. Using Genomics to Track Global Antimicrobial Resistance. Front. Public. Health 2019, 7, 242. [Google Scholar] [CrossRef]
- Lusignan, S.; et al. COVID-19 Surveillance in a Primary Care Sentinel Network: In-Pandemic Development of an Application Ontology. JMIR Public. Health Surveill. 2020, 6, e21434. [Google Scholar] [CrossRef] [PubMed]
- Munnink, B.B.O.; et al. Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans. Science 2021, 371, 6525. [Google Scholar] [CrossRef]
- Musen, M. Demand standards to sort FAIR data from foul. Nature 2022, 609. [Google Scholar]
- Petrillo, M.; et al. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing [version 2; peer review: 1 approved, 2 approved with reservations]. F1000Research 2022, 10, 80. [Google Scholar] [CrossRef] [PubMed]
- Pettengill, J.B. Interpretative Labor and the Bane of Nonstandardized Metadata in Public Health Surveillance and Food Safety. Clin. Infect. Dis. 2021, 73, 1537–1539. [Google Scholar] [CrossRef] [PubMed]
- Rick, J.A.; et al. Reference genome choice and filtering thresholds jointly influence phylogenomic analyses. bioRxiv 2022. [CrossRef] [PubMed]
- Robinson, E.R.; et al. Genomics and outbreak investigation: from sequence to consequence. Genome Med. 2013, 5, 36. [Google Scholar] [CrossRef]
- Rossen, J.W.A.; et al. Practical issues in implementing whole-genome-sequencing in routine diagnostic microbiology. Clin. Microbiol. Infect. 2018, 24, 355–360. [Google Scholar] [CrossRef] [PubMed]
- Schriml, L.; et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Sci. Data 2020, 7, 188. [Google Scholar] [CrossRef]
- Smith, B.; et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 2007, 25, 1251–1255. [Google Scholar] [CrossRef] [PubMed]
- Smits, T.H.M. The importance of genome sequence quality to microbial comparative genomics. BMC Genomics 2019, 20, 662. [Google Scholar] [CrossRef]
- Stevens, I.; et al. Ten simple rules for annotating sequencing experiments. PLoS Comput. Biol. 2020, 16, e1008260. [Google Scholar] [CrossRef]
- Timme, R.E.; et al. Utilizing the Public GenomeTrakr Database for Foodborne Pathogen Traceback. Methods Mol. Biol. 2019, 1918, 201–212. [Google Scholar] [CrossRef]
- Wagner, D.D.; et al. Evaluating whole-genome sequencing quality metrics for enteric pathogen outbreaks. PeerJ 2021, 9, e12446. [Google Scholar] [CrossRef] [PubMed]
- Wilkinson, M.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef] [PubMed]
- World Health Organization. Global genomic surveillance strategy for pathogens with pandemic and epidemic potential, 2022–2032. Available online: https://www.who.int/initiatives/genomic-surveillance-strategy.
- Xiaoli, L.; et al. Benchmark datasets for SARS-CoV-2 surveillance bioinformatics. PeerJ 2022, 10, e13821. [Google Scholar] [CrossRef] [PubMed]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).