Design
The pipeline is designed to evaluate multiple genome or transcriptome assemblies in parallel. The pipeline is divided into four major sections as shown in the flowchart in
Figure 1. Within each section, data are processed in parallel where possible. In section 1, the pipeline checks the input FASTA and GFF3 annotation files using
py_fasta_validator, SekQit
rmdup and GenomeTools
gt gff3validator (Edwards, 2019; Gremme, et al., 2013; Shen, et al., 2016), to ensure integrity of the input files, detect duplicate sequences and prevent failure of subsequent tools. In section 2, the pipeline uses NCBI’s Foreign Contamination Screen (FCS) and its database to detect contaminants such as adapters and sequences from foreign organisms (Astashyn, et al., 2023). Assemblies that successfully pass these checks move on to additional quality checks in section 3. The pipeline can be configured to skip quality checks for assemblies which contain contaminants.
Section 3 executes the remaining quality assessment in parallel for each assembly. For scaffold and contig-level contiguity statistics associated with FASTA sequences, assemblathon2-analysis is used (UCDAVIS-Bioinformatics, 2012). To break the assembly into contigs, the pipeline uses 100 ‘N’ bases as the default unknown gap size. This parameter can be changed and is quoted in the AssemblyQC final report atop the contig-level statistics. Statistics related to annotations in GFF3 files are computed with GenomeTools gt stat tool (Gremme, et al., 2013). The Benchmarking Universal Single-Copy Orthologs (BUSCO) tool is used for estimating the gene-space completeness of each assembly (Seppey, et al., 2019). The pipeline can be configured to evaluate each assembly against one or more BUSCO lineages in parallel. The quality of the repeat-space is evaluated with the Long Terminal Repeat (LTR) Assembly Index (LAI) (Ou, et al., 2018). The LTR_retriever workflow consisting of LTR_FINDER and LTRharvest for de novo detection of LTRs (Ellinghaus, et al., 2008; Ou and Jiang, 2018; Ou and Jiang, 2019). The LAI statistic is independent of the BUSCO statistics and can help improve overall assembly contiguity by isolating shortcomings in the repeat-space.
The pipeline can also assess chromosome completeness and correctness through telomere, Hi-C contact-frequency, and synteny visualisations. Telomere Identification toolKit (TIDK) is employed to estimate the presence of telomeres with a specified telomeric motif (Brown, 2023). To supplement assessment with this user-specified repeat motif, the toolkit also explores the most likely data-driven repeat sequence for each assembly from the assembly sequences. The results for TIDK are sorted in ascending order by sequence length using SeqKit (Shen, et al., 2016). Presence of the telomeric repeats throughout a sequence often indicates assembly errors, and in many cases, these can be corrected by rearranging fragments in the sequence.
Where Hi-C data are provided, a contact-frequency map is generated by the pipeline and added to the report for visualisation. FASTQC and FASTP are used to trim and quality check the Hi-C reads (Andrews, 2010; Chen, et al., 2018). The reads are then mapped to the assembly sequences using Burrow-Wheeler Aligner (BWA) (Li, 2013). The mapping quality is checked with hic_qc.py (Sullivan, 2022), and finally converted into hic format through a workflow consisting of samblaster, samtools and run-assembly-visualizer.sh (Danecek, et al., 2021; Dudchenko, et al., 2017; Faust and Hall, 2014). The interactive visualisation of the Hi-C map is added to the report using the Juicebox.js JavaScript library (Robinson, et al., 2018). Atypical Hi-C contact-frequency patterns such as gaps or non-diagonal contact concentrations can indicate gaps in the assembly or misassignment of contigs to scaffolds.
Two sub-workflows, the pair-wise mode, and the chromosome-level comparison, are implemented in the pipeline to generate synteny plots between input assemblies. The pair-wise workflow creates synteny plots between each combination of input assemblies. Firstly, cross-mapping is performed with MUMmer4 (Marcais, et al., 2018). The pipeline can be configured to include or exclude many-to-many mappings detected by MUMmer4. The mappings are then plotted with Circos (Krzywinski, et al., 2009). A dot-plot can also be created which is helpful in identifying inversions (Cabanettes and Klopp, 2018). This result makes it easy to visualise gaps and large structural variations (SVs) between assemblies and helps to consistently assign chromosomal numbers to those assemblies. Chromosome-level comparison sub-workflow maps corresponding chromosomes in all the input assemblies with Minimap2 (Li, 2018), and generates synteny plots with plotsr (Goel and Schneeberger, 2022; Li, 2018), allowing chromosome-by-chromosome visualisation of the key synteny blocks and large SVs from all input. This sub-workflow is helpful for creating a single summary synteny visualisation.
Kraken2 is used to assign taxonomic labels to each assembly sequence (Wood, et al., 2019) and an interactive report is produced with Krona (Ondov, et al., 2011). This can be useful in detecting cross-domain contamination in the assemblies (Cornet and Baurain, 2022).
Merqury is also included in the pipeline to facilitate k-mer analysis. The pipeline supports both mixed-haplotype and diploid assembly assessment with and without the availability of parental reads. In addition to K-mer completeness and consensus quality, Merqury is very useful in evaluating the extent of haplotype phasing (Rhie, et al., 2020).
In section 4 of the pipeline, outputs from various assessment tools are gathered, parsed, and converted into a HyperText Markup Language (HTML) report using Jinja templating language in Python. Other than file format validation, the remaining assessment tools are optional and can be bypassed via settings in the configuration file.