1. Summary
Bacterial infections remain among the most serious problems in healthcare. The continuous grow of antimicrobial drug resistance in clinical pathogenic bacteria represents a serious threat to public health worldwide leading to a limited set, if any, of available treatment options [
1]. Making things worse, the antimicrobial resistance (AMR) has already outstepped hospitals and other healthcare institutions and become a significant matter in community settings [
2], so that AMR infections has become one of the top causes of death worldwide [
3].
The spread of AMR within particular bacterial species can be driven by several lineages usually called ‘global clones’ or ‘international clones of high risk’, which was shown to be the case for the most successful and widespread pathogens like
Klebsiella pneumoniae [
4] and
Acinetobacter baumannii [
5]. Thus, in order to perform the epidemiological surveillance and develop effective prevention measures against multidrug-resistant (MDR) bacteria it is essential to track the spread of such global clones and to check whether particular isolates belong to these lineages. Simply stated, you should know your enemy before you could fight it.
Isolate classification and assignment to a particular clone can be based on several characteristics revealed using molecular biology techniques, but currently the whole genome sequencing is increasingly used for this and many other purposes due to unprecedented amount of information it produces and its high cost-effectiveness [
6,
7].
Now, tens of thousands genomes of pathogenic bacteria are available in public databases, and this amount will continue to increase. The representation in genomic databases is interrelated with the level of concern raised by a particular pathogen and the incidence of infections caused by it.
A. baumannii is responsible for a significant share of nosocomial infections worldwide [
8], and the World Health Organization (WHO) listed its isolates resistant to antibiotics from carbapenem class as one of the most critical pathogens with the highest priority in new antibiotic development [
9].
Currently, more than 17,000 draft
A. baumannii genomes are available at NCBI (
https://www.ncbi.nlm.nih.gov/datasets/taxonomy/470/, accessed on 23 October 2023), but the data provided there usually lacks isolate typing information, which should be derived using various computational tools by users. Another commonly used database PubMLST [
10] contains epidemiologically related and typing information for more than 8,700 isolates, but the number of available genomes is about 3,300 (
https://pubmlst.org/organisms/acinetobacter-baumannii, accessed on 23 October 2023). However, as we described earlier, an assignment of a particular isolate to some international clone is not always straightforward even when whole genome data is available [
11], and, to the best of our knowledge, no public database currently presents such assignment for large isolate datasets.
Here we present the dataset containing typing information, including assignment to international clones, for the whole set of the A. baumannii isolates available at NCBI (accessed on 23 February 2023), which included 17,546 genomes. The information available contains multilocus sequence typing (MLST)-based types (STs), intrinsic blaOXA-51-like gene variants, capsular (KL) and oligosaccharide (OCL) types, core genome MLST (cgMLST) profiles, and the data regarding the presence of CRISPR-Cas systems in each of the isolates. The data regarding the presence of AMR genes providing resistance to various classes of antibiotics is also available.
We already used this dataset to study the representation of known international clones in Genbank [
11]. Deriving of all this information from the available genomes requires advanced level of bioinformatics expertise and significant computational resources. As it was noticed recently, the proper global genomic epidemiology investigations of
A. baumannii are very important for understanding global dissemination of important clones [
12]. We believe that the dataset will be useful for epidemiological studies of
A. baumannii, including facilitation of selection of the proper reference isolate sets for any types of genome-based investigations.
This precomputed data will be especially helpful for the researchers starting their investigations in the emerging field of genomic epidemiology concerning this important and dangerous pathogen.
The dataset is available for academic use under Creative Commons Attribution-Non Commercial-ShareAlike (CC BY-NC-SA) 4.0 International License. The updates are scheduled to be provided at least twice a year.
2. Data Description
2.1. Data Structure
The dataset contains three tables provided in various formats: xlsx, txt (tab-delimited) and pdf (summary table only). When using xlsx format in table processing software, users can create their own filters to select the subsets of interest, change or add sorting parameters, build graphs etc. Text format (txt) is intended for computational processing using bioinformatics tools, while pdf format is provided for summary file containing main isolate typing results to present them in human-readable form.
The tables, which will be described below in more details, include:
- summary table, which contains typing information for all isolates, such as MLST Pasteur ST, OXA-51 like variant, capsule synthesis loci (KL) and lipooligosaccharide outer core loci (OCL)-based types, assignments of an isolate to known international clones of high risk (IC1-IC9) and the possible presence and type of CRISPR-Cas system in the genome of an isolate
- AMR gene table containing the information on the presence of genes known to provide, when properly expressed, the AMR to various classes of antimicrobial drugs for particular isolate
- the table containing cgMLST profiles for all isolates, which can be used for extended comparison and clustering purposes
2.1.1. Typing summary table
The format and exemplary data for typing summary table is given in
Table 1.
First column contains the assembly code assigned by Genbank, which uniquely identifies a particular A. baumannii genome assembly.
‘IC’ stands for ‘international clone of high risk’ and shows the assignment of a particular isolate to know international clones. The procedure for such an assignment is described in the Methods section and our previous publication [
11]. If an isolate was not assigned to any IC, this column contained ‘NOIC’ designation.
Third column contains a sequence type (ST) defined as a combination of 7 loci (cpn60, fusA, gltA, pyrG, recA, rplB and rpoB genes) typing scheme known as ‘Pasteur MLST’ [
13]. Each variant of a particular locus is numbered, and the combination of 7 loci numbers constitutes a ST, which has its own number. For example, the combination cpn60_3, fusA_29, gltA_30, pyrG_1, recA_9, rplB_1 and rpoB_4 was defined as ST17. Loci variants and the definitions of corresponding STs can be found in PubMLST database (
https://pubmlst.org/organisms/acinetobacter-baumannii). ‘ND’ in this column means that ST was not determined due to either low sequencing quality or the presence of a novel MLST allele not uploaded to the databases yet.
Fourth column shows the variant of a gene encoding intrinsic OXA-51-like β-lactamase, which is possessed by all A. baumannii isolates and was sometimes used for typing purposes [
14]. ‘NOT_FOUND’ can appear in this column due to low sequencing quality or low similarity with known OXA-51-like variants.
KL- and OCL-type show the typing classes based on the corresponding sets of genes, respectively. Capsular polysaccharide is an essential factor determining bacterial virulence and its susceptibility to phages, which makes it useful epidemiological marker [
15]. Capsular polysaccharide gene cluster consists of about 30 genes, while OC locus includes only five. Each distinct gene cluster found between the flanking genes is assigned a unique number identifying the locus type, and these data can be found in public databases [
16].
Final column shows the presence of CRISPR-Cas system in the isolate. Clustered regularly interspaced short palindromic repeat (CRISPR) arrays and CRISPR-associated genes (cas) constitute bacterial adaptive immune systems and function as a variable genetic element. Each CRISPR-Cas locus includes a strain-specific array of so-called spacer sequences, which can be used for strain subtyping [
17]. CRISPR-Cas systems can be divided into six major types (I-VI) and several subtypes (A-I, K, U) based on a combination of evidence from phylogenetic, comparative genomic, and structural analysis [
18,
19]. ‘UNKN’ in this column denotes incomplete system, in which some genes are absent, while ‘NF’ means that CRISPR-Cas system was not revealed in a particular isolate genome.
2.1.2. AMR gene table
Another part of the dataset includes the information regarding the presence of various genes known to provide antimicrobial resistance in the investigated A. baumannii genomes. First column holds the assembly code, which is the same as in typing summary table, second column shows the number of AMR genes found in a particular isolate, while other columns exhibit the presence of a particular gene given in a first row and its sequence similarity level with the corresponding allele from the database. The absence of a gene is marked with a dot for better readability purposes.
An example is provided in
Table 2. Only a small part of available genes is presented in this example.
The presence of a particular gene by itself does not confirm the resistance to the corresponding antimicrobial drug since this gene, for example, might not be expressed [
20,
21]. However, this information is very useful for estimation of AMR potential within bacterial population.
2.1.3. cgMLST profiles
Third part of the dataset includes cgMLST profiles for all isolates. The cgMLST typing scheme is similar to MLST in that it enumerates gene variants and uses their combination to form a profile, but the difference is that cgMLST relies on all conservative genes within particular species. Such a scheme for A. baumannii was proposed in 2017 [
22] and included 2390 loci. The allele variants for these loci are available in regularly updated public database at cgmlst.org (
https://www.cgmlst.org/ncs/schema/schema/3956907).
cgMLST profiles can be used in cluster analysis of a set of isolates in order to estimate their genomic closeness and, possibly, obtain some evolutionary or epidemiologically valuable insights. The threshold of 3 different cgMLST loci was proposed to check whether two A. baumannii isolates belonged to a single strain [
23], but less stringent criterion can be used depending on specific investigation case.
In this dataset, cgMLST profiles are given in a table format. First column contains the same assembly identifier as in other dataset tables, and other columns show the numbers representing the variants of genes given in a header row. Some special characters can appear besides numbers, namely, N - indicates novel allele variant not present in the database; 0? - locus is missing in assembly (probably, due to low sequencing quality); "-" - allele is partially covered; “+” represents multiple possible alleles, in which case the most probable is shown.
2.2. General Statistics
Some general statistics based on the dataset is provided below. As it was noticed previously [
11,
24], Genbank set of genomes cannot be considered representative for the whole A. baumannii population since it is strongly biased towards multidrug-resistant or other clinically relevant isolates. At the same time, the statistics on representation of particular STs, ICs, AMR genes etc. can provide useful information for reference set selection and comparison purposes.
We will refer to any Genbank assembly record containing complete or partial genome as ‘‘isolate” for simplicity, although some different assemblies could represent the same isolate, or some record could contain only a part of the isolate genome.
The distribution of ICs in Genbank is shown in
Figure 1.
The summary of top three dominating characteristics in each category is given in
Table 3.
In total, 78.5% of the isolates belonged to IC1-IC9, with IC2 accounting for 65% of all Genbank genomes and 83% of A. baumannii isolates belonging to ICs. The second largest IC - IC1 - was revealed in about 3.6%/4.5% of all isolates and isolates belonging to ICs, respectively. These results are typical since IC2 is known to be the most successful and widespread clone of A. baumannii being responsible for the majority of outbreaks worldwide [
25]. The dominating ST was, not surprisingly, ST2, which constituted the most part of IC2 and was revealed in 63.3% (11108) of cases. ST1 (3.59%) and ST499 (3.4%) also were in the top three. The number of distinct STs revealed was 482. However, 398 of them were presented by 10 or less isolates each, with 236 STs featured only a single isolate.
The dominating OXA-51-like variant was OXA-66 (about 60%) associated with IC2, followed by OXA-82 (also associated with IC2) and OXA-69 (IC1), each accounting for ten times less isolates than the former. In this case, the statistics is nearly full since the genes encoding OXA-51-like beta-lactamases were revealed in 98% of the isolates due to low similarity caused by bad sequencing quality or low coverage of particular genomic regions. The data on intrinsic beta-lactamases clearly correlates with IC and ST distribution, which was as expected [
11,
14]. Totally, 130 distinct OXA-51-like variants were revealed.
Top three KL types included KL2 (15.5% of the isolates), KL3 (11.6%) and KL22 (5.4%). KL2 was recently reported to be the most common type in A. baumannii and was associated with increased AMR [
26], which conforms with the previously noted bias of Genbank A. baumannii isolates towards more resistant ones. In total, 229 types were present.
The diversity of OCL types was lower with OCL1 accounting for 70% of the isolates. OCL3 and OCL6 were found in about 11% and 5%, respectively. In total, 22 types were revealed. These results corresponds to the previous study, in which OCL1 was the most common type among A. baumannii isolates belonging to ST1, ST2, ST3 and ST78, which together constituted about 71% of the isolates available in Genbank [
27].
The median number of AMR genes possessed by the isolates was 10, with the number ranging from 1 to 25. The number was never equal to zero since intrinsic blaOXA-51-like and blaADC genes were also included. In several cases, intrinsic genes were not revealed due to low similarity or insufficient genome region coverage. We revealed 1236 (7.0%) isolates possessing only intrinsic genes denoted above.
The most abundant gene except intrinsic ones were bla
OXA-23, aph(6)-Id and aph(3'')-Ib revealed in about 64% of the isolates each. First gene, encoding OXA-23 carbapenemase, was shown to be associated with IC2 [
28], and our study confirmed that it was found in about 95% of IC2 isolates. The latter two genes, usually being of plasmid origin [
29], encode aminoglycoside phosphotransferases providing AMR to streptomycin.
CRISPR-Cas systems were revealed in about 8% (1342) of the isolates. The dominating type in this case was I-F, with I-F1 subtype accounting for 67% and I-F2 – for about 20% of the CRISPR-Cas positive isolates, which corresponds to previous RefSeq analysis, in which it was found that most CRISPR-Cas systems in A. baumannii belonged to the types I-F1 and I-F2 [
30]. The systems were predominantly found in IC1 (39%), NOIC (29%) and IC7 (17%) groups, and were not revealed in IC2, which also conforms with previous investigations [
30]. The relatively low fraction of the isolates in Genbank containing apparently functional CRISPR-Cas system could also be attributed to overrepresentation of IC2, which usually does not possess such a system.
3. Methods
We retrieved 17,546 genomic sequences of
A. baumannii from Genbank (
https://www.ncbi.nlm.nih.gov/genbank/, accessed on 14 January 2023), for which the assembly level was defined as ‘Complete Genome’, ‘Chromosome’, or ‘Scaffold’.
Multilocus sequence typing (MLST) was performed using the PubMLST database (
https://pubmlst.org/bigsdb?db=pubmlst_abaumannii_seqdef, accessed 14 February 2023) using Pasteur [
13] typing scheme. When we tried to get STs based on Oxford scheme [
31], the reliable assignment was obtained for only about 23% of the isolates due to known issue of
gdhB gene paralogy and technical artifacts [
32]. It was also shown that the Pasteur scheme is more appropriate for population biology and epidemiological studies of
A. baumannii than the Oxford one since it can be used for more precise isolate classification in clonal groups [
33]. For these reasons, the data for Oxford STs are not presented in our dataset.
The detection of capsule synthesis loci (KL) and lipooligosaccharide outer core loci (OCL) was made using Kaptive v. 2.0.3 [
34] with default parameters (last update of the databases on 13 February 2023).
The presence of CRISPR-Cas systems in the genomes analyzed was investigated using CRISPRCasFinder [
35] version 4.2.20 with the following parameters: ‘-fast -rcfowce -ccvRep -vicinity 1200 –cas -useProkka’. Recent classifications based on the multiparametric analysis reported type I loci with
cas3 as a signature gene and type I-F with fused
cas3 and
cas2 genes [
36,
37]. Types I-F1 and I-F2 both contain
cas1 and
cas2-3 gene, but the rest part of the loci is different and includes four genes
csy1 (
cas8f1),
csy2 (
cas5f1),
csy3 (
cas7f1),
csy4 (
cas6f) in I-F1 and three genes
cas5fv (
cas5f2),
cas6f, and
cas7fv (
cas7f2) in I-F2 [
30].
Additional data processing and output formatting was performed using computational pipeline developed earlier by us [
39,
40].
Assignment of the isolates to known international clones of high risk (IC) was based on Pasteur MLST ST and
blaOXA-51-like gene variant supported by cgMLST profile when needed, as described earlier by us [
11]. In order to make the classification reliable, we performed cluster analysis of STs using currently available data and compared it with the information derived from literature analysis of experimentally verified IC assignments.