2. Materials and Methods
2.1. Design
This is a secondary analysis of 2 prospective, multicenter, observational cohort studies. The first database is the GETGAG registry (Spanish Severe Influenza A Working Group), a voluntary registry created by the Spanish Society of Intensive Care Medicine (SEMICYUC) in 2009 during the A(H1N1)pdm09 influenza pandemic in which 184 Spanish ICUs participated between June 2009 and June 2019. [
6,
16]
The second database is the COVID-19 registry a voluntary registry created by SEMICYUC in 2020 during the SARS CoV-2 pandemic in which 74 Spanish ICUs participated between 1 July 2020 and 31 December 2021[
7,
17,
24,
25].
We reported results in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines. (
Supplementary Figure S1)
2.2. Study Population
A total of 8,902 consecutive patients requiring ICU admission with a diagnosis of respiratory infection by influenza A (H1N1)pdm09 or seasonal A, B (n= 3,702) or SARS CoV-2 (n=5,200) viruses were included during the two periods described.
The presence of the virus was determined by performing reverse transcription-polymerase chain reaction (RT-PCR) in each hospital, according to Infectious Diseases Society of America (IDSA) recommendations for influenza [
26], and World Health Organization (WHO) recommendations for SARS CoV-2[
27]. The follow-up of patients was scheduled until confirmed ICU discharge or death whichever occurred first.
2.3. Definitions
Coinfection was suspected if a patient presented with signs and symptoms of lower respiratory tract infection, with radiographic evidence of a pulmonary infiltrate with no other known cause [
5,
6,
7].
Coinfection had to be confirmed by laboratory testing using Centers for Disease Control and Prevention (CDC) criteria [
5,
6,
17] Only respiratory infection microbiologically confirmed with a respiratory specimen or serology obtained within two days of ICU admission was considered community-acquired COI.
Lower respiratory tract infections (LTRI) diagnosed after two days of ICU admission were considered nosocomial and excluded from the present study.
The diagnosis of COI was considered “definitive” if respiratory pathogens were isolated from blood or pleural fluid and if serological tests confirmed a fourfold increase of atypical pathogens, including
Chlamydia spp.,
Coxiella burnetii and
Mycoplasma pneumoniae. The diagnosis of COI was considered “probable” when a respiratory pathogen was isolated from a respiratory specimen (bronchoalveolar lavage [BAL] or tracheal aspirate [TA]) or was positive for
S. pneumoniae or
Legionella pneumophilia in urine antigen test. Sputum was not used as a respiratory specimen for the diagnosis of LRTI. Only patients with a definite and probable diagnosis were included in the present analysis. [
5,
6]
The diagnosis of pulmonary aspergillosis (PA) was defined according to the recently modified criteria proposed by Verweij et al. [
28]. “Proven” PA was defined by lung biopsy showing invasive fungal elements and growth of
Aspergillus spp. in culture or positive PCR for
Aspergillus spp. in tissue. “Probable” PA was defined by pulmonary infiltrate and bronchoalveolar lavage (BAL) culture or cavitating infiltrate and sputum/sputum positive for
Aspergillus spp. “Possible” PA was defined with lung infiltrate and positive tracheal aspirate or mini-BAL culture. [
28,
29]
Diagnostic procedures for the active search for Aspergillus spp. or other bacteria were not standardized and were requested at the discretion of the treating physicians. Galactomannan levels in BAL or serum are not available in our general registry.
2.4. Study Variables
Demographic data, comorbidities and clinical and laboratory findings were collected during the first 24 hours of ICU admission. In addition, the need for invasive mechanical ventilation and the presence of shock on admission to the ICU were recorded.
Disease severity was determined using the Acute Physiology and Chronic Health Evaluation II (APACHE II) score and the level of organ dysfunction using the SOFA score. The variables controlled in the study can be seen in
Table 1.
2.5. Analysis Plan and Statistical Analysis
First: We determined the incidence of COI in the general population and compared patient characteristics to whether they had bacterial/viral COI. Qualitative variables were expressed as percentages, while quantitative variables were expressed as median and interquartile range Q1-Q3. Chi-square and Fisher tests for categorical variables and Student’s t or Mann-Whitney U test for quantitative variables were used to determine clinical differences between groups.
Secondly, binary logistic regression was employed to ascertain which variables were independently associated with COI. All variables with statistical significance (p<0.05) in the bivariate comparison between groups were included in the GLM (generalized linear model). Furthermore, variables that have been identified as important from a clinical perspective, along with the study variable (COI), were included in the final model.
Given the significant discrepancy between the two groups, with only 14% of patients in the COI group, it is essential to consider this imbalance when developing the model to ensure optimal performance. The ROSE (Random Over-Sampling Examples) package was implemented to address the imbalance between the groups. This statistical package provides functions to address binary classification issues in the context of unbalanced classes. Balanced samples are generated using a smoothed bootstrap approach, allowing for estimation and accuracy evaluation of a binary classifier in the presence of a rare class. The package also includes functions implementing more traditional remedies for class imbalance and various accuracy evaluation metrics. These are estimated using holdout, bootstrap or cross-validation methods [
30,
31]. The ‘under’ option determines simply under-sampling without replacement of the majority class until either the specified sample size N is reached, or the positive examples have a probability p of occurring. When the method is set to “under”, the resulting sample will be reduced.
The ROSE software was used exclusively for data processing of the Train subset, with the Test subset remaining untouched. Once the model had been developed in the Train set, it was applied to the Test set and its performance was evaluated. Results are presented as odds ratio (OR) with 95% confidence interval.
The performance of the model was assessed by determining accuracy, precision, sensitivity, specificity and area under the ROC curve (AUC). The Hosmer-Lemeshow goodness of fit was also determined and the presence of collinearity between the explanatory variables was examined using variance inflation factors (VIF).
In addition, we perform a k-fold Cross Validation (K=10) which consists of taking the original data and creating from it two separate sets: a first training (and test) set, and a second validation set. The training set will then be divided into k subsets, and, at the time of training, each k subset will be taken as the model test set, while the rest of the data will be taken as the training set. Once the iterations were completed, the accuracy and error were calculated for each of the models produced, and to obtain the final accuracy and error, the average of the k-trained models was calculated.
Finally, we study the normality of the residuals of the model. A residual measures the vertical distance between a point and the regression line. Simply put, it is the error between a predicted value and the actual observed value. The most important assumption of a linear regression model is that the errors are independent and normally distributed. The normality of the residuals was assessed visually using different plots and by applying the RESET (Regression Equation Specification Error Test), which assesses the adequacy of a linear regression model by including polynomial terms of the independent variables. A significant p-value rejects the linearity hypothesis, and it is concluded that it is not fulfilled, so a non-linear regression model should be chosen. [
32]
Third: We developed a Random Forest classifier model (RFc). Random forest models are a powerful tree-based non-linear learning technique in machine learning. The model developed was set to perform 500 random trees, with a minimum number of 15 variables per tree.
The performance of the RFc model was evaluated using the out-of-bag (OOB) error. This method allows the prediction error of random forests, boosted decision trees and other machine-learning models to be measured using bootstrap aggregation. We also plotted the importance of variables to the model, which is related to the average loss of accuracy and the Gini index for the classifier model. The Gini index is a ‘measure of clutter’, represented as ‘MeanDecreaseGini’, meaning that the higher the measure, the higher the importance in the generated models, as values close to 0 for the Gini index imply more clutter and values close to 1 imply less clutter. The higher this measure, the more variability it contributes to the dependent variable.
Fourth: A similar analysis was performed in the subpopulation of patients with influenza and COVID-19 to obtain variables associated with COI in each subpopulation.
Statistical analysis was performed with R statistical software (v 4.4.1) R: The R Project for Statistical Computing (r-project.org).
4. Discussion
Our results suggest that data obtained at the time of ICU admission cannot be used to predict the presence of coinfection adequately. Both multiple linear and random forest models need to perform more adequately, suggesting that there are uncontrolled confounders that are not included in the model. However, inflammatory variables (PCT, C-RP), severity (APACHEII, SOFA) or hemodynamic instability (shock, lactate, creatinine) appear to be important in most COI prediction models.
While the prevalence of respiratory coinfection in viral infections varies according to different reports [
5,
6,
11,
12,
13,
14,
19,
21], influenza has a higher risk of COI than SARS-CoV-2 infection [
5,
6,
7,
8,
11,
13]. It is important to recognize viral infection to implement appropriate isolation and infection control measures and to facilitate the use of promising antiviral treatments. However, clinicians should not overlook the possibility of respiratory bacterial or fungal COI in these patients. It is difficult for clinicians to identify respiratory coinfections early because of similar symptoms and signs, thus leading to a high rate of inappropriate prescriptions. This situation and the recommendations issued by scientific societies have led to a high overuse of antimicrobials with the consequent appearance of microbial resistance, especially in Gram-negative bacilli. [
9,
10,
33]
The ability to predict the presence of respiratory COI based on the data available during ICU admission could facilitate the optimization of antimicrobial treatment. However, despite the development of predictive models related to patient mortality, the need for ICU admission or mechanical ventilation [
34,
35,
36,
37,
38,
39,
40], there is a paucity of data on studies attempting to predict the presence of respiratory bacterial or fungal coinfection at the outset of ICU care.
A significant number of studies [
34,
35,
36,
37,
38,
39,
40] have been conducted to predict the presence of coinfection in patients with covid-19. However, these studies have several limitations. In a retrospective analysis of 235 patients with COVID-19, Su L et al. [
34] found that the presence of invasive fungal infection (IFI) was associated with the use of broad-spectrum antibiotics (aOR 4.4), fever (aOR 2.3), log IL-6 concentration (aOR 1.2) and prone ventilation (aOR 2.3). The model’s performance was good with an AUC of 81%, significantly better than our AUC of 68%. It should be noted, however, that there are several differences between the two studies. Firstly, it needs to be clarified whether this is a coinfection or a fungal superinfection. Secondly, the authors include
Candida spp. in the definition of IFI, which was isolated in more than half of the cases (n=33). In conclusion, the reported performance is based on testing the model on the same population it developed. Therefore, it is likely that the model needs to be more balanced.
Recently, a retrospective multicenter study of 1,977 patients with COVID-19[
35] reported that age (OR 1.02), male sex (OR 1.7), and APACHE IV (OR 1.01) at ICU admission were the variables associated with CAPA in the multivariate model. Although these factors were like those found in our study, unfortunately, the authors did not determine the performance of the generated model and, therefore, cannot make a comparison or assess its true applicability.
Wang M. et al. [
36] conducted a similar analysis in a population of 1,778 patients with confirmed cases of SARS-CoV-2 infection, with a co-infection rate of 5%. The machine learning models (GLM and RF) developed to investigate risk factors associated with co-infection demonstrated robust predictive capabilities. The algorithm showed that comorbidities (diabetes, neurological diseases), invasive procedures (central venous catheter [CVC], urinary catheter [UC]), baseline levels of inflammatory markers (IL-6, PCT), and creatinine were associated with an increased risk of bacterial/fungal coinfection, with an AUC of 87% for GLM and 88% for RF, higher values than those observed in our data. The discrepancy in performance outcomes may be attributed to the inclusion of outcome and treatment variables in the models proposed by Wang et al., which differ from the variables considered at ICU admission as seen in our models. Furthermore, our study is focused exclusively on respiratory COI. The presence of risk factors such as CVC or UC in the referenced research suggests including other types of co-infection.
A nationwide retrospective population-based study involving more than 200,000 hospitalized patients in Spain [
37] and an overall incidence of coinfection of 2% revealed that age, male sex, smoking, obesity, COPD, and metabolic disorders were the factors associated with coinfection in multivariate analysis. Regrettably, regrettably, the performance of the model is not presented, which hinders comparison of the developed model’s performance.
The study by Delhommeau G et al. [
13] compared pulmonary bacterial COI in patients with confirmed cases of both influenza and COVID-19 on admission to the ICU in a multi-center database. The authors observed that the prevalence of COI was considerably higher in patients with influenza (24.8%) compared to those with COVID-19 (8.4%). Furthermore, in the latter group, cirrhosis (OR 3.5) was identified as the sole independent risk factor for bacterial COI, suggesting that liver dysfunction significantly compromises the immune response, thereby increasing the susceptibility to bacterial infections. In contrast, immunosuppression (OR 0.34) and obesity (0.29) were identified as negative factors associated with the likelihood of COI in influenza patients. These findings highlight the complex and distinct immunopathological responses in influenza and COVID-19, underscoring the necessity for tailored approaches in managing coinfections in these conditions.
Immunosuppression, while generally increasing the risk of infections, might lead to less aggressive inflammatory responses, which could paradoxically reduce the likelihood of COI in certain viral infections.
Regrettably, the study by Delhommeau G et al. included nosocomial infections in their definition of respiratory COI and did not present data on the model’s performance, limiting the ability to generalize their findings. Future studies should differentiate between community-acquired and hospital-acquired infections to provide a clearer understanding of COI dynamics and develop more accurate predictive models. Additionally, exploring the temporal relationship between viral infections, such as influenza and COVID-19, and the onset of bacterial COI could offer insights into optimal timing for prophylactic and therapeutic interventions.
Giannella M. et al. [
40] developed a COI prediction score using logistic regression, considering only three variables with different cut-off points (PCT, WBC and Charlson index). The model performs adequately (AUC 83%) in classifying patients into low, intermediate and high risk of COI. While the score is a valuable addition to the field, it should be noted that the authors include COI infections other than respiratory infections in the definition, with urinary tract infections accounting for almost half of cases. This limits the applicability of the score in predicting respiratory COI.
To the best of our knowledge, there are no reports in the literature of predictive models of pulmonary COI in pandemic influenza patients. This is because the influenza A(H1N1)pdm09 pandemic occurred when machine learning techniques were not widely available.
Our group published a prospective observational study conducted between 2009 and 2015 in a large cohort of intensive care units in Spain [
5]. In just over 2900 patients with influenza A H1N1pdm09, respiratory COI was diagnosed in 16.6% of patients. The likelihood of coinfection increased with age (aOR 1.01), previous HIV infection (aOR 2.6) and immunosuppressive medication (aOR 1.4), but unfortunately, we did not develop predictive models at that time.
Masia M et al. [
41] showed in more than 100 patients diagnosed with pandemic influenza A that patients with pneumococcal COI were more likely to have confusion, a CURB-65 score >1 and higher CRP levels (255 mg/L vs. 89 mg/L) than those without pneumococcal coinfection.
Recently, our group [
29] published data on factors associated with respiratory
Aspergillus spp. COI in patients with pandemic influenza. In a population of 3700 patients, regression modelling showed that male sex (OR: 2.81), asthma (OR: 4.56), nosocomial influenza infection (OR: 2.40), hematological malignancies (OR: 4.39), HIV (OR: 3.83) and other immunosuppression (such as chronic corticosteroid treatment or chemotherapy, OR: 4.87) were independently associated with
Aspergillus spp. COI. In addition, using the CHAID (Chi-Squared Automatic Interaction Detection) automated decision tree, hematological disease, with an incidence of 3.3%, was shown to be the variable most closely associated with
Aspergillus spp. COI, followed by other immunosuppression as the second most crucial variable.
Experts have urged clinicians not to neglect the principles of antimicrobial stewardship during pandemics to avoid a worsening public health crisis related to antimicrobial resistance [
9,
10,
11]. The challenge for clinicians is to differentiate patients with respiratory viral infections who may benefit from prompt antibiotic treatment from those with a low risk of COI in whom antibiotic pressure should be avoided. Our group [
6,
15,
16,
17] has highlighted the potential role of biomarkers such as C-reactive protein (CRP) and procalcitonin (PCT) in this discrimination process. However, all these have been described as having low specificity and limited positive predictive value.
Using scientific statistical methods to estimate the risk of respiratory bacterial/fungal COI associated with influenza/COVID-19 aims to assess the likelihood of an individual having a COI at the time of initial ICU care. These models could help to intervene earlier, optimize antibiotic prescribing or provide appropriate patient care. Therefore, the establishment of accurate predictive models is of practical importance for clinical work and is beneficial for identifying high-risk patients and their accurate prevention and follow-up. However, developing models using machine learning techniques requires reliable data and the inclusion of all confounding factors related to the dependent variable to demonstrate adequate performance. In addition, many situations in medicine have a non-linear association, so non-linear models such as Random Forest can better predict different events.
To our knowledge, our study is the first to attempt to develop a predictive model of respiratory COI in a large population of critically ill patients with respiratory failure secondary to pandemic viral infection. Despite applying machine learning techniques, the models developed did not perform adequately in the validation process. However, the development of the models has allowed the identification of variables related to the severity or inflammatory state of the patient that have a strong association with the presence of COI. With the clinical and laboratory data available at the time of ICU admission, the prediction of coinfection cannot be higher than 70%.
Our study has several strengths. It is a multicenter study with many patients, all with a microbiological diagnosis of coinfection, and a robust statistical analysis using machine learning techniques. However, it also has several limitations that must be acknowledged. Firstly, it is a retrospective national study, merging two large databases collected prospectively over more than 15 years. During this time, organ support techniques have evolved and changed significantly, which may impact prognosis. However, the aim of the study is not to evaluate the evolution of patients but to try to develop a predictive model of coinfection.
Second, this is a secondary analysis, so the possibility of bias related to uncontrolled confounding variables cannot be excluded. In addition, the inclusion of outcome variables (e.g., type of treatment, ventilatory support modalities, etc.) could improve the performance of the models. However, the study aimed to provide an early prediction of the risk of respiratory coinfection.
Finally, although this is a huge and homogeneous cohort of critically ill patients, it is predominantly a national sample, so these results can only be transferred to other countries or regions with prior validation.
Author Contributions
Conceptualization, Alejandro Rodríguez , Josep Gómez, Ignacio Martín-Loeches, Laura Claverias, Emili Díaz , Rafael Zaragoza, Marcio Borges-Sa, Frederic Gómez- Bertomeu , Álvaro Franquet, Carlos González Garzón , Lisset Cortes, Florencia Alés, Susana Sancho , Jordi Solé-Violán, Ángel Estella and Borja Suberviola ; Data curation, Sandra Trefler, Julen Berrueta and Alejandro García-Martínez; Formal analysis, Alejandro Rodríguez , Josep Gómez, Ignacio Martín-Loeches, Laura Claverias, Rafael Zaragoza, Marcio Borges-Sa, Frederic Gómez- Bertomeu , Álvaro Franquet, Lisset Cortes, Florencia Alés, Susana Sancho , Jordi Solé-Violán, Julen Berrueta, Alejandro García-Martínez and Borja Suberviola ; Investigation, Alejandro Rodríguez , Laura Claverias, Emili Díaz , Marcio Borges-Sa, Carlos González Garzón , Lisset Cortes, Florencia Alés, Susana Sancho , Jordi Solé-Violán, Ángel Estella and Julen Berrueta; Methodology, Alejandro Rodríguez , Josep Gómez, Ignacio Martín-Loeches, Laura Claverias, Emili Díaz , Rafael Zaragoza, Carlos González Garzón , Florencia Alés, Ángel Estella , Julen Berrueta, Alejandro García-Martínez and Borja Suberviola ; Project administration, Alejandro Rodríguez , Sandra Trefler and María Bodí; Resources, Alejandro Rodríguez ; Software, Alejandro Rodríguez , Josep Gómez, Álvaro Franquet, Julen Berrueta and Alejandro García-Martínez; Supervision, María Bodí; Validation, Alejandro Rodríguez , Josep Gómez, Emili Díaz , Álvaro Franquet, Sandra Trefler, Susana Sancho , Julen Berrueta and Alejandro García-Martínez; Writing – original draft, Alejandro Rodríguez , Ignacio Martín-Loeches, Laura Claverias and María Bodí; Writing – review & editing, Alejandro Rodríguez , Josep Gómez, Ignacio Martín-Loeches, Rafael Zaragoza, Marcio Borges-Sa, Frederic Gómez- Bertomeu , Jordi Solé-Violán, Borja Suberviola , Juan José Guardiola and María Bodí. All authors will be informed about each step of manuscript processing including submission, revision, revision reminder, etc. via emails from our system or assigned Assistant Editor.