Altmetrics
Downloads
123
Views
61
Comments
0
A peer-reviewed article of this preprint also exists.
This version is not peer-reviewed
Submitted:
04 April 2024
Posted:
05 April 2024
You are already at the latest version
Step | Subject matter | References |
---|---|---|
1 | Estimand formulation under missing data | [2,19,20,21,22] |
2 | Domain-informed missing data formulation and assumptions* | [15,16] |
3 | Missing data identification theory | [2,23,24,25,26,27,28] |
4 | Estimation with missing data | [20,21,29,30,31,32] |
5 | Sensitivity analysis for unidentifiable missingness | [33,34,35,36] |
Scenario No. | Interpretation |
---|---|
1 | Since all study variables are missing in case of complete non-visit, a shared missingness indicator influences all the other ’s, such that when . Based on the potential reasons for complete non-visit, may have incoming edges from the main health outcome such as in fatal diseases (e.g., reflected in death status, possibly available in external data sources such as the Social Security Death Index database [14]), risk factors such as BMI, or non-medical factors such as socioeconomic status (e.g., reflected in the occupation and level of education variables in patient information [39]). Figure 3 presents a schematic m-graph structure for this scenario. |
2 | Similar to scenario 1, missingness of follow-up visits can be modeled using a shared indicator for the j-th visit. The simplicity of this scenario compared to scenario 2 lies in the likelihood that is influenced by observations at the -th visit. For instance, missing follow-up can be influenced by the health status upon discharge or length of stay for inpatient admissions. |
3 | Observed and partially observed causes of missingness include health markers directly related to the reasons leading to the interruption of data measurement. For instance, code-blue missingness may happen right after anomalies in vital sign readings. It is unlikely for missingness under scenario 3 to have exogenous causes since measurement standards are in place for patients during their entire inpatient stay. Causes that might confound missingness with other study variables include event location and timestamp metadata, as they highlight the decision-making conditions. |
4 | Reasons for patients’ refusal are usually predicated on personal characteristics that are not associated with other health-related variables. Thus, it is a fairly safe assumption to consider them exogenous unless disproved explicitly. A possible health-related cause for missingness under this scenario is refusal due to pain intolerance or distress, which may indicate a negative health status at the moment. In this case, indicators of the health status can be considered the causes of missingness. |
5 | The missingness indicators under this scenario are influenced by observed health variables X, mediated by the attending physician unless X itself is subjected to missingness under scenario 8. In this case, the edge is received from the counterfactual counterpart (since it influences missingness regardless of its observation status). As an extreme case, a counterfactual variable that is completely missing under scenario 8 is a latent cause for missingness (see the corresponding entry for scenario 8). Causes that might confound missingness with other study variables include the attending physician’s identifiers, which are proxies of medical practice styles. |
6, 7 | By definition, missingness under these scenarios can be predicted using the patient’s location, transfer information, or type of resource required for making the observation. This information is considered crucial, being part of the management and billing data; therefore, causing variables for this scenario are likely to be fully observed and available, especially for datasets from large healthcare facilities. |
8 | recording of the counterfactual study variables depends on the nature of the variable, as well as the style of the medical practice of the attending physician and the recording capabilities of the software tool. For instance, expensive and decisive tests such as medical imaging or lab tests are generally recorded, while qualitative examination results may escape from recording depending on the physician or if the software tool does not provide an entry for it. Overall, there is a possibility that the reasons for missingness under this scenario confound other study variables if they are also affected by the medical practice style, e.g., when a physician with a tendency to record the most variables also diagnoses and prescribes treatments more effectively. |
9 | Inclusion/exclusion criteria directly indicate the reason for missingness under this scenario. For instance, age is the direct cause of missingness if, by design, data is selected according to the age criteria. |
10 | The potential reasons for invalid entries mentioned in this paper are likely unrelated to the analysis of interest, as they are mostly related to human and software tool errors. However, one should be cautious about treating all medically unrelated variables as exogenous causes. Instead, whether these variables can realistically be confounders for other variables should be investigated. For instance, socioeconomic or occupational status may affect overall health and healthcare facility visits. |
Scenario No. | Interpretation |
---|---|
1, 2 | Parametric shift mat occur if the population distributions change. Examples include: (i) conducting analysis using hospital data but deploying it for patients in local clinics or the general healthy population, such as in a preventive healthcare plan, (ii) conducting analysis using data from a specific cohort but deploying it for another cohort, (iii) when the target population visits healthcare facilities more or less frequently than during the data acquisition stage. |
3 | Shift occurs if the transfer protocols or observation protocols during hospitalization change. Examples include (i) observing more, fewer, or different variables in different hospital wards and (ii) encountering data availability or unavailability after the transfer despite being available during the data acquisition stage. |
4 | A no-shift assumption regarding patient refusal behavior for taking a test or answering questions appears reasonable. However, it is essential to consider the possibility of a shift due to data-sharing consent. In such cases, the data available for analysis could differ from the data available to physicians at deployment. |
5, 6, 7, 8 | Shift occurs when there are changes in the observation policy of physicians, healthcare facility protocols, available equipment, or data collection software. Examples include (i) alterations in the measurement decisions resulting from the deployment of a prediction model, (ii) modifications in the utilized diagnostic flowcharts and scores, (iii) fluctuations in the level of physicians’ expertise, and (iv) enhancing data collection protocols following significant events such as an epidemic. |
9 | Inclusion/exclusion criteria typically imply a shift in missingness unless the same criteria are applied for the admission of patients, which is highly unlikely in most cases. An example of no-shift occurs when the data scientist restricts the general population to the cohort of interest for deployment, using inclusion/exclusion criteria. |
10 | Shift occurs only when the reasons behind errors and invalid entries in the data change. |
Scenario No. | Interpretation |
---|---|
1 & 2 | non-visit possibly influences the health variables via a direct causal effect on the treatments: patients usually do not receive treatment until being admitted (except the self-medication case). This means that the NDE assumption is mostly violated. Take the example of missing follow-up after the first visit, compared with unrecorded observations in a realized follow-up visit. In the latter, missingness does not influence the health status at the end of the follow-up visit. In contrast, in the former, the health status, for example, may degrade due to discontinuation of diagnosis/treatment. |
3, 4, 5, 6, 7 | Assuming most of the variables measured under these scenarios are related to the patient’s health status, the validity of the NDE assumption under these scenarios depends on the nature of the measurement. If the measurement directly influences the patient’s health (e.g., invasive tests) or indirectly (temporary pause of a medication), the NDE assumption is violated. Following the discussion on scenarios 1 and 2, the NDE assumption is likely violated for the treatment and medication variables since treatment decisions usually depend on the observations. |
8 | Unless disproved explicitly, missingness under this scenario admits the NDE assumption since recording status of the variables cannot influence the variables by any conceivable means. |
9, 10 | Since missingness under these scenarios are related to the data analysis and occurs after the data collection step, the NDE assumption can be made. |
Scenario No. | Interpretation |
---|---|
1, 2, 3, 4, 5, 8 | Observations and measurements under these scenarios permit the no-interference assumption, as the decisions are being generally made per individuals. |
6 | The healthcare facility protocols typically apply uniformly to individuals and remain consistent over a short period. Hence, it is reasonable to make the no-interference assumption in this scenario. |
7 | This scenario is the most critical and obvious example of violating the no-interference assumption. In this scenario, a prioritization scheme is usually adopted to allocate observation and measurement resources. Examples are (i) early discharge, no admission due to limited hospital capacity during the epidemic, and (ii) delayed or canceled measurements for healthier patients during staff overload. |
9, 10 | Unless for particular reasons, the data scientists do not induce interference by the inclusion/exclusion criteria, and the no-interference assumption holds. An example of a violation of the no-interference assumption (though not to be conceived as a meaningful scenario) is when performing sample selection based on the so-far selected samples from different cohorts, e.g., when we only choose up to 20 patients from an age stratum. |
Scenario No. | Interpretation |
---|---|
1, 9 | By their definitions, these scenarios induce selection bias, discarding the entire sample (of a specific sub-population) from the dataset. |
2 | To miss a follow-up under this scenario implies that the patient still has recorded data in the database. However, if analysis is limited to a specific follow-up (e.g., analysis of health status in the second hospital visit), then patients with limited data are subjected to selection bias. |
3, 5, 6, 7, 8, 10 | These scenarios by default concern data entries and do not cause missingness of an entire data sample; hence, no selection bias occurs. |
4 | A situation where patients’ refusal can lead to selection bias missingness is when they refuse to give data-sharing consent. |
Scenario No. | Interpretation |
---|---|
1, 9 | Complete non-visit and sample exclusion induce only two complete case and all-missing patterns. |
2 | Missing follow-ups in clinical studies induce monotone missingness, since by the study design rules, the patients who are absent, for any reason, from a visit are excluded from the remaining visits (case drop-out). However, such a rule does not apply in healthcare facilities; therefore, this scenario, in general, leads to non-monotone missingness. |
3, 4, 6, 7, 8, 10 | The scenarios are not determined to induce a monotone missingness pattern, unless for a specific reason related to the problem at hand. |
5 | Observation according to the diagnostic flowcharts and score tables induce a monotone missingness pattern, where extensive secondary measurements are not made unless primary ones are. However, many diagnostic flowcharts are utilized across all patients in a healthcare facility dataset. The set of primary tests usually overlaps among different flowcharts; therefore, a monotone pattern may still emerge. The pattern graph framework [27] provides a powerful methodology for dealing with missing data in this situation. |
Scenario No. | Interpretation |
---|---|
1 | Potential odds ratio information for sensitivity analysis include (i) difference in total hospital visits between the healthy and sick sub-populations, (ii) difference in medical care advantages received among different socioeconomic strata, (iii) difference in the death rates reported in the healthcare facility, and in total, specific to a disease. |
2 | Potential odds ratio information for sensitivity analysis include differences in the number of visits for healthy vs. sick patients, between groups with different socioeconomic status due to insurance plans, or morbidity rate for a specific diagnosis, possibly obtained from epidemiological research. |
3 | Potential odds ratio information for sensitivity analysis include the difference in the interrupted measurement level due to specific events, such as the code-blue. |
4 | For those types of missingness due to patients’ health-related refusal (such as intolerance to pain), meaningful differences may be found for available and unavailable samples, e.g., conceivable level of infection which may cause intolerable pain. |
5 | Potential odds ratio information for sensitivity analysis directly include the level of measured variables, e.g., in different branches of the diagnostic flowcharts. For instance, one may ask how later specific measurements may change if the results of the primary tests flip (reflecting the unavailable sub-population). |
6 | Potential odds ratio information for sensitivity analysis can be found by analyzing healthcare facility protocols for specific measurements. Since these rules are justified based on extensive research, informed sensitivity analysis might be possible via public health works that analyze such protocols. |
7 | Potential odds ratio information for sensitivity analysis include differences in the level of measurement for situations when the availability of resources changes, e.g., comparing the waiting line for a medical test or number of admissions. |
8, 10 | Due to complete-randomness, no specific sensitivity parameter can be conceived in general for this scenario. |
9 | The sensitivity parameters have interpretations similar to the complete non-visit scenario, except that the data scientists have induced omission under this scenario. Therefore, the sensitivity parameters and ranges may be obtained from the original dataset. |
Specification | Description |
---|---|
Classifier | |
Model | Logistic Regression |
Software | Sklearn v1.4.1, using linear_model.LogisticRegression |
Parameters | Sklearn default parameters |
Missforest imputer | |
Software | Sklearn v1.4.1, using impute.IterativeImputer and ensemble.RandomForestRegressor |
Parameters | Sklearn default parameters for both objects |
Propensity score model | |
Model | Logistic Regression |
Software | Sklearn v1.4.1, using linear_model.LogisticRegression |
Parameters | Sklearn default parameters |
1 | For the scope of this paper, we mainly focus on identification with respect to m-graphs. See section 3 of Mohan et al. [2] for other identification approaches for missing data. |
2 | The work also bears extreme historical significance for the missing data analysis methodology. |
Ref. | Title |
---|---|
Scenarios related to patients | |
1 | Patient complete non-visit |
2 | Missing follow-up visit due to health status |
3 | Missing measurements due to health-related events during hospitalization |
4 | Missing measurements due to patient’s refusal |
Scenarios related to physicians | |
5 | Missing measurements due to diagnostic irrelevance |
Scenarios related to healthcare facilities | |
6 | Missing measurements outside protocols requirements |
7 | Unavailability or shortage of resources |
8 | Unrecorded observations |
Scenarios related to data pre-processing | |
9 | Omission of data samples based on inclusion/exclusion criteria |
10 | Omission of invalid data entries |
Ref. | Description |
---|---|
at identification step | |
1 | What missingness mechanism is induced by a scenario |
2 | Whether a scenario is subjected to missingness parametric distribution shift |
3 | Whether a scenario permits no-direct-effect assumption |
4 | Whether a scenario permits no-interference assumption |
5 | Whether a scenario induces selection bias |
at estimation step | |
6 | Whether a scenario induces monotone missingness patterns |
at sensitivity analysis step | |
7 | Whether a scenario gives informed guesses about sensitivity parameters |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 MDPI (Basel, Switzerland) unless otherwise stated