Exploratory Report on Data Synchronising Methods to Develop Machine Learning-Based Prediction Models for Multimorbidity

Preprint

Concept Paper

Exploratory Report on Data Synchronising Methods to Develop Machine Learning-Based Prediction Models for Multimorbidity

Altmetrics

Downloads

169

Views

130

Comments

Supplementary Material

supplementary.docx (14.98KB )

This version is not peer-reviewed

This preprints belongs to the Topic

Advances in AI-Empowered Beamline Automation and Data Science in Advanced Photon Sources

Submitted:

16 May 2023

Posted:

18 May 2023

You are already at the latest version

Alerts

Abstract

Endometriosis is a complex chronic condition characteristic of chronic pelvic pain, dysmenorrhea, anxiety and fatigue. This can often lead to multimorbidity which is defined by the presence of two or more long term conditions. Delayed diagnosis of endometriosis is a crucial issue that leads to poor quality of life and clinical management. There are a variety of limitations linked to conducting endometriosis research including lack of dedicated funding. Additionally, accessing existing electronic healthcare records can be challenging due to governance and regulatory restrictions. Missing data issues are another concern that has been commonly identified among real-world studies. Considering these challenges, data science technique could provide a solution by way of using synthetic datasets that could be generated using known characteristics of endometriosis to explore the possibility of predicting multimorbidity. This study aimed to develop an exploratory machine learning model that can predict multimorbidity among women with endometriosis using real-world and synthetic data. A sample size of 1012 was used from two endometriosis specialized centres in the UK. In addition, 1000 synthetic data records per centre were generated using the widely used Synthetic Data Vault’s Gaussian Copula model based on patients’ records’ characteristics. Three standard classification models, Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), were used for classification. The average accuracies for all three models (LR, SVM and RF), given as “model accuracy-centre1: accuracy-centre2” were found to be: LR 64.26%:69.04%, SVM 67.35%:68.61%, and RF 58.67%:73.76% on real-world data, and LR 69.9%:72.29%, SVM 69.39%:70.13, and RF 68.88%:74.62 on synthetic data, respectively. The findings of this report show machine learning models trained on synthetic data performed better than models trained on real-world data. Our findings suggest synthetic data holds great promise for shows value to conduct clinical epidemiology and clinical trials that could devise better precision treatments and possibly reduce the burden of multimorbidity.

Keywords:

Subject: Medicine and Pharmacology - Obstetrics and Gynaecology

Background

Data science is a rapidly evolving research field that influences analytics, research methods, clinical practice and policies. Access to comprehensive real-world data and gathering life-course research data are primary challenges observed in many disease areas. Existing real-world data can be a rich source of information required to better characterise diseases, generate cohort specifications and understand clinical practice gaps to conduct more precision research that is value-based for healthcare systems. A common challenge linked to real-world and research data is a high rate of missingness. Historically, statistical methods were used to address missing data where possible, but advances in artificial intelligence techniques have provided improved and quicker methods for use. These methods could also be used for predicting disease outcomes, improving diagnostic accuracy and treatment suitability.

These methods can be particularly useful for women’s health conditions, where the complex physical and mental health symptoms can give rise to insufficient understanding of disease pathophysiology and phenotype characteristics that play a vital role in diagnosis, treatment adherence and prevention of secondary or tertiary conditions. One such condition is endometriosis. Endometriosis is complex with an array of physical and psychological symptomatologies, often leading to multimorbidity [1]. Multimorbidity is defined by the presence of two or more conditions in any given individual and therefore could be prevented if the initial conditions are managed more effectively. The incidence of multimorbidity has increased with a rising ageing population, burden of non-communicable diseases in general and mental ill health which, is particularly important for women [2]. Another important aspect of multimorbidity is disease sequalae, where a physical manifestation could correlate with a mental health impact, and vice versa. The precise causation is complex to assess due to limitations in the current understanding of disease sequalae pathophysiology [3]. As such, multimorbidity could be deemed highly heterogeneous. Multimorbidity impacts people of all ages, although current evidence suggests it is more common among women than men, even though previously, multimorbidity was thought to have been more common in older adults with a high frailty index score [4]. Hence, multimorbidity is challenging to treat, and there remains a paucity of research available to better understand the basic science behind the complex mechanisms that could enable better diagnosis and management long-term [4].

This undercurrent of disease complexities linked to endometriosis that could lead to multimorbidity should be explored to support clinicians and healthcare organisations in future-proofing patient care [5]. In line with this, exploring machine learning as a technique in conjunction with synthetic data methods could demonstrate better predictions and offer a new solution to sample size challenges.

Methods

Our primary aim of the study was to develop an exploratory machine learning model that can predict multimorbidity among endometriosis women using both real-world and synthetic data. In certain instances, real-world data may present confidentiality issues, particularly in medical research where data often contains personal and sensitive information. Sharing such data for analysis can expose vulnerabilities. To develop these models, existing knowledge and symptomatology, comorbidities and demographic data were used. Anonymised data from an ethically approved study was provided from Manchester and Liverpool Endometriosis specialist centres in the UK. The data records used included symptoms, diseases, and conditions in women with a confirmed diagnosis of endometriosis. Data curation was completed for the entire sample size using the following steps;

Data pre-processing: the data was cleaned and prepared to manage missing values, encoding categorical variables, and standardizing or normalizing continuous variables.
Synthetic data generation: the synthetic data records were generated for each centre using a widely used synthetic Data Vault's Gaussian Copula model, based on the data characteristics from patients' records.
Model development: trained and implemented three standard classification models - Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF) - on both real-world and synthetic data. These models were used to predict multimorbidity among women with endometriosis.
Model evaluation: models were assessed the performance of the models by comparing their average accuracies on real-world and synthetic data. Multiple metrics’ of accuracy, precision, recall, and F1-score were used to evaluate the models' performances.
Comparison and analysis: the results of the models trained on real-world data and synthetic data to determine if synthetic data could serve as a viable alternative for real-world data in predicting multimorbidity among women with endometriosis.

For all experiments, we trained one model on real-world data, and another on synthetic data. Both models were tested on the same test set which contains only real-world data because the overall population’s true distribution for endometriosis is verified. The accuracies of these models can then provide better insight into whether the use of synthetic data affects the performance of machine learning models.

Ethics approval

Anonymous data used in this study was approved by the North of Scotland Research Ethics Committee 2 (LREC: 17/NS/0070) for the RLS study conducted at the University of Liverpool.

The model used age, height, symptoms, commodities and weight in a mathematical formulation. Let

x_{i}

be the vector containing these recordings for the

i^{t h}

person and let

x = (x_{1}, \dots, x_{n})

be the matrix containing the data about all

n

people. As part of developing methodological rigour, we considered a working example to predict whether each person in the sample develops depression. Let

y = (y_{1}, \dots, y_{n})

be the vector of response variables where:

y_{i} = \{\begin{matrix} 1 if patient i develops a depression \\ 0 if patient i does not developp depression . \end{matrix}

In this example, s we collect data for

n = 3

people and have

p = 3

recordings for each person

i

, (i.e., age, height and weight). These are represented by

x_{i 1}, x_{i 2}

and

x_{i 3}

respectively. The data can be summarised in Table 1 as follows:

Table 1. Example Dataset for Predicting Depression.

PERSON #	AGE	HEIGHT (M)	WEIGHT (KG)	DEPRESSION
1	67	1.9	65	1
2	43	1.2	75	0
3	23	1.5	43	0

We created a function,

f_{β}

with parameters

β

, that takes the age, height, and weight (

x_{i 1}, x_{i 2}, x_{i 3}

) of the person

i

, as input and outputs a prediction of whether they will develop depression. Let

y_{i}^{*}

be the prediction of whether person

i

develops depression, then we say that

y_{i}^{*} = f_{β} (x_{i}) .

The performance of parameters

β

can be tested through a loss function, defined as

L (β)

which measures the difference between the true values of

y

and the predictions,

y^{*} = (y_{1}^{*}, \dots, y_{n}^{*})

. The loss function imposes a penalty when incorrect predictions are made. Hence, to find the best

β

, we solve the optimisation problem:

β^{*} = \underset{β}{argmin} L (β, y, y^{*}) .

The function

f_{β^{*}}

can then be used to make predictions for patients who haven’t been tested for depression.

An initial observation was that our prediction function could become over-fitted to the data. This meant that the function captured the specific distribution between

x

and

y

very well, but if this data was not in a structured format representing the true distribution between symptoms and comorbidities, the prediction function would not be generalisable to other types of data.

The performance of the prediction function on unseen data can be estimated by separating the data into a training set,

(x^{t r a i n}, y^{t r a i n})

and test set, (

x^{t e s t}, y^{t e s t})

. The optimal parameters are found using the training set and then the model’s accuracy is tested on the test set. This accuracy is measured by the proportion of correctly classified data. This is measured by a confusion matrix, which records the frequencies of each possible outcome. Let

C

be the confusion matrix defined as:

C = (\begin{matrix} C_{00} & C_{01} \\ C_{10} & C_{11} \end{matrix})

(1)

where

C_{i j}

is the number of times

y^{t e s t} = i

while

y^{{t e s t}^{*}} = j

. The accuracy of our model is then

A c c u r a c y (%) = \frac{C_{00} + C_{11}}{C_{00} + C_{01} + C_{10} + C_{11}} .

(2)

To summarise, the approach is broken down into the following three steps,

1): Solve optimisation problem

$β^{*} = \underset{β}{argmin} L (β, y^{t r a i n}, y^{{t r a i n}^{*}})$

on the training set, where the set of prediction values, $y^{{t r a i n}^{*}}$ , is found by

$y^{{t r a i n}^{*}} = f_{β} (x^{t r a i n}) .$
2): Make predictions on the test set using optimal weights $β^{*}$

$y^{{t e s t}^{*}} = f_{β^{*}} (x^{t e s t}) .$
3): Construct confusion matrix, $C$ as is defined in (1) and find the accuracy of the model on unseen data by equation (2).

Data preparation – Manchester

In the Manchester dataset, for each patient, the presence of various symptoms and multiple diagnoses among women with Endometriosis are recorded. These are summarised, with descriptions in Table 2. A total of

p = 15

recordings are made for each person, and so we define

x_{i} = (x_{i 1}, \dots, x_{i p})

to be the vector containing the recordings for person

i

Table 2. Manchester Data Feature Variables.

Feature	Data Type	Description
Age	Integer	Age of the Patient
Menorrhagia	Binary	Whether or not the patient has been diagnosed with menorrhagia
Dysmenorrhea	Binary	Whether or not the patient has been diagnosed with dysmenorrhea
Non menstrual Pelvic pain	Binary	Whether or not the patient experiences non-menstrual pelvic pain
Dysphasia	Binary	Whether or not the patient experiences dysphasia
Dyspareunia	Binary	Whether or not the patient experiences dyspareunia
other symptoms	Binary	Whether or not the patient has any other symptoms besides the ones recorded in other features
Infertility	Binary	Whether or not the patient is infertile
No of Endo symptoms	Binary	Whether or not the patient has more than 1 symptom
Year of diagnosis	Date	The year of the patient’s diagnosis of endometriosis
Other surgery – Not related to endometriosis	Binary	Whether or not the patient received any surgeries not related to endometriosis
Discharged	Binary	Whether or not the patient was discharged
follow up	Binary	Follow up clinical appointments
Hormonal treatment Currently	Binary	Whether or not the patient is taking any hormonal treatment
No of hormonal treatment tried	Integer	The number of hormonal treatments the patient is taking

Additionally, for each individual , three response variables are documented, which are summarised, along with their descriptions, in Table 3. These variables are defined as follows:

y_{i}^{M} = \{\begin{matrix} 1 if patient i develops a mental health condition \\ 0 if patient i does not develop any mental health condition \end{matrix},

y_{i}^{I} = \{\begin{matrix} 1 if patient i develops irritable bowel syndrome \\ 0 if patient i does not develop irritable bowel syndrome \end{matrix},

y_{i}^{C} = \{\begin{matrix} 1 if patient i develops at least one of various other comorbidities \\ 0 if patient i does not develop at least one of various other comorbidities \end{matrix} .

Table 3. Manchester Data Response Variables.

Variable	Name	Description
$y^{M}$	Mental Health	The presence of at least one of various mental health conditions
$y^{I}$	IBS	The presence of irritable bowel syndrome (IBS)
$y^{C}$	Comorbidities (Other)	The presence of at least one other disease (Perhaps we have a list of these?).
$y^{C o m b}$	Combined	The presence of at least one of the above conditions.

We examined three models of fit, one for each response variable. We defined a fourth response variable, “Combined”, as shown in the final row of table 3, which indicates the presence of at least one of the other three conditions. Formally,

y^{C o m b}

is defined as:

y_{i}^{C o m b} = \{\begin{matrix} 1 if patient i develops at least one of any of the conditions \\ 0 if patient i does not develop at least one of any of the conditions \end{matrix} .

We fitted a fourth model for this response variable.

We converted the binary variables, including our response variables of “Yes” and “No” to 1 and 0, respectively. There was no missing data in the Manchester dataset, and as such, we make use of all

n = 99

observations.

Data preparation – Liverpool

The data from Liverpool had a sample size of

913

patients. The raw data defined 68 possible different symptoms which was considered as feature variables. A significant rate of missing data was identified. The complete list of features along with their percentage missing values can be found in Table 4.

To prepare the data, we first filtered by “Endometriosis = TRUE” which included patients with a diagnosis of endometriosis leaving a sample of

339

patients. Then we removed features with more than 1% of missing values with 29 final features. The feature “Endometriosis” is a binary identifier, which, after filtering, is always true, so we dropped this feature too. The final

27

features are summarised, with descriptions, in Table 5.

Table 4. Liverpool Data Percentage Missing Data.

Feature	NaN (%)	Feature	NaN (%)	Feature	NaN (%)	Feature	NaN (%)
Sample ID	0.0	Age at diagnosis	98.5	Pain interferes with daily activities	0.0	Hormones	0.0
Age	0.1	Endometriosis symptoms	97.8	Dysmenorrhoea score	97.5	Other information	28.6
Ethnicity	96.7	Endometriosis stage	70.2	Non-menstrual pelvic pain	0.0	Previous ablation	0.0
Postcode	94.4	VAS	91.5	Analgesia for pain	0.0	Medications	85.9
Sample type	2.8	FH ENDO	98.1	Pain prevents daily activities	0.0	Endometrial cancer	0.0
Hair colour	96.7	Adenomyosis	0.0	Pelvic pain score	97.4	Metastatic lesion	0.0
Eye colour	96.7	Menorrhagia	0.0	Miscarriages	44.5	Metastatic lesion location	100.0
Height (m)	0.1	Fibroids	0.0	Polycystic ovary syndrome	0.0	Type of cancer	99.8
Weight (kg)	0.4	Reseason for surgery	18.7	Irregular cycles	0.0	Cancer comments	98.7
BMI	0.0	Previous history	84.7	Cu coil	0.0	Grade	100.0
Smoker	0.0	Gravidity	97.3	Menarche	97.2	Stage	99.8
Pack years	99.1	Parity	8.3	LMP	15.7	Pathology findings	99.8
Exercise	97.4	Deliveries	96.8	Menopause	100.0	Cancer staging	0.0
Alcohol	0.0	Infertility	0.0	Post-menopause	0.0	Dating by histology	64.3
Drinks per week	98.5	Dyspareunia	0.0	Cycle length	17.4	Hormonal dating	99.8
Endometriosis	0.0	Dysmenorrhoea	0.0	Days of bleeding	18.4	Agreement of date	0.0
Age first symptoms	98.6	Analgesia	0.0	Contraceptive/hormone treatment	59.9	Comments	70.1

Table 5. Liverpool Data Features with Less than 1% Missing Data.

Feature	Data Type	Description
Age	Integer	Age of patient
Height (m)	Real	Height of patient in meters
Weight (kg)	Real	Weight of patient in kilograms
BMI	Real	BMI of patient
Smoker	Binary	Whether of not the patient smokes
Alcohol	Binary	Whether or not the patient consumes alcohol
Adenomyosis	Binary	Whether or not the patient has been diagnosed with Adenomyosis
Menorrhagia	Binary	Whether or not the patient has been diagnosed with Menorrhagia
Fibroids	Binary	Whether or not the patient has been diagnosed with Fibroids
Infertility	Binary	Whether or not the patient is infertile
Dyspareunia	Binary	Whether or not the patient has been diagnosed with Dyspareunia
Dysmenorrhoea	Binary	Whether or not the patient has been diagnosed with Dysmenorrhoea
Analgesia	Binary	Whether or not the patient takes analgesia
Pain interferes with daily activities	Binary	Whether or not the patient experiences pain with daily activities
Non-menstrual pelvic pain	Binary	Whether or not the patient experiences non-menstrual pelvic pain
Analgesia for pain	Binary	Whether or not the patient takes analgesia to relieve pain
Pain prevents daily activities	Binary	Whether or not the patient says that pain prevents them from performing daily activities
PCOS	Binary	Whether or not the patient has polycystic ovary syndrome
Irregular cycles	Binary	Whether or not the patient experiences irregular menstrual cycles
Cu coil	Binary	Whether the patient has ever had a CU coil
Post-menopausal	Binary	Whether or not the patient has had menopause
Hormones	Binary	Whether or not the patient is taking any hormonal replacement treatments
Previous ablation	Binary	Whether the patient has had a previous ablation
Endometrial cancer	Binary	Whether or the patient have or had endometrial cancer
Metastatic lesion	Binary	Whether or not the patient had any cancerous lesions
Cancer staging agreement with Pathology	Binary	Whether or not the patient had an existing involvement within the cancer pathway
Agreement of staging	Binary	Whether or not the patient had a staging agreement

A total of 4 patients were identified with missing values and subsequently removed them from the dataset, resulting in a final sample size of

n = 310

patients. We selected two diseases as our response variables for prediction (Table 6). Given our ultimate objective of predicting multimorbidity in patients, we constructed a final response variable, “Combined”, as a binary variable representing the presence of at least one of the other two response variables, akin to the data from Manchester. Their formal definitions of these response variables are as follows:

y_{i}^{A} = \{\begin{matrix} 1 if patient i develops Adenomyosis \\ 0 if patient i does not develop Adenomyosis \end{matrix},

y_{i}^{I} = \{\begin{matrix} 1 if patient i develops Menorrhagia \\ 0 if patient i does not develop Menorrhagia \end{matrix},

y_{i}^{C} y_{i}^{C} = \{\begin{matrix} 1 if patient i develops at least one of any of the conditions \\ 0 if patient i does not develop at least one of any of the conditions \end{matrix} .

Table 6. Liverpool Data – Response Variables.

Variable	Name	Description
$y^{A}$	Adenomyosis	Whether the patient has been diagnosed with Adenomyosis
$y^{M}$	Menorrhagia	Whether the patient has been diagnosed with Menorrhagia
$y^{C o m b}$	Combined	The presence of at least one of the above conditions.

Synthetic Data

To address this concern, we employed the Synthetic Data Vault (SDV) package in Python to create synthetic data as a substitute and assessed its similarity to the real data. By leveraging other sampling techniques, such as random simulation, the synthetic data could generate a dataset with an expanded sample size that more accurately represents the entire population.

During our data preparation, we eliminated numerous observations due to missing data. Synthetic data could potentially serve as a replacement for these missing values. However, in our analysis, we opted to generate entirely new observations, rather than filling in the gaps.

We utilised SDV’s Gaussian Copula model, which constructs a distribution over the unit cube

{[0,1]}^{p}

from a multivariate normal distribution over

R^{p}

by using the probability integral transform. The Gaussian Copula characterises the joint distribution of the random variables representing each feature by analysing the dependencies between their marginal distributions. Once the model is fitted to our data, it can be used to sample additional instances of data.

Manchester Data

We initiated our analysis with the Manchester data, and after fitting the Gaussian Copula to our 99 samples, we generated an additional 1000 samples.

By employing SDV’s SD Metrics library, we were able to evaluate the similarity between the real and synthetic data. We examined how closely the synthetic data relates to the real data in order to determine whether we have adequately captured the true distribution. This assessment involved comparing the distribution similarities across each feature, and we adopted two approaches for this evaluation.

Initially, we measured the similarities across each feature by comparing the shapes of their frequency plots, as illustrated in Figure 1. This comparison was conducted based on the “age” distribution for both the real and synthetic data.

Figure 1. Age distribution shape comparison.

For numerical data, SDV computed the Kolmogorov-Smirnov (KS) statistic, which represents the maximum difference between the cumulative distribution functions. The value of this distance ranges between 0 and 1, with SDV converting it to a score by:

S c o r e = 1 - KS-statistic .

For Boolean data, SDV calculates the Total Variation Distance (TVD) between the real and synthetic data. We determined the frequency of each category value and represented it as a probability. The TVD statistic compares the differences in probabilities, as given by:

δ (R, S) = \frac{1}{2} \sum_{ω \in Ω} | R_{ω} - S_{ω} |

where

Ω

represents the set of possible categories, and

R_{ω}

and

S_{ω}

are the frequencies of category

ω

in the real and synthetic datasets respectively. The similarity score is then given by:

s c o r e = 1 - δ (R, S) .

The score for each feature is summarised in Figure 2, and we obtained an average similarity score of 0.92.

For the second measure of similarity, we constructed a heatmap to compare the distribution across all possible combinations of categorical data. This was accomplished by calculating a score for each combination of categories. To initiate this process, two normalised contingency tables were constructed; one for the real-world data and one for the synthetic data. Let

α

and

β

be two features, the contingency tables describe the proportion of rows that have each combination of categories in

α

and

β

, thereby illustrating the joint distributions of these categories across the two datasets.

Figure 2. Feature Distribution Shape Comparison.

To compare the distributions, SDV computes the difference between the contingency tables using Total Variation Distance. This distance is subsequently subtracted from 1, implying that a higher score denotes greater similarity. Let

A

and

B

represent the set of categories in features

α

and

β

respectively; the score between features

α

and

β

are calculated as follows:

s c o r e = 1 - \frac{1}{2} \sum_{a \in A} \sum_{b \in B} | S_{a, b} - R_{a, b} |,

(3)

where

S_{a, b}

and

R_{a, b}

represent the proportions of categories

a

and

b

occurring simultaneously, as derived from the contingency tables for the synthetic and real data, respectively. It is important to note that we did not employ a measure of association between features, such as Cramer’s V, since it does not measure the direction of the bias and may consequently yield misleading results.

A score of 1 indicates that the contingency table was identical between the two datasets, while a score of 0 indicates that the two datasets were as dissimilar as possible. These scores for all combinations of features are depicted as a heatmap (Figure 3). It is worth noting that continuous features, such as “Age”, were discretized in utilise Equation (3) in determining a score.

Figure 3. Distribution Comparison Heatmap.

The heatmap suggests that most features exhibit a strikingly similar distribution across the two datasets, with the exception for “Year of Diagnosis”. This discrepancy could potentially be attributed to the feature’s inherent nature as a date, despite being treated as an integer in the model. This issue merits further investigation.

Based on these metrics, we confidently concluded, that the new data closely adhered to the distribution of the original data.

Liverpool Data

To generate synthetic data, we adhered to the same procedure as with the Manchester data. We produced 1000 additional samples from a Gaussian copula fitted to the 311 real samples and combined them with the real data to create a new dataset. Using contingency tables, we developed a heatmap by applying the formula in Equation (3) to generate scores; this heatmap is displayed in Figure 4. A score of 1 implies that the contingency table was identical between the two datasets, whereas a score of 0 indicates that the two datasets were as distinct as possible. Our analysis revealed an average similarity of 0.94.

Figure 4. Real Vs Synthetic Data Distribution Heatmap (Liverpool Data).

We compared the shape of the distributions for each feature; for instance, the distributions for the “Height” feature are illustrated in Figure 5. We observed that the distributions were dissimilar. To calculate similarity scores, we employed the KS statistic for numerical features and Total Variation Distance for Boolean features. These scores are summarised in Figure 6. We found that the distributions of “Height” and “Weight” were not similar; however, the distributions of the remaining features exhibited similarity. With an average similarity of 0.75, we concluded that the data distributions were, on average similar. The distributions of all categorical features were accurately captured, but two of the continuous features were not.

Figure 5. Height Distribution Shape Comparison (Liverpool).

Figure 6. Feature Distribution Shape Comparison Between Real and Synthetic Data (Liverpool).

Models

We evaluated three standard classification models to predict the response variables; Logistic regression (LR), Support Vector Machines (SVM), and Random Forest (RF), as they employ distinct methods data separation and provide unique insights.

Logistic regression enables us to determine the likelihood of each class occurring. It offers straightforward interpretability of the model’s coefficients, allowing us conduct statistical tests on these coefficients to discern which features significantly impact the response variable’s value. While logistic regression adopts a more statistical approach by maximising the conditional likelihood of the training data, SVMs take a more geometric approach, maximising the distance between the hyperplanes that separate the data. We fitted both logistic regression and SVMs to compare the performance of these approaches.

In contrast to SVMs and logistic regression, which attempt to separate the data using a single decision boundary, random forest employ decision trees that partition the decision space into smaller regions using multiple decision boundaries.

The performance of these varies depending on the nature of the data’s separability. Consequently, we fitted all three models and compared their accuracies to assess the useability of the synthetic data.

Logistic Regression

Let

y = (y_{1}, \dots, y_{n})

to be the general vector of response variables and let

x_{i} = (x_{i 1}, \dots, x_{i p})

be the corresponding vector of features for patient

i

. We defined the function:

σ_{β} (x_{i}) = P (y_{i} = 1) = \frac{1}{1 + e^{- β x_{i}}}

as be the probability of patient

i

developing the condition corresponding to

y

, where

β = (β_{1}, \dots, β_{p})

are some weights. The prediction function is then defined to be:

f_{β} (x_{i}) = \{\begin{matrix} 0 if σ_{β} (x_{i}) < 0.5 \\ 1 if σ_{β} (x_{i}) \geq 0.5 \end{matrix}

We determined the optimal weights by solving the optimisation problem:

\min_{β} L (β)

where, for logistic regression, the loss function

L

took the form:

L (β) = \sum_{i = 1}^{n} {- y}_{i} \log (σ_{β} (x_{i})) - (1 - y_{i}) \log (1 - σ_{β} (x_{i})) .

Finally, we incorporated regularisation terms

λ

to prevent overfitting, which facilitated capturing the underlying distribution of the data without the proposed model to become overly specific to the training data. This approach helped mitigate any potential biases.

L (β) = \sum_{i = 1}^{n} y_{i} \log (σ_{β} (x_{i})) + (1 - y_{i}) \log (1 - σ_{β} (x_{i})) + \frac{1}{λ} {| | β | |}_{2}^{2} .

(4)

SVMs

Next, we examined Support Vector Machines. We slightly redefined our response variables from binary

{0,1}

to binary

{- 1,1}

. For instance, suppose

y_{i}^{M}

represents the binary response for a patient developing a mental health condition; then

y_{i}^{M}

is defined as:

y_{i}^{M} = \{\begin{matrix} 1 if patient i developed a mental health condition \\ - 1 if patient i did not develop any mental health condition . \end{matrix}

For SVMs, the prediction function takes the form:

f_{β} (x_{i}) = s i g n (β^{T} x_{i} - b)

Where

β \in R^{p}

and

b \in R

are some weights. We considered the hinge loss function, defined as:

l_{h i n g e} (β, b) ≔ \max_{β, b} (0, 1 - y_{i} (β^{T} x_{i} - b))

The function

l_{h i n g e}

is 0 when

y_{i} (β^{T} x_{i} - b) \geq 1

, which occurs when

f_{β} (x_{i}) = y_{i}

or in other words, when we have made a correct prediction. Conversely, when

f_{β} (x_{i}) \neq y_{i}

, we would incur some penalty. Therefore, for SVMs, the loss function,

L

takes the form:

L (β, b) = \frac{1}{λ} {| | β | |}^{2} + \frac{1}{n} \sum_{i = 1}^{n} \max_{β, b} (0, 1 - y_{i} (β^{T} x_{i} - b))

(5)

where

λ

is a parameter controlling the impact the of regularisation term. Similar to logistic regression, this term manages a trade-off between capturing the distribution of the entire population and overfitting to the training data.

Random Forest

The final model we fitted is the random forest predictor. These random forests classify data points through an ensemble of decision trees. The decision trees operate by separating the predictor space by a series of linear boundaries. As before, we let

y = (y_{1}, \dots, y_{n}), y \in {0,1}^{n}

be our set of response variables with corresponding feature vectors

x = (x_{1}, \dots, x_{n})

where each

x_{i} \in R^{p}

. To construct our random forest, we followed the procedure:

For

b = 1, \dots, B :

1): Sample, with replacement, $x^{b} \in R^{m \times p}$ and $y^{b} \in {0,1}^{m}$ from $x$ and $y$ respectively.
2): Fit $k$ decision trees, $f_{1}^{b}, \dots, f_{k}^{b}$ to dataset $(x^{b}, y^{b})$

When making predictions on unseen data, the model took the majority vote across all trees.

For all experiments, we split the real data in half, yielding one training set of real data and one test set of real data. The entire synthetic data is used as training data. All models utilise the test set of real data, thus enabling us to compare the performance between models trained on real data and models trained on synthetic data.

All the models contain hyperparameters that impact the performance of the model on unseen data. For each model, we performed hyperparameter optimisation by using a grid search, measuring the accuracy through cross-validation to find the optimal selection of the hyperparameter.

k

-fold cross-validation divides the data into

k

subsets, and by training the model on

k - 1

subsets and testing on the remaining set, we obtained an estimate of how the model will perform on unseen data. This process was repeated, holding out a different subset for testing each time, and an average performance is calculated. We performed a cross-validation grid search using the training data and select the value of the hyperparameter that yields the best average accuracy, and then retrain the model on the complete training set.