Socio-Economic and Demographic data
While multiple factors contribute to dengue incidence, the collection of socioeconomic data in the Zulia state is restricted by the limited information recorded in the 2011 census conducted by the National Institute of Statistics due to the lack of available data for the country. In this regard, the socioeconomic variables within this category were aggregated at the municipality level in Zulia, as provided by the 2011 Census. These variables included the proportion of households living in poverty and the proportion with access to piped water supply.
Regarding demographic data, the annual population figures were obtained from INE (2014), also based on the Venezuelan National Census of 2011. These figures were used to address gaps in demographic data from 2008 to 2016, aggregated at the municipality level. These gaps arose due to the absence of formal population surveys the Venezuelan national government conducted since 2011. Therefore, the most recent population size data available for the municipalities in Zulia state is from 2011. While this decision may limit the results, it represents the most accurate approximation.
Support Vector Machine (SVM):
This technique, devised in the 1960s and substantially improved in the 1990s, is based on statistical learning theory and the principle of structural risk minimisation [
15,
16]. SVM-Regression (SVM-R) has become a prevalent technique for dengue prediction [
15] due to its reported efficiency in generalization performance [
16]. As a predictor tool it has shown excellent empirical results by maximising predictive accuracy and avoiding overfitting [
15]. SVM-R includes three major components to be considered: (i) learning theory; (ii) optimal hyperplane algorithm and (iii) kernel functions [
15].
We briefly examined all three primary kernel functions but settled on the RBF as the most effective one (
Table 1).
According to specialists, no efficient, structured method for selecting the optimal hyperparameters exists. Therefore, a simple approach is used to optimise one parameter at a time [
15]. Hence, the evaluation of parameters of
for instance, with a fixed parameter
C at value 1.0, through a wide range of (typically) exponential forms:
is tried. Once the optimal value
has been found, the optimal value of parameter C is determined [
15].
Gaussian Process Regression
Gaussian process regression (GPR) [
18] is a regression approach gaining significant attention in Machine Learning [
26,
27,
28], due to its nonparametric and Bayesian characteristics. This methodology has shown remarkable performance with small datasets and its ability to provide uncertainty measurements on the predicted outcomes. In contrast to many widely used supervised Machine Learning algorithms that strive to learn precise values for each parameter in a given function, the Bayesian approach takes a different path by inferring a probability distribution across all potential values.
Let the training set
, where
and
, obtained from an unknown distribution. A GPR model addresses the question of predicting the value of a response variable
, given the new input vector
, and the training data [
29]. A linear regression model is of the form:
Where:
, with σ2 the error variance and the coefficients β are estimated from the data.
A GPR model uses latent variables and explicit basis functions to explain the response. The covariance function captures the latent variables, and the basis functions project the inputs into a p-dimensional feature space.
A Gaussian Process (GP) is a collection of random variables in which any finite number follows a joint Gaussian distribution[
30]. If
is a GP, then given
n observations
, the joint distribution of the random variables
is Gaussian; characterised by its mean function
and covariance function
, that is:
Now consider the following model.
where:
- ○
,
- ○
: are a set of basis functions that transform the original feature vector into a new feature vector .
- ○
: is a p-by-1 vector of basis function coefficients.
This model represents a GPR model. An illustration of response
can be modeled as
Therefore, a GPR model is a probabilistic model. A latent variable
is introduced for each observation
xi, making the GPR model nonparametric. In vector form, the GPR model can be written as follows:
where:
The joint distribution of latent variables
) in The GPR model has the following form:
approach to a linear regression model, where
is in the following form:
The covariance function
is usually written as
to explicitly indicate the dependence on
; because it is generally parameterised by a set of kernel parameters or hyperparameters
. Now, the kernel parameters are based on the signal standard deviation
and the characteristic length scale
. The determination of characteristic length scale serves to delineate the separation between input values
at which the response values are rendered uncorrelated. Both
and
require to be greater than 0, and this can be enforced by the unconstrained parameterization vector
θ, such that:
Some kernel (covariance) functions [
18,
31] are:
Squared Exponential Kernel
Rational Quadratic Kernel
Where:
ARD Matern 3/2
where:
The methodology applied in the present study used dengue cases in Zulia state weekly aggregated in conjunction with a set of socioeconomic and local and global climatic variables and based on previously implemented machine learning algorithms used for other parts of the world [
8,
15,
20]. The construction of Support Vector Regression (SVR) and Gaussian process regression (GPR) as traditional ML algorithms were performed due to their flexibility, practicality and typically excellent performance. A set of scenarios was proposed in the models for comparative purposes using both climatic and non-climatic factors and respective validations to obtain the optimum model.
Figure 5 shows the process implemented in this study.
Data Integration: Various preliminary steps were used to prepare the data before being entered into the machine learning algorithms:
Weekly epidemiological data of dengue cases in Zulia state were aggregated at the municipal level in conjunction with a set of climatic and non-climatic covariants. In this context it was necessary to integrate the existing data because of the different sources of information (as proposed Cabrera M & Taylor G. [
6]). The present study also utilised remote satellite climatic data obtained from NASA as described previously.
Epidemiological data was missing for Guajira municipality between 2013 to 2016, which resulted in this municipality being excluded from the study.
Some demographic data, such as 2008 and 2016, had 53 weeks due to the day the new year started, whilst the climatic data was always divided into 52 weeks. This was dealt with straightforwardly by repeating the previous week´s climatic data for the 53rd week where this occurred. The Niño 3.4 index was aggregated at a weekly level to be consistent with the other data.
According to some authors [
15], the data can be sensitive to extreme values. Therefore, in some cases, it is convenient to normalize or standardize the data. In this study, raw, standardized, and normalized data were used for each model to be trained. In this way, choose the best model obtained. In Standardization: the software centers and scales each column of the predictor data according to the mean and standard deviation of the column. In Normalization: it scales each column of the predictor data between -1 and 1.
Model Construction: Two different machine learning algorithms were used for comparative purposes in this study, chosen for their reported accuracy in dengue forecasting in other parts of the world [
8,
15,
20,
32] .
To conduct the experiments, the
MatLab Statistics and Machine Learning Toolbox [
33] was used, which provides a framework for designing and implementing ML algorithms, and applications. The experiments were carried out with a PC Laptop of the following characteristics: CPU Intel(R) Core
(TM) i7-10750H CPU @ 2.60GHz, 12,0 GB RAM, GPU NVIDIA GeForce RTX2060 with Max-Q design. We perform 10-fold cross-validation on the database aiming to obtain unbiased results in our experiments.
Fitting the GPR model requires estimating the following model parameters from the data:
- ○
Covariance function parameterized in terms of kernel parameters in vector θ
- ○
Noise variance
- ○
Coefficient vector of fixed-basis functions β
Fitting the SVR model requires solves an optimization problem that involves two parameters:
- ○
The regularization parameter C and
- ○
The error sensitivity parameter ϵ.