In this section, we first introduce our design of a framework to enable automated and repeatable surveillance data analysis anomaly detection irrespective of space and time. We then leverage the components of the framework to describe how we processed the data and applied different unsupervised anomaly detection algorithms to the data to detect patterns and anomalies of different types.
2.2. Epidemiological Feature Selection and Pre-Processing
Although there are 42 fields recorded in the SIVEP time series surveillance data for malaria, for the purpose of this study we used four fields: Date (in Months), total number of tests, number of negative results and number of positive results. These last three features were converted into a single feature: the
proportion of positive tests,
, which is the proportion of tests that were conducted that returned a positive result for malaria, for each state and health region.
is mathematically defined as:
where
is the total number of positive cases per month and
is the number of tests carried out per month. As
is a proportion, we can compare values across time and space even if the testing capacity is changing over the months and across the geographical health regions. However, we assume a uniform distribution of cases across a health region such that we have equal chance of detecting an infected person within a health region.
With the assumed uniform distribution of positive cases per health region, an increase in would then truly represent the situation where more people in the health region are becoming affected. The reason for the rise will then be investigated. Given the same testing capacity (), a decline in would then represent either a naturally dying epidemic or the outcome of a deployed intervention.
In
Figure 3, we show the state level aggregated data from the
Para state of Brazil which we used to demonstrate the original epidemiological features of interest that we used to derive
.
Similar data to
Figure 3 was extracted for the 13 health regions by separating the
Para state into its health regions. The extracted data were then transformed to
for each of the health regions. An example outcome of the data transformation is shown in
Figure 4. To reduce noise, we applied the moving average transformation to the derived feature,
using a window size of six months. Moving average is used to remove some noise and irregularity (
in Equation (
2)) to enhance the prediction accuracy of the machine learning algorithms.
A time series data,
can be generalised using an additive model as:
where:
is the trend (changes over a long period of time)
is seasonality (periodic or short term changes)
is the effects of holidays to the forecast
is the error term or irregularities (the unconditional changes specific to a circumstance).
Different learning algorithms can model time series data well depending on what components of time series are present in the data. The unsupervised approach to anomaly detection is exploratory in nature and the evaluation are subjectively performed by humans.
Although unsupervised anomaly detection algorithms are given certain inputs by humans that enable them set a metric threshold for objectively and automatically detecting anomalies, the detected anomalies will need to be certified by experts.
Table 2 shows the unsupervised models which are integrated in the Pycaret framework [
17] and uses a specific distance measure to estimate the point anomalies in a time series data.
Clustering-based local outlier (cluster or CBL), Local outlier factor (lof) and Connectivity-based local outlier (cof) are based on local outlier concepts. CBL uses a distance measure that considers both the distance of an object to its nearest cluster and also the size of the cluster. So, small clusters are not simply discarded as outliers as a whole. The
lof algorithm uses k-nearest neighbour to define the density of objects whose distances from one another would be considered in defining the density of the locality. The
reachability distance, which is a non-symmetric measure of distance, is used to determine an outlier. Each data point may have a different reachability distance and this distance is used to define the degree of anomaly. The larger the value, the more anomalous the point from its local neighbours [
18].
Connectivity-based local outliercof is an improved version of
lof. The density-based lof algorithm has a shortcoming in that it completely depends on the density of the neighbouring points. It performs poorly when the density of the outlier is similar to its nearby data points. Hence,
cof recognises that anomalies must not be of a lower density than the data it deviates from [
19].
Isolation Forest (iforest) algorithm is an unsupervised version of decision tree. It is a binary decision tree. The iforest algorithm is based on the assumption that anomalies are few and unique and that they belong to the shallow branches where they easily isolate themselves from the rest of the tree branches. A random set of features are selected and used to build the tree using the data points. The sample that travels deep into the tree is unlikely to be anomalies. The Isolation Forest algorithm is computationally efficient and it is very effective in anomaly detection. However, the final anomaly score depends on the
contamination parameter provided while training the model - meaning that we should have an idea of what percentage of the data is anomalous so as to get a better prediction [
20].
Histogram-based anomaly detection (histogram) assumes feature independence and builds the histogram of each feature. The anomaly score is based on histogram-based outlier score (HBOS). HBOS can be constructed from either univariate or multivariate features of each data point. For multivariate problems, the anomaly score of all variables is added up to rank the data. Given
d variables with
p data points, the HBOS is calculated as: [
21]
Histogram outlier detector first constructs a histogram for a variable by choosing a bin. The computed score for each variable is normalised to 1.0 and summed up across the d variables to compute the global outlier score. A data point may be anomalous in one variable but not in others. Hence, a data point that is an outlier in almost all the variables is almost definitely an anomaly in the data set.
K-Nearest Neighbors (KNN) is popularly used as a supervised learning algorithm. However, it can also be used as an unsupervised learning algorithm and can be used to detect outliers or anomalies in data. The assumption in this implementation for anomaly detection is that outliers are not in close proximity with other neighbours. A threshold is defined for proximity and used to determine data points that do not belong to a neighbourhood. A key parameter that determines the number of neighbours that will be used in calculating the proximity measure is the parameter.
One-class SVM detector (svm) is an unsupervised version of a traditional SVM regressor or classifier. It uses either a min-volume hyper-sphere [
23] or a max-margin hyperplane metric to separate anomalous data from the normal ones. The major purpose of one-class SVM is to detect novelty in data. It helps to detect rare events. Novelty and weak signals are special aspects of anomaly detection. In one-class SVM, data points that lie outside the hyper-shere or below the hyper-plane are considered anomalies.
Principal Component Analysis (PCA) is a method that decomposes a signal into its major components. The first component is usually the most important. This is followed by the second, third and so on. The idea of using PCA for outlier detection is that the data point with high reconstruction error from its principal components are outliers from the dataset [
24]. For different PCA algorithms, the way the anomaly score is calculated may differ. The use of residuals, leverage and influence of a data point may all be put into consideration. However, these metrics are better utilised in a visualisation than in an automated outlier detection system. Hence, some human evaluation and domain knowledge may need to apply in setting the threshold for outlier threshold using the appropriate metrics for the problem domain.
Minimum Covariance Determinant (MCD) is an anomaly detection method that uses the fact that tightly distributed data will have a smaller covariance determinant value. So, instead of using the entire data set to calculate distribution parameters (such as mean and standard deviation), it divides the data into sub-samples and then computes the covariance determinant of each sub-group. The number of sub-samples
h is such that
, where n is the total number of data points [
25]. The group with minimum determinant would be used as the central group for distance calculation. It is best suited for determining outliers in multivariate data [
25]. MCD uses
robust distance measures that are not amenable to the unrealistic distributional assumptions that underlie the use of
Mahalanobis distance measures for outlier detection in most other classical methods. Mahalanobis distance computation is sensitive to the presence of outliers in data as the outliers tend to draw the distributional statistics towards themselves. Hence, the robust distance is a robust calculation of the
Mahalanobis distance such that the effect of outliers is minimised.
Stochastic oultier selection (SOS) [
26] method is a statistical modeling method for anomaly detection. This method assumes that the data follows a stochastic model with a form of probability density function (pdf). Normal data exists in areas of higher density while anomalies exist in areas of lower density. Hence, the measure of distance used to determine anomaly is
probability density. For parametric stochastic models, a pdf is assumed
apriori, with some values assumed for the model parameters. However, for non-parametric modelling, little or no assumption is made about the value of these parameters and the algorithm has to learn the model parameters directly from the data. We have followed largely non-parametric modelling in this work. We focus on discovering the model that best models the data based on the type of anomaly that is of interest to epidemiologists. In this research, we are interested in
outbreak anomalies.
A major parameter and assumption that underlie the algorithms and methods employed in this work, which is based on unlabelled data, is the use of Proportion of anomaly or Contamination rate, η. Contamination rate is the fraction of the total data that we assume to be anomalous. Our default value is 0.1 (10%) fraction of the data. In our standard experiments, we have set . We also conducted a sensitivity analysis for this parameter, using values of .