1. Introduction
The widespread use of computers, databases and associated networks makes them vulnerable to attacks. The attacks are in the form hacking or intruding and which are actually malicious activities to undermine the integrity or security of the systems or their resources. We term them as anomalous activities and data instances associated with the activities as anomalies. An anomaly is the data object [
1,
2] which deviates, from the previously observed one. Detection of anomalies is now accepted as hot area of modern research.
Intrusion Detection Systems (IDS) are security tools that provide the safety, the security and the strength to any information and communication systems [
3]. Recently, anomaly-based IDSs [
4] are gaining popularity because of extensive use and their ability to detect insider attack or previously unknown attack. There are several approaches of IDS and one such approach is data mining-based approach.
Data mining is usually an iterative and interactive pattern finding process and its goal is to find patterns, associations, correlations, variations, anomalies, and similar statistically noteworthy structures from large datasets [
5]. There are three fundamental practices of data mining namely unsupervised, supervised and semi-supervised learning [
6]. Clustering [
7] is an unsupervised learning practice frequently exercised to unearth patterns and distribution of data. While it has been widely used in social science and psychology [
8], there are several algorithms developed for this purpose namely
k-means,
k-medoid,
CLARA,
CLARAN,
ROCK,
CACTUS,
DBSCAN [
7,
8,
9,
10,
11] etc. In [
12], the authors have proposed a hierarchical algorithm which can be used for both static as well as dynamic datasets.
Static clustering mostly deals with the static dataset that are ready before the use of the algorithm. In [
13], the authors have proposed several incremental clustering algorithms that can process new record or data instance as they are added to the dataset. However, there are some applications like Wireless Sensor Networks, IOT, Cloud, Finance and Social etc. where data are real time.
Clustering has been widely employed in many areas of anomaly detections. Using Weighted-Euclidean distance function in [
14], anomaly detection techniques is proposed for traffic-anomaly detection. In [
15], the authors have evaluated performance of
k-means clustering-based anomaly detection method using
KDD Cup 1999 network dataset. Although a few works have already been completed in this line, most of the aforesaid techniques considered the dataset with numeric attributes. In [
16], an algorithm is proposed, which can detect anomalies from the datasets of mixed attributes. Using distance and dissimilarity functions, a fuzzy
c-means based clustering method is discussed in [
17] which nicely works on both numeric and categorical attributes. In [
18], an approach for anomaly detection of mixed attribute-dataset is proposed which follows both partitioning and hierarchical approaches. Although most of the above-mentioned works are on static dataset, the anomaly detection from real time data is found to be interesting and it has caught attention to many researchers. In [
19], the authors have proposed an anomaly detection technique of the aforesaid data which can be applicable to both multi-dimensional as well as categorical data. In [
20], the authors have introduced time series data cube as new data structure in handling multi-dimensional data for the anomaly detection. Gupta
et al [
21], have made detailed study on anomaly detection of strictly temporal data for both discrete and continuous cases. In [
22], the authors have presented a classification-based method of online detection of anomalies from highly unreliable data. In [
23], a sequential k-means algorithm is presented where cluster-center are updated with the arrival of each data instance. In [
24], an efficient online anomaly detection algorithm for data streams is proposed which takes into consideration the temporal proximity of the data instances.
In this article, we have put forwarded an anomaly detection algorithm for real time data with mixed attributes using clustering techniques. The algorithm follows both partitioning and agglomerative hierarchical approaches and each cluster produced by the method will have an associated fuzzy time interval as its lifetime. The algorithm starts with partitioning approach then follows agglomerative hierarchical approach. As the data in the dataset are real time like data stream, the time of generation of every data instance is important and the algorithm also takes this into account. The objective of this article is as follows. First of all, we define the data instance-cluster distance measure [
10,
18] in terms of both numeric and categorical attribute-cluster distance. After this, the similarity of a pair of clusters is expressed in an equivalent manner and a merge function based on the similarity is defined. Finally, a two-phase method for anomaly detection is proposed
. In the phase-1, the first
k- data instances are kept in
k-different clusters along with their time stamps (time of generation) as cluster creation times or start time of the lifetimes. If a new data instance comes to any of the
k-clusters, the lifetime of the cluster is extended by the current time stamp. Then the cluster’s categorical attribute’s frequency and the numeric value’s mean are updated. At the end of phase-1, each cluster is having an associated lifetime. Then agglomerative hierarchical approach starts. In this phase, a merge function based on similarity measure is used to merge two highly similar clusters if their lifetimes overlap. Then the overlapping lifetimes are kept in a compact form using set superimposition [
25,
26]. For the merger of the clusters at any stage two superimposed intervals are superimposed based on non-empty intersection of their cores. This way each resulting cluster will have an associated superimposed time interval [
26] which produces a fuzzy time interval. The algorithm stops when no further merger is possible. The algorithm supplies set of clusters where each cluster will have an associated fuzzy time interval as its lifetime. One challenging issue in partitioning clustering is to specify the value of
k and there are several methods developed for this purpose. Our algorithm has addressed this problem nicely by producing same number of stable clusters irrespective of the number of input clusters because numbers clusters will be reduced during merging phase. Clearly, the order of the output cluster set will be less than or equal to
k. Thus, the output cluster set is more stable and invariant with respect to the number of input clusters. The data instances which either belong to sparse clusters or does not belong to any of the clusters will be considered as anomalies. The method’s efficacy is established using complexity analysis and experimental evaluation.
The article is arranged as follows. The recent developments in this area are explained in Section-2. In Section-3, we discuss the terms, notations and definitions that has been used here. In Section-4, we explain the proposed system using algorithm and flowchart. The complexity analysis is done in Section-5, and the results and findings of the experimental studies are given in Section-6, Finally, we wind-up the paper with conclusions and lines for future works in Section-7.
2. Related Works
Anomaly detection is the finding of the data object which deviates, from the previously observed one. In [
1,
2] the authors have discussed about anomalies and different clustering-based techniques and algorithm to identify them. Intrusion Detection Systems (IDS) are protection steps that can assure the safety and the security from unauthorized access [
3]. There are several well-known IDS available and signature detections system [
3] is one them. Off late the anomaly-based IDSs [
4] are gaining popularity because of their wide-spread use and ability to detect insider attack or previously unknown attack. Most of the anomaly-based IDS follows clustering or classification approaches.
Data mining is a well-known area of research which includes techniques to discover pattern from large datasets. The most popular data mining tasks includes pattern mining, association rule mining, clustering, classification, anomaly detections etc. [
5,
6]. Clustering [
7] is a popular technique used to find patterns and data distributions in datasets and it has extensively been used in many fields of human knowledge [
7,
8,
9,
10,
11]. In [
12], the authors have proposed a hierarchical algorithm which can be applied for both static as well as dynamic datasets. In [
13], the authors have proposed several incremental clustering algorithms that can process new record or data instance as they are added to the dataset.
In [
14], the authors have used Weighted-Euclidean distance function in
k-means clustering algorithm for traffic-anomaly detection. In [
15], the authors have studied the performance of
k-means-based anomaly detection method. In [
16] a hierarchical agglomerative algorithm is proposed for the anomaly detection in the datasets with mixed attributes. Based on minimization of cluster compactness and maximization of cluster separation in [
27], the authors have proposed an anomaly detection technique. In [
16], an R-based implementation of DBSCAN is proposed.
In [
18], the authors have proposed a hybrid approach consisting of both partitioning and hierarchical algorithm for anomaly detections from the datasets with mixed attributes. In [
19], the authors have proposed an anomaly detection technique of aforesaid data which can be applicable to both multi-dimensional as well as categorical data. In [
20], the authors have introduced time series data cube as new data structure in handling multi-dimensional data for the anomaly detections. Gupta
et al [
21], have made detailed study on anomaly detection of strictly temporal data for both discrete and continuous cases. In [
22], the authors have presented a classification-based method of on-line detection of anomalies from highly unreliable data. In [
29], the authors have put forwarded a hybrid semi-supervised method for finding anomalies in high dimensional data. In [
30], the authors have proposed a method using random forests for the improvement of the anomaly detection rate for streaming datasets.
Fuzzy is brought into clustering and anomaly detection by Linquan et al [
17]. Using distance and dissimilarity functions a fuzzy
c-means based method is discussed in [
17] which nicely works both numeric and categorical attributes. In [
31], the authors have made a detailed study on intrusion detection systems, fuzzy anomaly detection approach along with their advantages and limitations. In [
32], the authors have proposed an algorithm to detect anomalies from temporal data. In [
33], the authors proposed a real-time eGFC, to the log-based anomaly detection problem with time-varying data from the Tier-1 Bologna computer center. In [
34], the authors have discussed a model using the association between fuzzy logic and ANN to recognize anomalies in transactions involved in the context of computer networks and cyberattacks. A sequential
k-means clustering algorithm which updates cluster-center with the new arrival of each data instance is proposed in [
23]. An efficient online anomaly detection algorithm based on temporal proximity is proposed in [
24].
In [
18], the authors introduced a set operation called superimposition which can be used on overlapping intervals to generate superimposed intervals. Applying Glivenko-Cantelli Lemma [
35] of order statistics on superimposed intervals, fuzzy intervals can be generated [
18]. In [
26], the authors have used the set superimposition on locally frequent itemsets to generate periodically frequent itemsets from temporal datasets which in turn gives fuzzy periodic patterns. In [
11], the same operation is used and with help of fuzzy variance for the clustering of frequent patterns. In [
36], the authors have used set superimposition to solve a simple fuzzy linear equation. In this article we are using the aforesaid operation to find out the fuzzy time interval associated with each cluster as its lifetime.
3. Problem Definitions
In this section, we will navigate some significant terms, definitions, and formulae to be used in the proposed algorithm. Since most of the real-life datasets are hybrids, and the k-Means algorithm uses the distance between the object and the cluster, typical distance formulae do not work. Therefore, it is required to formulate a general distance function which can be appropriate for numeric, categorical, or hybrid attributes. The formulae are given below.
In [
10,
18], the authors have formulated a distance function for categorical attribute as follows. If the data instances set have categorical attributes
A1, A2, ….Ad. The domain (
Ai; i=1,2,…d)
={
ai1, ai2,…aim} comprises-finite, unordered possible values that can be taken by each attribute
Ai, such that for any
a, b∈dom(
Ai)
, either
a=b or a≠b. Any data instance
xi is a vector (
xi1, xi2,….xid)
/, where
xip∈dom(
Ap)
, p=1,2,…d. The distance
d(
xi, Cj) between data instance
xi and cluster
Cj, i=1, 2,…n and
j=1,2,….k. as proposed in [
10,
18] is given by
Here
wi, the weight factor associated with each attribute, describes the importance of the attribute which controls the contribution attribute-cluster distance to data instance-cluster distance. The attribute-cluster distance between x
ip and
Cj is proposed in [
10,
18] as follows.
Obviously
d(
xip, Cj) ∈ [0,1] means
d(
xip, Cj)=1 only if all the data instance in
Cj for which
Ap=xip and
d(
xip, Cj)=0 only if no data instance in
Cj for which
Ap=xip. With the help of equation (2), equation (1) becomes
where d(x
i, C
j)∈[
0, 1], ,i=1, 2,…n and j=1,2,…k.
In [
10,
37], the formula to calculate the weights of attributes is given as follows. The importance
IA of an attribute is quantified by the entropy metric as follows.
where
x(
A) is the value of the attribute
A, and
(
x(
A)) is the distribution function of data along
A dimension.
As the values of categorical attribute are discrete and independent, then an attribute’s probability is computed by counting the frequency of the attribute value. Accordingly, any categorical attribute
Ap ‘s (
p∈{
1,2,…,d}) importance can be evaluated by the formula [
10,
37].
And
where
ap∈
tdom(
Ap),
mp is the
A p’s total number of possible values, and
D is the whole dataset. From equation (5) it is concluded that an attribute’s importance is directly proportional to the number of different values of the categorical attribute. Although, practically, an attribute with immensely diverse values contributes minimum to the cluster. Hence, equation (5) can further be modified as
Therefore, we can quantify the importance of an attribute using its average entropy over each attribute value. Hence, each attribute’s weight [
10,
37] is estimated by.
Suppose all the attributes make equal contributions in the cluster structure of the data, then, their weights will be constant, i.e.
wp=1/
d, with
p=1,2,…,d [see e.g., [
10,
37]]. Consequently, the instance or object-cluster distance in equation (3) can be modified as [
10,
37]
The distance formula in [
10] for numeric attributes of the data instance is defined as follows. Let
xi=(
xi1, xi2,…xin) be the numeric attribute of a data instance
xi, then the distance
d(
xi, Cj) between
xi; i=1,2,..n and cluster
Cj; j=1,2,…k is defined as follows.
where
cj is the centroid of cluster
Cj, and
d(
xi, Cj)
∈[
0, 1].
In [
10,
37] the distance in mixed attributes is proposed as follows. Suppose
xi =[
xci, xni],
the data instance with
xci=(
xci1, xci2,… xcidc) categorical and
xni=(
xni1, xni2,… xnidn) numerical attributes where (
dc+dn=d). Using equations (1) and (9), the distance
d1(
xi, Cj) between the data instance
xi and cluster
Cj is defined [
10,
37] as follows:
Here, and cnj is the centroid of cluster Cj.
Since all the data instances-clusters distances are compared and the data instance having minimum data instance-clusters value is to be put in the corresponding cluster, the formula
d(
xi, Cj) [
10,
37] can be rewritten as follows:
It is worth mentioning here that we subtract the distance in categorical attributes from 1 to fit it on to the same scale as the distance in numeric attributes. Obviously,
d(
xi, Cj)
∈[
0, 1]. If
xi∈Cj, d(
xi, Cj)
=0. In equation (11), the numerical attributes are included as a whole in the Euclidean distance, hence, it can be treated as one of the indivisible components, and only one weight can be assigned to it. Thus, we will have
dc+1 attribute weights in total, and their summation is equal to 1. Under such settings, we the attribute weights can be taken as
In this manner, the totat weight of the numeric and categorical parts are 1/(dc+1) and dc/(dc+1) respectively. As the actual weight of each categorical attribute is adjausted by its importance as in equation (7), the equation (12) can give us the weights for mixed attributes.
Let
Ciand
Cj { i, j = 1, 2, …k and
i≠j } be two clusters obtained by partitioning phase, and
ci and
cj be their centroids, then the similarity measure [
18]
S(
Ci, Cj) between
Ciand
Cj is expressed as
,
where
Sn(
Ci, Cj)
= the similarity in numeric attributes
and
Sc(
Ci, Cj)
= the similarity of
Ci and
Cj on categorical attributes [
18]
Using equations (14) and (15), equation (13) becomes [see eg [
18]]
In equation (16), we subtract the similarity in categorical attributes from 1 to make the measure onto same scale as that of numeric attributes. Since Sn (Ci, Cj)∈[[0,1] and Sc(Ci, Cj)∈[[0,1], it follows that S(Ci, Cj)∈[[0,1]. For identical cluster pairs, S(Ci, Cj)=0, and S(Ci, Cj)=1, for completely dissimilar pairs
A fuzzy set A in a universe of discourse X is characterized by its membership function μA(x) ∈[0, 1], x∈X where μA(x) represents the grade of membership of x in A. [see e.g., 38].
A fuzzy set
A is termed as normal [
38] if
∃ at least one
x∈X, for which
μA(x) =1. For a fuzzy set
A, an
α-cut Aα[
38] is represented by
Aα={
x∈X; μA(x)≥α}. If all the
α-cuts of
A are convex sets then
A is said to be convex [
38].
A convex normal fuzzy set
A [
38] on the real line
R with the property that
∃ an
x0∈R such that
μA(x0) =1, and
μA(x) is piecewise continuous is called fuzzy number.
Fuzzy intervals [
38] are types of fuzzy numbers such that
∃ [
a, b]
⊂R such that
μA(x0)=1for all
x0∈ [
a, b]
, and
μA(x)is piecewise continuous.
The support of a fuzzy set A in X is the crisp set containing every element of X with membership grades greater than zero in A and is notified by S(A) ={x∈X; μA(x) > 0}, whereas the core of A in X is the crisp set containing every element of X with membership grades 1 in A [see e.g. 38]. Obviously core [t1, t2]=[t1, t2], since a closed interval [t1, t2] is an equi-fuzzy interval with membership 1 [see e. g. 25].
In [
25] an operation named
superimposition (S) was proposed. We re-write the operation as follows.
Where (A1- A2)(1/2)) and (A2- A1)(1/2) are fuzzy sets having fixed membership (1/2), and (+) denotes union of disjoint sets. To elaborate it, let A1=[x1, y1] and A2=[x2, y2] are two real intervals such that A1∩ A2≠ϕ, we would get a superimposed portion.
In the superimposition process of two intervals the contribution each interval on the superimposed interval is ½ so from equation (17) we get
where x
(1)=min(x
1, x
2), x
(2)=max(x
1, x
2), y
(1)=min(y
1, y
2), and y
(2)=max(y
1, y
2)
Similarly, if we superimpose three intervals [
x1, y1]
, [
x2, y2]
, and [
x3, y3]
, with
≠ϕ the resulting superimposed interval will look like
where the sequence {x(i); i-1, 2, 3} is found from {xi; i=1, 2, 3} by arranging ascending order of magnitude and {y(i); i-1, 2, 3} is found from {yi; i=1, 2, 3}in the similar fashion.
Let [
xi, yi]
, i=1,2,…,n, be n real intervals such that
≠ ϕ. Generalizing (19) we get.
In (4), the sequence {
x(i)} is formed of the sequence {
xi} in ascending order of magnitude for
i=1,2,…n and similarly {
y(i)} is formed of the {
yi} in ascending order of magnitude [
25]. In (20), we observe that the membership functions are the combination of empirical probability distribution function and complementary probability distribution function and they are given by
Using Glivenko-Cantelli Lemma of order statistics [
35], the equations (21) and (22), will jointly give us the membership function of the fuzzy interval [see e.g. 25].
Let A=[x
(1), x
(2)]
(1/m) (+) [x
(2), x
(3)]
(2/m) (+) ... (+) [x
(r), x
(r+1)]
(r/m) (+) ... (+) [x
(m), y
(1)]
(1)(+)[y
(1),y
(2)]
((m-1)/m)(+)...(+)[y
(m-r),y
(m-r+1)]
(r/m)(+)...(+)[y
(m-2),y
(m-1)]
(2/m)(+)[y
(m-1),y
(m)]
(1/m) be the superimposition of m intervals and B=[x
(1)/, x
(2)/]
(1/n) (+) [x
(2)/, x
(3)/]
(2/n) (+) ... (+) [x
(r)/, x
(r+1)/]
(r/n) (+) ... (+) [x
(n)/, y
(1)/]
(1)(+)[y
(1)/,y
(2)/]
((n-1)/n)(+)...(+)[y
(n-r)/,y
(n-r+1)/]
(r/n)(+)...(+)[y
(n-2),y
(n-1)/]
(2/n)(+)[y
(n-1)/,y
(n)/]
(1/n) be superimposition of n intervals, then A(S)B is the superimposition of (m+n) intervals and is given by
where{x
((1)), x
((2)), ..x
((m)), x
((m+1))…x
((m+n))} is the sequence formed from x
(1), x
(2), ..x
(m), x
(1)/, x
(2)/…x
(n)/ in ascending order of magnitude and {y
((1)), y
((2)), ..y
((m)), y
((m+1))…y
((m+n))} is the sequence formed from y
(1), y
(2), ..y
(m), y
(1)/, y
(2)/…y
(n)/ in ascending order of magnitude. From (24), we get the membership function as
By the equations (24) and (25), using Glivenko-Cantelli lemma of order statistics [
35], we get the membership function of the fuzzy interval generated from the identity (23).
Let [
ti, ti/] and [
tj, tj/] are lifetimes of
Ci and
Cj {
i, j=1,2,…n} respectively such that [
ti, ti/]
∩[
tj, tj/]
≠ϕ then the merge() function [
18] is defined as
C=merge(Ci, Cj)= Ci∪ Cj, if and only if
S(
Ci, Cj)
≤σ, a pre-defined threshold where
C is the cluster obtained by merging
Ci and
Cj. It is to be mentioned here that
C will be associated with the superimposed interval [
ti, ti/]
(S)[
tj, tj/] as its lifetime. To merge the clusters with superimposed time intervals, we compute the intersection of the cores of the superimposed time intervals. If it is found to be non-empty, then the clusters are merged and the corresponding superimposed time intervals are again superimposed to get a new superimposed time interval.
6. Experimental Settings and Discussions
We have made an experimental study with help of a synthetic dataset. We have generated stream of datasets of different sizes with fixed dimension. The generated dataset is quite similar to KDD Cup 99 which has 41 properties with 38 numeric and 3 flag properties. We have taken the dimension of our dataset as 41, with 37 numeric, 3 categorical and 1 temporal attributes (time-stamp). We have also kept the noises from 0 to 5%. The weight of each attribute is assumed to be same. For comparative studies, we have used two algorithms [
23,
24]. We have implemented the algorithms in MatLab. It has been observed that the aforesaid algorithms work nicely on lower dimensional datasets. Also, they can’t handle categorical attribute values. Also, there is no explicit references of temporal attributes. Secondly,
k-means [
23] algorithm gives a specified number of clusters which is not realistic in case of real time clustering. OnCAD [
24], needs a couple of input parameters to be specified. Our algorithm outperforms the aforesaid algorithms in the sense that it can handle higher dimensional data with numeric, categorical and temporal attributes. The comparative analysis is presented in both tabular form and graphically in Table-1, Table-2, Figure-1 and Figure-2. Although the number of clusters
k has to be specified in the partitioning phase, during phase-2, the merge function minimizes the number of clusters. So, even if we start with a large number input clusters, we arrive at quite less number of stable clusters. The similarity threshold has to be adjusted accordingly. In this work, we have taken different values of similarity threshold like 0, 0.25, 0.5, 0.75 to justify our claim. The results are expressed graphically in Figure-3, Figure-4, Figure-5, and Figure-6.
Furthermore, our algorithm explicitly keeps track of the temporal attribute, the time of arrival of data instances as time stamp which others do not. In the first phase, the algorithm generates clusters along with time interval list where each cluster will be associated with a time-interval which later merge to produce clusters with fuzzy time intervals. We have also done the complexity analysis which shows that the algorithm is quite efficient. Also, we have found that our method can extract the anomalies more accurately. The method is scalable in case of data sizes. The results are presented in a tabulated form and graphical in Table-2 and Figure-7.
Table 1.
Comparative analysis in terms of different parameters.
Table 1.
Comparative analysis in terms of different parameters.
Figure 1.
Comparative analysis.
Figure 1.
Comparative analysis.
Table 2.
Comparative analysis of output clusters.
Table 2.
Comparative analysis of output clusters.
Figure 2.
Comparative analysis of output clusters.
Figure 2.
Comparative analysis of output clusters.
Table 2.
Comparative analysis of anomalies.
Table 2.
Comparative analysis of anomalies.
Figure 3.
Comparative analysis of anomalies.
Figure 3.
Comparative analysis of anomalies.
Figure 4.
Data size vs output cluster.
Figure 4.
Data size vs output cluster.
Figure 5.
input clusters vs output clusters.
Figure 5.
input clusters vs output clusters.
Figure 6.
Similarity threshold vs output clusters.
Figure 6.
Similarity threshold vs output clusters.
Figure 7.
Data size vs. time of execution.
Figure 7.
Data size vs. time of execution.