Firstly, LogBD and log anomaly detection methods PCA, LogCluster, DeepLog, LogAnomaly, LogBERT and LogTransfer are tested on two datasets and the results are analyzed. Then, the influence of different log vector representation methods on anomaly detection performance is analyzed, and the influence of domain adaptation on anomaly detection performance is analyzed.
4.2. Results and Analysis
Since PCA, LogCluster, DeepLog, LogAnomaly and LogBERT are unsupervised models and are not designed for cross-system detection, these models are evaluated in two cases, that is, the training data set uses or does not use the samples of the target system, represented by W/O. Firstly, the Thunderbird dataset is used as the source system, and the BGL dataset is used as the target system. This situation is abbreviated as TB-BGL. The F1 value and AUC value of 0.880 are obtained on the source system Thunderbird, and the F1 value and AUC value of 0.938 and 0.973 are obtained on the target system BGL. Then the BGL data set is used as the source system and the Thunderbird data set is used as the target system. This case is abbreviated as BGL-TB. The F1 value of 0.933 and the AUC value of 0.978 are obtained on the source system BGL, and the F1 value of 0.841 and the AUC value of 0.854 are obtained on the target system Thunderbird.
As shown in
Table 3 and
Table 4, the table shows the experimental results of the six methods except LogTransfer in the case of TB-BGL and BGL-TB. W/O represents the training set using or not using the target system. Firstly, for PCA, LogCluster, DeepLog, LogAnomaly and LogBERT, they do not use domain adaptation. For Thunderbird as the source system, BGL as the target system (TB→BGL) or BGL as the source system, Thunderbird as the target system (BGL→TB), these five methods can take good F1 values and AUC values on the source system even if they do not use the sample training of the target system. When the training data set uses samples from the target system, they can obtain better F1 value and AUC value on the target system, but worse on the source system, indicating that these five methods do not have cross-system adaptive ability. Using a mixed log sequence of the source system and the target system will only make the training data distribution diverse, and the detection model confuses the distribution of the samples, resulting in poor model detection. Compared with these five methods, LogBD achieves better results in any scenario.
On the source system, the anomaly detection performance of LogBD is still better than these five, which fully shows that LogBD captures the pain points of log anomaly detection, that is, the accuracy of log template analysis, the use of log semantic information, and the method of anomaly detection. It can also be found that the performance of deep learning methods is better than that of machine learning methods, indicating that the machine learning model only uses the log template count vector as the input feature, without considering the log content itself, but it can still detect the abnormal information in the log to a certain extent, but it can not achieve good accuracy and coverage of anomaly detection. For example, PCA is based on the log template index for anomaly detection, only retains the main features of the original data, loses a lot of key information, and is difficult to learn features from the sparse count matrix. LogCluster is based on clustering for log anomaly detection, but it can not play a good role in the face of complex log structure, nor can it fully learn the features in the log, and the detection effect is not good. DeepLog regards the log sequence as a digital sequence and replaces the log template with a number. It not only uses the log parameter features but also integrates the log sequence features. However, it does not extract the semantic information in the log template, and it is easy to treat the log sequence that has not appeared in the training data as an exception, resulting in lower accuracy and more false alarms. Compared with the machine learning methods PCA and LogCluster, it has a more obvious improvement. The LogAnomaly method uses the semantic and grammatical information of the log template, and proposes Template2Vec for the synonyms in the log. It uses the word vector weighted average to obtain the vector representation, which has made some improvements on the basis of DeepLog. However, it does not consider the problem of polysemous words, only considers the representation of a single word vector, does not consider the context information, and the learned feature information is not comprehensive enough. LogBERT uses BERT to capture the pattern of normal log sequence, and uses two self-supervised task training models, mask log template prediction and hypersphere minimization. LogBERT uses the hypersphere objective function as LogBD, but the performance is not as good as LogBD, because LogBD uses domain adaptation to obtain more data with the same characteristics for training.
For LogTransfer, LogTransfer is a supervised transfer learning method that uses normal and abnormal labeled data of the source system and the target system to train a cross-system anomaly detection model. LogTransfer achieves good performance when sufficient tag data is available. In this experiment, we tested how many labeled abnormal samples are needed to train LogTransfer to make it have similar performance as LogBD. When using 100 abnormal sequences from the source system and 10 abnormal sequences from the target system to train LogTransfer, LogTransfer can achieve the best performance on the source system. The detection results in the two scenarios are shown in
Table 5 and
Table 6.
For the TB → BGL scenario, training LogTransfer with 10 abnormal sequences of the target system is not enough to be better than LogBD. It is better than LogBD on the source system, but worse on the target system. For the BGL-TB scenario, the performance is comparable on the source system and lower than LogBD on the target system. Therefore, LogBD can provide good performance using only normal data when it is difficult to obtain labeled abnormal samples.
Unlike previous methods, LogBD uses BERT to extract the semantic information of log messages, and uses semantic vectors to represent log templates, rather than the log templates id, Word2Vec and Glove used in previous methods. This paper compares the model performance using four log template representation methods. The detection results generated by different log template representation methods in two scenarios are shown in
Table 7 and
Table 8. It is found that the performance is greatly improved when BERT is used for log template representation. This may be because : the method based on the log template id is to number the log template and represent it by number. This method regards the log template as a number, and does not consider the semantic information contained in the log template. Word2Vec and Glove 's word vectors are fixed by looking up the dictionary and taking the corresponding word vectors. They cannot dynamically adjust the word vectors according to different context contexts, and lose the integrity of the log template semantics. According to the context of the input sentence, BERT returns sentence-level word vectors through model calculation in the model network. Due to the different context of the input context, the returned word vectors are different, so as to distinguish polysemous words. Compared with the first three, BERT can learn the deep semantics of sentences and capture the similarity between different log statements.
The cross-system log anomaly detection model LogBD proposed in this paper uses the domain adaptation method in transfer learning and achieves excellent performance. In order to prove the effectiveness of the domain adaptation method, a set of comparative experiments are carried out to compare the performance of the model without using the domain adaptation method and using the domain adaptation method. Observe the impact of domain adaptation methods on model performance.
The detection results using or not using the domain adaptation method in the two scenarios are shown in
Table 9 and
Table 10. The without indicates that the domain adaptation method is not used, and the with indicates that the domain adaptation method is used. Through observation, it can be seen that the domain adaptive method can greatly improve the performance of the anomaly detection model, so that the anomaly detection model can learn the similarity between the two system log data, thereby detecting the anomalies of the two systems.
This experiment also further evaluated the performance of LogBD when training with different numbers of target system normal log sequences. The experimental results are shown in
Figure 9.
It can be observed that LogBD can achieve high anomaly detection performance for the target system by using a small number of normal sequences in the target system. In the TB → BGL scenario, using about 100 normal log sequences in BGL can achieve good performance. For the BGL → TB scenario, only using 10 normal log sequences in Thunderbird is enough to obtain good performance. When the number of log sequences increases, the performance will continue to improve. In general, even if the new online system deployment time is short, it is easy to obtain normal log data from the target system, so LogBD has strong feasibility and accuracy in detecting anomalies in the new system.