1. Introduction
Machine condition prognostics is the critical part of intelligent health management (PHM) system, which aims to predict a machine's remaining useful life (RUL) based on condition monitoring information [
1]. The general PHM procedures include the construction of health indicators (HIs) and RUL prediction. The HI is a crucial variable that indicates the current machine health condition, and also it represents the information extracted from sensor data and provides degradation trends for RUL prediction.
The HI construction process is called data fusion and has three categories: feature-level, decision-level, and data-level fusion[
2]. Feature-level fusion methods rely on prior knowledge of degradation mechanisms and physical models. Ma [
3] reported a multiple-view feature fusion method for predicting the RUL of lithium-ion batteries(LiBs). Decision-level techniques fuse high-level decisions based on individual sensor data and do not depend on raw-signal feature extraction. Wei [
4] proposed a decision-level data fusion method to map a unique sensor signal onto reliable data to improve the capability of the quality control system in additive manufacturing and RUL estimation for aircraft engines. Data-level fusion methods find the embedding feature suitable for a task from raw data. They can monitor the machine system state based on the requirements of an effective aero-engine prognostic and also the monitoring task has strong versatility. Chen[
5] proposed an improved HI fusion method for generating a degradation tendency tracking strategy to predict gear's RUL. Wang [
6] extended the extreme learning machine to an interpretable neural network structure, which can automatically localize informative frequency bands and construct HI for machine condition monitoring. RUL prediction reveals the remaining operating time before equipment requires maintenance. They can be classified into four categories: physics model-based, statistical model-based, artificial intelligence-based, and hybrid methods [
7]. Many recent studies have focused on artificial intelligence-based machine RUL prediction methods such as convolutional neural networks (CNNs) [
8], long short-term memory (LSTM) recurrent networks [
9], and gated recurrent (GRU) networks[
10]. Recurrent neural networks (RNNs) have gradually become the most popular of these methods. Many scholars have focused on LSTM recurrent networks and GRU networks to address the vanishing gradient problem. Xiang [
11] added an attention mechanism to the basis of an ordered, updated LSTM network, which further improved the robustness and accuracy of the LSTM network-based RUL prediction model.
Although these methods can achieve an effective machine prognostic, most artificial intelligent-based models rely on manual feature extraction (HI construction). Manual feature extraction inevitably leads to information loss, which has a negative influence on prognostics. Several studies have focused on allowing neural networks to extract features automatically from the original input, a procedure that can avoid input information loss from manual feature extraction. In the fault diagnosis field, artificial intelligence-based models exhibit excellent fault diagnosis performance with the original vibration signal input [
12]. They can directly extract disguisable fault features from unlabeled vibration signals [
13]. These methods mainly utilize CNNs to realize automatic feature extraction. Therefore, several researchers have attempted to utilize CNNs to extract degradation features for predictive purposes. Xu [
14] applied a dilated CNN to the field of prognostics, used five convolutional layers to extract features from the original signal, and combined them with a fully connected network to realize effective prognostics. Li [
15] proposed a multivariable machine predictive method based on a deep convolutional network. The proposed method uses the time-window method to construct 2D data as convolutional network input. Ren [
16] built a spectrum principal energy vector from a raw vibration signal as a CNN input for bearing prognostics. CNNs demonstrate a strong capability in high-dimensional input situations but are not good at dealing with long-term series prognostics tasks. RNNs can easily construct long-term relationships but cannot directly utilize the abundant long-term information owing to their limited in-network processing capacity. Thus, this study proposes building a network that can directly deal with high-dimensional, long-term, time-series data for machine prognostics. The aim was to establish the long-term degradation relationship for prognostics from a large amount of raw data without relying on manual feature extraction and HI construction.
Another non-negligible defect of the existing prognostics methods is that all degradation datasets satisfy independent and identically distributed conditions. Due to the operating condition and fault type variation, a distribution discrepancy generally exists between degradation datasets (each degradation dataset is an independent domain), leading to performance fluctuation in prognostics methods. Transfer learning (TL) is introduced to help artificial intelligence-based prognostics methods extract domain-varied features and achieve effective outcomes under cross-operating conditions. TL can utilize the knowledge learned in previous tasks for new tasks by removing the domain invariance feature [
17], which is widely used in fault-diagnosis tasks. In recent years, many researchers have focused on TL application in the prognostics field to achieve effective cross-operating condition prognostics. For example, Wen [
18]utilized a domain adversarial neural network structure to solve the crossing domain prognostic problem. Roberto [
19]proposed a domain adversarial LSTM neural network that achieved effective aero-engine prognosis. Mao [
20] performed a transfer component analysis that sequentially adjusts the features of current testing bearings from auxiliary bearings to enhance prognostics accuracy and numerical stability. This study introduces TL to extract the general representation of bearing degradation data from different operating conditions and the final fault types to achieve prognostics in cross-operating conditions.
Figure 1 shows a general transfer learning algorithm for the cross-operating conditions’ HIs.
Transformer [
21] is a popular multi-modal universal architecture neural network architecture. The transformer utilizes a self-attention mechanism to capture the long-term dependence (spatial dependence) information between input elements in a sequence. It uses the full sequence input for each inference; therefore, it is less affected by the sequence length than traditional methods (RNN and LSTM). This feature of the transformer network is suitable for the prognostic task. Zhang [
22] proposed a dual-aspect transformer network to fuse the time steps and sensor information for long-time machine prognostic. Su [
23] proposed a bearing prognostic method consisting of a transformer and LSTM, achieving effective RUL prediction. Thanks to the advantages of the transformer architecture in processing long series and high-dimensional features, it has the potential to become a well-data-driven prognostic tool. Therefore, the cross-domain prognostic based on a transformer architecture is studied.
To address the limitations introduced by the above issues concerning feature extraction, cross-operating conditions, and different data distributions, this study takes the FEMTO-ST bearing dataset as an example to explore the degradation process based on a transformer-based self-attention transfer learning network (TSTN). The method can automatically construct an HI from high-dimensional feature inputs and realize long-term information association to monitor machine conditions. The innovations and contributions of this study are summarized as follows:
(1)Development of TSTN for Machine Prognostics:
We have introduced the Transformer-Based Self-Attention Transfer Learning Network (TSTN) as a dedicated solution for machine prognostics. TSTN leverages long-term, high-dimensional spectrum vectors as its input and directly produces a linear Health Index (HI) output, a numerical value ranging from 0 to 1. This HI value is straightforwardly compared to a failure threshold of 1. The core transformer architecture within TSTN plays a pivotal role in extracting critical features from extended time sequences.
(2)Incorporation of Long-term and Short-term Self-Attention Mechanisms:
TSTN incorporates both long-term and short-term self-attention mechanisms, empowering it to discern short-term and long-term fluctuations in machine conditions. By analyzing historical high-dimensional feature data in conjunction with current information, TSTN excels at identifying evolving machine states.
(3)Integration of Domain Adversarial Network (DAN) in TSTN:
To enhance TSTN's robustness and versatility, we have integrated a Domain Adversarial Network (DAN) within its architecture. DAN effectively minimizes data disparities across various operational conditions, thus enabling TSTN to monitor machine states consistently across different scenarios and environments. This integration significantly extends TSTN's applicability for cross-operation machine state monitoring.
The remainder of this paper is organized as follows.
Section 2 introduces the preliminaries of the proposed method. The principle of the proposed algorithm is presented in
Section 3.
Section 4 describes the proposed model's experimental study, and
Section 5 summarizes this work.
5. Comparisons and Analysis
Then, the normalized prediction error
and benchmark scores were calculated [
30]. The results of all the testing sets are listed in
Table 3. As presented in
Table 3, except for testing sets 2-7 and 3-3, the RUL prediction results of the proposed method are reasonable. The errors in the prediction results of datasets 1-5 to 2-6 were shallow, and the proposed method could effectively perform bearing condition monitoring with testing sets 1-5, 1-7, 2-4, and 2-6. Compared to the RNN-based RUL prediction method [
31], convolutional LSTM network [
32], Bi-directional LSTM network with attention mechanism [
33] and the traditional RUL prediction method based on vibration frequency anomaly detection and survival time ratio [
34], the proposed TSTN method has higher RUL prediction accuracy. These results confirm that the proposed method is applicable to the prognostics of mechanical rotating components. For the last two datasets, the RUL predictions exhibit large deviations. The reason for these large deviations is that the vibration signal changes slightly only in the early degradation process, which displays a linear degradation trend. However, as time goes on, the linear trend becomes nonlinear. The HI
does not have a linear change rate in the latter stage. Hence, the proposed HI is unsuitable for predicting the RUL in latter-stage degradation. However, compared with other methods, the computational complexity is higher, and the training time is 3 hours.
5.1. Discussions of the proposed methodology
Influence of multi-head number. To improve the learning capability of the self-attention layer of the encoder, linearly project keys, values, and query times, which is called the multi-head attention operation. In this section, the influence of multi-head numbers is discussed. The predicted RUL benchmark scores of different multi-head numbers indicate that 16 (score is 0.4017) is the most suitable for the prognostics task, and it is higher than the results of four multi-head (score is 0.0607) and eight multi-head (score is 0.1124) numbers. Theoretically, the larger the multi-head number, the stronger the fitting capability. However, the rotary position embedding method requires almost four numbers to indicate location information. When the multi-head operation breaks up the rotary position embedding, the self-attention calculation cannot capture the time information. Therefore, the score of the 32 multi-head numbers was 0.2631, and that of the 64 multi-head numbers was 0.0689. In summary, the multi-head number needs to be set to in the prognostics task.
Discussions with/without transfer learning. The proposed method uses the domain discriminator with the gradient reversal layer to extract the domain-invariant RUL representation. We expect to use the TL method to improve the linearity of the estimated HI under different operating conditions. An experiment was conducted on a TSTN without a TL, reflecting the domain discriminator's effectiveness in cross-operating condition monitoring. Aside from removing the domain discriminator, the other network framework settings were similar to those in
Figure 9. The RUL prediction score decreased from 0.4017 to 0.0515. The prognostic results of TSTN and TSTN without a domain discriminator for test datasets 1-6, 1-7, 2-4, and 2-6 indicate TL's effectiveness.
Figure 9 shows the comparison of TSTN and TSTN without transfer learning. The blue lines represent the classical TSTN HI results, and the greenish-blue lines denote the HI-estimated effects of TSTN without TL. TL improves the TSTN prognostics capability in cross-operating condition situations.
Effectiveness of the self-attention mechanism. This study utilized test sets 1-6 to generate a self-attention heatmap (shown in
Figure 10) to indicate the effectiveness of the self-attention mechanism. The longitudinal of the self-attention heatmap refers to the
time frames, and the transverse of the self-attention heatmap pertains to the 16 multi-heads with eight patches. In this study, 1/3, 2/3, and 1 of the normalized operating time were selected. When a patch has a high self-attention value, the network focuses on that patch.
Figure 10 shows that only a few heads undertake the HI estimation task, but our previous study indicated that a sizeable multi-head number equates to strong learning capability. A possible reason is that a large multi-head results in a flexible feature association capability, which means that features can be selected precisely.
The first self-attention layer was a long-term self-attention layer. In
Figure 10, head 12 of long-term self-attention captures the severe degradation at the end of the operating time, and head 4 focuses on the weak degradation at the early and middle operating stages. After the long-term self-attention layer, the spectrum long-term change relationship was obtained, and the local self-attention layer was used to capture abundant information in one frame. In
Figure 10, a clear degradation relationship was captured. Head 11 of the local self-attention layer captured the weak degradation in the early operating stage. Head 10 focuses on degradation in the middle operating phase, and head 13 focuses on rapid degradation at the late operational stage.
Figure 10 shows that local self-attention plays a greater role than the long-term self-attention layer. However, the learning capability sharply declined when the two layers' order was changed. This result indicates that the long-term self-attention layer generates the long-term relationship and is strengthened by the local self-attention layer.
In summary, the multi-heads in the short-term self-attention layer focus on the spectrum value, thereby making the proposed TSTN sensitive to spectrum value changes.