1. Introduction
The current progress of the Industry 4.0 revolution, generates an unprecedented amount of data from industrial processes. According to [
1], data came to be measured by Terabytes in 2005, Petabytes in 2010, Exabytes in 2015 and Zettabytes in 2020, meaning roughly a billion times more data within these 15 years. New advances in Artificial Intelligence (AI) promise to make beneficial use of this data, by training models for applications in e.g., quality control, processes parameter optimisation and process forecasting [
2]. One important use case is PM according to [
3], since the condition of the manufacturing equipment plays a vital role in the company’s competitiveness. Fining the correct schedule is difficult however, since maintenance always introduces downtime and cost [
4]. "Perfect" maintenance approaches, where machine is restored to a nearly brand new state, are often expensive and time consuming [
22]. On the other hand, "minimal" maintenance approaches, despite cheaper and faster, do not decrease the overall failure rate of the machine and can represent risk to personal safety, product quality and machine (i.e. production) running time [
22]. PM assisted by AI models enables businesses to find the optimal maintenance schedule, by predicting when and which equipment require maintenance [
5].
The performance of an AI model is highly depended on the quality of data used during training [
6]. It is, therefore, necessary to equip the machinery with sensor devices, that are placed as physically close as possible to the manufacturing process. Sensors are normally attached to mobile, or rotating parts, therefore the use of wireless technologies in MSPs is often required [
23]. Previously, this would prohibit high frequency measurements, because wireless technologies were not capable of transmitting the large amounts of data with the necessary low latency. The introduction of 5G technology in industrial applications however, allows for far greater data rates with lower latency, enabling the measurement of e.g. vibrations or acoustic emission [
24]. Training AI models requires a large amount of computing resources [
7]. Cloud computing currently works as a catalyst for development and deployment of scalable Internet of Things (IoT) applications, since it provides a platform for storage, computation and communication of edge-generated data [
25]. Businesses are than required to store their data on third party servers and transmit them over the public network. This introduces additional challenges, because public infrastructure is not capable of supporting mass-scale, high-frequency communication [
9], and moving and storing data off premise for cloud computing increases the risk of data breaches [
10].
Federated learning (FL) was first introduced in 2016 [
14] for mobile applications in the private sector and since than has been considered the promising AI technology in terms of privacy [
30]. For its capacity of utilising the distributed computing resources while communicating machine learning (ML), i.e. deep learning, pipelines through these computing resources, data does not have to move off premise [
31]. Additionally, it enables novel ML models to be trained on a large amount of data from multiple businesses without them having to share it with third parties, because it is processed by each business separately. In multiple areas FL applications are been discussed. One of its most promising applications is it’s fusion with industrial IoT [
26]. In the automotive industry, FL-enabled IoT can check for anomalies in production [
27], intelligent transportation systems [
28] can be developed and can provide optimal IoT-device scheduling regarding energy efficiency [
29].
Finally, deploying ML models on resource-constrained end-devices poses multiple challenges [
32]. Those devices are mostly powered by microcontrollers. Although microcontroller performance has made impressive strides, their computational resources and storage capacities are still limited [
33]. Additionally, microcontroller-powered devices are often run without any operating system, requiring specialised bare metal software. Their main benefits however, are low latency and low power consumption [
34]. It is therefore desirable in some applications that ML or data pre-processing models (e.g. self-calibration) are deployed on those end-devices, because data can either be processed immediately at the source without any added communications delay [
35].
This paper presents a novel architecture utilising FL with 5G-enabled MSPs for PM, that was developed as part of the European project Celtic-Next: AI-NET-ANIARA. Sub
Section 1.1 and
Section 1.2 presents a concise background on FL and PM.
Section 2 introduces the infrastructure, including the 5G MSPs.
Section 3 presents the developed architecture of FL-enabled MSPs for PM of machines.
Section 4 concludes with a discussion on the benefits and the potential challenges of this setup.
Prior to proceeding to the subsequent section, it is imperative to provide one concise explanation of the FL and PM techniques. It is suggested that those familiar with these technique may skip this exposition and proceed directly to
Section 2.
1.1. Federated learning
From the technical point of view, FL is a decentralized ML technique that delivers ML models without centralization of data [
14]. In other words, instead of centralizing data in a single location for training, FL is able to train ML models by transmitting the model and training instructions to the scattered edge infrastructures where data is located. The main objective of this technique is to eliminate the necessity of sharing (sensible) data for developing ML models with neither data parties nor cloud infrastructure providers [
11]. Therefore, this technique can significantly increase data security and represents one of the state of the art techniques for privacy-preserving ML.
The training process of FL is different from a conventional ML training. Since data is not centralized in a central server, the ML model is the one that must be transmitted to the clients. Once the clients are connected to a central server, one round of the FL training follows these four steps:
A central server sends the initial version of a ML model to participating clients. An example ML model: a neural network with its architecture defining the amount of nodes, activation functions, amount of hidden-layers, etc.
Each client fits the ML model to its own local data for a predefined number of times (i.e., epochs). In this step, fitting means to that the ML model recalculates its parameters aiming to minimize the difference between its predicted outputs and the true labels using a loss function (e.g., cross-entropy) and an optimization function (e.g., stochastic gradient descent).
Each client sends back only the updated model parameters (not the actual data) and the central server keeps track of each client’s response, awaiting for the necessary quorum.
Once the quorum is achieved, the central server aggregates the updated parameters from each client using a fusion algorithm and uses this aggregated updated parameters to improve the ML model.
A round (or communication round) refers to a single iteration of the aforementioned four steps of the FL training process. The entire FL training process can be composed of multiple rounds and each one with multiple epochs per client.
1.2. Predictive Maintenance
An overview of the maintenance concept evolution in [
12] presents that in the late 1940s the focus was on fixing equipment after it broke down. This approach, known as corrective maintenance (or breakdown maintenance), is often reactive and costly, as it involves repairing equipment without planned downtime. The primary goal is to get the equipment up and running again as quickly as possible to minimize equipment downtime [
45]. As industrialization progressed, equipment downtime became more expensive and problematic, making corrective maintenance less practical. In response, maintenance practices evolved to focus on preventing breakdowns from occurring in the first place. This approach, known as preventive maintenance, involves regularly inspecting and maintaining equipment to identify and fix potential problems before they lead to breakdowns[
46]. Preventive maintenance can be scheduled based on time, usage, or a combination of the two. Although combinations of corrective and preventive maintenance has been studied [
43], they are still considered economically inefficient approaches, since production equipment would be stopped to constantly exchange parts even if there still is equipment remaining useful life (RUL) left. In recent years, technological advancements have enabled maintenance practices that early predict failures [
42]. These practices have become meaningful and imperative to industrial environments [
44].
PM can be described as a maintenance scheduling-support technique that is based on monitoring the current operating condition of the equipment according to [
16]. That is, according to the actual operating status, the equipment’s RUL can be forecasted. In general, RUL forecasting can be performed by a handful of algorithms, but essentially the ML-based algorithms have shown significant performance advantage [
47]. In PM, ML-based RUL forecasting is trained on historical data acquired directly on the operating status. Integrated sensors in the various areas gather critical data on the operating status can be used to sense failure precursors [
48]. As soon as enough data is available, a ML model is trained, extracting information out of this database. Based on this, maintenance scheduling-support is performed with information on the exact component and exact time, when maintenance is required. This is highly beneficial for industries for its cost (i.e. maintenance costs and production costs) reduction potential. Furthermore, it enhances the efficiency by utilizing the machinery to the full intended life time.
Training ML models for enabling PM applications with good performance isn’t yet trivial. A whole digital infrastructure - networked sensors and machines, communication protocols, storage and computational resources - is required, before these models can be trained. A common obstacle most companies approaching PM face is the necessity to store this enormous amount of data in a centralized location for ML training [
13]. Although the actual approach of using cloud solutions as an infrastructure provider for enabling data storage and ML training, cloud environments have more than once failed to provide essential data security assurances [
17]. The lack of security of cloud services leads cloud service contractors to be held accountable for violations of the General Data Protection Regulation (GDPR) for data leaks. According to the GDPR [
19], severe violations imply a fine of up to 20 milion euros or up to 4 % of the company’s total global turnover of the preceding fiscal year, whichever is higher.
According to [
18], efficient bearing failure detection through vibration data of a rotating equipment and an autoencoder-based FL approach could be achieved, while efficiently using network capacity. Therefore, this paper assumes high potential of utilizing operational data from industrial equipment for supporting maintenance scheduling through a PM application. From our perspective the RUL forecasting of rotatory equipment bearing requires a higher degree of data-privacy (and therefore an FL approach [
52]), since confidential process configurations can be derived from the operational data acquired from the equipment (e.g., current, vibration, and temperature) through model inversion [
49], membership inference [
50] and other data-privacy attacks [
51]. These threats represent a risk, specially for industrials, because unauthorized data access not only can allow competitive parties to reverse engineer equipment parameters, processes and products [
49] but can also trigger legal actions against businesses for violating such data security regulations (e.g. GDPR) [
52].
3. Architecture Setup
In this section, the architecture overview will be explained, alongside its components and interfaces. This architecture for PM application through embedding the FL in the MSPs is represented in the
Figure 1. The envisioned multi-agent environment consists of several MSPs connected to a central on-premise server. The MSPs communicate data to the on-premise central server, where data remains stored and the local training services can be run, within the FL party servers. Further on, the FL party servers deployed on the on-premise central server are longly connected with a cloud environment, where the FL server is deployed.
In this architecture, the FEC is the cloud environment hosting the FL server, the MSP is the environment hosting data acquisition services and the coupled NVIDIA Jetson Xavier NX hosts the FL party server. The later runs on premise in the manufacturing environments from vehicle paint-shop area in an automobile industry. In this scenario, the connection of the on-premise server and the cloud infrastructure is performed through the VPN access.
MSPs are deployed in bearings of air exchange systems acquiring, pre-processing, communicating and storing vibration data. These MSPs are connected to the public network through a 5G module, allowing remote connection, VPN connection to the FEC and, therefore, embedded FL services at any time.
This architecture, where high frequency vibration data is locally stored and where only ML model update parameters are transmitted, is expected to deliver less traffic load to the public network as well as enhance data privacy of parties (i.e. clients).
3.1. Multi sensor platform
Traditionally sensors are integrated into the application that requires monitoring, and the analog data from these sensors are cabled out to external data acquisition systems. Cabled sensors tend to obstruct rotating parts and impair data acquisition and the actual process. Therefore, integrating the sensor and cables into the equipment requires much planning and cost. Specific applications, such as ML-based PM, require additional sensors integrated into the equipment to extend the feature space of the dataset.
A wireless solution seems more practical for this application. The approach with a wireless solution provides another challenge: battery run-time. Since the application typically needs to run for a long time, like months, the MSP must run for the whole duration before being removed for battery change. With the increased complexity of additional sensors and data processing for demanding applications, an energy-efficient MSP is needed.
The energy-efficient MSP [
8] was designed to consist of multiple sensor interfaces and complemented with hybrid communication connectivity. The MSP has a LoRa module for low-power transmission and a 5G module for enhanced connectivity with fast communication. The necessary sensors are integrated using the sensor interface, depending on the application. The core of the MSP, the microcontroller, has a dual-core processor, which can be selected on the run depending on the performance/battery run time requirements. The MSP’s software achieves energy efficiency by optimizing the application with scheduled data acquisition, data processing, analysis, and data transmission through appropriate communication channels.
Furthermore, the coupled NVIDIA Jetson was setup with four systemctl commands for running systemd services. One for automatically connect and monitor internet connectivity through the 5G module. A second for starting a tunneling service, enabling remote access to the NVIDIA Jetsona and the MSP. A third for connecting to the FEC through VPN. A fourth for managing the interface and data stream between the MSP and NVIDIA Jetson.
3.2. Federated Learning Platform
This technical concept mainly joined principles from the IIRA [
36], IFL system design [
38], the a three-layer collaborative and layered databus architectures [
39]. Further more, the FL platform was developed on top of one of the most advanced open-source frameworks: the IBM Federated Learning (IBMFL) [
20]. This constantly developing framework is currently in enabling multiple ML methods and fusion algorithms in federated fashion. In addition, IBMFL framework can perform state-of-the-art encryption methods, e.g., homomorphic encryption using secure multi-party protocols.
In
Figure 2, the schema of the FL platform, it’s components, core-services and interactions is presented.
The FL server is the element responsible for two main activities: the project management of decentralized projects and the execution of aggregation services.
Project management is performed through an user-friendly web application deployed in a cloud environment reachable for the project partners. It is designed for collaborative project development with data-owner partners and data scientist partners. For this activity two interfaces are foreseen: one with data-owner parties - through which local training settings can be provided - and one with the data scientist responsible for developing the ML pipelines. The web application maintains a WebSocket connection with FL party servers permitting exchange of command line input (CLI) commands between the FL server and the FL party server, what eases the interaction of data-owners with the IBMFL framework.
Aggregation services are responsible for distribution of training settings, monitoring the server-party connection and executing model fusion algorithms iteratively, the vital algorithm of FL. Aggregation services are executed when a new training round is triggered in the project and these services first task is to transmit a non-transitory computer-readable medium that stores instructions about training settings to the respective FL party servers involved in the project. These decentralized FL party servers then execute a local training service and register automatically through the CLI commands to the aggregation service according to these instructions. After configuring the local training service with the aggregation service, a continuous flask connection allows mutual model parameters exchange as well as IBMFL commands (e.g. TRAIN, SAVE, SYNC, EVAL).
The FL party servers are responsible for three main activities: the configuration of local training services with respective aggregation services, the execution of local training services and the iterative exchange of the partial training results for updated global parameters with the aggregation services.
The configurations arrive at the FL party server immediately after triggering of aggregation service in the web application. These settings define CLI commands for the local training service register to the aggregation service, the ML model, the data handling pipelines, the local and global hyperparameters of each single decentralized project.
After executed, the local training service is responsible for the processing of local data during model training and exchanging partial training results with the aggregation service. This processing of local data takes place on the party infrastructure, where the data is safely stored and previously set on the project management web application by the data-owner. The processing may also be set by the data-owner to a chosen computational resource available in his network, which must have been also previously defined in the project management web application. Conclusively, settings for the location of the data and available computational resource are flexible for data-owner partners to designate.
The iterative exchange of partial training results occurs through the flask connection between the the local training service and aggregation service in the FL server after the training phase is triggered. The partial training results of all local training services arrive at the aggregation service one-by-one as the local processing of the iteration is completed. The lack of parties synchronicity does not affect the quality of the global model, but does affect the total training time. Aggregation services are allowed to abandon parties for an iteration round in case of a non-responsive FL party server, ensuring the rightful use of data-owner resources involved in the project.