Reading between the Lines: Process Mining on OPC UA Network Data

The introduction of the Industrial Internet of Things (IIoT) has led to major changes in the industry. Thanks to machine data, business process management methods and techniques could also be applied to them. However, one data source has so far remained untouched: The network data of the machines. In the business environment, process mining, for example, has already been carried out based on network data, but the IIoT, with its particular protocols such as OPC UA, has yet to be investigated. With the help of design science research and on the shoulders of CRISP-DM, we first develop a framework for process mining in the IIoT in this paper. We then apply the framework to real-world IIoT network traffic data and evaluate the outcome and performance of our approach in detail. We find tremendous potential in network traffic data but also limitations. Among other things, due to the dependence on process experts and the existence of case IDs.

Keywords:

Subject: Computer Science and Mathematics - Information Systems

1. Introduction

Industrial Internet of Things (IIoT) technologies have ushered in a new era of manufacturing and industrial processes, offering unprecedented levels of connectivity, automation, and data-driven decision-making. In the heart of these dynamic ecosystems lies the seamless exchange of information among interconnected devices, sensors, and control systems [1]. This intricate web of interactions, facilitated by standard industrial communication protocols such as OPC UA (Open Platform Communications Unified Architecture)[2] and MQTT (Message Queue Telemetry Transport), generates vast volumes of network data, which, until recently, remained an untapped resource for unraveling the underlying operational intricacies [3,4].

In this paper, we delve into the realm of process mining as a transformative approach to extract invaluable insights from collected network data in Industrial IoT environments. Process mining, a field at the confluence of data science, machine learning, and process management, refers to the automated discovery, monitoring, and improvement of process models from event data of IT systems [5]. Event data is used in the research area of process mining to generate and compare process models automatically with the help of process mining algorithms. Event information can be generated by classical IT systems as well as by employees using smart devices, (production) machines, and sensors [6,7,8]. IT systems within an organization create, for example, records of activities performed, messages sent, or transaction data. These event data are compiled into event logs and form the starting point for process mining algorithms.

The application of process mining in the context of IIoT network data opens up a wealth of opportunities for industries seeking to enhance operational efficiency, reduce downtime, and improve overall performance [9]. Our research focuses on harnessing the power of process mining to analyze the communication patterns, information flows, and transactional sequences embedded within the data streams of the OPC UA protocol. These protocols are the backbone of communication in many industrial settings, facilitating data exchange between devices, sensors, and supervisory control systems. By scrutinizing the recorded network data with advanced process mining techniques, we aim to uncover hidden process models, identify bottlenecks, and optimize data flows, ultimately enabling industries to make informed decisions for improved productivity and reliability.

Using network data to discover (business) processes is an emerging research area, gaining increasing attention recently [10,11,12]. Our approach differs from previous approaches in that it operates on network data in the IIoT and thus requires even more process steps to transfer the raw network data into a processable format for process mining. Besides, to the best of our knowledge, we are the first to use real-world network traffic data instead of simulated ones. This paper outlines our method for collecting, preprocessing, and analyzing network data from IIoT environments, shedding light on the challenges and intricacies of handling large-scale, heterogeneous data sources. We present a real-world use case illustrating the practical application of process mining on OPC UA data, showcasing how this approach can yield actionable insights that translate into tangible operational benefits. In summary, integrating process mining with network data in IIoT environments promises to revolutionize the way industries operate, offering a data-driven lens through which to optimize processes, enhance decision-making, and unlock the full potential of IIoT technologies. This paper comprehensively explores this innovative intersection, shedding light on its theoretical foundations, practical implementation, and the transformative impact it can have on industrial operations.

The paper is structured as follows: in Section 2, we present essential basics and related literature on process mining and network traffic data. This is followed in Section 3 by our method to discover business processes in the IIoT. In Section 4, we present the implementation of the method, which we apply to a real-world use case in Section 5. We evaluate and discuss our method in Section 6 regarding performance and quality and conclude the paper in Section 7.

2. Background and Related Work

2.1. Process Mining and Network Event Data

Event data, generated during business process execution, includes details on activities, their sequence, timestamps, and contextual information. Event data is derived from systems like databases, software applications, or sensors and is the foundation for process mining. By combining data mining, machine learning, and process management techniques, process mining analyzes and visualizes event data to reconstruct and model organizational processes. It identifies inefficiencies, bottlenecks, compliance issues, deviations, and improvement opportunities. Process mining relies on event logs as its core input, forming the basis for analyzing and optimizing processes. Network data is precious for process mining due to various reasons:

Rich Information Source: Network data contains information generated by interconnected devices and systems. It captures interactions and communications between entities, providing a detailed record of activities and their sequence.
Granularity and Detail: Network data often offer granular insights into the flow of activities and dependencies among different elements within a system. This information can be valuable for reconstructing processes accurately.
Real-Time and Continuous Data: Networks continuously generate data in real-time as activities occur, providing an up-to-date and comprehensive view of ongoing processes. This real-time nature enables immediate analysis of process deviations or inefficiencies.
Comprehensive Coverage: Network data often covers various activities, including structured and unstructured data, allowing for a holistic view of processes.
Interconnection of Systems: In many cases, processes are interconnected across various systems or devices. Analyzing network data helps understand the interactions and dependencies among these systems, offering insights into end-to-end processes.

Network data, though rich, can be complex and varied, requiring specialized expertise for effective preprocessing, analysis, and interpretation in process mining. Regarding the IIoT, OPC UA is a widely used communication protocol in industrial automation, ensuring secure data exchange among devices in a networked environment. It supports various encoding formats such as binary, JSON, and XML, with binary favored for performance-critical applications and JSON/XML for web-based systems. In OPC UA, packets follow a request-response pattern, ensuring secure data transmission channels. The packet structure includes a message header with crucial information, a message body, and defined services for client-server interactions, covering operations like data reading/writing and event subscription. OPC UA handshakes involve a request handle in each packet, matching responses to original requests, supporting asynchronous communication, and aiding in error responses for processing issues.

2.2. OPC UA Protocol

OPC UA (Open Platform Communications Unified Architecture) is a machine-to-machine communication protocol that is widely used in industrial automation systems. It provides a framework for secure and reliable data exchange between various devices and applications in a networked environment. OPC UA supports multiple data encoding formats to represent information during communication. These formats include binary, JSON (JavaScript Object Notation), and XML (eXtensible Markup Language). Each format has its own characteristics and usage scenarios. Binary encoding is often preferred for performance-critical applications with limited bandwidth, while JSON and XML are more commonly used in web-based and interoperable systems where human readability and compatibility are important factors.

Table 1 provides an overview of the OPC UA packet structure. OPC UA allows establishing secure channels to ensure the confidentiality and integrity of data transmission (A). OPC UA messages are encapsulated within the secure channel (if used) or directly transmitted over the network. The message header contains essential information about the message, such as the message type, size, and encoding (B). The message body contains the actual content of the message, which can vary depending on the type of message (C). OPC UA defines a set of services that allow clients and servers to interact with each other (D). These services are transmitted via the message body and provide functionality for various operations, such as reading and writing data, subscribing to events, browsing the server’s address space, and managing sessions.

Table 1. OPC UA Packet Structure.

Secure Channel Layer (A)	optional
Message Header (C)	fixed size
Message Type	4 bytes
Message Size	4 bytes
Secure Channel ID	4 bytes
Security Flag	4 bytes
Additional Header	variable size
Message Body (D)	variable size
ReadRequest/ReadResponse (E)	variable size

Every OPC UA handshake follows the request and response pattern, as shown by the read operation in tab:opcuaread. Besides the requested (e.g., NodesToRead) or transmitted data (e.g., Results), OPC UA packets carry the request handle located in the header, a unique identifier assigned to a client’s request message when communicating with a server. It correlates a request (see Table 2) and a response (see tab:opcuareadresponse) within a session. The request handle serves three main purposes. First, it allows the client to match the response received from the server to the original request. Second, OPC UA supports asynchronous communication, where a client can send multiple requests to a server without waiting for responses. Third, in case of errors or exceptions during processing, the server includes the request handle in the error response.

Table 2. Request and Response Headers.

(a) OPC UAReadRequest
Request Header
Type ID	4 bytes
Request Handle	4 bytes
Timestamp	8 bytes
NodesToRead	variable
(b) OPC UAReadResponse
Response Header
Type ID	4 bytes
Request Handle	4 bytes
Timestamp	8 bytes
Results	variable

2.3. Related Work

Network data for process mining is a burgeoning research area gaining significant attention (tab:relatedwork). Existing studies predominantly use simulated network data from tools like Wireshark1 and can be categorized into two event log generation techniques: rule-based and model-based. Using network data as input for process mining is a burgeoning research area gaining significant attention (tab:relatedwork). Existing papers predominantly use simulated network data and capture the network traffic with tools like Wireshark. However, we find two distinct event log generation techniques: rule-based and model-based.

Table 3. Related works on network data-based process mining.

Reference	Input data	Log generation	Model
Wakup & Desel [12]	Simulated	Rule-based	Petri net
Engelberg et al. [10]	Simulated	Rule-based	BPMN
Hadad et al. [11]	Simulated	Model-based	Event log
Apolinário et al. [13]	Simulated	Model/rule-based	BPMN
Lange & Möller [14]	Simulated	Model-based	BPMN
Lange et al. [15]	Simulated	Model-based	BPMN
Empl et al. [16]	Simulated	Rule-based	Petri net
Our paper	Real world	Rule-based	BPMN

semi-automated Preprints 107288 i004

fully automated

Rule-Based Techniques

Rule-based techniques transform captured network traffic into structured event logs through predefined rules, necessitating manual rule definition beforehand. For instance, Wakup and Desel [12] employ filter TCP dumps with predefined rules and use TCPLog2Eventlog for extraction. Engelberg et al.[10] focus on HR recruitment, applying the heuristic miner to capture network data for business processes. Apolinário et al.[13] introduce FingerCI, combining both techniques for ICS model construction.

Model-Based Techniques

Model-based techniques generate event logs or process models from network traffic data, typically requiring no human intervention through unsupervised learning. For instance, Hadad et al.[11] propose unsupervised learning for event log generation, addressing challenges in activity recognition from network data. Lange et al.[15] introduce MONA, deriving workflows directly from network data without generating event logs.

In contrast, our paper contributes to explainable rule-based event log generation and process discovery, focusing on real-world OPC UA network data captured from a manufacturing company’s end-of-line business process. Moreover, we generate event logs without isolating processes and derive processes without relying on inexplicable machine learning techniques.

3. OPC UA Process Discovery Method

To address the lack of a structured approach for process discovery from OPC UA network data, we develop this method in this paper. Process discovery typically involves obtaining data from running processes, generating event logs, and mining processes from these logs [17]. The challenge lies in abstracting multiple low-level network events into a high-level event log [18]. Following the design science research approach of Hevner et al. [19], we develop an IIoT-specific artifact based on CRISP-DM (Cross Industry Standard Process for Data Mining) [20], as illustrated in Figure 1. Further details on the individual phases follow. Before starting, it is crucial to determine the scope: the target systems or components (which?), the technique, frequency, and timing (how?), the stakeholders (who?), and the desired outcome of the discovery (what for?). In addition, suitable metrics must be defined to measure whether the scope has been achieved. For example, which data should be used for the event log, or is there already a process model to be compared against the output? However, the stakeholders involved should document and agree on these metrics to ensure the success [21].

Figure 1. Generic process discovery approach in the IIoT.

3.1. Data Collection and Pre-Processing

Collect

Once the scope and objectives have been defined, we recommend using a passive data collection technique (network sniffing) instead of an active one, as it does not affect the operational processes and aligns with IIoT’s high availability requirements. Passive recording is feasible using appropriate hardware (e.g., a switch with port mirroring) and software (e.g., Wireshark). Regardless of the hardware and software in use, the collected data’s quality (e.g., completeness or encryption) is crucial. Competing the large data volume, filtering rules (e.g., on ports) ensure alignment with the predefined scope, but when recording an initial snapshot, a full capture is recommended, pushing the understanding of the network further. Last, as the PCAP format might be difficult to handle, it can be transformed into human-readable formats (e.g., XML or JSON).

Understand

Before pre-processing the data, the data analyst must understand the data’s context, e.g., by collecting additional information, such as existing process models, descriptions, expert interviews, asset inventories, or site visits. Afterward, it is crucial to understand the collected data [20]. Within the IIoT, this includes gaining insight into the network topology, IP addresses, ports, or protocols. After resolving duplicates, the data analyst can deeply dive into the packets’ structures to identify data of interest, such as the case ID for subsequent event logs. Different paths lead to Rome, so visualizing information (e.g., social network diagram) may also assist before heading to the data pre-processing.

Pre-Processing

Network data is selected based on scope and objectives. Iteratively approaching the scope and objectives will lead to the desired outcome. Data analysts can assess the model’s quality at each iteration by filtering less data and iteratively refining the selected data for the event log. Event log generation may involve aggregating multiple packets to form activities, especially in client-server architectures. In OPC UA, requests and responses can be matched using the so-called requestHandle (see Algorithm 1). The algorithm generates activities from low-level request-response events. Enriching activity names with human-readable labels ensures understandable process models. For example, if information on the function of a machine is available, replace the IP address and port with this information to increase readability.

Algorithm 1:Activity generation.

3.2. Rule-Based Event Log Generation

After generating activities from filtered, aggregated, and labeled network packets, the next step is identifying each activity’s process instance and generating an event log. Mandatory information of an event log includes the 1) case ID, 2) timestamp, and 3) activity name. The case ID is a unique identifier that identifies a process instance or a run and is assigned to all activities involved. Timestamps indicate the event’s occurrence and provide information on sequential or parallel activities. While an event can have different activity names, non-uniqueness within the same process run is permitted. Optional information complements an event log, including information about the resource, e.g., the name of the actuator executing an activity. In the IIoT, we find physical processes and machines handing over products. We can refer to each product traveling through this process as a process instance, while a new process instance is created when it first appears in the network traffic. Each product has a unique identifier, ideal as a case ID. As not every packet carries the product identifier, pseudocode in Algorithm 2 details the event log generation based on the case ID assignment. Automatically assigning activities ensures consistency over the process and the event logs. Experimenting with different case IDs (in the case of appropriate candidates) further allows the comparison throughout the event logs.

Algorithm 2:Event log generation.

3.3. Process Discovery, Visualization and Analysis

The derived event log is the basis for applying process mining techniques and enables identifying and visualizing processes and process instances. For example, process mining discovery techniques include heuristic, alpha, and inductive miners, which produce different outcomes (e.g., BPMN or Petri net). Each outcome, when visualized, shows different process perspectives. A direct follows graph creates an overview of process instances and dimensions (e.g., frequency or performance). The BPMN notation (and notably extended options with context-specific variables) focuses more on business processes [22].

Data analysts can interpret the results regardless of the notation or process mining technique used. This way, deviations between the discovered and target processes can be identified, e.g., bottlenecks. Visualizations also help to uncover optimization potential. For informed decision-making, stakeholders can enrich the process models with expert knowledge if required. An inaccurate model (e.g., inadequate data or pre-processing) may result in returning to an earlier phase.

4. OPC UA Mining Implementation

This section introduces the Python implementation details of the event log generation using the OpcuaPacketAnalyzer class. This analyzes OPC UA network packets, extracting relevant information and generating event logs. The implementation is available on GitHub2. It loads OPC UA data from a JSON file, extracts data from packets at various ISO/OSI layers, matches request/response handles in OPC UA packets, and generates CSV event logs. These event logs serve as the foundation for subsequent analysis.

4.1. Software Design

In Figure 2, we present a visual representation of the OpcuaPacketAnalyzer class structure and relationships (see Figure 2a). In the class diagram, we can derive the structure of the OpcuaPacketAnalyzer class, including its attributes and methods. The relationships between methods are depicted to provide a high-level overview of how they interact. A sequence diagram depicts the interactions and flow of control between objects and actors. In our case, we use a sequence diagram to illustrate how the OpcuaPacketAnalyzer class is invoked and how its methods interact (see Figure 2b). The sequence diagram shows the actions when users interact with the OpcuaPacketAnalyzer. The user initializes the class, runs the analysis, and triggers various internal methods to perform specific tasks, which we detail in the following.

Figure 2. Implementation design of the OpcuaPacketAnalyzer class.

4.2. Implementation Details

We provide detailed explanations of key methods and functionalities of the OpcuaPacketAnalyzer class in the following:

Entrypoint.: The analyze_packets() method is the entry point for event log generation, orchestrating data extraction, request handle matching, case ID assignment, and event log generation. It structures OPC UA communication for process mining and analysis.
Data Loading.: The load_data() method loads OPC UA communication data from a Wireshark JSON file, ensuring availability for subsequent methods.
Data Extraction.: Utilizing extract_tcp_data(), extract_ip_data(), and extract_eth_data(), this step extracts relevant data from packets at various ISO/OSI layers.
Request Handle Matching.: The match_request_handles() method matches request handles in OPC UA packets, establishing relationships between requests and responses and creating activities.
Event Log Generation.: The write_csv() method generates CSV event logs from extracted data for process mining or visualization.
Case ID Assignment.: The add_case_id() method assigns case IDs to matched arrays of OPC UA packets based on specific keys, facilitating subsequent process mining techniques.

5. Use Case: End-Of-Line Process

In this section, we apply the method from sec:processmodel to a real industrial use case to showcase its application and relevance to real-world OPC UA network data. We thereby rely on the implemented OpcuaPacketAnalyzer class. The scenario includes an automotive supplier’s end-of-line process, which includes robotic inspections, laser engraving, and cleaning processes. We travel through each phase of the method and discuss corresponding actions. The dataset includes activities from four machines and a central process control system and provides a real-time snapshot of the process. Through this, we aim to investigate the feasibility of identifying IIoT business processes by mining OPC UA data.

5.1. Data Collection and Pre-Processing