1. Introduction
Scaffolds are important as temporary structures for supporting workers, equipment, and materials during construction. The scaffolding assembly process is subject to strict safety regulations, including the inclusion of safe support systems, structural stability, and proper connection of connecting rods. Accidents on construction sites are often related to scaffolding, such as insufficient support owing to deviations from the construction design, insecure rod connections, or absence of cross bracing, which leads to uneven loading and collapse. The scaffolding assembly process requires inspectors to check for the presence of tie rods, cross-bracing rods, and base plates, in addition to ensuring that these components are properly secured. While some of these inspections can be performed visually, others may require inspectors to use touch, additional measuring tools, or other auxiliary instruments. In high-rise or large buildings, hundreds of structural scaffolding frames may be used, and it would be time-consuming for inspectors to check each frame even visually. Moreover, inspectors would very likely miss a few of the deficiencies in these frames.
In recent years, artificial intelligence (AI) has been widely used for image recognition at construction sites. In particular, deep learning models have significantly driven the uptake of AI for site monitoring and inspection. For example, Li et.al. [
1] applied a deep learning algorithm to detect concealed cracks from ground penetrating radar images. Fang et al. [
2] used deep learning to detect construction equipment and workers on a construction in an attempt to create a safer work environment through real-time monitoring. Reja et al. [
3] used deep learning to track construction progress by analyzing images and videos to enhance project management activities and decision-making. Shanti et al. [
4] demonstrated the use of deep learning for recognizing safety violations at construction sites. Wang et al. [
5] applied deep learning to detect and quantify cracks on surfaces of concrete structures.
Since Microsoft released HoloLens (HL) and HoloLens2 (HL2) [
6], mixed-reality (MR) headsets, they have been used in various industries, including the construction industry. The newer HL2 is equipped with a computational platform, which allows for on-device data processing and execution of AI applications. Moreover, it is equipped with multiple RGB and infrared cameras, which allow for spatial awareness and position tracking of the surrounding environment. Users can interact with the device manually, through eye-tracking, or by using voice commands to place AR (augmented reality) models. Meanwhile, in terms of project visualization, the construction industry has undergone a transformative shift in recent years owing to the widespread adoption of Building Information Modeling (BIM) [
7]. BIM is a three-dimensional (3D) digital representation of the physical and functional aspects of a construction project, providing a comprehensive view of the project’s lifecycle, including the construction phase.
Park et al. [
8] conducted a comprehensive review of the academic applications of HL across diverse domains, encompassing medical and surgical aids, medical education and simulation, industrial engineering, as well as architecture and civil engineering. In a notable example, Pratt et al. [
9] employed HL to assist medical surgeons in accurately and efficiently locating perforating vessels, leveraging information extracted from preoperative computed tomography angiography images. Additionally, Al-Maeeni et al. [
10] utilized HL to guide machine users in executing tasks in the correct sequence, thereby optimizing retrofit time and cost when remodeling production machinery.
In the fields of architecture and civil engineering, HoloLens exhibits a real-time inside-out tracking capability, enabling precise visualization of virtual elements within the spatial environment. However, it necessitates a one-time localization of the augmented reality (AR) platform within the local coordinate frame of the building model to integrate indoor surroundings with the corresponding building model data. Consequently, research has delved into fundamental spatial mapping utilizing digital models and visualization techniques (e.g., [
11,
12]). In the realm of construction management, Mourtzis et al. [
13] utilized HoloLens to visualize production scheduling and monitoring, while Moezzi et al. [
14] concentrated on simultaneous localization and mapping (SLAM) for autonomous robot navigation, leveraging HoloLens to facilitate control over positioning, mapping, and trajectory tracking. This study specifically addresses deficiencies observed when employing AR in construction. Karaaslan et al. [
15] developed an MR framework integrated with an HL headset to assist bridge inspectors by automatically analyzing defects, such as cracks, and providing real-time dimension information along with the condition state.
Although the guidelines on the safety of scaffolding have been studied (e.g., [
16]), only a few researchers have conducted digitalization-related research specifically on scaffolding. For example, Chan-woo Baek [
17] focused on improving transparency, accountability, and traceability in construction projects by applying blockchain technology to support a secure, decentralized ledger for documenting and verifying scaffolding installation processes. Sakhakarmi et. al. [
18] developed a machine-learning model to classify cases of scaffolding failure in buildings and predicted safety conditions based on strain datasets of scaffolding columns spanning multiple bays and stories. Similarly, Choa et. al.[
19] developed an Arduino module to build an Internet of Things network for collecting the boundary conditions associated with the dynamic loading conditions of scaffolding structures. In addition, they used the finite element method to estimate the structural behavior of scaffolds in real time.
As Sakhakarmi et. al. [
18] pointed out, although 65% of construction workers work on scaffolding structures and are often exposed to safety hazards, the existing method for monitoring scaffolding structures is inadequate. Despite regular safety inspections and safety planning, numerous fatal scaffolding-related accidents continue to occur at construction sites. The existing practices that rely on human inspection are not only ineffective but also unreliable owing to the dynamic nature of construction activities [
19]. In this research work, we integrate a machine deep learning model with an AR model by using the HL2 as the main visual device to help superintendents perform visual inspections in conformance with the regulations governing scaffolding for building facades during the construction phase.
According to the safety regulations for inspecting construction scaffolding in Taiwan [
20], the inspection requirements related to construction sites are divided into three types, namely, visual inspection, measurement inspection, and strain monitoring. We use this classification to define the scope of this research. For example, one can visually inspect and determine compliance with Article 4, which stipulates that cross-tie rods and lower-tie rods should be installed on both sides of scaffolding. One must perform measurements to determine compliance with Article 9, which stipulates that the distance between two scaffolding wall poles should be less than 5.5 m in the vertical direction and 7.5 m in the horizontal direction. In a few special dynamic situations, stress and strain gauges must be installed.
In this work, we focus only on the automation of visual inspection, which does not require measurement or installation of stress and strain gauges and is easy to implement at construction sites. A few examples are as follows.
Article 4: “Cross-tie rods and lower-tie rods should be installed on both sides of the scaffolding,” and “There should be appropriate guardrails on the stairs going up and down the scaffolding.”
Article 6: “Brackets should be used to lay auxiliary pedals or long anti-fall nets between the scaffolding and the structure.”
Article 12: “The scaffolding should have a foundation plate that is securely placed on the ground.”
3. Results
Among the three main modules, i.e., deficiency recognition, AR visualization, and HL2 visualization modules, the deep learning model of the deficiency recognition module is the one that determines the recognition accuracy of SADDS.
During the training phase, the mean average precision (mAP) of the trained model was 0.951, precision was 88.3%, recall was 90.4%, and F1 score was 0.893 after 166 epochs.
Table 2 lists the precision values of the model trained using Roboflow.
Because we did not have access to the codes of the models trained in the Roboflow software environment, we recreated the YOLO v5 version of the model by using the pytorch package of Python, where we set batch = 16, epoch = 300, and learning_rate = 0.01. The expanded image dataset was used to train and test this model.
Table 3 shows the mAP and the corresponding precisions of this self-built model for the “Qualified,” “Missing cross-tie rod,” “Missing lower-tie rod,” and “Missing footboard” classes. The test mAP was 0.89, with precision values of 0.96, 0.82, 0.90, and 0.89 for “Qualified,” “Missing cross-tie rod,” “Missing lower-tie rod,” and “Missing footboard” types of deficiencies, respectively.
Table 4 and
Figure 4 summarize and illustrate the losses of this model during the validation phase.
Figure 5 depicts the convergence of precision, recall, and mAP in the validation phase. The precision data indicate that the results obtained using the trained model were satisfactory. The trained model was then used as the deficiency-recognition module in SADDS.
The visualization of the result of the deficiency recognition module depends of viewing devices. As described previously in the conceptual model of
Figure 1, there are two ways of using SADDS, i.e., with or without HL2 AR goggle, to help a user to check scaffolding frames. When a user captures images using a mobile phone or video camera, and sends the video stream to a web server, the highlights are viewed directly on the web as shown in the left image of
Figure 7.
When a user uses HL2 AR goggle, the AR visualization module synchronizes the real world and the digital Unity model based on the QR-code markers. The center image of
Figure 7 shows the example of highlights projected in the Unity model. Subsequently, the HL2 visualization module projects the highlights on HL2 as shown in the right image of
Figure 7.
The visualization of the outcomes from the deficiency recognition module is contingent upon the viewing devices employed. As expounded in the conceptual model depicted in
Figure 1, the utilization of SADDS manifests in two modes: with or without the use of HL2 AR goggles, facilitating users in inspecting scaffolding frames. When users capture images through a mobile phone or video camera and transmit the video stream to a web server, the resulting highlights are directly perceivable on the web interface, as depicted in the left panel of
Figure 7.
Conversely, when users opt for the HL2 AR goggles, the AR visualization module synchronizes the real-world environment and the digital Unity model based on QR-code markers. The central panel of
Figure 7 provides an example of highlights projected onto the Unity model. Subsequently, the HL2 visualization module projects these highlights onto the AR goggles, as exemplified in the right panel of
Figure 7.
4. Field Test and Discussion
To field test the SADDS, we deployed it at two other construction sites, namely, 7- and 14-story concrete residential buildings, in Hshinchu City, Taiwan. One of the authors wore the HL2, walked slowly, and recorded the front and rear facades of these under-construction buildings from the exterior on the ground. The weather was cloudy and occasionally drizzly, which did not significantly affect image quality.
Figure 8 and
Figure 9 present examples of the recognition results obtained at these two test sites. Automated detection of the target deficiencies worked successfully, and most of the scaffolding frames were found to be qualified (green label), twelve had “missing lower-tie rods” (purple), two had “missing cross-tie rods” (magenta), and one had a “missing footboard” (red). The following lessons were learned from the field tests.
The camera shooting angle should be as orthogonal to the target wall face as possible. Nonetheless, the recognition module successfully recognized the deficiencies in some frames sooner or later as the wearer approached those frames. However, because the attached alert frames are always orthogonal squares, the highlights may cause humans to misread the wrong frame. This problem can be avoided so long as the camera shooting angle is orthogonal to the target wall face.
When shooting at an oblique angle with respect to the target wall face, far away frames may not be recognized by the module owing to self-occlusion. This problem is understandable because even humans cannot evaluate those frames in the same situation, and those frames will be evaluated correctly once the camera moves toward them in the absence of occlusions.
Considering the practical use case, to enhance work efficiency and inspector’s safety, the tests were performed in front of scaffolds on the ground without actually climbing on the scaffolding boards to efficiently capture multiple frames at a glance. So long as the shooting angle is near orthogonal to the target wall face, an image with 20–50 frames did not seem to be a problem to SADDS. In this way, the use of SADDS is more efficient than inspection with the human eye. Nevertheless, in double-frame scaffold systems, most of the inner frames will not be recognized by the system owing to self-occlusion by the outer frames. Although one may stand on a scaffolding board to shoot the inner frames without occlusion, the number of frames covered in an image would be very limited, and the frames would need to be checked one by one. In such a case, direct inspection with human eyes would be more convenient.
Before the field test, we were concerned about the system’s ability to recognize “missing cross-tie rod,” which had the least precision (i.e., 0.82 compared to, for example, 0.90 for “missing lower-tie rod”) among the three types of target deficiencies. However, this did not seem to be a problem during the field test. A possible explanation is that in the training test, precision values were calculated per image, and each misidentification was counted. However, during actual field use, images were run as a stream, and when the HL2 wearer was moving, SADDS had many chances to successfully identify deficiencies and, eventually, alert the wearer.
The scaffolds at both test sites were enclosed by safety nets (e.g., anti-fall or dustproof nets), which did not affect the recognition accuracy of SADDS so long as human eyes could see through the net. Indeed, in the presence of safety nets, it was more difficult for humans to recognize unqualified assemblies from a distance than it was for SADDS.
The auto-generated highlights on HL2 provide its wearer real-time warnings related to the frames of unqualified assemblies, and the wearer can suggest remedial measures right away. To record such warnings, it is best to annotate the highlight on the corresponding elements on the 3D Unity model, and if necessary, export it back to the
.ifc BIM format so that it can be read using BIM-compatible software, such as Revit.
Figure 10 shows the recorded highlights of unqualified scaffolding assemblies on the corresponding elements of the Revit model.
Professionals at the test sites appreciated the real-time highlighting of unqualified scaffolding frames. This function helped inspectors to avoid missing any potential deficiencies in scaffolding assemblies even when they only glanced at such assemblies. However, they were not convinced about the need to record the highlights of unqualified frames on the 3D model. First, scaffolds are temporary structures that may be adapted frequently as the construction proceeds. Recording only a snapshot of such a dynamic temporary structure did not make sense to them. Second, A/E or construction contractors seldom implement scaffolding elements in BIM. Creating a scaffolding structure in a 3D model simply for annotating the deficiencies in the former did not seem a worthwhile endeavor to them. Note that we created the scaffolding elements manually simply for the purpose of this study.
Finally, there is a multitude of deficiencies in construction scaffolding. When training deep learning models, it is essential to compile a diverse set of cases that represent various deficiency patterns. This article specifically addresses the types of deficiencies outlined in
Table 1. For instance, to train the model to recognize deficiencies such as missing cross-tie rods, lower-tie rods, or footboards, engineers can capture such types of deficiencies in the scaffolding's front view from the side, and train the model simultaneously. The simultaneous training of these three deficiency types is feasible because they share the same appropriate camera-shooting angles and could potentially coexist within the same captured image.
Conversely, identifying deficiencies in tie rods and fall protection nets necessitates engineers to position themselves either on the scaffolding or in the gap between the scaffolding and the building for optimal shooting. To spot deficiencies in the scaffolding's base plate, engineers should focus on the scaffold's bottom and, if necessary, capture images as close and vertically as possible toward the base plate. When identifying metal fastener deficiencies, engineers may need to concentrate on the joints between components, capturing a range of both qualified and unqualified patterns. We recommend training the model to recognize these deficiency types separately since the appropriate camera-shooting angles for each are distinct, and they rarely coexist in a single image.
5. Conclusions
Scaffolds are important as temporary structures that support construction processes. Accidents at construction sites are often related to scaffolds, such as insufficient support owing to deviations from the construction design, insecure rod connections, or absence of cross bracing, which lead to uneven loading and collapse, thus resulting in casualties. Herein, we proposed a deep-learning-based, AR-enabled system called SADDS to help field inspectors identify deficiencies in scaffolding assemblies. An inspector may employ a video camera, mobile phone, or wear AR goggles (e.g., HL2) when using SADDS for automated detection of deficiencies in scaffolding assemblies. The test mAP during training was 0.89, and the precision values for the “qualified,” “missing cross-tie rod,” “missing lower-tie rod,” and “missing footboard” cases were 0.96, 0.82, 0.90, and 0.89, respectively. The subsequent field tests conducted at two construction sites yielded satisfactory performance in practical use cases. However, the field tests highlighted a few limitations of the system. For example, the system cannot be used to identify the deficiencies in the inner wall of a double-walled scaffolding structure from the outside owing to occlusions. In addition, the detection of deficiencies in assemblies is merely the beginning of scaffolding inspection. Many unqualified scenarios that can be visually identified have not been covered in this work (e.g., deformed frames, and metal fasteners). These scenarios must be addressed in future work. In addition, the detection of these scenarios may require video-shooting of the scaffolding frames from a close distance, which leads to the following efficiency issue: if an experienced inspector is willing to perform frame-by-frame inspection, why would they need to use SADDS? More recognition abilities will be developed and the related issues will be explored in future research.