1. Introduction
An intelligent mapping system is urgently required because, although the rapid development of reality capture technology has stimulated documentation activities in recent years, the data processing still requires extensive human intervention.
In-situ investigations are expensive in terms of labour cost and time, although now, many technologies offer fast, automatic, and complete digitalisation capabilities. The elaboration works, including data processing, 3D reconstruction, registering, and further downstream works like information modelling and quality control for reality capturing, are mainly done off-site and are, till today, time-consuming because they are manual. Therefore, the urge to improve the productivity of 3D modelling leads to two issues that require immediate attention: instant (real-time) investigation feedback and the interpretation of interested objects in 3D.
AI technologies are witnessed to be capable of automating and speed up many time-consuming human labour processes. For example, the segmentation and labelling of semantic meanings might be the most relevant and mature tools nowadays usable to help the process of digitalised 3D spatial data.
Specifically, for 2D images, mature deep learning (DL) models (e.g. YOLOv10 [
1]) can run at 200 frames per second (FPS) or more, achieving real-time object detection. For 3D point clouds, machine learning models achieved 90%+ classification accuracy based on large ranges' geometric features (e.g. DGCNN [
2]). Much of the research in the last few years has been seen integrating the 2D and 3D automatic processing tools, mapping objects and items in the spaces, and giving corresponding semantic information, making possible quick assessment of the investigation site alongside the data acquisition activities.
These AI applications can greatly benefit 3D reconstruction activities and further data harvesting. In fact, 3D data with semantic meanings facilitate viewing and editing digital assets, enabling comparison and integration with previous surveys or the BIM model, and contributing to the BIM population and fieldwork monitoring. Considering the characteristics of different data types and corresponding technologies, combining artificial intelligence processing with reality capturing can help with real-time monitoring of building construction, facility control, and activities in many other scenarios.
This paper collects and organises noteworthy research relevant to the interests of the reality-capturing field in architectural scenarios. The related fields have presented many promising tools, but they are mostly constrained to specific scenarios. Listing the techniques and applications reveals some common approaches and a trend of research towards deep articulation can be found.
1.1. An Initial Overview of AI Techniques for Digitalisation
AI, especially machine learning, usually refers to the field of research in computer science that enables machines to learn from given data and generate responses to defined demands.
Machine learning is closely related to statistical models. It fits input data and generates classification (predicting predefined class labels), segmentation (dividing an image or data into multiple segments or regions), or regression (predicting continuous numerical values) strategies. Some widely used algorithms are Decision Trees, Random Forests (RF) [
9,
24], Support Vector Machines (SVM) [
25], K nearest neighbour (KNN), Naïve Bayes (NB), etc. Generally, these statistical models are simple and easily explainable. However, these models must be expanded and integrated with sophisticated feature extractors to accurately predict more complex inputs, like images and point clouds. These feature extractors induce context information to the data points (e.g. pixels in images) but at this stage were mostly human engineered.
Deep learning, as a subsector of machine learning, largely increased the model complexity and further automatised the learning process, allowing models to handle intricate data efficiently. Nowadays, DL tools can be expected to perform more complicated tasks like object detection (identifying and localising objects within images), pose estimation (determining the positions of subject’s key points, e.g., joints of the human body), text generation (producing sequences of words to create contextually relevant text), etc.
DL models learn the context information automatically. Initially, based on the data type, different approaches were developed, therefore two main research orientation: computer vision (CV) processes 2D data, Natural Language Processing (NLP), which processes text data. Concretely, considering the task and the data being analysed, inputs can be classified based on their dimensionality, including:
- (1)
One-dimensional inputs: Sensor data, including data from sensors such as accelerometers, gyroscopes, and temperature sensors; audio data, including speech, music, and other audio recordings; and some time series data.
- (2)
Two-dimensional inputs: Image data, including photographs, drawings, spectrograms, etc. Image data can be analysed using CV techniques, which involve processing and analysing visual data. The sequence of image data or video can be considered multi-dimensional.
- (3)
Three-dimensional inputs: Point cloud data and 3D scans of objects or environments, such as buildings, landscapes, and industrial equipment. Point cloud data can be analysed using algorithms that are specifically designed for 3D point cloud data. Addressed in the work of PointNet [3], points from the Euclidean space are unordered, which greatly differentiates from pixel arrays in 2D images or voxel arrays in volumetric grids. Changes of data feeding order will not change the point cloud essence. Another popular representative format of 3D is RGB-D data, which combines RGB colour data with depth data (D). Depending on the methods (either structure light or time-of-flight), the camera acquires the precise distance of the object's surfaces from the specific viewpoint where RGB information is collected.
- (4)
Multi-dimensional inputs: Text data, including articles, reviews and more, could have many features such as word frequency, word length, and syntactic structure. Text data can be analysed using NLP techniques, which involve processing and analysing natural language data.
Generally, different types of inputs require specific data-analysing tools and feature extractors. In some cases, the AI methods can also generalize well to other types of inputs, e.g., in the NLP field the “attention” mechanism, which uses “transformer” modules to calculate attention weights of word within specific section of the sentence, was applied in image object detection [
4] and the convolution filter application in the 3D point cloud [
3,
5].
With the growth of computational power, AI methods like DL have made data processing like classification able to achieve satisfying results with less labour and time. Since introducing by Lecun et al., 1989 [
6] the Convolutional Neuron Network (CNN), i.e., a regularised version of multilayer perceptron, DL methods have greatly enhanced feature extraction, task accuracy, and scalability with larger datasets. Unlike the other machine learning methods, DL is based on artificial neural networks, with ‘deep’ referring to the depth of neural layers. The deep networks gradually enlarged their depth (hence complexity); correspondingly, they can fit bigger datasets. When training data has become more available in recent years, thanks to the development in the data acquisition sector, the performance of AI can be expected to improve.
A typical workflow for DL process (illustrated in
Figure 1) includes the inference phase and a relatively long training process. Generally, a DL model must be first trained by feeding it a dataset; afterwards, it can be expected to make inferences from related unseen data.
The training process of a DL model starts after the data acquisition and a series of processing, including cleaning, pre-categorising (structuring), annotation data augmentation, etc. The input data will go through the “backbone” network, a feature extractor. The extracted feature will pass through the “neck,” which collects, combines, and transforms features. Then, the “head” will process the output and make decisions. The loss will be generated to start the optimization process, like gradient descent. Based on the loss function, the model will slightly update the parameters in the neural layers to reduce the loss. This process will run repetitively for numerous epochs until an acceptable loss value is achieved. This process could be time-consuming, but afterwards, the learnable parameter will be ready for prediction on unseen data, the so-called inferencing process. The input will go through the whole network once and not involve the gradient, no longer updating the learnable parameters.
1.2. Digitalization in Architectural Scenario: Data Collection, Fusion, and Processes
This paper intends to focus on the architectural scenarios, particularly CH, underlying how AI has been used in this field, supporting the documentation processes of existing buildings. Generally, different scenarios, for example, remote sensing or medical images, lead to different representation scales, error tolerance, and digitalisation methods. The characteristics of architectural spaces bring specific challenges to the practical applications:
- (1)
Architectural elements are large and complex: architectural elements like pillars and walls cannot be represented by merely one shot of photo or scan in close range. The dataset that describes the elements of a big volume will require relatively more computer resources for storage and processing.
- (2)
Unavoidable noises and shadows: Architectural space is not an ideal space of simple geometric shape but a space of volumetric changes. Despite that, it is usually occupied with items casting shadows in 3D. Therefore, the digitalisation work of full coverage is costly and time-consuming.
- (3)
Objects of interest: the objects related to architectural elements, components, structures, and joints are featured with multiple scales of geometric features. Some objects, like walls, require global information. Some, like cracks and disintegration, need the millimetre-scale representation of the geometry. Highly decorated architectural elements, especially in the case of cultural heritage, are one of the most challenging objects because they are easily confused with some less important components or facilities.
- (4)
Recurrent monitoring in architectural spaces: The architectural scenario is close to human activity: from daily residential usage to massive production, architectural spaces vary in functional typology over time. Nonetheless, architecture is constantly threatened by environmental issues, anthropogenic damages, and continuous interventions. So, data acquisition is a constant need for monitoring. The acquired data will be used for object tracking, detecting translation, deformation, twisting, chromatic alteration, etc.
Understanding the state is crucial for ensuring construction and lifelong protection, and a reality-based digitalisation process is the first mandatory step in this understanding process. Documentation is always a key topic in architectural and building scenarios, starting from the needs derived from the construction and conservation practices for historical heritage.
The abundant bibliography of the last twenty years shows that research related to documentation in architectural and building scenarios covers all the phases of digitalisation, from field survey methods and data elaboration procedures to representation and fruition methods. The high demands of as-is documentation and the complicated semantic structures for architectonic elements and facilities stimulated the research and practical applications.
Regular maintenance, restoration, and investigation missions yield valuable data over time, especially in Cultural Heritage (CH) scenarios. However, the lengthy intervals between documentation updates often result in outdated measurement and representation technologies: traditional manual methods and paper-based drawings of 30 years ago are being replaced today by digital reality-capturing techniques, enhancing the modern documentation process. Various ways have been used to record architecture, such as writings and drawings in the past centuries. From the 19th century, photography and phonautograph emerged, producing photos, audio, and video records. Yet 2D photography was not enough for architectural projects; they are better supported with 3D information [
7]. Photogrammetry and laser scanning techniques have provided a boost to 3D nowadays and have become the most favoured techniques for reality-capturing activities in architectural scenarios. In particular, image-based 3D reconstruction techniques (formerly called photogrammetry) have become very popular over laser scanning methods in the last few years because of their lower cost and flexibility. Also, photogrammetry, as a competitor to modern range-based mobile mapping systems, even if not yet largely tested in the architectonic and construction field, is used in mobile modality and gives quite good results in terms of reliability. Achille, Fassi and Fregonese, 2012 addressed their practices in the Milan Cathedral [
8], Perfetti, Polari and Fassi, 2017 have even presented fisheye photogrammetry in narrow spaces [
9]. The photogrammetric technique can generate spatial information from paired images and be applied to IRT [
10] and multispectral [
11] images.
Though not yet widely seen applications in the architectural digitalization field, but quite relevant in AI applications, RGB-D must be mentioned for its capability to integrate real-time 3D data and images [
12]. This 2D representation of the 3D has become an object of research in computer graphics and computer vision [
13,
14]. Additionally, in the 3D geometric acquisition, other types of investigations can be integrated with geometrical ones to better understand the heritage and its characteristics. As presented by Adamopoulos and Rinaudo, 2021 [
15] infrared thermography (IRT), multispectral imaging, ground penetrating radar (GPR), and active elastic wave techniques (Sonic and ultrasonic sensing techniques) are investigations that allow us to surpass the limit of human eyes, helping the identification of pathological issues, material differences, the presence of damage, or the changes in the material’s physical properties behind the surface [
16].
In addition to digitization, the data utilization methods, i.e., how the acquired data are made available to operators to be read, processed, and interpreted, are equally important. This is probably the most critical aspect nowadays, especially in architecture and cultural heritage. As a matter of fact, there are no standard methods or technologies for data storage and data sharing nor efficient and reliable automatic methodologies to aid data interpretation and processing.
It is precisely here that AI techniques can help in the future. Dealing with architectural topics means dealing with a large amount of data, and the manual approach has become time-consuming and unacceptable. Sharp techniques are urged to help human intervention and facilitate greater objectivity in analysis and results. Recent years have seen many attempts to integrate AI in architectural digitalization scenarios. As can be expected, AI tools are not initially developed for this particular purpose, and as is normal, the “needs emerge in the practical scenes.” AI algorithms used in other fields are mainly tested by applying specific adjustments to satisfy the requirements, like inference speed, wide-range categories, noise robustness, and training efficiency.
For example, AI tasks that conduct automatic semantic classification and segmentation can make more accessibilities for investigators and monitors. As for images, a wide variety of DL models have been developed that could perform classifications such as VGG16 [
17], Inception [
18], and Residual Networks [
19]. More complicated architectures were built to deal with object detection tasks, such as Yolo [
6], FAST R-CNN [
7], and semantic segmentation as FCN [
20]. Multiple methods are developed to process point clouds, like random forest [
9,
10,
11,
12], support vector machine (SVM) [
13], PointNet [
14,
15], and other machine learning methods in post-processing.
Many researchers have been testing the AI processing of 2D and 3D data, and recently, methods that combine multiple techniques have also emerged. These methods were tested in a practical scene and adjusted to practical needs and available data acquisition solutions. Consequently, the updated techniques are expected to accelerate the construction and preservation activities. They could provide reliable assistance in assessing the general situation of the scene and finding objects with locations, contributing to 3D data interpretation and big-data management.
1.3. Data Types and AI Methods
1.3.1. 2D Data
2D photographs are the predominant kind of data utilized in the documentation process and are currently undergoing digitization. The photo has been prevalent in the sphere of "architectural reading", documentation, and representation, gradually replacing, to some extent, the traditional practice of sketching for documentation purposes. Since its inception in 1851, photogrammetric surveying has used photographs and techniques like triangulation to obtain 3D measurements. Over time, it has evolved into a crucial instrument for driving progress and development in this field. Currently, photographs serve as the foundation for photogrammetric surveys and are commonly utilized to incorporate additional data, such as point scanner data. The major types of photos often include high-resolution digital frame images captured by traditional photography devices and fisheye images obtained from inexpensive portable sensors, such as those placed on drones. Additionally, panorama images are commonly used to add colour to static and mobile scanner data.
2D data processing has been a heated topic in the computer vision field for a long time. The first known neural network application for classification tasks dates to 1989 [
6]. The researchers made a large database (MINST) of the handwritten samples of 10 digits and built a network with limited layers to recognize which number is written. Afterward, the dataset comes into the research field. Dataset preparation became the basis of other research [
21,
22,
23]. Depending on DL tasks, the enlarging size of the sample and the increasing number of categories stimulated related research, including topics like the scale of the dataset (referring to the number of categories and instances), the semantic hierarchy of the classes, accuracy (reliability of the annotation), and diversity (appearance, positions, viewpoints and so on) etc. Works have also addressed the issue of “things” (objects with a well-defined shape, e.g., cat, person) and “stuff” (amorphous regions, e.g., sky, forest). The research emphasised the importance of stuff and discussed the contextual correlation to things [
24].
Many image classification models emerged in the past years. After LeNet, typical man-crafted networks with limited layer depth were developed, such as VGG [
17], inception network [
18], etc. Residual network, known as ResNet [
19], came out in 2015. It introduced the concept of residual connection, solved the problem of gradient vanishing, and allowed layer depth increases in the latter model.
At the same time, the object detection task was attracting attention from the researchers (
Figure 2). Classification merely defines the image; object detection attaches semantic meanings to the pixels or point locations, hence more useful information. An initial application for face detection [
25] applied binary classification on sliding windows. The latter trend integrates hand-engineered feature extractors [
26,
27,
28] with machine learning methods. The solution for object detection is later developed into two mainstream approaches: the two-step approach, which first finds where the objects could be and then classifies, and the one-step approach, which integrates localisation and classification in one pass. In recent years, approaches that got inspiration from NLP field also came into sight.
In 2013, R-CNN was introduced by Ross Girshick et al. They innovatively applied a selective search algorithm, which extracts regions of interest (RoI) for later classification. The selective search turned out to be time-consuming. To make it fast, in Fast R-CNN [
29], the input image is first processed by a neural network. Afterwards, the feature map will be cropped by region and passed through the prediction head. The Faster R-CNN [
30] made the process faster by introducing a separate network to predict the region proposals. It uses the concept of anchor that predefines the bounding boxes and reshapes them using RoI-based (Region of Interest) methods, then the output goes to the prediction head. Algorithms like Faster R-CNN are considered a typical two-step approach; they will first make region proposals and then classify only proposed crops of images using convolutional networks.
Unlike the two-stage object detector, one-stage detectors use a fully convolutional approach in which the network can find all objects within an image in one pass. A famous example is YOLO (You Only Look Once) [
31]. It is refreshingly simple, solely one convolutional network that outputs bounding boxes and classification probabilities at the same time. Different from that, SSD (Single-Shot Detector) [
32] uses more feature layers to predict the boxes and the category confidences, allowing predictions at multiple scales (
Figure 3). CornerNet [
33] introduced an innovative approach that, instead of defining the bounding box as the x and y coordinates of the box centre with h and w, uses a pair of diagonal key points. CenterNet [
34], presented by Duan et al., 2019 detects objects using centre key points and corners [
33]. They addressed a common defect of all one-stage approaches, as networks cannot pay attention to internal information within the cropped region without RoI extraction.
DETR (DEtection TRansformer) [
4] uses a “transformer network” to perform both feature extraction and object detection simultaneously. The transformer network was initially used for sequence modelling in natural language processing tasks, but it has also recently been applied in computer vision. DETR uses the “attention mechanism” to adaptively focus on the important parts of an object without requiring prior bounding boxes or anchor boxes. The performance of the COCO dataset can compete with that of an optimized Faster R-CNN. Later transformer-based approaches can be seen in [
34,
35,
36]. They further mitigate the gap between NLP and CV.
Currently, the available architectures of object detection are different in terms of approaches (one-step, two-step, transformer), convolutional feature extractors (VGG, Inception, ResNet), and detection heads. The requirements for training time and computational resources vary, as well as the performances. Huang et al., 2017 [
37] have discussed the speed and accuracy issues. Later detection models surpass the performances of the previous, but some facts remain: without region proposals, one-step detectors require lower computational resources and, hence, relatively faster. Later approaches (e.g. CenterNet) further accelerate the process by avoiding Non-Maximum Suppression (NMS). However, in terms of overall mean Average Precision (mAP), one-step can hardly catch up with two-step approaches.
When it comes to practical scenarios, in architectural scenarios, the detector's requirement is mainly the computational resource during the in-situ data acquisition process, as the ex-situ post-processing (e.g., 3D reconstruction) is supported with time and computational resources. In this case, one-stage detectors like YOLO are the preferred choices. As for the CornerNet and later CenterNet, their performances of the irregular shape are critical [
34]. DETR is a late promising solution, but the complicated training process [
4] makes it unsuitable for case-wise application in architectural fields.
Object detection architectures are the basis of many downstream applications, such as semantic segmentation, human key point detection, and image captioning. In 2017, Kaiming He et al. introduced Mask R-CNN [
38] as an extension of Faster R-CNN. They added a branch to the network that predicts each object's segmentation mask and its class and bounding box. This allowed the model to perform object detection and instance segmentation with state-of-the-art accuracy.
1.3.2. 3D Data
Currently, 3D data serves as the foundation for digitalization as it is crucial to accurately and comprehensively describe the geometry of objects, particularly in the fields of architecture and cultural heritage.
The primary digital data result is 3D point cloud data, comprising a collection of 3D points depicting an object's outer surface geometry or environment. A point cloud obtained from a scanner is typically generated in real time, but a point cloud obtained from photogrammetry may require more time to be created. Many processes, including registration, referencing, filtering, down-sampling, and feature extraction, are often carried out during post-processing and are only partially automated. Specifically, extracting features required for a subsequent modelling or architectural restitution phase and data interpretation still rely on manual operations performed by individual operators. Moreover, one of the main challenges of point cloud data is its high dimensionality, which makes it difficult and time-consuming to manually process and analyse a large amount of data representing complex geometry.
For these reasons, segmentation and classification are the two most discussed and required AI applications because, based on those results, point clouds can be successfully exploited and better comprehended [
39].
AI, especially the DL approach, has also shown great interest in processing 3D point cloud data (
Figure 4). In a supervised machine learning approach, semantic categories are learned from a manually annotated sub-set of data, which are used to train the classification models like SVM, RF, and NB. The input point cloud must pass through hand-crafted feature extractors to compute individual points' close and global neighbouring features. After the model is trained, it will be used to conduct the semantic classification of all the points in the dataset. With proper features, machine learning models can be effective in specific tasks even with limited annotated data.
Among the different DL approaches, the feature learning methods are initially divided into point-based and tree-based approaches [
40]. The first directly takes the raw point cloud as the input for training the DL network. The second employs a k-dimensional tree (Kd-tree) structure to transform the point cloud into a regular representation (linear representation afforded by the group action), and then feeds this into DL models. The concept of using representatives saves computational resources.
In 2017, PointNet [
3] was introduced by Charles et al. It is a DL architecture that directly learns point-wise features from unordered point clouds without the need for pre-processing such as voxelization or projection onto 2D grids. It has shown impressive performance in point cloud classification tasks. Noticeably, they added a data-dependent spatial transformer network to canonicalize the data. However, the PointNet could not capture local structures, which limited its ability to recognise high-granularity patterns and its generalization ability to large and complex scenes. PointNet++ [
15] upgraded the previous network to address this defect: a hierarchical structure was designed. The new architecture is composed of several set abstraction levels. These abstraction layers aggregate multi-scale information according to local point densities, making the learning process efficient and robust.
PointCNN [
41] by Li et al., 2018, proposed an X-transformation to address the issue of point-wise feature weighting and permutation of the unordered point clouds. Another point-wise convolution approach is inspired by image-based convolution. KPConv [
42] by Thomas et al., 2019, uses Kernel Point Convolutions to process point clouds without intermediate representation. It has shown impressive performance in point cloud segmentation tasks.
The attention transformer approach is used for 3D recognition. PATs (Point Attention Transformer) [
43] represent points by position and neighbourhood and learn features through Multi-Layer Perception (MLP). Later developments continued in PCT (point cloud transformer) [
44], Point transformer [
45], 3DETR [
46], and Uni3DETR [
47] have also achieved state-of-the-art results.
DGCNN [
2] by Wang et al., 2019, or Dynamic Graph CNN is also inspired by PointNet. It uses a graph neural network to model the local geometric structures of point clouds and dynamically constructs a graph based on the spatial relationships between points. It then applies a series of graph convolutional layers to learn hierarchical representations of the input point cloud, followed by a max pooling operation to obtain a global feature vector. Finally, a fully connected network is used to classify the point cloud.
Zhou and Tuzel, 2017, proposed VoxelNet [
48], which eliminates manual feature engineering of the point cloud. It unifies a deep convolutional network as a feature extractor and region propose network into a single-stage, end-to-end DL network by using the Voxel Feature Encoding (VFE) concept. The encoder is widely applied to many later models, like SECOND [
49], Voxel-FPN [
50], and HVNet [
51].
1.3.3. RGB-D Data
RGB-D data are 2D images where each pixel includes colour information (RGB) plus corresponding depth information. This data type can be achieved using depth sensors, like structured light and Time-of-Flight (ToF) cameras. RGB-D data can also be achieved from the photogrammetric matching process that allows the estimation of the depth maps from oriented RGB images in the space. These algorithms compare the disparities between images taken from different angles to calculate depth. An example of this is the Semi-Global Block Matching (StereoSGBM) algorithm. This OpenCV implementation uses semi-global matching (based on the work of StereoSGM [
52]) for stereo images to produce accurate depth maps.
Thus, RGB-D data is used more for 3D interior reconstruction, especially semantic VSLAM direction [
53]. However, the mentioned cameras are light-sensitive, can produce noisy, low-resolution models, and have problems with narrow horizons. This type of data is mentioned due to its growing usage and the fact that it works better for 3D object detection than other formats [
14].
In deep learning, convolutional networks were introduced to compute matching fields between two images [
54]. Afterwards, research on monocular depth estimation addressed the high costs and complexity associated with acquiring data from multiple views. Algorithms are made to infer detailed 3D structures from single still images [
55] using Markov Random Field (MRF), and convolutional neural network [
56]. End-to-end tools have been developed to synthesise novel views directly from images. One example is the Deep3D model [
57], which uses a convolutional neural network to generate the corresponding right view from an input source image.
Methods used to classify and localise objects using RGB-D data can be categorized (
Figure 4) as follows:
- (1)
View-based: Some researchers make use of RGB-D data by considering the depth map as an additional channel [
58]. Some process the 3D data as front-view images [
59,
60], or projects 3D information to bird’s view [
61]. Depth R-CNN [
62] was inspired by the 2D object detection model RCNN, it introduced geocentric embedding for better usage of the depth maps. Similarly, 3D-SSD [
63] is a 3D generalization of the SSD framework.
- (2)
Frustum-based (2D driven 3D): Frustum-PointNet [
64] as extensions of PointNet turns to process RGB-D information. It extracts the 3D bounding frustum of an object by projecting 2D bounding boxes from image detectors to 3D spaces. Then, within the trimmed 3D, segmentation and box regression are consecutively performed using variants of PointNet. It has addressed some of the limitations, such as local features handling, rotation sensibility, feature extraction, using hierarchical neural network architecture and graph convolutional network. Frustum-PointNet was the breakthrough method at time, it inspired later models such as YoloV3 & F-PointNet [
65] and Frustum Voxnet [
66].
- (3)
Descriptor-based: Geometric descriptors for 3D object detection were introduced. COG [
67] links the 2D appearance and the 3D pose of object categories. Ren et al. also introduced latent support surfaces [
68], where the location can explain shape variation in 3D.
- (4)
Convolution-based: Sliding Shapes [
5] applies a 3D sliding window to detect 3D places directly. It was later updated in Deep sliding shapes [
69], introducing a 3D region proposal network to speed up the computation. Qi et al. presented VoteNet [
70], which directly detects objects in the point cloud. It has addressed the challenge that the centroid of an object in 3D can be far from the surfaces by applying Hough voting.
3. Conclusion
This paper discusses numerous state-of-the-art works that use AI methods to support reality capture applications, mainly in architecture and cultural heritage. In general, reality capturing is an activity that uses multiple technologies to digitalize 3D environments. The different acquisition approaches and data formats largely shaped the later elaboration process, data interpretation and use, and methods for integrating AI. 2D and 3D data have advantages over others in architectural scenarios and often need to be supplemented to achieve a more comprehensive, multiscale degree of knowledge. Point clouds are suitable for representing accurate geometric information of large-scale elements, like architectural elements and large 3D environments. Things with small geometric features like cracks and nails are more recognizable with high-resolution images.
Thanks to the advanced research progress of deep learning networks that deal with image analysis, quick and autonomous assessment can be implemented in the data acquisition phase, which benefits photogrammetry, RGB-D-based approaches, and 3D point cloud processing. The literature shows that DL methods are more usable for 2D data, while machine learning is preferred for each 3D case. This could be explained by the shortage of point cloud datasets, considering that their collection, data processing, and annotation require much more effort than photos in terms of time, cost, and availability.
Vigorous research on 3D tasks allows for increasingly faster performance on 3D point cloud classifications and accurate localization. Though common elements of interest with regular shapes can be recognized well, DL integration in the reality-capturing field cannot yet be used and generalized for real practical scenarios.
Research has shown that integrating AI’s understanding of both 2D and 3D data can significantly accelerate 3D reconstruction processes, especially in photogrammetry and the 3D semantic classification of point cloud models. Future research aims to enhance this integration for detailed, multiscale documentation of buildings and their conservation state. Practical tests are needed to evaluate the effectiveness of RGB-D-based approaches in architectural scenarios. Standardized methods for data fusion and related datasets are demanded. Further works is needed to build the bridge from complicated investigation and documentation activities to the Deep Learning techniques that facilitate human intervention in 2D images and 3D point cloud.