Skip to main content

Semantic segmentation of point clouds of ancient buildings based on weak supervision


Semantic segmentation of point clouds of ancient buildings plays an important role in Historical Building Information Modelling (HBIM). As the annotation task of point cloud of ancient architecture is characterised by strong professionalism and large workload, which greatly restricts the application of point cloud semantic segmentation technology in the field of ancient architecture, therefore, this paper launches a research on the semantic segmentation method of point cloud of ancient architecture based on weak supervision. Aiming at the problem of small differences between classes of ancient architectural components, this paper introduces a self-attention mechanism, which can effectively distinguish similar components in the neighbourhood. Moreover, this paper explores the insufficiency of positional encoding in baseline and constructs a high-precision point cloud semantic segmentation network model for ancient buildings—Semantic Query Network based on Dual Local Attention (SQN-DLA). Using only 0.1% of the annotations in our homemade dataset and the Architectural Cultural Heritage (ArCH) dataset, the mean Intersection over Union (mIoU) reaches 66.02% and 58.03%, respectively, which is an improvement of 3.51% and 3.91%, respectively, compared to the baseline.


Architectural heritage has a long history and rich typology, with unique cultural connotations and historical values. However, long-term use and the effects of the natural environment, such as wood decay, weathering, and fire, have led to many architectural heritage being at risk of disappearing [1]. Therefore, timely preservation and restoration of architectural heritage is essential to maintain the continuity of human civilisation. Point cloud data can provide geometric and radiometric information such as geometric coordinates, colours and reflective intensities of architectural heritage, and accurately restore the morphology and structure of architectural heritage, which has become one of the commonly used data sources for architectural heritage conservation [2]. However, raw point cloud data cannot provide semantic level information directly, so how to achieve semantic segmentation of architectural heritage and obtain the component information needed for research, which can provide the basis for related research such as Scan-to-Building Information Modeling (BIM), disease monitoring of architectural heritage and other related researches [3,4,5], has become a research hotspot.

In recent years, with the recent development of consumer-grade depth sensors (e.g., Light Detection and Ranging (LiDAR) [6, 7]) and advances in computer vision algorithms, 3D data collection has become easier and cheaper. Impressive progress has been made in using point cloud data for semantic segmentation for tasks such as 3D reconstruction, autonomous driving and recognition classification [8,9,10]. Currently, most of the point cloud semantic segmentation methods are based on fully supervised deep learning semantic segmentation [11,12,13], which requires a large amount of labelled data to train the network, and its segmentation accuracy relies on the number and quality of data labels. However, unlike 2D images, labelling 3D point clouds is expensive and sometimes requires expertise [8, 10], which consumes great manpower and time.

Since the point cloud data is too large, in order to alleviate the cost of data labelling, some scholars have therefore started to investigate weakly supervised methods to solve the problem of point cloud semantic segmentation [14, 15]. Weakly supervised methods train models by annotating part of the point cloud, or generating pseudo-labels for the unlabelled point cloud on top of the partially annotated point cloud. Effective segmentation of semantic regions is achieved in large-scale 3D point clouds with limited manually annotated labels, thus achieving the same effect of fully supervised learning [16]. The number of point clouds of ancient buildings in large scenes is generally in the millions or even tens of millions level, and it is obviously more convenient to label them using weakly supervised approaches. However, there are relatively few studies on weakly supervised semantic segmentation in the field of architectural heritage [17], and there is still a gap between the weakly supervised and fully supervised methods, and how to improve the segmentation effect through effective labelling is still a great challenge.

For these reasons, we decided to endeavour to develop a weakly supervised deep learning approach for semantic segmentation of point clouds of ancient buildings. We explored the possibility of learning point cloud segmentation by providing only a limited number of task-specific labelled point clouds of ancient buildings. We rely on a state-of-the-art semantic query networks (SQN) [9] combined with self-attention mechanisms and simple positional encoding to learn point cloud features.

The contributions of this paper are threefold:

  1. (1)

    To the best of our knowledge, we are the first to train models for segmenting point clouds of ancient buildings using very little labelled data, which greatly reduces the cost of manpower and time.

  2. (2)

    In order to improve the accuracy of segmentation of ancient buildings, we change the positional encoding to remove the interference information so as to obtain the accurate positional features of the components of ancient buildings, and introduce a self-attention mechanism to enhance the network's ability to discriminate similar components in different neighbourhoods of ancient buildings.

  3. (3)

    Our method not only achieves the best segmentation in our Chinese ancient architecture dataset, but also achieves State-of-the-Art in the Western architectural heritage dataset ArCH [18], which shows that our network's generalisation on ancient architecture is better. In addition, our method is better than some fully supervised methods.

The rest of the paper is organised as follows. Sect. "Related work" presents the related work on point cloud semantic segmentation, including three parts: architectural heritage semantic segmentation, point cloud semantic segmentation, and weakly supervised semantic segmentation. Sect. "Methodology" describes the SQN network, our local spatial coding and self-attention modules, and the overall architecture of our weakly supervised segmentation network. Sect. "Experiments" presents the self-constructed dataset, comparative analysis and discussion of the results. Finally, Sect. "Conclusion and outlook" draws conclusions and discusses future directions for this research neighbourhood.

Related work

Related work on semantic segmentation of architectural heritage

Currently, there are two main types of semantic segmentation for 3D point clouds of architectural heritage: one is supervised machine learning and the other is deep learning. Machine learning classifiers used for semantic segmentation in architectural heritage neighbourhoods are dominated by random forests [19,20,21,22] and also logistic regression multiple classifiers [23]. Although such classifiers have high computational efficiency, they are prone to generate noise and affect classification accuracy. In contrast, deep learning methods can be applied to different scenarios. With the recent breakthroughs in deep learning on the task of segmenting publicly available point cloud datasets [24, 25], some scholars have then used these methods to study point cloud data of ancient buildings [26,27,28,29]. Due to the complexity of the elements in ancient architecture and the limited reproducibility in different architectures, some scholars have investigated the use of synthetic datasets to train the data [30, 31], and have made relevant progress. However, these methods consume many human resources to annotate point cloud data. Moreover, specific relevant knowledge is required to annotate point cloud data of cultural heritage, and non-professional annotation is prone to errors. In addition, the current semantic segmentation datasets on point clouds of ancient buildings are limited, which leads to the research on point cloud segmentation of ancient buildings using deep learning is hindered.

Related work on semantic segmentation of point clouds

With the rise of deep learning methods, there are more and more methods for point cloud segmentation. They can be classified into three categories: fully supervised, weakly supervised and unsupervised. Fully supervised requires labelling of all trained point cloud data. This consumes huge human and time costs. In recent years, fully supervised networks have been studied mainly for feature extractors [32,33,34,35,36]. Complex feature extractors result in their networks requiring significant computational resources. Unsupervised networks lead to very inaccurate segmentation due to the lack of labelling of the data and the inability to restrict the type of segmentation [37, 38]. In recent years, the development of weak supervision is getting hotter and hotter, and scholars have found that weak supervision can achieve comparable accuracy to full supervision using very little labelling cost. In this paper our method can even surpass full supervision. This not only avoids the requirement of full supervision that requires a large annotation cost, but also avoids the problem of overly fine-grained or inaccurate segmentation.

Related work on weak supervision

In recent years, weakly supervised semantic segmentation of point clouds has been developed more and more rapidly. Xie et al. [39] improved segmentation performance by designing PointContrast to pre-train point clouds for complex scenes, fine-tuning the weights of the pre-trained network and applying it to downstream tasks. Hou et al. [40] divided the scene point cloud into multiple regions, performed comparison learning in each region separately, and aggregated the final loss. Both introduce a small number of labelled points for pre-training based on unsupervised networks, while the whole scene needs to be input directly in order not to lose the contextual information, which requires huge computational effort. Zhang et al. [41] use unprocessed depth maps for pre-training, which cleverly avoids the drawback. Furthermore, pre-training PointContrast directly on ShapeNet [42] does not improve the performance of downstream detection and segmentation tasks because of the gap between single target classification tasks on ShapeNet and multiple object localisation tasks on real datasets. To bridge the gap, Rao et al. [43] synthesised multiple objects to generate pseudo-scenes in order to construct training data that contributes to scene-level understanding, which in turn improves segmentation. Zhang et al. [44] constructed an agent task, i.e. point cloud colouring, that uses self-supervision to transfer learned prior knowledge from a large number of unlabelled point clouds to a weakly supervised network. Liu et al. [45] introduced an effective method to solve the problem of 3D scene understanding with limited labelled scenes. Hu et al. [9] proposed SQN, which consists of two main components: (1) a point-local feature extractor for learning features of different dimensions, and (2) a flexible point-feature query network to collect as many relevant semantic features as possible for weakly supervised training. Compared to the previous network, it only needs to randomly select some points as truth points, and then it can be directly used for training without the need of pre-training through fully supervised like truth points, which greatly reduces the cost. However, this network is deficient in its effectiveness for segmentation of building facades. Inspired by the success of a Dual Local Attention (DLA) Network [13] in segmenting building facades, we will improve the SQN by introducing a self-attention mechanism based on it to enhance the network's segmentation effect on the inside of buildings.


The ancient architecture scene belongs to the large scene, which needs to select the network for model training. And the current segmentation network generally adopts a complex positional encoding method, which may mislead the network to update the wrong parameters during training. Moreover, some of the neighbouring components of ancient buildings are similar from a local perspective, which makes the model difficult to distinguish. To address these two problems we change the local spatial coding method on the basis of SQN and introduce a self-attention mechanism in the local feature coding, which in turn improves the segmentation effect on ancient buildings.

Introduction to SQN

SQN is a relatively advanced weakly supervised network in recent years. It can be applied to the segmentation of large scenes, which corresponds to the large-scale ancient building scenes we segmented. The SQN network consists of two main parts. One part is point local feature extraction. It is composed using Local Feature Aggregation (LFA) and Random Sampling (RS). After multilayer feature extraction and random sampling, features of different dimensions are obtained. The other part is the point feature query network. This part mainly focuses on querying the truth values around the obtained features of different dimensions and thus obtaining the semantic feature information. Finally, all the features are associated and decoded using Multilayer Perceptron (MLP).

In SQN, the local feature aggregation module consists of three parts: local spatial coding, attention pooling, and expansion of residual blocks. Local spatial coding is used to increase the network's understanding of the geometric structure of the point cloud, attentional pooling is used to aggregate neighbouring point features, and dilated residual blocks are used to increase the receptive field. The main workflow of the local feature aggregation module is to first obtain the features of the centroid and neighbourhood points by using local spatial coding on the input points, and then obtain the enhanced features by attention pooling, which are summed with the input points that have gone through the shared MLP to obtain the output features and then endowed with semantic information.

Local spatial coding

Local spatial coding is mainly for P = {p1,…,pi,…pN}\(\subset {\mathbb{R}}^{3}\) the relative positional encoding of the neighbourhood point {pik} of the centre point pi. The position encoding of SQN can be represented by the following Eq. (1):

$${\text{r}}_{i}^{k} = MLP\left( {p_{i} \oplus p_{i}^{k} \oplus \left( {p_{i} - p_{i}^{k} } \right) \oplus ||p_{i} - p_{i}^{k} ||} \right)$$

where pi and pik are the xyz positions of the points, \(\oplus\)  represents the join operation, and \(\left\| \cdot \right\|\) represents the computation of the Euclidean distance between the centre point and the domain point, where MLP is a linear transformation function. This helps the network to learn local features. However the complex way of encoding positions may interfere with the learning of local features. We only use the relative position information and Euclidean distance of the centre point pi and the neighbourhood point {pik} for encoding as follows in Eq. (2):

$${\text{r}}_{i}^{k} = MLP{\text{s}}\left( {\left( {{\text{p}}_{{\text{i}}} - p_{i}^{k} } \right) \oplus ||p_{i} - p_{i}^{k} ||} \right)$$

where the MLPs consist of two linear transformations and a Rectified Linear Unit (ReLU) activation function. We confirm the effectiveness of our method in the experimental part. Its structure is shown in Figure 1.

Fig. 1
figure 1

Positional encoding module

Self-attention block

The self-attention module is an important part of Dual Local Attention (DLA). Our main purpose of using self-attention is to make the point cloud more focused on the positional relationship with the surrounding points. Our self-attention mechanism is shown in Fig. 2. we combine the positional features obtained from the positional encoding block with the features of the points. From a local point of view, this improves the differentiation of neighbouring similar components. Our self-attention block expression (3) is as follows:

$$F_{i} = \sum\limits_{k = 1}^{K} {{\text{softmax}}\left( {\eta \left( {\alpha \left( {{\text{f}}_{{\text{i}}} } \right) - \beta \left( {f_{{\text{i}}}^{k} } \right) + r_{i}^{k} } \right)} \right) \bullet \left( {\gamma \left( {f_{i}^{k} + {\text{r}}_{i}^{k} } \right)} \right)}$$
Fig. 2
figure 2

Self-attention mechanism

Let the feature of the centroid pi be fi and the feature of its neighbourhood pik be fik. rik represents the positional encoding feature of Eq. (2). α, β, γ represent the MLP containing a linear layer. η represents the mapping function MLP containing two linear layers and a ReLU activation function. where ∙ represents the Hadamard product, denoted by in the figure. Fi represents the output features. Fi is a new set of neighbourhood features that explicitly encode the local geometry of the centroid pi. The output features Fi also need to be processed by a batch normalisation operation containing ReLU activation to improve the generalisation of the model.

SQN-DLA network framework

The overall structure of DLA is shown in Fig. 3. Inputs include spatial information and previously learnt features. The spatial information is encoded and then first passes through the self-attention block together with the features to get the local attention features. Then it is connected with the local attention features to form residuals, and passes through the attention pooling block to get the enhanced local attention features. The obtained features are summed with the original features to get the final spatial attention features. To obtain more feature information, our input features are encoded using features that contain colour information.

Fig. 3
figure 3

DLA overall framework

The overall structure of our network is shown in Fig. 4. The overall network consists of two parts, one is the point local feature extractor and the other is the point feature query network.

Fig. 4
figure 4

Overall network structure

First, we bring the DLA residual module into point-local feature extraction. The input is xyz + rgb values containing N points. The input points are first passed through a fully connected layer with the dimension raised to 8. The encoder contains four layers consisting of DLA and random sampling (RS) operations. After four DLAs and RS, the obtained dimensions and points are (32, 128, 256, 512) and (N/4, N/16, N/64, N/256) respectively.

Then, for each layer xyz query the xyz of the neighbouring tagged points. the Euclidean distance between its centroid and the neighbouring points is used as a weight. The resulting coded features are trilinearly interpolated using this weight. Finally, the interpolated features are concatenated and fed into a series of MLPs to directly infer the semantic categories of the points. The predicted points can be used to generate weak labels.



This study case uses 3D laser technology to collect complete mapping data of the Beiding Goddess Temple. Located in Beijing, China, Niangniang Temple was built between 1426 and 1435 during the Ming Dynasty, more than 500 years ago, and is one of the "five tops and eight temples" in the history of Beijing, as well as a landmark building on the central axis of Beijing. The main buildings in the Beidong Niangniang Temple include the Hall of Heavenly Kings, the East Annex Hall, the Hall of Niangniang, the Hall of Dongyue, and the Hall of Mountain Gate. Beidong Niangniang Temple is a typical traditional Chinese wooden structure, the building mainly consists of roof, body and foundation. The main load-bearing framework of the building consists of beams, columns, fang and other large wooden structures. The roof consists of tiles and ridge animals on the ridge. Doors and windows are made of wood, and there are openwork patterns on the tops of the doors and windows. The case we used is shown in Fig. 5 and contains four training scenarios (i.e., room1, room2, the Hall of Niangniang, and the East Annex Hall), totalling 4.66G, and one test scenario (the Hall of Heavenly Kings), 1.47GB. Among them, the training dataset Niangniang hall surface width of five rooms, the roof for the saddle and paraboloid roof, green glazed tiles yellow shear edge roofing. Niangniang Hall on both sides of the side of the hall (room1, room2) for the gabbled roof, simple tile roof. The East Annex Hall for the gabbled roof, simple tile roof. The architectural styles of the test dataset (the Hall of Heavenly Kings) and the training dataset differ in the roof style. The Hall of Heavenly Kings is three rooms wide, with a saddle roof and simple tile roof. On the front of the Hall of Heavenly Kings, there are four five-panel finial doors with four sill windows on each side of the door, and on the back, there are four five-panel finial doors. The different types of roofs are shown in Fig. 6. The difficulty of point cloud annotation of ancient buildings lies in the correct knowledge of ancient building components. We annotate the ancient building components dataset through CloudCompare according to the real situation of the acquired ancient buildings. The format of the dataset is modelled on that of the Stanford 3D Indoor Spaces Dataset (S3DIS) dataset, with each point cloud retaining x, y, z, r, g and b information. The dataset contains ten classes: clutter、door、fang、floor、roof、stair、stylobate、wall、window and column, as shown in Fig. 7. Table 1 shows the number of points for each category. The number of point clouds for each category varies greatly, the categories with the highest number of point clouds are roofs and walls, which have about 40 million and 20 million point clouds, respectively, and the categories with the lowest number of point clouds are steps with 1.7 million point clouds and foundations with 3.3 million point clouds, followed by columns, which account for 7.6 million point clouds, and square elements, which account for 10 million point clouds.

Fig. 5
figure 5

Self-built data set

Fig. 6
figure 6

Schematic diagrams of the three types of roofs: saddle and paraboloid roof (the Hall of Niangniang), saddle roof (the Hall of Heavenly Kings), and gabbled roof (room1, room2, and the East Annex Hall)

Fig. 7
figure 7

Schematic representation of the main categories in the dataset (clutter not shown)

Table 1 Number of point clouds of ten types of components in the self-built dataset

In addition, we also validate our method on the Architectural Cultural Heritage (ArCH) public dataset. We select three small scenes (i.e., "SMV_1", "SMV_24", and "SMV_28") as our training data, totalling 521MB, and the scene "B_SMV_chapel_27to35" as our test data. The scene "B_SMV_chapel_27to35" is chosen as our test data, which is 804MB.

Implementation details

Our network follows the dataset preprocessing approach used in RandLA-Net [12], where the source data is down-sampled on a grid with sampling points spaced at 0.04 m. The network is then trained end-to-end on 0.1% randomly annotated points. All experiments were performed on an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz CPU and NVIDIA RTX 2080Ti GPU. During training, we randomly sampled 40,960 points from each scene as input. epoch was set to 50, and the initial learning rate was set to 0.01, decreasing by 5% after each epoch. The number of nearest neighbour points K is set to 16. batch size is 3. the ratio of pseudo-labels is 0.2.

Competing methods and comparisons

Tested on self-built dataset In order to demonstrate the effectiveness of our method, our method is compared with the more popular networks in recent years (including the fully supervised semantic segmentation methods BAAF [46], Randla-Net [12] and the weakly supervised meaning segmentation methods PSD [47], SQN [9]). These networks are chosen because they are all suitable for semantic segmentation of point clouds in large scenarios and can be better compared with our approach. In Table 2, it is found that our method has the best segmentation results on the self-constructed dataset. The mIoU on our network reaches 66.02% and the accuracy reaches 83.06%. The best results are achieved when comparing with both the current popular fully supervised and weakly supervised networks. Our method improves mIoU by 3.51% and overall accuracy by 1.58% relative to SQN. The semantic segmentation qualitative results are shown in Fig. 8. Because of the fang and roof articulations and the adjacent door and window structures, they are very similar from a local point of view, so the segmentation will be less effective. However, we can see that our segmented windows occupy a larger area compared to the original network (corresponding to the blue and green lines in Fig. 8, respectively). The effect of our segmentation is considerable on the floor (red line in Fig. 8). Our network is able to segment ancient buildings with relative accuracy. In addition, we show the predicted quantitative evaluation results for each class in Table 3. It can be seen that our method outperforms other competing methods in many classes.

Table 2 Our results vs. previous work on the self-constructed dataset. mIoU and OA stand for overall accuracy and mean Intersection-over-Union
Fig. 8
figure 8

Qualitative results of semantic segmentation on the self-built dataset Hall of the Heavenly King. Our results have a remarkable effect on the segmentation of square beams and floors

Table 3 Our per-class IoU results and class-averaged mIoU results on the Hall of Heavenly Kings compared to previous work

Tested on ArCH To verify the generalisation of our method, we compare it with methods that have been tested on ArCH (including PointNet [48], PointNet++ [11], PCNN [49], DGCNN [50], Cao et al. [17], all of them belong to the fully-supervised segmentation methods) as well as with a number of popular networks that are suitable for large scenarios (including the fully-supervised methods Randla-Net [12] and weakly supervised segmentation methods PSD [47], SQN [9]) for comparison. The comparison results are shown in Table 4. Our approach, not only requires the least number of scenarios, but also has the highest mean Intersection over Union (mIoU). Compared to the original network our mIoU and Overall Accuracy (OA) are improved by 3.91% and 2.88% respectively. Compared with the PSD network using 1% annotations, our OA is only 1.02% lower. The Intersection over Union (IoU) for each class is provided in Table 5. Our method outperforms other methods for segmentation in most classes. Figure 9 shows the qualitative segmentation results on the test data. Our segmentation results are much improved in overall with respect to the original network. We improve the segmentation performance on stairs considerably (yellow line part of Fig. 9). This is because the original network may think that all points above the overall space are roofs, leading to incorrect segmentation, while our network can better avoid such a situation thanks to our positional encoding approach. This is also reflected in our self-built dataset (in the red line section of Fig. 8, the original network splits a large area of the ground into roofs and clutter). For similar components like moulding and wall which are adjacent to each other, our network is able to distinguish them better (purple line part of Fig. 9). Moulding is often found near doors and windows as a decorative element in buildings. However, they are difficult to distinguish for the network. Because molding is a concave or convex line footing from a wall, it is likely to be imperceptible if the training data is not sufficiently sampled. We use a self-attention mechanism to enhance the local features of each component, so that the shape features of the component are more visible.

Table 4 Correlation network test results on ArCH dataset
Table 5 Results of testing various types of IoU on the ArCH dataset
Fig. 9
figure 9

Qualitative results of semantic segmentation on ArCH. Our results are particularly effective on stairs and mouldings

Ablation experiments

In this section we first compare the results of training with the original network by modifying the network as a whole. Then for future validation by others, we perform the experimental part of the positional encoding module and the self-attention module on the ArCH dataset to analyse the different benefits they bring.

Firstly, after the overall modification of the network by changing the positional encoding and introducing the self-attention module, the comparison results with the original network are shown in Table 6. The initial generated model is used to generate pseudo-labels by predicting the training dataset, and then 20% of the pseudo-labels are randomly selected as truth points (as there is no guarantee that the predictions will all be correct), re-trained to generate a new model, and then predicted on the test set. After several iterations, we can see that the mIoU of our network is compared with that of the original network, and the segmentation performance of our network is greatly improved. And after the third iteration, the performance of our network starts to decrease and the original network can hardly be improved. We believe this is because the prediction training set reached the peak range for this network. Continuing iterations will be a decreasing trend. Nevertheless, our segmentation performance is still greatly improved compared to the original network.

Table 6 Tests on the self-constructed dataset the Hall of Heavenly Kings vs. the public dataset ArCH, per iteration effects

Finally, we perform test experiments on the positional encoding and self-attention modules on the ArCH, respectively. The results are shown in Table 7. The enhancement of the network is particularly noticeable with the self-attention module. Because the self-attention module has an enhanced learning ability for the surrounding of points of different building components, it is able to improve the differentiation for similar components in the neighbourhood of ancient buildings. Instead, the changed positional encoding uses only features that focus on the surrounding points, and no longer needs features of its own position in space. In addition, we compare on positional encoding by many different methods. The results are shown in Table 8. Using relative distance and euclidean distance to get the positional encoding is better. It can be seen that on the ancient building facade, the more position information is not better, but may cause interference.

Table 7 Comparison of positional encoding module and self-attention module tested on ArCH respectively
Table 8 Results of tests on ArCH using different positional information

Conclusion and outlook

In this study, we find a large improvement in segmentation of ancient building facades by utilising a weakly supervised network containing a self-attention mechanism and a new positional encoding. Our weakly supervised network has far surpassed the effectiveness of some fully supervised networks for segmenting ancient buildings. This may be due to the fact that fully supervised requires a large number of scene samples, while weakly supervised avoids having fewer scene samples. In addition, we use only 0.1% annotation compared to full supervision, which greatly saves the cost of manually annotating the point cloud. Finally, through the self-attention mechanism we introduced to SQN we can find that the segmentation effect of our network on ancient buildings can be greatly improved. We also continued our experiments using the generated pseudo-labels and found that this can continue to improve the segmentation. This was not verified in the SQN article, but we verified that this does improve segmentation in large-scale ancient building scenes. However, from the segmentation results, the segmentation of Door-window in Arch is very poor, which may be due to the fact that our introduction of attention is single-channel, in the future work, we will continue to carry out an in-depth study of the segmentation of the ancient buildings by introducing a multi-channel attention mechanism on the basis of weak supervision.

Availability of data and materials

All data generated or analyzed during this study are included in this published article.


  1. Hu Q, Wang S, Fu C, Ai M, Yu D, Wang W. Fine surveying and 3D modeling approach for wooden ancient architecture via multiple laser scanner integration. Remote Sens. 2016;8(4):270.

    Article  Google Scholar 

  2. Hu Y, Lan D, Wang J, Hou M, Li S, Li X, Zhu L. Measurement and analysis of facial features of terracotta warriors based on high-precision 3D point clouds. Herit Sci. 2022;10(1):40.

    Article  Google Scholar 

  3. Zhuo L, Zhang J, Hong X. Cultural heritage characteristics and damage analysis based on multidimensional data fusion and HBIM–taking the former residence of HSBC bank in Xiamen, China as an example. Herit Sci. 2024;12(1):128.

    Article  Google Scholar 

  4. Jo YH, Kim YH, Lee HS. Three-dimensional deviation analysis and digital visualization of shape change before and after conservation treatment of historic kiln site. Herit Sci. 2024;12(1):76.

    Article  Google Scholar 

  5. Tysiac P, Sieńska A, Tarnowska M, Kedziorski P, Jagoda M. Combination of terrestrial laser scanning and UAV photogrammetry for 3D modelling and degradation assessment of heritage building based on a lighting analysis: case study—St. Adalbert Church in Gdansk, Poland. Herit Sci. 2023;11(1):53.

    Article  Google Scholar 

  6. Cat F. Apple unveils new ipad pro with breakthrough lidar scanner and brings trackpad support to ipados. 2020.

  7. Scott S. Lidar on the iphone 12 pro. 2020.

  8. Fei H, Long Y, Wei Z, Li F, Yun T, Qin Z. Semantic based autoencoder-attention 3D reconstruction network. Graph Models. 2019;106: 101050.

    Article  Google Scholar 

  9. Hu Q, Yang B, Fang G, Guo Y, Leonardis A, Trigoni N, Markham A. SQN: weakly-supervised semantic segmentation of large-scale 3D point clouds. In: Proceedings of the European Conference on Computer Vision (ECCV). 2022. p. 600–619.

  10. Grilli E, Özdemir E, Remondino F. Application of machine and deep learning strategies for the classification of heritage point clouds. Int Arch Photogramm Remote Sens Spat Inf Sci. 2019;42:447–54.

    Article  Google Scholar 

  11. Charles RQ, Yi L, Su H, Guibas LJ. Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv Neural Inf Process Syst. 2017. p. 15520–15528.

  12. Hu Q, Yang B, Xie L, Rosa S, Guo Y, Wang Z, et al. Randla-net: efficient semantic segmentation of large-scale point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. p. 11108–11117.

  13. Su Y, Liu W, Yuan Z, Cheng M, Zhang Z, Shen X, Wang C. DLA-Net: learning dual local attention features for semantic segmentation of large-scale building facade point clouds. Pattern Recogn. 2022;123: 108372.

    Article  Google Scholar 

  14. Xu X, Lee GH. Weakly supervised semantic point cloud segmentation: towards 10× fewer labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020. p. 13706–13715.

  15. Su Y, Xu X, Jia K. Weakly supervised 3D point cloud segmentation via multi-prototype learning. IEEE Trans Circ Syst Video Technol. 2023;33(12):7723–36.

    Article  Google Scholar 

  16. Niu Y, Yin J. Weakly supervised point cloud semantic segmentation with the fusion of heterogeneous network features. Image Vis Comput. 2024;142: 104916.

    Article  Google Scholar 

  17. Cao Y, Scaioni M. Label-efficient deep learning-based semantic segmentation of building point clouds at LOD3 level. Int Arch Photogramm Remote Sens Spat Inf Sci. 2021;43:449–56.

    Article  Google Scholar 

  18. Matrone F, Lingua A, Pierdicca R, Malinverni ES, Paolanti M, Grilli E, et al. A benchmark for large-scale heritage point cloud semantic segmentation. Int Arch Photogramm Remote Sens Spat Inf Sci. 2020;2020(43):1419–26.

    Article  Google Scholar 

  19. Teruggi S, Grilli E, Russo M, Fassi F, Remondino F. A hierarchical machine learning approach for multi-level and multiresolution 3D point cloud classification. Rem Sens. 2020;12(16):2598.

    Article  Google Scholar 

  20. Galantucci RA, Musicco A, Verdoscia C, Fatiguso F. Machine learning for the semi-automatic 3D decay segmentation and mapping of heritage assets. Int J Arch Herit. 2024.

    Article  Google Scholar 

  21. Grilli E, Remondino F. Machine learning generalisation across different 3D architectural heritage. ISPRS Int J Geo-Inf. 2020;9(6):379.

    Article  Google Scholar 

  22. Croce V, Caroti G, De LL, Jacquot K, Piemonte A, Véron P. From the semantic point cloud to heritage-building information modeling: a semiautomatic approach exploiting machine learning. Remote Sens. 2021;13(3):461.

    Article  Google Scholar 

  23. Valero E, Forster A, Bosché F, Hyslop E, Wilson L, Turmel A. Automated defect detection and classification in ashlar masonry walls using machine learning. Autom Constr. 2019;106: 102846.

    Article  Google Scholar 

  24. Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Niessner M. ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the conference on computer vision and pattern recognition (CVPR). 2017. p. 5828–5839.

  25. Armeni I, Sax S, Zamir A, Savarese S. Joint 2D-3D-semantic data for indoor scene understanding. In: Proceedings of the computer vision and pattern recognition. 2017. arXiv preprint arXiv:1702.01105.

  26. Haznedar B, Bayraktar R, Ozturk AE, Arayici Y. Implementing PointNet for point cloud segmentation in the heritage context. Herit Sci. 2023;11(1):2.

    Article  Google Scholar 

  27. Pan X, Lin Q, Ye S, Li L, Guo L, Harmon B. Deep learning based approaches from semantic point clouds to semantic BIM models for heritage digital twin. Herit Sci. 2024;12(1):65.

    Article  Google Scholar 

  28. Ji S, Pan J, Li L, Hasegawa K, Yamaguchi H, Thufail FI, Brahmantara US, Tanaka S. Semantic segmentation for digital archives of Borobudur reliefs based on soft-edge enhanced deep learning. Rem Sens. 2023;15(4):956.

    Article  Google Scholar 

  29. Pierdicca R, Paolanti M, Matrone F, Martini M, Morbidoni C, Malinverni E, et al. Point cloud semantic segmentation using a deep learning framework for cultural heritage. Rem Sens. 2020.

    Article  Google Scholar 

  30. Battini C, Ferretti U, De AG, Pierdicca R, Paolanti M, Quattrini R. Automatic generation of synthetic heritage point clouds: analysis and segmentation based on shape grammar for historical vaults. J Cult Herit. 2024;66:37–47.

    Article  Google Scholar 

  31. Galanakis D, Maravelakis E, Pocobelli DP, Vidakis N, Petousis MA, Konstantaras AJ, Tsakoumaki M. Svd-based point cloud 3d stone by stone segmentation for cultural heritage structural analysis—the case of the Apollo Temple at Delphi. J Cult Herit. 2023;61:177–87.

    Article  Google Scholar 

  32. Guo M, Cai J, Liu Z, Mu T, Martin R, Hu S. PCT: point cloud transformer. Comput Vis Media. 2021;7(2):187–99.

    Article  Google Scholar 

  33. Zhang C, Wan H, Shen X, Wu Z. PVT: point-voxel transformer for point cloud learning. Int J Intell Syst. 2022;37(12):11985–2008.

    Article  Google Scholar 

  34. Zhao H, Jiang L, Jia J, Torr P, Koltun V. Point transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). 2021. p. 16259–16268.

  35. Park C, Jeong Y, Cho M, Park J. Fast point transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2022. p. 16949–16958.

  36. Wu X, Lao Y, Jiang L, Liu X, Zhao H. Point transformer V2: grouped vector attention and partition-based pooling. Adv Neural Inf Process Syst. 2022;35:33330–42.

    Google Scholar 

  37. Zhang Z, Yang B, Wang B, Li B. GrowSP: unsupervised semantic segmentation of 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 17619–17629.

  38. Zhang Z, Ding J, Jiang L, Dai D, Xia G. FreePoint: unsupervised point cloud instance segmentation. 2023. arXiv preprint arXiv:2305.06973.

  39. Xie S, Gu J, Guo D, Qi C, Guibas L, Litany O. Pointcontrast: unsupervised pre-training for 3d point cloud understanding. In: Proceedings of the Computer Vision–ECCV. 2020; Part III 16: 574–591.

  40. Hou J, Graham B, Nießner M, Xie S. Exploring data-efficient 3d scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 15587–15597.

  41. Zhang Z, Girdhar R, Joulin A, Misra I. Self-supervised pretraining of 3d features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.p. 10252–10263.

  42. Chang AX, Funkhouser T, Guibas LJ, Hanrahan P, Huang Q, Li Z, et al. ShapeNet: an information-rich 3d model repository. 2015. Technical Report arXiv:1512.03012.

  43. Rao Y, Liu B, Wei Y, Lu J, Hsieh CJ, Zhou J. Randomrooms: unsupervised pre-training from synthetic shapes and randomized layouts for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 3283–3292.

  44. Zhang Y, Li Z, Xie Y, Qu Y, Li C, Mei T. Weakly supervised semantic segmentation for large-scale point cloud. Proc AAAI Conf Artif Intell. 2021;35(4):3421–9.

    Google Scholar 

  45. Liu K, Zhao, Y, Nie Q, Gao Z, Chen B. Weakly supervised 3d scene segmentation with region-level boundary awareness and instance discrimination. In: Proceedings of the European conference on computer vision. 2022. p. 37–55.

  46. Qiu S, Anwar S, Barnes N. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021. p. 1757–1767.

  47. Zhang Y, Qu Y, Xie Y, Li Z, Zheng S, Li C. Perturbed self-distillation: weakly supervised large-scale point cloud semantic segmentation. In: Proceedings of the ICCV. 2021. p. 15520–15528.

  48. Charles RQ, Su H, Kaichun M, Guibas LJ. Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the CVPR. 2017. p. 652–660.

  49. Atzmon M, Maron H, Lipman Y. Point convolutional neural networks by extension operators. ACM Trans Graph. 2018;37(4):1–12.

    Article  Google Scholar 

  50. Matrone F, Grilli E, Martini M, Paolanti M, Pierdicca R, Remondino F. Comparing machine and deep learning methods for large 3d heritage semantic segmentation. ISPRS Int J Geo Inf. 2020;9(9):535.

    Article  Google Scholar 

Download references


We thank F. Matrone and all contributors for the ArCH dataset.


This work was supported by the Project of National Key R&D Program Project (2018YFC0807806); Open Fund Project of State Key Laboratory of Geographic Information Engineering (SKLGIE2019-Z-3-1); Open Fund of State Key Laboratory of Surveying, Mapping and Remote Sensing Information Engineering of Wuhan University (19E01); Open Research Fund Project of the Key Laboratory of Digital Mapping and Land Information Application of the Ministry of Natural Resources (ZRZYBWD202102); Software Science Research Project of the Ministry of Housing and Urban Rural Development (R20200287); Beijing Social Science Foundation Decision Consulting Major Project (21JCA004).

Author information

Authors and Affiliations



Jianghong Zhao conceived the presented idea and put forward experimental suggestions. Haiquan Yu conducted and refined the analysis process and wrote the manuscript. Xinnan Hua performed the data processing, as well as providing some of the network comparison experiment results. All authors approved the final manuscript.

Corresponding author

Correspondence to Xinnan Hua.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, J., Yu, H., Hua, X. et al. Semantic segmentation of point clouds of ancient buildings based on weak supervision. Herit Sci 12, 232 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: