Image segmentation algorithm
Semantic image segmentation
The creation of the deep convolutional neural network AlexNet [40] drastically improved the performance of computer vision algorithms for classification. The image classification error rate of the ImageNet competition dramatically reduced from 2011 to 2017 [41]. Then CNNs were adapted to perform computer vision tasks beyond overall image classification, including semantic image segmentation. As shown in Fig. 1, semantic segmentation involves detecting and partitioning an image into different segments based on their class. According to a literature review [42], recent popular image segmentation models are usually made up of two parts, namely a backend and a classifier. The backend is a deep CNN which is responsible for extracting the features of an image to form a feature map through convolutional and pooling operations. The classifier, with a relatively small neural network architecture compared to the backend, is responsible for making predictions based on the feature map from the backend. Finally, the prediction is bilinearly interpolated to size of the original image, and they are combined to form a segmented image. This process is summarized by Fig. 2.
An image segmentation model classifies each pixel based on the predicted probabilities of a pixel belonging to each possible class. This is achieved by finding optimal weights of the neural network by minimizing the loss function. The loss measures the difference between the ground truth label y and predicted one \({\hat{y}}\) from the training phase. A weighted cross-entropy loss function of Eq. (1) is used for image segmentation in this case, where \(i = 1,\ldots ,C\) is the class of the pixel. The weight for class \(W_{i}\) is calculated based on Eq. (2), where \(f_i\) is the frequency of the pixels of class i in the training set. This is to avoid the model to favor small objects too heavily, or ignore small objects and focus on large objects in prediction.
$$\begin{aligned} Loss(y,{\hat{y}})= & {} -\displaystyle \sum _{i = 1}^{C} \ W_{i} \cdot y_i \cdot log ({\hat{y}}_{i}) \end{aligned}$$
(1)
$$W_i= \frac{log(f_i)}{\displaystyle \sum\nolimits _{j = 1}^{C} log(f_j)}$$
(2)
Transfer learning is a technique which allows a deep-learning model to be adjusted and applied to another, usually smaller dataset [42]. The backbones of segmentation models can first be trained with a computer vision competition dataset, usually exceeding 100 k labelled training examples, such as COCO [43] or the Open Images dataset [44]. These large datasets contain images of complex yet common scenes from daily life. When a model has already been trained to extract features from universal images, transfer learning then saves tremendous computational cost and time. As lack of high-quality cultural heritage dataset limits the application of machine learning methods in heritage studies [45], transfer learning becomes an important novel technique to bring general knowledge learnt from other areas to solve data-sparsity problem in the heritage domain. This paper innovatively combines transfer learning with photogrammetry to highlight the potentiality of transfer learning technique in improving heritage monitoring work when combined with existing methods.
There are several image segmentation algorithms that are based on deep convolutional neural networks and transfer learning techniques, including DeepLab family models [46], FCN [47], U-Net [48] and Mask R-CNN [49]. Considering the exceptional performance in similar tasks and relatively easy implementation and preparation works compared to other algorithms, DeepLab [46] family models are adopted in this task.
DeepLab family and DeeplabV3
DeepLab family is a representative of dilated convolutional models. As stated in the papers of DeepLab family models [46, 50], compared to other computer vision tasks, image segmentation faces two main problems. First, the small resolution of the produced feature map due to multi-layer convolutional operations makes it difficult to detect small objects and draw clear boundary in an image, such as plant or cracks with small areas in this case. Second, the varying scales of the same type of objects within different images can influence the performance of image segmentation models. This issue will be prominent for tasks dealing with unstructured imagery data (e.g. crowdsourced photographs).
To address those problems, DeepLab family models propose to use a dilated convolution operation, which adds ’holes’ in the convolution kernel to skip some pixels. The rate of dilation is the number of zeros between two consecutive filter values along each spatial dimension as illustrated by Fig. 3. On the basis of dilated convolution, Deeplabv family models apply Atrous Spatial Pyramid Pooling (ASPP) to extract features in different scales and produce a higher resolution feature map with only minimal additional computational cost, which obtained an outstanding performance on dealing with small objects segmentation and varying scales problem. Specifically, using DeeplabV3 as an example, the model uses convolutional layers of a 3 × 3 kernel with different rates of dilation. After processing the feature map in parallel, it concatenates the outputs from the dilated convolution layers and an output from a global average pooling layer. This is further processed by a 1 × 1 convolution layer and bilinearly interpolated to form a tensor to be used in the final calculation of loss. This strategy enables the model to not only save the required memory of GPU in computation due to no computation in the dilated convolution kernel, but also gain a larger field-of-views resulted from the expanded convolution kernel and the ability to deal with varying scale objects due to the multi-grid ASPP. In this article, the DeeplabV3 with a backend part of ResNet101, abbreviated as Model1, will serve as the baseline model to be compared with other models.
DeeplabV3+
To further improve performance, DeeplabV3+ [51], the latest version of the DeepLab family of image segmentation models, reconsidered its structure as an encoding-decoding structure (with DeeplabV3 structure as the encoding part) and added a more sophisticated decoding part. The encoding step downsamples the image into a small prediction map. Then the decoding step upsamples the prediction map into the original size of the input image. In the decoding stage, it concatenates the features processed by an 1 × 1 convolution operation from the feature map produced by the backbone model. This is followed by another 3 × 3 convolution layer before interpolating to the original image size, rather than a naïve bilinear interpolation upsampling. This gives the model a better ability of predicting a smoother and precise result given the input image.
Another modification of DeeplabV3+ is that, apart from ResNet, it utilises a modified version of the Xception network [52]. With several changes in the backbone neural network structure, it achieved exceptional image segmentation results. DeeplabV3+ with two distinctive backbone parts, namely Xception and ResNet, was applied to this crowdsourced dataset. They will be referred to as Model2 and Model3 respectively.
Two-stage model design
The previous models partition whole images into distinguishing segments belonging to different classes. Additionally, this paper explores an optional two-stage model that only makes a binary classification to further refine the segmentation results for a specific class. As illustrated by Fig. 4, this second model type classifies pixels into plant and non-plant classes from a crop based on the bounding box of plants from the output of the first segmentation model. This cropped image is given by reverse selection using density-based spatial clustering of applications with noise (DBSCAN) [53], an unsupervised clustering technique, to partition and assign labels to disjoint individuals in the output of the image segmentation model. eps in DBSCAN controls the maximum distance between two pixels for one to be considered as in the same region of the other. It is therefore a free parameter for this algorithm that can be adjusted according to different scenarios.
The second model is also a DeeplabV3+ model with ResNet as its backbone part using the same procedure as the first model (except there is no log-transformation of the weights to each class in the loss function). The original image dataset was manually cropped around all the plants. Therefore, the training of this two-stage model does not require additional photographs. This extra step was carried out to refine the prediction result by further clearing the boundary between plants and non-plant objects and provide a more flexible solution to segmenting objects in different scenes by adding the parameter eps.
Photogrammetry
Although crowdsourced images can provide insights into the plant growth at heritage sites, it is inaccurate to compare areas of plant growth from photographs taken from very different angles. To address this, photogrammetry can be used to build a comprehensive and stereoscopic view of a heritage site by constructing a 3D model from crowdsourced photographs. The photogrammetry method used in this article is incremental Structure from Motion (SfM) [54]. To reconstruct a complete 3D model from 2D images, the process of photogrammetry can be roughly split into three steps, namely features matching, sparse reconstruction and dense reconstruction. The first step detects common features among the input images using algorithms such as Scale-Invariant Feature Transform (SIFT) [55]. Then, the second step uses bundle adjustment [56] to form estimated camera poses and an optimal sparse 3D model. Finally, the last step reconstructs a dense 3D model with the Patch-Match algorithm [57].
This article combines photogrammetry with image segmentation, as shown in Fig. 5. Firstly, photogrammetry software is used to construct a 3D model of the inner corner of Bothwell Castle from the crowdsourced image dataset. Next, the same image dataset is partitioned into segments locating each class by the image segmentation model. Finally, the segmented images are re-mapped onto the 3D model, given the known camera poses and mapping scheme. This gives a 3D model with textures segmented into different regions. Hence, a property of an object, such as the plant area, can be easily measured from any viewpoint to remove the distortion caused by angle of shooting. Lastly, a flowchart of Fig. 6 summarizes the whole process of the proposed algorithm.
Software and hardware
This paper uses PyTorch as the framework of the image segmentation models, combined with its associate computer vision library Torchvision and an implementation of DeeplabV3+ [58]. Image processing was conducted using the open-source computer vision and photogrammetry tools OpenCV [59], VisualSFM [60] and OpenMVS [61]. As for the hardware, the model is trained on a personal computer with a CPU of Intel I7-6850K, a GPU of NVIDIA RTX2070 and a 32G RAM.
Training setting
These three one-stage models and three two-stage models were trained using the same hyper-parameters and dataset. Except for the loss function and weights of each class, the Adam optimizer [62] is applied with a learning rate of 0.01. The learning rate will decay to one tenth of itself every 50 epochs, and the number of epochs is set to 100. The whole dataset of 113 images are randomly split into training set (93 images) and test set (20 images). Given this dataset is quite small, transfer learning was applied. Specifically, backend models pre-trained on Pascal VOC 2012, SBD and Cityscapes datasets [58] were used. The batch size in the training phase is six images considering the RAM of GPU. When fed into the model, all images are resized to a resolution of 900 × 900 for the one-stage models, and 200 × 200 for two-stage models. For images with different sizes, the eps of DBSCAN is automatically set to \(\frac{\sqrt{\# Pixels}}{20}\) based on empirical observations. For data augmentation, the color parameters (brightness, saturation, hue and contrast) of the training images are randomly changed within a small range when training the model to avoid overfitting. As reflected by Fig. 7, all three models have successfully converged after 100 epochs in the training phase, during which losses of both Model1 and Model3 are smooth. Model2’s loss is volatile and consistently higher than other two models. The relevant code can be found at [63].