In this section, we first introduce the framework of our ancient rock-art segmentation, which heavily augments the training dataset and employs a novel BEGL loss function for emphasizing rock-art boundaries in UNet. Moreover, a novel BEGL loss function aiming at enhancing and refining the rock-art boundaries is described. Finally, We describe the segmented network architecture in detail.
Overview on framework of petroglyph segmentation
Segmentation of rock-art is an incredibly challenging task due to different levels of degradation of petroglyph boundary and much scribble noises on rock panels. For a more efficient rock-art segmenting, we concentrate on the systematic framework of petroglyph segmentation. The proposed boundary enhancement based rock-art image segmentation framework is presented in Fig. 1. It comprises two phases, namely the image preprocessing and segmenting phases.
Due to the petroglyph orthophotos are tilted in general, it is necessary to apply image rotation correction based on Fourier transform. The principle of 2D discrete Fourier transform (2D-DFT) can be defined as Eq. (1):
$$\begin{aligned} \mathrm{{y}}(k,l)= & {} \sum \limits _{i = 0}^{\mathrm{{M}} - 1} {\sum \limits _{j = 0}^{N - 1} {x(i,j){e^{ - i2\pi \left( \frac{{ki}}{\mathrm{{M}}} + \frac{{lj}}{N}\right) }}} }. \end{aligned}$$
(1)
$$\begin{aligned} {e^{\mathrm{{iz}}}}= & {} \cos z + i\sin z. \end{aligned}$$
(2)
where x(i, j) is the value of image spatial domain, i and j are the indices of image position, \(\mathrm{{y}}(k,l)\) is the value of image frequency domain, k and l are the discrete spatial frequencies, M and N are the number of pixels in the 2D image space. Also, Eq. (2) is Euler’s formula, which establishes a connetion between the complex exponential functions and the trigonometric functions. In essence, the application of 2D-DFT enables to convert signals in the spatial domain into the frequency domain conveniently. The Fourier spectrum is comprised of the sizes of the 2D-DFT complex coefficients, which are proportional to the strength of the spatial frequencies. Next, the corrected petroglyph images are sliced into small patches which can be taken into ResNet classifier as input. Because the large background often exists on the ancient rock-art panels which draws into great class imbalance, ResNet is selected as the classifier of the framework, which filters rock-art patches with no pecking marks. Figure 2 shows a class activation map (CAM) which selects pecked regions in red and drops unpecked regions in blue obtained from ResNet. Finally, in order to extract and emphasize the geometric patterns and boundaries related to the pecked-marks in the map that make up petroglyph shapes, image reversal and image adaptive histogram equalization are applied in the framework.
The second phase is based on an UNet [26], which is an auto-encoder network with skip connections between layers of the same shape. We modify the network by introducing a novel loss function named BEGL, allowing it to better learn rock-art boundary features.
BEGL loss function
In order to emphasize the boundary regions, we apply the Sobel operator, which generates strong responses around the boundary areas and little response elsewhere, to each point in a 2D image x in Eq. (3) and Eq. (4).
$$\begin{aligned} {\mathrm{{S}}_\mathrm{{h}}}= & {} {\mathrm{{T}}_\mathrm{{h}}}*x \end{aligned}$$
(3)
$$\begin{aligned} {\mathrm{{S}}_\mathrm{{v}}}= & {} {\mathrm{{T}}_\mathrm{{v}}}*x \end{aligned}$$
(4)
It is useful to express this as weighted density summations using the following weighting functions for h and v components. The two \({\mathrm{{T}}_\mathrm{{h}}}\) and \({\mathrm{{T}}_\mathrm{{v}}}\) templates used by Sobel are showed as Fig. 3a, b. The filters enable to be utilized individually to the input image, to generate individual measures of the gradient components in each orientation. Then, These can be added to obtain the absolute magnitude of the gradient at every point. The orientation of the spatial gradient is given by Eq. (5):
$$\begin{aligned} \theta = \mathrm{{arctan}}\left( \frac{{{S_\mathrm{{h}}}}}{{{S_\mathrm{{v}}}}}\right) \end{aligned}$$
(5)
The gradient magnitude \(\mathrm{{S}}\) is given by Eq. (6):
$$\begin{aligned} \vert \mathrm{{S}} \vert = \sqrt{S_\mathrm{{h}}^2 + S_\mathrm{{v}}^2} \end{aligned}$$
(6)
Gaussian kernels are the most broadly applied in smoothing filters. These filters have been proved to play an important role in edge detection in human vision system, and to be very useful as detectors for edge and boundary detection [28]. The 2D Gaussian filter is also the only rotationally symmetric filter that is separable in Cartesian coordinates. Separability is important for computational efficiency when implementing the smoothing operation by convolutions in the spatial domain. The Gaussian filter in two dimensions can be defined as Eq. (7):
$$\begin{aligned} \mathrm{{G}}(i,j) = \frac{1}{{2\pi {\sigma ^2}}}{e^{ - \left( \frac{{{i^2} + {j^2}}}{{2{\sigma ^2}}}\right) }} \end{aligned}$$
(7)
where \((\sigma = 0.8)\) is the standard deviation of the Gaussian function and \(\left( {i,j} \right) \) are the Cartesian coordinates of the image. Standard 2D convolution operation can be used to calculate the discrete Gaussian filter. Hence, we can easily achieve the difference between filtered output of predictions of the CNNs and filtered output of the ground truth labels. Minimizing the divergence between two filtered outputs enables to close the gap between the results of CNNs and ground truth labels. Following the analyses above, the boundary enhancement with Gaussian loss is defined as a \({L_2}\)-norm shown in Eq. (8):
$$\begin{aligned} {\mathrm{{L}}_\mathrm{{G}}} = {\left\| {\mathrm{{G}}(S(\mathrm{{y}})) - \mathrm{{G}}(S(\hat{y}))} \right\| _2} \end{aligned}$$
(8)
where \(\mathrm{{y}}\) are the ground truth labels, \(\widehat{y}\) are the prediction labels, \(S( \cdot )\) is Sobel operator, and \(G( \cdot )\) is Gaussian filter. Meanwhile, \({L_{BCE}}\) effectively suppresses false positives and remote outliers, which are far away from the boundary regions. The formula of \({L_{BCE}}\) is defined as Eq. (9).
$$\begin{aligned} \begin{aligned} {\mathrm{{L}}_{\mathrm{{BCE}}}}(y,\hat{y}) = - (\beta *y\log (\hat{y}) + (1 - \beta )*(1 - y)\log (1 - \hat{y})) \end{aligned} \end{aligned}$$
(9)
Here, \(\mathrm{{y}}\) are the ground truth labels, \(\widehat{y} \) are the prediction labels, \(\beta \) is defined as \(1\mathrm{{ - }}\frac{y}{{H*W}}\), H and W are height as well as width of the image. The overall BEGL loss function is defined as Eq. (10) that is derived from Eqs. (8, 9):
$$\begin{aligned} {L_{BEGL}} = {\lambda _1}{L_G} + {\lambda _2}{L_{BCE}} \end{aligned}$$
(10)
where \({\lambda _1}\) is 0.001 and \({\lambda _2}\) is 1 respectively. The BEGL loss funciton is the combination of BCE loss [19] and Gaussian loss.
Segmentation network
The details of the segmentation network used in our work are provided in this section. In order to fully leverage the spatial contextual and boundary information of pecking rock-art data to accurately segment petroglyph images, a new BEGL loss function is designed for rock-art image segmentation network (BEGL-UNet) with inspiration from the work [22]. The BEGL-UNet architecture is showed in Fig. 4. It consists of an encoder-decoder structure resulting in an U-shape. The encoder applies max-pooling and a double convolution which halves the image size and doubles the number of feature maps, respectively. The decoder is comprised of three parts: a bilinear upsampling operation to double the feature map size, the feature maps of the encoder path are directly concatenated onto the corresponding layers in the decoder path, and lastly a double convolution to half the number of feature maps. The skip-connections enable the model to use multiple scales of the input to generate the output. This aids the network by propagating more semantic information between the two paths, thereby enabling it to segment images more accurately.