Face repairing based on transfer learning method with fewer training samples: application to a Terracotta Warrior with facial cracks and a Buddha with a broken nose

In this paper, a method based on transfer learning is proposed to recover the three-dimensional shape of cultural relics faces from a single old photo. It can simultaneously reconstruct the three-dimensional facial structure and align the texture of the cultural relics with fewer training samples. The UV position map is used to represent the three-dimensional shape in space and act as the output of the network. A convolutional neural network is used to reconstruct the UV position map from a single 2D image. In the training process, the human face data is used for pre-training, and then a small amount of artifact data is used for fine-tuning. A deep learning model with strong generalization ability is trained with fewer artifact data, and a three-dimensional model of the cultural relic face can be reconstructed from a single old photograph. The methods can train more complex deep networks without a large amount of cultural relic data, and no over-fitting phenomenon occurs, which effectively solves the problem of fewer cultural relic samples. The method is verified by restoring a Chinese Terracotta Warrior with facial cracks and a Buddha with a broken nose. Other applications can be used in the fields such as texture recovery, facial feature extraction, and three-dimensional model estimation of the damaged cultural relics or sculptures in the photos.


Introduction
As time went by, many sculptures were eroded by wind and rain or destroyed by war, with varying degrees of damage.In the early days, many artifacts were not scanned or archived in time by structured light.
Moreover, the history of photography technology is only two or three hundred years, which is far from the existence time of cultural relics, e.g., the Chinese Terracotta Warriors, which is known as the "eighth wonder of the world" has the history of more than 2000 years [1].Many cultural relics have been damaged when they are photographed, which increases the difficulty of cultural relics restoration.For example, Fig. 1 shows the photograph of a Chinese Terracotta Warrior with facial cracks and a Buddha with a broken nose.Although some old photographs taken when the cultural relics were not damaged can be found, the results of the restoration based on the experience of the craftsmen and the information provided by the old photographs are highly uncertain, and the restoration results are greatly affected by the differences in the individual experience of the craftsmen.Additionally, in many cases, the number of photos collected is too small, and the photos are not necessarily in the same period, so it is difficult to repair them by multi-visual methods [2][3][4].The information obtained from a single photo has some ambiguity, and it is difficult to determine the three-dimensional model of the artifact from a single photo.
Deep learning is an algorithm that uses artificial neural networks as a framework to characterize and learn data.It has been applied to many fields, and convolutional neural networks have solved many computer vision problems [5][6][7][8][9][10].In 1999, Blankz and Vetter proposed the 3D deformation model (3DMM) [11], but the method relies on the accuracy of feature points and detectors [12,13].Recently, convolutional neural networks (CNN) have been used to predict 3DMM parameters, but they take much time [14][15][16][17].In addition, the proposed unsupervised learning method can achieve the regression of 3DMM parameters without training data, but it is not effective in the case of occlusion and non-positive faces [18,19].Although the PRNet [12] has good effects, lightweight features, and can realize real-time facial reconstruction, it requires a large amount of training data, which is not suitable for the case of fewer samples.Wang et al. use the information of other similar artifacts to repair the missing artifacts, which is difficult to deal with the situation without similar data [20].
Facial texture alignment is a long-standing problem in computer vision.There are many methods for twodimensional planes, such as the classic Active Appearance Model (AMM) and the Constrained Local Model (CLM) [21][22][23].Then, the neural network-based method achieves better results but requires more data training and cannot handle occlusion [24,25].Recently, some work has begun to return to the 3DMM model [14][15][16][25][26][27][28][29][30][31][32][33], but it still needs more data, which is not suitable for the scene of cultural relics restoration.The model method can be used to achieve the goal of reconstructing the three-dimensional face alignment better, but large samples are needed for training.In the restoration of cultural relics, due to the small amount of data, an overfitting phenomenon will occur, which makes it difficult to achieve better results.
Recently, a method using convolutional neural networks has been used to estimate the depth of a photograph from a single photo [34][35][36][37], but this method cannot solve the occluded portion.As shown in previous works [20,38,39], the result can be obtained by using other similar types of images for alignment and averaging.This method is used to repair the sculpture that has suffered severe damage.It also relies on a lot of similar pictures of cultural relics and only achieves better results when the artifacts are not much different.Moreover, the method can only restore the image and cannot restore the three-dimensional model of the lost artifacts.Many works [26,32,33,40,41] use a lot of training data and have achieved good results in facial reconstruction, which better solves the problem of large posture and occlusion.However, the deep learning method also requires a large number of training samples, and the data of cultural relics is very limited.Even with data expansion methods such as panning, rotating, folding, scaling color channels, and adding image random noise, it is difficult to train a deep learning model with strong generalization ability.
In order to recover the three-dimensional shape of cultural relics from a single photo, this paper proposes a training method based on transfer learning, which effectively solves the problem of fewer cultural relics training samples when using deep learning method to recover the three-dimensional shape of cultural relics and improve the credibility of cultural relics restoration.The method realizes extracting features from a single photo, reconstructing the three-dimensional model of the artifact surface, and obtaining good results.The face dataset pretraining model is used to make the model learn the basic facial features, which provides a good initial value for subsequent feature refinement.Then, a small amount of cultural relic data is used to fine-tune all the weights to make the model learn so that the model can learn the difference between the face and the face of other artifacts.The trained convolutional neural network can effectively extract the features of the facial surface, reconstruct the sculpted facial shape from a single old black-and-white photograph, and realize the texture alignment.The network trained by the method of this paper also effectively solves the problems of occlusion, side face, and shadow.In addition, it can also extract features from the face of damaged artifacts and estimate the geometry before the damage.

The study motivation and main contributions
The purpose of this work is to train a more generalized model using less training data, to reconstruct the threedimensional shape and texture of the cultural relic using the information provided by a single old photo, and to repair the damaged parts.The main contributions of this study are: (1) Use a small amount of cultural relic data to train a deep neural network with strong generalization ability based on transfer learning and realize the reconstruction of cultural relics and texture alignment from a single old photograph.Training with a small amount of data not only aligns with the current situation of limited samples in cultural relics but also offers fast training speeds and low computational costs.(2) Overcome the problem of side face and shadow in old photos, realize face reconstruction and texture alignment in corresponding situations.(3) The network trained using the method of this paper can extract features from photos of damaged artifacts and estimate the model before the damage.

Methods
In this section, the details of the training methods proposed in this paper will be carefully described.[29,42] to expand.For the expanded data, cross-validation is used for training to improve data utilization efficiency.

The representation of facial data
The purpose of this paper is to extract features from old photographs and to regress the three-dimensional parameters of the facial model.Therefore, we have to convert the raw data into representations of the neural network's input and output.If the three-dimensional data of the model is converted into one-dimensional data and connected by using the full connection method, the position information of the three-dimensional model in the space will be lost, and the position of the adjacent points in the space won't be reflected.At the same time, the use of the full connection layer also greatly increases the parameters that need to be trained.Moreover, the possibility of overfitting is increased because of the lack of artifact data, so this method is not used to represent data.Fan et al. [43] used a point cloud to represent the 3D model, but the maximum number of points is only 1024, which cannot effectively express the details of the face unit model.In order to save the positional relationship information between points and points in the data set, referring to previous works [12,[44][45][46][47], the UV position map is used as the alignment and reconstruction of the facial structure information, besides, as the output of the neural network.

The method of model training
The purpose of using old photos to restore the faces of cultural relic sculptures is to extract facial features from a single old photo so as to construct the parameters of the face 3D model.If the deep learning method is used to extract features from 2D images, it can be achieved through convolutional neural networks.However, this article needs to learn a neural network model with strong generalization ability through a small amount of data, so it should make full use of the only cultural relic data, fully mine the characteristics of the image data, and introduce the information of the facial features to eliminate leaflets ambiguity in the photo.A relatively simple idea is to directly expand the cultural relics data set by translation, rotation, folding, and zooming color channels, adding image random noise, etc.Nevertheless, the combination of facial features of the expanded data is relatively single, and the training difficulty of the model is still great.
In addition, the generalization ability of the model is not high.In order to solve the problems of the previous work, this paper proposes a transfer learning method.In the past, the method of transfer learning has been mainly used in the classification problem.Combined with a small amount of data in the training set, this method can better combine the features of the pre-training model, and give a more accurate classification to the target [48].This paper first uses face data to train a large network, which can better preserve the information about human face data.The network has many different low-level and high-level kernels.The features extracted in this way are similar to the features extracted from the face of the cultural relics, which provides a good initial value for finetuning the three-dimensional data of the cultural relics.After training the model using face data, the model is trained using the artifact data to eliminate the difference between the human face and the artifact face using the artifact data.Because the cultural relics data is very limited, in order to be able to use the cultural relic data, this paper uses ten-fold cross-validation to train the model using cultural relics data.The specific training process is shown in Fig. 2.

Network structure and loss function
The network structure of this paper refers to the method of using convolutional neural networks and residual neural networks.The input of the network is a single-channel black-and-white image.This paper adopts the end-toend learning method.A convolutional residual network was used in the previous paragraph.The first layer of the network uses a convolutional layer and then connects 10 residual network blocks to convert the input 256 × 256 × 1 single-channel image into an 8 × 8 × 512 feature map.The second half is also combined using a convolution and residual network, which contains 17 convolutional layers, and finally generates a 256 × 256 × 3 position map.For all convolutional layers, we use a filter size of 4 × 4. The activation function in the network is selected as ReLU, which can effectively solve the problem of gradient disappearance in training.The functions implemented in this article can only be implemented by one of the above network structures.The specific network structure is shown in Fig. 3.
MSE is generally chosen as the objective function for commonly used networks [30,49].However, the MSE loss function is equivalent to all areas on the face.But the face center has more detailed features than other parts.As a loss function, MSE cannot distinguish key points and is not suitable for learning position maps [12].Therefore, in this paper, different weights are selected for different parts of the face.The method used to weigh the  [12].Since the features of the face center are more pronounced, there is a higher proportion at the center of the face.The specific ratio is 16:4:3:0.

Training details
The data in 300W-LP [26] was used for pre-training of the model.Since the images in the training set are all color images, the color images are first converted to black and white images to accommodate the input of the model.The corresponding 3D point cloud map is generated according to the parameters given by the dataset.By performing meshing and parameterization on the 3D point cloud and then mapping the attributes of the point cloud (such as color, coordinates, etc.) to UV space, a UV texture map is obtained.Replacing the RGB components in the UV texture map with the Cartesian coordinates of the 3D point cloud data yields the UV position map.In order to achieve better training results, the face area in the image is cut and divided, and the large-area image background is cut off, so that the main body of the image is mainly the human face.
For the collected artifact data, we position, crop, and scale it to a size of 256 × 256 to match the model.Similarly, the image of 300W-LP is also positioned, clipped, and scaled to a size of 256 × 256.In order to increase the amount of data and improve the generalization ability of the model, some data transformations are made.The main transformations include panning, rotating, folding, scaling color channels, and adding image random noise.The translation transformation mainly performs random translation of the image up and down and left and right within 15%.The range of zoomed image color passes is 0.5 to 1.5.The original data is randomly rotated, and the range of random rotation is − 45 degrees to 45 degrees.The image is folded up and down and left and right, so that all possible situations can be covered by the angle range of rotation.The variance of the random noise added to the original image is 200.Adam optimizer is used during training.The learning rate starts from 0.001, and every 5 epochs become half of the original.The batch size in training is 16.

Terracotta Warriors data reconstruction and mapping test
Firstly, the proposed method in this work is tested on the Terracotta Warrior.The data from the 300W-LP was first used for pre-training, then four simultaneous Terracotta Warriors were used for fine-tuning, and finally, other data was used for testing.An example of repair is shown in Fig. 4, which presents the test 3D reconstruction and texture alignment on Terracotta Warriors.4b-e, respectively.The iterative closest point (ICP) method is an algorithm that iteratively seeks the optimal rigid transformation between two point clouds under certain constraints, aiming to align them as closely as possible.The corresponding nearest point will be found using the ICP of the restored 3D model and the original 3D model, and then it will be normalized using the distance outside the eye.The mean square error (MSE) was calculated and the MSE for all test samples was 4.19.As can be seen from the figure, the position corresponding to reconstruction and mapping is better.

Reconstruction map test under small sample
The sample was also tested in a very small case, first using the data in the 300W-LP for pre-training, then using two Luo Han from the Northern Song Dynasty to finetune the model, and finally using the following images with shadows and side faces to test the model.The mean square error (MSE) on all test samples was 5.73.The result of restoring the 3D model from the old photo is shown in Fig. 5.As can be seen from the figure, our model has achieved good results in the reconstruction of the facial model with large shadows, side faces, and shadows in the old photos.However, the texture of occlusion and shadows is not well estimated.

Restoration of damaged artifacts from old photos
As for the damaged Chinese objects in the old photos, we used the Terracotta Warrior data and the Buddha data to test.In both tests, 300W-LP was used for pre-training.Then, four samples were used for fine adjustment of the Terracotta Warrior and the Buddha images, and other data were used for testing.
Figure 6 shows the restoration of a damaged Terracotta Warrior from an old photo.Figure 6(a) represents the input picture of the Terracotta Warrior, and Fig. 6b-e illustrate the three-dimensional model of the restored Terracotta Warrior without and with texture.Figure 7 shows the restoration of a damaged Buddha in an old photo.Figure 7a is the input picture of the Buddha, and Fig. 7b-e represent the three-dimensional model of the

Discussion and conclusion
Reconstructing and repairing the faces of cultural relics is of great significance for presenting the original appearance of relics, enhancing overall aesthetics and artistic value, and promoting historical and cultural research.This paper addresses the current situation of limited cultural relic samples by training a neural network with strong generalization ability based on transfer learning, achieving reconstruction and texture alignment of cultural relics from a single photo.Compared with previous methods, the method in this paper requires fewer relic samples and can address issues such as occlusion, side faces, and shadows.Additionally, both the facially damaged Terracotta Warrior and the broken-nosed Buddha have been effectively repaired, providing a model for similar relic restoration issues.However, due to the limitation of relic quantity, this paper only tested two types of relics.In future research, more relics with facial defects can be reconstructed to verify the algorithm's generalization ability.Furthermore, the damaged areas of the relics in this paper are not extensive, and subsequent studies can verify and research cases with large areas of facial damage.
In summary, this paper proposes a method of training end-to-end networks by transfer learning methods.The results show that the method makes it possible to train a deep neural network model with strong generalization ability using a small amount of data.The Terracotta Warrior and Buddha are taken as examples to demonstrate that the model can better realize the extraction of information from a single old photo, and the establishment of a face 3D model and texture alignment.The model can also solve the problem of side face, shadow, and occlusion in old photos.In addition, the model is further extended to feature extraction and 3D model reconstruction for the old photos of damaged statues.After the restoration, the texture alignment is realized, and a better effect is achieved, which provides a reference for the restoration of cultural relics.

Fig. 1
Fig. 1 Photograph of a Chinese Terracotta Warrior with facial cracks and a Buddha with a broken nose.The red panels indicate the damaged parts of the cultural relics

Fig. 2
Fig. 2 Training process.First use the face data for pre-training, then use the artifact training data to fine-tune, and finally reconstruct the artifacts in the old photos

Fig. 3 Fig. 4
Fig. 3 The network structure of cultural relics repairing.a Input picture.b The network structure.The blue part represents the residual block, and the green part represents the transposed convolution block.c The repaired artifact model

Figure
Figure4ais the input picture of the Terracotta Warrior, and the three-dimensional model of the Terracotta Warrior face without and with texture are shown in Fig.4b-e, respectively.The iterative closest point (ICP) method is an algorithm that iteratively seeks the optimal rigid transformation between two point clouds under certain constraints, aiming to align them as closely as possible.The corresponding nearest point will be found using the ICP of the restored 3D model and the original 3D model, and then it will be normalized using the distance outside the eye.The mean square error (MSE) was calculated and the MSE for all test samples was 4.19.As can be seen from the figure, the position corresponding to reconstruction and mapping is better.

Fig. 5 Fig. 6
Fig. 5 Test 3D reconstruction and texture alignment on two Luo Hans. a The input picture of Luo Hans.b, c Three-dimensional model of the Luo Hans' face.d, e Three-dimensional model of Luo Hans with texture Finally, the training methods proposed in this paper are introduced, including the specific data and training methods used in different stages, as well as the data expansion method, network structure, and loss function.
First, the training data used in the training model is introduced.Then, the representation of the data in training is expounded.three-dimensional model of the artifact.The data used is part of the data in the Shaanxi History Museum.For the collected cultural relics data, this paper first uses the methods of panning, rotating, folding, and zooming color channels, adding image random noise