- Research
- Open access
- Published:
DIGAN: distillation model for generating 3D-aware Terracotta Warrior faces
Heritage Science volume 12, Article number: 317 (2024)
Abstract
Utilizing Generative Adversarial Networks (GANs) to generate 3D representations of the Terracotta Warriors offers a novel approach for the preservation and restoration of cultural heritage. Through GAN technology, we can produce complete 3D models of the Terracotta Warriors’ faces, aiding in the repair of damaged or partially destroyed figures. This paper proposes a distillation model, DIGAN, for generating 3D Terracotta Warrior faces. By extracting knowledge from StyleGAN2, we train an innovative 3D generative network. G2D, the primary component of the generative network, produces detailed and realistic 2D images. The 3D generator modularly decomposes the generation process, covering texture, shape, lighting, and pose, ultimately rendering 2D images of the Terracotta Warriors’ faces. The model enhances the learning of 3D shapes through symmetry constraints and multi-view data, resulting in high-quality 2D images that closely resemble real faces. Experimental results demonstrate that our method outperforms existing GAN-based generation methods.
Introduction
Employing GAN technology to generate 3D representations of the Terracotta Warriors can significantly aid experts in the preservation and restoration of cultural heritage. Many of the warriors are damaged or partially destroyed; using GAN to create complete 3D models provides reference frameworks for restoration efforts. Digitizing the warriors’ features and creating 3D models substantially enhances methods of cultural heritage preservation and exhibition. These digital 3D models can be utilized in virtual museums, online exhibitions, and educational platforms, allowing global audiences to explore the intricate details of the Terracotta Warriors and amplifying their cultural impact. Generating 3D representations of the warriors’ faces using GAN has significant implications not only for heritage preservation and restoration, digital conservation and display, academic research, education, and the cultural creative industries but also for driving technological innovation and development. Through these applications, we can better protect and transmit invaluable cultural heritage while fostering new creativity and technological advancements.
Generative adversarial networks (GANs) [1,2,3], such as StyleGAN, can generate high-quality and diverse facial images. These networks learn features from large datasets and generate realistic faces through machine learning [4] and transfer learning [5]. These generated images are valuable across various applications, from entertainment and security to art and scientific research. However, these models have limitations. The primary issue is the randomness of the generated faces, which prevents users from directly controlling the semantic attributes of the images, such as pose and expression. This inability to specify details means users must accept the model’s randomly generated outcomes, posing a significant constraint for applications requiring precise control over image features. Current research aims to develop methods that provide control during the generation process, allowing users to adjust the semantic attributes, particularly 3D information like pose. This control is beneficial for applications such as facial recognition [6, 7] and image synthesis [8,9,10], as it enables the generation of faces with specific features, enhancing accuracy and flexibility. Most existing methods rely on external supervision [11], such as pose labels [12], landmarks, or synthetic images [13]. These external data are used to disentangle semantic factors in GAN’s latent space, enabling better control of these features during generation. Additionally, some methods explore unsupervised approaches, investigating the latent space [14,15,16] of GANs to achieve controllability of 3D features. These unsupervised methods rely on a deep understanding of data distribution and GAN’s internal mechanisms, employing clever designs to control the generated features.
Compared to solutions that only generate 2D images, constructing generative models with explicit 3D shapes offers significant advantages, including detailed adjustments to shape, pose, and texture. However, existing 3D generation solutions struggle with complex objects, particularly when generating fine-grained details and diverse facial expressions. 3D generative models [17,18,19,20] not only create realistic 3D forms but also provide stricter and more precise control over content. This control extends to detailed adjustments of object shape [21, 22], pose [23, 24], and texture [25, 26], better meeting specific application requirements. Recently, some research has attempted to train 3D generative models using 2D images, but these efforts mainly focus on relatively simple and coarse objects, like cars. For such objects, training data from 2D images can adequately support 3D shape reconstruction. However, this approach shows clear limitations when dealing with more complex objects. When applied to objects with fine-grained details, especially faces, existing 3D generation solutions encounter significant difficulties in learning reasonable 3D shapes. Faces possess rich details and varied expressions [27,28,29,30], imposing higher demands on generative models.
Despite significant advancements in GAN-based image generation techniques in recent years, particularly in generating high-quality 2D facial images, existing studies still face substantial challenges in producing 3D-aware facial models. These challenges primarily stem from the limited control over the generation of complex facial features and a lack of precise understanding of the generated images’ three-dimensional morphology. Current approaches often rely on external supervision data, which not only complicates data preparation but also restricts the broader application of these models. Additionally, these methods frequently fail to generate coherent and consistent shapes when handling large pose variations, resulting in suboptimal visual outcomes that do not meet practical application requirements.
To address these issues, we have developed a 3D generation model based on a distillation approach (DIGAN) that enables high-precision 3D facial generation of Terracotta Warriors without the need for external supervision. Our goal is to provide a more flexible and efficient technology for 3D facial generation, driving innovation in cultural heritage preservation and digital display.
In this paper, we develop a distillation model to generate 3D representations of the Terracotta Warriors’ faces. By refining and leveraging knowledge from StyleGAN2, we train a 3D generative network. Specifically, we employ an innovative approach that transfers the strengths and insights of the existing StyleGAN2 model to a novel 3D generator, enabling it to produce high-quality 3D models of the Terracotta Warriors’ faces. The G2D network, as the primary component, generates detailed and realistic 2D images through layered convolution operations and feature fusion. The main objective of the 3D generator is to modularly decompose G2D’s generation process, encompassing different 3D modules such as texture, shape, lighting, and pose. These modular 3D components, when integrated, render 2D images of the Terracotta Warriors’ faces. This decomposition and integration process results in images that are visually more realistic and three-dimensional. Through these steps, the G2D model acts as a teacher model, providing direct supervisory signals to G3D, thereby avoiding adversarial training. Additionally, the use of symmetry constraints and multi-view data effectively enhances the learning of 3D shapes. By modularizing StyleGAN2’s generation process, we construct a powerful 3D generative network capable of producing high-quality 2D images of the Terracotta Warriors. This new 3D generator not only inherits StyleGAN2’s ability to generate high-quality images but also improves image realism and three-dimensionality through modular design, resulting in visuals that closely resemble real human faces.
The main contributions of this paper can be summarized as follows:
-
We developed a distillation-based framework for generating 3D perceptual models of the Terracotta Warriors’ faces, named DIGAN.
-
We constructed a G3D model comprising five sub-networks to complete the transformation from style codes to final rendered images.
-
Our method can produce high-quality 3D information for human faces and Terracotta Warriors’ faces, allowing precise control over pose and lighting.
Related work
Attitude disentanglement 2D GAN
Recent research on Generative Adversarial Networks (GANs) has demonstrated their capability to generate high-quality, realistic images. In recent years, significant advancements in the fineness and realism of image generation have been achieved, making GAN-generated images nearly indistinguishable from real photographs. Increasingly, studies are focused on disentangling various factors within images, such as pose, lighting, and background. These studies not only enable GANs to produce realistic images but also allow better control over different elements of the images, enhancing image editing and manipulation capabilities. The first category of methods introduces specific modules or loss functions during training to ensure the disentanglement of pose information. Specifically: Tran et al. [31] and Tian et al. [32] utilize pose labels during training to ensure the independence of pose information. Hu et al. [33] and Zhao et al. [34] employ landmarks and 3D Morphable Models (3DMM) to assist training, thereby improving pose disentanglement. Deng et al. [35] guide the learning of pose factors during GAN training using 3DMM, resulting in more natural and accurate pose generation. The CONFIG method [36] combines real and synthetic images for training to enhance GAN’s image quality and pose control capabilities. HoloGAN [37] uses a 3D feature projection module to generate rotatable face images, enabling manipulation in 3D space. The second category of methods manipulates pre-trained GAN networks to alter the output’s 3D information. Specifically: Labels [38] or 3DMM [39] are used to control 3D information in generated images by disentangling the latent space. Shen et al. [40] propose an unsupervised method to decompose the GAN’s latent space, achieving control over 3D information in generated images without additional label data. Despite the progress in generating high-quality images and disentangling pose information, these methods still have a notable drawback: the inability to explicitly output the object’s 3D shape. This limitation is particularly critical for applications requiring strict control over 3D information. Overall, while GANs have made significant strides in image quality and pose disentanglement, their limitation in explicitly outputting 3D shapes persists. This defect constrains their practicality in applications requiring precise 3D control. Future research may need to explore ways to overcome this limitation to achieve more comprehensive 3D image generation and control.
3D reconstruction and generation of 2D images
Research on reconstructing and generating 3D shapes from unlabelled 2D images aims to extract three-dimensional information from flat image data, producing realistic 3D shapes. In the domain of unsupervised reconstruction methods, Tulsiani et al. [41] proposed a method that uses multi-view consistency as a supervisory signal, leveraging consistency between images from different viewpoints to infer 3D shapes. Kanazawa et al. [42] used the consistency of similar objects to learn reconstruction models, improving accuracy by comparing and learning from multiple instances of the same class of objects. Wu et al. [43] presented a method that utilizes the symmetry of objects to learn detailed shape and reflectance information for more refined reconstruction. Handerson et al. [44,45,46] reconstructed 3D meshes using shadow information from synthetic images and extended this approach to generate models capable of creating new 3D shapes. In the realm of generative models, GAN architectures (Generative Adversarial Networks) provide signals through adversarial loss, making the generated 3D shapes more realistic. Gadelha et al. [47] applied discriminators to the contours of generated voxels, enhancing the details of the 3D shapes produced. Lunz et al. [48] used commercial renderers to guide neural renderers in outputting shadowed images, increasing the realism of the generated images. Voxel-based methods, however, have limitations in recovering fine-grained details and 3D surface colors, often resulting in 3D shapes lacking sufficient detail and color information. Among the latest methods, Szabo et al. [49] employed vertex position mapping to train GANs with textured mesh outputs directly, yet they still encountered shape distortion issues during generation. Some studies use StyleGAN to generate multi-view synthetic data. For instance, Zhang et al. [50] improved data quality through manual annotation, while Pan et al. [51] iteratively synthesized data and trained reconstruction networks to enhance generation outcomes. This paper proposes a novel 3D generative model that innovatively addresses distortion and detail recovery issues in previous methods by simultaneously learning to manipulate StyleGAN2 generation and estimate 3D shapes. This approach produces more accurate and detailed 3D shapes, offering higher quality 3D reconstruction and generation results.
Methods
This chapter provides a detailed overview of the overall network architecture, the implementation details of the five sub-networks, and the loss functions used. First, the design of the entire network architecture is described, including how the various modules collaborate. Next, the specific implementation methods and technical details of each sub-network are elaborated, covering the generation of texture and shape, as well as adjustments to lighting and pose. Finally, the chapter discusses the loss functions employed, which play a crucial role during training to ensure the quality and realism of the generated images.
Overall framework
As shown in Fig. 1, our network first implements G2D, based on StyleGAN2, to generate 2D images of Terracotta Warriors, serving as a teacher model to supervise the learning process of G3D. For each random sampled image input from StyleGAN2 into G3D, five sub-networks are utilized: DV, DL, DS, DT, and DW. These sub-networks are each responsible for different tasks, working together to complete the transformation from style code to the final rendered image. The sub-network DV, using the KAN structure, maps the input random sampled feature \({w}\) to a 6-dimensional viewpoint representation \({V}\). This viewpoint representation includes translation and rotation information of the image, determining the perspective of the final rendered image. The sub-network DL, also using the KAN structure, decodes the style code \({w}\) into a 4-dimensional output \({L}\). This output \({L}\) includes the x–y direction of the light, ambient light, and diffuse light intensity, thus determining the lighting effects of the image. The sub-network DW generates texture mappings through StyleGAN2, defining the detailed texture of the image. The sub-network DS generates depth maps, representing the depth information of different parts of the image. The sub-network DT generates transformation maps, describing the shape and positional transformations within the image. DS and DT map the disentangled style code \({w_0}\) to shape representation \({S}\) and transformation mapping \({T}\). This separate handling of shape and transformation makes the disentangling of styles clearer and more intuitive.
Using G3D, the style code \({w}\) is converted into three-dimensional shapes and views, and a differentiable renderer \({R}\) generates 2D images \({I_{3D}}\). The images generated by G2D, \(I_{2D}\), are used as a supervisory signal, and a loss function is defined to measure the difference between \({I_{2D}}\) and \({I_{3D}}\). By minimizing the loss function through backpropagation, the parameters of the G3D model are updated, making the generated images \(I_{3D}\) as close as possible to the images \({I_{2D}}\) generated by G2D. The renderer \(R\) uses shape representation \({S}\), transformation mapping \({T}\), viewpoint representation \({V}\), and lighting effects \({L}\) to output the final rendered image. This process integrates the outputs of all sub-networks, achieving the transformation from style code \({w}\) to rendered image \({I_w}\). Moreover, the depth maps and texture maps obtained from the aforementioned sub-networks are correlated so that each depth value can be mapped to the corresponding texture position, generating realistic 3D appearances of the Terracotta Warriors. By combining the depth maps and transformation maps, complex shape representations and transformations can be realized.
G3D student model
The primary goal of the G3D student generator is to modularize the G2D generation process, encompassing various 3D components such as texture, shape, lighting, and pose. Acting as the teacher model, G2D provides direct supervisory signals for G3D, circumventing the need for adversarial training. Additionally, the use of symmetry constraints and multi-view data effectively enhances the learning of 3D shapes. The G3D student generator not only inherits the advantage of high-quality image generation from StyleGAN2 but also improves the realism and three-dimensional effect of the images through its modular design, making the visual representation of the Terracotta Warriors more lifelike.
DV and DL Sub-networks
Through the KAN structure, sub-networks DV and DL process the input style code \({w}\), generating information related to the image’s viewpoint and lighting effects. Specifically, sub-network DV produces a 6-dimensional viewpoint representation \({V}\), which includes translation and rotation information, thus determining the image’s perspective. Sub-network DL generates a 4-dimensional lighting information \({L}\), encompassing the direction of light, ambient light, and diffuse light intensity, thereby determining the image’s lighting effects. This approach allows for flexible adjustments to the image’s perspective and lighting effects based on different style codes \({w}\), achieving highly controllable image rendering. The detailed implementation is as follows. The DV sub-network outputs a vector that contains viewpoint features. Similar to the lighting information, we also constrain the viewpoint information. This is achieved by applying the Tanh activation function to limit the values within the range of [− 1, 1]. Randomly sampled feature \({w}\) is used to generate the viewpoint information \({V}\), which is produced by the viewmlp module:
The DL sub-network generates a vector containing lighting features. To ensure that the lighting information remains within a reasonable range, we typically constrain it. The output is limited to the range of [− 1, 1] using the Tanh activation function.
Given the randomly sampled feature \({w}\), the lighting information \({L}\) is produced by the lightmlp module:
The MLP is responsible for generating the latent space vectors, consisting of multiple layers and including components such as KANLinear, GroupNorm, and the LeakyReLU non-linear activation function. Inspired by the literature, the specific MLP can be represented as follows:
The overall representation of the MLP is:
where each layer \({f_i}\) is defined as:
The KAN operation is defined as:
where \({\Phi _i}\) represents the \({i}\)-th layer of the entire KAN network.
For the final layer:
This representation ensures a structured and effective transformation of the input latent vector \(\textbf{x}\) through multiple layers, incorporating normalization and activation functions to achieve the desired output \(\textbf{y}\).
DS sub-network
The DS sub-network is designed to generate depth maps. The process involves an optional MLP section, followed by an upsampling section with partial information fusion, and finally a head section that produces the final output. The detailed process is as follows:
Given input \(\textbf{x}\), it first passes through multiple fully connected layers with LeakyReLU activations:
In the upsampling section, initial upsampling uses transposed convolution (ConvTranspose2d) and ReLU activation:
Next, a Partial_conv layer with ReLU activation is introduced:
Our network includes five upsampling blocks, each consisting of GroupNorm, transposed convolution (ConvTranspose2d), and ReLU activation, as well as GroupNorm, Partial_conv3, and ReLU activation.
In the injection section, a 1x1 convolution processes input \(\textbf{x}_2\) to obtain \(\textbf{h}_{\text {inject}}\):
The upsampled result is added to the injected result to get \(\textbf{h}_{\text {combined}}\):
In the head section, \(\textbf{h}_{\text {combined}}\) is processed through GroupNorm and Partial_conv, with ReLU activation:
Further processing of \(\textbf{y}\) is done through GroupNorm and Conv2d, with ReLU activation:
Finally, the output convolution layer produces the final output \(\textbf{y}\):
This structure ensures that the depth map generated by the DS sub-network is accurate and detailed, by incorporating multiple layers of upsampling, partial convolution, and normalization.
DW sub-network
The DW sub-network applies the Laplacian operator during backpropagation to perform edge detection on the gradients, retaining only the gradients at edge regions while suppressing noise in non-edge regions. The process involves several steps:
Given the gradients from downstream \(\textbf{g}_{out}\), we first compute their absolute values and generate a mask \(\textbf{M}\) that marks all non-zero gradient positions:
The shape of the mask needs to be expanded for subsequent convolution operations, denoted as:
The Laplacian operator \(\textbf{L}\) is applied to the expanded mask to detect edges. We use a discrete Laplacian operator for convolution:
The result of the Laplacian convolution \(\textbf{C}\) is checked to detect edges. If the absolute value is less than a small constant \(\epsilon\), it is considered an edge point:
Here, \(\textbf{E}\) is a binary mask, \(\textbf{E} \in \mathbb {R}^{N \times H \times W}\), and it needs to be adjusted to remove extra dimensions.
Finally, using the edge mask \(\textbf{E}\), we clip the gradients to retain only the gradients at edge positions:
where \(\odot\) represents element-wise multiplication. This process helps suppress gradient noise in non-edge regions during training by detecting edges in the gradients and retaining only the gradients at edge locations.
DT sub-network
The DT sub-network is a neural network model designed for feature transformation, primarily consisting of an optional MLP part and a progressively expanding transposed convolution network. It is suitable for mapping input data to a specified output space and enhancing the model’s expressive capability using activation functions. The detailed implementation is as follows:
Given input \(\textbf{x}\) passing through multiple fully connected layers with LeakyReLU activations:
In the upsampling part, the initial upsampling uses transposed convolution (ConvTranspose2d) and ReLU activation:
After the initial upsampling, there are five upsampling blocks. Each block contains transposed convolution (ConvTranspose2d), GroupNorm, and ReLU activation:
The final step includes upsampling, convolution (Conv2d), GroupNorm, and ReLU activation:
By following this process, the DT sub-network effectively transforms the input features into the desired output space, enhancing the expressive power of the model through carefully structured layers and activations.
Loss function
Reconstruction loss
The reconstruction loss for each sample measures the difference between the proxy image \({I_0}\) output by StyleGAN2 and the reconstructed image \({I_w}\). Let \({w}\) represent the randomly sampled code, and the image generated by StyleGAN2’s teacher model is given by \({w} = G2D(w)\). The rendered image is \({I_w} = R(A, S, T, V, L)\), where \({R}\) is the renderer with five parameters: \({W, S, T, V, L}\). The reconstruction loss comprises the L1 loss and the perceptual loss, with the L1 loss representing pixel-level differences between the original image \({I_w}\) and the reconstructed image \({\hat{I}_w}\). The perceptual loss, \({L_{\text {perc}}}\), multiplied by the weight parameter \(\lambda _{\text {perc}}\), captures higher-level feature differences.
where \({N}\) is the total number of pixels, and \({I_w[i]}\) and \({\hat{I}_w[i]}\) are the values of the original and reconstructed images at pixel \({i}\), respectively. The reconstruction loss \({L_{\text {rec}}}\) combines low-level pixel differences and high-level perceptual differences to more comprehensively assess the quality of the reconstructed image. By adjusting the value of \(\lambda_{\text {perc}}\), the contributions of these two losses to the total loss can be controlled.
To further constrain the model’s output, we add several regularization losses, including identity variance regularization loss and albedo map regularization loss.
Identity variance regularization loss
Assuming that identity should not change under viewpoint perturbation, we achieve this through identity variance regularization loss:
where \({I_{w0}}\) is the texture image, \({I_w'}\) is the transformed image, and \({f}\) is a pre-trained face recognition network. Since the recognition network may not have pose invariance, this regularization is applied only to images within a certain rotation range.
Albedo map regularization loss
To better utilize shadow information, we regularize the albedo map, obtained by removing illumination from standard Terracotta Warrior facial images:
where \({KA} \in \mathbb {R}^{B \times HW}\) is the albedo matrix, composed of filtered and vectorized albedo maps, and \(\Vert \cdot \Vert_*\) denotes the nuclear norm. The Laplacian kernel is used to filter the grayscale albedo map, retaining only high-frequency information. The nuclear norm acts as a soft approximation for low-rank regularization, encouraging sample consistency while ensuring that different albedo maps within a batch have small Laplacian values.
Total loss function
Combining these losses, the total loss function for our network model training is:
The weight parameters \(\lambda _{\text{idt}}\) and \(\lambda _{\text{regA}}\) control the contributions of the respective regularization terms to the total loss. This combined loss aims to improve the overall quality of the generated image while preserving identity and detailed features, leveraging shadow information to ensure more consistent and reliable albedo maps.
Experiments
Implementation details
All modules in this study are implemented using PyTorch 1.8, ensuring system stability and scalability. For rendering techniques, the mesh rasterizer from PyTorch3D is utilized for differentiable rendering. This technique allows fine-tuning of images during training, enhancing the precision and effectiveness of the generative model. The training phase employs a reimplementation of the StyleGAN2 model in PyTorch, specifically trained on the Terracotta Warriors dataset. StyleGAN2 is recognized as one of the leading models in generative adversarial networks, capable of producing high-quality and high-resolution images. To optimize the training data, the MTCNN face detector is used to crop images and resize them to 256 × 256 pixels. This process ensures the consistency and quality of the training images, improving the efficiency and effectiveness of model training. In the second phase of training, the pre-trained 2D StyleGAN2 model is used to further train the generator of the teacher-student 3D model. This approach enables the model to better learn complex 3D structures and textures, enhancing the realism and detail of the generated images. Additionally, we trained the 3D generator for 30,000 steps. For the 3D renderer configuration, the field of view was set to 10°, a parameter that facilitates the generation of more detailed and realistic images. The depth map’s output range was confined to 0.9 to 1.1 to ensure the precision of depth information and minimize potential errors. Furthermore, the rotation and translation ranges of the viewpoint were normalized to (\(-60^{\circ }\), \(60^{\circ }\)) and (− 0.1, 0.1), respectively, to simulate perspective variations in real-world scenarios. The lighting coefficients were normalized to (0,1) to maintain consistent lighting conditions during training.
Datasets
Our method was tested on three datasets to verify its broad applicability and effectiveness: the CelebA dataset [52], the cat dataset [53], and the Terracotta Warriors dataset. These datasets represent three distinct application scenarios: human faces, animals, and historical artifacts. The CelebA dataset comprises 200,000 real human face images, each annotated with bounding boxes. These images cover a wide range of poses, expressions, and lighting conditions, making it a standard benchmark in the fields of face recognition and detection. The cat dataset includes 12,000 images of cats with bounding box annotations. Each image is cropped to ensure the bounding box accurately encloses the cat’s head, making it suitable for pet image processing and animal recognition research. The Terracotta Warriors dataset consists of 300 images of the heads of excavated Terracotta Warriors. These images capture the authentic appearance and details of the Terracotta Warriors, providing valuable data for the digital preservation and study of historical artifacts.
Qualitative experimental results
The qualitative results of our method, DIGAN, are illustrated in Fig. 2 These results display the outcomes obtained through training with StyleGAN2, and the 3D shapes generated using our method. These results clearly demonstrate the superiority of our approach in generating 3D shapes. The final row presents side views of the generated 3D shapes, showcasing the facial rotation effects from different angles, and highlighting our method’s ability to capture and reproduce facial details accurately. By leveraging self-predictive transformation mapping, our method maintains dynamic stability for foreground objects while keeping the background relatively static. This characteristic ensures the stability and consistency of the generated 3D shapes between the foreground and background. Our 3D generator not only produces realistic 3D human faces but also extends to other objects, such as cats and Terracotta Warriors. This demonstrates the generality and efficiency of our generator across different subjects, maintaining high-quality shape recovery even with extensive pose variations. In the generated 3D shapes, detailed features are well-preserved. Whether it is the wrinkles on human faces or the intricate details on the edges of Terracotta Warriors, our method captures and presents these details accurately, showing its precision and accuracy in handling fine details. Our method exhibits outstanding capability in generating high-quality 3D shapes and accurately preserving and restoring details during pose variations. By utilizing advanced self-predictive transformation mapping, our approach ensures dynamic stability between foreground and background and demonstrates broad applicability and detail-capturing ability across various objects. These features make our 3D generation method highly practical and valuable for a wide range of application scenarios. Figure 3 presents sample images generated under varying lighting conditions, visually illustrating the influence of light sources on the generated images. These samples demonstrate the flexibility and robustness of our method in handling lighting variations. By generating normal maps of the shapes, our approach can create images under any lighting conditions. The normal maps not only capture subtle shape variations on the Terracotta Warriors’ surfaces but also retain detailed information of the original surfaces, ensuring the generated images maintain a high degree of realism under different lighting conditions. The second row of Fig. 3 displays shadow maps created under lighting, which are used to re-illuminate the images in the first row. This technique allows us to simulate shadow effects in real lighting environments, making the final generated images more realistic and natural. Our method showcases the ability to generate high-quality facial images of the Terracotta Warriors under any lighting condition, even for images not present in the training dataset. This indicates that our method has high generalization and adaptability, capable of handling complex and variable lighting environments, and provides reliable technical support for various application scenarios.
Quantitative experimental results
We evaluated the visual fidelity of the generated images using the Fréchet Inception Distance (FID). DIGAN was thoroughly compared with other mainstream Generative Adversarial Network (GAN) models, such as StyleGAN [52], APA [54], ADA [55], Few-Diffusion [56], FastGAN [57], and FreGAN [58]. Table 1 shows that DIGAN demonstrates competitive or superior FID scores on challenging datasets such as CelebA, cat, and Terracotta Warriors. When trained on different datasets with the same set of hyperparameters, DIGAN’s architecture remained stable, consistently generating images with high visual fidelity. Thus, DIGAN exhibited exceptional visual fidelity and robustness across multiple datasets, underscoring its superior performance in the field of image generation.
This study provides a detailed quantitative comparison of DIGAN with three renowned open-source methods: HoloGAN [37], DiscoFaceGAN [35], CONFIGNet [59], and LiftGAN [60] as shown in Table 2 to evaluate their performance in face generation and identity preservation. For each method, 1,000 faces were randomly generated and output images were produced in multiple specified poses. The ArcFace model was used to compute the similarity between generated frontal and non-frontal faces, assessing each method’s ability to preserve identity information after facial rotation. All methods excelled in the visual quality of the generated images, producing facial images with high realism and clarity. However, DiscoFaceGAN performed relatively poorly in preserving identity information after facial rotation, indicating the challenge of maintaining identity consistency with varying facial poses. Without any supervision, DIGAN exhibited superior identity preservation capabilities compared to most methods. Within the framework of 3D controllable generation models, DIGAN’s average identity similarity across different rotation angles was significantly better than other methods, demonstrating its superior performance in maintaining identity consistency. Overall, DIGAN showcased exceptional performance in face generation and identity preservation, making it a powerful tool for relevant applications and research.
Ablation studies
In the ablation study, we thoroughly examined the impact of various loss functions on the generated results in our model, with a particular focus on their performance in terms of image quality and realism. Through experimental comparisons, we revealed the roles and effects of different loss functions in the generative model. To evaluate the importance of these loss functions, we conducted ablation experiments by sequentially removing the symmetric reconstruction loss, albedo consistency loss, and identity regularization loss, and retraining the generative model. This approach allowed us to observe the impact of each loss function on the final image quality and feature preservation. During the evaluation process, we employed both quantitative and qualitative methods. Quantitative evaluation included using specific metrics, such as FID scores, to measure the quality of the generated images. Qualitative evaluation involved visual inspection and comparison of the generated image details to assess their realism and consistency. The results of these evaluations are presented in Table 3 and Fig. 4 providing clear data support. To verify the generality and robustness of the experiments, we conducted multiple ablation studies using a specific Terracotta Warriors dataset. These experiments provided a more comprehensive understanding of the specific impact of different loss functions on model performance and ensured that the results have broad applicability.
We remove the symmetric reconstruction loss (i.e., wo_flip), the identity regularization loss (i.e., wo_idt), and the albedo consistency loss (i.e., wo_rega), respectively. Experiments showed that when the symmetric reconstruction loss was removed, the generative model struggled to produce reasonable facial shapes, often resulting in inconsistent and unnatural shapes. However, the model still managed to partially achieve facial rotation functionality, indicating the crucial role of symmetric reconstruction loss in shape generation. Further experiments revealed that albedo consistency loss and identity regularization loss had similar effects in regularizing the output shapes. Both loss functions helped the model generate more consistent and believable image shapes, thereby improving the quality and credibility of the generated images. By adding additional regularization losses to the output of StyleGAN2, we can significantly enhance the plausibility and reasonableness of the generated facial shapes. This approach not only improves the visual quality of the generated images but also boosts the model’s performance in complex shape generation tasks.
Conclusion
This paper explores the application of Generative Adversarial Networks (GANs) in generating 3D Terracotta Warrior faces, emphasizing their significance in cultural heritage preservation and restoration, digital archiving and exhibition, academic research, education, and the cultural and creative industries. Using GAN technology, particularly StyleGAN and its variants, allows for the generation of high-quality and diverse 3D Terracotta Warrior images. To address these challenges, this paper proposes an innovative distillation model, DIGAN, for generating 3D Terracotta Warrior faces. By distilling knowledge from StyleGAN2, we train a novel 3D generative network. The 3D generator modularly decomposes the generation process, encompassing 3D modules for texture, shape, lighting, and pose, and ultimately renders 2D Terracotta Warrior images. The model enhances 3D shape learning through symmetric constraints and multi-view data, resulting in images with a more pronounced three-dimensional appearance and realism. The proposed method can generate high-quality 3D information for faces and Terracotta Warrior images, allowing precise control over pose and lighting. Our approach has made significant progress in generating high-quality 3D Terracotta Warrior faces, providing new technological means for cultural heritage preservation and restoration, and advancing digital archiving and exhibition. However, our work has certain limitations. Since our model is a generative method, we do not have a real 3D model for its output, so it is difficult to directly evaluate the various performance of the 3D output. Therefore, we will carry out related work on the reconstruction of the 3D Terracotta Warriors in the future to obtain a real 3D model.
Data availability
The data will be available upon reasonable request.
References
Wenjun Z, Benpeng S, Ruiqi F, Xihua P, Shanxiong C. EA-GAN: restoration of text in ancient Chinese books based on an example attention generative adversarial network. Herit Sci. 2023;11(1):42.
Yan M, Xiong R, Shen Y, Jin C, Wang Y. Intelligent generation of Peking opera facial masks with deep learning frameworks. Herit Sci. 2023;11(1):20.
Hu Q, Huang W, Luo Y, Cao R, Peng X, Peng J, Fan J. Sgrgan: sketch-guided restoration for traditional Chinese landscape paintings. Herit Sci. 2024;12(1):163.
Pandey A, Shivaji BA, Acharya M, Mohbey KK. Mitigating class imbalance in heart disease detection with machine learning. Multimed Tools Appl. 2024. https://doi.org/10.1007/s11042-024-19705-8.
Meena G, Mohbey KK. Sentiment analysis on images using different transfer learning models. Procedia Comput Sci. 2023;218:1640–9.
Boutros F, Struc V, Fierrez J, Damer N. Synthetic data for face recognition: current state and future prospects. Image Vis Comput. 2023;135: 104688.
Gao S, Wu R, Wang X, Liu J, Li Q, Tang X. EFR-CSTP: encryption for face recognition based on the chaos and semi-tensor product theory. Inf Sci. 2023;621:766–81.
Kang M, Zhu J-Y, Zhang R, Park J, Shechtman E, Paris S, Park T. Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2023. p. 10124–34.
Sauer A, Karras T, Laine S, Geiger A, Aila T. Stylegan-t: unlocking the power of gans for fast large-scale text-to-image synthesis. In: International conference on machine learning. PMLR; 2023. p. 30105–18 .
Esser P, Kulal S, Blattmann A, Entezari R, Müller J, Saini H, Levi Y, Lorenz D, Sauer A, Boesel F. Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning; 2024.
Tao H, Duan Q, Lu M, Hu Z. Learning discriminative feature representation with pixel-level supervision for forest smoke recognition. Pattern Recogn. 2023;143: 109761.
Saddaoui R, Gana M, Hamiche H, Laghrouche M. Wireless tag sensor network for apnea detection and posture recognition using LSTM. In: IEEE embedded systems letters; 2024.
Yu Y, Liu X, Wang Y, Wang Y, Qing X. Lamb wave-based damage imaging of CFRP composite structures using autoencoder and delay-and-sum. Compos Struct. 2023;303: 116263.
Brophy E, Wang Z, She Q, Ward T. Generative adversarial networks in time series: a systematic literature review. ACM Comput Surv. 2023;55(10):1–31.
De Souza VLT, Marques BAD, Batagelo HC, Gois JP. A review on generative adversarial networks for image generation. Comput Graph. 2023;114:13–25.
Marano GC, Rosso MM, Aloisio A, Cirrincione G. Generative adversarial networks review in earthquake-related engineering fields. Bull Earthq Eng. 2024;22(7):3511–62.
Xie H, Chen Z, Hong F, Liu Z. Citydreamer: compositional generative model of unbounded 3d cities. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2024. p. 9666–75.
Kim G, Jang JH, Chun SY. Podia-3d: domain adaptation of 3d generative model across large domain gap using pose-preserved text-to-image diffusion. In: Proceedings of the IEEE/CVF international conference on computer vision. 2023. p. 22603–12.
Chai L, Tucker R, Li Z, Isola P, Snavely N. Persistent nature: a generative model of unbounded 3d worlds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2023. p. 20863–74.
Karnewar A, Mitra NJ, Vedaldi A, Novotny D. Holofusion: towards photo-realistic 3d generative modeling. In: Proceedings of the IEEE/CVF international conference on computer vision; 2023. p. 22976–85.
Cheng Y-C, Lee H-Y, Tulyakov S, Schwing AG, Gui L-Y. Sdfusion: multimodal 3d shape completion, reconstruction, and generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2023. p. 4456–65.
Ning X, Yu Z, Li L, Li W, Tiwari P. DILF: differentiable rendering-based multi-view image-language fusion for zero-shot 3D shape understanding. Inf Fusion. 2024;102: 102033.
Kurdi B, Charlesworth TE. A 3D framework of implicit attitude change. Trends Cogn Sci. 2023;27(8):745–58.
Noormohammadi N, Afifi D, Bateniparvar O. A simple meshfree method based on Trefftz attitude for 2D and 3D elasticity problems. Eng Anal Bound Elem. 2023;155:1186–206.
Richardson E, Metzer G, Alaluf Y, Giryes R, Cohen-Or D. Texture: text-guided texturing of 3d shapes. In: ACM SIGGRAPH 2023 conference proceedings; 2023. p. 1–11.
Carranza T, Guerrero P, Caba K, Etxabide A. Texture-modified soy protein foods: 3D printing design and red cabbage effect. Food Hydrocolloids. 2023;145: 109141.
Karnati M, Seal A, Bhattacharjee D, Yazidi A, Krejcar O. Understanding deep learning techniques for recognition of human emotions using facial expressions: a comprehensive survey. IEEE Trans Instrum Meas. 2023;72:1–31.
Adyapady RR, Annappa B. A comprehensive review of facial expression recognition techniques. Multimed Syst. 2023;29(1):73–103.
Meena G, Mohbey KK, Indian A, Khan MZ, Kumar S. Identifying emotions from facial expressions using a deep convolutional neural network-based approach. Multimed Tools Appl. 2024;83(6):15711–32.
Kumar HNN, Kumar AS, Prasad MSG, Shah MA. Automatic facial expression recognition combining texture and shape features from prominent facial regions. IET Image Proc. 2023;17(4):1111–25.
Tran L, Yin X, Liu X. Disentangled representation learning gan for pose-invariant face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1415–24.
Tian Y, Peng X, Zhao L, Zhang S, Metaxas DN. CR-GAN: learning complete representations for multi-view generation. arXiv preprint. 2018. arXiv:1806.11191.
Hu Y, Wu X, Yu B, He R, Sun Z. Pose-guided photorealistic face rotation. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8398–406.
Zhao J, Xiong L, Li J, Xing J, Yan S, Feng J. 3d-aided dual-agent gans for unconstrained face recognition. IEEE Trans Pattern Anal Mach Intell. 2018;41(10):2380–94.
Deng Y, Yang J, Chen D, Wen F, Tong X. Disentangled and controllable face image generation via 3d imitative-contrastive learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 5154–63.
Kowalski M, Garbin SJ, Estellers V, Baltrušaitis T, Johnson M, Shotton J. Config: controllable neural face image generation. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XI 16; Springer. 2020. p. 299–315.
Nguyen-Phuoc T, Li C, Theis L, Richardt C, Yang Y-L. Hologan: unsupervised learning of 3d representations from natural images. In: Proceedings of the IEEE/CVF international conference on computer vision; 2019. p. 7588–97.
Shen Y, Gu J, Tang X, Zhou B. Interpreting the latent space of gans for semantic face editing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 9243–52.
Tewari A, Elgharib M, Bharaj G, Bernard F, Seidel H-P, Pérez P, Zollhofer M, Theobalt C. Stylerig: rigging stylegan for 3d control over portrait images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 6142–51.
Shen Y, Zhou B. Closed-form factorization of latent semantics in gans. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 1532–40.
Tulsiani S, Efros AA, Malik J. Multi-view consistency as supervisory signal for learning shape and pose prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 2897–05.
Kanazawa A, Tulsiani S, Efros AA, Malik J. Learning category-specific mesh reconstruction from image collections. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 371–86.
Wu S, Rupprecht C, Vedaldi A. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. p. 1–10.
Henderson P, Ferrari V. Learning to generate and reconstruct 3d meshes with only 2d supervision. arXiv preprint. 2018. arXiv:1807.09259.
Henderson P, Ferrari V. Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. Int J Comput Vis. 2020;128(4):835–54.
Henderson P, Tsiminaki V, Lampert CH. Leveraging 2d data to learn textured 3d mesh generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2020. p. 7498–07.
Gadelha M, Maji S, Wang R. 3d shape induction from 2d views of multiple objects. In: 2017 international conference on 3d vision (3DV). IEEE; 2017. p. 402–11.
Lunz S, Li Y, Fitzgibbon A, Kushman N. Inverse graphics gan: learning to generate 3d shapes from unstructured 2d data. arXiv preprint. 2020. arXiv:2002.12674.
Szabó A, Meishvili G, Favaro P. Unsupervised generative 3d shape learning from natural images. arXiv preprint. 2019. arXiv:1910.00287.
Zhang W, Zhou D, Li L, Gu Q. Neural thompson sampling. arXiv preprint. 2020. arXiv:2010.00827.
Pan X, Dai B, Liu Z, Loy CC, Luo P. Do 2d gans know 3d shape? Unsupervised 3d shape reconstruction from 2d image gans. arXiv preprint. 2020. arXiv:2011.00844.
Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2019. p. 4401–10.
Zhang W, Sun J, Tang X. Cat head detection—how to effectively exploit shape and texture features. In: Computer vision—ECCV 2008: 10th European conference on computer vision, Marseille, France, October 12–18, 2008, proceedings, part IV 10. Springer; 2008. p. 802–16.
Jiang L, Dai B, Wu W, Loy CC. Deceive d: adaptive pseudo augmentation for gan training with limited data. Adv Neural Inf Process Syst. 2021;34:21655–67.
Karras T, Aittala M, Hellsten J, Laine S, Lehtinen J, Aila T. Training generative adversarial networks with limited data. Adv Neural Inf Process Syst. 2020;33:12104–14.
Hu T, Zhang J, Liu L, Yi R, Kou S, Zhu H, Chen X, Wang Y, Wang C, Ma L. Phasic content fusing diffusion model with directional distribution consistency for few-shot model adaption. In: Proceedings of the IEEE/CVF international conference on computer vision; 2023. p. 2406–15.
Liu B, Zhu Y, Song K, Elgammal A. Towards faster and stabilized gan training for high-fidelity few-shot image synthesis. In: International conference on learning representations; 2020.
Wang Z, Chi Z, Zhang Y. FreGAN: exploiting frequency components for training GANs under limited data. Adv Neural Inf Process Syst. 2022;35:33387–99.
Kowalski M, Garbin SJ, Estellers V, Baltrušaitis T, Johnson M, Shotton J. Config: controllable neural face image generation. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XI 16. Springer; 2020. p. 299–315.
Shi Y, Aggarwal D, Jain AK. Lifting 2d stylegan for 3d-aware face generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition; 2021. p. 6258–66.
Acknowledgements
This research was funded by the National Natural Science Foundation of China (Grant No. 62271393 and 62302393), Xi'an Science and Technology Plan Project(Grant No. 24SFSF0002 and 20GXSF0005).
Author information
Authors and Affiliations
Contributions
Longquan Yan, Pengbo Zhou, Yangyang Liu and Kang Li designed and implemented the entire model architecture and manuscript writing. Guohua Geng provided guidance on algorithm optimization and manuscript revision. Mingquan Zhou was responsible for manuscript writing, editing and polishing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yan, L., Geng, G., Zhou, P. et al. DIGAN: distillation model for generating 3D-aware Terracotta Warrior faces. Herit Sci 12, 317 (2024). https://doi.org/10.1186/s40494-024-01424-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40494-024-01424-w