Sgrgan: sketch-guided restoration for traditional Chinese landscape paintings

Hu, Qiyao; Huang, Weilu; Luo, Yinyin; Cao, Rui; Peng, Xianlin; Peng, Jinye; Fan, Jianping

doi:10.1186/s40494-024-01253-x

Research
Open access
Published: 24 May 2024

Sgrgan: sketch-guided restoration for traditional Chinese landscape paintings

Qiyao Hu ORCID: orcid.org/0000-0002-4260-1168^1,3,4^na1,
Weilu Huang⁶,
Yinyin Luo¹,
Rui Cao¹,
Xianlin Peng⁵,
Jinye Peng^1,2,3 &
…
Jianping Fan¹

Heritage Science volume 12, Article number: 163 (2024) Cite this article

348 Accesses
Metrics details

Abstract

Image restoration is a prominent field of research in computer vision. Restoring broken paintings, especially ancient Chinese artworks, is a significant challenge for current restoration models. The difficulty lies in realistically reinstating the intricate and delicate textures inherent in the original pieces. This process requires preserving the unique style and artistic characteristics of the ancient Chinese paintings. To enhance the effectiveness of restoring and preserving traditional Chinese paintings, this paper presents a framework called Sketch-Guided Restoration Generative Adversarial Network, termd SGRGAN. The framework employs sketch images as structural priors, providing essential information for the restoration process. Additionally, a novel Focal block is proposed to enhance the fusion and interaction of textural and structural elements. It is noteworthy that a BiSCCFormer block, incorporating a Bi-level routing attention mechanism, is devised to comprehensively grasp the structural and semantic details of the image, including its contours and layout. Extensive experiments and ablation studies on MaskCLP and Mural datasets demonstrate the superiority of the proposed method over previous state-of-the-art methods. Specifically, the model demonstrates outstanding visual fidelity, particularly in the restoration of landscape paintings. This further underscores its efficacy and universality in the realm of cultural heritage preservation and restoration.

Introduction

Ancient Chinese paintings represent a precious heritage of Chinese culture, reflecting the changes of the times and carrying rich cultural connotations. However, the influence of time and natural factors often leads to damage or blurring of these ancient artworks [1]. In the conservation field of ancient paintings and calligraphy, the age of the works, the fragility of the materials, and the influence of environmental factors inevitably lead to various damages during circulation, including breakage, fading, mildew, and insect infestation, with some even missing key areas.

These challenges necessitate that the restoration of paintings and calligraphy demands not only a high level of professional skills and artistic sensitivity but also a profound understanding of history, culture, traditional materials, and techniques. The restoration process encompasses both physical repair and multifaceted and complex operations such as chemical stabilization and artistic restoration reproduction [2]. Likewise, the restoration of ancient wall paintings is influenced by various factors [3].

Fig. 1 depicts three traditional Chinese paintings from various styles and historical periods: Yuan Dong from the Five Dynasties and Ten Kingdoms, Boju Zhao from the Southern Song Dynasty, and Hong Zhang from the Ming Dynasty. Throughout the history of Chinese painting, diverse painters have employed distinct techniques, with each artwork possessing unique artistic characteristics. Yuan Dong of the Five Dynasties and Ten Kingdoms Dynasty specialized in the ’Pi Ma Cun’ and ’Dian Zi Cun’ techniques. In the Southern Song Dynasty, Boju Zhao employed the meticulous ’Outline and Filling’ technique to portray rocks and trees. Ming Dynasty painter Hong Zhang focused on realism, employing dynamic and variable ink and brush techniques characterized by ’Cun, Ca, Dian, Ran’ means ’hooking, texturing, dotting, and dyeing.’

Utilizing techniques such as generative adversarial networks [4] and diffusion model [5] for repairing painting images avoids causing secondary damage to the original artworks, thereby enhancing the preservation and transmission of the cultural heritage embodied in ancient Chinese paintings [6].

To enhance the understanding of contextual information within images, Pathak et al. [7] introduced an unsupervised visual feature learning algorithm based on contextual prediction for generating content in arbitrary image regions. However, standard convolution often leads to artifacts like color inconsistency and blurring in image restoration. To address this issue, Liu et al. [8] proposed an image restoration method based on partial convolution. Li et al. [9] introduced a Recurrent Feature Reasoning Network, which utilizes neighboring pixels to iteratively and recursively speculate on the restored image. However, the method lacks sufficient consideration of constraints on the central area of the missing region. Guo et al. [10] introduced MISF, which produced high-fidelity restoration results while mitigating artifacts. However, the method lacked consideration of the features and structure of the original image. To address the challenge of inpainting models struggling to capture the unique painting style and intricate brush strokes of individual artists, Xu et al. [11] proposed a Chinese landscape painting restoration method based on fine-grained style. However, this approach heavily depends on both original and imitation paintings, making the restoration results vulnerable to the influence of imitation paintings, particularly in scenarios with limited datasets. Lyu et al. [5] reconstructed Chinese landscape painting images using a diffusion probability model. They also incorporated attention and self-attention mechanisms to enhance the quality of the reconstructed images. This approach presents a novel reference method for restoring ancient Chinese painting images.

In general, existing image restoration methods have been well developed for real-world image processing, but they still suffer from the following problems when applied to ancient Chinese paintings with complex semantic information and unique forms of artistic expression: (1) The majority of existing techniques heavily depend on extensive modern image datasets for training. However, datasets specific to ancient paintings are sparse and challenging to label, thereby constraining the adaptability and accuracy of these techniques for restoring ancient paintings. (2) The uniqueness of ancient painting techniques and individualized brushstrokes poses challenges for conventional image restoration techniques, which often struggle to address intricate details. These techniques are unable to accurately simulate or reconstruct the fine textures and brushstrokes that characterize the nuanced form of artistic expression. (3) Unlike natural photographs, the color schemes, compositional layouts, and elements in ancient paintings tend to be more abstract. Traditional algorithms may struggle to express these abstractions adequately, leading to challenges in accurately reproducing the details and overall mood of the artwork. (4) Ancient paintings frequently embody profound cultural connotations and historical backgrounds. Existing technologies may struggle to comprehensively understand and capture these deep semantic associations, leading to restored works that lose the mood and connotations of the originals.

To address the aforementioned challenges in Chinese traditional painting restoration, this paper presents a framework named Sketch-Guided Restoration Generative Adversarial Network, termed SGRGAN. The model employs a dual encoder-decoder architecture and incorporates a dual discriminator to reconstruct the structure and texture of the missing region using the texture encoder-decoder and the structure encoder-decoder, respectively.

To this end, the key contributions of this paper are four-fold.

1.
We propose an Ancient Landscape Painting Restoration Dataset with Special Mask, termed MaskCLP. It is accessible at https://github.com/Makbaka1/MaskCLP.
2.
We introduce sketch as multimodal structural prior information to assist network restoration, facilitating the reconstruction of fine texture and stroke characteristics in ancient paintings.
3.
We propose a novel Focal block, effectively fusing fine-grained local features of color matching and element abstraction with coarse-grained global features of composition layout.
4.
We propose a new BiSCCFormer block based on a Bi-level routing attention mechanism to comprehensively understand the internal structure and semantic information of the image while preserving its historical style and cultural significance.

Following experimental verification, our method significantly improves the accuracy and realism of image restoration for ancient Chinese paintings. The restoration results not only preserve the artistic essence of the original paintings but also finely restore their distinctive visual structures and brushstroke textures.

Related work

Traditional Chinese painting restoration

Ancient Chinese painting restoration is primarily categorized into two classifications, with calligraphy and painting restoration falling under one category. Chang et al. [12] segment the image restoration process into two layers: Structure layer and texture layer. They introduce a dual-layer digital image restoration method that markedly enhances the accuracy of the restoration process. Zeng et al. [13] initially detect damages in the painting using a damage detection method and subsequently reconstructed the image using a patch-based graphic restoration approach. Luo et al. [14] propose an ancient Chinese painting restoration method leveraging improved generative adversarial networks.

Another category is mural restoration. Wang et al. [15] propose a global and local feature weighting method based on structure guidance, which considers both global and local features of the image to complete mural image restoration. Cai et al. [16] propose a Bidirectional Feature Adaptation restoration method, which incorporates a spatial attention mechanism to adaptively enhance missing and known region features for mural restoration. Furthermore, Ge et al. [17] propose a virtual restoration network for ancient murals that utilizes global–local feature extraction and structural information guidance. This approach addresses the inadequacy of most restoration methods in filling lost mural areas with rich details and complex structures.

Traditional method

Traditional image restoration methods primarily rely on signal processing and statistical modeling, achieving some effectiveness but also exhibiting significant limitations. Chang et al. [18] propose a new simple interpolation (SI) restoration strategy for large damaged areas, which represents an improvement over conventional interpolation techniques [19]. However, accurately restoring complex textures and nonlinear distortions remains challenging, and repairing large missing regions incurs high computational complexity. Dimiccoli et al. [20] propose a perceptual filter for image restoration. While capable of enhancing image quality to some extent, still falls short in preserving edge details and structural consistency, particularly when dealing with images featuring smooth color transitions or high-frequency information-rich content.

Additionally, optimization methods based on partial differential equations have been widely studied. Li et al. [21] propose a combination of two variational models for image restoration. Although this method can achieve a certain restoration effect through mathematical models, its parameter selection and computational complexity are high. Moreover, its adaptability is limited when facing diverse image degradation phenomena in real-world scenarios. Especially when facing highly unstructured and randomly distributed damaged parts, it cannot achieve the ideal restoration effect. Although traditional image restoration methods have practicality in specific scenarios, overall, their performance is unsatisfactory in dealing with complex image content, diverse degradation patterns, and the balance between fidelity and naturalness. This inadequacy is also a significant reason for the widespread adoption of deep learning methods.

GAN-based restoration

Image restoration methods utilizing Generative Adversarial Networks (GAN) have shown significant advantages in the domain of cultural relic image restoration. Given that cultural relic images may be subject to various forms of damage such as wear and tear, cracks, fading, etc., GAN can utilize surrounding contextual information to generate content for the missing part that closely resembles the style of the original painting, thereby achieving the recovery of both the details and the overall structure of the cultural relic image. Nazeri et al. [22] propose the EC model, which uses an edge generator to predict the structural contours of the missing region first and then fills in the color and texture details through an image completion network. This approach represents a breakthrough in terms of visual coherence and structural rationality. Lin et al. [23] propose PDGAN method achieves the generation of multiple, high-fidelity repairs for the same damaged region by training two antagonistic neural networks that are co-optimized. Zheng et al. [24] introduce the Transformer into GAN and propose a multivariate image completion framework that can achieve high quality and diversity at faster inference speeds, termed VQGAN.

However, unstructured GAN-based image restoration techniques have certain limitations, particularly when dealing with images of culturally complex and unique relics. Specifically, when handling images of culturally complex and unique relics, solely relying on the automatic generation capability of GAN may fail to accurately reproduce historical elements or cultural features. Moreover, GAN that lack a structure-guided strategy are prone to noise during the restoration process, resulting in discrepancies between the generated content and the intrinsic logic of the original painting, such as incorrect texture orientation and distorted shapes, among others.

Structure-guided

Structure-guided image restoration methods ensure that the generated restoration results maintain visual consistency and coherence with the original image. Liu et al. [25] propose a framework based on monotonic transformation structure guidance, which preserves both the neighborhood coherence around the restored region and the global structural properties of the image. Similarly, Guo et al. [26] propose the CTSDG method, which, although not specifically designed for ancient painting restoration, integrates the technical concepts of structural constraints and texture synthesis. Their approach demonstrates the effectiveness of integrating structural and texture information to guide image inpainting tasks.

Structure-guided image restoration methods make effective use of the structural information inherent in the image itself, and the incorporation of structural priors imposes robust constraints that yield semantically coherent and natural restoration results. However, when faced with artworks like ancient Chinese paintings characterized by highly intricate structures and delicate brushstroke textures, their distinctive unstructured features, including painting techniques, color transitions, and brush-and-ink rhythms, merit examination as critical considerations.

Transformer-based restoration

Convolutional neural networks have achieved remarkable results in image restoration tasks, but they are limited in their performance in certain complex image restoration scenarios due to challenges in modeling local receptive fields and long-range dependencies. The emergence of Transformer and attention mechanisms provides new ideas for addressing the challenges in image inpainting. Song et al. [27] propose a two-stage framework for contextual image restoration, where contextual information is learned using an attention module after the rough filling of missing regions with a GAN. Liu et al. [28] propose a novel coherent semantic attention method based on a deep generative model. This method not only preserves the contextual structure but also predicts the missing parts more efficiently by modeling the semantic correlation between hole features. Li et al. [29] propose MAT, a Transformer-based large void image restoration model that skillfully combines the advantages of Transformer and convolution. In the same year, Wan et al. [30] and others similarly combine Transformer with convolution to propose a high-fidelity multivariate image restoration method. Additionally, Dong et al. [31] achieve high-quality restoration results by efficiently reasoning in the lower resolution sketch space and employing the attention-based Transformer module to gradually restore the overall structure of the image.

The attention mechanism in Transformer can effectively learn the global and structural information of colors, strokes, and compositions in ancient paintings, enabling a deeper and more comprehensive understanding of the semantic information in the images, thus preserving their cultural connotations and styles.

Methodology

This section focuses on the training and test datasets utilized in this paper, as well as the research methodology.

Dataset details

Our primary training is conducted using the MaskCLP dataset, which comprises the sketch dataset, ancient Chinese landscape painting dataset, and Mask dataset. To assess the generalization of the model, we included the constructed mural dataset mixed with the landscape painting data to evaluate the impact of various elements on the restoration results.

MaskCLP dataset

The proposed restoration dataset of ancient landscape paintings with masks divides the total of 12,242 images into two categories: the ancient landscape paintings dataset and the Mask dataset. Among these, 5,621 images constitute the ancient landscape paintings, while 1,000 images belong to the Mask dataset. The dataset includes 1,000 and 5,621 sketches of landscape paintings in grayscale.

1.
Ancient landscape paintings dataset

Sourced from collaborating institutions and digital art painting databases, we meticulously classify the collected ancient paintings by professional art workers, taking into account different styles, dynasties, and color characteristics to obtain rich stylistic information. We carefully screen and categorize these ancient paintings to ensure the diversity and representativeness of the dataset. The dataset consists of 5,621 images, with 5,061 images allocated to the training dataset and 560 images to the test dataset. Fig. 2 enumerates the paintings by Xin Hong, Boju Zhao, Zhen Wu, Qichang Dong, and Daqian Zhang present in the MaskCLP dataset.
2.
Mask dataset

The irregular mask dataset referenced by Liu et al. [32] is released by the Nvidia team in 2018, comprising 55,116 training mask samples and 24,866 test mask samples. However, in the study, we did not utilize the irregular mask resources from the Nvidia team. Instead, we opted for real antique landscape painting materials provided by specific cultural institutions and museums with which we collaborated. We meticulously digitally scanned these genuine, damaged landscape paintings with precision. Subsequently, a threshold segmentation method was employed to accurately extract 50 masks from the scanned paintings. To further enhance the diversity of the training set and improve the model’s generalization ability, an additional 1000 mask templates were derived, with sizes normalized to 256$\times$256 pixels through random cropping, rotation, and other image enhancement techniques. The final integrated mask dataset comprises 900 samples for training and 100 samples for testing. This unique dataset, closely resembling real-world application scenarios, forms a robust foundation for our research. Fig. 3 enumerates a real Mask from the MaskCLP dataset, employed as a mask on ancient Chinese paintings to simulate the damaged areas.
3.
Grayscale sketch landscape painting dataset

We extract the grayscale sketch from the dataset of historical landscape paintings. In addition to keeping some of the grayscale information, the sketch also preserves the distinctive stylistic elements of historical paintings. Fig. 4 enumerates the Grayscale Sketch from the MaskCLP dataset, serving as prior conditions to guide the restoration process of the corresponding ancient paintings.
Fig. 2
Examples of the painting on MaskCLP. From left to right, (a) is a painting of ‘Scroll of Exquisite Colors Along Bamboo Path’ by Xin Hong from the Qing Dynasty, (b) is a painting of ‘Flying Immortals Painting’ by Boju Zhao from the Southern Song Dynasty, (c) is a painting of ‘Painting of Rain Amidst Brook and Mountain’ by Zhen Wu from the Yuan Dynasty, (d) is a painting of ‘Mountain Landscape Painting’ by Qichang Dong from the Ming Dynasty, and (e) is a painting of ‘Painting Album of Landscapes’ by Daqian Zhang from the modern era
Full size image
Fig. 3
Examples of real Mask on MaskCLP. (a–e) are damaged masks extracted from authentic damaged Chinese paintings. These authentic masks are used on ancient Chinese paintings to simulate the damaged areas of the Fig. 2 are displayed
Full size image
Fig. 4
Examples of Grayscale Sketch on MaskCLP. (a–e) are grayscale sketch images corresponding to the landscape paintings in Fig. 2 are displayed
Full size image

In the history of Chinese painting, numerous outstanding painters emerged across various periods, each distinguished by unique painting styles.

Xin Hong is renowned for his delicate brushwork and exquisite landscape and flower-and-bird paintings. Boju Zhao’s focus lies in rigorous composition and magnificent colors, employing the ‘Outline and Filling’ technique to depict rocks and trees. Zhen Wu inherits the style of Yuan Dong and Ran Ju, showcasing the unique Jiangnan landscape with vigorous and vigorous ink and a moist artistic conception. Qichang Dong innovated the ‘Ku Shi’ painting method characterized by dynamic brushwork and an elegant layout, reflecting the refined atmosphere of ancient literati. Daqian Zhang, a modern Chinese painter, adopted a dynamic and impactful style characterized by grandeur, powerful brushwork, and vibrant colors, blending Chinese and Western painting techniques. The artistic legacies of these painters, spanning different eras, are diverse and rich, contributing invaluable treasures to the tradition of Chinese painting.

Mural

In addition, we create a Mural dataset to confirm the restoration effect of our model. The Mural dataset includes pictures of four distinct kinds of murals: Thangka, Temple, Cave, and Burial.

The dataset contains various types of mural images from different locations. The Cave dataset comprises more than 300 images from Dunhuang, Yungang, Longmen, Xinjiang Ghuzi Grottoes, and Maijishan Grottoes. The Temple dataset mainly features murals from the Shanxi region and includes more than 300 images. Additionally, the Tomb dataset contains more than 300 images, while the Thangka dataset provides more than 200 images. To verify the generalization ability of the SGRGAN network, a total of 99 murals are selected from these four types of murals for inference testing in the experiment.

Pre-processing

Landscape painting pre-processing

We have the meticulous process of sifting and classifying a substantial number of ancient paintings was undertaken. Furthermore, a variety of data enhancement techniques are applied to improve the model’s performance and robustness, encompassing resizing, cropping, rotation, and image flipping. The original painting image is adjusted to a uniform resolution of 256 $\times$ 256 and then aligned with the Mask data for training purposes. Then, to preserve the semantic integrity of the image, it is randomly cropped and rotated to enhance the model’s generalization ability and augment the diversity of the training data.

Mask pre-processing

We extract the regions with jumping pixel values are extracted from the real broken paintings, resulting in 50 masks. After data enhancement and expansion, 1000 mask images are generated through random cropping and rotating and then processed to a resolution size of 256$\times$256, matching that of the landscape painting images.

Overall structure

The overall network framework, as shown in Fig. 5. The proposed SGRGAN method utilizes a generative adversarial network, comprising a generator and a discriminator. Specifically, the generator adopts a dual-stream encoder-decoder structure, reconstructing the structure and texture of the missing region through separate texture and structure encoder-decoder pairs. This approach allows the high-dimensional information from each stream to complement one another during the texture and structure information recovery process. Subsequently, the output of the feature from the two decoders is concatenated and passed through the feature fusion module, comprising the BiSCCFormer block, Focal block, MHSA Module, and Multi-scale Module. This configuration aims to enhance the correlation between local features of the image and fully explore and integrate features at various levels, thereby improving the quality and detail accuracy of the final image generation and reconstruction.

Finally, the texture and structural feature statistics are estimated by a dual-stream discriminator to distinguish the real image from the restored image. (a) The output restoration image is simultaneously fed into the texture discriminator along with the input real image. (b) The grayscale sketch of the restored image is extracted using the sketch extraction method, and then simultaneously fed into the structural discriminator along with the input grayscale sketch. Finally, the outputs of the two branches are concatenated in the channel dimension, based on which we calculate the adversarial loss.

Generator details

Inspired by GTSDG [26], this paper introduces a novel dual-stream coupling network designed to reconstruct the texture and structure of missing regions in landscape paintings, tailored to the characteristics of traditional Chinese paintings, as depicted in Fig. 5a. The image is simultaneously processed and optimized for structural information, including object shape, contour, and layout and texture features including color, texture details, and material properties. The network comprises two encoder-decoder structures.

Texture encoder-decoder

Facilitating structurally constrained texture synthesis to ensure that the generated texture aligns with the specified structure, thereby maintaining realism and diversity. In the texture encoder, the input consists of mask images and broken landscape painting, generating feature maps at different scales through the texture encoder. The structural feature mapping is represented as $E_{s}$.

Structure encoder-decoder

Texture-guided structural reconstruction enables more accurate inference of structural details in deeper features by analyzing and leveraging texture cues from shallow features or corrupted images. In the structure encoder, the input includes mask images and broken grayscale sketch images, resulting in feature maps at different scales through the structure encoder. The texture feature mapping is represented as $E_{t}$.

Subsequently, the final level of feature output from the texture encoder is fed into the structure decoder for deconvolution. At each deconvolution level, the feature maps of the corresponding resolution from the structure encoder are skip connection to it. Similarly, the output of the last level of the structure encoder is input into the texture decoder, with the feature maps of the corresponding resolution from the texture information encoder skip connection to it. Through these steps, in the process of reconstructing the texture and structure information, each encoder leverages the high-dimensional information from the other as a complement. This process can be formulated as:

$$\begin{aligned} \begin{aligned} E_c&= concat(E_t,E_s) \end{aligned} \end{aligned}$$

(1)

${E}_{c}$ is then input to the feature fusion module, which shown in Fig. 5e. The quality and detail accuracy of the final image generation and reconstruction are improved by completely exploring and combining multiple levels of information to improve the connection between local features of the image.

G denotes generator and D denotes discriminator. $I_{g}$ denotes undamaged image. $S_{g}$ denotes the grayscale sketch image of the complete original image.

$I_{i}$ is the damaged image after Mask processing. $S_{i}$ is the damaged grayscale sketch image after Mask processing. $M_{i}$ denotes the initial binary mask. It can be formulated as:

$$\begin{aligned} \begin{aligned} I_{i}&= I_{g}\odot M_{i} \\ S_{i}&= S_{gr}\odot M_{i} \end{aligned} \end{aligned}$$

(2)

$I_{o}$ denotes the output image of the generator network. $S_{o}$ denotes the output grayscale sketch image of the generator network. It can be formulated as:

$$\begin{aligned} \begin{aligned} I_{o}={G}(I_{i}) \\ S_{o}={G}(S_{i}M_{i}) \end{aligned} \end{aligned}$$

(3)

BiSCCFormer block

In this paper, we propose a new BiSCCFormer block based on a Bi-level routing attention mechanism [33], as depicted in Fig. 5c. The attention mechanism is introduced to facilitate a deeper and more comprehensive understanding of the internal structure and semantic information of the image. Additionally, it plays a crucial role in preserving the profound cultural connotations and historical styles embedded in ancient artistic works.

When globally modeling picture information, the BiSCCFormer block can effectively decrease redundancy in the computing process of the self-attention mechanism. In the initial stage, we added the Spatial and Channel Reconstruction Convolution (SCConv) [34] module. SCConv employs a novel perspective to examine feature extraction in traditional CNN, reducing redundant features through simultaneous reconstruction of both the spatial and channel dimensions of the feature map. This approach not only decreases the model parameters but also enhances the efficacy of feature representation.

Specifically, the Bi-level Routing Attention module is an innovative dynamic sparse attention mechanism, depicted in Fig. 5c. The core idea is to filter out the least relevant key-value pairs at the coarse-grained region level, and subsequently compute token-to-token attention within the remaining regions. It is specifically crafted to dynamically learn and capture long-range dependencies between different image regions, aiming to achieve a more profound and comprehensive understanding of the intrinsic structure and semantic meanings within the image.

$I_{g}$ denotes the original undamaged image. $I_{s}$ denotes the result of the processing performed by these modules. This process can be formulated as:

$$\begin{aligned} \begin{aligned} I_{s}&=\sigma \left( {{\mathcal {S}}({I}_{g})}\right) \\ E_{o}&=\zeta {({I}_{g},{I}_{s})} \end{aligned} \end{aligned}$$

(4)

where $\sigma$ denotes the Bi-level Attention module. ${\mathcal {S}}$ denotes SSConv module. $\zeta$ denotes the BiSCCFormer module. $E_{o}$ denotes the output feature matrix of the BiSCCFormer module.

Focal block

Ancient paintings exhibit a higher degree of abstraction in color collocation, composition layout, and elements. Traditional algorithms may struggle with such abstract expression, making it challenging to achieve accurate restoration of details and the overall artistic conception. Therefore, we propose a novel feature focusing module, termed Focal block, as depicted in Fig. 5d, which effectively combines fine-grained local features of color collocation and element abstraction with coarse-grained global features of composition layout.

Drawing inspiration from ConvNeXt [35], we incorporate Dilated Convolution into the Focal block. Additionally, we employ two skip connection structures to expand the receptive field of the feature map, accentuate the learning of important regions, and diminish the impact of background or non-key regions. This approach enhances the efficiency of attention allocation in the model, particularly when handling landscape paintings with intricate composition and layout. Furthermore, it enables the simultaneous fusion of fine-grained local features and coarse-grained global features.

This process can be formulated as:

$$\begin{aligned} \begin{aligned} E_{w}&=\delta ({I})\oplus {I} \\ E_{J}&={\mathcal {E}}(E_{w})\oplus E_{w} \\ E_{m}&=\varphi (\varphi (E_{J}))\oplus {I} \end{aligned} \end{aligned}$$

(5)

where I denotes the module input. $\oplus$ indicates the operation of the feature fusion. $\delta$ represents the depthwise separable convolution. $E_{w}$ indicates the output of the depthwise separable convolution. ${\mathcal {E}}$ denotes the dilated convolution. $E_{J}$ represents the output feature of the dilated convolution. $\varphi$ stands for the 1 $\times$ 1 convolution. $E_{m}$ stands for the output feature of the Focal block.

MHSA module

The Multi-Head Self-Attention (MHSA) module is a multi head self attention mechanism module from Transformer [36]. Our network places the MHSA module at the end of the generator. The network employs a Bi-level routing attention mechanism in the shallow stage and a global multi-head self-attention mechanism in the high stage. This design enables the model to fully utilize the locality and sparsity of the attention matrix at the shallow level when processing low-level information and to simulate long-distance dependencies fully when handling high-level information, thereby improving the model’s reconstruction ability.

Discriminator details

In this paper, we introduce a two-stream discriminator to differentiate between the real image and the restored image by estimating the feature statistics of texture and structure.

Texture discriminator

The output restoration image is simultaneously fed into the texture discriminator along with the input real image.

The texture discriminator comprises three convolutional layers with a kernel size of 4 and a stride of 2, along with two convolutional layers in the tail with a kernel size of 4 and a stride of 1. We employ the Sigmoid non-linear activation function in the last layer and utilize the Leaky ReLU activation function with a slope of 0.2 in the remaining layers.

Structure discriminator

The output restoration map, sent to the sketch extraction network, extracts the grayscale sketch of the restoration result and calculates the loss function based on the input grayscale sketch.

The convolutional layer kernel size in the structure discriminator is 1. Subsequently, the output features of the two branches are concatenated along the channel dimension to compute the adversarial loss.

Therefore, the structural branching not only assesses the authenticity of the generated structure but also ensures its alignment with the real image. Additionally, a spectral normalization layer is introduced to effectively address the training instability problem in generative adversarial networks.

Sketch extraction

We adopt a traditional approach to sketch extraction. Consequently, we utilize the sketch extraction method within the structure encoder to extract the grayscale sketch corresponding to the restored image, a crucial step for authenticating the structure. Initially, the image is converted to grayscale and inverted to black and white. Subsequently, a blur filter is applied to the inverted image to create a smooth, detailed-blurred version. Next, all pixel coordinates of the original grayscale image are iterated to obtain the grayscale value a of the current coordinate and the pixel value b at the corresponding position in the blurred and inverted image. New blended pixel values are calculated and then converted to the original grayscale image. These blended pixel values are written back to their corresponding positions in the original grayscale image. Finally, the resultant grayscale sketch image is obtained.

Loss function

The network is trained with joint loss, which is Reconstruction loss, Perception loss, Style loss, Intermediate loss, and Adversarial loss.

Reconstruction loss

We take the $l_{1}$ distance between $I_{o}$ and $I_{g}$ as the reconstruction loss. It can be formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{r}={\mathbb {E}}[\Vert I_{o}- I_{g}\Vert _1] \end{aligned} \end{aligned}$$

(6)

where ${\mathcal {L}}_{r}$ denotes the Reconstruction Loss.

Perceptual loss

Since reconstruction loss makes it difficult to capture the high-level semantics of an image, we introduce the perceptual loss ${\mathcal {L}}_{p}$ to evaluate the global structure of the image, and measure the ${\mathcal {l}}_1$ distance from $I_{o}$ to $I_{g}$ in the feature space defined by the VGG-16 network [37] pre-trained on ImageNet [38]. This process can be formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{p}={\mathbb {E}}\left[ \sum _i\Vert \phi _i(I_{o})-\phi _i(I_{g})\Vert _1\right] \end{aligned} \end{aligned}$$

(7)

where i denotes the number of layers in the pooling layer, $\phi _i(\cdot )$ denotes the activation mapping of the pooling layer from VGG-16.

Style loss

We further introduce style loss to ensure style consistency. This process can be formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{s}={\mathbb {E}}\left[ \sum _i\Vert \psi _i(I_{o})-\psi _i(I_{g})\Vert _1\right] \end{aligned} \end{aligned}$$

(8)

where $\psi _i(\cdot )=\phi _i(\cdot )^T\phi _i(\cdot )$, and $\psi _i(\cdot )$ denotes the Gram matrix constructed from the activation map $\phi _i(\cdot )$.

Adversarial loss

Designed to guarantee both textural and structural consistency and visual realism in the reconstructed image, the adversarial loss can be formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{I}&= \min _G\max _D{\mathbb {E}}_{(I_{g})\sim p_{data}(I_{g})}[\log D(I_{g})]\\&+ {\mathbb {E}}_{z\sim p_z(z)}[\log (1-D(I_{o}))] \\ {\mathcal {L}}_{S}&= \min _G\max _D{\mathbb {E}}_{(S_{g})\sim p_{data}(S_{g})}[\log D(S_{g})]\\&+ {\mathbb {E}}_{{S_{o}}\sim p_{data}(S_{o})}[\log (1-D(S_{o}))] \\ {\mathcal {L}}_{a}&={\mathcal {L}}_{I} + {\mathcal {L}}_{S} \end{aligned} \end{aligned}$$

(9)

where ${\mathcal {L}}_{I}$ represents the adversarial loss between the original image and the inpainted image. ${\mathcal {L}}_{S}$ represents the adversarial loss between the grayscale sketch of the input image and the grayscale sketch of the inpainted image.

Intermediate loss

In order to enable the structure encoder and texture encoder to capture structure feature and texture feature, we introduced intermediate loss. $F_s$ denotes the structure feature map output by structure decoder. $F_t$ denotes the texture feature map output by texture decoder. It can be formulated as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{i}&={\mathcal {L}}_{s}+{\mathcal {L}}_{t}\\&=\ell _{1}\left( {\mathcal {S}}_{g},{\mathcal {R}}_{s}(F_{s})\right) +\ell _{1}\left( I_{g},{\mathcal {R}}_{t}(F_{t})\right) \end{aligned} \end{aligned}$$

(10)

where ${\mathcal {R}}_{s}$ denotes projection functions of structure, which mapping $F_s$ to grayscale sketch image. ${\mathcal {R}}_{t}$ denotes projection functions of texture, which mapping $F_t$ to restored landscape paintings image.

Joint loss

In summary, the joint loss is written as:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{j}&=\lambda _{r}{\mathcal {L}}_{r}+\lambda _{p}{\mathcal {L}}_{p}+\lambda _{s}{\mathcal {L}}_{s}\\&+\lambda _{a}{\mathcal {L}}_{a}+\lambda _{i}{\mathcal {L}}_{i} \end{aligned} \end{aligned}$$

(11)

where $\lambda _{r}$, $\lambda _{p}$, $\lambda _{a}$, $\lambda _{s}$, and $\lambda _{i}$ are the weight parameters, we set $\lambda _{r}$=10, $\lambda _{p}$=0.1, $\lambda _{a}$=0.1, $\lambda _{s}$=250 and $\lambda _{i}$=1.

Experiments

Supplementary dataset

Additionally, we performed a preliminary test on the public datasets Places2 and CelebA.

$\bullet$ Places2 [39]. The dataset, released by MIT, comprises more than 1.8 million images spanning 365 scene categories.

$\bullet$ CelebA [40]. Released by the Chinese University of Hong Kong, the dataset comprises over 180,000 labeled face images. CelebA is widely recognized as one of the most commonly used datasets in image restoration research.

Implementation details

Training setting

All image datasets and mask datasets are uniformly resized to a resolution of 256$\times$256 pixels. The training process is conducted on an NVIDIA 4090 GPU with the batchsize of 12, utilizing the Adam optimizer for optimization. The initial training utilizes a learning rate of $2\times 10^{-4}$. Then, the fine-tuning process utilizes a learning rate of $5\times 10^{-5}$.

Table 1 Comparison results on MaskCLP

Full size table

Training process

Before training the MaskCLP model, the first step is uniformly preprocessing all images in the MaskCLP dataset and resizing them to the standard resolution of 256$\times$256 pixels. Once in the formal training stage, the model combines the landscape painting data from MaskCLP with masks to simulate the damaged images to be repaired. Similarly, it combines line draft data with masks to simulate the damaged line drafts to be repaired.

Throughout the training process, damaged images along with their masks are inputted into the texture encoder, while broken sketch images and their corresponding masks are inputted into the structure encoder. Subsequently, the entire generative network focuses on reconstructing these damaged images with high precision. The reconstructed images are then passed to the discriminator for evaluation. The discriminator consists of two branches: the texture branch, tasked with evaluating the similarity between the restored image and the original landscape paintings in MaskCLP, and the structure branch, which extracts and compares sketch information from the restored image with real sketch images in MaskCLP.

During the training process, the network is trained using a joint loss function until the model converges. Additionally, to assess the model’s generalization ability beyond the MaskCLP dataset, we applied it to both the public restoration dataset and the Mural dataset during training. Although a subset of line art data was not explicitly provided in the public restoration dataset and the Mural dataset during training, we utilized Python library functions to automatically extract the necessary line art for training purposes. Regarding the Mask data, we consistently used the in-house produced Mask dataset that we provided.

Testing process

During testing, we designate the test set as 10$\%$ of the training set. Simultaneously, we partition the mask into three groups based on mask ratios: 0–15$\%$, 15–30$\%$, and 30–45$\%$. Additionally, our testing method outputs images of size 256$\times$512. Subsequently, we evaluate and calculate metrics for different mask ratios.

Inference process

During the inference process, the output is required to be exactly equal to the resolution of the input image, and the average test time is about 0.027 s per image.

Evaluation metrics

In the field of AI art, to quantitatively measure the quality of the migrated images generated by our work, we utilize three metrics: PSNR, SSIM [41], and LPIPS [42]. These metrics are employed for comparison with existing methods. Crucially, a key question in evaluating the creativity of human artists is whether AI can transcend pure technical imitation and achieve artistic creation with independent aesthetic value. Thus, the evaluation indicators we establish not only evaluate the ability of the images generated by the model in terms of image quality, realism, diversity, etc. but also assess the artistic beauty and visual appeal of the model. This approach can be more innovative and Cultural Background.

Main results

Baselines

$\bullet$ SI [18]: This method constructs a local best approximation using the radial basis function network algorithm and repairs the damaged image pixels through interpolation.

$\bullet$ EC [22]: This method proposes a two-stage model that separates the inpainting problem into structure prediction and image completion.

$\bullet$ RFR [9]: This method introduced a progressive restoration method that starts from simple to difficult.

$\bullet$ PD-GAN [23]: This method for image restoration built on a vanilla GAN, which generates images based on random noise.

$\bullet$ CTSDG [26]: This method proposed a new texture and structure coupling restoration network, which divides the image restoration task into two complementary subtasks: texture and structure.

$\bullet$ MAT [29]: This method is the pioneering Transformer-based model for large hole inpainting.

$\bullet$ MISF [10]: This method integrates traditional methods with deep learning techniques in image restoration.

Comparison analysis

To demonstrate the effectiveness of our proposed SGRGAN framework, we compare it with several existing state-of-the-art (SOTA) methods, including EC [22], RFR [9], PD-GAN [23], CTSDG [26], MAT [29], and MISF [10]. We quantitatively evaluate the proposed methods using three main metrics: LPIPS, PSNR, and SSIM. The irregular masking scores are categorized into 0–15$\%$, 15–30$\%$, and 30–45$\%$ to compute the corresponding scores. Table 1 shows the experimental results on the MaskCLP dataset.

Table 1 presents the metrics results for each type of restoration method on the MaskCLP dataset. To better illustrate the restoration performance of the proposed methods, we use the suboptimal results of the evaluation metrics for comparison analysis. Our model achieves the optimal result of 30.59 in the PSNR metric, which measures image distortion. This is an improvement of 0.05 compared to the sub-optimal MAT, indicating that our model restores images with less distortion and higher quality. Regarding the SSIM metric, which evaluates structural similarity, our model performs equally well, with a 0.07 improvement compared to the sub-optimal MAT, resulting in an 8.13$\%$ advantage. This indicates that our model better maintains image structure and details. Additionally, our model still performs well in the LPIPS metrics, with a decrease of 6.25$\%$ to 0.105 compared to the sub-optimal MAT. The analysis of the metrics shows that our model achieves better restoration effects in terms of overall image structure, line smoothness, colour levels, and detailed textures compared to other models.

Visualization

Through qualitative comparison with existing state-of-the-art (SOTA) methods, our approach yields satisfactory restoration results and demonstrates a comprehensive understanding of details in MaskCLP, a traditional Chinese landscape painting. The restored paintings exhibit a high level of restoration in terms of overall structure, smoothness of lines, color fidelity, and texture detail, effectively preserving and reproducing the artistic essence of the original artworks. Our method excels in addressing the limitations of existing state-of-the-art methods, effectively resolving issues such as edge blurring, artifact generation, and excessive smoothing. Additionally, it adeptly fills in missing areas while preserving the unique texture and style of the original painting.

Visual comparison of traditional method

The visualization results of simple interpolation methods [18] and SGRGAN restoration are depicted in Fig. 6.

As illustrated in Fig. 6, the outcomes of repairing with the simple interpolation algorithm exhibit excessive smoothing while filling the missing regions. Additionally, the boundaries of the restored regions lack clarity, resulting in a significant loss of details, particularly evident in the high-frequency detail section, leading to severe blurring.

Visual comparison on mask ratio of 0–15$\%$

The restoration results of two ancient Chinese landscape paintings on the missing area of 0–15$\%$ are depicted in Fig. 7.

Specifically, regarding the rock depicted in Fig. 7b, it is evident that RFR, MISF, and CTSDG all exhibit a loss of texture details and display artifacts of varying degrees. In contrast, our proposed method aligns more closely with the overall style of the painting in filling the texture of the missing area, while also preserving brushstrokes consistent with those of the original painting, resulting in a more nuanced performance.

Moreover, color serves as a crucial medium for expressing emotion, atmosphere, and aesthetic conception in artworks, enriching the works with vivid flavors and rich connotations. However, the restoration results of the green grass in Fig. 7a reveal that the EG restoration yields darker results, while the MAT restoration results appear yellowish, deviating significantly from the semantic information surrounding the original area and thereby compromising the color fidelity of the original painting. In contrast, our method effectively restores the color information of the original painting, ensuring that the colors align with the overall stylistic characteristics of the painting.

Visual comparison on mask ratio of 15–30$\%$

The restoration results of two ancient Chinese landscape paintings on the missing area of 15–30$\%$ are depicted in Fig. 8.

Specifically, in Fig. 8a, b, the restoration results of EC, RFR, and CTSDG exhibit predominantly yellow hues and severe artifacts, resulting in a significant discrepancy between the restored outcomes and the original style of the landscape painting. Moreover, the texture of the restored portions appears unnatural in relation to the surrounding environment. In contrast, the results repaired by our method do not suffer from color mismatch issues with the overall painting. Although there are minor artifacts present, they are nearly imperceptible to the human eye. Overall, our method effectively restores the original unique aesthetic essence and delicate texture of Chinese landscape painting.

Visual comparison on mask ratio of 30–45$\%$

The restoration results of two ancient Chinese landscape paintings on the missing areas of 30–45$\%$ are depicted in Fig. 9.

Specifically, when examining the river in the restored area of Fig. 9b, the RFR restoration results exhibit evident inappropriate artifacts and distorted lines, the PD-GAN restoration results display noticeable blurring and slight color distortion, and the MAT restoration results demonstrate improper texture filling and edge blurring. In contrast, our restoration results outperform other methods in both color fidelity and texture fidelity and effectively restore the original landscape painting with distinct layers and expansive spatial representation.

Visual comparison on mask ratio of 45$\%$

The restoration results of an ancient Chinese landscape painting with a 45$\%$ missing area are shown in Fig. 10.

Table 2 Ablation experiments on MaskCLP

Full size table

Table 3 Ablation experiments on Loss Function

Full size table

Specifically, serious artifacts are evident in the mountain repaired by RFR in Fig. 10b, resulting in significant inconsistencies in the structure and layout logic of the entire painting, thus disrupting the original composition balance. MISF restoration may exhibit color incongruity and a degree of blurriness. CTSDG failed to restore the clear outline of the original mountain, resulting in distorted lines. In contrast, our restoration approach effectively preserves the form of various painting elements, and restores the mountain outline, tree structures, and stone veins of the original landscape painting, while maintaining stroke consistency with the original artwork.

Overall, based on the experimental results, the outcomes generated by our proposed algorithm exhibit greater semantic coherence, recover more detailed image structures, reconstruct higher-quality images, and yield superior subjective visual effects. SGRGAN effectively restores the structure, texture, color, saturation, and brightness information of the painting. This enhancement elevates the artistic expression of the image, imbuing it with richer emotions and artistic conception, and bringing the image closer to the artistic characteristics and stylistic characteristics of the painting.

Ablation study

This section presents an ablation comparison analysis of the modules of SGRGAN, the loss function, and the trained dataset to verify the performance of the proposed repair method and its effectiveness.

Comparison analysis

In this section, we conduct ablation experiments and comparative analyses concerning the sub-modules, loss functions, and training datasets of SGRGAN, respectively, with the aim of validating the performance and effectiveness of the proposed restoration method.

Table 4 Ablation experiments on different dataset

Full size table

Comparison on module

To demonstrate the effectiveness of the proposed SGRGAN framework, ablation experiments are conducted on sketch-guided, Focal block, and BiSCCFormer block, respectively. The proposed method undergoes quantitative evaluation using three primary metrics: LPIPS, PSNR, and SSIM. Irregular masking scores were categorized into 0–15$\%$, 15–30$\%$, and 30–45$\%$ intervals, and the respective scores are computed.

Table 2 presents the results obtained from the MaskCLP dataset. In the case of a mask ratio ranging from 30–45$\%$, SGRGAN, compared with SGRGAN w/o Focal, exhibits a 0.16$\%$ improvement in PSNR, which assess image quality; a 1.04$\%$ improvement in SSIM, evaluating the preservation of structural information in the image; and a reduction of 2.9$\%$ in LPIPS, measuring image similarity perception. Conversely, when compared to SGRGAN w/o BiSCCFormer, SGRGAN demonstrates a 1$\%$ improvement in PSNR , a 1.36$\%$ improvement in SSIM, and a 7.2$\%$ reduction in LPIPS . The outcomes of the aforementioned ablation experiments comprehensively showcase the effectiveness of our proposed sketch-guided, Focal Block, and BiSSCFormer Block for enhanced image restoration.

Comparison on loss function

Additionally, to assess the effectiveness of the joint loss utilized, ablation experiments are conducted on SGRGAN w/o ${\mathcal {L}}_{p}$, SGRGAN w/o ${\mathcal {L}}_{s}$, and SGRGAN w/o ${\mathcal {L}}_{i}$, respectively. The proposed method undergoes quantitative evaluation using three primary metrics: LPIPS, PSNR, and SSIM. Irregular masking scores were categorized into 0–15$\%$, 15–30$\%$, and 30–45$\%$ intervals, and the respective scores were computed.

Table 3 presents the results obtained from the MaskCLP dataset. For GAN-based generative networks, adversarial loss and reconstruction loss are indispensable. Consequently, without considering adversarial loss and reconstruction loss, we conducted an ablation study on other loss functions. From the table, it can be observed that models trained with joint loss achieved the best restoration results when repairing landscape paintings with different breakage ratios. Particularly, when the breakage ratio is 30–45$\%$, training SGRGAN w/o ${\mathcal {L}}_{p}$ resulted in a decrease of 0.49$\%$ in PSNR, a decrease of 1$\%$ in SSIM, and an increase of 5.19$\%$ in LPIPS.

Additionally, if SGRGAN w/o ${\mathcal {L}}_{s}$ or SGRGAN w/o ${\mathcal {L}}_{i}$ is trained, compared to training with joint loss, the model failed to adequately learn the structural information of the landscape painting, resulting in an insufficiently reasonable and clear structure in the restored landscape painting. Thus, it is demonstrated that using joint loss is effective for our proposed SGRGAN restoration method.

Moreover, compared to SGRGAN w/o ${\mathcal {L}}_{s}$, for Mask Ratios of 15–30$\%$, our method demonstrates an increase of 0.85$\%$ in PSNR, 1.2$\%$ in SSIM, and 4$\%$ in LPIPS. Additionally, compared to SGRGAN w/o ${\mathcal {L}}_{i}$, for Mask Ratios of 30–45$\%$, our method exhibits an increase of 1.58$\%$ in PSNR, 2.83$\%$ in SSIM, and 7.6$\%$ in LPIPS.

Comparison on dataset proportion

Simultaneously, to ascertain the influence of different style pictures on the model within the training set, we conduct an ablation study by injecting 5$\%$, 10$\%$, and 15$\%$ of the mural data into the landscape painting training set respectively. Subsequently, we train and test the model to assess the impact of different style elements. The proposed method undergoes quantitative evaluation using three primary metrics: LPIPS, PSNR, and SSIM. Irregular masking scores were categorized into 0–15$\%$, 15–30$\%$, and 30–45$\%$ intervals, and the respective scores were computed. Table 4 presents the results obtained from the MaskCLP dataset.

As shown in Table 4, incorporating a small amount of mural data leads to minimal fluctuation in the index results. However, as the proportion of mural data increases, the index results exhibit slight changes. In comparison to scenarios without mixing mural data, when the mask ratio is 30–45$\%$, the PSNR index decreases by 0.66$\%$, the SSIM index decreases by 2.2$\%$, and the LPIPS index decreases by 5.9$\%$.

Based on the comprehensive experimental results, an imbalanced distribution of various sample data in the training set can lead to confusion conceptual space of the model during training. Consequently, the model struggles to learn the features of different data distributions throughout the training process, resulting in varying degrees of decline in experimental indices.

Visualization

Through qualitative comparisons involving various mask ratios, different methods of obtaining sketches, and the removal of different modules, we demonstrate that our method achieves satisfactory inpainting results. Additionally, we showcase a comprehensive understanding of the details of traditional Chinese landscape painting, particularly in the MaskCLP dataset.

Ablation analysis on mask ratio

To assess the impact of varying mask ratios on the method proposed in this paper, we conducted separate experiments to examine their effect on the experimental results. As depicted in Fig. 11, the mask ratios are 15$\%$, 30$\%$, and 45$\%$ from left to right.

Fig. 11 illustrates that the proposed method produces varying results under different mask ratios. At a 15$\%$ mask ratio, the inpainting result is nearly perfect with minimal difference from the original image. At a 30$\%$ mask ratio, the restoration is relatively ideal, and the texture and shape of the tree in the missing area can be roughly restored. For a mask ratio of 45$\%$, the restored area is generally consistent with the overall painting’s style and characteristics. However, the unique line texture of the rock and the vibrant form of the tree cannot be accurately restored. In general, our proposed method can adapt well to different mask proportions and produce satisfactory inpainting results.

Ablation analysis on sketch-guided

To assess the impact of sketch guidance on the method proposed in this paper, we conducted additional ablation experiments to evaluate sketch images extracted through various methods. The experimental findings are presented in Fig. 12. Specifically, the first row displays the grayscale sketch extracted from traditional Chinese painting, which incorporates grayscale information. The second row exhibits the edge map generated using the traditional Canny algorithm for edge detection.

Fig. 12 illustrates that, compared with SGRGAN without sketch-guided, the introduction of sketch guidance enhances the preservation of structure and texture in the restored area, aligning it more closely with the surrounding context. In the absence of sketch guidance, restored areas often exhibit blurriness and inconsistency in structure and texture. Furthermore, employing grayscale sketch extracted from traditional Chinese paintings as structural priors results in improved texture effects in the restored rocks and trees, enabling the restoration of mountain contours and tree and rock veins while maintaining consistent brushstrokes with the original paintings. Conversely, utilizing the sketch generated by the Canny operator as structural guides leads to local structural discrepancies in the restored results, along with issues such as distorted lines.

The quantitative results presented in Table 2 indicate that the PSNR, SSIM, and LPIPS metrics exhibit improvements when employing sketch guidance compared to its absence. In summary, utilizing sketch images as structural a priori inputs effectively mitigates the issue of inconsistency between local and overall structural elements within the restoration area. This enhancement significantly contributes to the quality of the results, reflected in superior quantitative scores.

Ablation analysis on module

To evaluate the effectiveness of the proposed Focal block, BiSCCFormer block, and sketch-guided in our method, we conducted ablation experiments to evaluate the impact of removing each component individually. The experimental results for removing the Focal block, BiSCCFormer block, and sketch-guided are presented in Fig. 13.

The BiSCCFormer block is incorporated to enhance the correlation of local features, while the Focal block is introduced to maintain consistency in the structure and texture of the restored area. To demonstrate the reconstruction effects of our method on various elements of Chinese landscape paintings, we utilize masks to cover the mountain peaks and trees individually. Specifically, as depicted in Fig. 13, the absence of the BiSCCFormer block results in noticeable deformations and peculiar textures in the reconstructed peaks and trees, deviating from the consistent texture and structure of the surrounding area. Similarly, the absence of the Focal block leads to artifacts and distorted lines in the reconstructed peaks and trees. However, employing both the BiSCCFormer block and Focal block simultaneously enables our method to reconstruct different elements of Chinese landscape paintings effectively. The quantitative results presented in Table 2 further corroborate the effectiveness of the BiSCCFormer block and Focal block.

Ablation analysis on loss function

To assess the influence of reconstruction loss, perceptual loss, style loss, and intermediate loss on our model’s training effectiveness, we perform ablation experiments on the loss function. The experimental results are depicted in Fig. 14.

From the visualization results in Fig. 14, it is evident that compared with SGRGAN w/o ${\mathcal {L}}_{p}$, the restored image, as depicted in Fig. 14b, although more consistent with the surrounding environment at the pixel level, appears blurry and flat at the texture level, failing to reproduce the richness of detail and realistic texture expected in the original image. Similarly, compared with SGRGAN w/o ${\mathcal {L}}_{s}$, as shown in Fig. 14c, the restoration results exhibit a small number of distorted lines, leading to an inconsistent effect with the surrounding style. Moreover, compared with SGRGAN w/o ${\mathcal {L}}_{i}$, as depicted in Fig. 14d, there is a distortion of structure and texture, undermining the consistency of the structure and texture of the restored object. The visualization results highlight the necessary of the joint loss functions employed in this paper.

Ablation analysis on dataset proportion

To verify the impact of varying proportions of different style images in the training set on the model, we conducted an ablation study by mixing 5$\%$, 10$\%$, and 15$\%$ of mural data into the landscape painting training set. The model was then trained and tested, with the experimental results depicted in Fig. 15.

The visualization results in Fig. 15 reveal that as the proportion of incorporated mural data increases, influenced by mural examples during training, the restoration results exhibit characteristics inconsistent with the traditional aesthetics of landscape painting. Following the mixing of 15$\%$ of the mural data, the repaired image displays visual effects that are not in harmony with the overall style and color palette of landscape painting.

Comparison on different datasets

Comparison on mural

Our framework is evaluated on four distinct types of Mural painting datasets, e.g.Tangka, Temple, Cave, and Burial to verify its generalizability to additional painting datasets. The visualization results are depicted in Fig. 16.

Our model demonstrates applicability to murals, exemplified by the intricate details of the clothing in Fig. 16b. Additionally, our method successfully restores the texture details of the murals, highlighting its ability to achieve outstanding visual effects.

Comparison on celebA and places2

To validate the restoration capability of our model on public datasets, we conduct experimental validation on CelebA and Places2 datasets, respectively. The experimental results are illustrated in Figs. 17, 18.

Our model performs consistently well for both faces and modern landscape photographs. For instance, the details of the nose in Fig. 17a and the floor in Fig. 18b showcase the model’s ability to recover texture details in both scenarios. This demonstrates that our approach yields excellent visualization results across publicly available datasets.

Conclusion and discussion

In this paper, we introduce a series of innovative methods and contributions aimed at addressing the complex and challenging task of restoring ancient Chinese landscape paintings. Initially, we establish a novel dataset specifically tailored for ancient landscape painting restoration, featuring specialized Mask annotations, termed MaskCLP. This dataset serves as a valuable resource for the research community, facilitating the advancement of deep-learning-based techniques for ancient painting restoration. We will release MaskCLP on https://github.com/Makbaka1/MaskCLP to encourage more researchers to explore this field and conduct experiments.

This study introduces sketch as innovative a priori information for multimodal structure. By effectively guiding the fine texture, stroke style, and other artistic elements from ancient paintings into the restoration network, a more accurate and realistic reconstruction of the details is achieved. A novel Focal block is designed to integrate local fine-grained features such as color matching and element abstraction with global compositional layout features efficiently. This enhancement enables the model to capture the intrinsic aesthetic characteristics of ancient paintings, preserving their original style while ensuring overall harmony and local texture fidelity. Additionally, the proposed BiSCCFormer block, based on a Bi-level routing attention mechanism, is a highlight of this study. Its comprehensive understanding of image structure and semantic information ensures that the restoration process not only preserves the cultural significance of ancient paintings but also faithfully reproduces the artistic styles of their respective historical periods.

Although this study has made significant strides in the field of ancient painting restoration, several avenues warrant further exploration and refinement in the future. Firstly, expanding the MaskCLP dataset could involve incorporating a broader array of ancient landscape paintings spanning various ages and regional styles to better encompass the entire spectrum of artistic diversity present in Chinese landscape painting. Secondly, although the current Focal block and BiSCCFormer block yield promising results in experiments, dynamically adjusting model parameters and attention weight distribution to align with the specific characteristics of different types of ancient paintings remains a topic deserving in-depth investigation. Moreover, while the proposed scheme enhances restoration effects considerably, integrating human expert knowledge with AI algorithms for enhanced interaction is crucial for capturing the unique expression and personal style of the artist more accurately during the restoration process. Furthermore, future research avenues may explore cross-media fusion techniques, such as incorporating textual histories or concurrently processing multiple types of image data, to augment the model’s holistic understanding and generation capabilities.

Overall, the work presented in this paper furnishes a potent tool for leveraging modern technological advancements in safeguarding and perpetuating ancient Chinese cultural heritage, while also opening up new avenues for future explorations and applications of deep learning in the realm of cultural and artistic heritage preservation and revitalization.

Limitation

This paper presents a novel restoration method named SGRGAN, which aims to restore ancient Chinese landscape paintings using sketch as guides. Experimental results demonstrate that, compared to other methods discussed in this paper, SGRGAN effectively considers the structural details and texture characteristics of landscape paintings during the restoration process, resulting in high-quality restoration of damaged paintings. However, the limitations of this study include the following aspects:

Firstly, due to the reality, the number of surviving intact ancient Chinese landscape paintings is limited and unevenly distributed. The dataset we obtained from the Internet and different institutional platforms contains works from multiple historical periods, different painters, and a variety of artistic styles, but the amount of data broken down into subcategories of dynasties, individual painters, and specific styles is insufficient. Labeling the style, author, and dynastic information of each painting requires a professional operation, which is a huge amount of work and requires more time and expense. Therefore, we relax the scope of the restoration task in this paper by restoring landscape paintings based only on the overall screened and merged dataset for general landscape painting restoration tasks.

Secondly, the sketch image assistance landscape painting restoration strategy has yielded significant advancements. In comparison with other inpainting methods, this strategy not only achieves high-fidelity restoration but also renders restoration defects less perceptible to the human eye. However, relying solely on sketch images as guidance elements still presents some shortcomings. Therefore, in subsequent research, we will explore the integration of more specific historical characteristics of landscape painting, the unique style of the painter, detailed explanations of the painting’s connotations, and other background information to enhance the restoration process.

We have already started the refinement of the dataset, and in the subsequent work, we will expand the refined data and classify the dataset more carefully according to the dynasty, the identity of the painter, and the painting style. In this way, we will advance the research on accurate restoration techniques for landscape paintings in different segments and explore how to effectively utilize multimodal techniques to combine the contextual information of landscape paintings for effective restoration. This series of efforts will help to deepen the protection and inheritance practices of traditional Chinese cultural heritage.

Availability of data and materials

Testing data is provided within the supplementary information files. Due to the data permission, please contact the pxl@nwu.edu.cn for more information.

References

Du WJ. On the digital protection of cultural relics. Cult Relics Identificat Appreciat. 2019;23:102–4 (in chinese).
Google Scholar
Deng F. What is the “mingzhe’’? - - reflections on the restoration project of ancient paintings donated by deng tuo. Chinese Fine Arts. 2016;5:27–34 (in chinese).
Google Scholar
Lan LR, Sang LJ. Digital protection of ancient murals and its practice. Art Educat. 2020;5:170–3 (in chinese).
Google Scholar
Luo R, Luo R, Guo L, Yu H. An ancient chinese painting restoration method based on improved generative adversarial network. J Phys Confer Series. 2022;2400: 012005.
Article Google Scholar
Lyu Q, Zhao N, Yang Y, Gong Y, Gao J. A diffusion probabilistic model for traditional chinese landscape painting super-resolution. Herit Sci. 2024;12(1):4.
Article Google Scholar
Fong WC. Why chinese painting is history. Art Bullet. 2003;85(2):258–80.
Article Google Scholar
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA. Context encoders: feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016; pp. 2536–2544.
Liu G, Reda FA, Shih KJ, Wang TC, Tao A, Catanzaro B. Image inpainting for irregular holes using partial convolutions. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018;pp. 85–100.
Li J, Wang N, Zhang L, Du B, Tao D. Recurrent feature reasoning for image inpainting. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2020; pp. 7760–7768.
Li X, Guo Q, Lin D, Li P, Feng W, Wang S. Misf: Multi-level interactive siamese filtering for high-fidelity image inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022; pp. 1869–1878.
Xu Z, Shang H, Yang S, Xu R, Yan Y, Li Y, Huang J, Yang HC, Zhou J. Hierarchical painter: Chinese landscape painting restoration with fine-grained styles. Visual Intelligence. 2023;1(1):19.
Article Google Scholar
Chang I-C, Wun Z-S, Yeh H-Y. An image inpainting technique on chinese paintings. J Comput. 2018;29(3):121–35.
Article Google Scholar
Zeng Y, Gong Y. Nearest neighbor based digital restoration of damaged ancient chinese paintings. In: 2018 IEEE 23rd International Conference on digital signal processing (DSP). 2018; pp. 1–5. IEEE.
Luo R, Luo R, Guo L, Yu H. An ancient chinese painting restoration method based on improved generative adversarial network. J Phys Conf Series. 2022;2400: 012005.
Article Google Scholar
Wang H, Li Q, Jia S. A global and local feature weighted method for ancient murals inpainting. Int J Mach Learn Cybern. 2020;11:1197–216.
Article CAS Google Scholar
Cai X, Lu Q, Yao J, Liu Y, Hu Y. An ancient murals inpainting method based on bidirectional feature adaptation and adversarial generative networks. In: computer graphics International Conference. 2023; pp. 300–311. Springer.
Ge H, Yu Y, Zhang L. A virtual restoration network of ancient murals via global-local feature extraction and structural information guidance. Herit Sci. 2023;11(1):264.
Article Google Scholar
Chang L, Chongxiu Y. New interpolation algorithm for image inpainting. Phys Proced. 2011;22:107–11.
Article Google Scholar
Ren-xi C, Xin-hui L. Fast image inpainting algorithm based on anisotropic interpolation model. Appl Res Comput. 2009;26(4):1554–6.
Google Scholar
Dimiccoli M, Salembier P. Perceptual filtering with connected operators and image inpainting. In: ISMM (1). 2007; pp. 227–238.
Li S, Yao Z. Image inpainting algorithm based on partial differential equation technique. Imag Sci J. 2013;61(3):292–300.
Article Google Scholar
Nazeri K, Ng E, Joseph T, Qureshi F, Ebrahimi M. Edgeconnect: Structure guided image inpainting using edge prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 2019; pp. 0–0.
Liu H, Wan Z, Huang W, Song Y, Han X, Liao J. Pd-gan: Probabilistic diverse gan for image inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021; pp. 9371–9381.
Zheng C, Song G, Cham TJ, Cai J, Phung D, Luo L. High-quality pluralistic image completion via code shared vqgan. 2022; arXiv preprint arXiv:2204.01931.
Liu J, Yang S, Fang Y, Guo Z. Structure-guided image inpainting using homography transformation. IEEE Transact Multimed. 2018;20(12):3252–65.
Article Google Scholar
Guo X, Yang H, Huang D. Image inpainting via conditional texture and structure dual generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021; pp. 14134–14143.
Song Y, Yang C, Lin Z, Liu X, Huang Q, Li H, Kuo CCJ. Contextual-based image inpainting: infer, match, and translate. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018; pp. 3–19.
Liu H, Jiang B, Xiao Y, Yang C. Coherent semantic attention for image inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019; pp. 4170–4179.
Li W, Lin Z, Zhou K, Qi L, Wang Y, Jia J. Mat: Mask-aware transformer for large hole image inpainting. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022; pp. 10758–10768.
Wan Z, Zhang J, Chen D, Liao J. High-fidelity pluralistic image completion with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021; pp. 4692–4701.
Dong Q, Cao C, Fu Y. Incremental transformer structure enhanced image inpainting with masking positional encoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022; pp. 11358–11368.
Liu G, Reda FA, Shih KJ, Wang TC, Tao A, Catanzaro B. Image inpainting for irregular holes using partial convolutions. In: Proceedings of the European Conference on Computer Vision (ECCV) 2018; pp. 85–100.
Zhu L, Wang X, Ke Z, Zhang W, Lau RW. Biformer: vision transformer with bi-level routing attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023; pp. 10323–10333.
Li J, Wen Y, He L. Scconv: spatial and channel reconstruction convolution for feature redundancy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023; pp. 6153–6162.
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022; pp. 11976–11986.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adva Neural Informat Process Syst. 2017;30.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. 2015.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. Imagenet large scale visual recognition challenge. Int J Comput Vision. 2015;115:211–52.
Article Google Scholar
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A. Places: a 10 million image database for scene recognition. IEEE transact Pattern Analy Mach Intell. 2017;40(6):1452–64.
Article Google Scholar
Liu Z, Luo P, Wang X, Tang X. Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision. 2015; pp. 3730–3738.
Assessment IQ. From error visibility to structural similarity. IEEE transactions on image processing. 2004;13(4):93.
Zhang R, Isola P, Efros AA, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018; pp. 586–595.

Download references

Funding

This research is supported by National Key Research and Development Program of China (No. 2023YFF0715103), National Natural Science Foundation of China (Grant No. 62,306,237 and No. 62006191), Key Research and Development Program of Shaanxi (No. 2024GX-YBXM-149 and No. 2021ZDLGY15–04), Northwest University Graduate Innovation Project (No. CX2023194), and Natural Science Foundation of Shaanxi (No. 2023-JC-QN-0750). However, it’s important to note that the views and opinions expressed solely represent those of the authors and do not necessarily reflect those of the funding agencies. Neither the funding agencies nor the authorizing agencies bear responsibility for the content.

Author information

Qiyao Hu Contributed equally to this work.

Authors and Affiliations

School of Information and Technology, Northwest University, Xi’an, 710127, China
Qiyao Hu, Yinyin Luo, Rui Cao, Jinye Peng & Jianping Fan
State-Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services, Xi’an, 710127, China
Jinye Peng
Generative Artificial Intelligence and Mixed Reality Key Laboratory of Higher Education Institutions in Shaanxi Province, Xi’an, 710127, China
Qiyao Hu & Jinye Peng
Shaanxi Silk Road Cultural Heritage Digital Protection and Inheritance Collaborative Innovation Center, Xi’an, 710127, China
Qiyao Hu
School of Art, Northwest University, Xi’an, 710127, China
Xianlin Peng
Network and Data Center, Northwest University, Xi’an, 710127, China
Weilu Huang

Authors

Qiyao Hu
View author publications
You can also search for this author in PubMed Google Scholar
Weilu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yinyin Luo
View author publications
You can also search for this author in PubMed Google Scholar
Rui Cao
View author publications
You can also search for this author in PubMed Google Scholar
Xianlin Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jinye Peng
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Fan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

QH contributed to conceptualization, software development, validation, resource acquisition, data curation, and formal analysis. WH was responsible for preparation, methodology, and validation. YL drafted the original manuscript and conducted a review and editing. RC participated in reviewing and editing the manuscript. XP contributed to software development, while JP handled project administration. JF conducted the review and editing.

Corresponding author

Correspondence to Xianlin Peng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no Conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Hu, Q., Huang, W., Luo, Y. et al. Sgrgan: sketch-guided restoration for traditional Chinese landscape paintings. Herit Sci 12, 163 (2024). https://doi.org/10.1186/s40494-024-01253-x

Download citation

Received: 07 February 2024
Accepted: 22 April 2024
Published: 24 May 2024
DOI: https://doi.org/10.1186/s40494-024-01253-x

Sgrgan: sketch-guided restoration for traditional Chinese landscape paintings

Abstract

Introduction

Related work

Traditional Chinese painting restoration

Traditional method

GAN-based restoration

Structure-guided

Transformer-based restoration

Methodology

Dataset details

MaskCLP dataset

Mural

Pre-processing

Landscape painting pre-processing

Mask pre-processing

Overall structure

Generator details

Texture encoder-decoder

Structure encoder-decoder

BiSCCFormer block

Focal block

MHSA module

Discriminator details

Texture discriminator

Structure discriminator

Sketch extraction

Loss function

Reconstruction loss

Perceptual loss

Style loss

Adversarial loss

Intermediate loss

Joint loss

Experiments

Supplementary dataset

Implementation details

Training setting

Training process

Testing process

Inference process

Evaluation metrics

Main results

Baselines

Comparison analysis

Visualization

Visual comparison of traditional method

Visual comparison on mask ratio of 0–15\(\%\)

Visual comparison on mask ratio of 15–30\(\%\)

Visual comparison on mask ratio of 30–45\(\%\)

Visual comparison on mask ratio of 45\(\%\)

Ablation study

Comparison analysis

Comparison on module

Comparison on loss function

Comparison on dataset proportion

Visualization

Ablation analysis on mask ratio

Ablation analysis on sketch-guided

Ablation analysis on module

Ablation analysis on loss function

Ablation analysis on dataset proportion

Comparison on different datasets

Comparison on mural

Comparison on celebA and places2

Conclusion and discussion

Limitation

Availability of data and materials

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher’s Note

Rights and permissions