Introduction

The cultural significance of traditional Chinese paintings is undeniable. They serve as a testament to the changing times and are imbued with rich cultural meanings. However, the passage of time and natural factors have often led to damage and deterioration of these valuable works of art. As a result, it is crucial to take measures to protect and restore them. Currently, some restoration experts attempt to restore these paintings through manual drawing, but the varying personal drawing styles of the restorers make it challenging to reproduce the authentic brushstrokes and style of the original artwork.

With the continuous improvement of deep learning in image processing, image super-resolution techniques have been used to reconstruct high-resolution details from blurry natural images. Especially in cases where direct physical restoration of the original artwork is not feasible, utilizing non-contact restoration methods like image super-resolution techniques becomes particularly important. This technology enables a more precise restoration of the details in ancient traditional paintings, contributing to the protection and inheritance of China’s precious cultural heritage. For example, Xiao et al. [1] construct a global correlation graph based on pre-extracted features from all samples and utilize a graph convolutional neural network with maximum mean discrepancy loss to approximate the feature distributions in two domains. This model effectively preserves the structural information between samples. A large number of single-image super-resolution (SISR) approaches heavily rely on supervised learning. However, to reduce dependency on supervised learning, Prajapati et al. [2] achieve certain results by employing unsupervised learning in a GAN framework. They also introduce a new loss function based on Mean Opinion Score (MOS) to evaluate the quality of the generated images. EaSRGAN [3] improves upon SRGAN by incorporating multi-stage training for the generator and discriminator, with a focus on edge and flat region enhancement. This approach pays attention to the perceptual edge information, resulting in fewer artifacts and higher image quality. Zhao et al. [4] propose a multi-level semantic progressive restoration approach for painting images. This method gradually shifts attention from high-level and large-scale information to increasingly fine scales, yielding better results compared to other one-step restoration methods. Although existing methods have achieved excellent performance in super-resolution of real-world images.

However, traditional Chinese paintings typically exhibit complex layout structures and abstract representations of objects and scenes. The challenge lies in the faithful reconstruction of the details of original artwork while maintaining its unique artistic style. In summary, the key difficulties encountered in restoring traditional Chinese paintings include:

  1. 1.

    Traditional Chinese paintings, including ink paintings and meticulous paintings, emphasize the variations in brushstrokes and lines, which deviate from the objective and realistic representation found in real-world images. The texture information embedded in these artworks encapsulates the distinctiveness of brushstrokes and the stylistic characteristics that define the artwork. Preserving their original forms during model inference is of paramount importance.

  2. 2.

    Chinese traditional paintings encompass a multitude of abstract elements and symbolism. Modeling such irregular and highly abstract content presents a significant challenge. The super-resolution process may introduce distortions or deviations from the original artistic style, further complicating the reconstruction process.

  3. 3.

    The development of traditional Chinese painting reconstruction techniques is hindered by the limited availability of datasets that align with high-resolution and low-resolution traditional paintings. This scarcity of data inhibits progress in this field.

To overcome the above limitations, we introduce a novel method to facilitate the super-resolution of traditional Chinese painting images, termed ConvSRGAN. Our contribution is threefold:

  • We propose a novel dataset specifically for Chinese landscape painting super-resolution, termed SRCLP. It facilitate further research and exploration in this field by providing an extensive and high-quality dataset. The dataset is available at https://github.com/LPDLG/SRCLP-Dataset

  • We propose an EARM to extract high-level abstract features. Additionally, to address the issue of lost contour information in painting images, we design an EHRM within the EARM to enhance the edges and textures of different-level feature maps. Furthermore, we introduce an ADCB in the EHRM to model large-scale spatial dependencies, allowing the model to better understand and reproduce the global layout and brushstroke trends in traditional Chinese paintings.

  • We introduce a combination of MS-SSIM loss (\(\mathcal {L}_{M-S}\)) and traditional loss weighting, which pay more attention to contours and pixel differences at various scales and suppressing color and brightness distortions.

Related work

Traditional Chinese painting restoration

Recently, significant advancements have been made in Chinese painting style transfer [5, 6], poetry-to-image [7], image-to-image translation [8], and traditional Chinese painting image generation techniques [9]. These developments have provided valuable support for the preservation and continuation of traditional art, while also paving the way for new opportunities in digital art evolution. For example, SAPGAN (Sketch-And-Paint GAN) [10] first employs SketchGAN to generate sketches of landscape paintings and then uses PaintGAN to transform the sketches into Chinese landscape paintings. Zhang et al. [11] propose a generative adversarial network-based model for automatically generating Chinese landscape paintings with styles closely resembling traditional Chinese paintings, which plays a significant role in the preservation of digital cultural heritage.

At the same time, there are also some research methods focusing on the restoration and super-resolution of traditional Chinese landscape paintings. Shi et al. [12] build the Ref-ZSSR network based on generative adversarial networks (GAN) to extract and apply the global information of images from the painting itself, successfully achieving restoration of damaged ancient paintings. Nagar et al. [13] apply diffusion models to the artistic restoration of mural images, effectively addressing various degradation issues such as noise, blur, and fading. It is worth mentioning that Lyu et al. [14] apply the diffusion model to Chinese landscape paintings, propose CLDiff, and introduce an attention mechanism. This approach achieves good performance in super-resolution tasks of traditional Chinese landscape paintings, providing high-resolution results with clear ink texture.

Although there has been some headway in utilizing deep learning for traditional Chinese landscape painting, research that specifically concentrates on super-resolution techniques for traditional Chinese paintings remains limited. Traditional Chinese painting is renowned for its unique artistic language, techniques, and aesthetic characteristics, including ink charm, vividness of artistic conception, and non-realistic composition. These attributes present new challenges for existing general image super-resolution algorithms. The process of super-resolution on landscape paintings can accurately restore the artistic effects of traditional Chinese painting, which is of great significance for the preservation of digital cultural heritage. Therefore, we will conduct comprehensive research on the super-resolution of Chinese landscape paintings.

Single image super-resolution

Since the introduction of the CNN-based super-resolution algorithm by Dong [15], deep learning methods have gained significant popularity in addressing image super-resolution tasks, leading to groundbreaking advancements in this domain. Ledig et al. [16] use a generative adversarial approach to train their network and define a content loss function, achieving superior results compared to traditional methods. After that, ESRGAN [17] builds upon this work by stacking multiple dense blocks to restore HR images. It also introduces the concept of perceptual loss to reconstruct images that closely resemble human perception. While dense connections help with feature reuse, the challenge lies in the complexity of training the model and the requirement for a large amount of high-quality HR-LR data for supervised training. Real-ESRGAN [18] utilizes a high-order degradation model to simulate complex degradation distributions in images, allowing for the generation of paired HR and low-resolution LR images. However, when dealing with more complex textures and details, there may be instances of distortion and blurring.

Although CNN-based methods have achieved success in super-resolution tasks, they still face inherent limitations, and learning long-range dependencies in images has always been a key issue in the field of computer vision. Since the introduction of Vision Transformers [19], many visual tasks [20,21,22] have demonstrated the excellent performance of Transformers in addressing this issue. The SwinIR [23] model leverages the power of the Swin Transformer [24] to enhance image super-resolution. By utilizing local attention and window-based shifts, the model gains a better understanding of the overall image structure, enabling it to effectively capture long-range dependencies and improve its performance in super-resolution tasks. ESRT [25] combines the strengths of the CNN and Transformer architectures, using a CNN network to learn deep image features and introducing an efficient multi-head attention mechanism called EMSH in the Transformer to capture dependencies between similar tokens. This approach reduces network parameters while improving feature representation. DAT [26] achieves feature aggregation and captures global contextual information by alternating between spatial and channel self-attention mechanisms within consecutive Transformer blocks. This approach aims to enhance the quality of super-resolution images. Despite the enhanced capability of Transformer-based super-resolution models in capturing global dependency information in images, the significant computational complexity arising from the Hadamard product of self-attention matrices presents considerable challenges, particularly when dealing with large-scale high-resolution artistic images. To circumvent the complex operations in Transformers, we propose a novel module that achieves similar effects to attention mechanisms. This module enables learning of both image semantics and local textures in artistic images while avoiding complex calculations.

These advancements have improved the super-resolution quality of real-world images to varying extents. However, traditional Chinese paintings, which are not real but rather contain complex layout structures and element arrangements, involve intricate texture details within mountains, rocks, and vegetation that require progressive learning. To this end, we concatenate EARM to progressively extract high-level information. Additionally, to address the differences in structure and texture information caused by scale variations, we introduce the \(\mathcal {L}_{M-S}\) in the combining function to improve the quality of super-resolution reconstruction.

Vision expend

While Transformers have shown remarkable ability in capturing long-distance dependencies in visual tasks, the high computational requirements and intricate inference procedures have incentivized researchers to incorporate larger convolutional kernels in convolutional operations. It aims to expand the model’s field of view, promoting the acquisition of improved contextual information. ConvNeXt [27] introduces a pure ConvNet model that achieves comparable performance to the Swin Transformer by optimizing training strategies and utilizing large-sized convolutional kernels. This demonstrates the effectiveness of large-kernel convolutions. RepLKNet [28] utilizes reparameterized depthwise convolutions to design high-performance large-kernel CNNs. By employing a \(31 \times 31\) oversized convolutional kernel, it achieves improved results in various typical downstream tasks while maintaining lower latency. VapSR [29] utilizes depth-wise separable convolutions instead of dense connection layers and implements pixel-level attention allocation with large convolutional kernels in the attention branch. This approach aims to enhance the resolution of generated images. LKDN [30] builds upon the BSRN [31] baseline structure and introduces a more efficient large-kernel attention (LKN) module to learn global image features and improve image clarity. It also reduces computational costs by distilling networks through an analysis of the computational efficiency of both BSRN and VapSR.

Inspired by these works, we consider replacing the Transformer architecture with large-kernel convolutions to avoid complex attention operations while capturing more feature information. To this end, we combine the characteristics of large-kernel convolutions and depth-wise separable convolutions to design the ADCB. This block provides the network with more contextual information, allowing for the preservation of the coherence of the painting structure while also finely retaining the artistic style and unique techniques of the artwork.

Methodology

Overall structure

As shown in Fig. 1, the overview of our model. (a) ConvSRGAN network: It comprises shallow feature extraction, deep feature extraction, and feature fusion. (b) Image degradation module: Our model incorporates a module specifically designed to emulate the degradation effects typically encountered in real-world landscape paintings. (c) ConvSR network: The core architecture of our model revolves around a streamlined process that begins with receiving input data, progresses through a sophisticated feature extraction phase, and ultimately leads to a refined feature super-resolution. (d) Discriminator: It is trained to distinguish between the synthesized images and genuine samples. By iteratively refining this process, the generator learns to produce images that increasingly resemble authentic instances, thus enhancing the super-resolution output. (e) Enhanced Adaptive Residual Module (EARM): It adjust the residual connections and main path weights to automatically select and retain the relevant feature information. (f) Enhanced High-frequency Retention Module (EHRM): It preserves high-frequency information at the current resolution by adding average pooling layers.

Fig. 1
figure 1

Overview of our framework. a ConvSRGAN network. b Image degradation Module. c ConvSR network. d Discriminator. e Enhanced Adaptive Residual Module. f Enhanced High-frequency Retention Module

The ConvSRGAN begins by taking high-resolution painted images (HR) as input. It simulates the degradation distribution of the real world through a second-order degradation module, resulting in low-resolution images (LR) in 4 \(\times\) resolution.

Then, low-resolution images are further restored in ConvSR, where they are mapping to the feature space through a convolutional layer. The resulting features are separated into deep and shallow features, with the shallow features preserving the overall structure and arrangement of the painted images, while the deep features restore the high-level texture details.

Moreover, the preliminary features are dynamically extracted using the concatenated EARM to capture different depths of image texture features. The initial features are also processed through convolutional layers to obtain shallow features, which are then combined with the deep features through long skip-connection to generate a high-resolution painted image.

In addition, after generating the super-resolution image, it is compared to the original high-resolution image to calculate loss fuction using the discriminator network. We use a U-Net discriminator with spectral normalization to provide more accurate gradient feedback for local textures.

And finally, \(\mathcal {L}_{M-S}\) is introduced to account for the characteristics of landscape painting images. We use four loss functions, which is Adversarial loss, Perceptual loss, MS-SSIM Loss and L1 loss to minimizes the feature vector distances between the super-resolution image and the original high-resolution image. The network parameters are updated through gradient back-propagation. Detailed explanations of each component of the network will be provided in the following sections.

Degradation module

Traditional paintings undergo various forms of deterioration over time, including climate erosion, pigment fading, and other factors, which ultimately lead to the degradation of the image quality, resulting in blurry and distorted representations. Meanwhile, when digitally preserving these images, the use of different storage methods and sharpening techniques often introduces undesirable artifacts. Basic techniques such as Blur, Resize, Noise, JPEG compress prove inadequate in accurately simulating the intricate degradation patterns observed in traditional painted images. Consequently, a substantial disparity exists between artificially synthesized low-resolution images (LR) and genuine degraded images.

Inspired by Real-ESRGAN [18], we have constructed a model that simulates the actual degradation process. As shown in Fig. 1b, the model can reflect various complex degradation phenomena that may occur in traditional paintings after long-term preservation, including but not limited to color loss, texture blurring, and structural distortion. The first-order degradation is contained four operations of degradation process, which are Blur, Resize, Noise, JPEG. The second-order degradation means secondary process of the first-order degradation. The formulation of this model is represented as Eq. (1):

$$\begin{aligned}&I_{L}=M^{2}(I_{H}) \nonumber \\&M \in \{ Blur, Resize, Noise, JPEG \} \end{aligned}$$
(1)

where \(M^{2}\) represents second-order degradation. The first-order degradation is contained four operations of degradation process. \(I_{H}\) and \(I_{L}\) respectively represent the images before and after degradation process.

By using degradation module, the generated degraded images are closer to real-world traditional paintings that have been damaged over time. This helps to bridge the significant gap between simulated images generated solely based on basic degradation techniques and actual degraded images.

ConvSR network

The ConvSR network comprises shallow feature extraction, deep feature extraction, and feature fusion. Given the degraded low-resolution image (LR) as input, a convolution layer with the kernel size of \(3 \times 3\) is used to extract the structure features \(E_{c}\). Simultaneously, the low-resolution image is mapping to the feature space. This process can be formulated as:

$$\begin{aligned} {E_{c}=\Phi _{c}(I_{L})} \end{aligned}$$
(2)

where \(\Phi _{c}(\cdot )\) represents the process of the first convolutional layer.

To capture the distinctive traits of Chinese traditional painting, including brushstroke techniques, composition structure, and element arrangement, ConvSR performs two separate branchs on the input shallow features.

As shown in Fig. 1c, one branch focuses on texture feature extraction. The other branch involves convolutional operations to preserve the structural feature. Then, the two separate branches perform the corresponding element-wise addition during the fusion process. Specifically, we extract features at different depths by adjusting the number of proposed EARM.

By combining the output of different EARM at different levels of image features, we obtain the deep features \(E_d\).

$$\begin{aligned} E_{i}&=\xi _{i}(E_{i-1}),\quad i=1,2,3,4,5 \end{aligned}$$
(3)
$$\begin{aligned} E_{a}&=concat(E_{1},E_{2},E_{3},E_{4},E_{5}) \end{aligned}$$
(4)

where i denotes the number of EARM. \({\xi }_{i}(\cdot )\) denotes the operation of EARM. \(E_a\) denotes the output features in terms of the fusion of Modules of EARM with different depths.

As for the deep features \(E_{a}\), they are first passed through a single convolutional layer to reduce the number of channels and then upsampled. The upsampled deep features are integrated using another convolutional layer to capture information from different depths. This process is defined as follows:

$$\begin{aligned} {E}_{m}=Upsample(\Phi (E_{a})) \end{aligned}$$
(5)

where \({\Phi (\cdot )}\) denotes the convolution layer and \({Upsample(\cdot )}\) denotes the Upsample layer.

To preserve the structure and layout features of the painting, the other branch directly upsamples the shallow feature \(E_s\) to the original size.

$$\begin{aligned} {E}_{n}=Upsample(E_{c}) \end{aligned}$$
(6)

Finally, the deep features and shallow features are merged to obtain the super-resolution image. The feature fusion is through element-wise addition process.

$$\begin{aligned} {E_{SR}=\Phi ({E}_{m})+\Phi ({E}_{n})} \end{aligned}$$
(7)

where \(E_{SR}\) denotes the feature matrix of super-resolution image.

It can be concluded that during the learning of image features in the ConvSR, the shallow structural features obtained through a single convolutional layer preserve the overall layout and shape of the landscape painting images. On the other hand, processing the deep features can enhance the details and texture of the image, resulting in a super-resolution image that is clearer and more realistic.

EARM

As shown in Fig. 1e, EARM has three EHRM in the main path and utilizes skip connections to learn residual information from the input. In this process, we dynamically adapt and adjust the residual connections and main path weights to automatically select and retain the relevant feature information. This allows the model to flexibly adjust and adapt to different painting styles, improving its ability to handle various styles of traditional painted images and enhancing its generalization capability.

By incorporating the EARM into the model, we ensure that it not only enhances the quality of the artwork at the pixel level but also captures and conveys the artistic spirit and aesthetic mood of the original piece. Our objective is to preserve the artistic essence of Chinese traditional painting throughout the digital processing, allowing the model to reflect the true artistic essence inherent in traditional artwork. This process can be represented as follows:

$$\begin{aligned} {E_{q}=\Phi \left( \lambda _{r}\cdot \delta ^{3}\left( E_{x}\right) +\lambda _{x}\cdot E_{x}\right) } \end{aligned}$$
(8)

where \(\delta ^{3}(\cdot )\) denotes the operation through three EHRM. \(\lambda _{r}\) and \(\lambda _{x}\) are the adaptive weights of the two paths respectively. Finally, we use a convolution layer to adjust the output dimension.

EHRM

High-frequency information refers to the details and textures in an image that has high variation frequencies, such as leaves, petals, and mountain folds. Depicting these details and textures is essential for representing the imagery in a painting. To extract subtle features in painting images, such as textures, lines, and shadows, we propose the Enhanced High-frequency Retention Module, termed EHRM.

It preserves high-frequency information at the current resolution by adding average pooling layers. Specifically, we introduce the Adaptive Deep Convolutional Block, termed ADCB. Increasing the size of the convolution kernel improves the limitations of traditional CNNs in terms of their field of view and enables the extraction of more contextual information.

As shown in Fig. 1f, it is assumed that the feature of the input EHRM is \(E_{b}\), the first ADCB extracts features that serve as input to the high-pass filter. The high-pass filter calculates the high-frequency information of these features, denotes \(E_{h}\).

$$\begin{aligned} E_{h}=A_v(\kappa _{a}(E_{b})) \end{aligned}$$
(9)

where \(\kappa _{a}\) is the operation of the ADCB. \(A_{v}\) denotes the Average Pooling layer.

After obtaining \(E_{h}\), we decrease the size of the feature map to reduce computational cost and feature redundancy. The downsampled feature map is represented as \(E_{d}\).

$$\begin{aligned} E_{d}=Downsample(E_{h}) \end{aligned}$$
(10)

We utilize five ADCB to explore its potential information, with weight sharing to reduce parameters. Simultaneously, an ADCB is used in the feature space to align \(E_{h}\) with \(E_{d}\), yielding \(E_{w}\).

$$\begin{aligned} E_{w}&=\kappa _{a}(E_{h}) \end{aligned}$$
(11)
$$\begin{aligned} E_{u}&=\kappa _{a}^{5}(E_{d}) \end{aligned}$$
(12)

After feature extraction, \(E_{u}\) is upsampled to the original size using bilinear interpolation.

$$\begin{aligned} E_{v}=Upsample(E_{u}) \end{aligned}$$
(13)

Then, we concat \(E_{v}\) and \(E_{w}\) to retain the original details and obtain the feature \(E_{t}\). This operation can be represented as follows:

$$\begin{aligned} E_{t}=concat(E_{w}, E_{v}) \end{aligned}$$
(14)

To strike a satisfactory balance between model complexity and performance, we use five ADCBs in our experiments, which are denoted as \(\kappa _{a}^{5}\).

To extract more image features, \(E_{t}\) is input into the Channel Attention Layer (CAL) Module after convolution operation.

$$\begin{aligned} E_{r}=\omega (\Phi (E_{t})) \end{aligned}$$
(15)

where \(\omega (\cdot )\) denotes the operation of the CAL Module.

Finally, in order to maintain the shallow features of the image and further extract the deep features, \(E_{r}\) is added with \(E_{b}\) after ADCB module to obtain the output result of EHRM.

$$\begin{aligned} E_{y}=E_b+{\kappa _{a}(E_{r})} \end{aligned}$$
(16)

ADCB

As shown in Fig. 2, Adaptive Deep Convolutional Block (ADCB) is composed of Deep Residual Block (DRB) and Channel Attention Layer (CAL) Module, which can better capture the complex features of traditional painted images.

Fig. 2
figure 2

Overview of Adaptive Deep Convolution Block. We employ a \(1 \times 1\) convolutional layer to reduce the number of its channels. The Channel Attention Layer Module is used to highlight channels with high activation values

Deep Residual Block. It is composed of a DepthWise Convolutional (DW Conv) layer and two Pointwise convolutional (PW Conv) layers, and will perform element-wise addition operations on the features before and after these three convolutions.

Inspired by ConvNeXt [27], we use a \(7 \times 7\) kernel size in the DRB instead of the more common \(3 \times 3\) size to provide a larger receptive field and achieve effects similar to non-local attention mechanisms. This design helps the model learn the spatial depth created by distance and shading in traditional paintings, as well as understand the layout and spatial structure of the artwork (Fig. 4).

To balance the increase in model parameters caused by large convolution kernel, we use Depth Separable convolution. Depth Separable convolution consists of a DW Conv layer and a PW Conv layer, which are used in DRB to extract input features, thus reducing the complexity of the model.

\(E_{\omega }\) represents the output feature of the first DRB. The process of the second DRB can be represented as:

$$\begin{aligned} E_{k}&=\Phi _{d}(E_{\omega }) \end{aligned}$$
(17)
$$\begin{aligned} E_{z}&=\Phi _{p}(E_{k}) \end{aligned}$$
(18)
$$\begin{aligned} E_{\Omega }&=\Phi _{p}(E_{z})+E_{\omega } \end{aligned}$$
(19)

where \(E_{\Omega }\) represents the output feature of the second DRB. \(\Phi _{p}\) represents pointwise convolution. \(\Phi _{d}\) represents depthwise convolution.


Channel Attention Layer Module. It is composed of average pooling layer, downscaling layer, upscaling layer, and activation via sigmoid fuction. In addition, Element-wise multiplacation is performed on the features before and after the CAL module. CAL Module is used to learn crucial channel information, focusing on important feature information in the input image, and enhancing the expression of feature information to improve the accuracy of image super-resolution results.

Then, the output of the two DRB are connected through a convolutional layer to reduce the number of channels. Finally, the weights of different paths are adjusted in an adaptive manner to better utilize hierarchical features.

Discriminator details

Considering the high-order degradation models used in our study, especially Real-ESRGAN [18], the degradation space becomes extensive and intricate. As shown in Fig. 1d, we use a U-Net discriminator with spectral normalization, which has stronger discriminative power and can provide more accurate gradient feedback for local textures. At the same time, the use of spectral normalization not only helps to reduce artifacts and oversharpening issues in GAN training but also makes the training process more stable.

Furthermore, the \(\mathcal {L}_{a}\) is iteratively refined through a comparative computation of the super-resolution (SR) images produced by the ConvSR network with their corresponding high-resolution (HR) original images. This iterative optimization process serves to elevate the discriminative capacity of the model, thereby enhancing the authenticity of the images generated by the generator, ensuring a heightened level of fidelity in the super-resolution process.

Loss function

Traditional paintings often have rich colors and gradients, and traditional loss functions cannot capture the pre-processing steps of low-pass filtering and color space conversion that simulate the human visual system. They also fail to capture the visual perception and artistic style of restoring the original image. When dealing with artistic works such as landscape paintings that are rich in details, layers, and expressive concepts, using a combination of losses helps improve the performance of the network. Specifically, we consider that traditional \(\mathcal {L}_{1}\) distance provides pixel-level differences, the perceptual loss based on the VGG network improves the visual effect of the super-resolution image. Additionally, we introduce the \(\mathcal {L}_{M-S}\) to capture structural information at different resolution levels, thus better preserving the overall structure and layout of the original image.

During the training of the ConvSR network, we use the \(\mathcal {L}_{1}\) and \(\mathcal {L}_{M-S}\) loss functions for training. During the training of the ConvSRGAN, in addition to the previous two loss functions, we also use the \(\mathcal {L}_{a}\) and \(\mathcal {L}_{p}\) loss functions for training in the discriminator.

MS-SSIM Loss. We introduce \(\mathcal {L}_{M-S}\), which can be represented by the specific calculation formula as:

$$\begin{aligned} \mathcal {L}_{M-S}&=1 \nonumber \\&\quad -\prod _{m=1}^M\left( \frac{2\mu _{S}\mu _{H}+c_1}{\mu _{S}^2 +\mu _{H}^2+c_1}\right) ^{\beta _m} \biggl (\frac{2Cov(S,H)+c_2}{\sigma _{S}^2 +\sigma _{H}^2+c_2}\biggr )^{\gamma _m} \end{aligned}$$
(20)

where M represents different scales. \(\mu\), \(\sigma\) represent mean, standard deviation, respectively. \(Cov(\cdot )\) represent covariance operation. \(\beta _m\) and \(\gamma _m\) denote the relative importance of two items.

Adversarial loss We use adversarial loss to perform adversarial training between the super-resolution results generated by the generator and the original HR images, in order to optimize the generation performance of the generator. The adversarial loss can be represented as follows:

$$\begin{aligned} {\mathcal {L}_{a}=\sum _{n=1}^{N}-\log D_{\tau _{d}}(G(I_{L}))} \end{aligned}$$
(21)

where \(D(\cdot )\) and \(G(\cdot )\) respectively represent discriminator and generator. \(D_{\tau _{d}}(G(\cdot ))\) represents the probability that the super-resolution image matches the ground truth. We achieve better super-resolution performance by minimizing \(\mathcal {L}_{a}\).

Perceptual loss Inspired by SRGAN [16], we use perceptual loss for training. Perceptual loss is defined as a weighted combination of content loss and adversarial loss:

$$\begin{aligned} {\mathcal {L}_{p}=\mathcal {L}_{\textrm{vgg}}+10^{-3}\mathcal {L}_{a}} \end{aligned}$$
(22)

where \(\mathcal {L}_{vgg}\) is calculated based on a pre-trained VGG19 network [32]. \(\mathcal {L}_{a}\) represents the adversarial loss.

Joint loss During the training of the ConvSRGAN network, using a combination of losses helps improve the performance of the network. To achieve better super-resolution results, we optimize the weight hyper parameters of each loss function during training.

$$\begin{aligned} {\mathcal {L}_{J}=\mathcal {L}_{p}+\alpha \mathcal {L}_{a} +\beta \mathcal {L}_{1}+\sigma \mathcal {L}_{M-S}} \end{aligned}$$
(23)

where \(\alpha ,\beta ,\sigma\) are hyperparameters that balance the different loss terms, we set \(\alpha =0.1\), \(\beta =1.0\) and \(\sigma =1.0\).

Experiments

Datasets

In the experiments, we mainly train and test on the SRCLP dataset. We validate the generalization ability of our method on different datasets. Additionally, we also conduct testing on Mural, Painter By Numbers [33] and Flickr2K [34] datasets. For each dataset, it should be pointed out that we select only a subset of data for testing without participating in the training process.

SRCLP. In this paper, we construct a high-quality dataset called Super-Resolution Chinese Landscape Painting, termed SRCLP. All images in this dataset are sourced from collaborating institutions and digital art databases. We enlisted professional artists to meticulously classify the collected ancient paintings, considering different styles, dynasties, and color features to obtain diverse style information. The selected paintings have undergone careful screening and categorization to ensure the diversity and representativeness of datasets.

The SRCLP dataset consists of 900 high-resolution Chinese traditional painting images. During the training phase, we applied techniques such as random rotation and cropping for data augmentation. This increased the number of images from the original 900 traditional landscape paintings to 2175 images. Figure 3 illustrates the composition of our original dataset, which encompasses a broad range of traditional paintings hailing from diverse historical dynasties and showcasing the unique styles of several esteemed artists. Among them, 2079 images were used for training, and a subset of 96 images was selected for testing purposes. Building upon the existing foundation, we have meticulously curated a collection of supplementary paintings originating from disparate dynasties and the portfolios of diverse artists. These thoughtfully chosen pieces have been incorporated into our test dataset, thereby augmenting its scope and diversity to encompass a broader chronological and stylistic spectrum.

Fig. 3
figure 3

Examples of Traditional Chinese Paintings. a is a painting of ‘Autumn Colors on the Wutong Tree’ in Ming dynasty, b is a painting of ‘Joy in the Heart of Heaven’ by artist Emperor Shunzhi in Qing dynasty, c is a piece imitated by artist Li Wu in the Qing dynasty, d is a painting of ‘Secluded Dwelling on Mount Shu’ by the modern and contemporary artist Daqian Zhang. These works of art range from expressive to realistic, each displaying unique artistic characteristics and technical expressions

Mural We proposed Mural dataset includes four different types of mural images: Cave, Temple, Tomb, and Thangka. We have collected this data from relevant websites and museums. These images originate from various locations and consist of over 1200 high-resolution images. For the experiment, we selected 99 of the most representative images from the dataset for testing purposes.

Painter By Numbers [33]. Painter By Numbers is a dataset sourced from the Google Arts & Culture project, consisting of over 100,000 images of artworks. These images encompass various art styles and painting techniques. Therefore, we selected a subset of images from different artistic styles for qualitative comparisons, as they were not specifically used for training or quantitative evaluation.

Flicker2K [34]. Flickr2K is a commonly used image super-resolution reconstruction dataset that is especially suitable for studying high quality (2K resolution) image restoration tasks. It contains 2650 high quality images in 2K resolution. The images cover different subjects, including people, animals, landscapes and more. Therefore, we carefully selected and used 2000 of these images to train and test on our ConvSRGAN.

Implementation details

In our experiments, the model implemented by PyTorch is trained on a NVIDIA GeForce RTX 4090 GPU. Before training, we applied a series of data augmentation techniques to improve the model’s performance and robustness. This included resizing, cropping, rotating, mirroring, and adding noise to the images. The original painting images were resized to a unified resolution of \(512 \times 512\), and the batch size was set to 2. To stabilize the model training and achieve better performance, we employed a weighted combination of multiple losses. The weights for \(\mathcal {L}_{1}\), \(\mathcal {L}_{p}\), \(\mathcal {L}_{M-S}\), and \(\mathcal {L}_{a}\) were set to 1, 1, 1, and 0.1, respectively.

In the experiment, the High-Resolution painting (HR) is first passed through a simulated degradation network to obtain the Low-Resolution painting (LR), which is then input into the network for training.

Training process We employ a two-stage training strategy. The training time is about 25 h.

In the first stage, the ConvSR network is trained to quickly converge and generate high-resolution painting (SR). The training is carried out for 200,000 iterations, with an initial learning rate of \(2\times 10^{-4}\). The learning rate is decayed in multiple steps, with the learning rate halving at each step. The decay stages are set at [100,000, 150,000, 175,000].

In the second stage, the model trained in the first stage is used as a pre-trained model to train the ConvSRGAN network. We introduce a discriminator with the aim of calculating \(\mathcal {L}_{a}\) and \(\mathcal {L}_{p}\) between the high-resolution images SR generated in the first stage and the HR images. This helps to restore the original painting style features and provide more details, ultimately optimizing the generation performance of the ConvSR generator network. The initial learning rate is set to \(1 \times 10^{-4}\) during the second stage of training. Similarly, the learning rate is adjusted using multiple step decay, with the learning rate halving at each step. The decay stages are set at [100,000, 150,000].

Inference process Given a traditional painting LR image that is blurry due to weather erosion, the ConvSR network is used to output a restored and complete super-resolution image. In the inference process, we used 100 images to calculate the inference time, it took 110.18 s.

Evaluation metrics

To evaluate the quality of the SR images obtained by our method, we adopted three metrics for assessment, including: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) [35], and Learned Perceptual Image Patch Similarity (LPIPS) [36]. Among them, the PSNR and SSIM metrics are used for evaluating the texture and structural integrity of the SR images, while the LPIPS metric is used for assessing the visual effects of the images.

Table 1 Comparison results on ConvSR and CovnSRGAN

Comparison with state-of-the-art method

To demonstrate the effectiveness of our proposed ConvSRGAN framework, we conducted quantitative and qualitative comparisons with several state-of-the-art methods on the SRCLP dataset we constructed.

Baselines

We conduct qualitative and quantitative comparisons with several baseline methods to demonstrate the effectiveness of the method.

  1. 1.

    RRDBNet [17]: It introduces the Residual-in-Residual Dense Block without batch normalization as the basic network building unit.

  2. 2.

    SRResNet [16]: It is the first framework capable of inferring photo-realistic natural images for 4\(\times\) upscaling factors.

  3. 3.

    EDSR [37]: The significant performance improvement of this model is due to optimization by removing unnecessary modules in conventional residual networks.

  4. 4.

    ESRT [25]: It is a hybrid model, which consists of a Lightweight CNN Backbone and a Lightweight Transformer Backbone.

  5. 5.

    Real-ESRGAN [18]: A high-order degradation modeling process is introduced to better simulate complex real-world degradations.

  6. 6.

    BSRN [31]: It contains two efficient designs. One is the usage of blueprint separable convolution, which takes place of the redundant convolution operation. The other is to enhance the model ability by introducing more effective attention modules.

  7. 7.

    LKDN [30]: It simplifies the model structure and introduces more efficient attention modules to reduce computational costs while also improving performance.

  8. 8.

    VapSR [29]: The large receptive field pixel attention mechanism is used with parameter reduction, pixel normalization, and intermediate attention conversion steps to enhance super-resolution performance in a lightweight manner.

Comparison analysis

We conducted quantitative comparisons ConvSR and ConvSRGAN with other SOTA methods. The quantitative results are shown in Table 1. Two versions of our model, ConvSR and ConvSRGAN, all achieved satisfactory results.

In addition, in order to make a fair comparison with ConvSR, we only train the generators of these comparison models. The output images of the generators are used for metrics computation and comparison.

In the comparative assessment against GAN-based models, specifically when pitted against Real-ESRGAN, our ConvSRGAN outperforms with a notable 0.736 dB increase in PSNR, achieving an impressive score of 28.281 dB. Furthermore, it surpasses in SSIM by 0.008, reaching 0.803, indicating a closer alignment with the structural information of the original images. Additionally, in the evaluation of LPIPS, our model demonstrates a superior performance, edging past ESRT with an increase of 0.0095 to attain a score of 0.2334. These metrics collectively affirm the enhanced fidelity, structural retention, and perceptual quality of our model’s outputs in the restoration and enhancement of images.

When contrasting ConvSR with w/o GAN architectures, particularly in comparison to ESRT, although there is a marginal difference of 0.061 dB, our model attains a commendable PSNR of 28.916 dB, indicative of high reconstruction fidelity. It is also noteworthy that our model edges ahead on the LPIPS metric by a slim margin of 0.0005, achieving 0.2914, suggesting a closer alignment with human perception of image quality. Moreover, in terms of SSIM, which evaluates structural similarity, our model excels with a score of 0.820, surpassing RRDBNet by 0.001. This underscores the effectiveness of our model in preserving structural integrity while maintaining a competitive edge in overall image quality assessments.

It is worth noting that considering the artistic characteristics of the Chinese landscape painting images, we attempted to train the ConvSR component with a GAN to achieve a more visually pleasing perceptual effect. This was aimed at providing more visual information during the super-resolution process and thereby obtaining more accurate evaluation results.

LPIPS evaluates image quality based on learned human perception, focusing more on the subjective perception of images by the human eye. It can better reflect the human eye’s perception of artistic works compared to the PSNR and SSIM metrics. Therefore, our goal is to optimize the performance of our model on the LPIPS metric.

Based on the experimental findings, it is evident that the complete ConvSRGAN network demonstrated a 0.058 improvement in the LPIPS evaluation metric. Moreover, it attained a prominent LPIPS value in comparison to other state-of-the-art techniques, although this trade-off may have led to more modest enhancements in other performance metrics. Nevertheless, we achieved super-resolution outcomes that align better with human perceptual preferences.

In addition, compared with Real-ESRGAN, our method showed a 0.736 dB improvement in the PSNR metric. Additionally, compared with ESRT, our method showed a 0.0095 improvement the LPIPS metric. From the values of various metrics, it can be observed that our model achieved better super-resolution performance (Fig. 4).

Fig. 4
figure 4

Examples of Mural datasets. From left to right, Cave, Temple, Tomb, and Thangka

Table 2 Comparison results with kernel size, loss function and loss weight

Visualization

To perform a visualization comparison of the performance between ConvSRGAN and other SOTA methods, we selected a set of representative painting images from the SRCLP dataset for testing and inference.

Specifically, from the pavement texture in Figs. 5a, 6f, the tree trunk in Fig. 5b, and the mountain in Fig. 6d, it can be observed that methods such as EDSR [37], SRResNet [16], and LKDN [30] have lost some of the fine texture details in their results. In contrast, ConvSRGAN preserves the brushstroke forms consistent with the original artwork and exhibits a more delicate representation.

Fig. 5
figure 5

Visual comparisons of ConvSRGAN. Both (a, c) are painted by Shimin Wang, a Qing Dynasty artist known for his delicate brushwork and the harmonious blend of form and spirit in his paintings. b is a painting of ‘Leisurely Strolling with a Walking Stick’. These landscape images are from our SRCLP dataset and are used for testing. Compared with the SOTA methods, the method we proposed preserves the form of brush strokes consistent with the original landscape painting and exhibits a more refined form of expression. Zoom in for best view

Fig. 6
figure 6

Visual comparisons of ConvSRGAN. a is a painting of The Relocation of Ge Zhichuan from the Yuan Dynasty. b is a painting of Strolling with a Walking Stick, illustrating a waterfall cascading down into a lake at the foot of the mountain gorge. c is a painting of Boating in a Rapid Stream by Daqian Zhang. These landscape images are from our SRCLP dataset and are used for testing. Compared with other models, our method not only deals with the texture in landscape painting, but also retains the original color and painting style of the landscape. Zoom in for best view

Furthermore, color is an essential means of expressing emotions, atmosphere, and artistic conception in artwork. In the case of the mountain in Figs. 5c, 6d, the result from BSRN [31] appears to have darker colors. It can be seen that ConvSRGAN effectively restores the color, saturation, and brightness information of the artwork, enhancing the artistic expression of the image and conveying richer emotions and artistic conception.

ConvSRGAN bring the image closer to the artistic characteristics and rhythmic style of the original artwork. Chinese traditional painting emphasizes the use of brushstrokes to depict the form and structure of the painting through contours and edges, thus giving the artwork a sense of space and layers.

Moreover, as can be seen in Fig. 5b, Real-ESRGAN [18] and LKDN fail to model the finer textures of the branches and leaves, resulting in blurred results. In contrast, our proposed EHRM block enhances the high-frequency details in the image while preserving the contour of painting through long skip connections. It also indicated that ConvSRGAN not only maintains the different forms of painting elements but also enhances the clarity of the edge lines, making the image more vivid and three-dimensional.

As can be seen in Fig. 7, the super-resolution results from methods like LKDN lose the texture and quality of the rocks, while there are partial artifacts in the result from RRDBNet [18]. The brushstrokes and techniques are crucial for representing the natural form of objects and creating a sense of depth in traditional paintings. It can be seen that our model can learn the interweaving brushwork in the artwork, enhancing the subtlety and artistic effects of the image.

Fig. 7
figure 7

Visual comparisons on an ancient painting. It is a painting of Visiting a Friend with a Qin by the Qing Dynasty painter Rui Shang. The painting showcases a stream flowing horizontally, with a secluded pavilion and elegant buildings on the opposite bank. The composition includes a bridge with rich brushstroke details. In this painting, our method has a superior performance in the processing of leaf texture compared to other methods. Zoom in for best view

Ablation study

Comparison analysis

By comparing the visual results with various advanced methods, it can be observed that our proposed approach not only improves the loss of high-frequency information in the reconstruction process of painting images but also achieves more accurate super-resolution results. Additionally, our method captures a broader range of spatial dependencies, which is crucial for understanding and reproducing visual characteristics such as the overall layout, brushstroke trends, and color rhythms present in Chinese traditional painting.

To enhance the credibility of our model, we have devised three ablation experiments: the kernel size in the ADCB, the \(\mathcal {L}_{M-S}\), and the weight of the \(\mathcal {L}_{M-S}\). As illustrated in Table 2.

Comparison analysis on kernel size

For comparison analysis, we set three models about kernel sizes, which is ConvSRGAN w/ \(7\times 7\), ConvSRGAN w/ \(5\times 5\), and ConvSRGAN w/ \(3\times 3\).

In terms of evaluating metrics, ConvSRGAN w/ \(7\times 7\) kernel size achieved the best results across all three metrics. Compared with ConvSRGAN w/ \(3\times 3\) convolution kernel size, our model improves PSNR by 0.261 dB, reaching 28.281dB, and SSIM by 0.003. Additionally, when compared to the ConvSRGAN w/ \(5\times 5\) kernel size, our model achieves a 0.0077 improvement in LPIPS.

It should be noted that increasing the size of the convolutional kernel results in a larger sensory field. It can cover a larger region of the image, allowing our model to better understand the global structure and contextual information of the image. In addition, a larger convolutional kernel is able to extract richer features, including image texture, edge information, and shape. ConvSR network can more accurately deal with the details of the image, which is crucial for super-resolution processing of landscape paintings.

Comparison analysis on MS-SSIM loss

Initially, we performed ablation experiments by comparing ConvSRGAN with ConvSRGAN w/o MS-SSIM loss (\(\mathcal {L}_{M-S}\)). As can be seen in Table 2, the PSNR value increases by 0.498dB and other metrics are also improved when using \(\mathcal {L}_{M-S}\). Meanwhile, we compare the results with three kinds of loss function: \(\mathcal {L}_{M-S}\), Gradient variance loss (\(\mathcal {L}_{GV}\)) [38] and Local Discriminative Learning loss (\(\mathcal {L}_{LDL}\)) [39].

\(\mathcal {L}_{GV}\) aims to minimize the distance between the variance maps, resulting in clearer images. \(\mathcal {L}_{LDL}\) stabilizes the model training process by computing artifact maps in the reconstructed images. In the comprehensive evaluation against four baseline models, ConvSRGAN w/ \(\mathcal {L}_{M-S}\) exhibits significant advancements. Specifically, unlike ConvSRGAN w / o \(\mathcal {L}_{M-S}\), our model achieves a substantial improvement of 0.498 dB in PSNR, resulting in an impressive score of 28.281 dB. Compared to ConvSRGAN w / \(\mathcal {L}_{GV}\), our model surpasses its SSIM by 0.014, indicative of superior preservation of structural similarity. Collectively, these metrics highlight the exceptional restoration and enhancement capabilities of our model across various dimensions of image quality assessment. Our findings demonstrate that \(\mathcal {L}_{M-S}\) preserves the structure and layout of painting images in the super-resolution task by considering the structural similarity of images at different scales.

Comparison analysis on loss weight

In our study, a comparative experiment was conducted among three models, with their loss function weights systematically varied to \(\delta\)=0.5, \(\delta\)=1.0, and \(\delta\)=2.0, respectively. Subsequently, a rigorous index testing protocol was employed for each model configuration to quantitatively evaluate this impact. The detailed findings from these assessments have been compiled and are presented in Table 2, offering insights into how the strategic tuning of the loss function weight \(\delta\) can be utilized to optimize model performance and mitigate overfitting issues.

Quantitative evaluation has revealed that ConvSRGAN w/ \(\delta\)=1.0 outperforms ConvSRGAN w/ \(\delta\)=0.5 with a notable increase of 0.147 dB in PSNR. Furthermore, ConvSRGAN w/ \(\delta\)=1.0 excels with the highest SSIM of 0.803. Collectively, these metrics affirm our model’s superiority across multiple dimensions of image assessment.

Visualization

As depicted in Figs. 8, 9, 10, we visualized analysis the performance on kernel size, MS-SSIM loss and loss weight.

Fig. 8
figure 8

Visual comparison of the model trained with different kernel size. All of (a, b, c) are from our SRCLP dataset. These three landscape pictures have different and prominent color styles. *ConvSRGAN w/ \(7\times 7\) is our model. The reliability of the kernel size selection on the training of the ConvSRGAN method was verified and evaluated by ablation experiments. Zoom in for best view

Fig. 9
figure 9

Visual comparison of the model trained with different loss functions. Both (a, b) are from our SRCLP dataset. We chose two landscape pictures with different texture characteristics for comparison. *ConvSRGAN w/ \(\mathcal {L}_{M-S}\) is our model. The effectiveness of the joint loss on the training of the ConvSRGAN method was verified and evaluated by ablation experiments on the loss function. Zoom in for best view

Fig. 10
figure 10

Visual comparison of the model trained with different weight parameters. All of (a, b, c) are from our SRCLP dataset. These three landscape pictures have different and prominent color styles.  *ConvSRGAN w/ \(\delta\)=1.0 is our model. The reliability of the \(\delta\) selection on the training of the ConvSRGAN method was verified and evaluated by ablation experiments. Zoom in for best view

Visualization on kernel size

Indeed, incorporating a larger convolution kernel, expands the receptive field of the model, allowing it to capture more extensive contextual information within the input data. This enhancement in the model’s horizon can lead to improved feature learning and a heightened ability to model complex patterns, thereby augmenting its fitting capacity.

As depicted in Fig. 8a, the effect of ConvSRGAN w/ \(7\times 7\) kernel size shows the reconstructed building textures in scenes were more natural and realistic. However, ConvSRGAN w/ \(3\times 3\) kernel size produced excessively smooth outcomes. Meanwhile, in the mountain texture depicted in Fig. 8b, ConvSRGAN w/ \(5\times 5\) kernel size led to color distortion.

Based on the results of our comparative tests and the theoretical understanding of larger convolution kernels, we can confidently conclude that the adoption of larger kernel sizes, such as our \(7\times 7\) convolution kernel, indeed bolsters the reconstruction capabilities of our model. The expanded receptive field enables the model to better grasp the broader context within images, thereby improving its ability to restore fine details and intricate structures present in landscape paintings.

Visualization on MS-SSIM loss

The loss function can quantify the distance or structural differences between images. Meanwhile, loss function plays a crucial role in guiding model fitting and ultimately affects the effectiveness and performance of the model during training. To evaluate the efficacy of our proposed method, we perform ablation experiments on the ConvSRGAN w/ \(\mathcal {L}_{M-S}\) to validate the improvement in model performance.

To further test the effectiveness of the ConvSRGAN w/ \(\mathcal {L}_{M-S}\), we set up some ablation models and analyzed the resulting changes. As shown in the Fig. 9b, the use of other loss functions failed to recover the shadowed parts of the bamboo leaves and resulted in brightness distortion. However, the ConvSRGAN w/ \(\mathcal {L}_{M-S}\) effectively resolved this issue.

In the process of image reconstruction, it is imperative to account for two pivotal loss functions: \(\mathcal {L}_{1}\) and \(\mathcal {L}_{M-S}\). The former, \(\mathcal {L}_{1}\), centers on quantifying pixel-wise discrepancies, ensuring a precise replication of individual elements. Conversely, \(\mathcal {L}_{M-S}\) prioritizes the maintenance of structural similarity across varied resolutions, thereby safeguarding the coherence and integrity of image structures. Achieving an optimal equilibrium between these two measures is fundamental, as it enables the preservation of both intricate details and the overarching compositional structure, which is vital for the faithful and visually coherent restoration of images.

Visualization on loss weight

To examine the impact of the weighting coefficient of the \(\mathcal {L}_{M-S}\) on the reconstruction process, we conducted experiments using three distinct weighting parameters with corresponding \(\delta\) values. This systematic approach aimed to meticulously gauge the repercussions of different loss function weightings on the extent of overfitting. As depicted in the Fig. 10, ConvSRGAN w/ \(\delta\)=1.0 exhibits superior performance in retaining intricate details and texture fidelity during the processing phase. This visual evidence reinforces our earlier quantitative findings, demonstrating a heightened capability to preserve the subtle nuances and fine elements of the original content, thereby enhancing the overall quality and authenticity of the processed images.

It can be identified that if the \(\sigma\) value is too large, the model may sacrifice some finer details. However, if the value of \(\sigma\) is too small, an excessive focus on details may lead to a loss of overall coherence. Our findings show that a \(\sigma\) value of 1.0 yields the most favorable results.

Comparison with different dataset

Comparison analysis

We conducted some comparative experiments on the ConvSRGAN model with three datasets: Mural, Painter by Numbers [33] and Flickr2K [34], respectively. Moreover, we also compare with other models on the Mural dataset.

Comparison analysis on different dataset

We conducted some comparative experiments on the ConvSRGAN model with different types of dataset to verify its performance. We carefully selected 2000 images on Painter By Numbers [33] and Flickr2K [34], respectively, as a training set. In addition, we selected 100 images as the testing set. As illustrated in Table 3.

Moreover, in the SRCLP Dataset, we selected 100 images of different dynasties: the Song Dynasty, Yuan Dynasty, Ming Dynasty, Qing Dynasty, for testing and analysis. In addition, we selected 68 images of the paintings of famous painter Daqian Zhang. As the one of the greatest painters in Chinese modern art history, Daqian Zhang have a high reputation at home and abroad. All results show that our network performs well in different styles of datasets.

Table 3 Comparison of ConvSRGAN on different datasets

Comparison analysis on mural

As illustrated in Table 4, we testing and compare with six state-of-the-art methods. Table 4 shows that in direct comparison with Real-ESRGAN, our model showcases a significant enhancement of 0.446 dB in PSNR, elevating the score to 24.675 dB. Although the difference in SSIM is marginal at 0.001, our model maintains a strong score of 0.770, indicative of comparable or slightly improved structural preservation. Moreover, our model distinguishes itself by achieving the optimal outcome in LPIPS, surpassing the baseline performance. Collectively, these metrics validate our model’s advancements in image restoration and enhancement. Meanwhile, the performance of our method is comparable to the state-of-the-art Real-ESRGAN, demonstrating its applicability on Mural super-restoration inpainting.

Table 4 Comparison results of ConvSRGAN on Mural

Visualization

We visualize and analyze on these three datasets: Mural, Painter by Numbers [33] and Flickr2K [34], separately. Moreover, we also visualized some landscape paintings on the paintings of different dynasties and the paintings of the painter Daqian Zhang.

Visualization on mural

To further validate the effectiveness of our model, we have conducted testing experiments on Mural dataset. Figure 11 visually illustrates the results, effectively demonstrating how our method delivers more aesthetically pleasing outcomes compared to Real-ESRGAN. The comparison between the ESRT and Real-ESRGAN results shows that using inappropriate loss functions, high-pass filtering, and pooling operations can result in excessive smoothing of image details, leading to less realistic super-resolution results. In contrast, our method handles the image more naturally, recovering more accurate image texture details, as shown in the facial details in Fig. 11c. Overall, our approach proves to be highly effective for restoring mural images.

Fig. 11
figure 11

Visual comparison on Mural. a Cave, b Temple, c Tomb, d Thangka. Our method preserves the overall color while processing the local lines more finely on Mural. Zoom in for best view

Visualization on painter by numbers

Furthermore, we have conducted a qualitative comparison of the visual effects between ConvSRGAN and other advanced methods on Painter By Numbers dataset. As shown in Fig. 12, it can be observed that other methods still exhibit shortcomings when restoring non-real-world images. In Fig. 12a, it is apparent that BSRN, VapSR, and LKDN lose color information in their results, while the super-resolution result from RRDBNet is overly smooth, exhibiting low fidelity in the lines of buildings and the texture of flowers, thus compromising the original brushstrokes and artistic style.

Fig. 12
figure 12

Visual comparisons on Painter By Numbers. All of (a, b, c, d) are from dataset Painter By Numbers. The visual effects of our method handle the color of the painting better than the SOTA model, and the lines are cleaner and clearer. Zoom in for best view

However, from the experimental results, our method did not achieve ideal results in Fig. 12a with blurred edges of the petals. We consider that this may be attributed to improper parameter adjustment in the EHRM, resulting in the loss of high-frequency information. Nonetheless, our method has achieved natural visual effects on other images, closely aligning with the artistic style and technical characteristics of the original paintings. It has demonstrated good visual performance, proving the applicability of ConvSRGAN to other artistic images.

Visualization on Flickr2K

As illustrated in Fig. 13, our model performs well on a dataset of natural images, Flickr2K. SR represents the super-resolution images produced by the ConvSRGAN. HR represents the high-resolution original images, as the ground truth.

Upon examining the visual evidence presented in Fig. 13, it becomes evident that our model excels in preserving details across categories (a), (b), (e), and (f), demonstrating a commendable capability in handling intricate texture components. For example, it is relatively good for the contours of distant houses in Fig. 13f. Nevertheless, it is imperative to acknowledge that there exists room for enhancement, particularly in the processing of categories (c) and (d). A case in point is the suboptimal restoration of the floral feature depicted in Fig. 13d, which, upon thorough analysis, we attribute to the similarity in hue between the floral subject and its surrounding background. This low contrast scenario poses a challenge for the network in effectively isolating and extracting distinct textural features, thereby necessitating further optimization strategies to mitigate such instances.

Fig. 13
figure 13

Visualization on Flickr2K. The visual effects of our method handle the color of the painting better than the SOTA model, and the lines are cleaner and clearer. Zoom in for best view

By comparing art images in Painter By Numbers with natural images in Flickr2K, exemplified in Fig. 13(b) depicting a floral arrangement alongside and in Fig. 13d presenting a real-world floral scene, our model demonstrates proficiency in handling relatively intricate textural details. However, it encounters challenges in identifying and reconstituting image characteristics that exhibit low contrast, highlighting a limitation in feature extraction for such subtler visual elements.

Visualization on different dynasties

As illustrated in Figs. 14, 15, 16, these diverse selections of Chinese paintings, spanning various dynasties, encompass not only breathtaking landscapes but also intimate portrayals of flora in detailed close-ups. A comparative analysis reveals that our model excels in achieving a nuanced classification of ancient paintings across epochs, all while meticulously preserving the original works’ details and inherent artistic essence intact. This demonstrates a high level of fidelity in maintaining the unique aesthetic sensibilities and spiritual depth embedded within each piece, thereby testifying to the efficacy and sensitivity of our approach in handling such culturally and aesthetically rich content.

Fig. 14
figure 14

Visual comparisons on the paintings of Song dynasty. They are paintings of ‘Frost Shinohan’, ‘Twilight Return’, and ‘plum and bamboo gathering birds’. They all show the relatively delicate brushwork and expressive expression of the Song Dynasty painters

Fig. 15
figure 15

Visual comparisons on the paintings of Yuan Dynasty. They are paintings of ‘The Qiushan Grass Hall’ by Yuan Dynasty artist Meng Wang and ‘The mountains’ by artist Mengxu Zhao. They all have the overall composition of landscape scenes and the depiction of local details

Fig. 16
figure 16

Visual comparisons on the paintings of Ming and Qing Dynasty. Both (a, b) are the paintings of Ming dynasty artist Jian Wang, illustrating the profound capacity of Ming artists to comprehend and depict the aesthetic essence of landscape painting’s artistic conception. Meanwhile, (c, d) represent the Qing Dynasty through the painting of ‘Du Fu’s Poetry in Paint’ of artist Shimin Wang and the painting of ’Peony Blossoms’ of artist Zhiqian Zhao, respectively. These latter two pieces exemplify meticulous detail and the rich artistic conception that is paramount to the expressive lexicon of Chinese painting tradition

Visualization on the painter Daqian Zhang

As illustrated in Fig. 17, the artwork of Daqian Zhang is renowned for its distinctive bright, elegant, and graceful aesthetic. Our model, when put to the test through comparative analyses, has proven capable of adeptly managing the intricate details characteristic of his pieces, ensuring the preservation of Zhang’s signature style. This highlights the efficacy of our model in accurately capturing and reproducing the refined nuances and aesthetic hallmarks integral to Zhang’s artistic legacy.

Fig. 17
figure 17

Visual comparisons on the paintings of painter Daqian Zhang. a depicts the painting of ‘Prime Minister amidst Mountains with Immortal Essence’. b showcases Daqian Zhang’s interpretation of Yuanhua Dong’s artistic styles, jointly illustrate the profound depth of understanding of Daqian Zhang in the realm of landscape painting. His proficiency goes beyond meticulous attention to textural details. It extends to encompassing a grand and awe-inspiring aesthetic of artistic conception, highlighting his comprehensive mastery over the art form

Conclusion

In this paper, we propose an innovative framework for super-resolution inpainting of Traditional Chinese Paintings, termed ConvSRGAN. We utilize a series of Enhanced Adaptive Residual Module to progressively learn the depth information of the images. In particular, within the EARM, we introduce an Enhanced High-frequency Retention Module to preserve high-frequency details through a specially designed Adaptive Depthwise Convolution Block and pooling operations that broaden the model’s receptive field.

To ensure that the model achieves more realistic and nuanced texture restoration, we incorporate the \(\mathcal {L}_{M-S}\) in the training with a combined loss function for supervised learning. Overall, the ConvSRGAN framework presented in this paper aims to address the challenges specific to traditional Chinese painting images and provides a novel solution for enhancing image resolution while preserving the artistic style and details of the paintings.

The experimental results demonstrate that ConvSRGAN achieves significant performance in handling traditional painting and mural datasets, particularly in high-definition restoration tasks for landscape paintings, showing remarkable visual fidelity and vividness. This validates its effectiveness and universality in the field of cultural heritage preservation and restoration. Furthermore, the model achieves excellent visual results on other artistic datasets while preserving the unique artistic style of the paintings, further confirming its robustness and generalizability in artistic image super-resolution tasks.

Discussion

The future research plan will focus on deepening exploration in two key areas: cultural heritage conservation and utilization, and optimization of modeling technology.

Firstly, in terms of cultural heritage conservation, we will continue to study the application of image super-resolution models in the field of cultural heritage, including but not limited to improving the high-definition restoration ability of the models for traditional artworks. We will also develop more refined image restoration algorithms specifically tailored to the material and age characteristics of cultural artifacts.

Secondly, on the technical level, we will explore new network architectures or loss functions to achieve better inference results. Additionally, we will further investigate the performance of the models, including improving the quality of super-resolution images and the speed of model inference.