A diffusion probabilistic model for traditional Chinese landscape painting super-resolution

Traditional Chinese landscape painting is prone to low-resolution image issues during the digital protection process. To reconstruct high-quality images from low-resolution landscape paintings, we propose a novel Chinese landscape painting generation diffusion probabilistic model (CLDiff), which is similar to the Langevin dynamic process, and realizes the transformation of the Gaussian distribution into the empirical data distribution through multiple iterative refinement steps. The proposed CLDiff can provide ink texture clear super-resolution predictions by gradually transforming the pure Gaussian noise into a super-resolution landscape painting condition on a low-resolution input through a parameterized Markov Chain. Moreover, by introducing an attention module with an energy function into the U-Net architecture, we turn the denoising diffusion probabilistic model into a powerful generator. Experimental results show that CLDiff achieves better visual results and highly competitive performance in traditional Chinese Landscape painting super-resolution tasks.


Introduction
In the long history and cultural development of China, traditional landscape painting is a very important form of cultural and artistic expression.Traditional Chinese landscape painting not only shows the beauty of the Chinese land, but also integrates the painter's thinking and emotional sustentation of the universe, nature, society, and life, which perfectly embodies the aesthetic thoughts in traditional Chinese ancient philosophy.Figure 1 shows traditional Chinese landscape paintings with diverse styles and unique charm.However, due to unpredictable factors such as natural, human, and equipment, this kind of art treasures with Oriental characteristics in the process of digital protection can lead to problems such as low resolution and semantic loss.This seriously hinders the inheritance of Chinese excellent history and culture.
At present, research on the super-resolution task of traditional Chinese landscape painting is rare and mainly focuses on the field of image generation and image translation.To exploit the multi-scale image information, Lin et al. [1] proposed a multi-scale generative adversarial network (GAN) to transform the sketch into Chinese paintings.To evaluate the quality of Chinese landscape paintings generated by different strategies, Lv et al. [2] investigated the influence of the network model, loss function, and training objective on the quality of generated Chinese landscape paintings in conditional generative adversarial networks.Zhou et al. [3] proposed an interactive and generative framework based on cycle-GAN, which can generate Chinese landscape paintings from input sketches.A recent work is SAPGAN [4] (Sketch-And-Paint GAN), which first employs SketchGAN to generate the sketches of landscape paintings, and then uses PaintGAN to realize the transformation from the sketches to Chinese landscape paintings.
In addition, for the image super-resolution task, some scholars have proposed a variety of different solutions.To model the local structure of complex images and reduce the time cost, an adaptive sparse domain selection and adaptive regularization method [5] is proposed.Considering the non-local self-similarity property of images, a simple and effective non-local centralized sparse representation method [6] is proposed to solve the problem of image super-resolution.These methods achieve appealing super-resolution performance but often require solving a complex iterative optimization problem, and the model lacks prior knowledge learned from large-scale datasets when solving.Recent works have shown that deep learning methods have achieved excellent performance in learning complicated empirical distributions of images.By combining the advantages of convolutions with Transformers, a strong baseline model [7] is proposed for image super-resolution.To utilize image perception information, a generative adversarial network is proposed for image super-resolution [8] (SRGAN).By improving SRGAN, the later proposed ESRGAN [9] and Real-ESRGAN [10] further improved the performance of image SR.However, GAN-driven methods are prone to mode collapse [11], resulting in no diversity in the generated images.Additionally, the training process of GANdriven methods is unstable and prone to the vanishing gradient problem [12] or exploding gradient problem [13].
Very recently, the diffusion probabilistic model [14] has shown great potential in various low-level vision tasks [15][16][17][18][19].The diffusion probabilistic model (DM) is a parameterized Markov chain with a variational inference process, which includes a diffusion process and a reverse process [20].The diffusion process converts data samples x 0 into random noise x t , t ∈ [1, . . ., T ] by gradually adding noise σ , i.e., The reverse process is the opposite direction of the diffusion process, and the generation of data samples is achieved by repeatedly executing the inverse transformation of sampling, i.e., x t−1 = f (x t ) .The DM is trained by optimizing the variational lower bound on negative loglikelihood, it does not require regularization and optimization techniques to avoid optimization instability and mode collapse.
In this paper, we present a novel diffusion probabilistic model for traditional Chinese landscape painting superresolution (CLDiff ) to enhance the visual effect of the reconstructed image.Some methods [2,4] only consider the advantage of the GAN while neglecting the potential difficulties in training, while other methods cannot balance the perceptual performance of the image.Unlike these methods, our proposed CLDiff is inspired by the denoising diffusion probabilistic model [20].CLDifff is a condition image generation model that learns to convert a standard normal distribution to a Chinese landscape painting data distribution through an iterative refinement step.To sum up, the main contributions of this paper are as follows: (1) a novel denoising diffusion probabilistic model for the super-resolution task of traditional Chinese

Denoising diffusion probabilistic model
Denoising diffusion probabilistic models are used to achieve high-quality image processing tasks.Moreover, there have been works indicating that the quality of the generated images has exceeded GAN.Recently, denoising diffusion probabilistic models have been widely used in image super-resolution and image inpainting.Saharia et al. [16] proposed a repeated refinement image superresolution diffusion model, which achieves high-quality image super-resolution effects through an iterative refinement process.Li et al. [15] proposed a super-resolution diffusion probabilistic model for the face, which transforms a pure noise image into a face super-resolution result through a Markov chain.Saharia et al. [21] implemented four different image translation tasks using diffusion models and investigated the impact of loss functions and attention mechanisms on model performance.Whang et al. [17] proposed a conditional diffusion model, which uses the predict-and-refine strategy to make the sampling more effective and improve the quality of image deblurring.To address the image inpainting problem, Lugmayr et al. [18] achieved free-form inpainting only by improving reverse diffusion iterations.Additionally, diffusion models have also been successfully applied to medical image generation and object detection.Inspired by the above works, we extend the denoising diffusion probabilistic model to the super-resolution task of traditional Chinese landscape painting for the first time.

Image Super-resolution
Image super-resolution is a low-level vision task that aims to recover a high-resolution image from a low-resolution version.As a classical ill-posed inverse problem in the field of image processing, various solutions [22] have emerged in recent years.Dong et al. [23] first proposed a deep convolutional neural network for end-to-end lowresolution to high-resolution mapping.Ma et al. [24] proposed to use GAN for super-resolution tasks.This method utilizes the structural information of images to generate visually pleasing detail information.To advantage of neural architecture search (NAS), Pan et al. [25] proposed a Gaussian process based on NAS that won first place in the image super-resolution task.Considering the computational cost, Zhou et al. [26] proposed an SRFormer for image super-resolution, which can enjoy performance while also reducing resource consumption.These methods have excellent performance in superresolution scenarios of natural images, which has great inspiration for the design of our model.

Attention mechanism
The proposal of attention mechanism reflects the application of biological mechanisms in artificial intelligence.Moreover, some studies [27][28][29] have achieved great success in applying attention mechanisms to low-level visual tasks.By adaptively adjusting the interdependencies between channels, the channel attention mechanism is introduced into the residual block to form a deep residual channel attention network [30], which realizes image super-resolution tasks.To refine the quality of image generation, the self-attention mechanism is integrated into the generative adversarial network [31] to improve the resolution of the generated image.Considering the neurons should be adjusted dynamically based on the context information, a context reasoning attention network [32] was put forward and realized the appealing image super-resolution effect.The latent diffusion model [19] introduces a cross-attention layer into the model architecture to improve the quality of generated images and the flexibility of the model.Different from these works, we design a novel attention mechanism to improve the quality of reconstructed images.

CLDiff
Inspired by the denoising diffusion probabilistic model [20], the proposed CLDiff is a conditional image generative diffusion model, which guides the reconstruction of high-quality traditional Chinese landscape paintings by conditional input low-resolution images.Given an input and output dataset x, y of traditional Chinese landscape paintings, where x is the Chinese landscape painting, y is its corresponding low-resolution painting.We aim to train the model to learn an approximate conditional probability distribution p(x|y) .When the model training is completed, the pure noisy image is transformed into a Chinese landscape painting image through the iterative refinement process under the guidance of the conditional input low-resolution painting.Specifically, the proposed CLDiff contains two processes: a forward Gaussian diffusion process and a reverse generation process, see Fig. 2.
The forward Gaussian diffusion process starts with a high-quality Chinese landscape painting image x 0 , and gradually adds noise to x 0 through a T-step iterative pro- cess.This process is a forward Markov chain that transforms the data distribution q(x 0 ) into a latent variable distribution q(x T ) .It can be defined as: where a single-step diffusion model is defined as a Gaussian distribution: where α t ∈ (0, 1), t ∈ [1, . . ., T ] is hyper-parameters, it controls the strength of the noise added each time.However, the efficiency of the single-step diffusion process is relatively low, and to improve the diffusion efficiency, it can be directly estimated x t through a series of equation transformations: where ρ t = t i=1 α i .There are no unknown variables to learn in Eq. ( 3).This allows us to obtain the intermediate hidden variable x t at any timestep, Therefore, in the forward Gaussian diffusion process, we use Eq. ( 3) to obtain x T .When the timestep T is large enough, x T can be seen as indistinguishable from pure Gaussian noise.
The reverse generation process is a stochastic denoising process that starts from the pure noise image x T ∼ N (0, I) and iteratively refines the image through a T-step reverse Markov chain.This process transforms the data distribution p(x T ) of the latent variable into the data distribution p(x 0 ) of the Chinese landscape painting.In the equation ( 1) transformation of the forward Gaussian diffusion process, if x 0 and x t are given, the posterior probability distribution of x t−1 can be obtained where Equation (6) indicates that µ t (x t , x 0 ) depends on x t and x 0 .There are no variables in Eq. ( 7), so β t is a determinis- tic value.Combining Eqs. ( 6) and ( 7), a one-step reverse Markov chain be obtained by sampling a slightly less noisy image x t−1 from x t .According to Eq. ( 5), it seems that the high-quality Chinese landscape painting x 0 can be obtained through the reverse generation step T times.However, this is impractical because x 0 is unknown in Eq. ( 5), x 0 is exactly what we need to estimate.As shown by the red fork in Fig. 2. To solve this problem, the reverse generation process was successfully carried out to estimate x 0 .Referring to [17,19], we designed a denoising network f θ to estimate the high-quality Chinese landscape painting x 0 = f θ (x t , ρ t ) from the latent variable noisy image x t .Therefore, we can utilize the estimate f θ (x t , ρ t ) to replace x 0 in Eq. ( 5), and the reverse generation process can be expressed as: To guide this reverse image super-resolution process, we take the conditional input low-resolution image y and the hidden variable x t = √ ρ t x 0 + √ 1 − ρ t ǫ, ǫ ∼ N (0, I) as the input of the denoising network.Equation ( 8) can be rewritten as follows: (5) q(x t−1 |x t , x 0 ) = N (x t−1 |µ t (x t , x 0 ), β t I), (6)

Impossible
Reverse Generation Process (Denoising) Fig. 2 The forward Gaussian diffusion process and the reverse generation process in CLDiff Figure 3 shows the single-step training process and the single-step inference process of our proposed model.Therefore, the reverse image super-resolution process procedure of our proposed model depends on the input condition y .Finally, the training objective is: Based on the above analysis, the denoising network f θ is an important part of the proposed model.Inspired by [19,33], we adopt the attention mechanism to improve the U-Net architecture.The denoising network architecture in CLDiff is shown in Fig. 4. (10) CLDiff transforms the diffusion probabilistic model into a conditional traditional Chinese landscape painting superresolution model by enhancing the U-Net backbone with the proposed attention mechanism.Existing attention mainly learns a weighted feature combination along the channel or spatial dimension to refine features.Channel attention generates 1-D weights and spatial attention generates 2-D weights.However, these two attention mechanisms do not fully conform to the principles of human visual neuroscience.In fact, human visual neurons are very sensitive to important features, and stimulated neurons can suppress surrounding neurons to highlight their importance [34].Therefore, a novel attention mechanism is The proposed attention mechanism also includes the channel attention branch and spatial attention branch.The difference is that inspired by human visual neurons [35][36][37], we have introduced an energy function into the channel attention mechanism and the spatial attention mechanism, respectively.The energy function can further help the attention mechanism to enhance the key features while weakening the secondary features.We adopted the same strategy as in Refs.[38,39] to parallel the two kinds of attention, and then the element-wise addition operation merges the two attention mechanisms.Figure 5 shows the structure of the attention mechanism with an energy function.The proposed attention mechanism can be expressed as: where X ∈ R c×h×w is the input tensor, f EC represents the channel attention mechanism operation with energy function, f EC (X) = f E (f CA (X)) , and f ES represents the spatial attention mechanism operation with function, f ES (X) = f E (f SA (X)) .f CA is channel attention mech- anism operation, it can be expressed as: and f SA is spatial attention mechanism operation, it can be expressed as: where, W k , W v , W q are 1 × 1 convolution operations.F SG is sigmoid operation, F SM is softmax operation, F GP is global average pool operation.F E is energy function (11) [34,37].To simplify and prevent overfitting, it can be expressed as a binary classification function with a regularization term: where o t and o i represents the output of the target neuron and surrounding neurons in a single channel of the input tensor X , respectively.o t = w t o t + b t , o i = w t o t + b t , w t and b t denote weight and bias.N = h × w is the number of neurons on the current channel. is the regularization parameter.J (w) is the regularization term, it is the l 2 -norm of the parameter vector, i.e., ||w|| 2 F .Equation ( 14) represents the linear separability between the target neuron and the surrounding neurons.The stochastic gradient descent algorithm can reduce the computational burden of the energy function in each channel.This allows linearly separable operations to be implemented in deep learning frameworks.With the pixels in each channel following the same distribution, the minimum energy can be obtained by algebraic transformation: The larger energy of neuron o t can be obtained by 1/e t .The larger energy 1/e t , the neuron o t is more important for capturing key features.To simulate the regulatory effect of mammalian attention mechanisms, the sigmoid function f s was used to scale extreme data: (14) where E groups all e t across the channel and spatial dimensions.

Dataset and setting
The proposed model is implemented with the Pytorch framework and runs on a platform with two Nvidia RTX2080ti GPUs.
applied to the Chinese landscape painting images to better train the proposed model.Figure 6 lists some examples of data augmentation effects.
Our model was trained for 1e6 epochs with a minibatch size of 1.We set the timestep T = 2000.We set the forward Gaussian diffusion process to constants increasing linearly from 1e-6 to 1e-2.The U-Net adopts Adam optimizer with a learning rate of 3e-6.The trained U-Net was used to represent the reverse generation process.

Performance comparison and results
In terms of comparison, the corresponding SR results are shown in Figs.7 and 8. Figure 7 shows the super-resolution ( × 2) result at 256 × 256 → 512 × 512.One can see that the overall visual effect of all methods is good.Due to the lack of prior knowledge learned on large-scale datasets, ASDS [5], and NCSR [6] blur the details of the image and destroy the local semantic information of the image, while SwinIR [7] and CLDiff make the SR image texture clearer and the visual effect  9, we can see that the proposed CLDiff has higher PSNR, the main reason is that reverse diffusion inference involves the iterative denoising process.Although the value of SSIM did not completely surpass the comparison method, this did not affect the quality of the reconstructed image.A similar conclusion has also been found in Ref. [40,41].Moreover, Fig. 10 shows that the image super-resolution quality and visual effect of the proposed CLDiff are better than or close to the original image.Extensive experiments demonstrate that our proposed method has a good image super-resolution visual effect and avoids the problems of image over-smoothing and model training instability.
In the future, we will conduct further research from two aspects: (1) improving the performance of diffusion models and accelerating the inference speed of models.(2) We further explore the research of the diffusion model in the restoration and editing of traditional Chinese landscape paintings.

Fig. 1
Fig. 1 Some traditional Chinese landscape paintings

Fig. 3 Fig. 4
Fig. 3 Single-step training process and single-step reverse inference process

Fig. 5
Fig. 5 Attention mechanism with energy function

Fig. 6
Fig. 6 Examples of the data augmentation effects

Fig. 7 Fig. 8 Fig. 9 Fig. 10 Fig. 11
Fig. 7 Qualitative comparisons with different methods on the × 2 super-resolution task The training dataset of the model is the traditional Chinese landscape painting dataset.Part of this dataset [4] is based on four open-access museum galleries: the Smithsonian Freer Gallery, Harvard University Art Museum, Princeton University Art Museum, and Metropolitan Museum of Art.The other part is collected from the Baidu image search engine.We used the crawler technology to obtain 1000 images from the Baidu search engine and selected 300 images with high quality by manual means.The data augmentation technique is