Automatic damage identification of Sanskrit palm leaf manuscripts with SegFormer

Palm leaf manuscripts (PLMs) are of great importance in recording Buddhist Scriptures, medicine, history, philosophy, etc. Some damages occur during the use, spread, and preservation procedure. The comprehensive investigation of Sanskrit PLMs is a prerequisite for further conservation and restoration. However, current damage identification and investigation are carried out manually. They require strong professional skills and are extraordinarily time-consuming. In this study, PLM-SegFormer is developed to provide an automated damage segmentation for Sanskrit PLMs based on the SegFormer architecture. Firstly, a digital image dataset of Sanskrit PLMs (the PLM dataset) was obtained from the Potala Palace in Tibet. Then, the hyperparameters for pre-processing, model training, prediction, and post-processing phases were fully optimized to make the SegFormer model more suitable for the PLM damage segmentation task. The optimized segmentation model reaches 70.1% mHit and 51.2% mIoU. The proposed framework automates the damage segmentation of 10,064 folios of PLMs within 12 h. The PLM-SegFormer framework will facilitate the preservation state survey and record of the Palm-leaf manuscript and be of great value to the subsequent preservation and restoration. The source code is available at https:// github. com/ Ryan2 1wy/ PLM_ SegFo rmer.


Introduction
Palm leaf manuscripts (PLMs) were an important writing medium in many Asian countries before the invention of papers [1,2].Sanskrit PLMs in Tibet are a kind of PLMs written in Sanskrit [3].According to incomplete statistics, several museums and palaces in Tibet, such as Potala Palace and Norbulingka, preserve more than 60,000 folios of Sanskrit PLMs dating from about the third century AD to the thirteenth century AD.Many of them are the first-level cultural relics in China.After centuries of use, spread, and preservation, Sanskrit PLMs were inevitably aged and damaged [1,4,5].
There are many kinds of damage in Sanskrit PLMs.Among these damages, incompleteness, break, fiber delamination and warping, contamination, and improper restoration are five frequent damages.Incompleteness (Fig. 1b) refers to the lack of the main body of palm leaves.Break (Fig. 1c) refers to the transverse or longitudinal breaks formed along the texture of palm leaves by external force or excessive drying.Fiber delamination and warping (Fig. 1d) refers to the delamination of the fiber layers of palm leaves or the separation of the fiber layers from the leaf body.Contamination (Fig. 1e) refers to stains and traces formed on the surface of PLMs.Improper restoration (Fig. 1f ) refers to manually restoring the damaged PLMs with inappropriate materials and methods.
In 2019, the Chinese government launched the protection project of Sanskrit PLMs in Tibet.During the protection and restoration of Sanskrit PLMs, a comprehensive survey and record of the preservation state of Sanskrit PLMs is a prerequisite.The current survey of Sanskrit PLMs entirely relies on manual work, which requires strong professional skills and is extraordinarily time-consuming.And the manual identification of damages of Sanskrit PLMs is subjective and labor-intensive.Therefore, to develop a computer-aided damage identification is required for the efficient preservation state survey of Sanskrit PLMs.
With the view of image processing, the damage identification of Sanskrit PLMs can be seen as a semantic segmentation task, i.e., assigning the correct category label to each pixel in digital images of Sanskrit PLMs.Recently, deep learning methods have become prominent and potent in semantic segmentation [6][7][8][9].They have been introduced to historical document analysis, such as binarization [10][11][12], text line segmentation [13][14][15], page segmentation [16,17], Layout Analysis [18][19][20], and character recognition [21][22][23].Based on the digital images of historical handwritten documents, Xu et al. [16] applied fully convolution networks (FCN) to classify the pixels of the documents into different categories: background, main text body, comments, and decorations.As for PLMs, several researchers applied deep learning methods to recognize Palm Leaf Characters [22][23][24][25][26][27].Devi et al. [23].manually built cursive training datasets and utilized a unique convolutional neural network (CNN) technique to identify the palm leaf characters.Sudarsan et al. [26] used a combination of Log-Gabor with uniform rotational invariant LBP for feature extraction.Then, a stacked ResNet-LSTM architecture was used for the classification of palm leaf characters.
In this research, a damage segmentation dataset named PLM dataset is established for the damage identification of Sanskrit PLMs in Tibet.It consists of five common damages, including incompleteness, break, fiber delamination and warping, contamination, and improper restoration.SegFormer [9] is chosen as the base segmentation network because it can balance segmentation efficiency and accuracy well.Based on SegFormer, the PLM-Seg-Former framework is proposed to automatically identify

Image acquisition and damage labeling
The Nikon D5300 camera was selected as the image acquisition instrument of Sanskrit PLMs.A Camera Tripod was used to assist the acquisition with fixed space and angle.A black paperboard was placed underneath the Sanskrit PLMs during the image acquisition.The horizontal resolution and the vertical resolution of the images were both 300 dpi.The height and width of the images were in the range of [362,2053], [2411, 4739] pixels, respectively.The aspect ratio (the ratio of width to height) of the images was in the range of [1.92, 10.82].In total, 338 images were captured.
Five frequent damages, incompleteness, fiber delamination and warping, break, contamination, and improper restoration, were considered as the targets in this study.All the damages of Sanskrit PLMs were labeled manually by experts.The image polygonal annotation tool LabelMe (v4.5.6) [28] was used for damage labeling.The raw images with manual annotation were considered as the PLM dataset.Then, the PLM dataset was divided into the training set, validation set, and test set in the ratio of 6:2:2.

Pre-processing
The size of PLM images varies significantly from one another, and feeding large-size images directly into the model leads to out-of-memory (OOM) errors.Therefore, it is crucial to pre-process the original images to make them suitable for model training.Three pre-processing strategies were considered here: cropping, resizing, and resizing and cropping.The high-quality Lanczos filter from the Pillow package was used for image resizing.
Cropping.A common way to handle large-size images in semantic segmentation is to crop the original image into equal-sized patches.Then, the image patches are used to train the segmentation model [6].All the PLM images were cropped into non-overlapped 512 × 512 patches, and the patches less than 512 × 512 were filled by adjacent pixels (e.g., a 512 × 768 image was cropped into two 512 × 512 patches with 512 × 256 overlapping area).For images with the size less than 512 pixels, their short sides were first resized into 512 while maintaining the aspect ratio.Then, the resized images were cropped into patches.In addition, a larger crop size (512 × 768) was also applied to investigate the effect of the crop size.
Resizing.Another way was to resize the images to a trainable size.Then the full images was directly used to train the segmentation model.There were two methods for image resizing: (I) one was to resize the short side of the image to 512, which maintained the original aspect ratio of the image; (II) another was to resize all the images to a fixed size, which was 512 × 976 or 512 × 3072 according to the minimum aspect ratio or average aspect ratio.
Resizing and Cropping.A resizing and cropping strategy that combined resizing and cropping was proposed.Firstly, the short sides of all the PLM images were first resized to 512 while maintaining the aspect ratio.Then, the resized images were cropped into 512 × 512 patches.Considering the significant reduction in image size after resizing, overlapping crop with an overlap area of half the patch size was used to increase the number of image patches.Similar to the cropping method, a larger crop size (512 × 768) was also applied.

Model training
In this study, SegFormer was used for the damage segmentation of PLMs.The details of the network architecture can be viewed in Additional file 1.A series of binary segmentation models were developed to predict each type of damage.As shown in Table 1, the damaged area is very small compared to the non-damage area of PLMs.The pixel percentages of the five damage area and the non-damage area in the dataset are 1.156%, 0.118%, 0.516%, 4.768%, 0.496%, and 92.946%, respectively.This extreme class imbalance causes the segmentation model to be biased toward nondamage area.This may severely affect the performance of the segmentation model.Here, different loss functions were used to find a suitable way to handle the class imbalance problem.
Cross-entropy loss.Cross-entropy (CE) loss is the most commonly used loss function in semantic segmentation tasks due to its simplicity and effectiveness.It examines each pixel individually by comparing the class prediction with the one-hot coded ground truth label.It is calculated by: (1) where N is the number of pixels, C is the number of classes, y n,c is the one-hot coded ground truth label for the class c, and p n,c is the class prediction.
Weight cross-entropy loss.Since the cross-entropy loss evaluates the class predictions for each pixel individually and then averages over all pixels, the training procedure can be dominated by the majority class if there are imbalanced classes.A common solution is to turn standard cross-entropy loss into weighted cross-entropy (WCE) loss by adding a weighting factor to focus more on minority classes.The WCE loss is defined as: where w c is the class weight, f c is the pixel percent- age of class c, and controls the weighting factor.As increases, the weight value of minority classes increases.In this study, was selected in the range of [0, 1].
Focal loss.Focal loss [29] was initially used in image classification to solve the class imbalance problem.To reduce the influence of class imbalance, a modulating factor was added to the standard cross entropy loss function, aiming to put more focus on hard, misclassified pixels.The focal loss is defined as: where γ controls the degree of down-weighting of easy- to-classify pixels.As γ increases, the degree of down- weighting increases.In this study, γ was set to 2 according to the reference [29]. (2) w c y n,c log(p n,c ), Dice loss.Dice loss is based on the Dice coefficient, which measures the overlap between two data sets ranging from 0 to 1.A Dice coefficient of 1 denotes complete overlap.In practice, Dice coefficient was used as the loss function [30] to minimize the non-overlap between the prediction and the ground-truth label.Dice loss can be calculated as follows: Combo loss.The training procedure of Dice loss will be unstable when segmenting a small foreground from a large background.Thus, a combination of CE loss and Dice loss was used to stabilize the training procedure [31].The combo loss is defined as: where ∈ [0, 1] controls the relative contribution of CE loss and Dice loss terms to the overall loss function.
Training setting.The MiT-B1 [9] network was used as the segmentation model.The last batch normalization layer was replaced with the group normalization (GN) [32] layer to improve the stability of the training procedure under a small batch size.Data augmentation was performed through random resized crop with a ratio of 0.8-1.5, random horizontal flipping, and random vertical flipping.The Adan [33] optimizer was used to train the models for 200 epochs with a learning rate of 0.0003 and a weight decay of 0.01.The batch size was set as 8.A cosine decay learning rate scheduler was used with a linear warm-up for 10 epochs.

Prediction
In this study, test time augmentation (TTA) and test-time local converter (TLC) were used in the inference phase for better performance.TTA duplicated and mirrored (4) . ( the input image along the horizontal, vertical, and diagonal axes.Given an image as input, four augmented images were obtained.Then, the predicted results of the augmented images were averaged as the final result.For patch-based training, it was common to use the full image directly as input during prediction.However, the information distribution inconsistency between patchbased training and full-image-based prediction led to performance degradation.TLC [34] was proposed to solve this problem by converting the spatial information aggregation operation from global to local.Before the global operation, TLC divided the feature map into patches in the spatial dimension.Each feature patch was operated separately, and then the feature patches were stitched back together according to their original spatial position.In SegFormer, the self-attention layer belonged to the spatially global operation, and TLC was applied to this operation during prediction.

Post-processing
Due to the complexity of the damages and the limited performance of the segmentation models, the damage segmentation results usually had some misclassified regions, such as small noise regions and discontinuous regions.Therefore, post-processing was used in the inference phase to alleviate these problems.First, regions with the area less than a given threshold were considered as noisy regions and were removed.Then, the morphological close operation was applied to connect the adjacent area.Finally, the small holes in the connected regions were filled.

Evaluation metrics
Two evaluation metrics were used in PLM damage segmentation for qualitative discovery and quantitative evaluation, respectively.In damage segmentation, finding the location of the damage region is as important as Table 2 Comparison of different pre-processing methods on the PLM validation set "512 × w" indicates that the short side of the image is resized to 512, and the aspect ratio of the image is maintained."INC", "BRE", "FIB", "CON", and "IMP" indicate incompleteness, break, fiber delamination and warping, contamination, and improper restoration, respectively.The CE loss was the default setting when comparing the different pre-processing methods.the precise segmentation of the damage regions.A qualitative discovery metric named hit area ratio (Hit) was proposed to evaluate the ability of damage region localization.If the recall value between the segmented region of the ground truth and the predicted result is greater than 0.5, this region is regarded as a Hit region.Hit is the ratio of the total area of hit regions to the total area of the ground truth and mHit is the mean value of Hit of all damages.Hit and mHit are defined as: where C is the number of damages, S region is the area of the ground truth region, S is the area of the total regions (6)  of the ground truth, and R is the total number of regions in the ground truth.Interaction-over-Union (IoU) and mean IoU (mIoU) were used as the precise evaluation metrics.IoU is one of the mostly used metrics in semantic segmentation.It is the interaction set of the ground truth and the class prediction divided by the union of the ground truth and the class prediction of a specific damage.The mIoU is the mean value of IoU of all damages.IoU and mIoU are defined as: where C is the number of damages.TP, FP, TN, and FN represent the number of true positives, false positives, true negatives, and false negatives, respectively.

Optimization of pre-processing method and loss function in the training phase
IoU and mIoU of the PLM validation set were used to compare different pre-processing methods and loss functions in the training procedure.When comparing the pre-processing methods, CE loss was used in all experiments.
The results of the different pre-processing methods are presented in Table 2. Regarding overall damage segmentation results, the resizing and cropping method with a crop size of 512 × 768 outperforms the other pre-processing methods by at least 3.8% mIoU.Compared to cropping methods, the image patches obtained by the resizing and cropping method gain more global information about the whole image at the expense of fine and local information.The importance of global information for damage segmentation can also be illustrated by the fact that the model performance can be improved by increasing the size of the image patches.Although the resizing (7) methods preserve the most global information about the images, the small number of images in the training set (206) makes it difficult for models to learn damage features effectively.Furthermore, resizing the images to a fixed size yields poor mIoU because the images are distorted, and their original structures are lost.
In terms of single damage, break needs local and fine prediction because it often appears as elongated strips.Thus, the cropping method represents a 1.1% IoU improvement compared to the other pre-processing methods.However, global information is more important for the other four damage categories.From these results, the resizing and cropping method with a crop size of 512 × 768 is more suitable as the pre-processing method for the PLM damage segmentation and is used in subsequent experiments.
When the cropping size is expanded from 512 × 512 to 512 × 768, the mIoU of the resizing and cropping method obtains a substantial improvement of 4.3%.In contrast, the results of the resizing method are not improved and even worse.The reason for this phenomenon is unclear, which will be investigated in future works.
After determining the pre-processing method, different loss functions were compared to deal with the class imbalance problem.As shown in Table 3, combo loss achieves the highest mIoU of 54.1% and outperforms the CE loss in all damage classes.Compared to the CE loss, Focal loss achieves a 4.9% IoU improvement in the break but gets slightly worse results for the other damages.Notably, the performance of Focal loss underperforms CE loss by 7.5% IoU in fiber delamination and warping.Dice loss performs poorly and is worse than CE loss, which indicates that it is unsuitable for handling the severe class imbalance problem.
The effect of the values of the hyperparameters on both WCE loss and combo loss was investigated.For WCE loss and combo loss in the five damages, the optimal values of were [0.2, 0.1, 0.1, 0.5, 0.1] and [0.6, 0.5, 0.5, 0.9, 0.1], respectively.As shown in Fig. 3, WCE loss is more sensitive to the choice of hyperparameter than combo loss.For WCE loss, as the increases, the IoU of each damage increases and then decreases.This decrease may be caused by giving too large weight to the damage class, resulting in many false positive regions.In particular, even given a small weight (0.1), CE loss can substantially improve the stability of Dice loss.Thus, combo loss was used as the default loss function.

Impact of inference phase strategy
Experiments were conducted to optimize the inference phase with several strategies, including TTA, TLC, and post-processing.IoU, mIoU, Hit, and mHit were used to evaluate the inference phase strategies.Since the inference phase takes significantly less time than the training phase, different combinations of inference phase strategies can be used for different damage types through multiple rounds of experiments.As shown in Table 4, TTA slightly improves all damage segmentation results except improper restoration, resulting in 0.1% higher mHit and 0.4% higher mIoU than baseline (no extra strategies).TLC shows performance gains in three damages, but performance declines in other two damages, resulting in final performance slightly below the baseline.The combination of TTA and TCL brings a performance boost with a gain of 1.3% mHit and 0.8% mIoU.This combination significantly improves the performance in incompleteness by 2.8% Hit and 2.3% IoU, demonstrating an excellent synergistic effect between TTA and TLC.Integrating the post-processing methods can achieve performance gains in incompleteness, contamination, and improper restoration but lose the performance of IoU in break and fiber delamination and warping.
Two examples before and after post-processing are shown in Fig. 4. One can see that the post-processing method can deal with the small noise regions and some discontinuous regions, and improve the visual perception of the segmentation results.Meanwhile, it removes some details of prediction results.

Results of PLM damage identification
The best inference phase strategy chosen for each damage is underlined in Table 4, which mainly refers to IoU.To illustrate the effectiveness of the PLM-SegFormer framework, the SegFormer baseline was set using the cropping method with a crop size of 512 × 768 for data pre-processing and CE loss for model training.The results in Fig. 5 show that the PLM-SegFormer framework achieves consistency improvements over the Segformer baseline on five damages, especially for the incompleteness, which receives a substantial improvement in 16.6% Hit and 12.1% IoU (Additional file 1: Table S1).Furthermore, the PLM-SegFormer framework brings a performance boost with a gain of 10.4% mHit and 5.9% mIoU, which shows that the PLM-SegFormer framework can be well adapted to the damage segmentation of PLMs.
As a result, the PLM-SegFormer framework can reach 71.0 mHit and 51.2 mIoU on the PLM test set, indicating that the model can be used for damage identification of Sanskrit PLMs in Tibet.

The impact of each damage's characteristics on the performance
In this section, the impact of the characteristics of each type of damage and its segmentation results (Fig. 6) are discussed.
(a) Incompleteness: Incompleteness often appears in the edge area of PLMs.When incompleteness occurs in the length of the PLM, it tends to form a long-distance damage area.Thus, enough global information should be obtained for the segmentation of incompleteness.Moreover, since the boundary of the complete PLM is manually labeled, it is difficult for the segmentation model to accurately judge the boundary of the "imaginary" complete PLM.The challenge to determine where incompleteness occurs leads to a relatively lower Hit.(b) Break: Break is shown as the horizontal or longitudinal fractures or cracks formed along the texture of

Automatic damage segmentation on a large number of Sanskrit PLMs
From 2021 to 2022, 10,064 digital images of Sanskrit PLMs were collected to investigate the damage distribution information in Tibet.The resizing and cropping strategy was used as the pre-processing method.The short sides of all PLM images were resized to 512 while maintaining the aspect ratio.Thus, the aspect ratio had the greatest impact on the inference speed of the segmentation model.The minimum, maximum, and average aspect ratios of the PLM images were 1.89, 11.48, and 6.84, respectively.The damage segmentation task was implemented on the PyTorch platform using a workstation with an i9-10900X CPU, 32GB RAM, and an NVIDIA RTX 3080 10GB GPU.The developed PLM-SegFormer framework can complete the damage segmentation of 10,064 PLM images within 12 h, significantly reducing the time required for investigating the damage information of the Sanskrit PLMs.In addition, the minimum, maximum, and average time costs for one image segmentation were 1.87 s, 6.44 s, and 4.08 s, respectively.
The level of automation can be seen in Additional file 2: Video S1.During the entire damage segmentation process, the system is fully automated and does not require any human intervention.
After the segmentation, the distribution of each type of damages in the PLMs can be summarized.It can be seen from the results (Fig. 7) that, among all the damages, the image counts and pixel percentages of contamination are the highest, while that of improper restoration is the lowest.The number of non-damage PLMs is only 73 indicating that the existing Sanskrit PLMs in Potala Palace are seriously affected by various damages.However, the pixel percentage of all damages is low, only 2.9%, which indicates that the damage degree of PLMs is not high.Therefore, the preservation and restoration of Sanskrit PLMs should be carried out in time to prevent the deterioration of damages.The results of damage segmentation will facilitate the preservation state survey and record of the Palm-leaf manuscript, which is of great value to the following preservation and restoration.

Conclusion
In this study, a damage segmentation dataset for Sanskrit PLMs was created.The PLM-SegFormer framework was proposed for damage identification of the Sanskrit PLMs.The presented PLM dataset annotates five frequent damages, including incompleteness, break, fiber delamination and warping, contamination, and improper restoration.The PLM-SegFormer framework builds upon the SegFormer architecture and adapts it to damage segmentation of Sanskrit PLMs by optimizing the overall workflow, through pre-processing, model training, prediction, and post-processing.The experimental results show that the resizing and cropping method, Combo loss for model training, are suitable for dealing with the inconsistent size problem and the class imbalance problem in the PLM dataset.The combination of TTA, TLC, and post-processing methods in the inference phase can further boost the performance of the damage segmentation models and reach 70.1% mHit and 51.2% mIoU.The developed PLM-SegFormer framework can complete 10,064 pages of PLM damage segmentation within 12 h, significantly reducing the time required for investigating the damage information of the Sanskrit PLMs.The proposed method will facilitate the preservation state survey and record of the Palm-leaf manuscript and be of great value to the following preservation and restoration.

Limits and outlook
The most significant barrier to the PLM damage semantic segmentation is the lack of accurate ground truth of labeled damages.The reasons come from two aspects.One is that it is hard to decide the boundary or the category of damages.The boundary of some damages is blurring, and some damages are overlapped, which leads to inaccurate damage annotation.The other one is that the annotations are labor-intensive and time-consuming, and the number of elaborately labeled PLM images is very small.The weakly supervised learning and self-supervised learning methods should be an option to handle the noise data problem and leverage a large amount of unlabeled data for future works.

Fig. 1
Fig. 1 The Sanskrit palm leaf manuscript in Tibet and five representative damages.a An example digital image of the Sanskrit PLM, b incompleteness, c break, d fiber delamination and warping, e contamination, f improper restoration.For each pair, the PLM image is shown at the top, and the corresponding manual annotation is at the bottom

Figure 2
Figure 2 is the flowchart of the PLM-SegFormer framework.The framework includes two parts: training phase and inference phase.It consists of five steps: data collection and labeling, pre-processing, model training, prediction, and post-processing.

Fig. 2
Fig. 2 The flowchart of the PLM-Segformer framework.a The PLM dataset is established by digital camera acquisition and manual annotation.It has been subsequently divided into the training set, validation set, and test set.Then, b various pre-processing methods and c loss functions are compared to find the best way to build the damage segmentation models.Finally, d test-time enhancement methods and e post-processing methods are used to optimize the prediction results in the inference phase

Fig. 3
Fig. 3 Results of different in combo loss and WCE loss on the PLM validation set.a Incompleteness; b break; c fiber delamination and warping; d contamination; e improper restoration; f mean IoU of the five damages.When = 0, the combo loss and WCE loss simplify to the CE loss; when = 1, the combo loss is equal to the Dice loss.The results show that WCE loss is more sensitive to the choice of hyperparameter λ than combo loss

Fig. 4
Fig.4 Examples of differences in damage segmentation results before and after post-processing.The post-processing methods can deal with the small noise regions and discontinuous regions, but they also remove some details of the prediction results

Fig. 5
Fig. 5 Compare the performance of the PLM-SegFormer framework and the SegFormer baseline on the PLM test set.The SegFormer baseline models were trained by the cropping method with a size of 512 × 768 and the CE loss.The proposed PLM-SegFormer framework combines resizing and cropping methods for pre-processing, combo loss for training, and optimized post-processing methods with SegFormer models

Fig. 6
Fig. 6 Visual comparisons of annotation and segmentation results of the representative samples regarding five types of damages in the PLM test set

Table 1
The image counts and pixel percentages of each damage and non-damage in the PLM dataset Best scores are highlighted in bold

Table 3
Comparison of different loss functions on the PLM validation setFor comparing loss functions, the resizing and cropping method with a crop size of 512 × 768 was used in all experiments.Best scores are highlighted in bold

Table 4
Comparison of different inference phase strategies on the PLM validation set"Post" indicates the post-processing method.The best scores are highlighted in bold, and the inference phase strategy used for each kind of damage is underlined