STEF: a Swin Transformer-Based Enhanced Feature Pyramid Fusion Model for Dongba character detection

The Dongba manuscripts are a unique primitive pictographic writing system that originated among the Naxi people of Lijiang, China, boasting over a thousand years of history. The uniqueness of the Dongba manuscripts stems from their pronounced pictorial and ideographic characteristics. However, the digital preservation and inheritance of Dongba manu manuscripts face multiple challenges, including extracting its rich semantic information, recognizing individual characters, retrieving Dongba manuscripts, and automatically interpreting the meanings of Dongba manuscripts. Developing efficient Dongba character detection technology has become a key research focus, wherein establishing a standardized Dongba detection dataset is crucial for training and evaluating techniques. In this study, we have created a comprehensive Dongba manuscripts detection dataset covering various commonly used Dongba characters and vocabularies. Additionally, we propose a model named STEF. Firstly, the Swin Transformer extracts the complex structures and diverse shapes of Dongba manuscripts’ features. Then, by introducing a Feature Pyramid Enhancement Module, features of different sizes are cascaded to preserve multi-scale information. Subsequently, all features are fused in a FUSION module, resulting in features of various Dongba manuscript styles. Each pixel’s bina-risation threshold is dynamically adjusted through a differentiable binarisation operation, accurately distinguishing between foreground Dongba manuscripts and background. Lastly, deformable convolution is introduced, allowing the model to dynamically adjust the convolution kernel’s size and shape based on the Dongba manuscripts’ size, thereby better capturing the detailed information of Dongba characters of different sizes. Experimental results show that STEF achieves a recall rate of 88.88%, a precision rate of 88.65%, and an F-measure of 88.76%, outperforming other text detection algorithms. Visualization experiments demonstrate that STEF performs well in detecting Dongba manuscripts of various sizes, shapes, and styles, especially in blurred handwriting and complex backgrounds.


Introduction
The Dongba manuscripts are a unique writing system of the Naxi minority in Yunnan, China.They are the world's only surviving systematized pictographic script, with a history spanning over a thousand years [1].Its origins can be traced back to the 9th century when the Naxi society began to flourish.The Naxi ancestors created this distinctive script to record folk tales, historical events, and religious beliefs.Dongba manuscripts are profoundly meaningful, depicting the history, culture, and daily life of the Naxi people through vivid graphics and lines [2].In 2003, the ancient Naxi Dongba manuscripts were listed as "Memory of the World" by UNESCO, highlighting the precious value of the Dongba manuscripts as a cultural heritage [3].Despite the esteemed status of these ancient manuscripts, their accurate translation is challenging for non-professionals due to a lack of in-depth knowledge of Dongba-related expertise.Therefore, digital analysis of Dongba manuscripts, such as individual character recognition, Dongba manuscripts retrieval, and machine translation research, is particularly crucial.This not only aids in the preservation and in-depth study of Dongba culture but also plays a vital role in the urgent protection of Dongba manuscripts.In this process, accurate detection of Dongba characters is the first step, as its precision directly impacts recognition, retrieval, and translation effectiveness, making the improvement of detection accuracy necessary.
The distinction between ancient Dongba manuscripts and contemporary books is pronounced.In Dongba manuscripts, the main text typically presents itself in three lines, each delineated by horizontal lines and subdivided into multiple independent sentences via single or double vertical lines.The form of Dongba manuscripts is notably varied, with an extensively dispersed distribution [4].Some manuscripts feature horizontal and vertical dividing lines and are embellished with external borders or intricate patterns [5].Due to their antiquity, certain manuscripts exhibit stains.Figure 1 illustrates an image of an ancient Dongba manuscript.While traditional text detection methodologies, such as region-based, connected components-based, and texture-based approaches, yield satisfactory results in simple or standard text environments, they often fall short in accurately identifying text regions within the complex and diverse structures characteristic of Dongba characters.Consequently, there is an urgent need to develop a text detection algorithm tailored to the unique features of Dongba characters.In recent years, the rapid advancement of deep learning technologies has significantly enhanced text detection algorithms.By training deep neural network models, these algorithms gain the ability to learn and comprehend text features autonomously, thus enabling efficient and precise text detection.Considering the complex and diverse structures of Dongba characters, deep learningbased text detection algorithms emerge as an optimal solution.They offer increased adaptability to the inherent diversity and complexity of the text.
However, most text detection models have been trained using large-scale, standard text datasets, primarily consisting of text arranged in regular lines.The performance of these models in detecting Dongba script is compromised due to several issues: (1) Incomplete detection, as the intricate structure of Dongba manuscripts often leads to partial recognition (refer to Fig. 2a); (2) Incorrect merging, with the models failing to distinguish between individual Dongba manuscripts that are closely spaced, treating them as a single entity (refer to Fig. 2b); (3) Misidentification of non-text elements, where elements such as decorative frames, dividing lines, and stains are incorrectly recognized as textual content (refer to Fig. 2b,  c); and (4) Undetected text, arising from the distinctive layout or style of Dongba manuscripts, which results in some manuscripts being overlooked; (refer to Fig. 2a, d).The primary challenges can be summarized as follows: (1) The lack of a large-scale public dataset of Dongba manuscript images for model training, (2) The unique writing system and character features of Dongba manuscripts increase detection difficulty, (3) The dispersed distribution of character types, varying glyph sizes, and the presence of non-text elements like borders and decorative patterns further complicate text detection, (4) Dongba manuscripts are written by many individuals, leading to significant variations in writing styles, which complicates text standardization and normalization, and (5) Due to their considerable age, numerous images of Dongba script have been adversely affected by stains and water damage.This has led to the partial obfuscation of the script, thereby enhancing the complexity of text processing endeavors.To address these issues, we propose the STEF model, designed specifically for automatically detecting Dongba manuscripts.Compared to traditional text detection methods, the STEF model offers higher The principal contributions of this paper are as follows: 1. We have established a dataset for Dongba script detection, named DBD400 (DongbaDetection), which comprises 400 images of ancient Dongba manuscripts, encompassing 24,266 handwritten Dongba characters.The DBD400 dataset is a benchmark platform for the comparative assessment of Dongba script detection.

We introduce STEF (a Swin Transformer-Based
Enhanced Feature Pyramid Fusion Model), a novel automated detection model designed explicitly for Dongba manuscripts.Utilizing the Swin Transformer as its backbone network, it captures text features across various scales to accommodate the complexity and diverse shapes of Dongba characters.Moreover, we designed a feature pyramid enhancement fusion module that integrates different levels of Dongba script feature maps layer by layer, thereby augmenting the capability to capture the nuances of Dongba characters.Additionally, through differentiable binarization technology, the model dynamically adjusts the binarization threshold for each pixel in the image, facilitating precise differentiation between the foreground Dongba script and the background.Finally, the model employed deformable convolution, thereby increasing its adaptability in handling the various shapes and layouts in the Dongba manuscript.
3. Extensive experiments on our DBD400 dataset demonstrate the effectiveness of STEF in Dongba manuscript detection.Furthermore, the precise detection capabilities of STEF on the Dongba manuscripts are illustrated through visualization.

Text detection
Early text detection algorithms relied on traditional image processing techniques and machine learning methods, primarily targeting horizontal text lines in simple scenes using edge detection and projection methods for text localization [6,7].Researchers adopted methods based on sliding windows and machine learning classifiers such as Support Vector Machines (SVM) [8] to address multi-directional and irregular text in natural scenes.With the increase in scene complexity, more advanced methods emerged, such as those based on Stroke Width Transform (SWT) [9], particularly effective in complex backgrounds and lighting conditions.These methods distinguish text from non-text elements by analyzing the consistency of character stroke widths.Another important class of methods is based on Maximally Stable Extremal Regions (MSER) [10] for text detection.These algorithms identify regions with stable similar colors or grayscale levels across different scales as text.While these traditional methods perform well in simple or controlled environments, they are limited in more complex natural scenes due to a lack of adaptability to text shapes and arrangements.
With the rise of deep learning, methods based on convolutional neural networks (CNNs) have become dominant, significantly improving text detection effectiveness.These methods are mainly divided into two categories: regression-based methods and segmentation-based methods.

Regression-based methods
Text detection methods based on regression primarily aim to locate text by predicting bounding boxes, using deep learning models such as CNNs or RNNs for training.These models process images during prediction, generating candidate boxes and determining text positions based on confidence scores.For instance, the TextBoxes [11] model optimizes anchor boxes and position regression to enhance detection speed and accuracy.Text-Boxes++ [12] further improves the detection capability for multi-oriented text, albeit facing challenges in complex backgrounds.RRD [13] effectively handles multidirectional long text through rotation-invariant feature classification and rotation-sensitive feature regression.DeRPN [14] introduces dimension decomposition region proposal networks to address scale variation issues.

Segmentation-based methods
Segmentation-based text detection methods precisely recognize text positions and shapes through pixel-level image segmentation.The Mask TextSpotter series has significantly contributed to scene text detection: The V1 version [15] achieves end-to-end training by applying Mask RCNN to text detection.However, it has limitations in text sequence recognition.The V2 version [16] improves sequence recognition by incorporating the attention mechanism-based SAM.The V3 version [17] replaces RPN with SPN to enhance detection accuracy.TextSnake [18] optimizes the detection of curved text but is limited in handling extreme-sized text.Hence, PSENet [19] is introduced, which enhances the detection accuracy of texts of different sizes using a progressive scale expansion algorithm.PAN [20] further improves the detection accuracy of multi-scale texts through pyramid attention mechanisms.DBNet [21] introduces a differential binarization module, enhancing detection accuracy and robustness, particularly in complex backgrounds and irregular-shaped texts.DBNet++ [22] further optimizes algorithms and network structures, improving precision and efficiency and marking continued progress in text detection technology.

Ancient text detection
Ancient text detection focuses on identifying characters within the textual materials passed down from ancient times.Dongba script, with its long history, is regarded as one of the most significant ancient textual forms.Due to prolonged transmission, many ancient texts have become blurred, compounded by significant differences from modern texts, increasing the difficulty of ancient text detection.In response, some researchers have delved into ancient text detection extensively [23].Garz et al. [24] developed a method for detecting text regions and decorative elements in manuscripts using Scale-Invariant Feature Transform (SIFT) descriptors for effective localization, albeit with limitations in detecting decorative elements.Subsequently, Asi et al. [25] proposed a nonlearning-based coarse-to-fine analysis method for layout analysis of ancient manuscripts.They detected significant text areas in manuscripts using texture-based filters and graph-cut algorithms.The work by Roman-Rangel and Marchand-Maillet [26] shifted focus to detecting Maya hieroglyphs, a highly visually complex writing system.They introduced a weighted bag-of-visual-words representation to enhance the performance of visual bag-ofwords in detection.In the domain of Yi script detection, Chen et al. [27] proposed a novel Yi character detection method by combining Maximally Stable Extremal Regions (MSER) and Convolutional Neural Networks (CNN), addressing precise character detection in ancient Yi script recognition, especially in documents with complex layout structures and mixed text and images.To preserve and inherit the culture of water writing, Tang et al. [28] established a dataset of handwritten water scripts and applied the Faster R-CNN algorithm to water script ancient text recognition.This study addressed challenges in dataset establishment and sample imbalance, achieving precise positioning and recognition of page-level end-to-end water script ancient text, further expanding the scope of handwritten ancient text detection.Recently, Xu et al. [29] made progress in Tibetan text detection by using relational reasoning graph networks to improve the detection of Tibetan texts of arbitrary shapes.These studies tackle specific technical challenges and complement each other, collectively driving the advancement of ancient text detection.
In the field of Dongba script detection, Xing et al. [30] proposed a multi-scale hybrid attention network based on YOLOv5s for sentence-level detection of Dongba manuscripts.Although they made some progress at this level, their method has not yet been able to detect individual Dongba characters effectively.Wang [31] adopted the EAST model to enhance the efficiency of feature extraction and fusion in images.Through numerical experiments comparing template matching, support vector machine, and deep learning methods, they confirmed that deep learning exhibits superior performance in Dongba script detection.Despite significant technological progress, existing script detection algorithms have yet to account for the distinct attributes of Dongba manuscripts fully.These include their intricate structure, diverse font styles, variable font sizes, ink intensity and legibility challenges, and noisy backgrounds.Consequently, this oversight results in incomplete detections and omissions.Therefore, detection networks must be developed to suit the Dongba manuscript's specific features.

Materials
The ancient Dongba manuscripts The ancient Dongba manuscripts are religious manuscripts employed within the indigenous religion of the Naxi ethnic group [32].There are two predominant styles of ancient Dongba manuscripts [32].The first type is rectangular, bound on the left side with hemp twine, measuring approximately thirty centimeters in length and nine centimetres in width, as depicted in Fig. 3a.The dataset in our study consists of manuscripts of this particular style.The second type is square-shaped, bound at the top, typically utilized for divination texts, as illustrated in Fig. 3b.In the creation of Dongba ancient manuscripts, the initial step involves outlining with a bamboo pen, followed by applying traditional pigments based on minerals, plants, and animals, primarily for decorative purposes, as illustrated in Fig. 4 [33].In ancient Dongba manuscripts, each page is typically divided into three lines, with each line comprising about two to three straight segments.The writing generally adheres to a left-to-right and top-to-bottom orientation.However, the handwriting rule in these manuscripts is characterized by considerable irregularity and a significant degree of arbitrariness.Despite the basic directional flow from left to right and from top to bottom, the placement of

Dataset
The original data for this study were sourced from the website of the Harvard-Yenching Library at Harvard University [34].As illustrated in Fig. 5, the data exhibit the unique characteristics of Dongba ancient manuscripts, including variations in the background and ink intensity of the text, differences in font sizes, diverse handwriting styles, uneven illumination, and text distortion phenomena.In this research, we created a dataset comprising 400 images of these ancient manuscripts, which we designated DBD400 (Dongba Detection).All images are stored in JPEG format, with resolutions ranging from a minimum of 1200 * 431 pixels to a maximum of 1201 * 530 pixels.In total, there are 24,266 handwritten Dongba characters.

Dataset annotation
We used the PPOCRLabel tool [35] to precisely annotate text regions within images of the Dongba ancient manuscript dataset.This study primarily employed two forms of annotation for processing Dongba manuscript.We used a four-point rectangular annotation method for Dongba characters that are uniform and regular in shape.For irregular and complex characters, a multipoint annotation approach was adopted.As illustrated in Fig. 6

Overview of the structure
The quality of images in the Dongba ancient manuscripts is relatively low, featuring complex text structures, varying font sizes, and diverse handwriting styles.To address  these challenges effectively, we designed the STEF model, the architecture of which is detailed in Fig. 7.This model utilizes the Swin Transformer to extract complex features from the Dongba manuscript.The Feature Pyramid Enhancement Fusion (FPEF) module efficiently aggregates detailed and global information by progressively merging shallow, large-scale feature maps with deep, small-scale ones.Additionally, the FUSION module within the FPEF is engineered to amalgamate features from different levels, enhancing the model's ability to express features.In post-processing, we employed differentiable binarization techniques, dynamically adjusting each pixel's binarization threshold to more accurately distinguish between the foreground Dongba manuscript and background.Furthermore, to better accommodate the irregular shapes of the Dongba manuscript, deformable convolutions have been introduced in place of traditional convolutions, enhancing the model's adaptability to irregular texts.Finally, the model is jointly trained using a combination of probability map loss, threshold map loss, and binary map loss functions.The following sections briefly introduce the Swin Transformer, the FPEF module, differentiable binarization, and the loss functions.Fig. 7 The structure of STEF model

Backbone architecture
The backbone architecture is designed to extract representative features from the raw data, providing adequate inputs for subsequent tasks.Given the complex structure and diverse styles of Dongba script, traditional Convolutional Neural Networks (CNNs) [36][37][38][39][40] struggle to fully capture the subtle differences and rich semantic layers between Dongba characters.This limitation is primarily due to CNNs' tendency to focus on extracting local features when processing two-dimensional images, often overlooking the impact of overall semantics.Consequently, our research utilizes the Swin Transformer [41] as the backbone architecture.The Swin Transformer combines the self-attention mechanism of the Transformer with the local perception capabilities of convolutional networks.This allows for better handling of the complex structures and detailed features of the Dongba manuscript, offering a richer and more accurate feature representation.
The Consecutive Swin Transformer blocks are computed as: here Ẑl and Z l represent the outputs of (S)W-MSA and MLP in l the layer.

Feature pyramid enhancement fusion module
Feature fusion is a critical method for enhancing detection performance.During the fusion process, Dongba manuscript images contain a significant amount of cluttered information, leading to redundancy in the features and decreasing the effectiveness of Dongba manuscript detection.Moreover, the varying receptive fields of multiscale feature maps of texts result in differences among feature information.Consequently, the direct fusion of multi-scale features may introduce problems such as confused localization or detected errors.Inspired by [42], our research designs the feature pyramid enhancement and fusion module, comprised of two components: FPEM and Fusion.
FPEM is a U-shaped structure, as depicted in Fig. 7b under FPEM.It consists of two stages: up-scale enhancement and down-scale enhancement.Up-scale enhancement aggregates lower-level Dongba manuscript features output by the Swin Transformer to higher-level features (F2, F3, F4, C5) in a bottom-up manner, enriching higher-level features with more detailed information.This process is achieved through upsampling and fusion operations, allowing features at each level to gain supplemental information from lower levels.In the down-scale enhancement stage, FPEM conveys high-level semantic information to lower-level features in a top-down approach, realized through downsampling and fusion operations to ensure lower-level features receive highlevel semantic guidance.Additionally, FPEM employs depthwise separable convolutions [43] (3 × 3 depthwise convolutions and 1 × 1 convolutions) instead of regular convolutions to construct the FPEM connectivity part (refer to Fig. 8a).This design expands the receptive field (through 3 × 3 depthwise convolution) and increases network depth (via 1 × 1 convolution) with minimal computational overhead, achieving enhanced feature expression at lower computational costs.The specific process is formulated as follows: here i = 2, 3, 4 , C i denotes the feature map.ReLU stands for the Rectified Linear Unit, which introduces non-linearity in the activation functions.BN represents Batch Normalization, a technique to standardize the inputs to a layer for each mini-batch, facilitating faster and more stable training.The term 1 × 1 Conv refers to a convolution operation with a one-by-one kernel size commonly used to alter the dimensionality of the feature channels.The 3 × 3 DWConv signifies a 3 × 3 depthwise convolution, a computationally efficient convolutional method that separately applies a filter to each input channel.Lastly, upsample refers to the process of upsampling, which increases the spatial resolution of feature maps.here C i_1 denotes the features enhanced by the first itera- tion of FPEM, and C i_2 refers to the features enhanced by the second iteration of FPEM.
Features enhanced by the second iteration of FPEM are fused with those enhanced by the first iteration of FPEM to yield C2′, C3′, C4′, and C5′.This ensures that the final features utilized for Dongba script detection are endowed with rich contextual information while retaining ample detail.
To leverage the multi-scale information from different feature maps output by FPEM, most methods employ simple add or concatenate for feature fusion.The FUSION module (they are illustrated in Fig. 8a used in this paper is designed to integrate Dongba manuscript features from various layers, enhancing the model's ability to detect details in Dongba characters.This module first concatenates features from different layers.Then, it employs a 1× 1 convolution to integrate these features without altering the spatial dimensions of the feature map, preserving the spatial structure of Dongba characters.Batch normalization and the ReLU activation function are subsequently used to enhance nonlinear expression and stabilize network training.This is particularly important for capturing the fine stroke differences in the Dongba script.Finally, features are further transformed through pooling and fully connected layers, with the Hsigmoid function employed to output fused (5) features.Compared to the traditional Sigmoid function, Hsigmoid provides a more stable gradient influence during backpropagation, helping to prevent gradient vanishing, which is crucial for deep networks to recognize the complex symbol system of Dongba script.The specific process can be expressed in formulas as follows: here Cat[•, •] to perform channel-wise feature fusion.
In the equation, FC represents a fully connected layer, P Avg denotes average pooling, and σ represents the Hsig- moid function, whose formula is as follows:

Differentiable binarization
The differentiable binarization operation dynamically adjusts each pixel's binarisation threshold, thereby more accurately distinguishing between the foreground Dongba manuscript and the background.Traditional segmentation-based text detection methods usually employ a fixed threshold in the post-processing stage to convert the obtained probability map into a binary image, as shown in the equation: where P represents the probability map, i, j are the pixel coordinates in the image, th is a fixed threshold, and B is the output binary image.
In traditional binarization, a fixed threshold th pro- cesses the probability map P .When a pixel's value exceeds the threshold, it is considered part of the text region (i.e., a positive sample); otherwise, it is deemed background.However, this binarization method generates a non-differentiable step function, limiting optimization capabilities during training.To overcome this limitation, Liao et al. [21] proposed the Differentiable (8) 1, P i,j ≥ t h 0, otherwise Binarization (DB) module, which employs a differentiable approximation of the binarisation step function, enhancing the model's trainability and optimizability, as shown in the equation.With the DB module, the model can adaptively learn more accurate threshold maps during training, allowing for the precise distinction between text and background and effectively separating adjacent texts in predicting text boxes, as illustrated in Fig. 9.
here i, j denotes the pixel coordinates, and Bi,j , P i,j , and T i,j respectively represent the pixel values of the approxi- mated binary image, probability map, and threshold map at the point i, j ; k is a scaling factor.Due to the irregularity in size and shape of Dongba characters, traditional fixed-size convolution needs help to capture their feature information comprehensively.To address this issue, the convolution in the differentiable binarization network is replaced with deformable convolutional instead of traditional convolution operations, as shown in the Deconv module in Fig. 7c.The distinction between deformable convolutional and traditional convolution lies in its introduction of offsets within the receptive field, transforming the receptive field from a rigid square to a more adaptive structure that conforms to the actual shape of objects.Consequently, deformable convolutional exhibits higher adaptability when processing irregular Dongba manuscripts, enabling it to better capture the feature information of various shapes and thus enhance the capability of feature representation.

Loss function
The loss function Total_Loss aims to enhance the model's performance in localizing Dongba manuscript regions, ensuring the model can effectively distinguish between Dongba manuscripts and the background, accurately outline the edges of Dongba manuscripts, and adapt to the various sizes and shapes of Dongba manuscripts, thereby achieving high-precision text detection.The loss function Total_Loss consists of three parts: the sum of the probability map (Loss_Prob), threshold map (Loss_Thr), (12) Bi,j = 1 1 + e −k(P i,j −T i,j ) Fig. 9 Visualization results and binary map (Loss_DB), calculated as follows in the equation: here α is set to 5 and β to 10.The binary cross-entropy loss function recognizes positive class samples, ensuring the model can accurately distinguish between text and non-text areas.The formula is as follows: here S l represents a set containing the indices of all posi- tive class samples.x i represents the pixel values in the predicted segmentation map, y i represents the pixel val- ues in the ground truth.
Loss_Thr utilizes the L1 loss to quantify the discrep- ancy between the predicted threshold map and the actual threshold, further refining the model's adaptability to varying threshold values for different Dongba manuscript areas.Its expression is: here R d denotes the region between the original text box and the expanded text box.(13) Adopting Dice loss aims to minimize the difference between the predicted and real segmentation maps, enhancing the model's segmentation accuracy.Its expression is as follows: here x i represents the pixel values in the predicted seg- mentation map, y i represents the pixel values in the ground truth, and S is the set of all pixels.
The training variation curves for the overall loss function L , the probability map loss function, the threshold map loss function, and the binary map loss functions are illustrated in Fig. 10.

Experimental settings
The proposed STEF model is implemented in Python developed, and trained based on the PyTorch framework [44], and the MMOCR library [45].The model operates on a Linux system and utilizes an Nvidia Tesla V100 graphics card with 16 GB of VRAM.We employed the Adam optimizer for training STEF, with a momentum parameter set to 0.9.The initial learning rate was set to 1e−3, and a Poly strategy was adopted for learning rate decay.The weight decay coefficient was set to 0.0001.The batch size was set to 6 during ( 16)

Fig. 10 Loss curves of training
training, and the model was trained for 20 epochs.A fivefold cross-validation was performed to enhance the robustness and generalizability of the model.

Evaluation metrics
To evaluate the detection performance of the model, this study utilizes three commonly used text detection evaluation metrics.The recall rate R represents the proportion of the ground truth content that has been detected, calculated as in Eq. 17.The precision rate P indicates the correct proportion of correct detected text, calculated as in Eq. 18.The F-measure, a harmonic mean, combines both precision and recall, calculated as in Eq. 19: Here, TP denotes the total number of text areas correctly predicted by the model, FP represents the total number of text areas incorrectly predicted, and FN stands for the total number of text areas that were not predicted.The Frames Per Second (FPS) metric assessed our model's real-time processing capability.FPS is calculated as the ratio of the number of frames processed to the total time taken for processing, as expressed by the formula:

Superiority studies
Numerous text detection models have been proposed.To demonstrate the superiority of our proposed model, extensive experiments were conducted, comparing these text detection models with our proposed STEF.Furthermore, comparisons were made among different modules to estimate our proposed modules' superiority.

Comparison with text detection models
To demonstrate the superiority of the proposed STEF, extensive comparative experiments were conducted with text detection models on the DBD400 dataset.Precision, Recall, F-measure, and FPS were used as metrics.We trained and tested various models under the experimental setup mentioned in "Experimental settings" section , ensuring that these models could (17) Processed Frames Total Processing Time converge to their optimal values.The experimental results are presented in Table 1.
As illustrated in Table 1, the STEF introduced in this study has realized commendable detection outcomes, achieving a top-ranked Precision of 88.88%, Recall of 88.65%, and F-measure of 88.76%.Compared to the second-ranked model, DBNet++, STEF has enhanced Precision by 0.45%, Recall by 1.46%, and F-measure by 0.96%, in addition to exhibiting a higher inference speed in terms of FPS.These findings certainly affirm the superiority of our STEF.Figure 11 demonstrates that our model strikes an optimal balance between F-measure and FPS.

Comparison with different backbone networks
We investigate the performance of various backbone networks, as detailed in Table 2 and illustrated in Fig. 12.This includes traditional Convolutional Neural Networks (CNNs) such as ResNet18, ResNet50, and MobileNetV2,

Comparison with different feature fusion networks
Table 3 showcases the impact of different feature fusion modules on text detection.In our experiments, the proposed FPEM+FUSION module achieved significant improvements in precision, recall, and F-measure, reaching 88.88%, 88.65%, and 88.76%, respectively, with only a 0.5 M increase in the number of parameters (from the  Fig. 13 The comparisons of models performance in accuracy (F-measure) and speed original 1.29 M to 1.79 M).These results demonstrate the effectiveness of FPEM+FUSION in enhancing text detection performance and prove that significant performance improvements can be achieved in designing efficient and powerful text detection systems through meticulous feature fusion strategies.

Comparison of parameter settings in loss functions
We conduct a detailed analysis through experiments on the impact of the weight ratio of probability map loss to threshold map loss, α and β , within the loss func- tion.As illustrated in Fig. 14, the model's performance peaked when the weight ratios for α and β were respec- tively adjusted to 5 and 10.This indicates that significant improvements were achieved in precision, recall, and F-measure, highlighting the importance of appropriately setting the weights for different components of the loss function to optimize model performance.

Ablation study
We conduct a detailed investigation into the impact of the FUSION module, differentiable binarization (DB), and deformable convolution (deforconv) on the performance of the text detection model.As shown in Table 4, six configurations were tested to evaluate their individual and combined effects.The experimental results indicated that introducing the FUSION module alone slightly improved the model's precision, recall, and F-measure, validating the effectiveness of the FUSION module in capturing textual features.Upon incorporating the differentiable binarization, a significant leap in model performance was observed, with precision increasing to 84.29%, recall to 81.23%, and F-measure reaching 82.73%.This demonstrates the crucial role of differentiable binarization in handling the structure of Dongba manuscripts.While the application of deformable convolution alone did not show as marked an improvement as the differentiable binarization, it revealed the potential of deformable convolutions in processing texts with irregular shapes, with precision, recall, and F-measure reaching 80.62%, 83.97%, and 82.26%, respectively.When the FUSION module, differentiable binarization, and deformable convolution were used, the model performance peaked at a precision of 88.88%, recall of 88.65%, and F-measure of 88.76%.These results highlight the unique value of each component in enhancing text detection performance and, more importantly, point out their synergistic effect when combined.

Visualization experiments
We qualitatively compare the Dongba manuscript detection results across seven models, as illustrated in Fig. 15.
In the MaskRCNN [46], the detection boundaries were

Conclusions and future works
We propose a novel algorithm for detecting complex Dongba manuscripts, utilizing the Swin Transformer as the backbone network for feature extraction.By integrating a Feature Pyramid Enhancement and Fusion module, the algorithm adaptively selects and integrates local and global features, focusing the model's attention on regions Experimental results demonstrate that the proposed algorithm achieves a recall rate of 88.88%, a precision rate of 88.65%, and an F-measure of 88.76%, outperforming other algorithms.Based on this algorithm, a wide variety and large quantity of Dongba manuscripts can be detected efficiently.Future work will explore the application of the algorithm in areas such as classification, recognition, segmentation, retrieval, and translation of Dongba manuscripts, further expanding its practicality and impact.

Fig. 1
Fig. 1 Image of Ancient Dongba manuscript.The arrow points to the dividing line, the square highlights decorative patterns, and the circle indicates stains

Fig. 2
Fig. 2 Examples of text detection model failures: a DBNet++ detection results; b Mask R-CNN detection results; c TextSnake detection results; d DBNet detection results

Fig. 3 a
Fig. 3 a Dongba manuscripts; b pages six and seven of the Naxi Dongba Ancient manuscript' divination: the book of divining by anomalous phenomena , annotation information was saved in TXT format.The dataset was divided into a training set and a test set in an 8:2 ratio, with the training set comprising 320 images containing 19,332 handwritten Dongba characters and the test set comprising 80 images with 4934 handwritten Dongba characters.

Fig. 5
Fig. 5 Some examples from the dataset

Fig. 6
Fig. 6 Examples of different annotation methods a four-point rectangular annotation; b multipoint annotation; c four-point rectangular annotation and multipoint annotation

Fig. 8
Fig. 8 Components of the model

Fig. 11
Fig. 11 The comparisons of backbones performance in accuracy (F-measure) and speed

Fig. 12
Fig.12 The comparisons of backbones' performance (PVT: PyramidVisionTransformer, RN18: ResNet18, RN50: ResNet50, MNV20: MobileNetV2, ST10: SwinTransformer-10, ST12: SwinTransformer-12, ST8: SwinTransformer-8, R2N: Res2Net, RX: ResNeXt, ST6: SwinTransformer-6) unclear, and decorations and borders in the image were mistakenly identified as text.The TextSnake model sometimes incorrectly detected single characters as multiple characters and similarly misidentified decorations and borders as text.DBNet, PAN, and PSENet tended to erroneously merge characters that were close together, also misidentifying decorations and borders as text.Notably, DBNet could only detect parts of characters in some instances, and both MaskRCNN and DBNet occasionally misjudged stains as text.DBNet++ also had issues merging characters that were close together but successfully avoided misidentifying decorations and borders in the image as text.The STEF model proposed in this study demonstrated the best detection results, clearly identifying each Dongba character, although it occasionally misjudged some decorations as text.

Fig. 15
Fig. 15 Visualization of results from different models

Table 1
Evaluation of excellent text detection models on the proposed DBD400 testing setThe best indicators are shown in bold

Table 2
Evaluation of different backbones on the proposed DBD400 testing set

Table 3
Evaluation results of different feature fusion techniques on the proposed DBD400 testing set 'Add' for summation fusion and 'Concat' for concatenation fusion.The best performers are highlighted in bold Fig. 14 3D surface plots of model performance metrics over different α and β parameters

Table 4
Ablation experiment