Skip to main content

STEF: a Swin Transformer-Based Enhanced Feature Pyramid Fusion Model for Dongba character detection


The Dongba manuscripts are a unique primitive pictographic writing system that originated among the Naxi people of Lijiang, China, boasting over a thousand years of history. The uniqueness of the Dongba manuscripts stems from their pronounced pictorial and ideographic characteristics. However, the digital preservation and inheritance of Dongba manu manuscripts face multiple challenges, including extracting its rich semantic information, recognizing individual characters, retrieving Dongba manuscripts, and automatically interpreting the meanings of Dongba manuscripts. Developing efficient Dongba character detection technology has become a key research focus, wherein establishing a standardized Dongba detection dataset is crucial for training and evaluating techniques. In this study, we have created a comprehensive Dongba manuscripts detection dataset covering various commonly used Dongba characters and vocabularies. Additionally, we propose a model named STEF. Firstly, the Swin Transformer extracts the complex structures and diverse shapes of Dongba manuscripts’ features. Then, by introducing a Feature Pyramid Enhancement Module, features of different sizes are cascaded to preserve multi-scale information. Subsequently, all features are fused in a FUSION module, resulting in features of various Dongba manuscript styles. Each pixel’s binarisation threshold is dynamically adjusted through a differentiable binarisation operation, accurately distinguishing between foreground Dongba manuscripts and background. Lastly, deformable convolution is introduced, allowing the model to dynamically adjust the convolution kernel’s size and shape based on the Dongba manuscripts’ size, thereby better capturing the detailed information of Dongba characters of different sizes. Experimental results show that STEF achieves a recall rate of 88.88%, a precision rate of 88.65%, and an F-measure of 88.76%, outperforming other text detection algorithms. Visualization experiments demonstrate that STEF performs well in detecting Dongba manuscripts of various sizes, shapes, and styles, especially in blurred handwriting and complex backgrounds.


The Dongba manuscripts are a unique writing system of the Naxi minority in Yunnan, China. They are the world’s only surviving systematized pictographic script, with a history spanning over a thousand years [1]. Its origins can be traced back to the 9th century when the Naxi society began to flourish. The Naxi ancestors created this distinctive script to record folk tales, historical events, and religious beliefs. Dongba manuscripts are profoundly meaningful, depicting the history, culture, and daily life of the Naxi people through vivid graphics and lines [2]. In 2003, the ancient Naxi Dongba manuscripts were listed as “Memory of the World” by UNESCO, highlighting the precious value of the Dongba manuscripts as a cultural heritage [3]. Despite the esteemed status of these ancient manuscripts, their accurate translation is challenging for non-professionals due to a lack of in-depth knowledge of Dongba-related expertise. Therefore, digital analysis of Dongba manuscripts, such as individual character recognition, Dongba manuscripts retrieval, and machine translation research, is particularly crucial. This not only aids in the preservation and in-depth study of Dongba culture but also plays a vital role in the urgent protection of Dongba manuscripts. In this process, accurate detection of Dongba characters is the first step, as its precision directly impacts recognition, retrieval, and translation effectiveness, making the improvement of detection accuracy necessary.

The distinction between ancient Dongba manuscripts and contemporary books is pronounced. In Dongba manuscripts, the main text typically presents itself in three lines, each delineated by horizontal lines and subdivided into multiple independent sentences via single or double vertical lines. The form of Dongba manuscripts is notably varied, with an extensively dispersed distribution [4]. Some manuscripts feature horizontal and vertical dividing lines and are embellished with external borders or intricate patterns [5]. Due to their antiquity, certain manuscripts exhibit stains. Figure 1 illustrates an image of an ancient Dongba manuscript. While traditional text detection methodologies, such as region-based, connected components-based, and texture-based approaches, yield satisfactory results in simple or standard text environments, they often fall short in accurately identifying text regions within the complex and diverse structures characteristic of Dongba characters. Consequently, there is an urgent need to develop a text detection algorithm tailored to the unique features of Dongba characters. In recent years, the rapid advancement of deep learning technologies has significantly enhanced text detection algorithms. By training deep neural network models, these algorithms gain the ability to learn and comprehend text features autonomously, thus enabling efficient and precise text detection. Considering the complex and diverse structures of Dongba characters, deep learning-based text detection algorithms emerge as an optimal solution. They offer increased adaptability to the inherent diversity and complexity of the text.

Fig. 1
figure 1

Image of Ancient Dongba manuscript. The arrow points to the dividing line, the square highlights decorative patterns, and the circle indicates stains

However, most text detection models have been trained using large-scale, standard text datasets, primarily consisting of text arranged in regular lines. The performance of these models in detecting Dongba script is compromised due to several issues: (1) Incomplete detection, as the intricate structure of Dongba manuscripts often leads to partial recognition (refer to Fig. 2a); (2) Incorrect merging, with the models failing to distinguish between individual Dongba manuscripts that are closely spaced, treating them as a single entity (refer to Fig. 2b); (3) Misidentification of non-text elements, where elements such as decorative frames, dividing lines, and stains are incorrectly recognized as textual content (refer to Fig. 2b, c); and (4) Undetected text, arising from the distinctive layout or style of Dongba manuscripts, which results in some manuscripts being overlooked; (refer to Fig. 2a, d). The primary challenges can be summarized as follows: (1) The lack of a large-scale public dataset of Dongba manuscript images for model training, (2) The unique writing system and character features of Dongba manuscripts increase detection difficulty, (3) The dispersed distribution of character types, varying glyph sizes, and the presence of non-text elements like borders and decorative patterns further complicate text detection, (4) Dongba manuscripts are written by many individuals, leading to significant variations in writing styles, which complicates text standardization and normalization, and (5) Due to their considerable age, numerous images of Dongba script have been adversely affected by stains and water damage. This has led to the partial obfuscation of the script, thereby enhancing the complexity of text processing endeavors. To address these issues, we propose the STEF model, designed specifically for automatically detecting Dongba manuscripts. Compared to traditional text detection methods, the STEF model offers higher accuracy and robustness and is better adapted to the characteristics of Dongba manuscripts.

Fig. 2
figure 2

Examples of text detection model failures: a DBNet++ detection results; b Mask R-CNN detection results; c TextSnake detection results; d DBNet detection results

The principal contributions of this paper are as follows:

  1. 1.

    We have established a dataset for Dongba script detection, named DBD400 (DongbaDetection), which comprises 400 images of ancient Dongba manuscripts, encompassing 24,266 handwritten Dongba characters. The DBD400 dataset is a benchmark platform for the comparative assessment of Dongba script detection.

  2. 2.

    We introduce STEF (a Swin Transformer-Based Enhanced Feature Pyramid Fusion Model), a novel automated detection model designed explicitly for Dongba manuscripts. Utilizing the Swin Transformer as its backbone network, it captures text features across various scales to accommodate the complexity and diverse shapes of Dongba characters. Moreover, we designed a feature pyramid enhancement fusion module that integrates different levels of Dongba script feature maps layer by layer, thereby augmenting the capability to capture the nuances of Dongba characters. Additionally, through differentiable binarization technology, the model dynamically adjusts the binarization threshold for each pixel in the image, facilitating precise differentiation between the foreground Dongba script and the background. Finally, the model employed deformable convolution, thereby increasing its adaptability in handling the various shapes and layouts in the Dongba manuscript.

  3. 3.

    Extensive experiments on our DBD400 dataset demonstrate the effectiveness of STEF in Dongba manuscript detection. Furthermore, the precise detection capabilities of STEF on the Dongba manuscripts are illustrated through visualization.

The code is publicly available at:

Related works

Text detection

Early text detection algorithms relied on traditional image processing techniques and machine learning methods, primarily targeting horizontal text lines in simple scenes using edge detection and projection methods for text localization [6, 7]. Researchers adopted methods based on sliding windows and machine learning classifiers such as Support Vector Machines (SVM) [8] to address multi-directional and irregular text in natural scenes. With the increase in scene complexity, more advanced methods emerged, such as those based on Stroke Width Transform (SWT) [9], particularly effective in complex backgrounds and lighting conditions. These methods distinguish text from non-text elements by analyzing the consistency of character stroke widths. Another important class of methods is based on Maximally Stable Extremal Regions (MSER) [10] for text detection. These algorithms identify regions with stable similar colors or grayscale levels across different scales as text. While these traditional methods perform well in simple or controlled environments, they are limited in more complex natural scenes due to a lack of adaptability to text shapes and arrangements.

With the rise of deep learning, methods based on convolutional neural networks (CNNs) have become dominant, significantly improving text detection effectiveness. These methods are mainly divided into two categories: regression-based methods and segmentation-based methods.

  1. 1.

    Regression-based methods

Text detection methods based on regression primarily aim to locate text by predicting bounding boxes, using deep learning models such as CNNs or RNNs for training. These models process images during prediction, generating candidate boxes and determining text positions based on confidence scores. For instance, the TextBoxes [11] model optimizes anchor boxes and position regression to enhance detection speed and accuracy. TextBoxes++ [12] further improves the detection capability for multi-oriented text, albeit facing challenges in complex backgrounds. RRD [13] effectively handles multi-directional long text through rotation-invariant feature classification and rotation-sensitive feature regression. DeRPN [14] introduces dimension decomposition region proposal networks to address scale variation issues.

  1. 2.

    Segmentation-based methods

Segmentation-based text detection methods precisely recognize text positions and shapes through pixel-level image segmentation. The Mask TextSpotter series has significantly contributed to scene text detection: The V1 version [15] achieves end-to-end training by applying Mask RCNN to text detection. However, it has limitations in text sequence recognition. The V2 version [16] improves sequence recognition by incorporating the attention mechanism-based SAM. The V3 version [17] replaces RPN with SPN to enhance detection accuracy. TextSnake [18] optimizes the detection of curved text but is limited in handling extreme-sized text. Hence, PSENet [19] is introduced, which enhances the detection accuracy of texts of different sizes using a progressive scale expansion algorithm. PAN [20] further improves the detection accuracy of multi-scale texts through pyramid attention mechanisms. DBNet [21] introduces a differential binarization module, enhancing detection accuracy and robustness, particularly in complex backgrounds and irregular-shaped texts. DBNet++ [22] further optimizes algorithms and network structures, improving precision and efficiency and marking continued progress in text detection technology.

Ancient text detection

Ancient text detection focuses on identifying characters within the textual materials passed down from ancient times. Dongba script, with its long history, is regarded as one of the most significant ancient textual forms. Due to prolonged transmission, many ancient texts have become blurred, compounded by significant differences from modern texts, increasing the difficulty of ancient text detection. In response, some researchers have delved into ancient text detection extensively [23]. Garz et al. [24] developed a method for detecting text regions and decorative elements in manuscripts using Scale-Invariant Feature Transform (SIFT) descriptors for effective localization, albeit with limitations in detecting decorative elements. Subsequently, Asi et al. [25] proposed a non-learning-based coarse-to-fine analysis method for layout analysis of ancient manuscripts. They detected significant text areas in manuscripts using texture-based filters and graph-cut algorithms. The work by Roman-Rangel and Marchand-Maillet [26] shifted focus to detecting Maya hieroglyphs, a highly visually complex writing system. They introduced a weighted bag-of-visual-words representation to enhance the performance of visual bag-of-words in detection. In the domain of Yi script detection, Chen et al. [27] proposed a novel Yi character detection method by combining Maximally Stable Extremal Regions (MSER) and Convolutional Neural Networks (CNN), addressing precise character detection in ancient Yi script recognition, especially in documents with complex layout structures and mixed text and images. To preserve and inherit the culture of water writing, Tang et al. [28] established a dataset of handwritten water scripts and applied the Faster R-CNN algorithm to water script ancient text recognition. This study addressed challenges in dataset establishment and sample imbalance, achieving precise positioning and recognition of page-level end-to-end water script ancient text, further expanding the scope of handwritten ancient text detection. Recently, Xu et al. [29] made progress in Tibetan text detection by using relational reasoning graph networks to improve the detection of Tibetan texts of arbitrary shapes. These studies tackle specific technical challenges and complement each other, collectively driving the advancement of ancient text detection.

In the field of Dongba script detection, Xing et al. [30] proposed a multi-scale hybrid attention network based on YOLOv5s for sentence-level detection of Dongba manuscripts. Although they made some progress at this level, their method has not yet been able to detect individual Dongba characters effectively. Wang [31] adopted the EAST model to enhance the efficiency of feature extraction and fusion in images. Through numerical experiments comparing template matching, support vector machine, and deep learning methods, they confirmed that deep learning exhibits superior performance in Dongba script detection. Despite significant technological progress, existing script detection algorithms have yet to account for the distinct attributes of Dongba manuscripts fully. These include their intricate structure, diverse font styles, variable font sizes, ink intensity and legibility challenges, and noisy backgrounds. Consequently, this oversight results in incomplete detections and omissions. Therefore, detection networks must be developed to suit the Dongba manuscript’s specific features.


The ancient Dongba manuscripts

The ancient Dongba manuscripts are religious manuscripts employed within the indigenous religion of the Naxi ethnic group [32]. There are two predominant styles of ancient Dongba manuscripts [32]. The first type is rectangular, bound on the left side with hemp twine, measuring approximately thirty centimeters in length and nine centimetres in width, as depicted in Fig. 3a. The dataset in our study consists of manuscripts of this particular style. The second type is square-shaped, bound at the top, typically utilized for divination texts, as illustrated in Fig. 3b. In the creation of Dongba ancient manuscripts, the initial step involves outlining with a bamboo pen, followed by applying traditional pigments based on minerals, plants, and animals, primarily for decorative purposes, as illustrated in Fig. 4 [33]. In ancient Dongba manuscripts, each page is typically divided into three lines, with each line comprising about two to three straight segments. The writing generally adheres to a left-to-right and top-to-bottom orientation. However, the handwriting rule in these manuscripts is characterized by considerable irregularity and a significant degree of arbitrariness. Despite the basic directional flow from left to right and from top to bottom, the placement of characters often varies according to the scribe and the geographical context. Furthermore, even within the same manuscript, the positioning of characters in identical sentences can differ when penned by the same individual.

Fig. 3
figure 3

a Dongba manuscripts; b pages six and seven of the Naxi Dongba Ancient manuscript’ divination: the book of divining by anomalous phenomena

Fig. 4
figure 4

The Book of Genesis


The original data for this study were sourced from the website of the Harvard-Yenching Library at Harvard University [34]. As illustrated in Fig. 5, the data exhibit the unique characteristics of Dongba ancient manuscripts, including variations in the background and ink intensity of the text, differences in font sizes, diverse handwriting styles, uneven illumination, and text distortion phenomena. In this research, we created a dataset comprising 400 images of these ancient manuscripts, which we designated DBD400 (Dongba Detection). All images are stored in JPEG format, with resolutions ranging from a minimum of 1200 * 431 pixels to a maximum of 1201 * 530 pixels. In total, there are 24,266 handwritten Dongba characters.

Fig. 5
figure 5

Some examples from the dataset

Dataset annotation

We used the PPOCRLabel tool [35] to precisely annotate text regions within images of the Dongba ancient manuscript dataset. This study primarily employed two forms of annotation for processing Dongba manuscript. We used a four-point rectangular annotation method for Dongba characters that are uniform and regular in shape. For irregular and complex characters, a multipoint annotation approach was adopted. As illustrated in Fig. 6, annotation information was saved in TXT format. The dataset was divided into a training set and a test set in an 8:2 ratio, with the training set comprising 320 images containing 19,332 handwritten Dongba characters and the test set comprising 80 images with 4934 handwritten Dongba characters.

Fig. 6
figure 6

Examples of different annotation methods a four-point rectangular annotation; b multipoint annotation; c four-point rectangular annotation and multipoint annotation


Overview of the structure

The quality of images in the Dongba ancient manuscripts is relatively low, featuring complex text structures, varying font sizes, and diverse handwriting styles. To address these challenges effectively, we designed the STEF model, the architecture of which is detailed in Fig. 7. This model utilizes the Swin Transformer to extract complex features from the Dongba manuscript. The Feature Pyramid Enhancement Fusion (FPEF) module efficiently aggregates detailed and global information by progressively merging shallow, large-scale feature maps with deep, small-scale ones. Additionally, the FUSION module within the FPEF is engineered to amalgamate features from different levels, enhancing the model’s ability to express features. In post-processing, we employed differentiable binarization techniques, dynamically adjusting each pixel’s binarization threshold to more accurately distinguish between the foreground Dongba manuscript and background. Furthermore, to better accommodate the irregular shapes of the Dongba manuscript, deformable convolutions have been introduced in place of traditional convolutions, enhancing the model’s adaptability to irregular texts. Finally, the model is jointly trained using a combination of probability map loss, threshold map loss, and binary map loss functions. The following sections briefly introduce the Swin Transformer, the FPEF module, differentiable binarization, and the loss functions.

Fig. 7
figure 7

The structure of STEF model

Backbone architecture

The backbone architecture is designed to extract representative features from the raw data, providing adequate inputs for subsequent tasks. Given the complex structure and diverse styles of Dongba script, traditional Convolutional Neural Networks (CNNs) [36,37,38,39,40] struggle to fully capture the subtle differences and rich semantic layers between Dongba characters. This limitation is primarily due to CNNs’ tendency to focus on extracting local features when processing two-dimensional images, often overlooking the impact of overall semantics. Consequently, our research utilizes the Swin Transformer [41] as the backbone architecture. The Swin Transformer combines the self-attention mechanism of the Transformer with the local perception capabilities of convolutional networks. This allows for better handling of the complex structures and detailed features of the Dongba manuscript, offering a richer and more accurate feature representation.

The architecture of the Swin Transformer primarily consists of Patch Partition, Linear Embedding, Swin Transformer Blocks, and Patch Merging, as shown in Fig. 7a. Patch Partition divides the input Dongba manuscript images into several fixed-size patches, enhancing the model’s capability to process local information within the image and expanding its receptive field. Linear Embedding maps each patch to a low-dimensional vector representation, transforming pixels in the image into a vector form. Swin Transformer Blocks, the core component of the Swin Transformer, comprise window multi-headed self-attention (W-MSA), shifted window-based multi-head self-attention module (SW-MSA), multi-layer perceptrons (MLP), layer normalization (LN), and residual connections. This structure primarily facilitates the extraction of Dongba manuscript features, effectively capturing the detailed and structural characteristics of Dongba characters. The Patch Merging step occurs between Swin Transformer Blocks, reducing the resolution of feature maps through downsampling and increasing the number of channels to achieve hierarchical feature extraction.

The structure of Swin Transformer Blocks is depicted in Fig. 8. W-MSA/SW-MSA employs a shifted window-based approach for dividing sequences, reducing computational load and memory usage. The MLP layer performs nonlinear transformations on features to enhance the model’s representational power. The Layer Normalization (LN) layer normalizes features to improve model stability. Residual connections maintain the continuity of information flow and feature reuse during deep feature extraction. This facilitates the model’s ability to fully explore and utilize the rich features and semantic information of the Dongba script.

Fig. 8
figure 8

Components of the model

Consecutive Swin Transformer blocks are computed as:

$$\begin{aligned} \hat{\text{z}}^l= & {} \text {W-MSA }\left( \text{LN}\left( \text{z}^{l-1}\right) \right) +\text{z}^{l-1} \end{aligned}$$
$$\begin{aligned} \textbf{z}^l= & {} \textbf{MLP}\left( \text{LN}\left( \hat{\textbf{z}}^l\right) \right) +\hat{\varvec{z}}^l \end{aligned}$$
$$\begin{aligned} \hat{\textbf{z}}^{l+1}= & {} \text {SW-MSA}\left( \text{LN}\left( \textbf{z}^l\right) \right) +\textbf{z}^l \end{aligned}$$
$$\begin{aligned} \textbf{z}^{l+1}= & {} \text {M}\text{LP}\left( \text{LN}\left( \hat{\varvec{z}}^{l+1}\right) \right) +\hat{\textbf{z}}^{l+1} \end{aligned}$$

here \(\hat{\textbf{Z}}^l\) and \(\textbf{Z}^l\) represent the outputs of (S)W-MSA and MLP in \(\text {l}\) the layer.

Feature pyramid enhancement fusion module

Feature fusion is a critical method for enhancing detection performance. During the fusion process, Dongba manuscript images contain a significant amount of cluttered information, leading to redundancy in the features and decreasing the effectiveness of Dongba manuscript detection. Moreover, the varying receptive fields of multi-scale feature maps of texts result in differences among feature information. Consequently, the direct fusion of multi-scale features may introduce problems such as confused localization or detected errors. Inspired by [42], our research designs the feature pyramid enhancement and fusion module, comprised of two components: FPEM and Fusion.

FPEM is a U-shaped structure, as depicted in Fig. 7b under FPEM. It consists of two stages: up-scale enhancement and down-scale enhancement. Up-scale enhancement aggregates lower-level Dongba manuscript features output by the Swin Transformer to higher-level features (F2, F3, F4, C5) in a bottom-up manner, enriching higher-level features with more detailed information. This process is achieved through upsampling and fusion operations, allowing features at each level to gain supplemental information from lower levels. In the down-scale enhancement stage, FPEM conveys high-level semantic information to lower-level features in a top-down approach, realized through downsampling and fusion operations to ensure lower-level features receive high-level semantic guidance. Additionally, FPEM employs depthwise separable convolutions [43] (3 × 3 depthwise convolutions and 1 × 1 convolutions) instead of regular convolutions to construct the FPEM connectivity part \(\oplus\) (refer to Fig. 8a). This design expands the receptive field (through 3 × 3 depthwise convolution) and increases network depth (via 1 × 1 convolution) with minimal computational overhead, achieving enhanced feature expression at lower computational costs. The specific process is formulated as follows:

$$\begin{aligned} C_i= & {} \text {ReLU}\left( \text {BN}\left( 1 \times 1 \text {Conv} \left( 3 \times 3 \text {DWConv} \left( \text {upsample}\left( F_i \right) + C_{i+1} \right) \right) \right) \right) \end{aligned}$$
$$\begin{aligned} C_{(i+1)\_1}= & {} \text {ReLU}\left( \text {BN}\left( 1 \times 1 \text {Conv} \left( 3 \times 3 \text {DWConv}\left( \text {upsample}\left( C_{i+1} \right) + C_{i\text {\_}1} \right) \right) \right) \right) \end{aligned}$$

here \(i = 2, 3, 4\), \(C_i\) denotes the feature map. ReLU stands for the Rectified Linear Unit, which introduces non-linearity in the activation functions. BN represents Batch Normalization, a technique to standardize the inputs to a layer for each mini-batch, facilitating faster and more stable training. The term 1 × 1 Conv refers to a convolution operation with a one-by-one kernel size commonly used to alter the dimensionality of the feature channels. The 3 × 3 DWConv signifies a 3 × 3 depthwise convolution, a computationally efficient convolutional method that separately applies a filter to each input channel. Lastly, upsample refers to the process of upsampling, which increases the spatial resolution of feature maps.

$$\begin{aligned} C'_i = C_{i\_1} + C_{i\_2} \end{aligned}$$

here \(C_{i\_1}\) denotes the features enhanced by the first iteration of FPEM, and \(C_{i\_2}\) refers to the features enhanced by the second iteration of FPEM.

Features enhanced by the second iteration of FPEM are fused with those enhanced by the first iteration of FPEM to yield C2′, C3′, C4′, and C5′. This ensures that the final features utilized for Dongba script detection are endowed with rich contextual information while retaining ample detail.

To leverage the multi-scale information from different feature maps output by FPEM, most methods employ simple add or concatenate for feature fusion. The FUSION module (they are illustrated in Fig. 8a used in this paper is designed to integrate Dongba manuscript features from various layers, enhancing the model’s ability to detect details in Dongba characters. This module first concatenates features from different layers. Then, it employs a 1× 1 convolution to integrate these features without altering the spatial dimensions of the feature map, preserving the spatial structure of Dongba characters. Batch normalization and the ReLU activation function are subsequently used to enhance nonlinear expression and stabilize network training. This is particularly important for capturing the fine stroke differences in the Dongba script. Finally, features are further transformed through pooling and fully connected layers, with the Hsigmoid function employed to output fused features. Compared to the traditional Sigmoid function, Hsigmoid provides a more stable gradient influence during backpropagation, helping to prevent gradient vanishing, which is crucial for deep networks to recognize the complex symbol system of Dongba script. The specific process can be expressed in formulas as follows:

$$\begin{aligned} F = \text {Cat}\left( C'_{2}, C'_{3}, C'_{4}, C'_{5}\right) \end{aligned}$$

here \(\text {Cat}[\cdot ,\cdot ]\) to perform channel-wise feature fusion.

$$\begin{aligned} F' = \sigma \left( \text {FC}\left( P_{\text {Avg}}(F) \right) \right) \end{aligned}$$

In the equation, FC represents a fully connected layer, \(P_{\text {Avg}}\) denotes average pooling, and \(\sigma\) represents the Hsigmoid function, whose formula is as follows:

$$\begin{aligned} f(\alpha , x) = {\left\{ \begin{array}{ll} 0 &{} \text {if } x < -3 \\ \alpha (x + 3) &{} \text {if } x \ge -3 \end{array}\right. } \end{aligned}$$

with \(\alpha\) = \(\frac{1}{6}\)

Differentiable binarization

The differentiable binarization operation dynamically adjusts each pixel’s binarisation threshold, thereby more accurately distinguishing between the foreground Dongba manuscript and the background. Traditional segmentation-based text detection methods usually employ a fixed threshold in the post-processing stage to convert the obtained probability map into a binary image, as shown in the equation:

$$\begin{aligned} B_{i,j} = {\left\{ \begin{array}{ll} 1, &{} P_{i,j} \ge t_h \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where \(P\) represents the probability map, \(i,j\) are the pixel coordinates in the image, \(th\) is a fixed threshold, and \(B\) is the output binary image.

In traditional binarization, a fixed threshold \(th\) processes the probability map \(P\). When a pixel’s value exceeds the threshold, it is considered part of the text region (i.e., a positive sample); otherwise, it is deemed background. However, this binarization method generates a non-differentiable step function, limiting optimization capabilities during training. To overcome this limitation, Liao et al. [21] proposed the Differentiable Binarization (DB) module, which employs a differentiable approximation of the binarisation step function, enhancing the model’s trainability and optimizability, as shown in the equation. With the DB module, the model can adaptively learn more accurate threshold maps during training, allowing for the precise distinction between text and background and effectively separating adjacent texts in predicting text boxes, as illustrated in Fig. 9.

$$\begin{aligned} \hat{B}_{i,j} = \frac{1}{1 + e^{-k(P_{i,j}-T_{i,j})}} \end{aligned}$$

here \(i,j\) denotes the pixel coordinates, and \(\hat{B}_{i,j}\), \(P_{i, j}\), and \(T_{i, j}\) respectively represent the pixel values of the approximated binary image, probability map, and threshold map at the point \(i,j\); \(k\) is a scaling factor.

Fig. 9
figure 9

Visualization results

Due to the irregularity in size and shape of Dongba characters, traditional fixed-size convolution needs help to capture their feature information comprehensively. To address this issue, the convolution in the differentiable binarization network is replaced with deformable convolutional instead of traditional convolution operations, as shown in the Deconv module in Fig. 7c. The distinction between deformable convolutional and traditional convolution lies in its introduction of offsets within the receptive field, transforming the receptive field from a rigid square to a more adaptive structure that conforms to the actual shape of objects. Consequently, deformable convolutional exhibits higher adaptability when processing irregular Dongba manuscripts, enabling it to better capture the feature information of various shapes and thus enhance the capability of feature representation.

Loss function

The loss function Total_Loss aims to enhance the model’s performance in localizing Dongba manuscript regions, ensuring the model can effectively distinguish between Dongba manuscripts and the background, accurately outline the edges of Dongba manuscripts, and adapt to the various sizes and shapes of Dongba manuscripts, thereby achieving high-precision text detection. The loss function Total_Loss consists of three parts: the sum of the probability map (Loss_Prob), threshold map (Loss_Thr), and binary map (Loss_DB), calculated as follows in the equation:

$$\begin{aligned} Total\_Loss = \alpha \times Loss\_Prob + \beta \times Loss\_Thr + Loss\_DB \end{aligned}$$

here \(\alpha\) is set to 5 and \(\beta\) to 10.

The binary cross-entropy loss function recognizes positive class samples, ensuring the model can accurately distinguish between text and non-text areas. The formula is as follows:

$$\begin{aligned} Loss\_Prob = \sum _{i \in S_l} y_i \log {x_i} + (1 - y_i)\log (1 - x_i) \end{aligned}$$

here \(S_l\) represents a set containing the indices of all positive class samples. \(x_i\) represents the pixel values in the predicted segmentation map, \(y_i\) represents the pixel values in the ground truth.

\(Loss\_Thr\) utilizes the L1 loss to quantify the discrepancy between the predicted threshold map and the actual threshold, further refining the model’s adaptability to varying threshold values for different Dongba manuscript areas. Its expression is:

$$\begin{aligned} Loss\_Thr = \sum _{i \in \mathbb {R}^d} |y_i^* - x_i^*| \end{aligned}$$

here \(R_{d}\) denotes the region between the original text box and the expanded text box.

Adopting Dice loss aims to minimize the difference between the predicted and real segmentation maps, enhancing the model’s segmentation accuracy. Its expression is as follows:

$$\begin{aligned} Loss\_DB = 1 - \frac{2 \times \sum _{i \in S} x_i y_i}{\sum _{i \in S} x_i^2 + \sum _{i \in S} y_i^2} \end{aligned}$$

here \(x_i\) represents the pixel values in the predicted segmentation map, \(y_i\) represents the pixel values in the ground truth, and \(S\) is the set of all pixels.

The training variation curves for the overall loss function \(L\), the probability map loss function, the threshold map loss function, and the binary map loss functions are illustrated in Fig. 10.

Fig. 10
figure 10

Loss curves of training

Experiments and analysis

Experimental settings

The proposed STEF model is implemented in Python developed, and trained based on the PyTorch framework [44], and the MMOCR library [45]. The model operates on a Linux system and utilizes an Nvidia Tesla V100 graphics card with 16 GB of VRAM. We employed the Adam optimizer for training STEF, with a momentum parameter set to 0.9. The initial learning rate was set to 1e−3, and a Poly strategy was adopted for learning rate decay. The weight decay coefficient was set to 0.0001. The batch size was set to 6 during training, and the model was trained for 20 epochs. A fivefold cross-validation was performed to enhance the robustness and generalizability of the model.

Evaluation metrics

To evaluate the detection performance of the model, this study utilizes three commonly used text detection evaluation metrics. The recall rate \(R\) represents the proportion of the ground truth content that has been detected, calculated as in Eq. 17. The precision rate \(P\) indicates the correct proportion of correct detected text, calculated as in Eq. 18. The F-measure, a harmonic mean, combines both precision and recall, calculated as in Eq. 19:

$$\begin{aligned} \text {Recall}= & {} \frac{TP}{TP + FP} \end{aligned}$$
$$\begin{aligned} \text {Precision}= & {} \frac{TP}{TP + FN} \end{aligned}$$
$$\begin{aligned} F\text {-measure}= & {} \frac{2 \times \text {Recall} \times \text {Precision}}{\text {Recall} + \text {Precision}} \end{aligned}$$

Here, \(TP\) denotes the total number of text areas correctly predicted by the model, \(FP\) represents the total number of text areas incorrectly predicted, and \(FN\) stands for the total number of text areas that were not predicted.

The Frames Per Second (FPS) metric assessed our model’s real-time processing capability. FPS is calculated as the ratio of the number of frames processed to the total time taken for processing, as expressed by the formula:

$$\begin{aligned} \text {FPS} = \frac{\text {Processed Frames}}{\text {Total Processing Time}} \end{aligned}$$

Superiority studies

Numerous text detection models have been proposed. To demonstrate the superiority of our proposed model, extensive experiments were conducted, comparing these text detection models with our proposed STEF. Furthermore, comparisons were made among different modules to estimate our proposed modules’ superiority.

Comparison with text detection models

To demonstrate the superiority of the proposed STEF, extensive comparative experiments were conducted with text detection models on the DBD400 dataset. Precision, Recall, F-measure, and FPS were used as metrics. We trained and tested various models under the experimental setup mentioned in “Experimental settings” section , ensuring that these models could converge to their optimal values. The experimental results are presented in Table 1.

As illustrated in Table 1, the STEF introduced in this study has realized commendable detection outcomes, achieving a top-ranked Precision of 88.88%, Recall of 88.65%, and F-measure of 88.76%. Compared to the second-ranked model, DBNet++, STEF has enhanced Precision by 0.45%, Recall by 1.46%, and F-measure by 0.96%, in addition to exhibiting a higher inference speed in terms of FPS. These findings certainly affirm the superiority of our STEF. Figure 11 demonstrates that our model strikes an optimal balance between F-measure and FPS.

Table 1 Evaluation of excellent text detection models on the proposed DBD400 testing set
Fig. 11
figure 11

The comparisons of backbones performance in accuracy (F-measure) and speed

Comparison with different backbone networks

We investigate the performance of various backbone networks, as detailed in Table 2 and illustrated in Fig. 12. This includes traditional Convolutional Neural Networks (CNNs) such as ResNet18, ResNet50, and MobileNetV2, as well as Transformer-based architectures, like Pyramid Vision Transformer and Swin Transformer with different depth configurations (i.e., varying numbers of Transformer Blocks). CNNs, Res2Net, and ResNeXt achieved good accuracy, achieving F-measures of 87.81% and 87.73%, respectively. This demonstrates the effectiveness of advanced convolutional structures in capturing complex text features. However, the frame rates of these models were lower than those based on the Swin Transformer, at 1.77 FPS and 1.82 FPS, respectively. It was also noted that the different numbers of Transformer Blocks impact model performance. For example, Swin Transformer-6 (indicating the use of 3 Transformer Blocks) achieved an F-measure of 88.76%, while maintaining a processing speed of 5.03 FPS, emphasizing its excellent balance between accuracy and speed, as shown in Fig. 13. In contrast, Swin Transformer-8 (indicating the use of 4 Transformer Blocks) and Swin Transformer-10 (indicating the use of 5 Transformer Blocks) saw a performance drop but still maintained F-measures of 86.67% and 84.00%, respectively, confirming that adjusting the number of Transformer Blocks significantly impacts performance.

Table 2 Evaluation of different backbones on the proposed DBD400 testing set
Fig. 12
figure 12

The comparisons of backbones’ performance (PVT: PyramidVisionTransformer, RN18: ResNet18, RN50: ResNet50, MNV20: MobileNetV2, ST10: SwinTransformer-10, ST12: SwinTransformer-12, ST8: SwinTransformer-8, R2N: Res2Net, RX: ResNeXt, ST6: SwinTransformer-6)

Fig. 13
figure 13

The comparisons of models performance in accuracy (F-measure) and speed

Comparison with different feature fusion networks

Table 3 showcases the impact of different feature fusion modules on text detection. In our experiments, the proposed FPEM+FUSION module achieved significant improvements in precision, recall, and F-measure, reaching 88.88%, 88.65%, and 88.76%, respectively, with only a 0.5 M increase in the number of parameters (from the original 1.29 M to 1.79 M). These results demonstrate the effectiveness of FPEM+FUSION in enhancing text detection performance and prove that significant performance improvements can be achieved in designing efficient and powerful text detection systems through meticulous feature fusion strategies.

Table 3 Evaluation results of different feature fusion techniques on the proposed DBD400 testing set

Comparison of parameter settings in loss functions

We conduct a detailed analysis through experiments on the impact of the weight ratio of probability map loss to threshold map loss, \(\alpha\) and \(\beta\), within the loss function. As illustrated in Fig. 14, the model’s performance peaked when the weight ratios for \(\alpha\) and \(\beta\) were respectively adjusted to 5 and 10. This indicates that significant improvements were achieved in precision, recall, and F-measure, highlighting the importance of appropriately setting the weights for different components of the loss function to optimize model performance.

Fig. 14
figure 14

3D surface plots of model performance metrics over different \(\alpha\) and \(\beta\) parameters

Ablation study

We conduct a detailed investigation into the impact of the FUSION module, differentiable binarization (DB), and deformable convolution (deforconv) on the performance of the text detection model. As shown in Table 4, six configurations were tested to evaluate their individual and combined effects. The experimental results indicated that introducing the FUSION module alone slightly improved the model’s precision, recall, and F-measure, validating the effectiveness of the FUSION module in capturing textual features. Upon incorporating the differentiable binarization, a significant leap in model performance was observed, with precision increasing to 84.29%, recall to 81.23%, and F-measure reaching 82.73%. This demonstrates the crucial role of differentiable binarization in handling the structure of Dongba manuscripts. While the application of deformable convolution alone did not show as marked an improvement as the differentiable binarization, it revealed the potential of deformable convolutions in processing texts with irregular shapes, with precision, recall, and F-measure reaching 80.62%, 83.97%, and 82.26%, respectively. When the FUSION module, differentiable binarization, and deformable convolution were used, the model performance peaked at a precision of 88.88%, recall of 88.65%, and F-measure of 88.76%. These results highlight the unique value of each component in enhancing text detection performance and, more importantly, point out their synergistic effect when combined.

Table 4 Ablation experiment

Visualization experiments

We qualitatively compare the Dongba manuscript detection results across seven models, as illustrated in Fig. 15. In the MaskRCNN [46], the detection boundaries were unclear, and decorations and borders in the image were mistakenly identified as text. The TextSnake model sometimes incorrectly detected single characters as multiple characters and similarly misidentified decorations and borders as text. DBNet, PAN, and PSENet tended to erroneously merge characters that were close together, also misidentifying decorations and borders as text. Notably, DBNet could only detect parts of characters in some instances, and both MaskRCNN and DBNet occasionally misjudged stains as text. DBNet++ also had issues merging characters that were close together but successfully avoided misidentifying decorations and borders in the image as text. The STEF model proposed in this study demonstrated the best detection results, clearly identifying each Dongba character, although it occasionally misjudged some decorations as text.

Fig. 15
figure 15

Visualization of results from different models

Conclusions and future works

We propose a novel algorithm for detecting complex Dongba manuscripts, utilizing the Swin Transformer as the backbone network for feature extraction. By integrating a Feature Pyramid Enhancement and Fusion module, the algorithm adaptively selects and integrates local and global features, focusing the model’s attention on regions with significant features. The application of a binarization operation accurately distinguishes foreground Dongba manuscripts from the background. Additionally, introducing deformable convolutions enhances the model’s capability to capture Dongba characters of varying sizes. Experimental results demonstrate that the proposed algorithm achieves a recall rate of 88.88%, a precision rate of 88.65%, and an F-measure of 88.76%, outperforming other algorithms. Based on this algorithm, a wide variety and large quantity of Dongba manuscripts can be detected efficiently. Future work will explore the application of the algorithm in areas such as classification, recognition, segmentation, retrieval, and translation of Dongba manuscripts, further expanding its practicality and impact.

Availability of data and materials

Not applicable.


  1. He L. Discussing the inheritance of Dongba culture. Soc Sci Yunnan. 2004;01:83–7.

    Google Scholar 

  2. Goagan: Exploring the splendors of dongba culture. Ethnic Art Studies 1999;(02), 71–80

  3. Yang Y, Kang H. Research on the extracting algorithm of dongba hieroglyphic feature curves. J Graph. 2019;40(03):591–9.

    Google Scholar 

  4. Hu Y. Digital preservation of the naxi dongba manuscripts. Lantai World. 2012;02:2–3.

    Article  Google Scholar 

  5. Xing J, Bi X, Weng Y. A multi-scale hybrid attention network for sentence segmentation line detection in dongba scripture. Mathematics. 2023.

    Article  Google Scholar 

  6. Shen T, Zhuang J, Li W, Wang Y, Xia Y, Zhang Z, Zhang X, Yang J. Research on recognition of dongba script by a combination of hog feature extraction and support vector machine. J Nanjing Univ Nat Sci. 2020;56(6):870–6.

    Article  CAS  Google Scholar 

  7. Xu X, Jiang Z, Wu G, Wang H, Wang N. Research on recognition of dongba script by a combination of hog feature extraction and support vector machine. J Electr Meas Instrum. 2017;31(01):150–4.

    Article  Google Scholar 

  8. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.

    Article  Google Scholar 

  9. Epshtein B, Ofek E, Wexler Y. Detecting text in natural scenes with stroke width transform; 2010. p. 2963–70.

  10. Matas J, Chum O, Urban M, Pajdla T. Robust wide-baseline stereo from maximally stable extremal regions. British machine vision computing 2002. Image Vis Comput. 2004;22(10):761–7.

    Article  Google Scholar 

  11. Liao M, Shi B, Bai X, Wang X, Liu W. Textboxes: a fast text detector with a single deep neural network. In: Proceedings of the thirty-first AAAI conference on artificial intelligence. AAAI’17; 2017. p. 4161–7 .

  12. Liao M, Shi B, Bai X. Textboxes++: a single-shot oriented scene text detector. IEEE Trans Image Process. 2018;27(8):3676–90.

    Article  PubMed  Google Scholar 

  13. Liao M, Zhu Z, Shi B, Xia G, Bai X. Rotation-sensitive regression for oriented scene text detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2018. p. 5909–18. IEEE Computer Society, Los Alamitos, CA, USA

  14. Xie L, Liu Y, Jin L, Xie Z. Derpn: Taking a further step toward more general object detection. Cornell University—arXiv: Cornell University—arXiv; 2018.

  15. He T, Tian Z, Huang W, Shen C, Qiao Y, Sun C. An end-to-end textspotter with explicit alignment and attention. arXiv: Computer Vision and pattern recognition,arXiv: computer vision and pattern recognition; 2018

  16. Liao M, Lyu P, He M, Yao C, Wu W, Bai X. Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans Pattern Anal Mach Intell. 2021.

    Article  PubMed  Google Scholar 

  17. Liao M, Pang G, Huang J, Hassner T, Bai X. Mask TextSpotter v3: segmentation proposal network for robust scene text spotting; 2020. p. 706–22.

  18. Long S, Ruan J, Zhang W, He X, Wu W, Yao C. TextSnake: a flexible representation for detecting text of arbitrary shapes; 2018. p. 19–35.

  19. Wang W, Xie E, Li X, Hou W, Lu T, Yu G, Shao S. Shape robust text detection with progressive scale expansion network. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR); 2019.

  20. Wang W, Xie E, Song X, Zang Y, Wang W, Lu T, Yu G, Shen C. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: 2019 IEEE/CVF international conference on computer vision (ICCV); 2019. .

  21. Liao M, Wan Z, Yao C, Chen K, Bai X. Real-time scene text detection with differentiable binarization. Proceedings of the AAAI conference on artificial intelligence; 2020. p. 11474–81

  22. Liao M, Zou Z, Wan Z, Yao C, Bai X. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Trans Pattern Anal Mach Intell. 2023.

    Article  PubMed  Google Scholar 

  23. Yuan J, Chen S, Mo B, Ma Y, Zheng W, Zhang C. R-gnn: recurrent graph neural networks for font classification of oracle bone inscriptions. Herit Sci. 2024;12(1):30.

    Article  Google Scholar 

  24. Garz A, Diem M, Sablatnig R. Detecting text areas and decorative elements in ancient manuscripts. In: 2010 12th international conference on frontiers in handwriting recognition; 2010.

  25. Asi A, Cohen R, Kedem K, El-Sana J, Dinstein I. A coarse-to-fine approach for layout analysis of ancient manuscripts. In: 2014 14th international conference on frontiers in handwriting recognition; 2014.

  26. Roman-Rangel E, Marchand-Maillet S. Shape-based detection of maya hieroglyphs using weighted bag representations. Pattern Recogn. 2015;48(4):1161–73.

    Article  Google Scholar 

  27. Chen S, Han X, Lin X, Liu Y, Wang M. MSER and CNN-based method for character detection in ancient YI books. J S China Univ Technol Nat Sci Ed. 2020;48(06):123–33.

    CAS  Google Scholar 

  28. Tang M, Xie S, Liu X. Detection and recognition of handwritten characters in Shuishu ancient books based on faster-RCNN. J Xiamen Univ Nat Sci. 2022;61(02):272–7.

    Google Scholar 

  29. Xu Z, Zhu J, Liu Y, Xu Z, Yan S, Wang C. Research on arbitrary shape tibetan text detection with graph network. In: 2022 international conference on image processing, computer vision and machine learning (ICICML), 2022;pp. 452–456.

  30. Xing J, Bi X, Weng Y. A multi-scale hybrid attention network for sentence segmentation line detection in dongba scripture. Mathematics. 2023;11(15):3392.

    Article  Google Scholar 

  31. Wang Y. Research on the detection and recognition algorithm of dongba character based on deep learning. Master’s thesis, Nanjing University; 2021.

  32. Archives L, China DH. The Naxi Dongba ancient scriptures. Accessed 29 Feb 2024.

  33. Center, N.L.N.A.B.P., Welfare, B.P., Institute, L.D.C.R., Changting, T. Genesis Knowledge Database in combined Dongba and Chinese scripts. Accessed 20 Feb 2024.

  34. Library H. Naxi manuscripts. Accessed 01 Mar 2024.

  35. Baidu: paddlepaddle. Accessed 17 Jan 2024.

  36. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR); 2016.

  37. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. Mobilenetv2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition; 2018.

  38. Wang W, Xie E, Li X, Fan D-P, Song K, Liang D, Lu T, Luo P, Shao L. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: 2021 IEEE/CVF international conference on computer vision (ICCV). 2021;

  39. Gao S-H, Cheng M-M, Zhao K, Zhang X-Y, Yang M-H, Torr P. Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell. 2021.

    Article  PubMed  Google Scholar 

  40. Xie S, Girshick R, Dollar P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR); 2017.

  41. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF international conference on computer vision (ICCV); 2021.

  42. Wang W, Xie E, Song X, Zang Y, Wang W, Lu T, Yu G, Shen C. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In: 2019 IEEE/CVF international conference on computer vision (ICCV); 2019.

  43. Howard A, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv: Computer vision and pattern recognition, arXiv: computer vision and pattern recognition; 2017.

  44. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. Pytorch: an imperative style, high-performance deep learning library. Neural information processing systems; 2019.

  45. OpenMMLab: MMOCR: OpenMMLab text detection, recognition and understanding toolbox; 2021. Accessed 03 Jan 2024.

  46. He K, Gkioxari G, Dollar P, Girshick R. Mask r-CNN. IEEE Trans Pattern Anal Mach Intell. 2020;42:386–97.

    Article  PubMed  Google Scholar 

  47. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR); 2015.

Download references


This work was supported by the Project of Southwest University Graduate Student Research and Innovation (SWUB23053), and the project of the Ministry of Education of the People’s Republic of China on the humanities and social sciences The collation, arrangement, and research of the inscriptions on oracle bones of Shang, Dynasty based on database technology under Grant 22JZD036, in part by the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJQN202203306, in part by the Chongqing LanguageResearch Project under Grant yyk21223, part by the Chongqing Natural Science Foundation under grant cstc2021jcyj-msxmX0417 and the Project of Chongqing Municipal Education Commission Science and Technology Research (KJZD-K202200203), the Project of Chongqing Ecological Environment Big Data Application Center (CQHJDSJYY-2023-013), the Projects of Chongqing Science and Technology Bureau (cstc2019jscx-gksbX0103, cstc2020ngzx0010) and the Fundamental Research Funds for the Central Universities of China (SWU2009107).

Author information

Authors and Affiliations



Y.M. designed the study, conducted the experiments and discussions, and mainly wrote the article; S.C. provided overall guidance and supervision of the study and proposed an optimized protocol; Y.L provided experimental related datasets and assisted in article calibration; J.L.assisted in the query and sorting of the literature; Q.Y. and W.X helped in the proofreading of the article; H.X and X.L. assisted in the calibration of this article. All authors reviewed the manuscript.

Corresponding author

Correspondence to Shanxiong Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Y., Chen, S., Li, Y. et al. STEF: a Swin Transformer-Based Enhanced Feature Pyramid Fusion Model for Dongba character detection. Herit Sci 12, 206 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: