Exploring spatiotemporal changes in cities and villages through remote sensing using multibranch networks

With the rapid development of the social economy, monumental changes have taken place in the urban and rural environments. Urban and rural areas play a vital role in the interactions between humans and society. Traditional machine learning methods are used to perceive the massive changes in the urban and rural areas, though it is easy to overlook the detailed information about the changes made to the intentional target. As a result, the perception accuracy needs to be improved. Therefore, based on a deep neural network, this paper proposes a method to perceive the spatiotemporal changes in urban and rural intentional connotations through the perspective of remote sensing. The framework first uses multibranch DenseNet to model the multiscale spatiotemporal information of the intentional target and realizes the interaction of high-level semantics and low-level details in the physical appearance. Second, a multibranch and cross-channel attention module is designed to refine and converge multilevel and multiscale temporal and spatial semantics to perceive the subtle changes in the urban and rural intentional targets through the semantics and physical appearance. Finally, the experimental results show that the multibranch perception framework proposed in this paper has the best performance on the two baseline datasets A and B, and its F-Score values are 88.04% and 53.72%, respectively.


Introduction
With the continuous development of the social economy, human living standards have undergone tremendous changes. Cities and villages, as gathering places for human social interaction and activities, have also experienced massive changes in recent years. In addition, the cities and villages that people call home not only reflect their lifestyles, but also affect their physical health, mental health and social well-being. Exploring the urban and rural environmental changes from the acquired remote sensing data helps to understand the development of society and the economy in depth. At the same time, it can also effectively judge whether it is necessary to further improve the infrastructure construction and the quality of life in these urban and rural spaces.
In recent years, with the application of computer intelligence interpretation technology in many fields, such as natural language processing (NLP), image classification, and object detection (OD), it has provided a new way of evaluating the city and village environmental changes (including buildings, infrastructure and heritage). In the early stages of urban and rural change research, people usually use a variety of different methods to simulate and measure the construction, cultural heritage, infrastructure, and environment of a certain area, by using digital models to obtain useful information and build urban forms and urban environments. However, with Open Access *Correspondence: zhaomengqi@whut.edu.cn 1 School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430000, China Full list of author information is available at the end of the article the complexity of urban and rural environments, digital modeling also has difficulty meeting the increasing application requirements. At the same time, affected by the exponential growth of big data, it is difficult to obtain effective detailed information with this kind of simulation modeling method. Meanwhile, digital modeling is usually oversimplified and therefore, is unavailable not for some studies, including studies of the changes in infrastructure. Additionally, because it neglects the netural landscapes of cities and villages, it has proven to be less effective.
However, to obtain more detailed information from relevant data to effectively simulate urban and rural environmental changes, many machine learning and deep learning algorithms have been developed. For instance, Naik et al. [1] to rate the safety, resident wealth and vitality index of the block, input the data collected from Google into the machine learning model for modeling, and generate new neighborhood semantic information. Gebru et al. [2] proposed a deep learning method to estimate different choices in the United States, the socioeconomic situation of the district, and the methods using a large number of geotagged street images. Li et al. [3] presented urban landscape study methods by combining deep convolutional neural networks (DCNNs) and street-level images, which accurately recognized the different urban features from these street-level images. Meanwhile, machine learning and deep learning methods also have strong modeling ability for complex and largescale data, applying these methods to large-scale urban complex data, such as occlusion and zoom, and learn the location or category of the target object through supervised training [4,5]. Obeso et al. [6] adopted deep convolutional natural network methods to train and predict visual attention in natural images to address the classification problem of Mexican cultural heritage. Morbidoni et al. [7] proposed novel methods for learning from synthetic point cloud data for historical building semantic segmentation, mainly to provide a first assessment of the use of synthetic data to drive convolutional networkbased semantic segmentation in the context of historical buildings.
Although the above methods detect urban and rural environments to a certain extent, they basically focus on segmentation tasks, such as content classification and recognition of buildings, while ignoring the changes in the urban and rural spatiotemporal environment. At the same time, these methods ignore the image feature extraction process. There are subtle changes in the target object and a poor perception of the temporal and spatial semantics. Thus, we address these issues and explore changes in the urban and rural environment, as well as the form and infrastructure from the perspective of time and space. We present a novel spatiotemporal perception method to explore changes in cities and villages from remote sensing with multibranch networks. We aim to build visual spatiotemporal perception models that can be used to estimate environmental, form and infrastructure changes in urban areas and villages, while vigorously promoting the development of social research and improve the lifestyle of humans.
In summary, the main work in the paper is as follows: • Frameworks: A methodology for exploring the spatiotemporal changes from remote sensing of city and village environments, forms and infrastructure aspects. The main aim is to build a relationship between human visual perspectives and perceptions that can understand the changes in social development, and improving the effectiveness of statistics from society. • Technology: We present a novelty perception frameworks using multibranch networks. This method mainly uses a multibranch attention network to model remote sensing images in the same area at different time periods, forming information sharing in time and space. Second, through this information, the model can perceive subtle changes in different targets in cities and villages, including target positions, physical structures and geometric shapes. It further establishes temporal and spatial dependence on different scales to generate better representations to complete relevant statistics and reasoning. • Application: For the application of subsequent tasks, such as urban planning, intention target statistics, disaster evaluation, etc. based on baseline datasets a and b, using preprocessing methods with rotation and noise addition, the perception framework proposed in this paper is tested and verified. The final experimental results show that our proposed framework has achieved good experimental results and perceives the average area of urban and rural intentional changes.
The rest of the organizational structure of the paper is as follows: In Section 2, we elaborate on the related work of urban and the image perception of village environments, forms, infrastructure, etc. Section 3 describes our proposed perception frameworks in detail. Section 4 discusses and analyzes the processing of datasets and the application. Then, we present the detailed experimental results and describe the changes. Section 5 provides a brief summary and possibilities for future work.

Related works
In this section, we elaborate on the related work on the image perception of form, infrastructure, etc. in cities and villages. The primary is divided into traditional and deep neural networks of urban and village environmental forms or the infrastructure's architectural elements in visual content. It is worth noting that deep neural network methods mainly focus on tasks, such as image classification, segmentation and detection.

The traditional image perception of cities and villages
Currently, image datasets have been widely used in many files of urban and village research and planning in progress; for example, the main application files contain regional city systems, city and village spatial structures, infrastructure service systems, transportation and travel and collective activities in society. However, with the continuous development of society and the economy, people's application needs are gradually increasing. It is time-consuming and expensive to use manual statistics to collect relevant information. Thus, many researchers have developed algorithms to perceive urban and rural areas from different perspectives and archive the effective information, such as the form and environment of the plant. For instance, Hu et al. [8] proposed an effective method of typology analysis, which entailed they using computer technology to check the content of different images and adopting clustering methods to judge the activity levels of the different types of users on Instagram. Hochman et al. [9] proposed a method based on Instagram algorithms, which was a spatiotemporal pattern analysis method designed to visualize the characteristics of image content from 13 different cities around the world and make corresponding comparisons to further describe people's daily activities, culture, etc. However, to facilitate the interaction of users and existing image datasets and further extend the scale of these image datasets via social media, Jett et al. [10] present a feedback framework for transferring user-generated information to institutional data providers, which can improve the service scope of the dataset center. However, the methods mainly use cultural heritage institutions that can also enhance collections by sharing content through popular web services. The abovementioned methods mainly use some simple visual methods to analyze the images of cultural heritage, residents' living conditions and their environments circulating on social media during the disaster. Although quick and simple statistics are realized to further expand the relevant database, it is not possible to perceive changes from a deeper level, such as damage to residential areas, cultural buildings and other infrastructure in the disaster.
However, there are also many researchers who focus on identifying urban or rural building structures from natural images generated by users and analyzing the relevant characteristics of buildings. For example, Li et al. [11] addressed the sustainable development problem of cities and the effective identification of urban functional areas. They combined multisource geographic data to establish a quantitative measurement method for urban functional areas. Bose et al. [12] take the Siliguri metropolitan area in West Bengal, India as the research object, propose novelty study methods of the Markov chain model and analyze the spatial distribution of urban land. Liuet al. [13] scientifically plan the urbanization layout and improve the utilization rate of land space. Urban functional areas are identified and analyzed from the perspective of data mining, and taxi trajectory data are used as the research basis for urban functional areas. A DTW-based approach is proposed. K-nearest's classification algorithm for cluster recognition of urban functional areas. Although these methods can effectively identify the functional areas of the city, they have not effectively combined the temporal and spatial information of the city and the countryside in the analysis and statistics process. When the environment is complex, it is difficult to distinguish the functional areas efficiently and accurately. The cultural heritage, buildings and roads in the functional area are not analyzed in detail.
Conversely, many researchers pay more attention to the perception of the form and infrastructure of residential areas in urban functional areas, such as Tardioli, Giovanni and Kerrigan, Ruth et al. [14], to evaluate the building energy in the city. A new method is proposed to identify building clusters, and a dataset of representative buildings is provided. At the same time, the method is mainly divided into three parts: building classification, building clustering and prediction. Gadal, S Bastien and Ouerghemmi, Walid et al. [15] considered that hyperspectral remote sensing images can describe surface objects and landscapes more accurately, and a classification method based on an urban target spectral database was proposed to detect and classify specific urban targets. Manzoni, Marco and Monti-Guarnieri, Andrea et al. [16] combined synthetic aperture radar (SAR) images and geospatial information systems, developing a simple and fast method to identify structural changes in buildings in urban environments. This proposed method can effectively evaluate small changes after disasters.

The deep learning image perception of cities and villages
Although these methods can reduce the errors caused by hand-made features, in a complex environment, it is difficult to effectively capture the detailed changes of the target (such as buildings, roads, bridges, etc.) in the form, physical structure, or geometric form in the image using simple machine learning. Thus, deep learning techniques are widely used in tasks, such as urban planning, urban building classification, and urban form perception. Llamas, Jose and M Lerones, Pedro et al. [17] present a novel method of the classification of architectural heritage images with deep convolutional neural networks.
The main objective of this article is to introduce the application of techniques based on deep learning for the classification of images of architectural heritage, specifically through the use of convolutional neural networks. Meanwhile, the methods can achieve better management and a more effective search of the urban architectural heritage. They are also beneficial for the tasks of studying and interpreting the heritage asset in question. With the rapid development of urban areas and villages, due to their wide distribution, construction waste is easily confused with the surrounding environment and difficult to manually classify. At the same time, traditional single-spectral feature analysis has difficulty extracting and identifying urban construction waste-related information. Thus, Chen et al. [18], utilizing the multifeature analysis method of remote sensing images, developed a method for extracting urban construction waste information from the optimal VHR image combined with a morphological index and hierarchical segmentation. Attari et al. [19] assessed the extent of damage to urban and village building structures after the disaster, and with UAV imagery, proposed a fine-grained classification method called Nazr convolution neural networks (Nazr-CNN) to conduct a damage assessment. Vetrivel et al. [20] suggested that to improve the performance of damage detection, the CNN and 3D point cloud information of the target object in the image are, respectively extracted, and the multicore learning framework is used to combine the two kinds of information to achieve classification, while finally performing damage detection on the building roof and other object. Subsequently, Hamdi et al. [21] presented a forest damage assessment method with deep learning techniques, and the backbone network of the method was mainly U-Net. Although these methods have achieved good results in the postdisaster assessment, they mainly focus on the use of UAV images and hyperspectral remote sensing images.
In recent years, some researchers have used images collected on social media to perceive the ideology of cities and villages. For example, in the case of disasters and a lack of labeled data, Li et al [22] proposed a domainadaptive countermeasure neural network method to recognize disaster images and detect damaged areas. Meng et al. [23] verified the correlation between the physical health of the elderly and the urban space using the Baidu Street View (BSV) of the Macau Peninsula as the research scene, and deep learning technology was used to perceive the high-density urban street space. Kim et al [24] proposed understanding tourists' urban images with geotagged photos using convolutional neural networks. With the continuous increase in the urban population, the human gathering area has gradually evolved into a local dense temporal and spatial dynamic distribution. To better understand the urban environment, Chen et al [25] constructed an advanced image recognition model and used marked Flickr pictures to train the neural network to quantify the feature information of different cities. Jayasuriya et al [26] presented a novel localizing PMD perception method for urban streets via convolutional neural networks. The method combines two important components, one of which uses a CNN to extract the feature information of infrastructure such as roads, lane markings, and manhole covers and form a location. The other component is mainly to use a CNN to detect common environmental landmarks, such as tree trunks for positioning. However, to further enhance a human's perceptibility for urban and village forms, the environment and the infrastructure, Wang et al. [27] presented a new multitask and multimodal deep learning framework with automatic loss weighting to assess the damage after disastrous events. Agarwal et al [28] proposed multimodal damage analysis methods to reply to deployment, challenges and assessment and are called Crisis-DIAS. In addition, other related two-branch neural networks, such as the Fractional Gabor Convolutional Network (FGCN) was proposed by Zhao et al. [29,30]. The information fusion and Patch-to-Patch CNN uses remote sensing image tasks by Zhang et al. [31,32] with the manner of word embedding using image processing [33].
In summary, although the above methods use deep learning technology to improve people's perceptions of the social environment and form, most of them use simple deep learning methods to classify, segment, and detect corresponding image data, which are not sensitive to spatiotemporal information. At the same time, in the process of target feature extraction, a large amount of detailed information is ignored, which makes the feature information unable to effectively describe the target (urban and rural buildings, roads, etc.), ultimately leading to large perceptual errors. Second, these methods do not take into account the changes in the same area at different time periods.

Our proposed methods
In this section, we will elaborate on our proposed spatiotemporal perception framework from three aspects: the feature extraction of urban and village images, the network structure of the backbone and the adjustment and optimization.

Overview
With the rapid development of society and the economy and field surveys of urban and rural residents' gathering places or other nongathering places, it can be found that there are huge differences in the forms, environments, and infrastructures presented in different regions and at different times. For example, the distribution of residential areas and functional areas is irregular. At the same time, the distribution of the environment and infrastructure also changes with changes in the gathering place. However, when external factors are more complex, if using traditional machine learning methods to perceive changes in the same area at different periods of time, people are susceptible to interference from these external factors because of light and occlusion, resulting in larger perception errors and affecting subsequent applications. The deep learning method has a strong self-learning ability, and can use the activation state of the neurons in the network structure to capture the detailed information of the urban and rural targets in the image, as well as highlevel abstract distinguishable information to improve the perception accuracy. Therefore, subtle changes in the urban and village environment, form and infrastructure in different time periods are detected from the limited remote sensing data to improve the perception accuracy and the efficiency of subsequent applications, such as the statistics of urban planning and environmental information. We propose a spatial-temporal sensing method to detect urban and rural changes from the perspective of remote sensing. The method mainly includes spatial branches and temporal branches. The temporal branch embeds the urban and village images in the same area in different time phases to enhance the interaction between images in different time phases and establish effective dependencies. For spatial branching, the main purpose is to model the target object in the image to form a strong difference within or between classes so that it has better recognition. The network structure of our proposed spatiotemporal perception framework is shown in Fig. 1.
Considering that urban and rural images in the same area at different times have both relevance and spatial and temporal differences, we set the input images to x (2) indicate the remote sensing images of the input. f Spatial and f Temporal indicate the spatial information and temporal information via the feature extraction module, and the module mainly contains densely connected convolutional networks (DenseNet-121). H, W, C indicates the height, width and channel, respectively. γ , ε indicates the subspace of temporal and spatial feature maps, ( C γ )′ = C 8γ , ( C ε )′ = C 8ε .y (1) and y (2) indicates the output feature via STPM, where STPM indicates the layers of spatiotemporal perceptions. τ Total indicates the total loss of our frameworks. C − CHA indicates the cross-channel attention component, GNPA indicates the Group-Norm position attention component. Conv 7×7 (·) indicates that the convolutional operation of the kernel size is 7 × 7 , MP3 × 3(·) indicates that the max pooling operation of the kernel size is 3 × 3.Conv 1×1 (·) indicates that the convolutional operation of the kernel size is 1 × 1.GN indicates the Group-Norm operation. × indicates the elementwise product operate x (1) ∈ R C×H ×W and x (2) ∈ R C×H ×W , respectively, where H, W, C indicates the height width and channel, , and the image size of the inputs is 256 × 256 × 3 . The feature information of the output via the feature extraction module is f Spatial , f Temporal ∈ R C×H ×W , where C indicates the channel dimension. The spatiotemporal feature information is refined to attention feature maps y (1) and y (2) via a spatiotemporal perceptions module. However, the module is mainly composed of efficient channel attention guided squeeze-and-excitation. Then, we resize the optimization feature information to the size of the input remote sensing images. Meanwhile, we will calculate the distance of each pixel pair in the corresponding feature maps and archive a corresponding distance map ζ in the proposed optimization update.

Spatiotemporal feature extraction via DenseNet
In the past ten years, convolutional neural networks and improved convolutional neural networks [19] have been widely used in urban and rural perception tasks relying on their strong learnability, which is to expand the single dimension of traditional spatial structure to include morphological structure and intention type (City Intention Classification) and Intention Evaluation (Disaster Assessment) [22,27] with other dimensions to extract better detailed information. Compared with traditional handmade or manual field survey methods, the method based on the convolutional neural networks not only has a higher efficiency, but also shows a stronger performance. To obtain better detailed information and different scales of spatiotemporal information, we introduce DenseNet to model urban and rural images in different phases, while using it as a feature extractor to capture multiscale spatiotemporal information to further enhance perception.
Due to the large differences in the socioeconomic environments of cities and villages and their different distribution states, such as landscapes, landmark buildings, public places, and cultural function areas, there is a strong spatial correlation between them. At the same time, there are interclass or intraclass differences in a certain spatial dimension, and the multiscale DenseNet can highlight these differences through features, such as feature multiplexing and information cross-layer connection, which can better represent high-level information. However, the original DenseNet was mainly used for image classification tasks and was directly used to capture the feature information of urban and rural socioeconomic environments (including buildings, roads, etc.). Therefore, we remove the final fully connected layer and use different scales of densely connected blocks [34] to obtain multiscale information on these intentional targets. DenseNet [35] high-level information is semantically accurate, but it cannot effectively determine the position of the intended target; the position of the intended target in the same area image cannot be determined in different time phases. The low-level information contains a wealth of physical structure and appearance details. To this end, we fuse the high-order and low-level layers in the spatial dimension to generate more refined representations. We also quantify and evaluate the intentional goals of cities and villages from different angles. It is worth noting that both the temporal branch and spatial branch use the multiscale DenseNet as the feature extractor. Assume that each densely connected convolutional block (Dense Block) is composed of l layers; however, the extraction process of multiscale spatiotemporal features =T (2) . S indicates multi scale information.

Spatiotemporal perceptions with cross-channel interaction attention
To further perceive the changes in the socioeconomic environments of cities and villages in recent years, to strengthen the dependence and location information between the same intentional target in different time phases and to improve the network's perception of the intentional target, we design a squeeze-and-excitation (SE) [36] enhanced channel attention the force module captures of the rich global spatiotemporal relationships among the intentional individuals throughout the entire time and space. It also establishes effective long shortterm dependencies to highlight the perception of subtle changes and temporal and spatial characteristics, while providing subsequent urban and rural planning, disaster evaluation, and statistics. In addition, this model intends to provide reliable theoretical support for other tasks. The specific intention perception can be divided into the following steps: Step , where s ∈ S = 1, 2, 3, 4 . We can also think that f Temporal 1 is equal to the output features of Dense block-1 (see Fig. 1). We first fused the captured multibranch spatiotemporal information and denoted it as. (1) where f Temporal and f Spatial indicate the multibranch spatiotemporal information. Conv 1×1 indicates the operation of convolution, and the kernel size is 1 × 1 . In addition, according to Equations 1 and 2, feature information of different scales can be expressed as Step 2. We divide f Temporal into a γ subspace along the channel of temporal feature maps and f Spatial into a ε subspace along the channel of spatial feature maps. Finally, these subspaces are defined as.
Then, the specific temporal semantics information of urban and village intended objects via each subspace f Temporal i ∈ R C γ ×H ×W generate a corresponding coefficient. The structure of the spatial branch is similar to that of the temporal branch, namely, f Spatial i ∈ R C ε ×H ×W and ε is the subspace.
Step 3. To make the module more portable and more conducive to the statistics of global information, we use the spatiotemporal information of the urban and rural intentional targets captured by the temporal branch as the input of the cross-channel attention (CHA) component, and under the condition of no dimensionality reduction, cross dimensionality embedding is performed on the intentional object. The cross-dimensionality embedding of urban and village intended objects via the cross-channel attention component is shown.
However, the feature information captured by the spatial branch is used as the input of the Group-Norm position attention (GNPA) [37] component to determine the changing position of the urban and rural intentional target, which complements the output information of the cross-channel attention component (CHA of different branch component. δ(·) indicates the activities functional ReLU . Meanwhile, to ensure efficiency, reliability, and help from effective cross-channel interaction between local and global information, the frequency band matrix W γ is used to further improve the cross-channel attention (CHA) component, and it can be expressed as Where, w γ indicates the weighting factor.
Step 4 We share these spatial and temporal branches to make the size of the feature maps the same as the initial inputs. The aggregation processing are denoted as .
We use different branches to capture the characteristic information of the urban and rural intentional targets to not only obtain better high-level information, but also obtain appearance details, establish a dependency relationship in the spatial and temporal dimensions, and further strengthen the relationship between humans and the urban and rural intentional targets. Interactivity improves the ability of follow-up applications.

Optimization
To further improve the representations and perceptibility of this spatiotemporal information for urban and village intention object changes, we present a loss functional of reconstruction. The loss functional are indicated as Where α and β is a learnable balance factor.
For the spatial and temporal branches, we use the binary cross entropy loss (BCELoss) and cross entropy loss, namely, τ spatial and τ temporal .
where N indicates the total number of samples, y (1) n is the category of the n-th sample, z (1) n is the predicted value of the n th samples, u ∈ U indicates the number of categories, and z (2) nu indicates the probability that the n th sample belongs to category u.
In summary, we can better perceive changes in the content of the urban and rural intentional targets in this way, achieve as much automated processing of the content as possible, and improve the ability and efficiency of emergency response after disasters. The proposed multibranch networks for exploring spatiotemporal changes in cities and villages are shown in Algorithm 1.
1.5 m. It mainly includes new urban areas, building construction, planting a large number of trees and new cultivated land.
However, to better perceive the changes in the urban and rural intentional environments and provide more reliable experimental support for subsequent urban planning, intention type or disaster evaluation, we preprocessed this initial data to ensure that the processed dataset was suitable for urban and rural areas. The description of a socioeconomic environment is more comprehensive, and it is also more suitable for urban and rural perception tasks.

Experimental discussion and analysis of the spatiotemporal perceptions
In the sections, we describe our perception results of urban and village intention objects in detail and provide a discussion and analysis.

Data preparation and processing
Because there is no database specifically used to perceive the changes in the urban and rural intentional targets, we screen public baseline datasets, such as LEVIR-CD and SZAB.
LEVIR-CD: [38] The dataset has a total of 637 1024*1024 remote sensing images and mainly describes the changes in urban and rural buildings in 20 different areas of several cities in Texas, USA, between 2002 and 2018, mainly concentrating on the growth of the various types of buildings (such as villas, high-rise apartments, small garages and large warehouses) in cities and villages.
SZAB: [39] The datasets are called the SZTAKI-Air-Change-Benchmark and contain 13 pairs of aerial images with a size of 952x640 pixels and a spatial resolution of

Training configuration
To achieve a better perception effect of city and village intention objects by training our proposed frameworks, we conduct a sequence of initial settings for the frameworks and enhance these datasets by augmentation methods. Meanwhile, augmentation can also be effective to compensate for the lack of urban and rural content data, such as rotation, noise, color change, etc. The processing of the datasets are shown as Fig. 2.
For the network structure of our present spatiotemporal perception frameworks, the scale is set as s ∈ {S = 1, 2, 3, 4} , the growth rate for the DenseNet-121 is k = 32 , and the learning rate is set as 1e − 4 . The Dropout is 0.5 and the epoch is set as 600. However, to further ensure the effectiveness of training for our frameworks, we force the input remote sensing image size to be cropped to 256 × 256 . The datasets are divided into three subsets: training (40%), testing (60%) and validation (10%).

Evaluation coefficient
To ensure the consistency and validity of the experimental results, we use multiple evaluation coefficients, such as precision (P), recall (R), F-score (F1), average area (AR), parameter quantification (PQ) and time, where "time" indicates the run time of each batch size to test and verify our experiments. The calculation process of the evaluation index is shown in the following equation.
where TR indicates the total area of remote sensing images and PCR indicates the change area via predic.

Experimental results of the different methods
To demonstrate the effectiveness of our proposed spatiotemporal perception framework, it also helps to collect information on the environments of urban and rural cities, and improve the responsiveness of tasks, such as urban planning, disaster evaluation and intention type judgment. Compared with other advanced perception frameworks, we tested and verified two datasets, LEVIR-CD and SZAB, with precision, recall and F-score as evaluation indicators. Meanwhile, we will give the change area of the intention content in the urban and rural images, namely, AR. The experimental results of the different methods are shown in Table 1.
According to Table 1, we can draw the following conclusions: 1. The perception framework we propose achieves the best results in a variety of evaluation indicators. The main reason may be that the multiscale spatiotemporal information extracted by the dual-branch DenseNet-121 is used to describe the urban and rural targets in detail from different angles and different levels. At the same time, attention is used to aggregate multilevel information, which further strengthens the use of the detailed information and strengthens the interaction between the temporal and spatial information.
In addition, the perception framework we propose is also very competitive in terms of perception. The parameter amount and time efficiency are 18.14 M and 11.95 s, respectively, which is 4.14 s higher than the KPCAMNet method in efficiency. The possible reason for this is that in our proposed perception framework, the squeeze excitation component uses a Compared with the CD-UNet++, the UNetLSTM achieves a better perceptual performance. The main reason is that it not only uses the U-Net to encode and decode the local features of urban and rural targets but also uses the LSTM to describe the global semantics of the target connotation. Expressing urban and rural goals from two perspectives, local and overall, forms a complementarity. Compared with other CNN-based perception methods, the VGG-LR achieves the lowest effect. The main reason for this is that the framework only uses the VGG-16 to extract local features of urban and rural targets, and loses a large amount of detailed information. 3. Compared with other methods, such as the ESC-Net, SRCDNet and UNetLSTM. The two perception frameworks, the FGCN and the PtoP CNN, have achieved better performance. The main reason is that different branches are used to model the local and global semantics of the target, which forms an interaction and complementarity between the global and local semantics, improving the feature information pair and the ability to perceive subtle changes. 4. Compared with the perception methods based on the CNN and the U-Net, the SRCDNet, ESCNet and KPCAMNet have strong competitiveness. For example, on the SZAB data, the KPCAMNet method has increased by 7.0%, 6.36% and 6.69%, respectively, compared to the UNetLSTM. The main reason for this is that the KPCAMNet uses multiscale information and simultaneously uses attention to refine the multiscale information, filtering out redundant information. In addition, the number of participants in the training of the perceptual framework we propose is also small.

Experimental results of different components
To verify the impact of the different components on the overall performance of the proposed perception framework based on baseline data, such as the LEVIR-CD and the SZAB, different components were tested and demonstrated, and the experimental results and related analysis are given. The specific experimental results are shown in Table 2.
According to Table 2, we can draw the following conclusions: Our proposed spatiotemporal perception framework achieves the best performance on the two public baseline datasets; the F-scores were 88.04% and 53.72%, respectively. The main reason for this is that the perception framework we design uses multibranch deep neural networks to first capture the deep semantics and shallow physical appearance information of the urban and rural intentional targets, while describing the intentional targets from different levels and scales. Second, to further establish a spatiotemporal dependence, interaction modeling between long-and short-term distances can be used to more accurately mark the position of the intentional target. At the same time, it highlights the difference between the intentional target class or the class and further improves the network's perception of the socioeconomic environment of the urban and rural areas. ability. In addition, we can also find that only using the DenseNet (Our(No-STPM)) for spatiotemporal information extraction can also achieve better performance, but compared to using the STPM module (Our), its F-score value is reduced by 7.47% and 2.35%, respectively.
At different times, the urban and rural intentions in the same area showed great changes, and the average areas of change were 31.43% and 13.49%, respectively. This shows that with the continuous development of the social economy, the urban and rural forms will also undergo massive changes. If artificial participation is used, it is time-consuming and labor-intensive to measure the changing area, and the method we provide can effectively improve the measurement efficiency; at the same time, it is more To show the performance of our proposed spatiotemporal perception framework more intuitively, we give the perception effects of different regions, where the results are shown in Fig. 3.

Ablation studies
To further verify the influence of the different components on the proposed framework, experimental tests are carried out based on the LEVIR-CD datasets, and the relevant perception results and analysis are given. The perceptions results are shown in Table 3.
According to Table 3, we can find that the perception accuracy of using CHA (our(No-GNPA)) is obviously better than using GNPA (Ours(No-CHA)) and that its F value and AP are increased by 1.39% and 0.79%, respectively. This indicates that CHA's contribution to the network is higher than that of GNPA. The main reason for this may be that CHA captures more effective specific information and is more sensitive to urban and rural objects. However, to better show the impact of CHA and  . 4 The hotmaps of the different components. a Indicates the initial image of urban and rural areas before and after the change. b and c Indicate the spatiotemporal feature maps of the middle layers. d Indicates different components of the hotmap GNPA components on the overall frame performance, we have provide a visual hotmap of different components. The hotmap are denoted in Fig. 4. According to Fig. 4, we can obviously see that the two components are used in conjunction to form information complementarity, which can better express the urban and rural intentional targets and, at the same time, can perceive subtle changes. Because CHA uses cross-channel interaction to capture the specific semantics of the urban and rural intentional targets, GNPA can better locate the target's location, and their collaborative work can establish a more effective dependence.

The discussion of the results
To show that the proposed perception framework can effectively detect the socioeconomic environments of urban and rural locations, forms, and infrastructure, while contributing to various tasks, such as disaster evaluation and intention type statistics, we show the perception results of multiple intentional targets. The result is shown in Fig. 5.

Conclusions and next studies
In this paper, we perceive the changes in the socioeconomic environments of urban and rural areas, and present an exploration of spatiotemporal changes in cities and villages through remote sensing using multibranch networks. The perception framework not only effectively captures the multiscale spatiotemporal information of the intended target, but also uses STPM to capture the long-term spatiotemporal correlation. The intended target is described from multiple perspectives, such as high-level semantics and low-level appearance to learn more effective embeddings. In addition, the interaction between time and space information is strengthened, and this characteristic information is gradually refined during the training process, which is helpful for urban planning and construction and disaster response. The final perception results show that our proposed perception framework has a good performance.
Although the framework has achieved a good perceptual performance, the perceptual effect of the intentional targets with large scale changes (the same target at different moments or on different remote sensing images, the physical appearance of the intended target, such as the shape and size of the intended target changes greatly) is poor and needs to be improved. Therefore, in future work, we will introduce concepts like as superscale blocks to develop a simpler and more effective semantic framework. At the same time, we will further improve the attention network to guide the perception framework to explore large-scale changes. Finally, we learn the important characteristics of the urban and rural socioeconomic areas.