2.2. Data Craw and Pre-Process
To acquire Weibo data, we use the public interface of Sina Weibo, setting a time range and relevant city event keywords to crawl Weibo multimodal data using the Python programming language. These data include Weibo creation time, text content, images (if any), videos (if any), and IP province (starting from August 1, 2022). For videos, we extract stable frame images using optical flow method.
To process the text data efficiently, it is necessary first to clean noise data. Character-level cleaning includes removing topic tags, zero-width spaces (ZWSP), @ other users, Emoji symbols, HTML tags, etc. However, not all Weibos containing event-related keywords are related to the event. Therefore, a model can be trained to classify text related to specified city events. As an event is usually reported by multiple media, overly similar posts will cause data redundancy. Therefore, using a day as the time range, the text is vectorized and highly similar Weibos are removed using cosine similarity matrix.
After these three steps of data preprocessing, a city event Weibo dataset with overall low noise and relevant to the required events can be obtained. An example of a processed dataset is shown in
Table 1.
2.3. Coarse-Grained Spatio-Temporal Information Extraction
Given that the narration of social media information has a high degree of randomness and diversity, lacking a unified text format, we have designed a set of rigorous spatio-temporal information standardization rules to efficiently extract key spatio-temporal information from a large amount of Weibo data and lay the foundation for subsequent detailed research. These rules aim to ensure that different levels of potential spatio-temporal information are maximally utilized during the standardization process.
Before standardizing spatio-temporal information, it is necessary first to extract spatio-temporal information from the text. For the preprocessed and standardized city event dataset, we use NER technology to identify entities related to spatio-temporal information. To improve the efficiency of subsequent spatio-temporal information standardization, we merge similar tags. Specifically, we combine the DATE and TIME tags into the TIME category, as they can both serve as materials for time standardization; we make the GPE tag a separate category without changing its name, as it provides a basis for spatial standardization with administrative divisions; we combine the LOC and FAC tags into the FAC category, as they both can identify specific facilities or locations, which can serve as specific place names for spatial standardization.
Table 2 shows the built-in tags of concern for spatio-temporal information extraction and reclassified tag categories.
For spatio-temporal standardization, particular attention needs to be paid to time and space. Therefore, we chose the JioNLP library, which offers the best open-source time parsing tool and a convenient location parsing tool [
31]. In terms of time standardization, we standardize the Weibo posting time to the "Year-Month-Day" format, omitting the specific "Hour:Minute:Second". This is because it is challenging to accurately pinpoint the time of an event such as a flood down to the "hour" level based solely on the Weibo posting time and the implicit time information in the text. Therefore, the lowest unit of time is retained only to the "day", rather than the specific details of the Weibo posting time. For spatial standardization, we transform the potential spatial information in Weibo into the "Province-City-District (County)-Specific Geographical Location" format to facilitate the understanding of subsequent geographic coding and accurately convert it to the WGS1984 latitude and longitude coordinates of the address.
For this research, it is crucial to further refine the spatial information. Therefore, it is first necessary to remove data that do not contain FAC entities to ensure the subsequent research progress. On this basis, for time standardization, it is necessary to determine whether there are TIME class tags in the text. If not, the Weibo posting date is directly used as the final standardized time; if there is, through forward screening of some keywords, such as: "today", "yesterday", "day" etc. We use the time parsing function provided by the JioNLP library based on the Weibo posting time, identify entities with the named entity type as TIME, and use them as revision times for time standardization. Finally, only meaningful time point types are retained; if none, the Weibo posting date is used as the final time.
In the process of spatial information standardization, more situations need to be dealt with. First, determine whether there are GPE tags in the text. Similar to time standardization, address standardization also needs a benchmark. Therefore, the GPE tag is crucial. Notably, starting from August 1, 2022, the National Internet Information Office requires internet information service providers to display user IP address ownership information, providing new possibilities for texts that only have FAC tags but no GPE tags. However, cases involving foreign countries or regions outside China need to be excluded. In successful cases with GPE tags or IP address ownership and FAC tags, the address recognition function provided by JioNLP is used to standardize the content of the GPE tag to the "District (County)" unit.
The different standardized result states returned by the above spatio-temporal standardization are classified, mainly into three categories: 0, 1, and 2. Among them, 0 represents the failure of standardized parsing, 1 represents incomplete standardized parsing, and 2 represents successful standardized parsing. According to the different types of standardization parsing, we only convert the spatial information after standardization of categories 1 and 2 into Wgs1984 coordinates using the Baidu Maps geocoding API.
Through these steps, we have achieved effective extraction of coarse-grained spatio-temporal information, laying the foundation for further research. Our overall approach to standardizing spatio-temporal information in Weibo text is visualized in
Figure 2, showing the program's response to different standardization return types. Also, three common standardization rule examples are shown in
Figure 3.
3.4. Fine-Grained Extraction of Spatial Information
To extract fine-grained spatial information from social media images, a series of image processing techniques are needed to compare them with street view images containing spatial information, screening for the best match to realize information transfer. The degree of match between the social media image and the street view image determines the reliability of the fine-grained spatial information. To maximize the credibility of this process, we designed a cascading model based on match-extract-evaluate, named LSGL (LoFTR-Seg Geo-Localization).
In social media data, users often express location and orientation based on their perception or understanding of the geographical environment. Hence, the spatial coordinates extracted at a coarse grain could be a representative building or location, whereas specific orientation descriptions (e.g., "nearby," "around the corner," "next to") are difficult to define. To address this issue, we divide the standardized addresses into road and non-road types based on the classification after Baidu Map geocoding. For the road type standardized addresses, we generate street view sampling points at 5m intervals on the corresponding name's OSM vector road network. For non-road type standardized addresses, we create a buffer zone with a radius of 200 centered on it, clip the OSM vector road network within it, and generate street view sampling points at 5m intervals as well.
Due to the randomness of social media, most user-uploaded images are somewhat blurry, significantly affecting the selection of feature points. To solve this problem, the LSGL model adopts the Local Feature Transformer (LoFTR)[
32] feature matching method in the matching stage. This method can not only effectively extract feature points from blurred textures but also maintain a certain relative positional relationship between feature point pairs with the help of a self-attention mechanism, significantly improving the performance level of street view image matching.
For an ideal street view image matching task, the feature matching degree between buildings often represents the similarity of the shooting locations of the two scenes. However, in actual operation, the matching task is often affected by the sky, roads, vegetation, and other features with strong similarity information, leading to a large number of feature points in the image that lack reference significance. To reduce the impact of irrelevant information on the matching task, LSGL adopts the DETR model[33], which can efficiently segment images and label them at a relatively low performance overhead level, thereby extracting reference feature points with practical significance for further evaluation.
To select the best matching street view image from all matching results and extract its coordinates, a quantitative indicator for evaluating the degree of image matching needs to be established. With this goal in mind, we use the reference feature points of each scene image to design this indicator from the two dimensions of feature vector matching degree and spatial position difference of feature points.
First, we consider the feature vector matching degree of feature points. For the LoFTR feature matching method, it can output the coordinates of the feature points and the corresponding confidence level. We first screen out feature points that do not belong to the target category based on their coordinates. Then, we use a traversal statistical method to calculate the number of remaining feature points. Next, we multiply the confidence level of each feature point and sum them, then take the average of the accumulated results to represent the credibility of all feature points in this image. This can be expressed mathematically as:
In the formula, represents the feature vector matching degree of the feature point, represents the number of feature points, and signifies the confidence of feature points.
Second, we consider the spatial position difference of feature points. As user images come from Weibo and are influenced by user devices, shooting level, etc., the features and objects in their images may be slightly offset compared to street view images. However, the spatial relationship between feature points should remain similar. Therefore, based on the coordinates of each pair of feature points in their respective images, we calculate their Euclidean distance and Euclidean direction as follows:
In equations (2) and (3), and respectively denote the Euclidean distance and direction of the feature points in the user image and the reference image. represent the coordinates of the feature points in the user image,while , signify the coordinates of the feature points in the reference image.
In order to assess the impact of changes in Euclidean distance and direction on the spatial position of feature points, we calculated the root mean square error for these two indices separately, resulting in RMSED and RMSEA. Multiplying these two values yields the spatial position discrepancy of the feature points, as shown in equation:
Standardizing the indicators can more intuitively reflect the relative advantages of the evaluation results. Therefore, it is necessary to process the results of individual evaluations and round evaluations. The main methods are as follows:
In these equations, and represent the matching degree and spatial position discrepancy of the feature vector in a single match, respectively. and are the optimal and worst feature vector matching degrees in a single round of matching, respectively. and are the optimal and worst spatial position discrepancies in a single round of matching, respectively.
Given the differing impacts of these two factors on the results of feature point matching, we have constructed the following final scoring method:
The more reliable the result of feature matching is, the higher the feature vector matching degree and the lower the spatial position matching degree.
Finally, we select the image with the optimal
value from all matching results and obtain its specific coordinates. We return this as the fine-grained spatial information. Through this series of processes, we have established a cascaded model that can better extract fine-grained spatio-temporal information.
Figure 4 shows the impact of each level in this model on the image matching result.