2.2. Data craw and Pre-Process
To obtain Weibo data, we utilize the public interface of Sina Weibo. The Weibo multi-modal data is crawled using Python programming language, by setting the time range and keywords related to city events. This data includes Weibo creation time, text content, images (if any), videos (if any), IP owned provinces (starting from August 1, 2022), etc. For videos, we extract stable frame images using the optical flow method.
Despite initial keyword filtering, not all Weibo posts containing event-related keywords are actually related to the event. Therefore, a text classification model can be used to classify the relevant text for the specified city event. Subsequently, to efficiently process text data, we need to clean the noise in the data. Character-level cleaning includes removing topic tags, zero-width spaces (ZWSP), @other users, emoji, HTML tags, etc. However, as an event often receives coverage from multiple media sources, overly similar report posts may lead to data redundancy. Therefore, we vectorize all text and use an efficient cosine similarity matrix to calculate the similarity of each text with all other texts, eventually removing highly similar Weibo posts (similarity>=0.9).
After the above three steps of data preprocessing, we obtain a normative city event Weibo dataset that is largely noise-free and relevant to the required event. An example of a processed dataset is shown in
Table 1.
2.3. Coarse-grained spatio-temporal information extraction
Due to the high degree of spontaneity and diversity in social media narratives and the lack of a unified text format, we designed a set of rigorous spatio-temporal information standardization rules to efficiently extract key spatio-temporal information from a large amount of Weibo data and lay a foundation for subsequent detailed research. These rules aim to ensure that during the standardization process of spatio-temporal narratives, different levels of potential spatio-temporal information are maximally utilized.
Before the standardization of spatio-temporal information, we first need to extract spatio-temporal information from the text. For the normative city event dataset processed by data preprocessing, we use NER (Named Entity Recognition) technology to identify entities related to spatio-temporal information. To improve the efficiency of subsequent spatio-temporal information standardization, we merge similar labels. Specifically, we combine DATE and TIME labels into the TIME category, as they can both be used as materials for time standardization. The GPE label is kept as a separate category, as it provides the basis for administrative divisions for spatial standardization. We integrate LOC and FAC labels into the FAC category because they can identify specific facilities or locations, which can serve as specific place names for spatial standardization.
Table 2 shows the built-in labels required for extracting spatio-temporal information and the reclassified label types.
In terms of temporal-spatial standardization, specific attention is given to both temporal and spatial aspects. Hence, we utilized the JioNLP library, which currently provides the highest quality open-source temporal parsing tools and convenient location parsing tools [
31]. In terms of temporal standardization, Weibo publication times are standardized to the format "Year-Month-Day", omitting the specific "Hour:Minute:Second". This is because events typically occur spontaneously and it is difficult to determine the exact time of the event based solely on the Weibo publication time and the implied time information in the text. Consequently, the lowest unit of time is retained only up to "Day", rather than the specific Weibo publication time or the detailed specifics implied in the text. With regard to spatial standardization, we transform the potential spatial information in Weibo posts into a "Province-City-District (County)-Specific Geographic Location" pattern for ease of comprehension during subsequent geocoding, and accurately convert it into the WGS1984 latitude and longitude coordinates for that address.
In this study, further precise handling of spatial information is of paramount importance. Firstly, data that does not include FAC entities must be excluded to ensure the progress of subsequent research. Building upon this, for temporal standardization, it is necessary to ascertain whether there are TIME class labels in the text. If not, the Weibo publication date is directly used as the final standardized time. If there are, some keywords are selected through forward screening, such as: "today", "yesterday", "day", etc. Utilizing the temporal parsing function provided by the JioNLP library and taking the Weibo publication time as a reference, entities with a named entity type of TIME are identified and used as correction times for temporal standardization. In the end, only meaningful point-in-time types are retained. If there are none, the Weibo publication date is selected as the final time.
During the process of spatial information standardization, a larger number of scenarios need to be handled. Firstly, it needs to be determined whether there are GPE labels in the text. Similar to temporal standardization, address standardization also requires a reference point, making the GPE labels critically important. Notably, as of August 1, 2022, the Office of the National Internet Information Office required internet information service providers to display user IP address ownership information, providing a new possibility for texts with only FAC labels and no GPE labels. However, cases involving overseas countries or regions must be excluded. Upon successfully obtaining a GPE label or IP address ownership location and FAC label, the address recognition function provided by JioNLP is used to standardize the content of the GPE label to the "district (county)" unit.
Different statuses of standardization results returned by the above temporal-spatial standardization are categorized, mainly into three types: 0, 1, and 2. Here, 0 represents a failure of standardization parsing, 1 represents incomplete standardization parsing, and 2 signifies successful standardization parsing. Based on the different types of standardization parsing, we only geocode the spatial information after standardization of types 1 and 2 using the Baidu Maps Geocoding API, converting the standardized addresses into Wgs1984 coordinates.
Through these series of steps, we effectively extract coarse-grained temporal-spatial information, laying a foundation for further research. The overall approach for the standardization of temporal-spatial information in Weibo text is visualized in
Figure 2, demonstrating the program’s assignment of different status types based on different standardization results. Additionally, three common examples of standardization rules are shown in
Figure 3.
While coarse-grained spatial and temporal information has been effectively extracted via the steps described above, in social media data, users often express location and orientation based on their personal perception and cognition of the geographical environment. Thus, the spatial coordinates extracted through coarse-grained extraction may only reflect a representative building or place, while specific orientation descriptions, such as "nearby," "at the corner," "next to," etc., are often somewhat vague. One solution to this issue is to categorize the standardized addresses into two main classes, namely roads and non-roads, by referring to the categorization after Baidu Map geocoding. For standardized addresses of road type, streetscape sampling points are generated at 5-meter intervals along the road network vector in the Open Street Map (OSM) that corresponds to the road name. For non-road type standardized addresses, a buffer zone with a radius of 200 meters is created around the address, and streetscape sampling points are similarly generated at 5-meter intervals along the road network vector in the OSM that has been clipped within this zone.
However, the randomness of social media data raises another issue: in the same microblog post, the image may not be directly related to the text content. This implies that even if a space information is mentioned in the post, the image may not necessarily be related to this information. Additionally, given the varied quality of user-uploaded images or videos, there are not many clear, high-quality streetscape images that contain potential spatial information. To further explore these multimodal data, a semi-manual method can be adopted. First, based on the semantic segmentation of streetscapes, a simple algorithm determines whether each user-uploaded image is a streetscape image; that is, an image should contain at least three types of image semantics: road, sky, buildings, and poles. Then, through manual screening, high-quality, relevant images are selected from the microblogs and associated with the addresses standardized during the coarse-grained extraction phase. In this way, high-quality microblog image-text data can be screened out and categorized as "Positive," while those coarse-grained standardized address points without high-quality, relevant images are categorized as "Negative."
2.4. Fine-grained extraction of spatial information
To extract fine-grained spatial information from the high-quality microblog image-text data above, a series of image processing techniques are required to compare it with streetscape images that already contain spatial information, thereby screening out the best match for spatial information migration. In this process, the matching degree between the social media images and streetscape images determines the reliability of the fine-grained spatial information. To maximize the reliability of this process as much as possible, we designed a cascade model LSGL(LoFTR-Seg Geo-Localization) based on match-extraction-evaluation.
Given the randomness of social media, most user-uploaded images are blurry, which greatly affects the selection of feature points. To solve this problem, the LSGL model adopts the LoFTR (Local Feature Transformer) [
32] feature matching method in the matching stage. This method can not only extract feature points from blurry textures effectively but also maintain a certain relative positional relationship between feature point pairs through a self-attention mechanism, significantly improving the performance of streetscape image matching.
For the ideal streetscape image matching task, the feature matching degree between buildings generally represents the similarity of the shooting locations of the two scenes. However, in practical operations, the matching task is often affected by sky, road, vegetation, and other strong similarity feature information, resulting in a large number of feature points in the image that do not carry significant reference information. To reduce the influence of irrelevant information on the matching task, LSGL adopts the DETR model [33], which can efficiently segment images and label them at a relatively low performance overhead level, thereby extracting practical reference feature points from the images for further evaluation.
To select the best-matched streetscape image from all the matching results and extract its coordinates, a quantifiable indicator is required to assess the degree of image matching. With this goal in mind, we rely on the reference feature points of each scene to design this indicator from two dimensions: feature point feature vector matching degree and feature point spatial position difference.
Firstly, we consider the feature point feature vector matching degree. For the LoFTR feature matching method, it can output the feature point coordinates and the corresponding confidence. We first filter out feature points not within the target category based on their coordinates. Then, we use an exhaustive statistical method to calculate the number of remaining feature points. Subsequently, the confidence of each feature point is multiplied and added, and the average of the cumulative result is taken as the confidence of all feature points in the image. In mathematical terms, it is represented as:
In the formula, represents the feature vector matching degree of the feature point, represents the number of feature points, and signifies the confidence of feature points.
Second, we consider the spatial position difference of feature points. As user images come from Weibo and are influenced by user devices, shooting level, etc., the features and objects in their images may be slightly offset compared to street view images. However, the spatial relationship between feature points should remain similar. Therefore, based on the coordinates of each pair of feature points in their respective images, we calculate their Euclidean distance and Euclidean direction as follows:
In equations (2) and (3), and respectively denote the Euclidean distance and direction of the feature points in the user image and the reference image. represent the coordinates of the feature points in the user image,while , signify the coordinates of the feature points in the reference image.
In order to assess the impact of changes in Euclidean distance and direction on the spatial position of feature points, we calculated the root mean square error for these two indices separately, resulting in RMSED and RMSEA. Multiplying these two values yields the spatial position discrepancy of the feature points, as shown in equation:
Standardizing the indicators can more intuitively reflect the relative advantages of the evaluation results. Therefore, it is necessary to process the results of individual evaluations and round evaluations. The main methods are as follows:
In these equations, and represent the matching degree and spatial position discrepancy of the feature vector in a single match, respectively. and are the optimal and worst feature vector matching degrees in a single round of matching, respectively. and are the optimal and worst spatial position discrepancies in a single round of matching, respectively.
Given the differing impacts of these two factors on the results of feature point matching, we have constructed the following final scoring method:
The more reliable the result of feature matching is, the higher the feature vector matching degree and the lower the spatial position matching degree.
Finally, we select the image with the optimal
value from all matching results and obtain its specific coordinates. We return this as the fine-grained spatial information. Through this series of processes, we have established a cascaded model that can better extract fine-grained spatio-temporal information.
Figure 4 shows the impact of each level in this model on the image matching result.