4.1. Place name disambiguation
Although the place names have been extracted in advance by the tool, the ambiguity caused by the homonymy of place names can reduce the usefulness of the extraction results under the current situation of place name naming. In addition, the affiliation between the names is a factor to be considered in the construction of the area studied features. Therefore, this paper proposes an improved heuristic disambiguation method based on the place name knowledge graph. The entity type of the place name knowledge graph constructed in this paper is place name, and the place name disambiguation algorithm is shown in
Figure 2.
For the identified set of place names, matching retrieval is performed in the place name knowledge graph. In this paper, a 4-level place name knowledge graph is used, which is composed of 4 levels: provincial, city, county, township (town), and becomes the knowledge source for place name disambiguation. If a place name can be precisely located in the upper, lower and the same level by the initial query of the graph, there is no need to disambiguate it. Otherwise, there are two cases. (1) If the meaning cannot be uniquely determined, the name is considered as ambiguous. The disambiguation steps are as follows. Firstly, use the semantics to eliminate the irregularities caused by the simplified expressions. If the full name of a place name can be found in the query, the full name is used to disambiguate. If more than one full name of the place name is retrieved, the ambiguity is eliminated with the aid of the place names appearing in the abstract text. The place names disambiguated in the previous step are used as markers. Disambiguation is performed sequentially using the identified place name, the superior or subordinate place name of an ambiguous place name, and the closest neighboring place name within the threshold range. If the ambiguity has not been eliminated, the scale disambiguation is used, that is, the full name of the highest administrative level among multiple full names is chosen to eliminate the ambiguity. (2) If the place name cannot be retrieved, it is judged as an ambiguous place name. The disambiguation steps are as follows. Fuzzy query in the place name database and retrieve all the full names containing it. If there is a unique result, the ambiguity will be eliminated. Otherwise, the ambiguity will be eliminated by identifying the place name. If there is still no unique location, the scale will be used for disambiguation. The algorithm flow is as follows.
The place names in the knowledge graph constitute the standard place name set P. For any paper 𝐴𝑖⊆ 𝐷𝑜𝑐, traverse the set Li of place name it contains, and for each place name in Li, process as follows.
Step1: Firstly, is retrieved from P. For the unique place name of ⊆P, is removed from Li and added to the unambiguous place name set Ti.
Step2: When ⊊P, sequentially using semantic, marking, and scale disambiguation methods until the geographic location is uniquely determined, it will be removed from Li and added to the disambiguation set Ti.
Step3: When ⊆P and there are multiple matches, the recognition, relation matching, word spacing, and scale disambiguation methods are used sequentially until the geographic location is uniquely determined, then it is removed from Li and added to the unambiguous set Ti.
4.2. Area studied recognition
The feature template designed for the characteristics of the area studied will show a significant difference between the feature values on the area studied and the non-area studied, and through this difference the high-precision and high-efficiency distinction between the area studied and the non-area studied is achieved in the classifier. Therefore, it is necessary to construct the feature set in 3.3 for each word in the above extracted place name set, and classify the place names in the place name set into two categories according to the feature difference: area studied place names and non-area studied place names, to complete the extraction of area studied(
Figure 3).
Since this study targets the recognition of area studieds in the abstract texts of academic papers, and there is no regularity in the composition of area studieds. It is difficult to achieve satisfactory results if we rely only on the structure of the area studied itself. On the basis of accurate and meaningful recognition of place names, the area studied is extracted from the abstract text, from which some features related to the area studied need to be constructed to improve the final extraction effect. Therefore, it is crucial to select the appropriate features.
(1) Title association feature
The title of academic papers are generally concise and reflect the core content of the text, so the elements contained in the title, such as the research topic and the area studied, are bound to have different correlation degree to the core content. Then the sentences containing these elements in the text must also have a strong correlation with the core content, and the place name in the sentence with the greatest correlation is more likely to be the area studied in the paper. If the abstract contains a place name that appears in the title, especially the subtitle, the probability that the place name is the area studied will be greater than the probability that the place name is not the area studied. If the title does not include a place name, but a sentence in the abstract with high similarity to the title contains a place name, the probability that the place name is a area studied will also increase. Therefore, correlation degree between the sentence and the title is determined as one of the features to extract the area studied.
For anyone paper
Ai ⊆
Doc, the set of titles is Ti, the set of abstract sentences is
Si, and the set of sentences containing place names is a subset
Si′, that is
Si′ = {
s|
s ∈
Si ∧
including place name}. In this paper, we discriminate
Ai with no place name in the title and discriminate the title similarity for each sentence in its
Si′, and write
c(
si,
j) ⊆ C for the title similar sentence and obtain the modified Ti′.The title association feature of
is noted as p(
) . The calculation formula is shown in Formula 1.
The input of the feature algorithm is abstract, and the output is the title association feature values of all the place names in the abstract, by first discriminating papers without place name in the title and obtaining title similarity sentences 𝑐(𝐴𝑖), followed by 𝑐(𝐴𝑖) to replace the title Ti, corresponding to that paper to obtain the updated set of titles Ti′, final judging whether it appears in Ti′. If yes, then the feature value is 1, otherwise, the feature value is 0. In particular, for with affiliation that appears simultaneously in Ti′, the one with the smallest administrative division rank is taken to have an eigenvalue of 1, and the rest is 0.
(2) Location feature
In the abstract text of academic papers, it is customary to express the central idea of the article in the first two sentences of the paper or to make a summary of the last sentence. Therefore, if a place name appears in the first two sentences and the last sentence of the text, the probability that the place name is the area studied is greater than the probability that the place name appears in other sentences.
In this paper, we first define the distance for the sentence as follows. An article 𝐴𝑖 is sequentially split into a set 𝑆𝑖={
,
… ,
} of sentences, which does not contain the titles 𝑡𝑖. The distance of
is its distance to the title, that is, its value is 𝑗. In this paper, the formula for calculating this feature is designed as showed in Formula 2.
In the formula, 𝑗 indicates sentence distance, is the set size of , that is, the total number of sentences. The sentence distances are normalized by this calculation.
For the first two terms , , and the last term of the set 𝑆𝑖 with n elements. That is, the location feature value of the place name appearing in sentences with j values of 1,2 and n is recorded as 1, otherwise, it is recorded as 0.
(3) Time association feature
Research time and research location often appear together in the abstract text of academic papers. The granularity of the research time varies from article to article, but the research time must contain different time words, such as a year. Therefore, sentences in the abstract that contain a time word and contain a place name have an increased probability that the place name is the area studied. In the case of sentences containing time and containing multiple place name, the closer the place name is the time word the higher the probability that it is the area studied.
For anyone article 𝐴𝑖⊆ 𝐷𝑜𝑐, the set of sentences is 𝑆𝑖, from which a subset 𝑆𝑖′of the set of sentences containing the place name and containing the time is extracted, that is 𝑆𝑖′= {𝑠|𝑠 ∈ 𝑆𝑖∧
including place name and time}. The formula for calculating the temporal association feature designed in this paper is shown in Formula 3.
In the formula,L() denotes the location of the j-th place name of the i-th article, L() denotes the location of the p-th time word of the i-th article, |𝑆𝑖′| denotes the length of the i-th abstract text, the time association feature of is denoted as Y().
(4) Trigger word feature
There are some fixed expressions in the sentence structure where the area studied is located. For example, the words "analyze", "study", "explore" and other words often appear before the area studied,and the words "located ", "for example", "for the scope of the investigation", "for the research object" and other words often appear after the area studied. These words are called trigger words, and the probability of a place name as a area studied increases when a trigger word appears in a sentence containing the place name.
For an article 𝐴𝑖⊆ 𝐷𝑜𝑐, the set of sentences is 𝑆𝑖, from which a subset 𝑆𝑖’’ of the set of sentences containing the place name and containing the trigger words is extracted, that is 𝑆𝑖’’= {𝑠|𝑠 ∈ 𝑆𝑖∧
including place name and trigger words}. The formula for calculating the trigger word feature designed in this paper is shown in Formula 4.
In the formula, denotes the location of the 𝑗-th place name of the 𝑖-th article, T() indicates the trigger word feature value of the place name.
Therefore, the templates developed for the area studied by making full use of the abstract text content and integrating the title association feature, location feature, time association feature, and trigger word feature is shown in Formula 5.
Among them, the value of (j=1,2,3,4) denotes the value of the j-th feature selected by the i-th place name entity,,,, in order of title association feature, location, year association feature, and trigger word feature, denotes the i-th place name entity and contains two values 1 and 0, indicating the area studied and non-area studied respectively.
4.3. Automatic Generation of Massive Thematic Maps
The massive literature contains two categories: academic journal papers and master's degree theses. In the literature, 14 disciplinary categories such as science, engineering, agriculture, medicine and literature are covered, and the major categories are subdivided into ponderous research directions. Among them, this paper focuses on academic journal papers in the field of surveying, mapping and geographic information related to this specialty to carry out research related to the area studied. The carrier of the area studied is the name of the place. In the actual research, the field of its research does not often appear directly in the paper, but the keywords express the topic of the paper's research and indirectly reflect the field information.
The production process of literature knowledge thematic map is shown in
Figure 4.The topic of the study is first quickly determined by the keywords of the literature. The research topic is obtained by clustering multiple semantically similar keywords in the cluster center. Since the data stock of research literature is large and constantly updated in real time, the clustering method uses the leader-follower incremental clustering strategy [
32]. The specific process is as follows: in the clustering process, the set of centers C of existing clusters is kept, C={c1,c2...cn}, where c denotes the cluster center obtained by combining the keywords contained in its class clusters, and when a new keyword Ii is clustered, the spatial, temporal, and semantic similarity measures are used to calculate the spatial and temporal similarity between the new keyword Ii and the center c of each class cluster [
33]. If the spatio-temporal semantic similarity between the new information item Ii and the existing class cluster c, Total_sim(Ii,c) is greater than a certain threshold (0.7 in the study), then Ii is included in the class cluster c. The calculation formula is as in formula6.
Where, L_sim(Ii,c), T_sim(Ii,c) and S_sim(Ii,c) denote spatial, temporal and semantic similarity, respectively. After the completion of spatio-temporal semantic similarity clustering for the keywords associated with the topic, the information that has coterminous relationship in spatial location, remains the same or similar in time, and describes the same or closely similar content semantically is integrated and consolidated. On this basis, scientific research literature is searched based on the keywords contained under the use of topics, and then indirectly associated with the research areas extracted from the literature, i.e., the relationship between topics and research areas is constructed through scientific research literature. The geographical names of the study area have been disambiguated and can be directly associated with the corresponding geographical entities in the geographical names database to complete the geographical mapping operation. The area studieds processed according to the above ensure the accuracy of knowledge, but the huge amount of information is not conducive to knowledge mining and use, so the area studieds are counted separately according to three levels: provincial, municipal, and county. That is, the area studieds are counted by topic and by administrative scale. Then, the most suitable administrative division scale is selected for different themes to display the output in priority, and the location of the center point of the result map is also presented dynamically according to the themes. The whole map is color graded, and the results are displayed in a hierarchical color setting based on the frequency of area studied selection. Finally, the graphical decorations such as map name, output time, output unit and legend are drawn according to the knowledge of literature to complete the thematic mapping.