3.5.1. R&D of Corpus Analysis Software
(1) Topic Recognition
To demonstrate the effect of text topic recognition clearly, the topic model was adopted to classify and reduce the dimensions of the texts. A topic model is a statistical one that clusters the implicit semantic structure of a corpus in an unsupervised learning manner. By comparing the existing topic models, the LDA2Vec model (a joint training topic model based on deep learning) was used to recognize the topic of this platform. Combining the advantages of latent Dirichlet allocation (LDA) global prediction and local prediction of the word vector model, LDA2Vec extends the skip-gram negative sampling loss function, jointly trains document vectors and word vectors to complete the context vector prediction of pivot words and obtains word vector representations and topic vector representations containing topic information. Additionally, this model can produce sparse interpretable document vectors, allowing for an easier understanding of the topic recognition results.
The specific process is as follows:
First, using the Word2Vec word vector model, the word vectors of words in the fusion document set were generated from the preprocessed corpus, which was input as part of the model. Python Gensim was used to encapsulate the Word2Vec model. The gensim.models.Word2Vec of Python Gensim was used to train the word vectors of the skip-gram model in Word2Vec.
Second, the topic-word distribution matrix and document weight were generated through the LDA model as another part of the input of LDA2Vec, and similarly, the preprocessed corpus was input to the LDA model as the dataset for training. There are encapsulations for the LDA model in toolkits, such as Gensim and Sklearn in Python. Considering the calculation of perplexity evaluation indicators in the later experiment, the LDA model was trained based on the Sklearn.
Finally, the obtained word vectors and document vectors were input into LDA2Vec for fusion training and topic extraction. The topics were ranked in descending order of probability of occurrence. Topic words showing the top 10 probabilities were selected under each topic to elucidate the implicit semantics of each topic more clearly and accurately. pyLDAvis was used to visually display the results of topic identification, allowing for more intuitive observation and analysis of hot topic results.
(2) Entity Recognition
As an information extraction technique, entity recognition can obtain entity data, such as person names and location names from text data. A named entity recognition method based on multiple features was adopted in this system to fully discover and utilize the contextual features and the internal features of the entity. Morphology includes the following situations: any character or word in the dictionary is in a separate category, while person names (Per), abbreviations of person names (Aper), location names (Loc), abbreviation of location names (Aloc), organization names (Org), time words (Tim) and number words (Num) are each defined as a separate category. The functions of morphological features and part-of-speech features were comprehensively utilized to establish entity recognition models according to the structural characteristics of different entities, which were superior in recognition performance and system efficiency.
(3) Keyword Extraction
The TextRank algorithm can be used to extract keywords and summaries (key sentences) from texts. As the Python implementation of the TextRank algorithm, TextRank4ZH can extract the summaries of Chinese and English articles and has been widely used because of its simplicity and effectiveness.
The TextRank4ZH algorithm split the original texts into sentences, filtered out stop words (optional) in each sentence and kept only words with specified parts of speech (optional), from which a collection of sentences and a collection of words could be obtained.
Each word acted as a node in PageRank. The window size was set as k, assuming that a sentence consists of the following words in turn:
where w1, w2, ..., wk, w2, w3, wk+1, w3, w4, wk+2,
etc. are all windows. There is an undirected and unweighted edge between the nodes corresponding to any two words in a window.
Based on the composition graph above, the importance of each word node can be calculated. The most important words can be used as keywords.
(4) Analysis of Text Similarity
To figure out the correlation degree between the text contents and the topics, the correlation was calculated with the text similarity calculation method.
The three currently most popular similarity calculation methods were analyzed as follows:
Editing Levenshtein distance: Although its computational complexity is high, it is outstanding in actual scenes and has the most accurate similarity calculation.
Cosine similarity: The computational complexity is high. However, since the data sparsity is too high in practical applications, the cosine similarity calculation will produce misleading results.
MinHash: Both SimHash and Minhash have sensitivity properties that are not found in general Hash methods (Hash LSH that is locally sensitive belongs to the Hash function), MinHash and SimHash will result in close Hash results of two similar documents.
After analysis and comparison, the editing Levenshtein distance algorithm performing best in the actual text was adopted to calculate the similarity.
3.5.2. R&D and Embedding of Corpus Analysis Software
The most widely used corpus analysis software includes Wordsmith and Antconc, which are often used as independent application software. However, the research and development (R&D) of this platform was based on the B/S architecture and deployed and used in the Linux environment. Since the compatibility between the platform system and these commercial software systems could not be solved, the R &D team decided to develop customized corpus statistics and analysis software through machine learning algorithms to realize the corpus analysis function. Based on the functions of existing commercial corpus analysis software, the following functions were developed and realized on the platform: the research, judgment and labeling of the collected corpus on the part of speech, syntax and concentration of coverage topics, the extraction of high-frequency words, collocated lexical chunks and keyword tables, statistics of type-token ratio, standard type-token ratio, average sentence length, structural capacity, visual analysis of emotional tendency and semantic prosody tendency, etc. At present, as the technology of natural language processing (NLP) based on deep learning (Goldberg, 2018; He, 2020) has been well developed, it is feasible to analyze and recognize multilingual part-of-speech and count word frequency through it. Moreover, it is more suitable for a platform based on Python language development, which makes preliminary preparations for further deep learning algorithms in the later stage.