3.1. General Findings
Following the procedure presented in
Section 3, general data were extracted from the 194 selected articles regarding the types of research, publication years, and keywords to enable the identification of interesting information.
Figure 2 contains the distribution of these works according to the publication years from 2014 to 2021.
The largest number of works occurred in 2018 with 36 works, followed by 2019 with 35, 2017 with 34, and 2020 with 32. These four years added up to 137 works (70.10% of all selected publications). These data suggest that 2017 was a milestone year for the increase in publication numbers on topics of interest, with 47,83% more than 2016. It is important to emphasize that the final number of works for 2021 was not higher because data collection was completed in June 2020.
Figure 3 shows the types of work distribution, with most being conference papers, followed by journal publications and book chapters.
Most conference papers occurred in 2017, followed by 2016 and 2020, cumulating 50 works. 2018 incurred the largest number of journal articles, followed by 2019 and 2020, cumulating 53 works. Finally, the book chapters occurred at the same number in 2018 and 2019, followed by 2016 and 2020, cumulating 16 works. The distribution of these types by year is listed in
Table 2.
From the information extracted about the publication channels, it was possible to list the number of works published on text mining in public security defined according to each journal, conference, and book/series between 2014 and 2021.
Table 3 contains the journals with more than one article published on the theme in the defined years.
About these findings, it is important to note that Procedia Computer Science is a journal dedicated to publishing high-quality conference proceedings. The journals in
Table 3 were extracted from a list containing 64 journals.
There was an extensive list of 84 different conferences, with only five of them – the 2016 Pacific Asia Conference on Information Systems, the 2017 European Intelligence and Security Informatics Conference, the 2018 IEEE International Conference on Intelligence and Security Informatics, 22nd Americas Conference on Information Systems, 8th International Conference on Computing, Communication, and Networking Technologies – with more than one paper from the selected literature. Finally, among the seventeen books from which the selected chapters came, one of them - Advances in Intelligent Systems and Computing - contained three selected chapters, and all the others contained only one.
3.2. Application areas for text mining in public security
The first research question (RQ1) considers the application areas for text mining within the context of public security. According to what the authors made evident in their texts, the analysis of the selected studies detected nineteen different application areas. These areas and the count of related findings are listed in
Table 4.
The nineteen application areas contained in
Table 4 were defined according to the findings in the literature. Each journal article, conference paper, or chapter selected was manually tagged based on the primary area of application of text mining tools in public security, as per the reading of these materials, also seeking this information.
These results suggest that "Cybersecurity" is the area with most studies containing text mining applications. Following this field are "General Crime detection/prediction", "Fraud detection", as well as "Terrorism detection", "Cyberbullying detection", and "Digital/Cyber forensics", collectively representing 80.94% of the selected studies.
While the labels used to designate the corresponding topics seem simple, some areas, such as "General crime detection/prediction", "Support to Law Enforcement agencies' actions", and "Support to the Judiciary power" are more general. Others, such as "Digital/Cyber forensics", "Cyberbullying", and "Information security" are all aligned to "Cybersecurity" but remain separated to ensure more details about these areas.
The label "General crime detection/prediction" concerns an application area not dedicated to a specific type of crime, such as "Fraud detection" or "Drug-related crime detection." The works by Aghababaei and Makrehchi [
24], Das and Das [
25], Qazi and Wong [
26], and Lal
et al. [
27] contain examples of applications in "General crime detection/prediction".
The "Support to Law Enforcement agencies' actions" application area is related to providing complete systems architecture, methodologies, and frameworks for agencies dedicated to ensuring compliance with the laws to maintain public security and social welfare. This set is composed of the works by Badii
et al. [
28] that proposed a system architecture to provide data analytics (including a text mining and analytics module) for supporting decision-making in law enforcement agencies; Bisio
et al. [
29] that proposed an approach to allow law enforcement agencies to detect events, using Twitter traffic monitoring, that compromise public security; Basilio
et al. [
30] that presented a methodology to extract knowledge from police reports for extracting information to support activities related to law enforcement; Behmer
et al. [
31] that proposed a framework to support law enforcement agencies in the investigations and analyzes of organized crime; Basilio
et al. [
32] that developed a method for knowledge discovery in emergency response databases based on police reports; and Hou
et al. [
33] that proposed the Bidirectional Encoder Representation from Transformers based on the Chinese relation extraction algorithm for public security, for security information mining.
The "Support to the Judiciary power" area refers to developing applications that aid judiciary activities since they may also be related to crime judgments and analysis or judicial reports about crimes. This set is composed of the works by Nikolić
et al. [
34] that proposed an e-Government service for extracting information from documents related to laws (Criminal Codes, for instance); Iftikhar
et al. [
35] that proposed a system to support courts' activities with text mining to extract relevant information from legal data; Pina-Sánchez
et al. [
36] that analyzed court sentence databases to detect ethical discrimination; Pina-Sánchez
et al. [
37] that proposed an approach to access data based on mining judiciary sentence records about crime available online; Xia
et al. [
38] that evaluated if judge gender exerted some effect over the sentences concerning rape, and Gomes and Ladeira [
39] that applied an empirical evaluation of a framework for jurisprudence retrieval to ease the task of retrieval of other decisions with the same legal opinion.
The "Drug-related crimes detection and Weapons' trafficking detection" combination appears among the application areas, with only one study selected, by Al-Nabki
et al. [
40] that proposed a new feature replacing the use of external sources of knowledge, applying it to recognize named entities related to suspicious activities related to weapons and drug trafficking through the Tor Darknet. The distribution of the selected works by year is shown in
Figure 4, highlighting cybersecurity as the area with the greatest number of selected works over seven years in the defined period.
3.3. Text mining techniques and technologies applied in public security
The selected works' methodological sections were analyzed to answer the second research question (RQ2). In this case, information extraction was performed to identify the terms referring to techniques or technologies, counting their frequencies. Techniques are all the algorithms and methods used to make text mining viable, while technologies can be understood as tools such as programming languages, code libraries, and other software that contain implementations of these techniques.
For each term, the number of occurrences represents the number of works that included a specific technique while recognizing that each work could apply more than one technique or technology.
Figure 5 contains the frequency of the 20 more recurrent terms.
The terms "support vector machines", "naïve Bayes", "random forests", "decision trees", "logistic regression", "
k-nearest neighbors", and "neural networks" represent machine learning techniques applied to classification problems typically related to the detection or prediction of crimes within the context of the types of applications in the security areas presented in the previous section. Of these, "support vector machines" is the most frequent technique, being a discriminative classifier [
41] and one of the most effective classification algorithms for general purposes [
42].
The term "naïve Bayes" refers to one of the simplest generative machine learning classifiers [
43], and its algorithm is based on the Bayes Theorem with independence assumptions between the predictors [
44]. It is the second machine learning technique most frequently applied by the literature selected.
The term "random forests" refers is an ensemble technique with excellent predictive performance [
42] using unpruned decision trees based on bootstrap samples of the training data [
45]. "Decision trees" refer to another popular technique based on a tree data structure that contains a set of nodes and edges to support decision-making [
43]. Both "random forests" and "decision trees" occurred the same number of times in the selected literature.
The term "logistic regression" refers to a generalized linear regression model [
42] that makes predictions using a binary or multiclass outcoming [
46]. The term "
k-nearest neighbors" refers to a popular technique that assigns elements to a class with their neighbors according to a similarity measure (as in cosine and Jaccard similarities, for instance) [
44,
47].
The term "neural networks" refers to non-linear machine learning techniques that simulate the human brain to solve problems [
43,
48]. These networks establish relationships between inputs and outputs, associating input data to their belonging classes through a series of hidden layers and the links between the created nodes [
49].
The term "latent Dirichlet allocation" refers to a machine learning technique dedicated to topic modeling. It is a generative probabilistic model used to identify latent topics among the texts in a corpus, modeling each corpus item as a finite mixture over a latent set of topics [
50,
51]. Topic modeling is the process of discovering hidden topics within semantic structures that contain interrelated concepts [
52,
53,
54].
"Term frequency-inverse document frequency" is a statistical measure applied for feature extraction and selection, which consists of reducing the original set of textual data into a new set, more readable by other techniques, such as machine learning related ones [
55,
56,
57]. Another related term is "term frequency", simply referring to counting the frequency of words in a text, being a component of "term frequency-inverse document frequency" calculation [
54]. In
Figure 5, a subset of terms refers to technologies for text mining solutions, such as programming languages, code libraries, and some programs specifically developed to apply data mining.
Among the programming languages, Python is the most recurrent. For instance, Al-Nabki
et al. [
40] applied Python with Keras framework in a neural network architecture to recognize named entities in suspicious Darkweb domains. Birks
et al. [
58] also used this programming language with Gensim wrapper to identify crime clusters. Bozyiğit
et al. [
59] used Python with Scikit-Learn to classify cyberbullying contents using texts extracted from social media.
The "Natural Language Toolkit" and "Scikit-Learn" are libraries developed in Python, the first deals with natural language processing problems, and the second contains pre-built machine learning techniques, such as many of those presented above. The "Scikit-Learn" library includes several machine learning techniques implemented with great flexibility for applications, as demonstrated in the works by Chen
et al. [
60], Dong
et al. [
61], Martín
et al. [
47], and Thao
et al. [
48]. The "Natural Language Toolkit" contains essential functions implemented to perform preprocessing tasks (Dong et al., 2018), "named entity recognition" [
25], and to apply "term frequency-inverse document frequency" [
55], for example. Preprocessing tasks involve applying natural language processing techniques to treat the texts by eliminating noise that affects the analytical process and formatting the text to perform subsequent processing. Examples of these preprocessing tasks include text cleaning and normalization, removing special characters, numbers, empty or white spaces, stop words, performing case folding, stemming, and lemmatizing, tokenization, and extraction of n-grams as evidenced by the work by Aboluwarin
et al. [
62], Chandra
et al. [
63], Gil
et al. [
64], Martín
et al. [
47], and Savaliya and Philip [
65].
The "R language" is the second most recurrent within this set, containing several functions like the Python language and its libraries. The work by Basilio et al. (2019), for instance, applied preprocessing tasks and topic modeling using the R language. In addition to performing preprocessing, Cichosz [
42] applied machine learning classification techniques from R language packages. Aboluwarin et al. (2016) applied the R language for preprocessing and several Scikit-Learn functions, using Python, to perform classifications. Other languages were detected but did not appear as regularly as Python and R, including Java [
66,
67,
68,
69]; Perl [
36,
37,
70]; PHP [
66,
71]; and C++ [
72].
Distinct from these programming languages, "WEKA" and "RapidMiner" are computer programs specifically developed for machine learning and data mining purposes. WEKA is a machine learning platform that contains several implemented techniques (Alothman & Rattadilok, 2017). The research by Das and Das [
73,
74] used WEKA to compare it with their methodology to process and analyze online newspaper reports covering crime. Almehmadi
et al. [
75] used WEKA to perform preprocessing tasks over a retrieved corpus and to apply a machine learning technique (support vector machines). RapidMiner is a software dedicated to data mining with several functions for data manipulation, statistical analysis, and graphic presentation [
76]. Noviantho
et al. [
77] and Samtani
et al. [
78] are examples from the selected literature using RapidMiner and WEKA, the first paper dedicated to cyberbullying classification, the second for identifying and assessing vulnerabilities in Supervisory Control and Data Acquisition (SCADA) systems.
Terms like "named entity recognition", "manual annotation", and "dictionaries", refer to natural language processing subjects. The term "named entity recognition" refers to an information extraction task for detecting named entities that are related, such as people, organizations, locations, expressions of time, and money [
79,
80]. "Dictionaries" are lists composed of keywords extracted from texts with descriptions of the characteristics related to a target term or word [
81]. These textual data structures present the sensitivity of a text or document as defined by the experts in the field to which it is related [
82]. "Manual annotation" refers to the process of creating a corpus with some labels or tags, such as in sentiments' polarities, using expert people. Petrovskiy and Chikunov [
83] and Saini and Bansal [
84] applied manual annotation to create corpora, which were later used to train machine learning techniques for performing classifications.
Most Frequent Techniques and Technologies by Application Area
According to each application area, a separation of the most frequent techniques and technologies was made to provide greater detail in answer to RQ2.
Figure 6 contains an assembly with bar plots showing the counts for the most frequent techniques and technologies in the six areas with the most works selected (see
Table 4). In this figure, there are cases where there is more than one term associated with a bar in the graph, indicating that each term has precisely the same number of occurrences as represented by the bar.
For “Cybersecurity”, the bar with four occurrences for each term involves: adaboost, named entity recognition, word clouds, and support vector machines. For “General crime detection/prediction”, the bar with two occurrences for each term involves: cluster analysis, georeferencing, logistic regression, natural language toolkit, neural networks, random forests, and rapidminer. For “Fraud detection”, the bar with two occurrences for each term involves: bagging, georeferencing, latent Dirichlet allocation, loss calculation, matlab, neural networks, risk calculation, scikit-learn, cosine similarity, and principal component analysis.
Figure 7 contains a way to visualize the combinations of terms between the six main areas according to the term extraction performed. The dots refer to terms appearing isolated, and dots connected by lines indicate term combinations. The bars on the left side are the number of occurrences among the six main areas, and the bars on the top of the plot are the counts of terms (isolated or in combination with other terms).
The term "naïve bayes" is an interesting case to exemplify the analyses that can be done with
Figure 7: it appears five times among the terms listed in all areas, which determines that it is at the intersection of five areas; it also appears twice alone, once in combination with just the term “support vector machines”, once with "term frequency-inverse document frequency", and once with both the terms "r language" and "term frequency-inverse document frequency". The term “python” has similar behavior in this plot: it also appears five times, being in the intersection of five areas; twice it is isolated from other terms; once it is combined with “random forests” and “named entity recognition”; once it is combined with “term frequency-inverse document frequency” and “k-nearest neighbors”; and once it is combined with “support vector machines”, “named entity recognition” and “natural language toolkit”.
The most recurrent term is “support vector machines”, appearing six times, in other words, it is in the intersection of the six main areas. It is followed by “python”, “term frequency-inverse document frequency”, “random forests”, and “naïve bayes”. For more counts, see
Figure 7.