1. Introduction
In today's world, social media serves as an "integral vehicle" [
1] and as an "online community" [
2] for seeking and seeking of information, news, views, opinions, perspectives, ideas, awareness, comments, and experiences on various topics such as pandemics, global affairs, current technologies, recent events, politics, family, relationships, and career opportunities, just to name a few [
3]. Out of multiple social media platforms, Twitter is highly popular amongst all age groups. As of December 2022, Twitter's audience accounted for over 368 million monthly active users worldwide [
4]. Twitter is the most used social media platform amongst journalists [
5] and ranks amongst the most popular social media platforms on a global scale [
6].
Twitter has been highly popular amongst healthcare researchers, epidemiologists, medical practitioners, data scientists, and computer science researchers for studying, analyzing, modeling, and interpreting social media communications related to pandemics, epidemics, viruses, and diseases such as Ebola [
7], E-Coli [
8], Dengue [
9], Human papillomavirus (HPV) [
10], Middle East Respiratory Syndrome (MERS) [
11], Measles [
12], Zika virus [
13], H1N1 [
14], influenza-like illness [
15], swine flu [
16], flu [
17], Cholera [
18], COVID 2022, 2 1028 Listeriosis [
19], cancer [
20], Liver Disease [
21], Inflammatory Bowel Disease [
22], kidney disease [
23], lupus [
24], Parkinson's [
25], Diphtheria [
26], and West Nile virus [
27]. The recent outbreaks of COVID-19 and MPox have served as "catalysts," leading to the usage of Twitter for sharing and exchange of information on diverse topics related to these respective viruses leading to the generation of tremendous amounts of Big Data. No prior work in this field has focused on studying and analyzing Tweets that focused on both these viruses simultaneously to understand and interpret the underlying paradigms of conversations. Therefore, this serves as the main motivation for this work.
In December 2019, there was an outbreak of an unknown respiratory disease in the seafood market in Wuhan, China. This outbreak affected about 66% of the people in the market. A prompt investigation from the healthcare and medical sectors revealed that a novel coronavirus was responsible for this disease, and this virus was named severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2, 2019-nCoV) as it was observed to have a high homology of about 80% with SARS-CoV [
28]. The disease humans suffer from after getting infected by this virus is known as COVID-19 [
29]. Despite best efforts by the Chinese Government to contain the spread of this virus, it soon spread to other parts of the world while undergoing multiple mutations and several variants such as Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2), Epsilon (B.1.427 and B.1.429), Eta (B.1.525), Iota (B.1.526), Kappa (B.1.617.1), Zeta (P.2), Mu (B.1.621 and B.1.621.1), and Omicron (B.1.1.529, BA.1, BA.1.1, BA.2, BA.3, BA.4, and BA.5) [
34]. At present, there have been a total of 681,518,412 cases and 6,811,869 deaths on account of COVID-19 on a global scale [
35]. Respiratory systems are the primary target of the SARS-CoV-2 virus, although infections in other organs of the body have been reported in some cases. The symptoms of COVID-19 usually include fever, dry cough, dyspnea, headache, dizziness, exhaustion, vomiting, and diarrhea [
36]. However, studies have shown that symptoms can vary from person to person based on diversity characteristics such as age group, preexisting conditions, disabilities, etc. [
37,
38].
Mpox (Monkeypox) is a re-emerging zoonotic disease. It is caused by the Mpox (monkeypox) virus, which belongs to the Poxviridae family, Chordopoxvirinae subfamily, and Orthopoxvirus genus [
39]. This virus was originally identified in monkeys in 1958 [
40], and the first case of this virus in humans was recorded in 1970. The Mpox virus is closely related to the variola virus and causes a smallpox-like disease in humans. The common symptoms of Mpox include fever, headache, and myalgia. A distinguishing feature of Mpox is the presence of swelling at the maxillary, cervical, or inguinal lymph nodes [
41,
42]. The Mpox virus was endemic in the Democratic Republic of the Congo (DRC) and a few African countries for a long time, and a few cases outside these geographic regions were recorded only twice—first in 2003 [
43] and then in 2017–2018 [
44,
45]. However, since May 2022, the world is also experiencing an outbreak of the Mpox virus. At present, there have been a total of 86,231 cases of Mpox, with 84,858 of these cases being recorded in regions that have not historically reported Mpox [
46].
In the context of recent works related to Twitter data mining and analysis, a number of works have focused on sentiment analysis of Tweets. Sentiment Analysis is the computational analysis of people's attitudes, views, and sentiments regarding an entity that may represent an individual, concept, topic, event, or scenario. Sentiment Analysis can be considered a classification process. The three primary classification levels in Sentiment Analysis are document level, sentence level, and aspect level. The goal of document-level Sentiment Analysis is to categorize an opinion document as expressing a positive or negative sentiment. The entire document is viewed as a single fundamental informational unit in this process. Sentence-level Sentiment Analysis seeks to categorize the sentiment that each sentence expresses. In order to categorize the sentiment in relation to particular features of entities, aspect-level Sentiment Analysis is used. While the prior works in this field focused on performing sentiment analysis, the works were focused on studying either Tweets about COVID-19 or Tweets about MPox and did not include Tweets that focused on both these viruses simultaneously. The outbreak of MPox during the ongoing outbreak of COVID-19 has resulted in several Tweets involving the views, opinions, concerns, and perspectives of the public regarding both these viruses. Examples of a few such Tweets (obtained by using the Advanced Search feature of Twitter) are shown in
Table 1.
As can be seen from these Tweets, these two ongoing virus outbreaks prompted sharing and exchange of views, information, concerns, and perspectives on a wide range of topics that reflect various sentiments regarding these viruses and how these viruses may affect aspects of people's lives. No prior work in this field thus far has focused on studying and analyzing Tweets that involved conversations about both COVID-19 and MPox. This work aims to address this research gap in this field. The work of this paper involved performing sentiment analysis and text analysis on 61,862 Tweets that focused on Mpox and COVID-19 simultaneously, posted between May 7, 2022, to March 3, 2023. The findings are summarized as follows:
The results of sentiment analysis using the VADER (for Valence Aware Dictionary for sEntiment Reasoning) approach shows that nearly half the Tweets (actual percentage being 46.88%) had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%).
Using concepts of text analysis, the top 50 hashtags associated with these Tweets were obtained. These hashtags are presented in the paper.
Using concepts of text analysis, the top 100 most frequently used words featured in these Tweets were obtained. The findings show that some of the commonly used words involved Twitter users directly referring to either or both viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate", and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine.
In addition to the above, a comprehensive comparative study that compares the contributions of this paper with 49 prior works in this field to uphold its relevance and novelty is also presented in this paper. This paper is organized as follows. In
Section 2, an overview of recent works related to sentiment analysis in the context of COVID-19 and MPox is presented.
Section 3 outlines the detailed methodology and the specific steps that were followed for this work. In
Section 4, the results are presented and discussed.
Section 5 presents the conclusion and scope for future work along these lines. It is followed by references.
3. Materials and Method
This section outlines the methodology that was followed for the development and implementation of the proposed framework for performing sentiment analysis and text analysis of Tweets that focused on COVID-19 and MPox simultaneously.
First of all, a relevant Twitter dataset had to be selected. The dataset that was selected for this study is MonkeyPox2022Tweets [107]. This dataset presents more than 600,000 Tweet IDs of Tweets about the 2022 outbreak of MPox. These Tweets were posted between May 7, 2022, to March 3, 2023. The dataset comprises Tweets in 34 languages, with English being the most common language in which the Tweets are available. The Tweets in the dataset include 5470 distinct hashtags related to MPox, out of which #monkeypox is the most frequent hashtag. As this dataset comprises only Tweet IDs, the Hydrator App [108] was used to hydrate this dataset. The process of hydration refers to the process of obtaining the Tweets and related information corresponding to each of the Tweet IDs. The Hydrator App works by complying with the policies of accessing the Twitter API as well as the specific rate limits in terms of accessing the Twitter API. The following steps were followed for hydrating the Tweet IDs present in this dataset:
The desktop version of Hydrator was downloaded and installed on a Computer with Microsoft Windows 10 Pro operating system (Version 10.0.19043 Build 19043) comprising of Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz, 2904 Mhz, 2 Core(s), and 4 Logical Processor(s)
The Hydrator app was then connected to the Twitter API by clicking on the "Link Twitter Account" button on the app's interface.
This next step involved uploading a dataset file to the Hydrator app for hydration. As the Hydrator App allows only one file to be uploaded at a time, so all the dataset files were merged to create one .txt file, which was uploaded to the app.
Then specific information about the uploaded dataset file (such as Title, Creator, Publisher, and URL) was entered in the Hydrator app, and then the "Add Dataset" button was clicked to complete the process of dataset upload
Thereafter, in the "Datasets" tab of the Hydrator App, the "Start" button was clicked to initiate the process of hydration.
Figure 1 is a screenshot from the Hydrator App obtained after the completion of this hydration task.
The output of the Hydrator app provided 509,248 Tweets about MPox. Upon obtaining these Tweets, it was crucial to perform text filtering to obtain those Tweets that contained keywords related to COVID-19. The specific keywords that were selected for the text filtering were "COVID", "COVID19", "coronavirus", "coronavirus pandemic", "COVID-19", "corona", "corona outbreak", "omicron variant", "SARS-CoV-2", "corona virus", and "Omicron". These keywords were selected based on the findings of [
93]. The text filtering task produced a set of 61,862 Tweets, i.e., each of these Tweets focused on MPox and COVID-19 at the same time. This set of 61,862 Tweets was selected for performing sentiment analysis and text analysis.
There are various approaches for performing sentiment analysis, such as manual classification, Linguistic Inquiry and Word Count (LIWC), Affective Norms for English Words (ANEW), the General Inquirer (GI), SentiWordNet, and machine learning-oriented techniques relying on Naive Bayes, Maximum Entropy, and Support Vector Machine (SVM) algorithms. However, the specific approach that was used for this study was VADER (Valence Aware Dictionary for sEntiment Reasoning). VADER was used as it has been reported to outperform manual classification as well as it addresses the limitations in similar approaches for sentiment analysis as outlined below [
94]:
- a)
VADER distinguishes itself from LIWC as it is more sensitive to sentiment expressions in social media contexts.
- b)
The General Inquirer suffers from a lack of coverage of sentiment-relevant lexical features common to social text.
- c)
The ANEW lexicon is also insensitive to common sentiment-relevant lexical features in social text.
- d)
The SentiWordNet lexicon is very noisy; a large majority of synsets have no positive or negative polarity.
- e)
The Naïve Bayes classifier involves the naive assumption that feature probabilities are independent of one another.
- f)
The Maximum Entropy approach makes no conditional independence assumption between features and thereby accounts for information entropy (feature weightings).
- g)
In general, machine learning classifiers require (often extensive) training data, which are, as with validated sentiment lexicons, sometimes troublesome to acquire.
- h)
In general, machine learning classifiers also depend on the training set to represent as many features as possible.
VADER uses sparse rule-based modeling to build a computational sentiment analysis engine that performs well on social media style text while easily generalizing to multiple domains, needs no training data but is built from a generalizable, valence-based, human-curated gold standard sentiment lexicon, is quick enough to utilize online with streaming data, does not suffer significantly from a speed-performance tradeoff, has a time complexity of O(N), and is freely available without any subscription or purchase costs. In addition to detecting the polarity (positive, negative, and neutral), VADER is also able to detect the intensity of the sentiment expressed in the texts. For developing the system architecture for sentiment analysis and text analysis, RapidMiner was used. RapidMiner, formerly known as Yet Another Learning Environment (YALE) [
95], is a data science platform that enables the development, implementation, and utilization of several algorithms and models related to Machine Learning, Data Science, Artificial Intelligence, and Big Data. RapidMiner is utilized for both academic research and the creation of business-related applications and solutions. RapidMiner is available as an integrated development environment that consists of—(1) RapidMiner Studio, (2) RapidMiner Auto Model, (3) RapidMiner Turbo Prep, (4) RapidMiner Go, (5) RapidMiner Server, and (6) RapidMiner Radoop. For all the work related to the methodologies proposed in this paper, RapidMiner Studio was used. For the remainder of this paper, wherever the phrase "RapidMiner" has been used, it refers to "RapidMiner Studio" and not any of the other development environments associated with this software tool. RapidMiner is created as an open-core model with a powerful Graphical User Interface (GUI) that enables developers to create numerous applications and workflows and develop and implement algorithms. In the RapidMiner development environment, specific operations or functions are referred to as "operators," and a collection of "operators" (connected linearly or hierarchically or a combination of both) to achieve a desired task or goal is referred to as a "process". For the creation of a particular "process," RapidMiner offers a variety of built-in "operators" that may be utilized straight away with or without any changes. A particular class of "operators" can also be utilized to change the distinguishing qualities of other "operators". Moreover, the development environment also allows developers to construct their own "operators," which can then be shared and made accessible to all other RapidMiner users via the RapidMiner Marketplace.
The VADER approach for performing sentiment analysis is available as an "operator" in RapidMiner, which can be directly used in a "process." This "operator" calculates and then outputs the sum of all sentiment word scores in a given text(s) by following the VADER approach. If the advanced output option of this "operator" is selected, then it also outputs a nominal attribute with all words taking part in the scoring, the sum of positive components, the sum of negative components, and the number of used and unused tokens. The "process" that was developed in RapidMiner involving the use of this "operator" and other "operators" connected to it is shown in
Figure 2.
The description of all the "operators" used in this "process" is presented next. The "Dataset" "operator" was used to import the original dataset of 509,248 Tweets about MPox (obtained from the output of the Hydrator app). The "Filter Tweets" "operator" was used to perform text filtering on the text of the Tweets. Specifically, the Tweets that contained the keywords "COVID", "COVID19", "coronavirus", "coronavirus pandemic", "COVID-19", "corona", "corona outbreak", "omicron variant", "SARS-CoV-2", "corona virus", and "Omicron" were filtered. Thereafter, the "Select Attributes" "operator" was used to select only that specific attribute from the dataset that would be used for sentiment analysis. The specific attribute in this context was the text of the Tweets. The output of this "operator" was provided as an input to the "Extract Sentiment" "operator" which performed sentiment analysis according to the VADER approach. The output of this "operator" comprised a score associated with each Tweet classifying it into a positive, neutral, or negative Tweet. To compute the number of positive, neutral, or negative Tweets, additional data filters were used. However, this required creating multiple copies of the output. To achieve the same, the "Multiply" "operator" was used. Specifically, three copies of the output from the VADER "operator" were created by using this operator. Each of these copies of the output was passed to data filters which were set up to filter out the positive, neutral, and negative Tweets based on specific rules based on the working of the VADER approach – a Tweet with a score greater than 0 was filtered as a positive Tweet, a Tweet with a score equal to 0 was filtered as a neutral Tweet and a Tweet with a score less than 0 was filtered as a negative Tweet. Thereafter, an analysis of the number of Tweets from these respective data filters was performed to infer the percentages of positive, neutral, and negative Tweets. These results are discussed in
Section 4.
In addition to performing sentiment analysis, this study also involved the detection of some of the commonly used hashtags and words in the 61,862 Tweets that were considered for this study. The RapidMiner "process" that was developed to implement the same is shown in
Figure 3.
The description of all the "operators" used in this "process" is presented next. The "Dataset" "operator" was used to import the original dataset of 509,248 Tweets about MPox (obtained from the output of the Hydrator app). The "Filter Tweets" "operator" was used to perform text filtering on the text of the Tweets. Specifically, the Tweets that contained the keywords "COVID", "COVID19", "coronavirus", "coronavirus pandemic", "COVID-19", "corona", "corona outbreak", "omicron variant", "SARS-CoV-2", "corona virus", and "Omicron" were filtered. Thereafter, the "Select Attributes" "operator" was used to select only that specific attribute from the dataset that would be used for sentiment analysis. The specific attribute in this context was the text of the Tweets. The output of this "operator" was provided as an input to the "Nominal to Text" operator. Thereafter the "sub-process" "Process Documents" was used. This "sub-process" comprised specific operators to perform tokenization and elimination of stop words. The output of this "operator" was provided as an input to the "WordList to Data" operator to display the results for detection and analysis of the commonly used hashtags and words in these Tweets. The results of this "process" are discussed in
Section 4. It is worth mentioning here that the VADER "operator" performs tokenization and elimination of stop words automatically, so the "sub-process" "Process Documents" was not used in the RapidMiner "process" for performing sentiment analysis shown in
Figure 3.
5. Conclusions
The Big Data of Twitter conversations holds the potential for inference of the views, opinions, perspectives, mindset, sentiment, and feedback of the general public towards pandemics, epidemics, viruses, and diseases. This has attracted the attention of researchers in the fields of computer science, big data, data science, epidemiology, healthcare, medicine, and their interrelated areas in the last few years. Various forms of analysis of this Big Data, such as sentiment analysis, hashtag analysis, and frequent keyword analysis, can be seen in prior works in this field that focused on studying Tweets involving some of the virus outbreaks of the past, such as Ebola, E-Coli, Dengue, Human papillomavirus (HPV), Middle East Respiratory Syndrome (MERS), Measles, Zika virus, H1N1, influenza-like illness, swine flu, flu, Cholera, COVID, Listeriosis, cancer, Liver Disease, Inflammatory Bowel Disease, kidney disease, lupus, Parkinson's, Diphtheria, and West Nile virus. The recent outbreaks of COVID-19 and MPox have escalated the use of Twitter for conversations related to these respective viruses. While there have been a few works published in the last few months that focused on performing sentiment analysis of Tweets related to either COVID-19 or MPox, none of the prior works in this field thus far focused on the analysis of Tweets focusing on both COVID-19 and MPox and performing sentiment analysis of the same. To address this challenge, this study presents the findings from a comprehensive sentiment analysis study involving 61,862 Tweets that focused on Mpox and COVID-19 at the same time. The VADER approach was used for performing the sentiment analysis. The results show that almost half the Tweets (actual percentage being 46.88%) involving COVID-19 and MPox had a negative sentiment. It was followed by Tweets that had a positive sentiment (31.97%) and Tweets that had a neutral sentiment (21.14%). This study also presents the findings from hashtag analysis and keyword analysis of these Tweets. The top 50 hashtags featured in all these Tweets are detected and presented in this paper. The top 100 most frequently used words that featured in all these Tweets were also detected using concepts of tokenization and are presented. The findings of frequent word analysis show that some of the commonly used words involved directly refer to either of these viruses. In addition to this, the presence of words such as "Polio", "Biden", "Ukraine", "HIV", "climate" and "Ebola" in the list of the top 100 most frequent words indicate that topics of conversations on Twitter in the context of COVID-19 and MPox also included a high level of interest related to other viruses, President Biden, and Ukraine. Future work in this area would involve collecting more Tweets over the next months and repeating this study to infer any potential evolutions of public sentiment related to these viruses over the course of time.