1. Introduction
In the recent past, several viruses such as COVID-19 [
1], the plague [
2], Spanish Flu [
3], HIV [
4], and Ebola [
5] have rampaged unopposed across different countries, infecting and leading to the demise of people, destruction of political regimes, affecting various sectors of the global economy, as well as causing financial and psychosocial burdens, the likes of which the world has not witnessed in centuries [
6]. As a response to this, various organizations and policy-making bodies on a global scale have begun investigating approaches to learn from such virus outbreaks with an aim to not repeat the mistakes of the past during future virus outbreaks. “Disease X” is a placeholder name that was adopted by the World Health Organization (WHO) in February 2018 on their shortlist of blueprint priority diseases to represent a hypothetical, unknown pathogen that could cause a future epidemic [
7,
8]. The WHO used the placeholder term “Disease X” to make sure that its planning (such as relevant tests, expanded vaccinations, and production capabilities for vaccines) was robust, versatile, and equipped to deal with an unidentified virus [
9]. The idea of Disease X, according to Anthony Fauci (the director of the US National Institute of Allergy and Infectious Diseases at that time), was to motivate WHO’s investigations on entire classes of viruses rather than just specific strains of certain viruses, with an aim to strengthen WHO’s preparedness of dealing with such outbreaks [
10].
Thus, it is crucial to plan and adopt a holistic approach to prevent and predict a new pandemic in the future. Prior works [
11,
12,
13,
14] in this field have discussed various means by which Disease X might start. For instance, the potential of deadly pathogens being released from melting glaciers could start a new pandemic. Alternatively, with the continual increase of global warming and climate changes, viruses dormant at present may become active and mutate and lead to the next pandemic. Furthermore, human and animal contact has become increasingly common, and the lack of proper protocols in this regard has led to the outbreak of zoonotic viruses in the past. A well-known example of this would be H1N1 which contained genetic material from human, avian, and swine origin, involving wildlife, pig farming, animal movement, and farm workers [
15]. Therefore, in the last couple of years or so, works in this field have also focused on predicting what type of pathogen might be responsible for Disease X, with an aim to create, implement, and evaluate countermeasures that would help control the potential pandemic at a faster rate than previous pandemics such as COVID-19 [
16,
17,
18]. Simpson et al. [
19] stated that Disease X is likely to occur due to one or more of these risk factors - human interactions with wildlife, the production of goods derived from animals with minimal oversight of workers and an unclear supply chain, bug and tick vectors, extremely high population densities, and limited surveillance and laboratory capacities. This work by Simpson et al. [
19] also states that Disease X will probably be caused by the zoonotic spread of a highly infectious RNA virus from a region where the confluence of risk factors and population dynamics will lead to prolonged person-to-person transmission.
A highly agreed upon aspect related to Disease X within the research community in this field is that the world is currently not prepared with the applicable countermeasures, policies, and procedures that would be necessary to control and contain this virus. There are multiple factors that we need to take into consideration when creating new response, control, and preparation measures, including vaccine development and distribution, country and state responses, political stances, as well as cultural and environmental factors. It is crucial that there is global preparation, coordination, and communication such that each of these factors is considered and managed in coordination with other factors to allow for controlling and containing a new pandemic [
20].
One of the overarching issues that were observed on a global scale when attempting to handle the COVID-19 pandemic was the lack of efficiency, coordination, agreement, and organization related to the production and distribution of vaccines and COVID-19 tests in a timely manner [
21]. While various organizations and labs were working to create a vaccine, it seemed that different countries were scrambling to even put up testing centers and mass produce enough COVID-19 tests. It took far longer than ideal to ensure easy access to COVID-19 tests which allowed the COVID-19 virus to continue to spread at an alarming rate because symptoms were not guaranteed to be noticeable in all population groups [
22]. Testing is one of the first lines of defense against viruses because the threshold of the spread can be determined, and suitable actions can be taken depending on the positive cases that are reported. This was an issue with the supply chain and communication across agencies during the outbreak and rapid spread of COVID-19. Those same supply chain issues were reported when trying to roll out the COVID-19 vaccines at an even slower rate than the tests. Research labs in different geographic regions seemed less prepared to mass produce and distribute the vaccines, which slowed down response rates and did not contain the spread of COVID-19 in a timely manner [
23,
24]. Another issue associated with the COVID-19 pandemic was the lack of coordination and cooperation between countries in their responses [
25]. During the outbreak of COVID-19, some countries implemented measures (such as partial or complete lockdowns) immediately, while others did not implement such measures at the same pace [
26,
27]. Finally, a major issue specifically seen during the COVID-19 pandemic was political stances standing in the way of scientific progress. There was a lot of misinformation surrounding the entirety of the pandemic. That ranged from the effectiveness of vaccines, the safety of the vaccines, the accuracy of the test results, approaches for treatment, and the severity of the virus. [
28,
29].
During the outbreak of COVID-19 and similar virus outbreaks of the past, Google Trends attracted a significant amount of attention from researchers across different disciplines, such as Big Data, Data Mining, Healthcare, Epidemiology, Information Retrieval, and Data Analysis, as Google Trends helps to mine, analyze, and obtain real-time insights related to web behavior and the features of Google Trends surpass traditional surveys [
30]. In the last few years, Google Trends has been highly popular for researchers in Healthcare for analysis of different patterns of web behavior related to different virus outbreaks [
31,
32,
33,
34,
35,
36,
37]. Ginsberg et al. [
38] discussed the significance of seasonal influenza and the potential threat of a pandemic caused by a new strain of the influenza virus using Google Trends. The work proposed a method to enhance early disease detection by monitoring Google search queries, which reflected health-seeking behavior. By analyzing Google search queries, the researchers accurately estimated weekly cases of influenza in different regions of the United States, allowing for rapid detection and response to influenza with only a one-day reporting lag. The work by Kapiány-Fövény et al. [
39] focused on analyzing Google search volumes using Google Trends to forecast Lyme disease incidences. By integrating Google Trends data into a seasonal autoregressive moving average (SARIMA) model, the researchers compared their predictions with the actual reported values for Lyme disease incidence in Germany. The objective of the work done by Verma et al. [
40] was to assess the potential of using Google Trends data for predicting disease outbreaks. Focusing on diseases like malaria, dengue fever, chikungunya, and enteric fever in two regions in India - Chandigarh and Haryana, the research compared Google Search trends with Integrated Disease Surveillance Programme (IDSP) data. The analysis revealed a temporal correlation between the two datasets, particularly with a lag of 2 to 3 weeks for chikungunya and dengue fever, indicating the feasibility of utilizing Google Trends for predicting disease outbreaks at both local and regional levels. Young et al. [
41] explored the potential of using relevant Google Search queries from Google Trends to monitor and predict syphilis cases at a state level. The study investigated the relationship between weekly reported syphilis cases and online search activity related to risk factors. By employing linear mixed models, the study established associations between search query data and syphilis cases, achieving accurate predictions for a significant number of weeks. The results indicated a strong correlation between web behavior and reported syphilis cases, suggesting the feasibility of integrating such data into public health monitoring systems for disease surveillance and prediction. Another work by Young et al. [
42] focused on utilizing Google search data to monitor and predict new HIV diagnosis cases in the United States. The researchers collected HIV-related search volume data and state-level new HIV diagnoses data using Google Trends. They developed a predictive model using significant predictor keywords identified through LASSO and combined this data with actual HIV case reports from the CDC. The model demonstrated strong predictive capabilities, achieving an average R
2 value of 0.99 and an average root-mean-square error (RMSE) of 108.75 when comparing predicted and actual HIV cases. Morsy et al. [
43] focused on predicting Zika virus cases using Google search queries from Google Trends. The researchers developed a prediction model based on time-series regression (TSR) that utilized Zika search volume from Google Trends to anticipate confirmed Zika cases in Brazil and Colombia. The model with a 1-week lag of Zika query and a 1-week lag of Zika cases as a control for autocorrelation was found to be the most effective in predicting Zika cases. The results demonstrated the potential to forecast Zika cases a week ahead of outbreaks, offering healthcare authorities an early indicator for outbreak evaluation and precautionary measures. Using Google Trends, Ortiz-Martínez et al. [
44] showed that there was a high correlation between the COVID-19 incidence in Colombia and Google searches on COVID-19 in Colombia (R
2 = 0.8728 and p < 0.0001). Therefore, it may be concluded that prior works in this field have focused on using Google Trends related to mining, analysis, and investigation of multimodal components of web behavior during various virus outbreaks. However, these works have two major limitations. First, none of these works focused on Disease X, which features in the shortlist of blueprint priority diseases of WHO. Second, these works focused on the analysis of the relevant data from Google Trends from a very limited number of regions. To address these limitations and to contribute to the timely advancement of research and development in this field, this work presents a dataset that comprises web behavior data related to Disease X that emerged from 94 regions from February 2018 to August 2018. These 94 regions were selected for the development of this dataset as all these regions recorded a significant level of interest towards Disease X during this timeframe. This dataset was developed by collecting this data from Google Trends. The rest of this paper is organized as follows.
Section 2 presents the detailed methodology which was followed for the development of this dataset. The dataset is described in
Section 3.
Section 3 also presents a brief analysis of specific features of this dataset to uphold its applicability, relevance, and significance for the investigation of different research questions.
Section 4 concludes the paper, which is followed by references.
2. Methodology
Google Trends [
45], a tool developed by Google, allows the mining and analysis of real-time and historical information associated with Google search queries, enabling researchers to uncover valuable insights into the interests of individuals across different domains and topics [
46]. Google Trends analyzes aggregate search behavior by considering searches on Google and can thus provide unique insights associated with web-behavior. This feature is particularly valuable in health informatics, where understanding public engagement and interests in health-related topics and predicting disease outbreaks is of paramount importance [
47].
The real-time data availability of Google Trends makes it superior to traditional survey methods, and it is also far less time-consuming. Additionally, as the web behavior data available via Google Trends is anonymous, it allows researchers to explore different forms of data analysis that might have been otherwise difficult due to privacy concerns of the general public [
47]. Google Trends presents several significant advantages over traditional survey methods, positioning it as a potent tool for research and analysis of multimodal characteristics of web-behavior. The foremost advantage lies in the cost-effectiveness of utilizing Google Trends. Unlike traditional surveys, which frequently entail significant expenses for participant recruitment, data collection, and analysis, Google Trends operates as a cost-free resource. This financial flexibility allows researchers to channel resources into more focused areas of investigation or allocate them toward enhancing the research process itself, promoting greater flexibility in research endeavors. Another key advantage centers around the breadth and diversity of the data captured by Google Trends. Conducting regular surveys on a global scale is a logistical challenge, often constrained by geographic and demographic limitations. However, Google Trends seamlessly aggregates web behavior data on a global scale which can be used for in-depth study and analysis. This global perspective of Google Trends enhances the generalizability of findings and facilitates cross-cultural comparisons, making it a valuable resource for understanding the intricacies of web behavior across different geographic regions. Moreover, the near real-time nature of data availability on Google Trends is a game-changer. Google Trends offers almost immediate access to search trends as they unfold, providing researchers with timely access to evolving interests and trends. This swift access to information enables timely analysis, decision-making, and trend detection, making it particularly advantageous in fields that require quick response, such as public health and policy formulation. In contrast, traditional surveys often grapple with time delays, influenced by the labor-intensive nature of participant recruitment and adherence to inclusion criteria. The delays inherent in survey-based research can hinder the ability to capture real-time insights, potentially impacting the accuracy and relevancy of the findings. The instant accessibility of Google Trends data addresses this limitation, empowering researchers with the agility to adapt and react promptly to emerging trends or shifts in user interests related to a topic as evidenced by relevant web-behavior.
Google Trends presents the frequency at which a specific search term is input into Google’s search engine relative to the overall search volume on the site during a specific time frame. Mathematically, if n(q, l, t) represents the number of searches for the query q in the location l during the period t, the relative popularity (RP) of the query is computed as shown in Equation (1). In Equation (1), Q(l,t) is a set of all the queries made from location l at time t, Π(n(q,l,t)>τ) is a dummy variable with value 1 when n(q, l, t) > τ (Query is popular) and 0 otherwise. The resulting numbers are then scaled within the range of 0 to 100 based on the proportion of the topic relative to the total number of search topics. This defines the Google Trends Index (GTI) as shown in Equation (2).
These index values can be generated by Google Trends starting from January 1, 2004, up to 36 hours prior to the present search. Google Trends excludes search data from very limited users and highlights popular search topics while assigning 0 to terms with low search volumes [
48]. The following is an overview of the features of Google Trends:
Search Term Trends: This feature allows users to see how the popularity of a specific search term or keyword has changed over time. Google Trends provides a graphical representation to highlight these trends.
Related Queries: Google Trends displays related queries that are frequently searched alongside the user’s primary search term. This can help identify related topics or terms relevant for data analysis.
Regional Interest: Users can view the geographical regions where a specific search term is most popular using Google Trends. Google Trends provides insights into regional differences in search interest for search terms.
Trending Searches: This feature of Google Trends highlights the current and popular search queries or topics, providing real-time insights into what people are searching for on Google.
Year in Search: Google Trends often releases a “Year in Search” report summarizing the top search queries from the past year. In this report, it offers an overview of significant events and trends.
Category Comparison: Users can compare the search interest of different categories or topics on Google using Google Trends. This can be useful for understanding the relative popularity of various topics.
Time Period Selection: Google Trends allows users to specify the time period for which they desire to query and analyze the data. This can range from a few hours to multiple years.
Data Visualization: Google Trends provides interactive charts and graphs to visualize search data.
Real-Time Data: Google Trends often updates in near real-time, making it valuable for tracking ongoing events.
Data Export: Google Trends allows different options to export data related to search interests, related queries, and related topics for a search term on Google for further analysis.
For developing this dataset, the web behavior data in terms of search interests related to Disease X (as a topic) was collected using Google Trends from February 2018 to August 2023. February 2018 was selected as the start time as WHO added Disease X to their shortlist of blueprint priority diseases in February 2018. August 2023 was the most recent month at the time of data collection. First, the global search trends related to Disease X (as a topic) during this timeframe (February 2018 to August 2023) were analyzed using Google Trends. The result provided by Google Trends is shown in
Figure 1. Thereafter, by using the “Regional Interest” feature of Google Trends, the list of regions that recorded significant search interests related to Disease X was compiled and exported. This list of regions is shown in
Table 1.
Thereafter, by utilizing Google Trends as the data source, search interests related to Disease X (as a topic) for all these 94 regions between February 2018 and August 2023 were collected and exported as .CSV files. To consolidate the 94 .CSV files into one workbook on Microsoft Excel, the Power Query interface on Excel was employed. The Power Query tool uses each individual file as a data source and imports each file’s data onto the Excel Workbook. Each region’s search interest for “Disease X” is present on distinct sheets in this file which was uploaded to IEEE Dataport [
49] as a dataset. The flowchart in
Figure 2 shows the step-by-step process that was followed for the development of this dataset. This dataset is described in
Section 3.
3. Data Description and Analysis
This section describes the dataset, which is available at
https://dx.doi.org/10.21227/ht7f-rx42. This dataset contains one Microsoft Excel workbook that comprises 94 different sheets where each sheet presents the search interests related to Disease X (as a topic) between February 2018 to August 2023 for a different region. The search interest data for all the regions stated in
Table 1 is available in this dataset. For each region, this dataset presents the search interests related to Disease X (as a topic) for each month in this timeframe, i.e., from February 2018 to August 2023. This data can be analyzed to obtain the trends in search interests during this timeframe for each of these 94 regions. For instance, the analysis of this data for the United States is presented in
Figure 3. In this Figure, the X-axis represents the months, and the Y-axis represents the search interest related to Disease X on a scale of 0 to 100.
This analysis of this data for the United States shows that the search interest related to Disease X has been the highest in August 2023. Similar trends and insights associated with search interests for Disease X emerging from different geographic regions can be obtained from analysis of the search interest data for that region as available in this dataset.
Figure 4 shows a world map-based analysis of the search interests related to Disease X for all 94 regions during this timeline. The intensity of the color in
Figure 4 represents the value of search interest related to Disease X from a certain region. So, a region that recorded a very high value of search interest related to Disease X during this timeframe is indicated by a darker shade of the color blue as compared to a region that recorded a very low value of search interest related to Disease X during this timeline. This analysis shows that the top 10 regions that recorded the highest search interests related to Disease X during this timeframe are Singapore, Honduras, Haiti, Nicaragua, Guatemala, El Salvador, Brunei, Panama, Cuba, and the United Arab Emirates. Furthermore, this analysis also helps to infer the list of regions that recorded the least (but significant) search interests related to Disease X during this timeframe. The list of 10 regions that recorded the least (but significant) search interests related to Disease X during this timeframe is Finland, Romania, Czechia, Ukraine, Poland, Türkiye, Vietnam, Iran, Russia, and Japan.
During the development of this dataset, it was observed that online searches on Google related to Disease X during this timeframe (February 2018 to August 2023) had several related queries. The ‘rising’ keywords associated with these related queries were collected by using the “Related Queries” feature of Google Trends, as described in
Section 2.
Figure 5 shows a word cloud-based representation of these queries related to Disease X during this timeframe. In this context, it is worth mentioning that the mining of the data from Google Trends for the development of this dataset was performed on August 8, 2023. Google Trends provided the search interest for August 2023 for each of the 94 regions by taking into account the relevant Google Searches recorded from August 1, 2023, to August 8, 2023. So, if the data collection is performed once again at the end of August 2023 or at a later date using Google Trends, it is possible that the search interest for August 2023 for some of these regions might change as Google Trends would then report the search interest value for August 2023 by taking into account all relevant Google Searches recorded from August 1, 2023, to August 31, 2023.
In the remainder of this section, the compliance of this dataset with the FAIR principles of Scientific Data Management [
50] is explained. The FAIR principles represent a vital framework crafted to amplify the accessibility and utility of scientific data and research outcomes. The acronym “FAIR” encapsulates four fundamental principles of scientific data management: Findability, Accessibility, Interoperability, and Reusability. These principles underscore the significance of making data effortlessly discoverable, openly accessible, compatible with other datasets, and comprehensively documented for the sake of reproducibility. Essentially, the FAIR principles endeavor to cultivate a more cooperative and transparent research landscape, facilitating the exchange of knowledge and bolstering the lasting influence of scientific investigations related to database development and database management. Several prior works in the field of dataset development have discussed how the developed datasets such as - the human metabolome database for 2022 [
51], WikiPathways dataset [
52], datasets of Tweets about COVID-19 [
53,
54], a dataset of Tweets about MPox [
55], computational 2D materials database (C2DB) [
56], the open reaction database [
57], RCSB Protein Data Bank [
58], and the PHI-base: pathogen–host interactions database [
59], just to name a few, complied with the FAIR principles of scientific data management.
This dataset, available at
https://dx.doi.org/10.21227/ht7f-rx42, is findable as it has a unique and permanent DOI assigned by IEEE Dataport. This DOI can be used by researchers from any discipline to find this dataset online. This dataset satisfies the accessibility property as it can be accessed by any user on the internet using any device via the DOI, as long as the user’s device is connected to the internet and is operating in a desired manner. The dataset is interoperable as the data in this dataset is available in a standard format (.xlsx file) that can be downloaded, read, and analyzed across different computer systems, frameworks, and applications. Finally, this dataset satisfies the reusability property as the data can be re-used any number of times for the study and investigation of different research questions that focus on the analysis of search interests related to Disease X.
Author Contributions
Conceptualization, N.T.; methodology, N.T., K.A.P, and Y.N.D.; software, N.T., K.A.P, and Y.N.D.; validation, N.T.; formal analysis, N.T.; investigation, N.T.; resources, N.T.; data curation, N.T., and K.A.P; writing—original draft preparation, N.T., I.H., K.A.P, Y.N.D. and S.Q.; writing—review and editing, N.T. and I.H.; visualization, N.T.; supervision, N.T.; project administration, N.T.; funding acquisition, Not Applicable. All authors have read and agreed to the published version of the manuscript.