Version 1
: Received: 22 March 2021 / Approved: 24 March 2021 / Online: 24 March 2021 (12:03:46 CET)
How to cite:
Batra, R.; Kastrati, Z.; Imran, A. S.; Daudpota, S. M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints2021, 2021030572
Batra, R.; Kastrati, Z.; Imran, A. S.; Daudpota, S. M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints 2021, 2021030572
Batra, R.; Kastrati, Z.; Imran, A. S.; Daudpota, S. M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints2021, 2021030572
APA Style
Batra, R., Kastrati, Z., Imran, A. S., Daudpota, S. M., & Ghafoor, A. (2021). A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints. https://doi.org/
Chicago/Turabian Style
Batra, R., Sher Muhammad Daudpota and Abdul Ghafoor. 2021 "A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis" Preprints. https://doi.org/
Abstract
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Subject
Computer Science and Mathematics, Algebra and Number Theory
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.