A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis

Rakhi Batra; Zenun Kastrati; Ali Shariq Imran; Sher Muhammad Daudpota; Abdul Ghafoor

Submitted:

22 March 2021

Posted:

24 March 2021

You are already at the latest version

Abstract

This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.

Keywords:

Urdu Twitter Dataset

;

Urdu Natural language processing (NLP)

;

Urdu text Sentiments and Emoticons

Subject:

Computer Science and Mathematics - Algebra and Number Theory

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe