Preprint
Article

Retweet Prediction based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

Altmetrics

Downloads

166

Views

41

Comments

0

A peer-reviewed article of this preprint also exists.

Submitted:

29 September 2022

Posted:

08 October 2022

You are already at the latest version

Alerts
Abstract
Retweet prediction is an important task related to different problems such as information spreading analysis, the automatic detection of fake news, social media monitoring, etc. In this study we explore the possibilities of retweet prediction based on heterogeneous data sources. In order to classify the tweet according to the amount of retweets, we combine features extracted from the multilayer network and the text. More specifically, we introduce a multilayer framework that proposes the multilayer network representation of Twitter. This formalism captures different users' actions and complex relationships as well as other key properties of communication on Twitter. We select a set of local network measures from each layer and construct a set of multilayer network features. In addition, we adopt a BERT-based language model, namely Cro-CoV-cseBERT to capture high-level semantics and structure of tweets as a set of text features. Then, we train six machine learning (ML) algorithms: random forest, multilayer perceptron, light gradient boosting machine, category embedding model, neural oblivious decision ensembles and attentive interpretable tabular learning model in the task of retweet prediction. We compare the performance of all six algorithms in three different setups (i) using only text features, (ii) using only multilayer network features and (iii) using both sets of features. We evaluate all setups in terms of standard evaluation measures i.e. precision, recall, F1-score and accuracy. For this task, we first prepare and use an empirical dataset of 199,431 tweets in the Croatian language posted during the period between January 1, 2020 and May 31, 2021. Our results indicate that by integrating multilayer network features with text features the prediction model would perform better than using just one set of features.
Keywords: 
Subject: Computer Science and Mathematics  -   Artificial Intelligence and Machine Learning
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated