On modelling and predicting the popularity of online content

Activists might be considered as the bridges for a broader public who can be mobilized. However, to make a noticeable impact, large social movements and campaigns require the participation of a vast number of people [82, 235, 245, 283]. Most of the time, those people do not have prior experience or even knowledge of the tools and platforms that facilitate mobilization and knowledge spread. Due to the wide spread of social networks, such as Facebook and Twitter, users could support (like, click, sign) for the ﬁrst time an digital invitation to a social movement, public campaign, online petition, political uprising.

Predicting the popularity of an online item is an active ﬁeld of research. Many different types of online items have been studied such as YouTube videos, online news, social networking campaigns, crowdfunding campaigns, online petitions. On social media, a user participation or interest can be registered as an activity (tweet, retweet, “like”, follow action). Most works in the area of popularity prediction focus on answering the following questions:

(i) Can an online item become successful? This question involves the predicting if the total

popularity of the online item will be larger than a threshold [55, 69, 126, 138, 187]. We recomend reading the work of Hong et al.[126] if you start working on this direction. (ii) How would popularity evolve over time? This question relates to time-series forecasting,

i.e modelling the popularity dynamics over time [6, 92, 156, 182, 240, 246, 266]. Works of Gao et al.[92] and Rizoiu et al.[246] are recommended to get an overview of hte best approaches for this problem.

(iii) Can we predict the ﬁnal popularity of an item? This question correspond to the regression

task where the ﬁnal number of attention shall be predicted [26, 120, 126, 162, 163]. We recommend Kupavskii et al.[162] to gain some initial insights into the problem.

Regardless of the task and a particular phenomenon to be modelled, two types of methods have been developed to solve these problems. First, machine learning techniques rely on an exhaustive list of potential features extracted from the phenomenon’s online traces, including structural and

2.2. On modelling and predicting the popularity of online content

Approach Domain Short Description ... Classiﬁcation

Logistic regression

Twitter [126] studies tweet cascades classiﬁcation, where the most prominent features include the history of retweets and the number of users’ followers.

Feature based Twitter [187] uses various features for the classiﬁcation—such as the number of tweets contain- ing a certain hashtag during a certain time period and the number of unique users that post messages with a certain hashtag for the same time period. [138] compares Gen- eralized Linear Model and Naive Bayes and uses number of followers, tweet length, sentiment, URLs, number of hashtags in a tweet as features. [55] focuses on the prediction of the structure of the reshare cascades using temporal and structural features. [69] constructs a temporal analysis of hashtags in order to discover breaking events in real-time and tried to distinguish the hashtags of social events from hashtags of virtual topics or memes.

Social transfer Multiple [251] utilizes external information to model the video popularity. Model based Google

Trends

[67] studies the effects of the external shocks on the time series evolution and thus classify the content as one of the three burst patterns.

... Final Number

Feature based Twitter [163] compares the prediction of the retweet cascades as well as shows cascades using multiple social, content and infection features. [162] examines a number of retweets a tweet might obtain using the ﬂow of the retweet cascade and PageRank score on retweet graph. [26] predicts if a tweet is retweeted more than a certain threshold based on the structural characteristics of the networks spanned by early retweeters. SVM, KNN. Fea-

ture based.

News [25] focuses on the prediction prior the release of the item of interest. Logistic Regr. Bi-

partile graph

Twitter [120, 126] study prediction of the absolute content popularity based on the single source of information.

Social dynamics

Digg [171] utilizes social inﬂuence and Digg web site layout to predict content popularity (being promoted in the friends page).

Model Based YouTube [230] proposes adaptive model selection based on the similarity to the previously seen examples.

Model based Twitter [324] uses self-excitation component (point process) that allows to predict whether a post will become popular and what will be its ﬁnal number of reshares.

Model based Earthquake, neurons, crimes.

[205, 213, 229] utilize a point process models that predict space-time earthquake patter, activity of neurons and crime rate respectively.

Model based Multiple [56] empoly the retweet data that used to perform timely query expansion based on temporal information, i.e, retweets of documents are used to boost documents’ relevance over a period of time.

... Time evolution

Time series clustering

YouTube, Digg, Vimeo

[6] focuses on the content clustering based on the evolution of its popularity and prediction the popularity of the content based on its transitions between various evolution patterns.

Model based Twitter In some cases, retweets are modelled as point process due to the instantaneous nature of the tweets [92, 266]. The model assumes the multiplicative nature of the diffusion as a tweet tend to trigger another ones. [156] also incorporate the circadian nature of the underlying phenomenon into the model.

Model based Multiple [182, 245, 246, 310, 311] analyse the cross platform effect on the content popularity. For example, the structures of the inﬂuence networks between various processes as a result of Granger causality [109] or the effect of the breaking news, posts from social friends and user’s intrinsic interests on content popularity.

Model based Search queries

[240] introduces Dynamics Model Learners algorithm that incorporate an internal trend and periodicity of the time series.

Temporal clustering. K-Spectral Centroid

Twitter [312] proposes a new metric that is invariant for scaling and hifting of the time series.

Table 2.3 – Overview of the most prominent approaches and applications of the popularity predictions on social media.

Chapter 2. Online Activism on Social Media and Beyond

temporal characteristics and features from other sources that affect the cascade. Then learning methods are applied for the purpose of classiﬁcation or regression. These kinds of methods have drawbacks, including a high dependence on the quality of the features, the requirement of computing power due to the requirement of an exhaustive training, and in some cases the model’s interpretability is limited. Second, model-based techniques aim at calibrating a speciﬁc parametric model that we assume that drives the phenomenon. The main drawback is that in some cases they are hard to formulate; however, they are more interpretable.

In the following we focus on predicting, modelling and describing various online and ofﬂine, real-world phenomena with data sets of online digital traces of human behaviour collected from various sources, such as videos [6, 230, 246], posts [25, 26, 55, 126, 324], blogs [6, 171], Google trends [67], search queries [240], memes [172], online petitions [233, 235, 313], campaigns [81, 237], natural phenomena [205, 213, 229]. A detailed description of the works on online and ofﬂine content popularity prediction on the web and social media is shown in Table 2.3.

In document Profiling, Modelling and Facilitating Online Activism (Page 38-40)