3.3 Keyword Adaptation Algorithm (KwAA)
3.3.2 Traffic Pattern based Approach (TP-KwAA)
According to the evaluation results that are presented in3.6.1, initial attempts show that extra event content is identified when using TF-KwAA. However, the dataset collected through TF-KwAA also contains a large amount of noisy tweets (sometimes is even worse than the stream retrieved by the sample function of Twitter Streaming API). Moreover, the longer the crawler runs, the larger the proportion of noisy tweets. The noise, namely, event irrelevant tweets, eventually overwhelm the event relevant content, which results in a chaotic and meaningless dataset. This issue is caused by the fact that the algorithm relies on the collected content: a clean keyword set will helps the KwAA adapts correctly, while a polluted keyword set confuses the KwAA with noisy hashtags (been wrongly considered as event relevant keywords).
As a result, the problem is how to modify the TF-KwAA so the adaptive crawler collects a greater amount of event-associated data without significantly increasing the dataset noise. In order to reduce the impact of noisy information on the adaptive dataset, the traffic pattern of hashtags, i.e. frequency count distribution of the hashtags, is exploited to identify new search terms. The basic assumption of this KwAA is that the frequency
trends of any event-related hashtags should be similar to that of the initial keywords. In other words, the frequency distribution of a new hashtag should be positively correlated to that of initial keywords. The higher the correlation is, the more similar the two terms are.
The refined version, TP-KwAA, first automatically gets the hashtags list H(tn) as gener-
ated by TF-KwAA. The list is then passed to an extended part of the keyword adaptation algorithm for assessing the elements’ relevance to the event. Although the ideal situation is to pass the hashtags list Hall(tn) to the extended part, this research only chooses the
subset H(tn) to avoid the frequent queries to Twitter Streaming API (that are restricted
by Twitter rate limits). To measure the relevance, the correlation coefficient exploited. In order to calculate the correlation between two hashtags, the original time frame is subdivided into m time slots (as illustrated in Figure3.5).
Figure 3.5: Time frames and Time slots for Hashtag Frequency
As defined previously, the total frequency count of hashtag hk at tn is represented by
f (hk, tn). Therefore, the frequency count of hashtag hk for all the slots at tn can be
represented with F (hk, tn) = {f (hk, tn, s1), f (hk, tn, s2), ..., f (hk, tn, sm)}. Instead of
using H(tn) as the input for querying tweets in the next time frame, Hf in(tn), a subset
of H(tn) is used to represent the keyword set. The pseudo code is updated as the
Algorithm 3.
The relationship between initial keywords Hseed and the keyword set at the beginning
Algorithm 3 Traffic Pattern based Keyword Adaptation Algorithm (TP-KwAA) Require: Hseed, Hf in(tn) = ∅, H(tn)
1: Execute Algorithm 2
2: for ∀hx∈ H(tn) do
3: for ∀hy ∈ {Hseed∪ Hf in(tn)} do
4: if hy ∈ HBL and cor(F (hx, tn), F (hy, tn)) > T hres1 then
5: if hx∈ {H/ seed∪ Hf in(tn)} then
6: add hx to Hf in(tn)
7: end if
8: else if hy ∈ H/ BL and cor(F (hx, tn), F (hy, tn)) > T hres2 then
9: if hx∈ {H/ seed∪ Hf in(tn)} then 10: add hx to Hf in(tn) 11: end if 12: end if 13: end for 14: end for
the following assumptions:
Assumption 1 the initial keywords used for both baseline crawler and adaptive crawler are the most representative words that describe the event of interest.
Assumption 2 keywords for an event during one particular or several sequential time frames are likely to exhibit similar traffic patterns.
Assumption 2.1 the frequency count of two event-related hashtags should positively correlate with each other. Namely, when keyword A appears more frequently, the fre- quency of keyword B will also increase, and vice versa.
The initial keywords used by the baseline crawler and adaptive crawler with TF-KwAA are also selected as initial keys in TP-KwAA. To measure the correlation between the traffic patterns of hashtags, this research tests the selection of potential keywords with three correlation coefficient measurements, i.e. Pearson’s r, Kendall’s τ and Spearman’s ρ. Through a series of experiments (more details in section 3.5), results show that r
and ρ achieve similar performance, and both better than τ . Since the Pearson’s r gives slightly better results, this research chose the Pearson correlation coefficient to measure the similarity between keywords. The range of Pearson correlation is between +1 and -1 inclusive, where 1 represents a positive correlation, 0 represents no correlation, and
-1 represents negative correlation. The formula is defined by the equation3.1 cor(hx, hy) = Pm i=1[f (hx, tn, si) − F (hx, tn)] · [f (hy, tn, si) − F (hy, tn)] q Pm i=1[f (hx, tn, si) − F (hx, tn)]2 q Pm i=1[f (hy, tn, si) − F (hy, tn)]2 (3.1) The equation calculates the Pearson correlation coefficient between the traffic pattern of hashtag hx and that of hashtag hy. Algorithm 3 guarantees that the input keyword set
for the next time frame tn+1 is a list of hashtags where hk∈ H(tn) with traffic pattern
that highly correlated to that of initial keywords. For example, #100aday is a trending hashtag during the 2012 London Olympic Games, but irrelevant to the event. It is detected as a keyword by TF-KwAA, but successfully excluded in TP-KwAA because of its low correlation to the initial seeds.