Term Frequency based Approach (TF-KwAA) - Keyword Adaptation Algorithm (KwAA)

3.3 Keyword Adaptation Algorithm (KwAA)

3.3.1 Term Frequency based Approach (TF-KwAA)

TF-KwAA first identifies all the hashtags that co-occurs with the collection of initial keywords (or initial seeds, represented by Hseed = {h1, h2, ...}) in the nth time frame

tn, represented as Hall(tn) = {h1, h2, ..., hk, ...}, where hk(k = 1, 2, ). In this thesis, the

term “keywords” only refers to the hashtags that are used in the search query for tweets retrieval. The keywords set, sent back to Twitter API in Figure 3.4, in the same time

frame, is a subset of Hall(tn) and is represented as H(tn) = {h1, h2, ...}. H(tn) satisfies

specific criteria, high frequency of co-occurrence with initial keywords. Apart from the hashtag lists, the algorithm also keeps two hashtags frequency lists; Fall(tn) and F (tn).

Fall(tn) = {f (h1, tn), f (h2, tn), ...} is the individual frequencies at all observed hashtags

at the end of nth time frame tn. The frequency list F (tn), as a subset of Fall(tn), is

used to record the frequency of the keywords. The hashtag list and the frequency list have a one-to-one correspondence, i.e. the frequency count of a hashtag hk at nth time

Algorithm 1.

Algorithm 1 Frequency List Update Require: Hall(tn), Fall(tn)

1: for ∀hin in the incoming tweets do

2: if ∃hk= hin: hk∈ Hall(tn) then 3: f (hk, tn) = f (hk, tn) + 1; 4: else 5: add hk to Hall(tn); 6: f (hk, tn) = 1 7: add f (hk, tn) to Fall(tn); 8: end if 9: end for

When a hashtag hk appears, the Algorithm 1 is executed to check whether the hashtag

already exists in the hashtag list Hall(tn) for the nth time frame. If this hashtag has

already emerged, its corresponding frequency f (hk, tn) is incremented by 1. Otherwise,

both the hashtag list Hall(tn) and the frequency list Fall(tn) are updated to include this

new hashtag hk with a frequency f (hk, tn) = 1.

To enable an efficient keyword adaptation, a minimum frequency (fmin), as a threshold

for being a keyword, and an array of blacklist hashtags (Hblack) are also used in this

TF-KwAA (this will be explained later this section). The pseudo code in Algorithm 2

explains the details of this KwAA.

Algorithm 2 Term Frequency based Keyword Adaptation Algorithm (TF-KwAA) Require: Hall(tn) and Fall(tn) from Algorithm1

1: for ∀hk∈ Hall(tn) do

2: if ∃hk: f (hk, tn) < fmin or hk∈ Hblacklist or

{h_k∈ H₍tn−1) and f (hk, tn), ..., f (hk, tn−n0) = 0} then

3: remove hk from Hall(tn);

4: remove f (hk, tn)} from Fall(tn)

5: else if f (hk, tn) ∈ top N [Fall(tn)] then

6: add hk to H(tn);

7: add f (hk, tn) to F (tn)

8: end if

9: end for

This algorithm keeps at most N keywords when query Twitter Streaming API. By default, the value is set to be N = 400, as it is the maximum number of terms can be used to filter the Twitter Streaming API2. When the timer expires (at the end of each 2_{This track limit is defined in Twitter API version 1.0, and has been adopted in the current version}

time frame), the hashtags in the hashtags list are sorted according to their frequency. Top ones will be added to the keyword set, while those with low frequency are ignored. This is because that the number of hashtags that co-occurred with the initial keywords can be huge. A random sample of 10 independent time slots from London Olympic dataset shows that, on average, 355 new hashtags emerge in every minute. In order to reduce the number of potential keywords, this research only keeps keywords with a frequency count higher than the median hashtag in the initial filtering step. This preliminary filtering strategy is used to reduce the number of keyword comparisons needed. The 50% threshold value was not investigated in further detail, as in practice it does not impact the results as this bottom 50% always contained hashtags which were also did not adhere to our minimum frequency restriction (described below).

To avoid the overwhelming of non-related keywords in the new keyword set, the following three noise reduction steps are employed in the TF-KwAA:

1. Minimum frequency

This threshold, fmin, helps to filter out the unusual and non-related hashtags,

especially when the crawler first starts. The introduction of low frequency hashtags will significantly increase the calculation cost, both in space and time, and are very unlikely to introduce useful amount of event-tweets. As a result, fminis empirically

set to be one per minute.

2. Rare keywords discarding mechanism

Some newly identified keywords are popular in a specific period of time, but fade away quickly after that. Since the number of keyword is limited to N, a lot of space will be wasted if the algorithm keeps track on these keywords. By discarding the long-term-low-frequency items, the crawler can improve the utility of N keywords. This mechanism functions as follows: any hashtag hk whose frequency is lower

than x for a long period (f (hk, tn), f (hk, tn−1), ..., f (hk, tn−n0)) will be removed from the keywords set.

3. Modifiable keyword blacklist.

The introduction of the keyword blacklist allows noisy keyword to be manually filtered. The blacklist is empty when the crawler is started. Users can identify and add non-related words to the blacklist during the collection period. The algorithm will check this list every time when it identifies new search terms so it can discard the words that are in the blacklist. For the experiments in this paper, the blacklist words are either the abbreviation of news channels (e.g. #BBC for British Broadcasting Corporation, #CNN for Cable News Network and etc.) or hashtags used by follow up and follow back activities (e.g. #teamfollow and #followback).

After all the above steps, the number of keywords is expected to be lower than the N = 400 limit. If the number of keywords is higher than 400, the TF-KwAA only sends the top 400 keywords with highest frequency back to the Twitter API.

In document Real-time Content Identification for Events and Sub-Events from Microblogs. (Page 56-59)