3.3 Keyword Adaptation Algorithm (KwAA)
3.3.1 Term Frequency based Approach (TF-KwAA)
TF-KwAA first identifies all the hashtags that co-occurs with the collection of initial keywords (or initial seeds, represented by Hseed = {h1, h2, ...}) in the nth time frame
tn, represented as Hall(tn) = {h1, h2, ..., hk, ...}, where hk(k = 1, 2, ). In this thesis, the
term “keywords” only refers to the hashtags that are used in the search query for tweets retrieval. The keywords set, sent back to Twitter API in Figure 3.4, in the same time
frame, is a subset of Hall(tn) and is represented as H(tn) = {h1, h2, ...}. H(tn) satisfies
specific criteria, high frequency of co-occurrence with initial keywords. Apart from the hashtag lists, the algorithm also keeps two hashtags frequency lists; Fall(tn) and F (tn).
Fall(tn) = {f (h1, tn), f (h2, tn), ...} is the individual frequencies at all observed hashtags
at the end of nth time frame tn. The frequency list F (tn), as a subset of Fall(tn), is
used to record the frequency of the keywords. The hashtag list and the frequency list have a one-to-one correspondence, i.e. the frequency count of a hashtag hk at nth time
Algorithm 1.
Algorithm 1 Frequency List Update Require: Hall(tn), Fall(tn)
1: for ∀hin in the incoming tweets do
2: if ∃hk= hin: hk∈ Hall(tn) then 3: f (hk, tn) = f (hk, tn) + 1; 4: else 5: add hk to Hall(tn); 6: f (hk, tn) = 1 7: add f (hk, tn) to Fall(tn); 8: end if 9: end for
When a hashtag hk appears, the Algorithm 1 is executed to check whether the hashtag
already exists in the hashtag list Hall(tn) for the nth time frame. If this hashtag has
already emerged, its corresponding frequency f (hk, tn) is incremented by 1. Otherwise,
both the hashtag list Hall(tn) and the frequency list Fall(tn) are updated to include this
new hashtag hk with a frequency f (hk, tn) = 1.
To enable an efficient keyword adaptation, a minimum frequency (fmin), as a threshold
for being a keyword, and an array of blacklist hashtags (Hblack) are also used in this
TF-KwAA (this will be explained later this section). The pseudo code in Algorithm 2
explains the details of this KwAA.
Algorithm 2 Term Frequency based Keyword Adaptation Algorithm (TF-KwAA) Require: Hall(tn) and Fall(tn) from Algorithm1
1: for ∀hk∈ Hall(tn) do
2: if ∃hk: f (hk, tn) < fmin or hk∈ Hblacklist or
{hk∈ H(tn−1) and f (hk, tn), ..., f (hk, tn−n0) = 0} then
3: remove hk from Hall(tn);
4: remove f (hk, tn)} from Fall(tn)
5: else if f (hk, tn) ∈ top N [Fall(tn)] then
6: add hk to H(tn);
7: add f (hk, tn) to F (tn)
8: end if
9: end for
This algorithm keeps at most N keywords when query Twitter Streaming API. By default, the value is set to be N = 400, as it is the maximum number of terms can be used to filter the Twitter Streaming API2. When the timer expires (at the end of each 2This track limit is defined in Twitter API version 1.0, and has been adopted in the current version
time frame), the hashtags in the hashtags list are sorted according to their frequency. Top ones will be added to the keyword set, while those with low frequency are ignored. This is because that the number of hashtags that co-occurred with the initial keywords can be huge. A random sample of 10 independent time slots from London Olympic dataset shows that, on average, 355 new hashtags emerge in every minute. In order to reduce the number of potential keywords, this research only keeps keywords with a frequency count higher than the median hashtag in the initial filtering step. This preliminary filtering strategy is used to reduce the number of keyword comparisons needed. The 50% threshold value was not investigated in further detail, as in practice it does not impact the results as this bottom 50% always contained hashtags which were also did not adhere to our minimum frequency restriction (described below).
To avoid the overwhelming of non-related keywords in the new keyword set, the following three noise reduction steps are employed in the TF-KwAA:
1. Minimum frequency
This threshold, fmin, helps to filter out the unusual and non-related hashtags,
especially when the crawler first starts. The introduction of low frequency hashtags will significantly increase the calculation cost, both in space and time, and are very unlikely to introduce useful amount of event-tweets. As a result, fminis empirically
set to be one per minute.
2. Rare keywords discarding mechanism
Some newly identified keywords are popular in a specific period of time, but fade away quickly after that. Since the number of keyword is limited to N, a lot of space will be wasted if the algorithm keeps track on these keywords. By discarding the long-term-low-frequency items, the crawler can improve the utility of N keywords. This mechanism functions as follows: any hashtag hk whose frequency is lower
than x for a long period (f (hk, tn), f (hk, tn−1), ..., f (hk, tn−n0)) will be removed from the keywords set.
3. Modifiable keyword blacklist.
The introduction of the keyword blacklist allows noisy keyword to be manually filtered. The blacklist is empty when the crawler is started. Users can identify and add non-related words to the blacklist during the collection period. The algorithm will check this list every time when it identifies new search terms so it can discard the words that are in the blacklist. For the experiments in this paper, the blacklist words are either the abbreviation of news channels (e.g. #BBC for British Broadcasting Corporation, #CNN for Cable News Network and etc.) or hashtags used by follow up and follow back activities (e.g. #teamfollow and #followback).
After all the above steps, the number of keywords is expected to be lower than the N = 400 limit. If the number of keywords is higher than 400, the TF-KwAA only sends the top 400 keywords with highest frequency back to the Twitter API.