Evaluation Metrics - Comparison to Baselines

5.3 Comparison to Baselines

5.4.3 Evaluation Metrics

We apply the topic coherence metrics (introduced in Chapter 4) to automatically evaluate the coherence level of the generated topics (see Section 5.4.3.1). We introduce a topic mixing degree metric (see Section 5.4.3.2), which indicates the extent to which the generated topics are mixed together. Since both the ToT and TVB approaches estimate the topical trends, we also use the trend estimation error (see Section 5.4.3.3) to compute the distance between a real topical trend and its estimated topical trend (introduced in Section 5.2.1).

5.4.3.1 Metric 1: Coherence Metrics

Following the conclusion provided in Chapter 4, we use the T-WE coherence metric (i.e. Metric (18) in Table 4.6) to evaluate the coherence of the generated topics. In order to capture the semantic similarity of the latest hashtags and Twitter handle names, we crawl 200 million English tweets posted from 01/08/2015 to 30/08/2016 using the Twitter Streaming API. This Twitter dataset is crawled in a different time period compared to the Twitter background dataset used in Chapter 4 (see Section 4.5.1). Indeed, the time period of this newly crawled Twitter background dataset covers the time period of the GT dataset as well as a 13-month time period before the US Election 2016 date (i.e. 08/11/2016). Therefore, using the new background dataset, T-WE can effectively assess the coherence of the topics generated from both the GT and USE datasets. We train a WE model using this Twitter background dataset and obtain word embedding vectors of 5 million tokens11_{. The trained WE model is used}

in our WE-based coherence metric. To evaluate the coherence of topic models, we apply our proposed coherence@n metric (denoted asc@n, see Section 4.6). Note that the c@n

metric calculates the average coherence scores of the top n ranked topics in a topic model (introduced in Section 4.6), where the coherence of topics are computed using the WE-based coherence metric. For the GT dataset, we examine the top2and7most coherent topics from a generated topic model, i.e.c@2&c@7 metrics. Considering that the number of topics is 10, we argue that the top 2 and 7 most coherent topics are reasonable choices to evaluate the coherence of the generated topic models. For the USE dataset, we usec@10 &c@20and

c@30metrics as the number of topics is relatively bigger. We also apply the average (Aver) coherence to evaluate all topics for both Twitter datasets, i.e. the average coherence score of all topics in a topic model (recall thatAveris used as a baseline forc@nin Section 4.6)

5.4. Experimental Setup

5.4.3.2 Metric 2: Mixing Degree Metric

As discussed in Section 5.2.1, Topic (a) (“currency, GBP, weaker”) can be mixed with Topic (b) (“currency, Scotland, euro”) because they have a similar usage of words, such as the use of “currency” in Figure 5.3. Let’s assume that Topic (c) is represented by “scotland, economy, finance ”. Topic (b) can also be mixed with Topic (c) since Topic (c) has the words “scotland, economy”, which are semantically related to “scotland, currency”. Topics (a), (b) and (c) are mixed, in the sense that they have overlapping/related topics. We introduce a new metric to capture the similarities of all pairs of topics generated by a given topic modelling approach. The higher the overall similarity, the more mixed are the generated topics. More formally, we use cosine similarity to compute the average similarities among all the generated topics, which we call thetopic mixing degree(denoted as MD). We use Equation (5.14) to calculate theMDscore of a topic model:

M D(β) = X

cosine(βk, βk0)/|K|2 (5.14)

whereβkis a topic term distribution andK is the total number of topics (see Table 5.1). The higherMDis, the more the topic model is mixed, i.e. the topic modelling approach generated more mixed topics. A similar methodology is used in AlSumait et al. (2009) to identify the background topics.

5.4.3.3 Metric 3: Trend Estimation Error

Both the ToT and our TVB approaches estimate the topical trends. To evaluate the topical trends over time, we calculate the distance/error between the real topic trends and the estimated topical trends (using the Beta distribution in ToT and TVB). The error is calculated using the method shown in Equation (5.15):

ERR(τ) = P k R1 0 |τk(t)−P DFk(t)|dt K (5.15)

whereP DFk(t)is the probability density function of the real timestamps of topics, which is obtained through the GT dataset. TheERR score ranges from0to 2. The generated topics are matched to the ground-truth topics if the top 10 words of a generated topic have at least 312_{same words to the top 10 words of a hashtag event.}

For the GT dataset, we apply the three mentioned metrics: topic coherence metrics (c@nandAver), topic mixing degree metric (MD) and trend estimation error metric (ERR).

However, there are no ground-truth labels in the USE dataset. Hence, onlyc@n,Averand

MDmetrics are used for the USE dataset.

5.4.4 Research Questions

We aim to answer four research questions in this chapter:

• RQ1. Does our TVB approach outperform ToT and TLDA in terms of topic coherence and topic mixing degree?

• RQ2. Does the time dimension help to improve the coherence of topics in our TVB approach?

• RQ3. What is the impact of the balance parameter on both the coherence and the mixing degree of the generated topics?

• RQ4. Does our TVB approach more accurately estimate the trends of the generated topics compared to ToT?

In document Analysing political events on Twitter: topic modelling and user community classification (Page 113-115)