Evaluating Models’ Ability to Generate Correct Tagging Patterns

4.4 Evaluating the Topic Model

4.4.3 Evaluating Models’ Ability to Generate Correct Tagging Patterns

Patterns

The learned models are evaluated by comparing the likelihood of generating all the tagging patterns observed in a set of held out documents against the likelihood of generating the tagging patterns found in sets that had the tags assigned to documents randomly. Two variants of document sets with random tag assignments are considered. In the first approach, the original set of held out documents is used, and for each document 50% of its tags are randomly selected and replaced by a new set of randomly selected tags. In the second approach, for each document in the original set of held out documents, all the tags are replaced by a new set of tags that are randomly selected with uniform probability. We conclude that the models were successfully trained to generate the tagging patterns that appear in the collections examined: Table4.6 shows that the model is more likely to generate the tagging pattern observed in the held out documents than the tagging patterns that result from adding noise.

4.5 Properties of Tags Generated by the Models

The learned models were used to generate tag assignments, which were then compared to the original collections, with respect to the properties described in Chapter 3:

Figure 4.6: Tag assignments in generated vs. original collections a) NYT and b) ACM

1. the tag frequency distribution,

2. the tag count per document distribution, 3. the pairwise tag co-occurrence patterns,

4. the distribution of higher-order tag co-occurrence, and

5. the number of distinct tag sets with large amount of similarity.

Due to the large size of the NYT collection, tags for a smaller collection—10% of size of the NYT—were generated, and, similarly to the ACM, the threshold limit for online document aggregation is assumed to be 5 instead of 50 (the value for the whole NYT collection).

Figure4.6 overlays the normalized tag frequency distributions observed in the original collections (Figure3.2) with that of the learned model, showing that it tracks the original distribution fairly closely for all but the least frequent tags. The learned model has a bias towards selecting the popular tags for assignment to documents, which in turn leads to some of the less popular tags not being assigned to any documents. For the NYT collection, out of 1,015 available tags, the generative model only assigned the 744 most popular tags to the generated documents. While for the ACM collection, out of 6,109 available tags, the generative model only assigned the 4,890 tags. (The ACM generative model is capable of only generating 6,109 tags because only that many tags were observed in the training set of the generative model.)

Figure 4.7: Distribution of the number of tags for synthetic tag assignment generated using CTM for a) NYT and b) ACM

The tag count per document distributions observed in the original sampled collections (Figure 3.3) can be compared to the distributions in the learned model (Figure 4.7). The learned model of the NYT has the mean tag counts per document close to the original sampled collection and the general shapes of the curves are similar. For the learned model of the ACM, the mean tag counts per document are close to the original sampled collection but the shape of the curve differs. However, a similar distribution shape can be achieved by reducing the threshold F , at the expense of increasing the mean tag count per document. All the observed distributions of tag count per document resemble the zero-truncated Poisson distribution.

Comparing the pairwise tag co-occurrence observed in the original sampled collections (Figure 3.4) to those for the learned model (Figure 4.8), shows that the top tags co-occur with many other tags and apparently resemble a Zipf-like distribution.

The amount of higher order tag co-occurrence is shown in Table4.7 (which repeats the ACM data from Table3.1 for convenience). The learned model includes many documents that have many tags in common, although the sets of shared common tags do not grow as large as in the original collections, especially for the ACM-Model.

Similarity among document sets in the NYT and ACM (Table 3.2) can be compared to the corresponding generated collection (Table 4.8). Although, there is some similarity

Figure 4.8: Tag to tag co-occurrence graph of the 10 most frequent tags for synthetic collection generated by CTM in a) NYT and b) ACM

Table 4.7: Number of conjunctions of n tags that contribute to high multi-way co- occurrence (with threshold limit 5)

1 2 3 4 5 6 7 8 9 10 11 12 Total

NYT Fragment 1,015 18,657 26,199 16,995 7,733 3,423 1,757 951 423 130 24 2 77,309

Model 744 15,795 18,524 7,244 1,646 310 62 11 1 44,337

ACM Original 9,098 14,262 5,280 3,860 3,700 3,199 2,390 1,520 776 297 79 13 44,474

Table 4.8: Percentage overlap of cells corresponding to two different query lengths using Ok(A, B) in the a) NYT-Model and b) ACM-Model

NYT-Model 1 2 3 4 5 6 7 8 9 1 0 0 2 8 27 58 82 100 2 20 12 27 56 81 91 100 3 70 54 73 87 91 100 4 96 94 95 100 100 ACM-Model 1 2 3 4 5 1 0 0 0 0 2 16 2 0 3 76 71 4 100

observed between different tag intersection levels in the synthetic collections, it is not as profound as in the original collections, especially for the ACM-Model.

In the development of the generative tagging models, our goal was not to create perfect generative models for the NYT and ACM collections, but rather to create a generative tagging model that produces synthetic collections that exhibit realistic tag assignment patterns. As a result, the various differences between the generated tag assignment patterns and the tag assignments found in original collections, do not indicate a failure of the approach; the tag assignment patterns generated do have many properties that are similar to those observed in the original data.

In document An Online Analytical System for Multi-Tagged Document Collections (Page 73-77)