In this section we review the aliases of news topic reference. We briefly provide a definition of aliases and an approach to generate aliases. Aliases are different news topic references that refer to the same news topic events, for example the news topic reference “Mortgage meltdown” is an alias of “Subprime crisis” because both refer to the same events in late 2007. Aliases can be defined as alternate names or synonyms to a news topic.
Aliases can be used to generate multiple queries for a news topic and also to reduce a result set by consolidating multiple aliases to one resulting news topic reference, aliases may also be used to classify a phrase as a news topic reference, if an alias can be generated for a phrase and that alias is a news topic reference then the original phrase can be considered a news topic reference.
4.5.1
Wikipedia Redirect
One approach to generate aliases is based on the structured information found in Wikipedia, namely redirect pages. Redirect pages are defined in Wikipedia as “a page which has no content itself, but sends the reader to another article, section of an article, or page, usually from an alternative title”. For example if you search for “Subprime crisis” or if you select an internal Wikipedia link with the same term, then the result will be the article titled “Subprime mortgage crisis”, therefore both links reference the
same page, this indicates that they are aliases of one another. In essence multiple phrases will be directed to the same article, because these phrases refer to the same concept.
The titles of the redirect pages are considered aliases if they link to the same article, and that article is titled with a news topic reference. Using the redirects information from Wikipedia it is possible to obtain aliases for many former news topic references, though it may not be possible to obtain aliases for all news topic references.
An implementation of this approach was evaluated and results showed an overall accuracy of 65%. The evaluation was done using an online survey system13. In this survey users were given a set of phrase pairs that are news topic reference aliases, see Figure 4.9. Users where then asked to confirm if both phrases refer to the same event or activities. In 65% of the cases the user answered yes, in 27% of the cases the users answered no and in 8% of the cases the used answered unknown. After reviewing the results we found that because of obscure terminology or terms given in a language other than English, many of the answers were considered incorrect or unknown.
Figure 4.9: Screen capture of online survey for alias evaluation
4.5.2
Additional Approaches
The task of finding news topic reference aliases is a subtask of finding semantically similar phrases. It follows that additional approaches to find aliases are based on approaches to find semantically similar phrases. One of these intended approaches uses the correlation between query patterns to find similar phrases. An example of query patterns is given in Figure 4.10. In the figure we can see that the X axis is time, and the Y axis is a normalized value. The value is based on the number of times the term is used as a query in a search engine, the value is normalized so that patterns with different scales can be compared. This approach is based on the assumption that query patterns of news topic reference aliases are similar, or closely correlated to one another.
Figure 4.10: Example query pattern of “Mortgage meldown” and “Subprime mortgages”
If we assume that the queries used to search for a news topic are news topic reference aliases, then different queries are aliases of one another.We assume that query patters for aliases follow a similar distribution over time, then we can use the correlation between query distributions to find news topic reference aliases. For example in Figure 4.10 the news topic reference aliases “Mortgage meldown” and “Subprime mortgages” are correlated queries, because the query patterns have similar spikes and valleys. Therefore if we use the news topic reference “Mortgage meldown” to find all the correlated queries, one of the resulting queries will be “Subprime mortgages”. Other results may not be aliases, these results
must be filtered out. Google correlate14 provides access to the correlated
information of search queries.
We assume that the popularity of a news topic will change over time, and that this change will give a news topic a characteristic distribution, and that news topic reference aliases will follow this characteristic distribution. Therefore if we plot the distribution of a news topic reference and find all the distributions that are similar, the queries used for these similar distributions could be news topic reference aliases.
A preliminary evaluation of this approach has shown that some relevant results are generated, though there is some filtering required to remove results that are not aliases. A possible way to filter out these results is by comparing the similarity measures for the query results, this is to say that different queries to a search engine would return similar results.
4.5.3
Outlook
In this chapter we have given a definition of news topic reference aliases and potential approaches that have in part been implemented, but not fully evaluated.