3.3 Additional Modules
3.3.2 Control Module
This module is the main interface to the administrator of the system. The module is composed of the task manager and the processing com- ponents, this is because the functionality of these submodules is highly interdependent although considerably different.
3.3.3
Task Manager
Because of the modular design of multiple processing components it is pos- sible to distribute these tasks to different processing nodes. The intention of distributing the processing load is to maintain a timely performance of the system. Depending on the processing load this component may choose to update recently extracted information or it may chose to present previ- ously extracted information, therefore this component is responsible for the information freshness, this is to say that the information is up-to-date. Because many of the processing components enable parallelization, this module may assign the same task to multiple processing nodes, and once
the results are generated they can be consolidated. To illustrate consider the process of extracting causal relations from a collection of texts, the collection can be partitioned to be processed by several nodes. The results of the extraction can then be combined.
3.3.4
Processing Components
In the previous section we presented all the processing components in a flow format. The main components of this flow are the causal relation extraction and the news topic reference validation and extraction. There are additional components that are necessary to provide complete results and also to evaluate the hypotheses. This module is the repository of the methods and logic of each component. Each component can be deployed to different processing nodes to complete a task.
3.3.5
Processing Nodes
This module represents the available hardware resources to process the information. These resources are registered with the control module. This module addresses the need for resources to maintain scalability and near real-time performance of the system.
3.3.6
Data Sources
The processing components of Forest are used to extract information, therefore it is critical to have a clearly defined source of information from which the information is extracted, this module provides access to the sources of information. To evaluate the system it is important to replicate inputs to compare results of different processing strategies, in this case a static document corpus is appropriate as a source of information. To obtain the most up-to-date information an Internet source is well suited. This component is also responsible for filtering out sources that are out of scope, for example video sources or sources that are in a language other than English. This module also provides access to supporting data sources such as Wikipedia, these additional data sources are used to aid in different processes, such as NTR validation.
3.3.7
Information Repository
This component is for storing all the extracted information and its associ- ated metadata. When the information is extracted from multiple nodes, it is the task of the control module to compile the information and store that information in this module. The information may be updated at
different intervals depending on configured requirements. This component will choose the NTRs that are presented to the user.
3.3.8
Configuration
The performance of the system may be tuned with different parameters for different configurations. For example, the configuration for testing causal relation strategies is different to the configuration for evaluating the scalability of the system. This module is intended to keep these configurations and the configurations of the user. An example of user configuration can be the information granularity such as maximum number of results, or setting an information freshness threshold, this is to say how often the information should be updated.
3.4
Conclusion
In this chapter we have presented the conceptual architecture of Forest. We have presented the processing steps to extract information and we have shown the supporting components that help the system to generate results. The architecture also presents components that are reviewed in detail in the following chapters.
Chapter 4
News Topic References
To model topics, specifically news topics, we require a topic model that can be found in a causal relation. In this chapter we present a topic model that addresses these requirements, we call this topic model a news topic reference(NTR). To better understand news topic references we provide a definition and some examples. We also present several approaches to extract and generate news topic references, along with their corresponding evaluations. We continue by demonstrating how these approaches can be composed and present the best approaches for different use cases. In conclusion we will show how the performance of these approaches is suitable for evaluating Hypothesis 3 By analyzing news articles at a sentence level is it possible to extract explicit causal relations between news topic references with a 95% accuracy.
4.1
Introduction
The abundance of online information makes it difficult or impossible for a person to process this large amount of information, this problem in known as information overload. It is readily evident in online news, there are so many sources of information generating content that tools are needed to process all this information. These tools can include clustering similar news articles and organizing the articles into predefined categories, such as sports or finance. In this work we propose to organize the news into a network of causally related news topics. The system we developed to automatically extract this causal relations between news topics is called Forest. The goal of Forest is to facilitate understanding and navigation of the news. We facilitate understanding by giving a context to a news topic in the form of a causal network for a news topic. We believe that causally related news topics allow the user to navigate from one topic to the next while maintaining a coherent relation between the topics.
To extract a causal network of news topics we require two main components: a component to extract causal relations from natural language text, and a component to extract news topics from causal relations. In this chapter we review the methods to extract news topics references found in natural language text.