• No results found

A Brief Overview of Information Retrieval and Machine Learning Techniques

Background and Related Work

2.2 A Brief Overview of Information Retrieval and Machine Learning Techniques

Here, we briefly discuss information retrieval and machine learning techniques, since ma-chine learning is closely related to some previous work and our research adopts information retrieval techniques for improving the scalability of our entity coreference system.

2.2.1 Information Retrieval

Given a large collection of documents, Information Retrieval is a traditional technique that allows users to quickly look up documents that contain some given words. Instead of having to traverse through the whole document collection, an inverted index is built by pre-processing all documents in the collection to speed up the query process. Figure 2.7 demonstrates an example of an inverted index.

!"#$%&

Figure 2.7: An Example of Inverted Index

From the upper part of this example, first of all, we see three documents, a document collection, identified with ID 1 to 3 respectively; and each document has their own contents.

Furthermore, the lower part of this diagram shows the built inverted index, with two primary components: Term List and Posting List. The Term List contains all tokens extracted from the contents of all documents with stop words filtered out. Each term in the term list is then associated with a posting list that includes all documents that have this term or token in their contents. Specifically, in our example, “Dezhao” is a term and it appears in both Document 1 and 3; therefore, the posting list of the “Dezhao” term has both documents in it. Please note that some tokens, such as “in”, “the”, and “of”, were not included in

the term list, since they frequently appear in many documents (in real world scenarios) and are not sufficiently representing a document. Also, sometimes, a token in the content is changed to another form. For example, in Figure 2.7, the token “graduated” was changed to its original form “graduate” when building the index. This is typically referred to as Stemming, a pre-processing or normalization step when building inverted index.

With this inverted index, users could easily perform fast index look-up to find documents that satisfy certain constraints. As a concrete example, suppose we have the following query:

Find documents that contain term “Dezhao” and term “Lehigh”. Given this query, we first obtain the posting lists of both terms: Document 1 and 3 for term “Dezhao” and Document 1, 2, and 3 for term “Lehigh”. Next, we perform an intersection of the two posting lists and finally have the answer to the query to be Document 1 and 3. Disjunctive queries can be performed in a similar way; instead of doing intersection, a union will be performed to merge the posting lists of different terms from the query. According to the literature of information retrieval research, techniques have been developed for index optimization, such as index compression and adding location information into the index to support more types of queries [28].

2.2.2 Machine Learning

Machine Learning is a well known technique for performing prediction by utilizing from known information and knowledge. According to Alpaydin [29], “Machine learning is pro-gramming computers to optimize a performance criterion using example data of past expe-rience. We have a model defined up to some parameters, and learning is the execution of a computer program to optimize the parameters of the model using the training data or past experience. The model may be predictive to make predictions in the future, or descriptive

to gain knowledge from data, or both.” Overall, machine learning is to use existing data and knowledge to predict some currently unknown. There are primarily three types of ma-chine learning techniques: Supervised Learning, Unsupervised Learning and Reinforcement Learning.

For supervised learning, in general, we have training data and a mathematical model that actually uses the training data to learn the parameters in the model. For a given machine learning problem, we also need to define a set of features that are specifically designed for a given problem. From the training data, values of the defined features are extracted and are utilized by the mathematical model to learn the values of the parameters. To evaluate machine learning algorithms, we also have testing data where we extract values for the same set of features and let the learned mathematical model to do prediction by utilizing the learned parameter values. Models for such approaches include Naive Bayes, Decision Tree, Random Forest, etc. As a concrete example, we could try to predict if a customer would buy diaper by considering several factors or features: 1) Does this customer have a baby; 2) Gender of this customer; and 3) Is there a football game tonight? With these features, we probably be able to learn the probability that a customer buys diaper under each feature and combine them together for prediction. Note all these probabilities are learned from the training data automatically.

Different from supervised approaches, there are also unsupervised approaches. Instead of trying exploit training data or labelled data, such approaches operate directly on the testing data and try to discover the structure of the data. Clustering algorithms would be good examples of such category. The advantage of unsupervised approaches is that they do not require any labeled data which could be difficult to obtain for certain domains. However, because there is not any formal training process, unsupervised approaches can only rely on

some similarity and distance measures for prediction.

Finally, reinforcement learning, different from previous two types of machine learning algorithms, is concerned with how intelligent agents should act in an environment to max-imize some notion of reward. The agent executes actions which cause the observable state of the environment to change. Through a sequence of actions, the agent attempts to gather knowledge about how the environment responds to its actions, and attempts to synthesize a sequence of actions that maximizes a cumulative reward. Different from supervised learn-ing, reinforcement learning based algorithms learn how to perform actions by considering the outcome of previous actions in order to achieve the highest reward.