3.2 Background Knowledge
3.2.1 Knowledge Sources
Two knowledge sources are used as initial inputs for discovering concept labels. A structured collection of teaching materials provides a source for extracting important topics identified by teaching experts in the domain, while a domain lexicon provides a broader but more detailed coverage of the relevant topics in the domain. Books are highlighted as a contributing factor which is often linked to the success of teaching and learning (Agrawal, Chakraborty, Gollapudi, Kannan & Kenthapadi 2012). The authors of books are usually experts in their respective domains,
3.2. Background Knowledge 41
Figure 3.2: An overview of the background knowledge creation process
and they carefully design books to contain important topics that learners should be interested in. The domain lexicon is used to verify that the concept labels identified from the teaching materi- als are directly relevant. Thereafter, an encyclopedia source, such as Wikipedia pages, is searched and this provides the relevant text to form a pseudo-document for each verified concept label. The final output from this process is the set of concepts each comprising a concept label and an associated pseudo-document.
The approach is demonstrated with learning materials from Machine Learning and Data Min- ing. The collection of teaching materials used are e-Books. The Tables-of-Contents (TOCs) of the e-Books are used as a structured knowledge source. A summary of the e-Books used is shown in Table 3.1. The first column contains the title of the e-Books and the surname of the authors, while the second column contains the number of Google Scholar citations for each e-Book, as at the time this research was done. Two Google Scholar queries: “Introduction to data mining textbook” and “Introduction to machine learning textbook” guided the selection process, and 20 e-Books that met all of the following 3 criteria were chosen. First, the book should be about the domain. Second, there should be Google Scholar citations for the book. Finally, the book should be accessible on the Web.
3.2. Background Knowledge 42
Table 3.1: Summary of e-Books used
Book Title & Author Cites
Machine learning; Mitchell 264
Introduction to machine learning; Alpaydin 2621
Machine learning a probabilistic perspective; Murphy 1059
Introduction to machine learning; Kodratoff 159
Gaussian processes for machine learning; Rasmussen & Williams 5365
Introduction to machine learning; Smola & Vishwanathan 38
Machine learning, neural and statistical classification; Michie, Spiegelhalter, & Taylor
2899
Introduction to machine learning; Nilsson 155
A First Encounter with Machine Learning; Welling 7
Bayesian reasoning and machine learning; Barber 271
Foundations of machine learning; Mohri, Rostamizadeh, & Talwalkar 197 Data mining-practical machine learning tools and techniques; Witten & Frank 27098
Data mining concepts models and techniques; Gorunescu 244
Web data mining; Liu 1596
An introduction to data mining; Larose 1371
Data mining concepts and techniques; Han & Kamber 22856
Introduction to data mining; Tan, Steinbach, & Kumar 6887
Principles of data mining; Bramer 402
Introduction to data mining for the life sciences; Sullivan 15 Data mining concepts methods and applications; Yin, Kaku, Tang, & Zhu 23
Wikipedia is used to create a domain lexicon because it contains articles for many domains (V¨olkel, Kr¨otzsch, Vrandecic, Haller & Studer 2006), and the contributions of many people (Yang & Lai 2010), so this provides the coverage needed for the lexicon. Yang & Lai (2010) set out to find what motivated users to contribute freely to Wikipedia to create such a large knowledge base. They found that one of the motivating factors was the sense of achievement users had when making a contribution to the knowledge base. Their study confirms similar conclusions by (Nov 2007) that most users are pleased to share what they know with others. These findings give some explanation to the growth of this Encyclopedia source.
Wikipedia concepts are used in (Gabrilovich & Markovitch 2007) to provide meaning for natural language texts because of the large amounts of concepts available. Also, Zheng, Li, Huang & Zhu (2010) exploit Wikipedia as a knowledge base for linking entities found in unstructured text to Wikipedia articles in order to provide some descriptive information for the entities. Similarly, in this thesis, Wikipedia is exploited to build background knowledge that can be used for representing unstructured text from learning materials to enable the recommendation of relevant documents. The lexicon is generated from all the available Wikipedia sources for the Machine Learning and
3.2. Background Knowledge 43
Data Mining domain. There are 2 available sources for this domain, so we use both sources. First, the phrases in the contents and overview sections of the chosen domain are extracted to form a topic list. Second, a list containing the titles of articles related to the domain is added to the topic list to assemble the lexicon. Overall, the domain lexicon consists of a set of 664 Wiki-phrases.
Using Wikipedia as a knowledge source can present the challenge of the provenance of some of the contributors. However, we note that in this research, the TOCs of e-Books are used as the starting point for identifying concepts for creating our background knowledge. Wikipedia provides a description of the concepts already identified from the e-Books, because the e-Books have known authors. The articles on Wikipedia are open to review as other contributors can edit content that is not consistent. Further, any disputed article entries can be settled through a discussion page associated with each entry. The study by Giles (2005) suggests that the editing feature helps to improve the quality of articles in the Encyclopedia. These findings give us confidence to use the sources chosen for creating our background knowledge.