A multi-agent collaborative personalized web mining system model.

Full text

(1)A MULTI-AGENT COLLABORATIVE PERSONALIZED WEB MINING SYSTEM MODEL. by. OCKMER LOUREN OOSTHUIZEN. DISSERTATION. submitted in the fulfilment of the requirements for the degree. MASTER OF SCIENCE. in. COMPUTER SCIENCE. in the. FACULTY OF SCIENCE. at the. RAND AFRIKAANS UNIVERSITY. SUPERVISOR: PROF. E.M. EHLERS. JUNE 2004.

(2) Abstract Keywords: Agent Based Systems, Web Mining, Knowledge Mining The Internet and world wide web (WWW) have in recent years, grown exponentially in size and in terms of the volume of information that is available on it. In order to effectively deal with the huge amount of information on the web, so called web search engines have been developed for the task of retrieving useful and relevant information for its users. Unfortunately, these web search engines have not kept pace with the boom growth and commercialization of the web. The main goal of this dissertation is the development of a model for a collaborative personalized meta-search agent (COPEMSA) system for the WWW. This model will enable the personalization of web search for users. Furthermore, the model aims to leverage on current search engines on the web as well as enable collaboration between users of the search system for the purposes of sharing useful resources between them. The model also employs the use of multiple intelligent agents and web content mining techniques. This enables the model to autonomously retrieve useful information for it’s user(s) and present this information in an effective manner. In order to achieve the above stated, the COPEMSA model employs the use of multiple intelligent agents. COPEMSA consists of five core components: a user agent, a query agent, a community agent, a content mining agent and a directed web spider. The user agent learns about the user in order to introduce personal preference into user queries. The query agent is a scaled down meta-search engine with the task of submitting the personalized queries it receives from the user agent to multiple search services on the WWW. The community agent enables the search system to communicate and leverage on the search experiences of a community of searchers. The content mining agent is responsible for analysis of the retrieved results from the WWW and the presentation of these results to the system user. Finally, a directed web spider is used by the content mining agent to retrieve the actual web pages it analyzes from the WWW. In this dissertation an additional model is also presented to deal with a specific problem all web spidering software must deal with namely content and link encapsulation.. i.

(3) Opsomming Sleutelwoorde: Agent Gebaseerde Stelsels, Web Ontginning, Kennis Ontginning Die Internet en wˆereld wye web (WWW) het oor die afgelope paar jaar eksponensi¨eel gegroei in terme van groote asook inligtings volume. Om effektief die groot hoeveelheid inligting op die web te hanteer het sogenaamde web soekenjins ontwikkeling geniet met die doel om relevante en bruikbare inligting vir hul gebruikers op te spoor. Ongelukkig het hierdie web soekenjins nie tred gehou met die fenomenale groei en kommersialisering van die web nie. Die hoofdoel van hierdie verhandeling is die ontwikkeling van ’n model vir ’n kollaboratiewe persoonlike meta-soek agent (COPEMSA) stelsel vir die WWW. Hierdie model sal die verpersoonliking van web soektogte vir gebruikers moontlik maak. Verder, streef die model daarna om te steun op huidige web soekenjins asook om kollaborasie tussen gebruikers van die soek stelsel te bekragtig vir die doeleindes van uitruiling van behulpsame bronne. Die model benut ook die gebruik van meerdere intelligente agente en web inhoud myn tegnieke. Dit stel die model in staat om outomaties bruikbare inligting vir sy gebruiker(s) op te spoor en om hierdie inligting op ’n effektiewe manier aan te bied. Om die bogenoemde doelstellings te bereik, gebruik die COPEMSA model meerdere intelligente agente. COPEMSA bestaan uit vyf kern komponente: ’n gebruikers agent, ’n navraag agent, ’n gemeenskap agent, ’n inhouds myn agent en ’n gerigte web loper. Die gebruikers agent leer oor die gebruiker om sodoende persoonlike voorkeur in gebruikersnavrae voor te stel. Die navraag agent is ’n afgeskaalde meta-soek enjin met die doel om die verpersoonlikte navrae wat dit ontvang vanaf die gebruikers agent aan verskillende soekenjins op die WWW te stuur. Die gemeenskap agent stel die soek stelsel in staat om voordeel te trek uit die soek ervarings van ’n gemeenskap van gebruikers. Die inhoud myn agent is verantwoordelik vir die analise van resultate ontvang vanaf die WWW en die aanbieding van hierdie resultate aan die eindgebruiker. Laastens, word die gerigte web loper deur die inhouds myn agent gebruik om die respektiewe bladsye wat dit analiseer vanaf die WWW fisies af te trek. In hierdie verhandeling word ’n addisionele model ook voorgestel om ’n spesifieke probleem, ii.

(4) naamlik inhoud en skakel enkapsulasie, wat alle web loop sagteware in die gesig staar op te los.. iii.

(5) Contents 1 Introduction. 1. 1.1 1.2 1.3. Information vs. knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information retrieval on the world wide web (WWW) . . . . . . . . . . . . . . . Intelligent agents and web mining . . . . . . . . . . . . . . . . . . . . . . . . .. 2 2 3. 1.4. Design of a model for a collaborative personalized meta-search agent system for the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2 The concept of knowledge 2.1 On the philosophy of knowledge . . . . . . . . . . . . . . . . . . . . . . . . . .. 5 5. 2.2. 2.3. Knowledge defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6 6. 2.2.2 2.2.3 2.2.4. 7 7 8. Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wisdom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.2.5 Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11. 3 Information representation and storage 3.1 3.2. 12. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13. 3.3. 3.2.2 Data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17. 3.4. 3.3.1 Information representation . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 Information storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19. iv.

(6) 4 Knowledge representation and storage 20 4.1 Knowledge representation background . . . . . . . . . . . . . . . . . . . . . . . 20 4.2. 4.3 4.4. 4.1.1 The knowledge representation problem . . . . . . . . . . . . . . . . . . 21 Knowledge representation strategies . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Knowledge representation using logic . . . . . . . . . . . . . . . . . . . 23 4.2.2 4.2.3. Knowledge representation using networks . . . . . . . . . . . . . . . . . 26 Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29. 4.2.4 4.2.5 4.2.6. Rule based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Topic maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35. Knowledge storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42. 5 The world wide web: content, structure and information retrieval 5.1. The world wide web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.1 Content on the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.2 Structure of the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1.3 5.1.4. 5.2. 5.3. Protocols on the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Information on the web . . . . . . . . . . . . . . . . . . . . . . . . . . . 51. Searching the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Information retrieval on the web . . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 Web search engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.3 Challenges for web search engines . . . . . . . . . . . . . . . . . . . . . 58 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69. 6 Web mining agents 6.1. 71. Intelligent agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1.1 A brief history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1.2 What is an agent ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1.3 6.1.4 6.1.5. 6.2. 43. Design considerations in agent systems . . . . . . . . . . . . . . . . . . 74 Types of agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Agent application domains . . . . . . . . . . . . . . . . . . . . . . . . . 81. Web mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.1 What is web mining ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.2. Web structure mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 v.

(7) 6.2.3 6.2.4 6.3. 6.4. Web usage mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Web content mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86. 6.2.5 Web mining components . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Web mining agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.1 SoftBot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.2 6.3.3. Letizia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 WebWatcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92. 6.3.4 6.3.5 6.3.6. Syskill and Webert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 PAINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 WebAce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93. 7 Design of a prototype collaborative personalized meta-search agent (COPEMSA) system model for the world wide web 95 7.1. Agent-based personalized autonomous web mining . . . . . . . . . . . . . . . . 96 7.1.1 Personalization through user modelling . . . . . . . . . . . . . . . . . . 97 7.1.2 Communities of interest . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.1.3 7.1.4. 7.2. 7.3. 7.4. Meta-searching the web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2.1 What is a meta-search engine ? . . . . . . . . . . . . . . . . . . . . . . . 106 7.2.2 Improving web mining agent coverage through meta-searching . . . . . . 108 7.2.3 Collaborative ranking of multiple search engine results . . . . . . . . . . 108 Collaborative personalized meta-search agent system . . . . . . . . . . . . . . . 109 7.3.1 COPEMSA system design goals . . . . . . . . . . . . . . . . . . . . . . 110 7.3.2 Benefits and limitations of a client-side based approach . . . . . . . . . . 111 The COPEMSA system model . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.1. 7.5. The agent paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Web content mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106. 7.4.2. Collaborative personalized meta-search agent (COPEMSA) system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 User agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113. 7.4.3 7.4.4. Collaborative meta-search unit . . . . . . . . . . . . . . . . . . . . . . . 113 Web content mining unit . . . . . . . . . . . . . . . . . . . . . . . . . . 114. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115. 8 The COPEMSA user agent. 117 vi.

(8) 8.1 8.2. 8.3. The role of the user agent in the COPEMSA model . . . . . . . . . . . . . . . . 118 User profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.2.1 8.2.2 8.2.3. User profiling strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Search sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128. 8.2.4 8.2.5. Query augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Initial results from query agent . . . . . . . . . . . . . . . . . . . . . . . 131. User profile database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.3.1 ODP tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.3.2 Individual ODP tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133. 8.4. 8.3.3 User profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 User feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134. 8.5. 8.4.1 Implicit user feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.4.2 Explicit user feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137. 9 The COPEMSA collaborative meta-search unit 9.1. 9.2. 9.3. 139. Query agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 9.1.1 The role of the query agent in the COPEMSA model . . . . . . . . . . . 140 9.1.2 9.1.3 9.1.4. Search engine selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Query composition, restructuring and modification . . . . . . . . . . . . 145 Results scoring and fusion process . . . . . . . . . . . . . . . . . . . . . 147. 9.1.5 9.1.6 9.1.7. Post-retrieval processing and re-ranking of results . . . . . . . . . . . . . 149 Presentation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Topics-links database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150. Community agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.2.1 The role of the community agent in the COPEMSA model . . . . . . . . 151 9.2.2 9.2.3 9.2.4. Server-based architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Community ODP-tree access and submission . . . . . . . . . . . . . . . 153 Specialist search engine lookup . . . . . . . . . . . . . . . . . . . . . . 154. 9.2.5 9.2.6. Community URL rating service . . . . . . . . . . . . . . . . . . . . . . 154 Community profile database . . . . . . . . . . . . . . . . . . . . . . . . 155. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155. 10 The COPEMSA web content mining unit vii. 157.

(9) 10.1 The results analysis agent and directed web spider in the COPEMSA model . . . 158 10.2 Results analysis agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.2.1 Document representation and feature extraction . . . . . . . . . . . . . . 160 10.2.2 Neighbouring pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 10.2.3 Document classification . . . . . . . . . . . . . . . . . . . . . . . . . . 164 10.2.4 Topics-links database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 10.2.5 Results returned to the user agent . . . . . . . . . . . . . . . . . . . . . 167 10.3 Directed web spider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 10.3.1 Directed web spider concepts . . . . . . . . . . . . . . . . . . . . . . . 168 10.4 The information burial and URL encapsulation problems . . . . . . . . . . . . . 169 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11 Conclusions and further research 175 11.1 The next generation internet (NGI) and Internet2 . . . . . . . . . . . . . . . . . 176 11.2 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 11.2.1 Personalization through observation . . . . . . . . . . . . . . . . . . . . 177 11.2.2 Meta-searching improvements . . . . . . . . . . . . . . . . . . . . . . . 178 11.2.3 Further exploitation of search community collaboration . . . . . . . . . . 178 11.2.4 Visualization of results . . . . . . . . . . . . . . . . . . . . . . . . . . . 178. viii.

(10) List of Figures 2.1. Views of “information” adapted from [3] . . . . . . . . . . . . . . . . . . . . . .. 2.2. Transitions between views of Information [5]. . . . . . . . . . . . . . . . . . . . 11. 3.1. Von Neumann Architecture [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 4.1 4.2. Inheritance-style semantic network . . . . . . . . . . . . . . . . . . . . . . . . . 27 Expanded inheritance-style semantic network . . . . . . . . . . . . . . . . . . . 28. 4.3 4.4 4.5. Part of figure 4.2 as a frame system. . . . . . . . . . . . . . . . . . . . . . . . . 29 Ontology life cycle [22]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 An example of a simple conceptual graph adapted from [29]. . . . . . . . . . . . 38. 5.1 5.2 5.3. Relationship between hubs and authorities [37]. . . . . . . . . . . . . . . . . . . 50 Adapted classic model for information retrieval [41]. . . . . . . . . . . . . . . . 53 Adapted classical model for information retrieval augmented for the web [39]. . . 54. 6.1 6.2. IBM intelligent agent graph [49, 50]. . . . . . . . . . . . . . . . . . . . . . . . . 77 Adapted taxonomy of web mining techniques [52, 54]. . . . . . . . . . . . . . . 84. 6.3. Web mining subtasks adapted from [57, 58]. . . . . . . . . . . . . . . . . . . . . 88. 7.1. Meta-search engine process adapted from [81, 82, 83]. . . . . . . . . . . . . . . 107. 7.2. Collaborative personalized meta-search agent (COPEMSA) system model. . . . . 113. 8.1 8.2. Role of the user agent in the COPEMSA system. . . . . . . . . . . . . . . . . . 118 Example of the ODP-tree approach . . . . . . . . . . . . . . . . . . . . . . . . . 122. 8.3 8.4. Internal query representation using XML. . . . . . . . . . . . . . . . . . . . . . 128 ODP tree node to XML query mapping. . . . . . . . . . . . . . . . . . . . . . . 130. 9.1 9.2. Role of the query agent in the COPEMSA model. . . . . . . . . . . . . . . . . . 142 Query restructuring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145. ix. 9.

(11) 9.3. Role of the community agent in the COPEMSA system. . . . . . . . . . . . . . . 152. 10.1 The results analysis agent and the directed web spider in the COPEMSA system model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.2 Document representation with a frequency-word vector [91]. . . . . . . . . . . . 161 10.3 The k-Means algorithm [93]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 10.4 Website structure annotation model using XML. . . . . . . . . . . . . . . . . . . 171 10.5 Example XML DTD for site structure annotation. . . . . . . . . . . . . . . . . . 172 10.6 Example usage of site structure annotation . . . . . . . . . . . . . . . . . . . . . 173. x.

(12) List of Tables 8.1. User query syntax elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125. 8.2. User profile structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134. 9.1. Link-query lookup table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150. 9.2. Context-common words listing table . . . . . . . . . . . . . . . . . . . . . . . . 155. xi.

(13) Chapter 1 Introduction. The Internet has revolutionized the way in which we collect, view and classify information. Speed and the ease of its use has made it a part of everyday life and work for most people. Unfortunately, much of the information on the Internet is highly unstructured and does not have any sort of predefined schema, type or pattern. The presentation of information has also become geared more towards visual aesthetics and user friendliness, making it appear more visually pleasing to human readers but at the same time making it hard for users to locate information on the Internet. This unstructured and unorganized nature of the Internet, as well as the sheer vastness of information available on it, has made it quite difficult for users to identify and extract “useful” or specific information on various topics from this large information base and viewing it in a structured way. This has led to users spending large amounts of time manually searching for information on the web and interpreting the results of their search efforts. An obvious question to ask is if this process of resource discovery and structuring of search results cannot be automated in some way in order to aid users in finding and interpreting information on the World Wide Web, thereby saving users valuable time that can be used more productively on other endeavours. The problem of automatically finding and interpreting information for a specific user is to be 1.

(14) addressed in this dissertation. More specifically the development of a model for the automatic discovery of resources and classification thereof for a specific user is the main focus. In order to gain a better insight into the unique problem of designing a system that can assist users with the retrieval of information and the presentation of it in a structured way, a few key concepts must be taken into account. These concepts are briefly discussed below.. 1.1 Information vs. knowledge At the very core of any type of retrieval system lies the question of what data, information and knowledge is defined as and how they can be represented in a form understandable by machines. Only then can a understanding be reached about what type of information is stored on the Internet and ways in which to retrieve and manage it. Chapter 2 considers the concept of knowledge. The philosophies behind the concept is briefly introduced and a discussion of the various forms of information (i.e. data, information, knowledge and wisdom) and the differences between the forms are given. Chapter 3 looks at how data and information are represented and encoded in a machine readable form. Chapter 4 discusses the principles behind knowledge representation and some strategies for the capturing, representation and storage of knowledge structures.. 1.2 Information retrieval on the world wide web (WWW) Once an understanding about the forms of information and their representation is reached, attention can be diverted to the contents of the world wide web and how information is searched for and retrieved on it. Chapter 5 briefly introduces the different types of content on the WWW, its link structure and the protocols available on the web for document retrieval. A discussion on the form of information found on the Internet is also given. The focus then shifts to how the web is searched for information. A background to theory of information retrieval (IR) on the web is given and so called web search engines are discussed as the primary method of IR on the web today. The chapter concludes with some challenges facing modern web search engines.. 2.

(15) 1.3 Intelligent agents and web mining With the contents of the WWW described, as well as the principles of searching and information retrieval on the WWW, the fields of software agents and web mining can be considered to address some of the challenges search engines are faced with. Chapter 6 starts off with a background to intelligent agents, defines what intelligent agents are, considers the design considerations in agent based systems and discusses different types of agents and their application in various domains. Web mining can be described as the process of extracting knowledge from the WWW by analyzing the properties of the WWW itself. These properties can include the content of the web, the link structure between pages and even the usage patterns generated by people using the web in their daily lives. The chapter briefly discusses the previously mentioned three classes of web mining and presents an overview of the web mining process. The chapter concludes with a discussion of web mining agents. Web mining agents combine the fields of intelligent agents and web mining.. 1.4 Design of a model for a collaborative personalized metasearch agent system for the WWW With the background of knowledge representation, information retrieval on the web, intelligent agents and web mining the design of a personalized autonomous web mining system can be considered. Chapter 7 introduces the key concepts used in the design of the model presented in this chapter. Users are unique regarding their habits and interests. With this in mind, the chapter discusses personalization of the search experience through user modelling as well as the idea of a community of searchers collaborating in order to achieve a common search goal. The chapter also discusses the agent paradigm and its applicability to the problem of personalized web mining. Not all web search engines index the same web pages and, as a result, the idea of meta-searching the web in order to improve coverage is also discussed in the chapter. Finally the design goals of the collaborative personalized meta-search agent (COPEMSA) system is listed, the proposed model given and its components briefly discussed. Chapter 8 introduces the user profiling strategy used in the COPEMSA model. Chapter 9 discusses how the web can be meta-searched to automatically discover resources that match a certain user query and the issues 3.

(16) involved in such a system as well as the collaboration infrastructure proposed for the model. Finally, Chapter 10 discusses how results found through the meta-search process is retrieved, represented and structured using a technique from web content mining called clustering. The chapter concludes with the presentation of an additional model proposed for solving two specific problems content mining systems may encounter when analyzing results from the WWW.. 4.

(17) Chapter 2 The concept of knowledge. “Without sensibility no object would be given to us, without understanding no object would be thought. Thoughts without content are empty, intuitions without concepts are blind.” -Immanuel Kant (“Critique of Pure Reason”). 2.1 On the philosophy of knowledge The quest for knowledge can be described as one of the single most important of human endeavours. Ancient Greek philosophers such as Plato and Aristotle where followed by Kant and Hegel and, in the 20th century, even more strived to answer one of the oldest questions posed to thinkers over the ages. What is Knowledge ? Recent technological advances, especially ones in Information Technology, have renewed interest in this question. Stenmark states that if this is the case, plausible questions to ask include: How does knowledge relate to technology ? Can technology be used to process knowledge at all, and if so, what types of knowledge ? What types of knowledge are there [1]? From the above questions, it is clear that a concise definition of what is considered to be knowl5.

(18) edge is desperately needed. This definition has however eluded philosophers and systems theorists, as different views of what exactly knowledge is exist. In the following sections, an attempt will be made to offer a definition of this concept and the underlying principles for the purposes of this discussion.. 2.2 Knowledge defined The concept of knowledge is a difficult one to understand, let alone define. From the literature, it is clear however that knowledge cannot be described as a singular concept. Terms like data and information have been used in the context of knowledge, but their suitability for use still remains in debate. According to Stenmark, many researchers use the terms data, information and knowledge very casually. In particular, knowledge and information are often used interchangeably, even though they are far from identical [1]. According to renowned systems theorist, Russel Ackoff, the content of the human mind can be classified into five distinct categories [2]: • Data. • Information. • Knowledge. • Wisdom. • Understanding. These terms, and the relationships between them will be discussed in the following paragraphs.. 2.2.1 Data Even though there are many definitions of data, all of them have the following element in common : Data is the most basic building block of the thought process. It is generally described as 6.

(19) “raw” or more clearly, not yet interpreted symbols or facts. It can exist in any form, usable or not. It has absolutely no meaning beyond its own existence, in other words, it has no meaning on itself. This implies that for data to have any useful meaning, it has to be transformed by some sort of process or action to something more useful [1, 3]. As an example, consider a basic mathematical system consisting of numbers(0...9) and operators(+, −, ×, ÷, =). If these two classifications were to be considered as only numbers and symbols on their own, they would be utterly useless. Most of us have some sort of prior knowledge as to how they are used, but for someone who has never had any experience with these symbols or their application they would simply be considered as noise, that is having no meaning on their own.. 2.2.2 Information Information is data that has been given meaning by means of relational or contextual connection or association. In other words, “raw” data is transformed into something new by some sort of process, and has been given meaning by this process . In the case of the example mentioned above, if we were to state that numbers and operators were to be used in the following order: <number><operator><number>, we would be giving meaning to the raw numbers and operators by means of specifying the context in which they relate or connect to each other [1, 3].. 2.2.3 Knowledge At present, many different views of knowledge exist. McQueen discusses four popular views of knowledge in his article, stating that [4]:. • Knowledge can be viewed as access to information. • Knowledge can be stored in repositories of information. 7.

(20) • Knowledge is sets of rules. • Knowledge is “knowing”,“understanding”.. The above views differ fundamentally from each other. The first view defines that knowledge must already be in an explicit form and perceives that it can be extracted through access to documents and databases containing information and data. In the second view, knowledge is perceived as understanding in a given area of expertise. It also recognizes that the capture of dialogue among experts may be an important technique in the discovery of new knowledge. Thus, according to this view, experts are perceived to be the holders of knowledge, and their dialogues are stored in archives of messages organized by contextual area to aid in the discovery of new knowledge or use by persons less knowledgeable about the given context. The third view, can be summarized by stating that knowledge consists of sets of rules. In the fourth and final view knowledge is perceived as something that only occurs in humans, and thus not possible to mechanize. The view of particular interest for the purpose of this study, is the view that usable explicit knowledge can be best represented as an appropriate collection of contextual information connected, via relevant sets of rules, to information in different contexts.. 2.2.4 Wisdom Bellinger, Castro and Mills defines wisdom as an extrapolative, non-deterministic, non-probabilistic process [5]. It’s derivation is reliant upon many factors, including morals, ethical and social codes and previous experience. Wisdom is sometimes linked to a specific society or individual and is synthesized from previous knowledge, beliefs and/or experience. Something that is common to all views or classifications of wisdom, is that it is always linked to a human or group of humans. It can be deduced that wisdom is therefore something that is inherently human, and the process 8.

(21) by which it is attained is defined by the human experience itself. This obviously makes it an extremely difficult process to replicate with a computer, a feat which would require the construction of machines that can think and make decisions as well as or better than any human can. Pressman summarizes the terms discussed above as follows [3]:. • Data. A collection of facts that must be processed to be useful. • Information. Derived form the association of facts within a given context. • Knowledge. Associates information obtained in one context with other information obtained in a different context. • Wisdom. Occurs when generalized principles are derived from unrelated knowledge.. The above concepts are illustrated in figure 2.1. With the distinct differences between these concepts now defined, the next logical question to ask is how does data become information or how does information become knowledge or if transitions between the different forms of information is at all possible. The answers to these questions will be discussed in the next paragraph. Information:. associativity within one context. Data:. no associativity. Knowledge:. associativity within multiple contexts. Wisdom:. creation of generalised principles based on existing knowledge from different sources. Figure 2.1: Views of “information” adapted from [3]. 9.

(22) 2.2.5 Understanding Understanding can be described as an interpolative, probabilistic process. It must be stressed, that understanding is not a separate form of information, but rather a process that translates one form of information into another. Understanding draws upon previous information, knowledge and understanding to produce new information [1, 5]. Stenmark in his paper, assumes that the relationship between the different forms of information seems to be asymmetrical, suggesting that data may be transformed into information, which in turn may be transformed into knowledge and so on. He also states that it does not seem possible to reverse the process. If, however, the view is taken that understanding is a process that instigates and facilitates the change, it is only logical to deduct that every process has certain actions or operations that it performs. These actions or operations could be reversed to reproduce the original state [1]. The asymmetrical nature of the defined relationship also suggests that certain forms of information is more valuable than others. This, of course, is not always the case. It is sometimes desirable to use knowledge to derive information, and create data out of this information. It may also be the case that knowledge is deconstructed, refined, and then reconstructed into a more pure form of knowledge. This kind of data refining also becomes a theoretical possibility, if the view of this chapter is taken [3, 5]. Artificial Intelligence systems are especially inclined to posses understanding, in the sense that they are able to synthesize new knowledge from previously stored data, information and knowledge. This makes them ideal for applications which has the primary function of information state translation. These type of systems would perhaps even be able to assist humans in the acquisition of wisdom, through the understanding of knowledge. Figure 2.2 on page 11 summarizes the entire information cycle, as defined in this chapter. It illustrates the relationships between the information forms, and their connectedness.. 10.

(23) Wisdom C o n n e c t e d n e s s Data. Understanding the rules for the derivation of Knowledge generalized principles. Understanding contextual pattern rules. Information Understanding association rules. Understanding. Figure 2.2: Transitions between views of Information [5].. 2.3 Conclusion In this chapter, an attempt was made to clear the fuzzy boundaries between the definitions of the various terms used to describe information. Data was separated from information, information from knowledge and knowledge from wisdom. The concept of understanding was also introduced to explain how the actual transition between the forms of information take place. The value of each form is truly in the eye of the beholder.. 11.

(24) Chapter 3 Information representation and storage. “I do not fear computers. I fear lack of them.” -Isaac Asimov. 3.1 Introduction The concepts and definitions discussed in the previous chapter provide a brief overview of the forms of information relevant to this study. This knowledge is, however, not very useful if it cannot be practically implemented. With this in mind, it now becomes appropriate to discuss how the concepts of information and knowledge are represented, implemented and ultimately stored within an electronic storage and/or retrieval system. In this chapter, the representation and storage of two of the four forms of information are discussed.. 12.

(25) 3.2 Data As defined in the previous chapter, data is something that has no meaning in itself. It is the statement of some fact or collection of facts that must be processed or changed in some way to be of any practical use. This approach leads to the question of what the smallest piece of information an electronic system can process is. What can be considered “data” in the computing context ?. 3.2.1 Data representation The key to understanding how data is represented and stored in a computer system, is an thorough understanding of the principles and architecture modern computers are built upon. A computer is defined as a combination of memory, processor and I/O subsystems [6]. These three components of a computer system can be viewed as a set of nested state machines. The basic working of this machine can be summarized by stating that the memory holds instructions and data, and that these instructions and data are modified by the processor logic to transform a certain set of inputs to a desired set of outputs. The architecture described above is known as the von Neumann model of computer architecture and is considered the standard architecture for modern computers. It is represented graphically in figure 3.1. The main feature of this architecture, for the purposes of this discussion, is that in a von Neumann architecture, memory holds both program instructions and the data to be operated upon. This property is known as the stored program concept and allows for programs and data to be changed easily. This implies that instructions are also considered data because instructions can be operated on as data, a feature called self-modifying code [6]. Having defined what data means in the computing context, an explanation of how they are represented is still needed.. Binary representation. Computers perform all their operations using the binary number sys-. tem. All instructions and data are stored and manipulated in binary format [7]. Each binary 13.

(26) Addresses. Instruction Processing Unit. Control. Arithmetic Unit. Memory. Data. Figure 3.1: Von Neumann Architecture [6]. number is known as a bit and can assume one of two values, 0 or 1.. Bit groupings. A single binary number has the property that it can only differentiate between. two distinct values. This is unfortunately not very efficient for representing human usable data. A practical solution to this dilemma is to instead use groups of bits to represent data rather than single binary numbers. The amount of distinct combinations that can be represented by a grouping of n bits can be calculated by 2n .. Alphanumeric data Much of the user data supplied to the computer, and the data expected from the computer after processing will be in some human usable form.This typically implies that the data will be supplied in the form of alphabet characters, numbers and punctuation. Data represented in this form is known as alphanumeric data [7]. To accurately represent the vast variety of human data, it becomes necessary to use bit grouping for the purposes of representation. The question arises however that with all the possible groupings and combinations, how will these different formats be understood by every computer manufactured ?. 14.

(27) Standard formats The problem discussed above is addressed by the creation of standards for data exchange. Three major standards or codes for data representation exist. The three codes are known as the American standard code for information interchange ASCII, Unicode, and the extended binary coded decimal interchange code EBCDIC. More information about these codes can be found in [7] and [8]. The main feature of all three codes is that they provide some translation between a pre-defined bit pattern in that specific code, and an alphanumeric character. This enables consistent representations across various platforms.. Multimedia data In recent years, that which has come to be considered “data” is much more than just alphanumeric characters. The multimedia revolution of the 1990’s enabled computers to represent and process audio and video data as well. The principles behind the representation of audio and visual data will be briefly discussed in the next few paragraphs.. Audio data. The key to understanding how audio data is represented by a computer lies in. understanding the physics of sound. A sound wave in its original form is analog in nature. Thus, it must be converted to a binary form to be used by a computer. This conversion can be achieved by sampling the analog wave at regular intervals. Each time a sample is taken, the amplitude of the sample is measured by an electronic circuit and this circuit then converts this analog value to binary equivalent. This technique is called Pulse Amplitude Modulation [9]. Another possible analog-to-digital conversion technique is Pulse Code Modulation. These techniques are discussed in the book by Shay [9]. In addition to storing the actual waveform of the audio data, it is also necessary to store supplementary data to the waveform itself, like its maximum amplitude, sampling rate and number of samples [7].. Image and video data The representation techniques for image and video data is closely related, as a video sequence consist of a series of images. The representation of images can be classified into two approaches : bit map representation and vector representation. 15.

(28) In bit map representation, an image is represented by a series of binary numbers called pixels. These pixels are arranged in a 2 × 2 grid to represent the original image. By using the data stored in this 2 × 2 array of binary numbers, a reproduction of the original image is possible [7]. In vector or object representation the original image is defined in terms of mathematical shapes such as straight lines, Bezier curves, circles etc. Each of these shapes can be defined by storing only a small number of parameters. These parameters can then be used as input to functions to recreate the original shapes the specific image consisted of, thus recreating the image as a whole. Video data is usually represented as a series of bit maps. The large amount of bit maps needed to recreate a single animation or video sequence makes the binary representation of video data quite large. A solution to the above problem is to compress the sequence by only storing keyframes, or images in the sequence that represents the most change. The missing frames are calculated by interpolating between keyframes. This is the basic idea behind formats like the moving pictures expert group (MPEG) format [7].. Standard audio, image and video data formats Some popular audio formats include musical instrument digital interface (MIDI), waveform (WAV) and moving pictures expert group layer 3 (MP3). The MIDI format is primarily used for the coordination of sounds and signals between a computer and connected musical instruments. The WAV format was designed by Microsoft, and is intended as a general purpose format for audio wave data [7]. The MP3 specification is a derivative of the MPEG-2 specification and was designed by the Fraunhover-Institute in Germany. It is primarily intended as a compressed representation for wave data, and is quite extensively used to store large amounts of audio data [7]. There is an abundant array of formats available for image data. Some of the more popular include the bitmap (BMP), graphics interexchange format (GIF) and joint photographers expert group (JPEG) formats. The BMP format is a basic format for storing bitmaps. The GIF format and the compressed JPEG format are also variations on storing bit mapped images. Video storage formats are dominated by the audio video interleave (AVI) and MPEG formats. The AVI format stores a sequence of images, whereas MPEG is a compressed format as described above [7].. 16.

(29) 3.2.2 Data storage File storage Data is typically stored as a sequence of binary numbers on some storage device. This sequence of binary numbers is called a file, and the storage device is either the computers primary memory or secondary memory like a hard disk drive or an optical disk like a CD-ROM. Files are accessed either sequentially or randomly. With sequential access, the stream of bits is read from the start of the file to the end of the file, the entire file is read at a time. With random access, it is possible to seek until a specific position in a file and read the data from that point on [7]. Files are the most primitive form of data storage in modern computer systems. As we will see in the next section, it forms the basis of storage for higher forms of information.. 3.3 Information Information was defined in chapter 2 as data that has been given meaning by means of relational or contextual connection or association. The concept information will be discussed further.. 3.3.1 Information representation Database systems A database can be defined as a collection of data, typically describing the activities of one or more related organizations [10]. In other words, a database can be seen as a grouping of data for a specific goal in a given context. For example, a database for a secondary school might contain entities such as students, teachers and subjects. It may also contain relationships between these entities such as which students attend which subjects etc. The above is a property of relational databases and it fits in perfectly with the definition of information that was given in chapter 2. Thus we can actually use the term “Information Base” to more accurately describe the modern relational database. 17.

(30) The heart of the relational database lies in its data model. A data model is a collection of highlevel data description constructs that abstracts away low-level storage details. The relational database system is based upon the relational data model which is briefly discussed in the next paragraph [10].. The relational data model The main data representation construct in the relational model is called a relation. A relation consists of a relation schema and a relation instance [10]. The schema specifies the relation’s name, as well as the name of each field (or attribute) and the domain (or data type) of each field. A domain is referred to in a relational schema by the domain name and has a set of associated values. An instance of a relation is a set of tuples(records) such that each tuple has the same number of fields as the relational schema. A relational instance can be defined as a set of unique tuples or rows. A relational schema specifies the domain of each field or column in the relation instance. That is, these constraints specify a certain condition every relational instance must satisfy. The relational model can be summarized formally by the following : Let R(f1 : D1, . . . , Dn)) be a relational schema, and for each fi , 1 ≤ i ≤ n let Domi be the set of values associated with the domain named Di . An instance of R that satisfies the domain constraints in the schema is a set of tuples with n fields such that :. {hf1 : d1 , . . . , fn : dn i|d1 ∈ Dom1 , . . . , dn ∈ Domn }[10]. 3.3.2 Information storage Database management systems In practice, a database management system (DBMS) is used to implement the model described above. The main functions of a DBMS are [11]:. • Data dictionary management. 18.

(31) • Data storage management. • Data transformation and presentation. • Security management. • Multi-user access control. • Backup and recovery management. • Data integrity management. • Database access languages and application programming interfaces. • Database communication interfaces.. As can be seen from the above functions, the DBMS manages and physically stores the information supplied to it. The actual physical storage is achieved in terms of binary files.. 3.4 Conclusion In this chapter, the discussion focused around what data and information means in the computing context and how both are physically stored and managed. The question still remains how the higher forms of information, namely Knowledge and Wisdom can be stored and managed. These topics will be discussed in the following chapter.. 19.

(32) Chapter 4 Knowledge representation and storage. “Knowledge will forever govern ignorance; and a people who mean to be their own governors must arm themselves with the power which knowledge gives.” -James Madison. In previous sections, knowledge was defined as the association of information obtained in one context with other information obtained in different contexts. This association is achieved by defining a set of relevant rules that connects the various contextual information to each other. With this definition, it becomes possible to explore various knowledge representation and storage strategies. In chapter 4, a few approaches to knowledge representation will be discussed including logic, semantic networks, conceptual graphs, frames, rule based systems, topic maps and ontologies.. 4.1 Knowledge representation background Knowledge representation (KR) can be defined as the area of artificial intelligence that deals with the representation, maintenance and manipulation of knowledge about a certain application domain [12]. Every computer program and virtually all artificial intelligence systems must possess 20.

(33) knowledge about its application domain. With this in mind, it is obvious that KR is one of the central subfields of artificial intelligence and that any project with a knowledge based content must choose a scheme for representing that knowledge in an explicit and declarative way. Explicitness of representation means the knowledge that is being represented is stored in some sort of knowledge base consisting of a set of formal descriptions of that knowledge which is direct and unambiguous. Declarativeness implies that the meaning of the representation scheme can be specified without reference to how the knowledge is applied procedurally (i.e. there is a form of logical methodology behind it) [12]. It is also noted by Shapiro that any knowledge representation scheme is useless without the ability to reason with the aid of the scheme [13]. Therefore the derivation of new conclusions from the information represented (i.e reasoning with it) is of critical importance in the field of KR as well.. 4.1.1 The knowledge representation problem The primary reason for representing knowledge is so that a computerized system can come to new conclusions about its environment through manipulation of the representation [14]. In other words, the representation aids the system in solving problems about its environment without any pre-programmed human expertise. With the assumptions stated above in mind, three components for knowledge representation can be identified:. • The knowledge representation language. • A component of the knowledge representation that can perform inferences from the knowledge base. • Capturing and incorporation of new knowledge into the system’s understanding of its domain. 21.

(34) The representation language could be described as a formal language rooted in logic, in which sentences can be interpreted as propositions about a domain or world. Once a representation is secured, a reasoning mechanism is needed to aid in accessing facts stored explicitly in the knowledge base as well as uncovering facts implied by the knowledge representation language. This aids in the performance of inferences for the user of the intelligent system. Finally, knowledge bases are hardly ever static and devising methods for the incorporation of new knowledge is of paramount importance for any knowledge based system. This problem, often referred to as belief revision enables the system to adapt to a changing environment [14]. In their paper, Duce and Ringland state some issues that arise in knowledge representation that must be considered when choosing a representation scheme [14]:. • Expressive adequacy relates to the suitability of the representation scheme to the application domain. What are the restrictions of the scheme, if any ? • Reasoning efficiency has to do with the usability of the scheme. A scheme that represents all knowledge of interest and allows for sufficient inference does not necessarily guarantee that the inference can be made in an acceptable time frame. There is usually a trade-off between expressiveness and accuracy vs. efficiency. • Primitives ask what the primitives of the representation scheme are, which should be provided in a system and at what level should this be done. • Meta-representation deals with the structuring of knowledge in a knowledge base and the representation of this structure in the knowledge base itself. • Incompleteness has to do with what can be left unsaid about a certain domain and how to perform inferences over incomplete knowledge. How to revise earlier inferences in light of later, more complete, knowledge is also of interest. • Real-world knowledge deals with attitudes such as and beliefs, desires and intentions and how to deal with them.. With these and the above mentioned issues in mind, we can now look at accepted representation strategies.. 22.

(35) 4.2 Knowledge representation strategies 4.2.1 Knowledge representation using logic As the use of logic and logical languages in Artificial Intelligence is of paramount importance, any discussion on knowledge representation would not be complete without a discussion on logic. We must however define what is meant by a logic first. A logic can be said to contain two parts, a language and a method of reasoning with the language. This language consists of two parts, namely a syntax and semantics. With this in mind, a logic can be defined as [13, 15]:. • Syntax that specifies legal expressions in the language. This includes the atomic symbols of the language as well as the rules for constructing well-formed, non-atomic expressions of the logic. • Semantics for the association of language elements with elements of some subject matter. In other words, the specification of the meaning attached to atomic symbols as well as the rules for determining the meanings of non-atomic expressions. • Inference rules for determining a subset of logical expressions (sentences) and the manipulation thereof.. If we compare the definition of a logic to the components of knowledge representation discussed earlier, it is clear that logic is a very suitable representation strategy. In this subsection, two very important logical languages used in artificial intelligence namely propositional calculus and predicate calculus will be briefly discussed. It is not the intent of this discussion to cover these two topics in detail, but rather to illustrate their relevance and importance as representation schemes.. 23.

(36) Propositional calculus. The language. The elements of the language can be defined as [15, 16]:. • Atoms. Two distinguished atoms T (TRUE) and F (FALSE) and the countably infinite set of strings of characters beginning with a capital letter. • Connectives: ∨(AND) ∧(OR) ⊃(IMPLIES), ¬(NOT) and ≡(LOGICAL EQUIVALENCE). • Literals. Atoms and atoms with a ¬ in front of them. • Sentences. Called well-formed formulas (wffs). The syntax of wffs are: – Any atom is a wff. – If ω1 and ω2 are wffs, so are ω1 ∨ ω2 (disjunction), ω1 ∧ ω2 (conjunction), ω1 ⊃ ω2 (implication) and ¬ω1 (negation) – There are no other wffs.. Semantics The semantics of a logical language associates the elements of the language with the elements of some particular application domain. In propositional logic, we associate atoms with propositions (or sentences) about the domain. This is called interpretation. In an interpretation, the proposition associated with a given atom is called the denotation (or value) of that atom. In propositional logic the denotation of a proposition is either True or False. In his paper, Shaprio label the above as extensional semantics [13]. He also defines intentional semantics, which are dependent only on the particular domain being modelled. This is used to form the base atomic propositions of our conceptualization of the domain. Then denotation of atomic propositions follow, which is the allocation of a truth value to each atomic proposition. Each way of doing this forms a situation. The truth values for the situations are then summarized in a truth table. Well formed formulas (wffs) have semantic properties which can be used to reduce domain knowledge about a certain application domain to a set of situations that abides to these properties [13]. 24.

(37) Rules of inference Rules of inference aid in drawing conclusions. It consists of two parts, a set of conditions and a conclusion that is guaranteed given the conditions. After defining a set of propositions about the domain (i.e. a knowledge base about the domain), the idea is to use these inference rules to determine automatically what other propositions about the domain are True and integrate them into the knowledge base [13]. One of the shortcomings of propositional logics is that they do not analyze below the level of a proposition (or sentence). First order logic, also known as predicate logics, continue the analysis down to objects, classes, properties and relations [17]. A brief discussion on predicate calculus follows.. Predicate Calculus. The Language. Syntactic expressions in predicate calculus consist of terms, atomic formulas. and non-atomic well-formed formulas. Terms consist of [13]:. • Individual constants. • Variables. • Arbitrary individuals. • Undetermined individuals. • Functional terms.. Non-atomic Symbols consist of:. • Functional terms. • Atomic formulas. • Well-formed formulas (wffs). 25.

(38) Atomic Symbols are:. • Individual Constants. • Variables. • Arbitrary and undetermined individuals. • Function symbols. • Predicate symbols.. All of the above have syntactical rules that govern their construction, as listed in the work by Shapiro [13].. Semantics As with propositional calculus, there are a number of semantic rules that govern how we connect the elements of the language with the domain being conceptualized. They are listed and discussed in detail in [13].. Rules of Inference. Inference in predicate calculus is similar to inference in propositional cal-. culus with the addition of rules for universal (∀) and existential (∃) qualifiers. The idea of then using these rules to determine what is true about the application domain and integrating it into the knowledge base essentially stays the same as for propositional calculus.. 4.2.2 Knowledge representation using networks In this subsection, the use of a network or graph structure for the purposes of knowledge representation will be briefly discussed. The particular network structure of interest will be the semantic network structure (also known as a concept graph).. 26.

(39) Semantic networks (concept graphs). The study of semantics attempts to describe concepts behind word meanings and the ways in which these meanings interact. Semantic networks were designed to provide such a description. A semantic network is defined as a labelled, directed, acyclic graph where nodes denote entities and labelled directed arcs denote relations between the nodes they connect [17]. In short, a semantic network depicts relationships between concepts in a specific domain. To demonstrate how this works, consider the following information:. A human being consists of a number of sections. A human has a head, arms, torso, midsection, legs and feet. The head, arms and midsection are connected to the torso, the legs connected to the midsection and feet attached to the legs.. Figure 4.1 below represents an inheritance network that represents the above. Human has-a. has-a has-a. Head. Torso. connects-to. has-a. has-a. has-a. Midsection connects-to. Legs. connects-to. Feet. Arms. connects-to. connects-to. Figure 4.1: Inheritance-style semantic network Inheritance networks do have one significant drawback in the sense that the semantics of the relations are not very clear [17]. To demonstrate this, let figure 4.1 be modified as in figure 4.2 on page 28. In figure 4.2, we introduce two new facts: Humans have two arms and two legs. We also introduce an entity called “Bob” who is a human. The problem with this is that these facts, although being generally true, isn’t universally true. What if Bob lost a limb in an accident ? This would exclude him from being part of the concept 27.

(40) is-a. Human. Bob has-a. has-a has-a. Head. Torso. connects-to. has-a. has-a has-a. Midsection connects-to. Legs. connects-to. connects-to. Feet. connects-to. Arms has-no. 2. has-no. 2. Figure 4.2: Expanded inheritance-style semantic network of human as defined in Figure 4.2, which is obviously incorrect. Semantic networks also have difficulty in representing belief about a particular domain. This semantic unclarity which is characteristic of semantic networks, is addressed by a variation of inheritance networks called description logics. Description logics can be seen as the combination of network based representation schemes (similar to semantic networks and frames) and logic. They allow for the definition of categories (called “concepts”) and relations (called “roles”) without the semantic confusions of earlier inheritance networks. The main benefit of description logics over conventional inheritance networks is that they can represent information about relations between nodes using a language with precise semantics. A brief introduction to description logics can be found in the work by Nardi and Brachman [18]. Propositional semantic networks add the ability to represent information about beliefs. A propositional semantic network can be defined as a semantic network where every proposition (belief) represented in the network is represented by a node (called a propositional node), rather than by an arc [19]. As with traditional semantic networks, there are nodes that represent entities as well.. 28.

(41) 4.2.3 Frames Frame-based knowledge representation systems use frames as their primary mechanism for representing domain knowledge. Frames are a way of grouping information in terms of a record consisting of “slots” and “fillers” [14]. Frame representation systems are quite similar to semantic networks, where individual “frames” correspond to nodes in the network, describing some object or class of objects. Relationships between frames correspond to arcs in the network, called “slots”. Each slot also has a value associated with it. The slots describe the object in question, with each slot-value pair corresponding to some common attribute of the object and its allowable value or range of values. It is also allowable for a slot to contain a pointer to another frame, like nodes are pointed to by arcs in a semantic network. These slots are called “frame fillers”. So in short, a frame can be thought of as a record data structure with a number of slots and associated with each slot there is a value. As an example of this, consider figure 4.3 below. The figure represents part of the semantic network of figure 4.2 on page 28 as a frame based system. Bob is-a: Human Midsection. Human Head,Torso, has-a: Midsection,Legs,Feet, Arms. connected-to: Torso. Head. connected-to: Torso has-no: 2. connected-to: Torso. Legs. Figure 4.3: Part of figure 4.2 as a frame system. Frame systems reason about a particular class or classes of objects by using stereotypical representations of different situations, objects and events. Type hierarchies provide mechanisms for inheritance. In a frame based system the uppermost class hierarchy is fixed and can then, through inheritance, provide default values to individual frames inherited from an ancestor. This implies that no slot values are ever left empty within a frame. These default values form the stereotypical representation of a situation and are overwritten by values that are more well suited to a specific case. One of the advantages of frame-based modelling is that it is quite similar to object-based mod29.

(42) elling which is a familiar and popular technique in the realm of software engineering. A disadvantage is that these descriptions are unfortunately somewhat restrictive, and modifications to the scheme are usually necessary to adequately capture the complexities of the real world.. 4.2.4 Rule based systems One of the easiest ways to represent knowledge is with the use of IF/THEN rules. Formally, systems using rules are known as production systems. Generally a production system has three main components: working memory, rule memory and an inference engine. The working memory and rule memory can be collectively seen as the knowledge base of the system. The working memory contains a set of symbol structures representing facts about the domain. The rule memory contains a set of pattern-action rules that govern the system’s behaviour. Each rule has a ‘left-hand side’ or antecedent which defines a condition that should be matched against the content of the working memory. If such a match is found, an action is performed. Actions are consequents or the ‘right-hand side’ of the rule, and define changes to the working memory or interaction with the user (output) [14]. The inference engine selects rules from the working memory and performs the associated actions (called firing a rule). As a side effect of this, the knowledge base can be expressed as sets of rules, each of which can be validated independently. Firing rules may change working memory and therefore may influence which other rules are triggered. If multiple rules are triggered, the ultimate behaviour of the system could depend on which rule was triggered first. This is determined by a conflict resolution strategy. More information can be found in the work by Duce and Ringland [14].. 4.2.5 Ontologies It has been the quest of philosophers for hundreds of years to define the nature of existence. These philosophers spoke of ontology as the science of being, what it means to exist. 30.

(43) Today, ontologies are seen as a form of knowledge representation with uses in three main areas [20]:. • Communication between (i) implemented computational systems, (ii) humans and (iii) humans and implemented computational systems. • Computational inference for (i) internally representing and manipulating plans and planning information and (ii) analyzing the internal structures, algorithms, inputs and outputs of implemented systems in theoretical and conceptual terms. • Reuse and organization of knowledge. This is the structuring or organizing of libraries containing planning and domain information (knowledge base).. The uses listed above serves as an affirmation of ontologies as a knowledge representation technique, but do not supply an adequate definition. A popular definition for an ontology is that it is an “explicit specification of a conceptualization” [21]. Conceptualization, in this context, refers to the the expression of knowledge about a certain application domain in terms of entities. These entities (things in the world, the relationships they hold and the constraints between them) provide an abstract model of how people and/or machines think about the domain, and is usually restricted to some subject area like manufacturing, biology, the web etc. Specification is the representation of this conceptualization in some sort of concrete form. One of the steps in specification necessarily involves encoding the conceptualization in a knowledge representation language. We will discuss the representations commonly used later in this section.. Components of an ontology. The main components of an ontology are concepts, relations, instances and axioms [22]. Concepts represent a set or class of entities within a domain. Concepts can be classified into two groups:. • Primitive concepts possess only the necessary properties for admission to the class. So if we have the primitive concept α with property x, all concepts belonging to the class α 31.

(44) belongs to will have property x. There may however be other concepts with property x that do not belong to the class α belongs to. • Defined concepts possess properties that are both necessary and sufficient for the concept to acquire admission to the class. If the defined concept β has property y then every concept that has property y belongs to the same class β belongs to.. Relations describe the interactions between concepts or their properties. They can also be classified into two broad categories:. • Taxonomies that organize concepts into a tree structure, with child nodes called subconcepts and parent nodes called super-concepts. These relationships structure the tree and further define the relationships between the concepts. Specialization relationships exist to indicate that one concept is a kind of another, more general concept. Partitive relationships describe a typical whole-part relationship, where a concept (whole) is made up of other (part) concepts. It is interesting to note that the definitions of specialization and partitive relationships correspond exactly to the definitions of generalization and aggregation relationships defined in the unified modelling language (UML) [22]. • Associative relationships relate concepts across the tree structures mentioned above. Associative relationships describe the properties concepts have, for instance their names (Nominative), location with respect to another (Locative), processes involved in or has internally etc. Many of these types of relationships exist and are used to relate concepts with each other [22].. Relations can be organized into taxonomies as well. They can also have properties that more precisely define their nature and how they relate concepts with each other. These properties can include cardinality of the relation, transitivity of a relation, whether or not a relation is universally applicable to a concept, restrictions on a relation etc. Once the conceptualization has been made concrete (as described in the next section), an ontology has been produced. Instances are the entities that are represented by a concept. In their paper, Stevens, Goble and Bechhofer notes that an ontology should not contain instances as it is meant to be a conceptual32.

(45) ization of the domain [22]. The combination of an ontology with its associated instances makes up the knowledge base for that domain. Axioms are used to place constraints on values for classes or instances. They also include general rules about the domain [22]. With the components of an ontology defined, it is still not clear how ontologies for different domains are created. The process of building an ontology is called ontological engineering and is introduced in the next section.. Building ontologies: ontological engineering. As in the case with software, ontologies are engineered and a methodology is needed to define what stages are involved in building an ontology, guidelines and principles that outline the activities of each stage and a life-cycle that shows the relationship between the stages. The emerging discipline of ontological engineering is concerned with the ontology life cycle, that describes the steps involved in the development of an ontology and the relationships between these steps. The end goal of ontological engineering is the use and support of ontologies throughout this life cycle [20]. As ontological engineering is still in an emerging discipline it borrows some ideas from more mature engineering fields such as software engineering. It is therefore not surprising that methodologies for the development of ontologies can be divided into stage-based approaches (e.g. TOVE [23]) similar to the classic waterfall model in software engineering and iterative prototype refining approaches (e.g. METHONTOLOGY [24]) similar to the spiral model in software engineering. In both approaches there is a distinction between an informal stage where the ontology is described using diagrams or natural language and a formal stage where these high-level descriptions are encoded in a formal knowledge representation language so as to ensure that it is understandable and processable by a machine. In their paper, Stevens et al propose a skeletal life-cycle for building ontologies [22]. The key stages of this approach are: specification, design, conceptualization, integration, formalization, and evaluation. These are illustrated in figure 4.4 below. The specification stage attempts to develop a requirements specification for the ontology by 33.

(46) Formal Stage Informal Stage Conceptualization Specification Evaluation Design. Formalization. Integration. Figure 4.4: Ontology life cycle [22]. identifying the intended purpose and scope of the ontology. This is done through the acquisition of domain knowledge from which the ontology will be built. Sources of domain knowledge can span from experts in the domain to research papers to other ontologies. This broad scope can contribute to the fact that building ontologies are unfortunately a difficult, time consuming and expensive task as different members across a community may have conflicting opinions on how the domain they are considering should be modelled. It is therefore crucial that all interested parties (people and computer systems) commit to an agreed upon ontology. With this in mind, Holsapple and Joshi name five approaches to ontology design to ensure an acceptable design for an ontology [25]. They are:. • Inspiration which constitutes an individuals viewpoint about the domain being modelled. • Induction where modelling is guided by analyzing specific cases in the domain. • Deduction where general principles about a domain are accepted and applied adaptively to ontology construction. • Synthesis which accepts non-overlapping base ontologies which provide a partial characterization of the domain. • Collaboration of individuals, reflecting their experience and viewpoints. An initial ontology can be defined as a starting point for discussion.. There are numerous steps involved in all five approaches, and they all have certain distinct advantages and disadvantages. In the article by Holsapple and Joshi, these steps are discussed in detail [25]. 34.

No results found