Chapter 3: Development Research Approach
3.4 Model Building Component
The model building component is the core and most important component of the proposed research approach. It includes four iterative steps: keyword extraction,
clustering, labeling, and review. First, keywords are selectively extracted from the web resources. Next the web resources are clustered based on the extracted keywords and the clustering results are visualized in a two dimensional map and presented to the user. Finally, the user reviews the clusters and may make some manual adjustments. Based on user refinement, additional iterations of keyword extraction, clustering, and visualization may be performed to make necessary modifications. The whole process could be repeated several times.
Keyword Extraction. In this step, all the text content from the identified web pages is extracted using the “View Source” feature of a web browser. Those web pages may contain links to other web pages. Theoretically the extractor (e.g., search engine) can go as deep as it is allowed to, although I only extracted information from the first page the crawler accessed for the sake of simplicity.
The extracted web page text is first filtered based on user specification. For example, a database web page from the NDG website may contain a lot of information
such as description, species, database URL, project primary investigator, etc. If the user wants to categorize the databases from the perspective of database contents, only the description section of a web page is kept. By doing this, only the information related to the user‟s specific perspective is extracted for future clustering. This in turns makes the clustering more accurate.
The filtered web page text is then tokenized and a set of keywords are generated. The keyword list is further filtered by removing the common words such as “is,” “a,” etc. At the end, each web page (or URL) is associated with a set of clean keywords that describe the web page.
Clustering. In this step, the extracted keywords are used as a feature set in the clustering phase. By clustering, the input web pages are organized into related groups based on the feature set. In this dissertation, I apply neural network based Self-
Organizing Maps (SOM) as the clustering algorithm. The clustering result of SOM is sensitive to the selection of its parameter values (Kohonen 1995). Liang et al.(Liang, Vaishnavi et al. 2006) researched SOM parameter values for their directory project by trying different permutations of such values. In this dissertation, I apply a genetic algorithm (GA) to discover near-optimal SOM parameter values. A Grid computing infrastructure is used to provide computing power for the system since SOM and GA are both computation-intensive.
One issue in this step is the integration of different techniques. This is discussed in details in the next chapter. The other issue is the evaluation of the knowledge model (clustering result) generated by the system. In the experiments of this dissertation, I selected a website that already organized its web pages from various perspectives
following a rigorous process. Those existing categorization schemes (knowledge models) are used to evaluate the knowledge model created by the system.
Visualization Another challenge is the visualization of the clustering. SOM can generate a two dimensional map, but the map isn‟t interactive. I adapted a Java based visualization toolkit to provide an interactive user interface. The user can easily review the information about the SOM map and can also be able to drag and drop an object (a node that represents a web page) on the SOM map to better facilitate modification (review) step.
Review The final component is the user evaluation of the produced model. I argue that user participation is critical to generate a user-centric knowledge model especially when the knowledge model needs to be created completely from the start. In this step, a web user reviews a generated model and makes necessary modifications. The system will then rerun the model building process based on user guidance until the model meets the user‟s perspective.
3.5 Research Findings
In this chapter, equipped with tentative design (research findings) in the
suggestion phase, I introduced a research approach for generating user-centric dynamic and adaptable knowledge models for web-based resources. I described in details the functionality of each component in the architecture. In the research approach, clustering technique, specifically Self-Organizing Maps (SOM), was chosen as the main driving engine for creating knowledge models. I proposed to integrate SOM with other enabling techniques such as genetic algorithm and Grid computing infrastructure to improve the effectiveness and efficiency of the approach.
The research approach is unique as a type of web user adaptation system because it focuses on adaptable and flexible user adaptation. Current adaptive systems have been very successful in performing adaptation autonomously but the adaptation has been done on the back end using many server side resources, and the system often can‟t accurately react to a user‟s specific adaptation needs, especially when such needs can‟t be modeled by the user. Existing adaptable systems can accurately reflect user needs, but often offer low-level adaptation. Based on my knowledge, this study is the first research attempt to offer flexible and adaptable user adaptation to web information systems.
The research approach also provides a systematic and efficient way to create a dynamic knowledge model for a web-based resource by integrating clustering technique with a genetic algorithm and Grid technology.
The proposed approach provides a generic approach for knowledge presentation and discovery. In this dissertation, I focus on a web-based dataset, and the research approach fundamentally deals with textual data. The proposed approach could be easily adapted to other application domains, such as database domain and project portfolio management by converting the new datasets into a textual format.