First Annual Technical Report

(1)

Data Mining Center of Excellence

Organizational Web Mining

First Annual Technical Report

For the Period staring September 2005, and ending August, 2006 TR/COE_WM/9.0/12/2006

Prepared By Prof. Dr Ahmed Rafea Dr. Samhaa El-Beltagy

(2)

Web Mining _________________________________________________________________________ 4 Web Mining and Information Retrieval _________________________________________________ 4 Web Mining and Information Extraction ________________________________________________ 4 Web Content Mining __________________________________________________________________ 5 Approaches of Web Content Mining ______________________________________________________ 6 Unstructured Text data mining (Text Mining) ____________________________________________ 6 Semi-Structured and Structured data mining______________________________________________ 9 Applications ________________________________________________________________________ 13 IBM Intelligent Miner for Text _______________________________________________________ 13 Customer Relationship Management___________________________________________________ 13 Conclusions ________________________________________________________________________ 14

Segmenting Web Documents: A Survey ___________________________________________ 18

Introduction ________________________________________________________________________ 18 Historical Background ________________________________________________________________ 18 The Nature of Web Pages______________________________________________________________ 18 Segmentation Approaches _____________________________________________________________ 19 Text-based Segmentation ___________________________________________________________ 19 Tag-based Segmentation ____________________________________________________________ 25 Vision-based Segmentation __________________________________________________________ 26 Content & Visual based Segmentation _________________________________________________ 29 Applications ________________________________________________________________________ 30 Browsing web pages designed for PCs with non-PC terminals_______________________________ 30 Text summarization________________________________________________________________ 30 Web document ranking _____________________________________________________________ 30 Cost-effective caching ______________________________________________________________ 31 Conclusion _________________________________________________________________________ 31

An Overview of Approaches for Labeling Training Sets for Text Categorization _________ 34

Introduction ________________________________________________________________________ 34 Historical Background on Text Categorization _____________________________________________ 35 Text Categorization Using a Machine Learning Approach ____________________________________ 35 Manual Labeling __________________________________________________________________ 35 Semi-automatic Labeling____________________________________________________________ 36 Automatic Labeling________________________________________________________________ 36 Text Categorization as an act of Semantic Annotation _______________________________________ 38 Manual Annotations _______________________________________________________________ 39 Semi-automatic Annotations _________________________________________________________ 39 Automatic Annotations _____________________________________________________________ 40 Conclusion _________________________________________________________________________ 41

Overview of Developed Components___________________________________________ 44 Extracting the Latent Hierarchical Structure of Web Documents ______________________ 44

Introduction ________________________________________________________________________ 44 Related Work _______________________________________________________________________ 45 Heading Detection Algorithms__________________________________________________________ 45 The heading detection phase _________________________________________________________ 46

(3)

Heading Level Detection____________________________________________________________ 48 Results Fine Tuning________________________________________________________________ 49 Evaluation _________________________________________________________________________ 50 Conclusion and Future work ___________________________________________________________ 51

Ontology Based Annotation of Text Segments ______________________________________ 53

Introduction ________________________________________________________________________ 53 Related Work _______________________________________________________________________ 54 Initial Analysis ______________________________________________________________________ 55 The Annotation System _______________________________________________________________ 56 The Annotation Algorithm _____________________________________________________________ 57 Evaluation _________________________________________________________________________ 59 Conclusion and Future Work ___________________________________________________________ 61

Evaluation and Analysis of an Ontology Based Annotation Algorithm __________________ 64

Introduction ________________________________________________________________________ 64 The Ontology based Annotation Algorithm ________________________________________________ 64 Evaluation and Analysis of the results of the Algorithm ______________________________________ 65 Modified Learning Algorithm __________________________________________________________ 68 Conclusion _________________________________________________________________________ 68

Component Design ________________________________________________________ 69 The Design of the Labeling Module _______________________________________________ 69

Introduction ________________________________________________________________________ 69 The Abstract Architecture _____________________________________________________________ 70 Class Descriptions ___________________________________________________________________ 71 Basic Classes _____________________________________________________________________ 71 Database Related Classes ___________________________________________________________ 74 Database Tables _____________________________________________________________________ 82

The Design of the Segmentor Component __________________________________________ 88

Introduction ________________________________________________________________________ 88 System Overview ____________________________________________________________________ 88 Requirements Elicitation ______________________________________________________________ 90 Use Case Diagram _________________________________________________________________ 90 Use Case Descriptions______________________________________________________________ 91 Requirements Analysis________________________________________________________________ 94 Class Descriptions ___________________________________________________________________ 94 Sequence Diagrams _________________________________________________________________ 100 Segment Web Documents __________________________________________________________ 100 Extract Document Features _________________________________________________________ 101 Detect Headings _________________________________________________________________ 102 Class Diagram ___________________________________________________________________ 103

(4)

Introduction

The primary aim of the Organizational Data Web Mining Project is to augment sections or segments of organizational electronic publications and web pages with domain meta data with the least possible human intervention as possible. Towards this end, four different scenarios have been identified. In the first scenario, a domain ontology covering subjects addressed by organizational documents, already exists and the documents under consideration contain segments with well defined headings (or headings that can be identified using heading detection methods). The approach to be applied in this first scenario, is to segment documents based on their headings and then apply pattern matching techniques on heading text so as to match the heading with one or more concepts in the ontology. After the concepts are determined, they can then be used to annotate the segment whose heading was analyzed. This is considered the simplest of all four scenarios.

In the second scenario, a domain ontology is not available or only partially exists. Like in the first scenario, documents under consideration contain segments with well defined headings (or identifiable headings). For the system to work in this scenario, it must be able to learn a taxonomic ontology from existing documents. Once an ontology is learned, pattern matching techniques can be applied on heading text to match with one or more concepts in the created ontology and to annotate segments as described before.

During the first year of the Organizational Data Web Mining Project, focus was placed on the design and implementation of components related to the first scenario which are the foundation on which the other scenarios can be built. Specifically, emphasis was placed on developing a segementor and a segment annotation component. The segemnetor component is one that can segment an input html document based on its headings which can be represented explicitly using html heading tags or implicitly using any kind of special formatting.

Before carrying out any work on these components, review of current work being carried out in the area surrounding them was carried out. This is presented in the second section of this report. Once the review was completed, design of these components was carried out. This is presented in the fourth and final section of this report. An overview of the implemented components and their evaluation results is presented in the third section of this report. The reason an overview of the developed components is presented before the design, is because it provides sufficient information about the systems and how they work, without going into too much of low level details.

Literature Reviews

In this section, three different literature reviews are presented. The first was carried out to familiarize our research staff with the general area in which work will be carried out. The second and third were conducted in order to investigate current work being carried out in the areas of segmentation and labeling respectively. Each of these reviews, was issued as a technical report.

(5)

Web Content Mining Research: A Survey

1

Abstract. The rapid expansion of the web is causing the constant growth of information, leading to several problems such as an increased difficulty of extracting potentially useful knowledge. Web content mining confronts this problem by gathering explicit information from different web sites for its access and knowledge discovery. In this paper we survey the

research in the area of Web content mining. Firstly we try to put this area in its right place by well defining it and the other fields relative to it. Then we summarize the main problems Web content mining faces due to the nature of data it deals with, using this data to categorize the different approaches of Web content mining. For each approach we list the different

techniques utilized. We also list one of the real life applications that employ Web content mining. For the survey we focus on representation issues as the criteria. We conclude the paper with some research issues.

Web Mining

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services (Etzioni, 1996). This area of research is so huge today partly due to the interests of various research communities, the tremendous growth of information sources available on the Web and the recent interest in e-commerce.

Web mining field consists of main three categories, Web usage mining, Web structure mining, and Web content mining. Web usage mining refers to the discovery of user access patterns from Web usage logs. Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web content mining aims to extract/mine useful information or knowledge from Web page contents.

Web mining is often associated with IR or IE. However, web mining is not the same as IR or IE.

Web Mining and Information Retrieval

IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non relevant as possible (Rijsbergen, 1979). Some have claimed that resource or document discovery (IR) on the Web is an instance of Web content mining and the others associate web mining with intelligent IR. Actually IR has the primary goals of indexing text and searching for useful documents in a collection and nowadays research in IR includes modeling, document classification and categorization, user interfaces, data visualization, filtering, etc. (Baeza-Yates & Berthier, 1999). The task that can be considered to be an instance of Web mining is Web document classification or categorization, which could be used for indexing. Viewed in this respect, Web mining is part of the (Web) IR process. (Kosala & Blockeel, 2000).

Web Mining and Information Extraction

IE has the goal of transforming a collection of documents, usually with the help of an IR system, into information that is more readily digested and analyzed (Cowie& Lehnert, 1996). IE aims to extract relevant facts from the documents while IR aims to select relevant

documents (Pazienza, 1997).

(6)

While IE is interested in the structure or representation of a document, IR views the text in a document just as a bag of unordered words (Wilks, 1997). Thus, in general IE works at a finer granularity level than IR dose on the documents.

Building IE systems manually is not feasible and scalable for such a dynamic and diverse medium such as web contents (Muslea, Minton & Knoblock, 1998). Due to this nature of the Web, most IE systems focus on specific web sites to extract. Others use machine learning or data mining techniques to learn the extraction patterns or rules for Web documents semi-automatically or semi-automatically (Kushmerick, 1999). Within this view, Web mining is used to improve Web IE (Web mining is part of IE) (Kosala & Blockeel, 2000).

An example of IE without Web mining is what done by (El-Beltagy, Rafea & Abdelhamid) for building a model for automatically augmenting segments documents with metadata using dynamically acquired background domain knowledge in order to assist users in easily locating information within these documents through a structured front end.

Web Content Mining

Web content mining describes the discovery of useful information from the Web contents/data/documents (Kosala& Blockeel, 2000). However, what consist of the Web contents could encompass a very broad range of data. In this section we begin by reviewing some of the important problems that Web content mining aims to solve. We then list some of the different approaches in this field classified depend on the different types of Web content data. In each approach we list some of the most used techniques.

It is often said that the Web offers an unprecedented opportunity and challenge for data mining. We believe that this is so due to the following characteristics of the Web (Liu & Chang, 2004):

1. The amount of data/information on the Web is huge and still growing rapidly. Web data is also easily accessible.

2. The coverage of Web information is wide and diverse. One can find information about almost anything on the Web.

3. Data of all types exist on the Web, e.g., structured tables, texts, multimedia data (e.g., images and movies), etc.

4. Information on the Web is heterogeneous. Multiple Web pages may present the same or similar information using completely different formats or syntaxes, which makes integration of information a challenging task.

5. Much of the Web information is semi-structured due to the nested structure of HTML code and the need of Web page designers to present information in a simple and regular fashion to facilitate human viewing and browsing.

6. Much of the Web information is linked. There are links among pages within a site, and across different sites. These links serve as an information organization tool and also as indications of trust/authority in the linked pages and sites.

(7)

7. Much of the Web information is redundant. The same piece of information or its variations may appear in many pages or sites. This property has been explored in many Web data mining tasks.

8. The Web is noisy. A Web page typically contains a mixture of many kinds of information, e.g., main content, advertisements, navigation panels, copyright notices, etc. For a particular application only part of the information is useful, and the rest are noises.

9. The Web consists of surface Web and deep Web. Surface Web is composed of pages that can be browsed using a normal Web browser. Surface Web is also searchable through popular search engines. Deep Web is mainly composed of databases that can only be accessed through parameterized queries using query forms.

10. The Web is also about services. Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services.

11. Above all, the Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems.

12. The Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues for many applications.

We can see why the Web is such a fascinating place and why it offers so many opportunities for data mining.

Approaches of Web Content Mining

The web content data consist of unstructured data such as free texts, semi-structured data such as HTML documents, and a more structured data such as data in the tables or database

generated HTML pages. So, two main approaches in web content mining arise, (1)

Unstructured text mining approach and (2) Semi-Structured and Structured mining approach.

Unstructured Text data mining (Text Mining)

Much of the Web content data is unstructured text data (Etzioni, O 1996). What we mean by the unstructured text if free text such as news stories. The research around applying data mining techniques to unstructured text is termed knowledge discovery in texts (KDT) (Feldman & Dagan, I 1995), or text data mining (Hearst, 1999), or text mining (Tan, 1999). Hence we could consider text mining as an instance of Web content mining.

Text mining or KDT was first proposed by (Feldman & Dagan, 1995). They suggest structuring the text documents by means of information extraction, text categorization, or applying NLP techniques as pre-processing step before performing any kind of KDTs. The reason is mining on the unprepared documents dose not provide effectively exploitable results (Rajman & Besancon, 1998).

Table 1 summarizes some of the research done for the unstructured text (Kosala & Blockeel, 2000). It tries to give a taste on the variety of some representations, methods and applications that have been used.

(8)

Representations: Most of the research in table 1 uses bag of words to represent unstructured documents. The bag of words representation (Salton & McGill, 1983) takes single words found in the training corpus as features. This representation ignores the sequence in which the words occur and is based on the statistic about single words in isolation. The features could be Boolean (a word either occurs or dose not occur in a document), or frequency based

(frequency of the word in a document).

Preprocessing could include Validations of the feature selection include removing the case, punctuation, infrequent words, and stop words. The features could be reduced further by applying some other feature selection techniques, such as information, cross entropy, or odds ratio (Mladenic & Grobelnik, 1999).

Other preprocessing includes latent semantic indexing (LSI) (Deerwester, Dumais, Landauer & Harshman, 1990) that seeks to transform the original bag of words document vectors to a lower dimensional space by analyzing the co relational structure of terms inthe document collection such that similar documents that do not share terms are placed in the same topic, and stemming which reduces words to their morphological roots. For example the words "informing", "information", "informer", and "informed" would be stemmed to their common root "inform" and only the latter word is used as the feature instead of the former four.

Other feature representations are also possible such as using information about word positions in the document (Cohen, 1995), using n-grams representation (word sequences of length up to n) (Honkela, Kaski, Lagus & Kohonen, 1997) (for example "the morphological roots" is a tri-gram), using phrases (Dumais, Platt, Heckerman & Sahami, 1998) such as "the quick brown fox that run away", using hypernyms (linguistic term for the "is a" relation – a dog ia an animal, thus "animal" is a hypernym of "dog") (Scott & Matwin, 1999) etc.

In the relational representation one may use relationships between different words and their positions, e.g. "word X is to the left of word Y in the same sentence".

Current techniques are mainly based on machine learning and natural language processing to learn extraction rules from manual labeled examples (Bunescu & Mooney, 2004).

Recently, a number of researchers also make use of common language patterns (common sentence structures used to express certain facts or relations) and redundancy of information on the Web to find concepts, relations among concepts and named entities (Cimiano,

Handschuh & Staab, 2004). The patterns can be automatically learnt or supplied by human users.

Another direction of research in this area is Web answering. Although question-answering was first studied in information retrieval literature, it becomes very important on the Web as Web offers the largest source of information and the objectives of many Web search queries are to obtain answers to some simple questions. (Kwok, Etzioni & Weld, 2000) extend question-answering to the Web by query transformation, query expansion, and then selection.

(9)

Table 1: Web content mining for unstructured documents Author Document Representation Method Application

(Ahonen,

1998) Bag of words and word positions Episode rules - Finding keywords and keyphrases - Discovering grammatical rules and collocations

(Billsus & Pazzani, 1999)

Bag of words - TFIDF

- Naïve Bayes Text classification (Cohen,

1995) Relational - Propositional rule based system Inductive Logic Programming

Text classification

(Dumais,

1998) - Bag of words - Phrases - TFIDF - Decision trees - Naïve Bayes -Bayes nets - Support Vector Machines Text classification (Feldman & Dagan, 1995)

Concept categories Relative entropy Finding patterns between concept distributions in textual data (Feldman,

1998) Terms Association rules Finding patterns across terms in textual data (Frank,

1998) Phrases and their positions Naïve Bayes Extracting keyphrases from text documents (Freitag &

McCallum, 1999)

Bag of words Hidden Markov

Models Learning extraction models

(Hoffmann,

1999) Bag of words Unsupervised statistical method Hierarchical clustering (Honkela,

1997) Bag of words with n-grams Self-Organizing Maps Text and document clustering (Junker,

1999) Relational Inductive Logic Programming - Text categorization - Learning extraction rules (Kargupta,

1999) Bag of words with n-grams - Unsupervised hierarchical clustering - Decision trees - Statistical analysis

Text classification and hierarchical clustering

(Nahm & Mooney, 2000)

Bag of words Decision trees Predicting (words) relationship (Nigam,

1999) Bag of words Maximum entropy Text classification

(Scott & Matwin, 1999) - Bag of words - Phrases - Hypernyms and

(10)

Author Document Representation Method Application

(Soderland,

1996) Sentences, and clauses Rule learning Learning extraction rules (Weiss,

1999) Bag of words Boosted decision trees Text categorization (Weiner,

1995) Bag of words - Neural Network - Logistic Regression Text categorization (Witten,

1999) Named entity Text compression Named entity classifier

(Yang,

1999) Bag if words and phrases - Clustering algorithms - K-Nearest Neighbor - Decision tree

Event detection and tracking

Semi-Structured and Structured data mining

This is perhaps the most widely studied research topic of Web content mining. One of the reasons for its importance and popularity is that structured data on the Web are often very important as they represent their host pages. Essential information, e.g., lists of products and services. Extracting such data allows one to provide value added services, e.g., comparative shopping, and meta-search. Structured data is also easier to extract compared to unstructured texts.

Semi-structured data is a point of convergence (DeRose, 1999) for the Web and database communities: the former deals with documents, the latter with data. The form of that data is evolving from rigidly structured relational tables with numbers and strings to enable the natural representation of complex real-world objects like books, papers, movies, jet engine components, and chip designs without sending the application writer into contortions.

Emergent representations for semi-structured data (such as XML) are variations on the Object Exchange Model (OEM) (Goldman, McHugh & Widom, 1999). In OEM, data is in the form of atomic or compound objects: atomic objects may be integers or strings; compound objects refer to other objects through labeled edges. HTML is a special case of such 'intra-document' structure.

Information Retrieval view vs. Data Base view

We could differentiate the research done in Web content mining for semi-structured and structured data from two different points of view: IR and DB (Cooley & Mobasher, 1997) views. The goal of Web content mining from the IR view is mainly to assist or to improve the information finding or filtering the information to the users usually based on either inferred or solicited user profiles, while the goal of Web content mining from the DB view mainly tries to model sophisticated queries other than the keywords based search could be performed (Kosala & Blockeel, 2000). These view points are further discussed in the next sections.

(11)

We can see form table 2 that the works surveyed use richer representations compared to the works surveyed in table 1. This is due to the additional structure (HTML and hyperlink) information in the hypertext documents. Actually all the works surveyed utilize the HTML structures inside the documents and some utilize the hyperlink structure between the

documents for document representation. The methods that are used are common data mining classification or categorization and clustering, learning relations between Web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.

Basic clustering techniques

Clustering is a fundamental operation in structured data domains, and has been intensely studied. Some of the existing techniques, such as k-means (Jain & Dubes, 1988) and hierarchical agglomerative clustering (Cutting, Karger, Pedersen & Tukey, 1992) can be applied to documents. Typically, documents are represented in unweighted or TFIDF vector space, and the similarity between two documents is the cosine of the angle between their corresponding vectors, or the distance between the vectors, provided their lengths are normalized.

k-means clustering

The number k of clusters desired is input to the k-means algorithm, which then picks k 'seed documents' whose coordinates in vector space are initially set arbitrarily. Iteratively, each input document is 'assigned to' the most similar seed. The coordinate of the seed in the vector space is recomputed to be the centroid of all the documents assigned to that seed. This process is repeated until the seed coordinates stabilize. The extreme high dimensionality of text

creates two problems for the top-down k-means approach. Even if each of 30000 dimensions has only two possible values, the number of input documents will always be too small to populate each of the 230000 possible cells in the vector space. Hence most dimensions are unreliable for similarity computations. But since this is not a supervised problem, it is hard to detect them a priori. There exist bottom-up techniques to determine orthogonal subspaces of the original space (obtained by projecting out other dimensions) such that the clustering in those subspaces can be characterized as 'strong' (Agrawal, Gehrke, Gunopulos & Raghavan, 1998). However these techniques cannot deal with the tens of thousands of dimensions. Agglomerative clustering

In agglomerative or bottom-up clustering, documents are continually merged into super-documents or groups until only one group is left; the merge sequence generates a hierarchy based on similarity. The algorithm initially places each document into a group by itself.

Table 2: Web content mining for semi-structured documents Author Document

Representation Method Application

(Craven, 1998) Relational and ontology - Modified Naïve Bayes - Inductive Logic Programming

- Hypertext classification

- Learning Web page relation

- Learning extraction rules

(12)

1999) information supervised classification

algorithms graphical classification - Clustering (Furnkranz,

1999) Bag of words and hyperlinks information Rule learning Hypertext classification (Joachims,

1997) Bag of words and hyperlinks information - TFIDF - Reinforcement learning Hypertext prediction (Muslea, 1998) Bag of words, tags, and

word positions Rule learning Learning extraction rules (Shavlik &

Eliassi-Rad, 1999)

Localized bag of words,

and relational. Neural networks with reinforcement learning Hypertext (homepage) classification (Singh, 1998) Concepts and Named

entity - Mod. association rule - Classification algorithm Finding patterns in semi-structured texts (Soderland,

1996) Sentences, phrases, and named entity Rule learning Learning extraction rules

Data Base View for Semi-Structured Documents

Database approaches to Web mining have focused on techniques for organizing the semi-structured data on the Web into more semi-structured collections of resources, and using standard database querying mechanisms and data mining techniques to analyze it (Cooley, Mobasher & Srivastava, 1997).

Multilevel Databases:

The main idea behind this approach is that the lowest level of the database contains semi-structured information stored in various Web repositories, such as hypertext documents. At the higher level(s) Meta data or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases. For example, (Zaiane & Han, 1995) use a multi- layered database where each layer is obtained via generalization and transformation operations performed on the lower layers. (Khosla, Kuhn & Soparkar, 1996) propose the creation and maintenance of meta-databases at each information providing domain and the use of a global schema for the meta-database. (King & Novak, 1996) propose the incremental integration of a portion of the schema from each information source, rather than relying on a global heterogeneous database schema. The ARANEUS system (Merialdo, Atzeni & Mecca, 1997) extracts relevant information from hypertext documents and

integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views.

Web Query Systems:

Many Web-based query systems and languages utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for the queries that are used in World Wide Web searches. W3QL (Konopnicki & Shmueli, 1995) combines structure queries, based on the organization of hypertext

documents, and content queries, based on information retrieval techniques. WebLog (Lakshmanan, Sadri & Subramanian, 1996) Logic-based query language for restructuring extracts information from Web information sources. (Quass, Rajaraman, Sagiv, Ullman & Widom, 1995) query heterogeneous and semi-structured information on the Web using a labeled graph data model. (Chawathe, Hammer & Widom, 1994) extracts data from

(13)

heterogeneous and semi-structured information sources and correlates them to generate an integrated database representation of the extracted information.

Agent-Based view

Web mining is often viewed from or implemented within an agent paradigm. Thus web mining has a close relationship with software agents or intelligent agents. Indeed some of these agents perform data mining tasks to achieve their goals. According to (Green, Hurst, Nangle, Cunningham, Somers & Evans, 1997) there are three sub-categories of software agents: user interface agents, distributed agents, and mobile agents. The sub-categories of software agents that are relevant for data mining tasks are user interface agents and distributed agents. User interface agents try to maximize the productivity of current users interaction with the system by adapting behavior. The issue of personalization abounded here (Kosala & Blockeel, 2000). User interface agents that can be classified in the web mining agent category are:

Intelligent Search Agents

Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user profiles to organize and interpret the discovered information. Agents such as Harvest (Brown, Danzig, Hardy, Manber & Schwartz, 1994), FAQ- Finder (Hammond & Lytinen, 1995), Information Manifold (Kirk, 1995) rely either on pre-specified domain information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Agents such as Shop-Bot

(Doorenbos, 1996) interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy.

Information Filtering/Categorization

A number of Web agents use various information retrieval techniques (Frakes & Baeza-Yates, 1992) and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. HyPursuit (Weiss & Velez, 1996) uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space. BO (Bookmark Organizer) (Maarek & Shaul, 1996) combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information.

Personalized Web Agents

There are two frequently used approaches for developing intelligent agents that help users find and retrieve relevant information form the Web, namely content-based and collaborative approaches. In the content-based approach, the system searches for items that match based on an analysis of the content using the user preferences. In the collaborative approach, the system tries to find users with similar interests to give recommendations to. The system does this by analyzing the user profiles and sessions or transactions. It assumes that if some users rate an item high then the other users with similar interests would rate this item height also. So this approach mainly uses the usage data (user ratings). Viewed in this light we could categorize the content-based methods as Web content mining and categorize the collaborative approach

(14)

as Web usage mining. However, collaborative approaches might also be used or combined with the Web content (Kosala & Blockeel, 2000).

This category of Web agents learn user preferences and discover Web information sources based on these preferences (using content-based approach) and those of other individuals with similar interests (using collaborative approach) (Cooley & Mobasher & Srivastava, 1997). A few recent examples of such agents include the WebWatcher (Armstrong & Freitag, 1995), (Syskill & Webert, 1996), and others (Balabanovic & Shoham, 1995). For example, Syskill & Webert utilizes a user profile and learns to rate Web pages of interest using a Bayesian

classifier.

Manual vs. Semi-automatic vs. Automatic view

There are several approaches to structured data extraction, which is also called wrapper generation. The first approach is to manually write an extraction program for each Web site based on observed format patterns of the site. This approach is very labor intensive and time consuming. It thus does not scale to a large number of sites. The second approach is wrapper induction or wrapper learning, which is the main technique currently. Wrapper learning works as follows: The user first manually labels a set of trained pages. A learning system then generates rules from the training pages. The resulting rules are then applied to extract target items from Web pages. Example wrapper induction systems include WIEN (Kushmerick, 2000), Stalker (Muslea, 1999), BWI (Freitag, 2000), WL2 (Cohen, 2002), etc. The third approach is the automatic approach. Since structured data objects on the Web are normally database records retrieved from underlying databases and displayed in Web pages with some fixed templates. Automatic methods aim to find patterns/grammars from the Web pages and then use them to extract data. Examples of automatic systems include IEPAD, MDR,

RoadRunner (Crescenzi & Mecca, 2001), EXALG (Arasu & Garcia-Molina, 2003), etc.

Applications

This section gives a few examples of some real life applications that employ web content mining.

IBM Intelligent Miner for Text

In 1998, IBM for the first time introduced a product in the area of text mining: the Intelligent Miner for Text. It is a software development toolkit - not a ready-to-run application - for building text mining applications. It addresses system integrators, solution providers, and application developers. The toolkit contains the necessary components for “real text mining”: feature extraction, clustering, categorization, and more. But there are also more traditional components, e.g., the IBM Text Search Engine, the IBM Web Crawler, and drop-in Intranet search solutions. These components are essential to build applications that use the information generated in a mining process, e.g., in an Intranet portal for the company (Dorre, Gerstl, & Seiffert, 1999).

Customer Relationship Management

Based on the Intelligent Miner for Text product IBM offers an application called CR1 (Customer Relationship Intelligence) that is designed to specifically help companies better understand what their customers want and what they think about the company itself. After

(15)

selecting the appropriate set of input documents (e.g. customer complaint letters, phone call transcriptions, e-mail conversation) and converting them to a common standard format, the CR1 application uses the feature extraction and clustering tools to derive a database of documents which are grouped according to the similarity of their content. Depending on the purpose of the data analysis, the user might select different parameters for the preprocessing (e.g. concentrate on names and dates) and for the clustering step (e.g. use a more or less restrictive similarity measure). When clustering customer feedback information, the result exposes groups of feedback that share important linguistic elements, e.g. descriptions of a difficulty customers have with a certain product or support organization. Information of this type can be used to identify problematic areas that need to be addressed. Sometimes the cluster itself may provide clues about how the problem could be solved, e.g. do the documents have something in common which is independent from the problem description such as the location, background . . . of the customers that raised the issue? As a separate step after a set of useful clusters has been identified and, probably manually enhanced, the categorization tool can be used to assign new incoming customer feedback to the identified categories. The CRI application incorporates both, the distillation and discovery aspects of text mining. A typical usage scenario starts with an unstructured collection of documents (transcripts, mails, and scanned letters), creating some interpretable structure by means of clustering which corresponds to the aspect of discovery. The refinement and extension of the clustering results by means of interpreting the results, tuning of the clustering process, and selecting meaningful clusters emphasizes the aspect of distillation. (Dorre, Gerstl, & Seiffert, 1999).

Conclusions

The term Web mining has been used to refer to techniques that encompass a broad range of issues. However, while meaningful and attractive, this very broadness has caused Web mining to mean different things to different people (Han, 1996), and there is a need to develop a common vocabulary. Towards this goal we proposed a definition of Web mining, and

compare it to the relevant confusing fields. Next, we presented a survey of the research in this area and concentrated on Web content mining. We provided a detailed survey of the efforts in this area. We provided different views to understand Web content mining, and identified the issues and problems in this area.

References

Agrawal, R., Gehrke, J., Gunopulos, D. & Raghavan, P (1998). Automatic subspace

clustering of high dimensional data for data mining applications. In SIGMOD Conference on Management of Data, Seattle

Arasu, A. and Garcia-Molina, H (2003). Extracting Structured Data from Web Pages. SIGMOD-03.

Armstrong, R & Freitag, D (1995). Webwatcher: A learning apprentice for the world wide web. In Proc. AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments.

(16)

Brown, C. M, Danzig, B. B, Hardy, D., Manber, U. & Schwartz, M. F (1994). The harvest information discovery and access system. In Proc. 2nd International World Wide Web Conference.

Bunescu, R., Mooney, R (2004). Collective Information Extraction with Relational Markov Networks. ACL-2004.

Chakrabarti, S (2000).Data mining for hypertext: A tutorial survey. SIGKDD Explorations. Chawathe, S. & Garcia-Molina, H.& Hammer, J.& Ir- land, K. & Papakonstantinou, Y. & Ulman, J. & Widom, J (1994). The tsimmis project: Integration of heterogenous information sources. In Proc. IPSJ Conference, Tokyo.

Cimiano, P., Handschuh, S., Staab, S (2004). Towards the Self- Annotating Web. WWW-04. Cohen, W. W (1995). Learning to classify English text with ilp methods. In advances in inductive logic programming (Ed. L. De Raedt).

Cooley, R. & Mobasher, B. &Srivastava, J (1997). Web Mining: Information and Pattern Discovery on the World Wide Web. ICTAI.

Cowie, J & Lehnert, W (1996). Information extraction communications of the ACM.

Crescenzi, V., Mecca, G. and Merialdo, P (2001). ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites.VLDB-01.

Cutting, D. R, Karger, D. R, Pedersen, J. O. & Tukey, J. W (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Annual International Conference on Research and Development in Information Retrieval, Denmark.

Deerwester, S, Dumais, S, Furnas, G, Landauer, T & Harshman, R (1990). Indexing by latent semantic analsis.

DeRose, S (1999). What do those weird XML types want, anyway? Keynote address, VLDB 1999, Edinburgh,Scotland, Sept.

Doorenbos, R. B. & Etzioni, O. & Weld, D. S (1996). A scalable comparison shopping agent for the world wide web. Technical Report 96-01-03, University of Washington, Dept. of Computer Science and Engineering.

Dumais, S, Platt, J, Heckerman, D & Sahami, M (1998). Inductive algorithms and representations for text categorization.

El-Beltagy, S & Rafea, A & Abdelhamid, Y. Using Dynamically Acquired Background Knowledge For Information Extraction And Intelligent Search Management. Central Lab for

(17)

Agricultural Expert Systems, Agricultural Research Center, Ministry of Agriculture and Land Reclamation.

Etzioni, O (1996). The world wide web: Quagmire or gold mine. Communication of the ACM, 39(11):65-68.

Feldman, R & Dagan, I (1995). Knoledgr discovery in textual databases (kdt). In the preceeding of the first international conference on knowledge discovery and data mining (KDD-95)

Frakes, W. B. & Baeza-Yates, R (1992). Information Retrieval Data Structures and Algorithms. Prentice Hall, Englewood Cli_s, NJ.

Goldman, R, McHugh, J & Widom, J (1999). From semistructured data to XML: Migrating the Lore data model and query language. In Proceedings of the 2nd_{International Workshop on}

the Web and Databases (WebDB '99), pages 25{30, Philadelphia. Online at http://www-db.stanford.edu/pub/pape

Green, S. , Hurst, L. , Nangle, B. , Cunningham, P. , Somers, F & Evans, R (1997). A review technical report

Hammond, K, Burke, R., Martin, C. & Lytinen, S (1995). Faq-_nder: A case-based approach to knowledge navigation. In Working Notes of the AAAI Spring Symposium: Information Gathering from Heterogeneous, Distributed Environments. AAAI Press.

Hearst, M. A (1999). Untangling text data mining. In proceeding of ACL'99: the 37th annual meeting of the association for computational liguistics

Honkela, T, Kaski, S, Lagus, K & Kohonen, T (1997). Websom – self-organizing maps of document collections.

Jain, A. K & Dubes, R. C (1988). Algorithms for Clustering Data. Prentice Hall

Khosla, I. & Kuhn, B. & Soparkar, N (1996). Database search using information mining. In Proc. of 1996 ACM- SIGMOD Int. Conf. on Management of Data

King, R. & Novak, M (1996). Supporting information in- frastructure for distributed,

heterogeneous knowledge discovery. In Proc. SIGMOD 96 Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada.

Kirk, T, Levy, A. Y, Sagiv, Y. & Srivastava, D (1995). The information manifold. In Working Notes of the AAAI Spring Symposium: Information Gathering from Het- erogeneous,

Distributed Environments. AAAI Press.

Konopnicki, D. & Shmueli, O (1995). W3qs: A query system for the world wide web. In Proc. of the 21st VLDB Conference, pages 54{65, Zurich}.

(18)

Kosala, R & Blockeel, H (July 2000). Web Mining Research: A Survey. SIGKDD Explorations.

Kushmerick, N (1999). Gleaning the web. IEEE intelligent systems

Kwok, C., Etzioni, O., Weld, D (2000). Scaling Question Answering to the Web. WWW-00. Lakshmanan, L. & Sadri, F.& Subramanian, I. N (1996). A declarative language for querying and restructuring the web. In Proc. 6th International Workshop on

Research Issues in Data Engineering: Interoperability of Nontraditional Database Systems (RIDE-NDS'96)

Liu, B & Chang, K (2004). Editorial: Special Issue on Web Content Mining SIGKDD Explorations special issue on Web Content Mining, Dec, 2004

(19)

Segmenting Web Documents: A Survey

2

Introduction

Web documents can be viewed as complex objects which often contain multiple entities each of which can represent a standalone unit. However, most information processing applications developed for the web, consider web pages as the smallest undividable units. This fact is best illustrated by web information retrieval engines whose results are presented in the form of links to web documents rather than to the exact regions within these documents that directly match a user’s query. A better retrieval performance can be achieved by considering the page not as an undividable unit but as having an underlying semantic structure with topically coherent segments as atoms as this will results in a more targeted model of search with very specific results being presented. In fact, many other applications can benefit from the operation on the level of semantically independent units rather than on an entire web

document (which may contain many such units). Such applications include browsers for cell phones, PDAs and PCs with non-PC terminals as well as text summarization applications. It is essential however for any application that requires the operation on such a level, to first segment a web document to its standalone components. It is thus the objective of this report, to review various approaches that have been adopted in carrying out the segmentation task and to identify the advantages and disadvantages associated with each.

Historical Background

Research on means for segmenting text documents commenced at the beginning of the nineties for the purpose of achieving passage retrieval for question answering systems and boundary detection for news streams. Research in this area had to deal with mostly

un-structured text, so despite the extensive work carried out, the accuracy of segmentation results produced by this work can be considered limited.

With the enormous move towards publishing information on the web came a wide spread adoption of HTML for document presentation. As a result, new approaches for segmentation were developed to make use of the new presentation features found in these documents. It can be roughly stated, that attempts for segmenting web documents began around 1997 and are still in progress.

The Nature of Web Pages

Usually a Web page is an HTML document that contains a sequence of tag delimited data elements. Initially, these tags emphasized semantic aspects of a document rather than its physical aspects. They identified parts of a document according to the meaning of those parts in relation to the document as a whole. When you applied the tags, it was as if you were saying, this is a title, and this is a header, and this is a paragraph, and this is an ordered list, etc. So most of the tags did not address the way a document or its parts should appear.

(20)

But major browser manufacturers soon introduced "extensions" to the HTML language in the form of new tags that gave users more physical formatting power. While these have been very popular and useful, they have also led to new problems like increasing levels of clutter and complexity within HTML. Also, because of the flexibility of the HTML syntax, a lot of web pages do not obey the W3C html specifications which might cause the HTML tags to poorly reflect the actual semantic structure of a page.

When adopting a segmentation approach, the challenge is thus to some how utilize HTML tags while addressing the limitations that currently exist.

Segmentation Approaches

Text-based Segmentation

Segmentation techniques that fall under this heading can be divided into three groups: similarity based, lexical chain based, and feature based techniques. In the following sub sections a brief overview of each of these techniques is provided.

Similarity-based Techniques

In similarity-based techniques a similarity measure is utilized to compute similarities between text blocks. Blocks can be sentences, paragraphs or a predefined number of words. Section boundaries are then identified when dips in the similarities occur. An example of this approach is presented in the work carried out by Hearst (Hearst, 1994). This work describes TextTiling, an algorithm for partitioning expository texts into coherent units which reflect the subtopic structure of the texts. The algorithm uses the cosine similarity measure between term vectors to define similarity between blocks. Two important parameters for the algorithm are pseudosentence length w and block size k. For the first parameter, the text is subdivided into pseudosentences of a pre-defined size w rather than actual syntactically-determined sentences, to avoid normalization problems. Setting w to 20 tokens per pseudosetence was found to work best for many texts.

The second parameter is the block size which means the number of pseudosentences that are grouped together into a block to be compared against an adjacent group of pseudosentences. The work showed that this value, labeled k, varies from text to text. It can be set to the average paragraph length (in pseudosentences). Setting k to 6 pseudosentences per block was found to work well for many texts. Actual paragraphs are not used to avoid the effect of difference in length on the comparison.

A sliding window (with size equal to the predefined block size) that moves one

pseudosentence per step, is used to compare adjacent blocks. Similarity values are computed for every pseudosentence gap which means how similar the block containing this

pseudosentence as the last pseudosentence in the block is to the next adjacent block. Segment boundaries are then assigned to pseudosentence gaps according to how sharp a change occurs on both sides of the gap. To compute the changes that occur on both sides of a current gap, the algorithm looks at the scores of the pseudosentence gaps to the left of this gap as long as their values are increasing. When the values to the left peak out, the difference between the score at the peak and the score at the current gap is recorded. The same procedure takes place with the

(21)

pseudosentence gaps to the right of the current gap. The relative height of the peak to the right of the current gap is added to the relative height of the peak to the left to be used as an

indicator to how sharp a change occurs on both sides of the gap.

To evaluate this work, Hearst compares the results of the TextTilling algorithm against judgments made by seven human readers for each of thirteen magazine articles which satisfied the length criteria (between 1800 and 2500 words) and which contained little

structural demarcation. The algorithm is evaluated according to how many true boundaries are selected out of the total selected (precision) and how many true boundaries are found out of the total possible (recall). The results of the evaluation yielding an average precision of 0.66 with 0.18 standard deviation and a recall of 0.61 with 0.13 standard deviation compared with the results of the reader’s judgments which yield an average precision of 0.81 with 0.06 standard deviation and recall of 0.71 with 0.06 standard deviation.

(Brants et al, 2002) defines a method that combines the use of the probabilistic latent semantic analysis (PLSA) model with the method based on similarity between pairs of adjacent blocks. In the previous work, Hearst computes similarities between adjacent blocks based on term frequency vectors. In contrast, this work uses the probability distribution of words as the term weights to compute the similarity between blocks. The PLSA model is used here to calculate the probability distribution of the words using this equation:

Where:

Z is a latent variable representing the possible topics in a corpus.

P(w|z) and P(z|b) are the two parameters of the PLSA model, where P(w|z) is the probability of a word w given a topic z and P(z|b) is the probability of a topic z given a block b.

The two parameters of the PLSA model, P(w|z) and P(z|b) are estimated using the iterative Expectation-Maximization (EM) algorithm. How?

The meaning behind the PLSA model is that it converts the term frequency vector f(b,w) for each block b into two new vectors P(w|z), P(z|b) (which are the two parameters of the model) by introducing a latent variable z, representing the possible topics with the aim of connecting words with the existing possible topics to gain some conceptual dimensionality, and then convert the two vectors again to one vector P(w|b) in order to benefit from the information about semantically similar words in calculating the probability distribution of words in a block.

In contrast to TextTiling which used cosine similarity, this work uses Clarity-based similarity metric (Croft et al, 2001)to compute the similarity between two adjacent blocks using this equation:

(22)

Where:

GE is the probability distribution of words for “general English” as derived from the training corpus.

KL is the Kullback-Leibler divergence (Kullback et al, 1951) which measures the distance between two probability distributions. What does this mean?

(Probabilities are not clear, what is Pr and Pl ?)

The idea behind the clarity measure is to give credit to similar pairs of blocks with term distributions that are very different from general English, and to discount similar pairs of documents with term distributions that are close to general English, which can be interpreted as being the noise. The distance to general English is named clarity.

An evaluation to this work is done on two corpora that have been used before for the same task. The first corpus is the Brown Corpus, a collection of 500 American English text samples of approximately 2000 words each. The second corpus is the Reuter-21578 corpus.

The evaluation metric that is used is the error probability, which is an estimate of the probability that a randomly chosen pair of words at a certain distance of words apart is erroneously classified Do you understand this?. For comparison, the results of this work are compared with the results of TextTiling for the Reuter-21578 corpus. The error rate of 1.25% of this work is a 71% reduction in error rate over 4.30% achieved by TextTiling I do not understand this sentence. The main two differences between this PLSA-based model and TextTiling that may explain the large difference:

1. TextTiling uses cosine similarity metric while this work uses Clarity-based metric, which found to be the most effective metric for this work.

2. TextTiling depends only on term frequency to find the similarity between two adjacent blocks which ignores semantic relationships between terms, like synonymy and

polysemy, which in turn affect the similarity between blocks. On the other hand this PLSA-based model aims to discover the meaning behind the words by clustering them to some classes.

Other similarity metrics that can be used to compute the similarity between to adjacent blocks are, variational or L1 distance that is used by (Li & Yamanishi, 2000) what was his work?, Hellinger distance (also known as Bhattacharyya distance) that is used by (Kailath, 1967) what was his work? and Jensen-Shannon divergence that is used by (Kullback et al, 1951) what was his work?

Lexical Chain Techniques

Lexical chain techniques use relationships between words such as synonymy, specialization/ generalization and part/whole to construct lexical chains. (Stokes et al, 2002) define a lexical chain as a sequence of lexicographically related word occurrences where every word occurs within a set distance from the previous word. The chaining algorithm proceeds as follows, the first token in the input stream forms the first lexical chain and each subsequent token is then added to an existing chain if it is related to at least one other token in that chain by a semantic relationship such as synonymy, specialization/ generalization and part/whole, otherwise the token will become the seed of a new chain. This process is continued until all keywords in the text have been chained. All the associations between words are obtained from the WordNet

(23)

taxonomy. A relationship between two tokens is only considered valid if they appear no more than 600 words away from each other in the original text for the synonymy relationship and 500 words away from each other for all other relationships since they are weaker associations than synonymy. Further constrains are imposed on relationships between words that have a path length greater than 1 in the WordNet taxonomy. One reason behind the distance

constrains may be to lessen the effect of misidentified word associations due to the ambiguous nature of the word forms i.e. associating bank with money when bank refers to a river bank. Each constructed chain is fitted out with the positions of the sentences that this chain begins and ends at. Segment boundary is selected based on the following hypothesis, a high

concentration of chain begin and end points exist on the boundary between two distinct segments. Based on this hypothesis, boundary strength w(n, n+1) between each sentence in a text is computed as the product of the number of lexical chains that ends at sentence n and the number of chains that begin at sentence n+1 , and then used to select segment boundary. An evaluation to this work is done on a test set that have been created by concatenating distinct CNN broadcast news stories from various CNN news programs (Night Time, DayLight, International News etc.)

The evaluation metrics used are the precision and recall metrics. It has been noted in text segmentation literature that there are a number of drawbacks to using recall and precision metrics in the text segmentation domain, since they fail to take in to consideration near boundary misses i.e. if a suggested system boundary is just one sentence away from the true segment boundary then the system will be penalized just as it has missed the boundary by 10 sentences Do you understand that?. So this evaluation methodology is based on work done by Reynar (Reynar, 1998) who considers a system boundary correct if it exists within a certain fixed-window of allowable error Do you understand that?..

For comparison, the results of this work are compared with the results of TextTiling. Precision and recall measures in this comparison are based on observations resulting from increases made to window of error in boundary shifts of concatenated CNN news stories. The table must have a caption and reference to it in the text. How the error column is calculated?

SeLeCTDid you mention this name above?

textTile Random Segmentation

ERROR Recall Precision Recall Precision Recall Precision

+/-0 62.7 36.6 19.7 13.3 7.1 7.1 +/-1 69.2 40.4 61.6 41.5 18.4 18.4 +/-2 77.4 45.2 79.9 53.8 29.4 29.4 +/-3 81.3 47.5 88.4 59.5 39.1 39.1 +/-4 84.3 49.2 94.1 63.4 45.9 45.9 +/-5 85.4 49.9 96.2 64.8 51.5 51.5 +/-6 87.4 51.0 97.3 65.5 55.7 55.7 +/-7 88.3 52.7 97.8 65.9 59 59 +/-8 89.3 52.1 98.2 66.1 62.4 62.4 +/-9 89.9 52.5 98.3 66.2 64 64

(24)

Feature-based Techniques

Feature-based techniques utilize a set of features for text blocks in order to detect boundaries. For example, (Beeferman et al, 1999) use an exponential model and generate features using a maximum entropy selection criterion. Most features learned are cue based features that identify a boundary based on the occurrence of words or phrases. They also include a feature that measures the difference in performance of a “long range” Vs. a “short range” model. When the short range model outperforms the long range model, this indicates a boundary. Their method performed well on a number of broadcast news data sets, including the CNN data set from TDT 1997. (Long range and short range, not clear in this context)

(Reynar, 1999) describes a maximum entropy model Do you know what is this model? that combines hand selected features, including: broadcast news domain cues, number of content word bigrams, number of named entities, number of content words that are WordNet

synonyms in the left and right regions, percentage of content words in the right segment that are first uses, whether pronouns occur in the first five words, and whether a word frequency based algorithm predicts a boundary. He found that for the HUB-4 corpus, which is composed of transcribed broadcasts that the combined feature model performed better than TextTiling (Hearst, 1994).

(Kauchak et al, 2005) found that all the work that has been done before in text segmentation performed well on broadcast news, expository and synthetic data, where many properties are found that can help for segmentation. These properties include cue phrases, such as “welcome back” and “joining us” found in broadcast news, also strong topic shifts, as in synthetic

documents created by concatenating newswire articles. But no work has been done before that targets narrative documents that have more complicated properties like:

1. Small sub-topic shifts which cause word relationships based techniques like lexical chains (Stokes et al, 2002) to fail because when topics are similar, words tend to relate too easily.

2. The varied use of words which results in many unseen terms in the test set. This causes problems for those methods that learn a model of the training data such as Brants (Brants et al, 2002) and Beeferman (Beeferman, 1999).

3. No consistent cue phrases are there at boundaries.

He thus presented work that targets the problem of text segmentation on narrative text. A combination of many features found to be useful were used. These include:

Word groups: A word group is all words that have the same parent in the WordNet

hierarchy. A binary feature is used for each learned group based on the occurrence of at least one of the words in the group. This feature is used to better generalize (from the training data) cue words that are similar and tend to occur at boundaries.

Entity groups: For each entity group (i.e. person, city or disease) that occur significantly at a boundary, a feature indicating whether or not an entity of that group occurs in the sentence used.

Pronoun: If the sentence contains a pronoun within five words of the beginning of the sentence, then it is likely to be connected to the previous sentence.

Conversation: If the sentence is part of a conversation, then it will not be a beginning of a new segment.

(25)

Paragraph: If the sentence is not at a beginning of a paragraph, then it is not likely to be the beginning to a new segment.

Support vector machines (SVM) are used to classify sentences based on the previous features by detecting sentence boundaries that are right segment boundaries How?.

An evaluation to this work was carried out using 1000 articles from Groliers Encyclopedia. The evaluation metrics used were:

Word error probability: Estimates the probability that a randomly chosen pair of words k words apart is incorrectly classified. What does this mean?

Sentence error probability: Estimates the probability that a randomly chosen pair of sentences s sentences apart is incorrectly classified. I do not understand!

WindowDiff: Uses a sliding window over the data and measures the difference between the number of hypothesized boundaries and the actual boundaries within the window. This metric handles the criticism of the word error probability metric for being biased, e.g. it penalizes false negatives (miss boundaries) more than false positives (erroneous additional boundaries) (Pevzner and Hearst, 2002).

For comparison, this work is compared with the TextTiling and PLSA model. The results of the evaluation are as follows:

!

! ""## $$%%&& '' ''

Advantages of Text-based methods

Text based methods do not rely on any formatting on the part of the information producer in order to work. Instead, they make use of the useful information that can be extracted from the text to detect boundaries between segments. This information includes repetition of character sequences, pattern of words and word n-gram repetition, word frequency, the presence of cue words and phrases and the use of synonyms. These methods can thus be used whenever text is availed in an un-structured manner.

Disadvantages of Text-based methods

1. Using these methods, it is difficult to detect segment boundaries with a high degree of accuracy (Rupesh, 2005).

2. Also, different types of text like narrative text, expository text and broadcast news stories need different methods of segmentation. (Kauchak, 2005).

(26)

Tag-based Segmentation

Tag-based segmentation refers to an approach which divides a page based on the type of tags it includes. Utilized tags include <p> (paragraph), <Table> (table), <UL> (list), <H1>~<H6> (heading), etc.

(Diao et al, 2000) try to build a Web query processing system with learning capabilities that can retrieve only segments of a web page that meet an entered query’s requirements. User queries are provided in the form of keywords and search engines are employed to find URLs of Web sites that might contain the required information. The first few URLs are presented to the user for browsing???. Through these first few URLs the query processor learns both the information required by the user and the way that the user navigates through hyperlinks to locate such information You explained how in the next sentence so there is no need for this sentence?.

The web page is presented to the user as a segment tree. In which the web page is partitioned into segments each of which serves as a candidate answer to the entered query. Four major segment types used are: paragraph, table, list and heading. Segments can be nested, that is, a segment can include a number of sub-segments. An HTML document is the largest segment. Hyperlinks are dealt with in this tree as segments and are associated with their parent

segments, i.e. the smallest segments that contain them. If the user chooses a hyperlink, the system goes to process a new page. The process is repeated until the user marks a query segment or rejects the whole site. User behaviors, either choosing a link or marking a segment, are recorded. The query processor then processes the rest of the URLs using the recorded user behavior and produces accurate query results in the form of segments of Web pages without user involvement.

(Buyukkokten et al, 2001) define a method to transform a web page into a hierarchy of individual content units called Semantic Textual Units, or STUs. First, STUs are built by analyzing syntactic features of an HTML document, such as text contained within paragraph (<P>), table cell (<TD>), and frame component (<FRAME>) tags. These features are then arranged into a hierarchy based on the HTML formatting of each STU. STUs that contain HTML header tags (<H1>, <H2>, and <H3>) or bold text (<B>) are given a higher level in the hierarchy than plain text.

Heading tags are between the most important tags that help in web page semantic structuring. They gain this value because they not only give boundaries between sections in the document but also represent each section. Exploitation of heading tags was presented in the work by (El-Beltagy et al, 2004) and (Tatsumi et al, 2005).

(El-Beltagy et al, 2004) tried to detect headings based on tag names then used these headings to segment sections in a web document in order to achieve targeted search.

(Tatsumi et al, 2005) proposed an algorithm to accurately extract web page headings considering various presentations to solve the usability problem of browsing web pages designed for PCs with non-PC terminals, for example, by selectively showing headings of each section within a display and enabling direct access to section content when its heading is selected. This can be further extended to serve the goal of segmentation.

First Annual Technical Report

Data Mining Center of Excellence

Organizational Web Mining