Specific Web Pages and Extracting Information from Them
Vangelis Karkaletsis and Constantine D. Spyropoulos
NCSR Demokritos, Institute of Informatics & Telecommunications, 15310 Aghia Paraskevi Attikis, Athens, Greece
{vangelis, costass}@iit.demokritos.gr iit.demokritos.gr/skel
Abstract. The paper presents a platform that facilitates the use of tools for col- lecting domain specific web pages as well as for extracting information from them.
It also supports the configuration of such tools to new domains and languages. The platform provides a user friendly interface through which the user can specify the domain specific resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The platform design is based on the method- ology proposed for web information retrieval and extraction in the context of the R&D project CROSSMARC.
1 Introduction
The growing volume of web content in various languages and formats, along with the lack of structured information and the information diversity have made information and knowledge management a real challenge towards the effort to support the information society. Enabling large scale information extraction (IE) from the Web is a crucial issue for the future of the Internet.
The traditional approach to Web IE is to create wrappers, i.e. sets of extraction rules, either manually or automatically. At run-time, wrappers ex- tract information from unseen collections of Web pages, of known layout, and fill the slots of a predefined template. The manual creation of wrappers presents many shortcomings due to the overhead in writing and maintain- ing them. On the other hand, the automatic creation of wrappers (wrapper induction) presents also problems since a re-training of the wrappers is neces- sary when changes occur in the formatting of the targeted Web site or when pages from a “similar” Web site are to be analyzed. Training an effective site- independent wrapper is an attractive solution in terms of scalability, since any
V. Karkaletsis and C.D. Spyropoulos: An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them, StudFuzz 185, 145–154 (2005)
www.springerlink.com Springer-Verlag Berlin Heidelberg 2005c
domain-specific page could be processed, without relying heavily on the hy- pertext structure.
The collection of the application specific web pages which will be processed by the wrappers is also a crucial issue. A collection mechanism is necessary for the location of the application specific web sites and the identification of interesting pages within them.
The design and development of web pages collection and extraction sys- tems needs to consider requirements such as enabling adaptation to new do- mains and languages, facilitating maintenance for an existing domain, provid- ing strategies for effective site navigation, ensuring personalized access, and handling of structured, semi-structured or unstructured data. The implemen- tation of a web pages collection and extraction mechanism that addresses effectively these important issues was the motivation for the R&D project CROSSMARC
1, which was partially funded by the EC. CROSSMARC work resulted to a system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages.
Based on the methodology proposed in CROSSMARC, we started the development of a new platform to facilitate the use of collection and ex- traction tools as well as their customization. The platform provides a user friendly interface through which the user can specify the domain specific re- sources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The current version of the platform incor- porates mainly CROSSMARC tools for the case studies in which it is being tested. However, it also enables the incorporation of new tools due to its open architecture design.
The paper outlines first CROSSMARC work, in relation to other works in the area. It presents then the first version of the platform as well as some first results from its use in case studies.
2 Related Work
Collection of domain specific web pages involves the use of web focused crawl- ing and spidering technologies. The motivation for web focused crawling comes from the poor performance of general-purpose search engines, which depend on the results of generic Web crawlers. The aim is to adapt the behavior of the search engine to the user requirements. The term “focused crawling” was introduced in [1] where the system presented starts with a set of representa- tive pages and a topic hierarchy and tries to find more instances of interesting topics in the hierarchy by following the links in the seed pages. Another inter- esting approach to focused crawling is adopted by the InfoSpiders system [4],
1
http://www.iit.demokritos.gr/skel/crossmarc
a multi-agent focused crawler which uses as starting points a set of keywords and a set of root pages. The crawler implemented according to CROSSMARC approach involves three different crawlers which exploit topic hierarchies, key- words from domain ontologies and lexica, and a set of representative pages [8].
While in focused crawling, the aim is to adapt the behavior of the search engine to the requirements of a user, in site-specific spidering the spider nav- igates in a Web site, following best-scored-first links. Each Web page visited is evaluated, in order to decide whether it is really relevant to the topic, and its hyperlinks are scored in order to decide whether they are likely to lead to useful pages. Therefore, site-specific spidering involves two decision func- tions: one which classifies Web pages as being interesting (e.g. laptop offers) or not and one that scores hyperlinks, according to their potential usefulness.
Thus, the input to the 1st decision function is a Web page visited by the spider and its output is a binary decision. This is a typical text classification task. Various machine learning methods have been used for constructing such text classifiers. In [6] an up-to-date survey of such approaches is provided.
In CROSSMARC we examined a large number of classification approaches in order to find the most appropriate one for each domain and language.
Concerning the second decision function in site-specific spidering, this is a regression function, i.e., the input to the function is the hyperlink, together with its anchor and possibly surrounding text, and the output is a score, cor- responding to the probability of reaching a product page quickly through this link. Like classification, there is a variety of machine learning methods that are available for learning regression functions. However, in contrast to text classification, the task of hyperlink scoring has not been studied extensively in the literature. Most of the work on scoring and ordering of links refers to Web-wide crawlers, rather than site-specific spiders, and is based on the popularity of the pages pointed by the links that are being examined. This approach is inappropriate for the spider implemented in CROSSMARC. The only really relevant work that has been identified in the literature is [5], who use a type of simplified reinforcement learning in order to score the hyper- links met by a Web spider. A reinforcement learning link scoring methodology was also examined in CROSSMARC and was compared against a rule-based methodology.
Concerning information extraction from web pages, a number of systems have been developed to extract structured data from web pages. A recent nice survey of existing web extraction tools is found in [3], where a classification of these tools is proposed based on the technologies used for wrapper creation or induction. According to [3], tools can be classified in the following categories:
• Languages for wrappers development: these are languages designed for as-
sisting the manual creation of wrappers.
Fig. 1. Classification of web extraction tools (taken from [3]) and CROSSMARC position
• HTML-aware tools: these tools convert a web page into a tree representation that reflects the HTML tag hierarchy. Extraction rules are then applied to the tree representation.
• Wrapper Induction tools: these tools generate delimiter-based rules rely- ing on page formatting features and not on linguistic ones. They present similarities with the HTML-aware tools.
• NLP-based tools: these tools employ natural language processing (NLP) techniques, such as part of speech tagging, phrase chunking, to learn ex- traction rules;
• Ontology-based tools: these tools employ a domain-specific ontology to lo- cate ontology instances in the web page which are then used to fill the template slots.
CROSSMARC employs most of the categories of web extraction tools pre- sented in [3] (see Fig. 1). It uses:
• Wrapper Induction (WI) techniques in order to exploit the formatting fea- tures of the web pages.
• NLP techniques to exploit linguistic features of the web pages enabling the process of domain specific web pages in different sites and in different languages (multilingual, site-independent).
• Ontology engineering to enable the creation and maintenance of ontologies, language-specific lexica as well as other application-specific resources.
Details on the CROSSMARC extraction tools are presented in [2]. More rel-
evant publications can be found at the project’s web site.
Fig. 2. System’s agent based architecture
3 The Platform
CROSSMARC work resulted to a core system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages. The core system implements a distributed, multi- agent, open and multi-lingual architecture which is depicted in Fig. 2. It involves components for the identification of interesting web sites (focused crawling) and the location of domain-specific web pages within these sites (spi- dering), the extraction of information about product/offer descriptions from the collected web pages, and the storage and presentation of the extracted information to the end-user according to his/her preferences.
The infrastructure for configuring to new domains and languages involves:
an ontology management system for the creation and maintenance of the ontology, the lexicons and other ontology-related resources; a methodology and a tool for the formation of corpus necessary for the training and testing of the modules in the spidering component; a methodology and a tool for the collection and annotation of corpus necessary for the training and testing of the information extraction components.
Based on this work, we started the development of a platform that will enable the integration, training and testing of collection and extraction tools (such as the ones developed in CROSSMARC) under a common interface. The experiences from building three different applications using CROSSMARC tools assisted significantly the platform deisgn. These applications concerned the extraction of information from:
• laptops offers in e-retailers web sites (in four languages),
Fig. 3. Ontology tab: invoking the ontology management system
• job offers in IT companies web sites (in four languages),
• holidays’ packages in the sites of travel agencies (in two languages).
According to CROSSMARC methodology, the building of an application involves two main stages. The 1st one concerns the creation of the application- specific resources using the customization infrastructure, whereas the 2nd stage concerns the training of the integrated system using the application- specific resources and the system configuration.
The 1st stage is realized, in our platform, by the “Ontology” and “Cor- pora” tabs. Through the “Ontology” tab (see Fig. 3), the user can invoke an ontology management system in order to create or update the domain spe- cific ontology, the lexicons under the domain ontology, the important entities and fact types for the domain, and the user stereotypes’ definitions according to the ontology. In the current version, the ontology management system of CROSSMARC is used.
The “Ontology” tab enables also the user to specify the location of the ontology related resources, he/she wants to use in the next steps of the appli- cation building (see Fig. 4).
Through the “Corpora” Tab the user can perform several tasks. The user can invoke the Corpus Formation Tool (CFT), which helps users build a cor- pus of positive and negative pages, with respect to a given domain (see Fig. 5).
This corpus is then used for the training and testing of the “Page Filtering”
component of the spidering tool.
In addition, the user can specify the folder(s) where the corpora for the
training and testing of the “Information extraction” components are stored,
and also invoke the annotation tool. The current version of the platform em-
ploys a different annotation tool from the one that was included in the CROSS-
MARC distribution. The new tool is of the ones provided by the Ellogon
Fig. 4. Ontology tab: specifying the ontology related resources
Fig. 5. Corpora tab: invoking the Corpus Formation Tool
language engineering platform
2of our laboratory. However, the platform sup- ports also the use of the CROSSMARC Web annotation tool [7].
The 2nd processing stage is realized by the “Training” and “Extraction”
tabs. Through the “Training” tab (see Fig. 6), the user can invoke the ma- chine learning based training tools for the “Page Filtering”, “Link scoring”, and “Information Extraction” components. Especially, in the case of “Informa- tion Extraction”, training involves two separate modules, the “Named entity recognition & classification – NERC” module and the “Fact extraction – FE”
module. The current version of the platform employs the Ellogon-based NERC and FE modules developed by our laboratory. The platform can support also
2