An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

(1)

Speciﬁc Web Pages and Extracting Information from Them

Vangelis Karkaletsis and Constantine D. Spyropoulos

NCSR Demokritos, Institute of Informatics & Telecommunications, 15310 Aghia Paraskevi Attikis, Athens, Greece

{vangelis, costass}@iit.demokritos.gr iit.demokritos.gr/skel

Abstract. The paper presents a platform that facilitates the use of tools for col- lecting domain speciﬁc web pages as well as for extracting information from them.

It also supports the configuration of such tools to new domains and languages. The platform provides a user friendly interface through which the user can specify the domain specific resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The platform design is based on the method- ology proposed for web information retrieval and extraction in the context of the R&D project CROSSMARC.

1 Introduction

The growing volume of web content in various languages and formats, along with the lack of structured information and the information diversity have made information and knowledge management a real challenge towards the eﬀort to support the information society. Enabling large scale information extraction (IE) from the Web is a crucial issue for the future of the Internet.

The traditional approach to Web IE is to create wrappers, i.e. sets of extraction rules, either manually or automatically. At run-time, wrappers ex- tract information from unseen collections of Web pages, of known layout, and fill the slots of a predefined template. The manual creation of wrappers presents many shortcomings due to the overhead in writing and maintain- ing them. On the other hand, the automatic creation of wrappers (wrapper induction) presents also problems since a re-training of the wrappers is neces- sary when changes occur in the formatting of the targeted Web site or when pages from a “similar” Web site are to be analyzed. Training an effective site- independent wrapper is an attractive solution in terms of scalability, since any

V. Karkaletsis and C.D. Spyropoulos: An Open Platform for Collecting Domain Speciﬁc Web Pages and Extracting Information from Them, StudFuzz 185, 145–154 (2005)

www.springerlink.com Springer-Verlag Berlin Heidelberg 2005c

(2)

domain-speciﬁc page could be processed, without relying heavily on the hy- pertext structure.

The collection of the application specific web pages which will be processed by the wrappers is also a crucial issue. A collection mechanism is necessary for the location of the application specific web sites and the identification of interesting pages within them.

The design and development of web pages collection and extraction sys- tems needs to consider requirements such as enabling adaptation to new do- mains and languages, facilitating maintenance for an existing domain, provid- ing strategies for eﬀective site navigation, ensuring personalized access, and handling of structured, semi-structured or unstructured data. The implemen- tation of a web pages collection and extraction mechanism that addresses eﬀectively these important issues was the motivation for the R&D project CROSSMARC

¹

, which was partially funded by the EC. CROSSMARC work resulted to a system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports conﬁguration of the system to new domains and languages.

Based on the methodology proposed in CROSSMARC, we started the development of a new platform to facilitate the use of collection and ex- traction tools as well as their customization. The platform provides a user friendly interface through which the user can specify the domain speciﬁc re- sources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various conﬁgurations. The current version of the platform incor- porates mainly CROSSMARC tools for the case studies in which it is being tested. However, it also enables the incorporation of new tools due to its open architecture design.

The paper outlines first CROSSMARC work, in relation to other works in the area. It presents then the first version of the platform as well as some first results from its use in case studies.

2 Related Work

Collection of domain speciﬁc web pages involves the use of web focused crawl- ing and spidering technologies. The motivation for web focused crawling comes from the poor performance of general-purpose search engines, which depend on the results of generic Web crawlers. The aim is to adapt the behavior of the search engine to the user requirements. The term “focused crawling” was introduced in [1] where the system presented starts with a set of representa- tive pages and a topic hierarchy and tries to ﬁnd more instances of interesting topics in the hierarchy by following the links in the seed pages. Another inter- esting approach to focused crawling is adopted by the InfoSpiders system [4],

1

http://www.iit.demokritos.gr/skel/crossmarc

(3)

a multi-agent focused crawler which uses as starting points a set of keywords and a set of root pages. The crawler implemented according to CROSSMARC approach involves three diﬀerent crawlers which exploit topic hierarchies, key- words from domain ontologies and lexica, and a set of representative pages [8].

While in focused crawling, the aim is to adapt the behavior of the search engine to the requirements of a user, in site-specific spidering the spider nav- igates in a Web site, following best-scored-first links. Each Web page visited is evaluated, in order to decide whether it is really relevant to the topic, and its hyperlinks are scored in order to decide whether they are likely to lead to useful pages. Therefore, site-specific spidering involves two decision func- tions: one which classifies Web pages as being interesting (e.g. laptop offers) or not and one that scores hyperlinks, according to their potential usefulness.

Thus, the input to the 1st decision function is a Web page visited by the spider and its output is a binary decision. This is a typical text classiﬁcation task. Various machine learning methods have been used for constructing such text classiﬁers. In [6] an up-to-date survey of such approaches is provided.

In CROSSMARC we examined a large number of classiﬁcation approaches in order to ﬁnd the most appropriate one for each domain and language.

Concerning the second decision function in site-specific spidering, this is a regression function, i.e., the input to the function is the hyperlink, together with its anchor and possibly surrounding text, and the output is a score, cor- responding to the probability of reaching a product page quickly through this link. Like classification, there is a variety of machine learning methods that are available for learning regression functions. However, in contrast to text classification, the task of hyperlink scoring has not been studied extensively in the literature. Most of the work on scoring and ordering of links refers to Web-wide crawlers, rather than site-specific spiders, and is based on the popularity of the pages pointed by the links that are being examined. This approach is inappropriate for the spider implemented in CROSSMARC. The only really relevant work that has been identified in the literature is [5], who use a type of simplified reinforcement learning in order to score the hyper- links met by a Web spider. A reinforcement learning link scoring methodology was also examined in CROSSMARC and was compared against a rule-based methodology.

Concerning information extraction from web pages, a number of systems have been developed to extract structured data from web pages. A recent nice survey of existing web extraction tools is found in [3], where a classiﬁcation of these tools is proposed based on the technologies used for wrapper creation or induction. According to [3], tools can be classiﬁed in the following categories:

• Languages for wrappers development: these are languages designed for as-

sisting the manual creation of wrappers.

(4)

Fig. 1. Classiﬁcation of web extraction tools (taken from [3]) and CROSSMARC position

• HTML-aware tools: these tools convert a web page into a tree representation that reﬂects the HTML tag hierarchy. Extraction rules are then applied to the tree representation.

• Wrapper Induction tools: these tools generate delimiter-based rules rely- ing on page formatting features and not on linguistic ones. They present similarities with the HTML-aware tools.

• NLP-based tools: these tools employ natural language processing (NLP) techniques, such as part of speech tagging, phrase chunking, to learn ex- traction rules;

• Ontology-based tools: these tools employ a domain-speciﬁc ontology to lo- cate ontology instances in the web page which are then used to ﬁll the template slots.

CROSSMARC employs most of the categories of web extraction tools pre- sented in [3] (see Fig. 1). It uses:

• Wrapper Induction (WI) techniques in order to exploit the formatting fea- tures of the web pages.

• NLP techniques to exploit linguistic features of the web pages enabling the process of domain specific web pages in different sites and in different languages (multilingual, site-independent).

• Ontology engineering to enable the creation and maintenance of ontologies, language-speciﬁc lexica as well as other application-speciﬁc resources.

Details on the CROSSMARC extraction tools are presented in [2]. More rel-

evant publications can be found at the project’s web site.

(5)

Fig. 2. System’s agent based architecture

3 The Platform

CROSSMARC work resulted to a core system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages. The core system implements a distributed, multi- agent, open and multi-lingual architecture which is depicted in Fig. 2. It involves components for the identification of interesting web sites (focused crawling) and the location of domain-specific web pages within these sites (spi- dering), the extraction of information about product/offer descriptions from the collected web pages, and the storage and presentation of the extracted information to the end-user according to his/her preferences.

The infrastructure for conﬁguring to new domains and languages involves:

an ontology management system for the creation and maintenance of the ontology, the lexicons and other ontology-related resources; a methodology and a tool for the formation of corpus necessary for the training and testing of the modules in the spidering component; a methodology and a tool for the collection and annotation of corpus necessary for the training and testing of the information extraction components.

Based on this work, we started the development of a platform that will enable the integration, training and testing of collection and extraction tools (such as the ones developed in CROSSMARC) under a common interface. The experiences from building three diﬀerent applications using CROSSMARC tools assisted signiﬁcantly the platform deisgn. These applications concerned the extraction of information from:

• laptops oﬀers in e-retailers web sites (in four languages),

(6)

Fig. 3. Ontology tab: invoking the ontology management system

• job oﬀers in IT companies web sites (in four languages),

• holidays’ packages in the sites of travel agencies (in two languages).

According to CROSSMARC methodology, the building of an application involves two main stages. The 1st one concerns the creation of the application- specific resources using the customization infrastructure, whereas the 2nd stage concerns the training of the integrated system using the application- specific resources and the system configuration.

The 1st stage is realized, in our platform, by the “Ontology” and “Cor- pora” tabs. Through the “Ontology” tab (see Fig. 3), the user can invoke an ontology management system in order to create or update the domain spe- ciﬁc ontology, the lexicons under the domain ontology, the important entities and fact types for the domain, and the user stereotypes’ deﬁnitions according to the ontology. In the current version, the ontology management system of CROSSMARC is used.

The “Ontology” tab enables also the user to specify the location of the ontology related resources, he/she wants to use in the next steps of the appli- cation building (see Fig. 4).

Through the “Corpora” Tab the user can perform several tasks. The user can invoke the Corpus Formation Tool (CFT), which helps users build a cor- pus of positive and negative pages, with respect to a given domain (see Fig. 5).

This corpus is then used for the training and testing of the “Page Filtering”

component of the spidering tool.

In addition, the user can specify the folder(s) where the corpora for the

training and testing of the “Information extraction” components are stored,

and also invoke the annotation tool. The current version of the platform em-

ploys a diﬀerent annotation tool from the one that was included in the CROSS-

MARC distribution. The new tool is of the ones provided by the Ellogon

(7)

Fig. 4. Ontology tab: specifying the ontology related resources

Fig. 5. Corpora tab: invoking the Corpus Formation Tool

language engineering platform

²

of our laboratory. However, the platform sup- ports also the use of the CROSSMARC Web annotation tool [7].

The 2nd processing stage is realized by the “Training” and “Extraction”

tabs. Through the “Training” tab (see Fig. 6), the user can invoke the ma- chine learning based training tools for the “Page Filtering”, “Link scoring”, and “Information Extraction” components. Especially, in the case of “Informa- tion Extraction”, training involves two separate modules, the “Named entity recognition & classiﬁcation – NERC” module and the “Fact extraction – FE”

module. The current version of the platform employs the Ellogon-based NERC and FE modules developed by our laboratory. The platform can support also

2

http://www.ellogon.org

(8)

Fig. 6. Training tab: invoking the training tool for page ﬁltering

Fig. 7. Extraction tab: conﬁguring the spidering component (advanced options)

the use of the other NERC and FE training tools developed in the context of CROSSMARC, since they all share common I/O speciﬁcations.

Through the “Extraction” tab (see Fig. 7), the user can conﬁgure and test

the “Crawling”, “Spidering” and “Information Extraction” components. In

the case of “Crawling”, the user can set the starting points for the crawler edit-

ing the corresponding configuration file. In a similar way, a different crawler

can be incorporated and conﬁgured according to the speciﬁc domains. A new

crawler is currently under development and will be tested through the plat-

form in a future case study. In the case of “Spidering”, the user can select the

model for page ﬁltering and link scoring (a machine learning or a heuristics

based), edit the heuristics based model, set a threshold for link scoring, and

perform several more advanced options. The user can test the components

(9)

with various conﬁgurations, view the results and decide on the preferred con- ﬁguration.

Concerning “Information Extraction”, the user can test separately the NERC and FE components, and conﬁgure the demarcation components. In the current version, the platform supports only the NERC component.

It must be noted that the outcome of the platform use is not necessarily a complete web content collection and extraction system. As it is shown in the case studies section, the platform user can build a crawler for a new domain, a collection system (crawler and spider), a named entity recognition system, or an information extraction system. It depends on the speciﬁc task needs and the domain.

4 Case Studies

The current version of the platform was used for the building of several ap- plications. Some of these applications are presented below grouped according to the diﬀerent tasks.

The first group of applications involves the development of crawlers for an information filtering task. More specifically, the task was to develop crawlers for specific topics (English and Greek languages were covered) that will return lists of web sites for these topics. These lists would be used to train an in- formation filtering system. Examples of topics include web sites that provide a service to communicate (chat) with other users in real time, web sites that provide e-mail services (send/receive e-mail messages), sites with job offers, etc. In these cases, the “extraction” tab of the platform was used to configure the starting points of the crawler, test it and find the best configuration for each topic.

Another group of applications concern the development of systems col- lecting web pages for specific domains and languages. An example domain is personal web pages of academic staff in University departments (Greek pages were covered). Such applications involve the training of both the crawling and the spidering components using the platform functionalities. More specif- ically, the “ontology” tab for creating the domain-specific ontology and lexica, the “corpora” tab for create the corpus for the training of page filtering, the

“training” tab for the training of the page ﬁltering and link scoring compo- nents, and the “extraction” tab for conﬁguring and testing the crawling and spidering components.

A third group of applications concerns the development of named entity recognition systems for speciﬁc domain and languages, which require the col- lection and annotation of the necessary corpus, the training and testing of the system. In a similar way, information extraction systems can be developed.

The ﬁnal group of applications integrate the collection and extraction

mechanisms, as it was the case for the CROSSMARC domains. The platform,

(10)

in its current status, does not support the development of such integrated applications.

5 Concluding Remarks

The CROSSMARC project implemented a distributed, multi-agent, open and multilingual architecture for web retrieval and extraction, which integrates several components based on state of the art AI technologies and commercial tools.

Based on this work we are developing a platform that enables the integra- tion, training and testing of collection and extraction tools, such as the ones developed in CROSSMARC. A ﬁrst version of this platform is currently being tested in several case studies for the development of focused crawlers, spi- ders, and information extraction systems. The current version employs mainly CROSSMARC tools. However, due to its open design, other tools have also been employed and more will be integrated and tested in the near future.

References

1. Chakrabarti S., van den Berg M.H., Dom B.E.: Focused Crawling: a new approach to topic-speciﬁc Web resource discovery. Proceedings of the 8th International World Wide Web Conference, Toronto, Canada (1999)

2. Karkaletsis V., Spyropoulos C.D., Grover C., Pazienza M.T., Coch J., Souﬂis D.: A Platform for Cross-lingual, Domain and User Adaptive Web Information Extraction. Proceedings of the European Conference in Artiﬁcial Intelligence (ECAI), Valencia, Spain (2004) 725–729

3. Laender A., Ribeiro-Neto B., da Silva A., Teixeira J.: A Brief Survey of Web Data Extraction Tools, ACM SIGMOD Records, vol. 31(2) (2002)

4. Menczer F., Belew R.K.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning, 39(2/3) (2000) 203–242

5. Rennie J., McCallum A.: Eﬃcient Web Spidering with Reinforcement Learning.

Proceedings of the 16th International Conference on Machine Learning (ICML- 99) (1999)

6. Sebastiani F.: Machine learning in automated text categorization. ACM Com- puting Surveys, 34(1) (2002)

7. Sigletos G., Farmakiotou D., Stamatakis K., Paliouras G., Karkaletsis V.: An- notating Web pages for the needs of Web Information Extraction applica- tions. Proceedings of the 12th International WWW Conference (Poster Session), Budapest, Hungary (2003)

8. Stamatakis K., Karkaletsis V., Paliouras G., Horlock J., Grover C., Curran J.R.,

Dingare S.: Domain-Speciﬁc Web Site Identiﬁcation: The CROSSMARC Focused

Web Crawler. Proceedings of the 2nd International Workshop on Web Document

Analysis (WDA 2003), Edinburgh, UK (2003)