A SURVEY ON WEB CONTENT MINING

(1)

Available Online at www.ijpret.com 292

INTERNATIONAL JOURNAL OF PURE AND

APPLIED RESEARCH IN ENGINEERING AND

TECHNOLOGY

A PATH FOR HORIZING YOUR INNOVATIVE WORK

A SURVEY ON WEB CONTENT MINING

DEVEN KENE1_{, DR. PRADEEP K. BUTEY}2 1 Research Scholar, Department of Computer science, Vidyabharati ahavidyala, Amravati, MH, India.

2 Head, Department of Computer science, Kamala Neharu College, Nagpur, MH, India.

Accepted Date: 05/03/2015; Published Date: 01/05/2015

\

Abstract: Today most of the people used web search engines to find and retrieve information. The enormous growth, diverse, dynamic and unstructured nature of web makes internet extremely difficult in searching and retrieving relevant information and in presenting query results. Web is a treasure of information and data, where large amount of data is available in different formats and structures. Finding the useful data from the web is a complex task, the above problem has given rise to the development of web content mining. The presented paper shows the evaluation and study of various approaches, tools and techniques which are available for web content mining. We begin by reviewing some of the important problems in Web content mining.

Keywords: Web Mining, Web Content Mining, Search Engines

Corresponding Author: MR. DEVEN KENE

Access Online On:

www.ijpret.com

How to Cite This Article:

(2)

Available Online at www.ijpret.com 293 INTRODUCTION

(3)

Available Online at www.ijpret.com 294 2. WEB CONTENT MINING (WCM)

Web content mining describes the discovery of useful information from the web contents, data and documents. However, what consist of the Web contents could encompass a very broad range of data. In this section we begin by reviewing some of the important problems that Web content mining aims to solve. We then list some of the different approaches in this field classified depend on the different types of Web content data. In each approach we list some of the most used techniques. It is often said that the Web offers an unprecedented opportunity and challenge for data mining. This is due to the following characteristics of the Web: 1. Web data is easily accessible and the amount of data/information on the Web is huge and still growing rapidly.

2. The coverage of Web information is wide and diverse. One can find information about almost anything on the Web.

3. All types of data and information exist on the Web. e. g. structured tables, texts, multimedia data (e.g., images and movies), etc.

4. Information available on the Web is heterogeneous. Multiple Web pages may present the same or similar information using completely different formats or syntaxes, which makes integration of information a challenging task.

5. Much of the Web information is semi-structured due to the nested structure of HTML code and the need of Web page designers to present information in a simple and regular fashion to facilitate human viewing and browsing.

6. Most of the Web information is linked. There are links among pages within a site, and across different sites. These links serve as an information organization tool and also as indications of trust/authority in the linked pages and sites.

7. The same piece of information or its variations may appear in many pages or sites that mean much of the Web information is redundant. This property has been explored in many Web data mining tasks.

3. WEB CONTENT MINING PROBLEMS:-

(4)

 Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications. Some existing techniques and problems are examined.

 Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. We will introduce a few tasks and techniques to mine such sources.

 Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are covered.

 Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain..

 Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem. A number of interesting techniques have been proposed in the past few years.

4. WEB CONTENT MINING TECHNIQUES:-

The main point of this research paper is “Web Content mining”.

The concept of “WEB CONTENT MINING” involves techniques for summarizing, classification and clustering of the web contents. It is mainly based on research in information retrieval and text mining, such as information extraction, text classification and clustering, and information visualization. Ittargets the knowledge discovery, in which the main objects are the traditional collections of text documents and also the collections of multimedia documents such as images, videos, audios, which are embedded in or linked to the Web pages. Some of the prominent web content mining techniques are as follows:-

(5)

2. Structured data mining techniques

3. Semi structured data mining techniques

4. Multimedia data mining techniques

1. Unstructured data mining techniques:

One of the techniques for web content mining is unstructured. Number of the web pages is in the form of text. According to this technique the data is searched and retrieved. It is not necessary that the data which is retrieved is meaningful data, it may be unknown information. We have to use some tools or techniques to get relevant data/ information from that data.

2. Structured data mining techniques:-

Structured data extraction is a progress of extracting information from web pages. A program for extracting such data is usually called a wrapper. Structured data are typically the data records retrieved from underlying database and displayed in the web pages following some templates. Sometime, the template is a table. Sometime, it is a form. Extracting such data records is useful because it enables us to obtain and integrate data from multiple sources (Web sites and pages) to provide value-added services, e.g., customizable Web information gathering, comparative shopping, meta-search, etc.

3. Semi structured data mining techniques:

Semi-structured data is a point of convergence for the Web and database communities. The form of that data is evolving from rigidly structured relational tables with numbers and strings to enable the natural representation of complex real-world objects like books, papers, movies, etc.,. Emergent representations for semi-structured data (such as XML) are variations on the Object Exchange Model (OEM).

4. Multimedia data mining techniques:-

(6)

Available Online at www.ijpret.com 297 5. CONCLUSION

The World Wide Web is the universe of network accessible information, an embodiment of human knowledge. The web continues to increase in size and complexity with time hence making it difficult to extract relevant information. Thus various Data mining techniques and web content mining tools are used to extract useful information or knowledge from web page contents. By using these techniques we can make our search of contents over the web faster and exact. This paper focuses on web content mining problems, tools and techniques.

REFERENCES:

1. Faustina Johnson, and Santosh Kumar Gupta,”WebContent Mining Techniques:A Survey”, International Journal of Computer Applications,Vol.47–No.11, PP. 0975 – 888 ,June 2012 .

2. V. Bharanipriya and V.Kamakshi Prasad, “WEB CONTENT MINING TOOLS: A COMP ARA TIVE STUDY.

3. Chidansh Amitkumar Bhatt, Mohan S. Kankanhalli, “Multimedia data mining: state of the art and challenge”.

4. Govind Murari Upadhyay, and Kanika Dhingra, “Web Mining:Its Techniques and Uses,”International Journal of Advanced Research in Computer Science and Software Engineering,vol3-Issue11, PP.610-613 Nov 2013.

5. Screen-scraper, http://www.screen-scraper.com Viewed 19 February 2013.

6. F. Johnson, S.K. Gupta, “Web Content Mining Techniques: A Survey,” international Journal of Computer Applications,Vol 47, No.11, June 2012.

7. P. Sivakumar, R.M.S. Parvathi, “An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining,” European Journal of Scientific Research, Vol.50 No.3, pp.340-351, 2011.

8. Arvin Kumar Sharma and Gupta P C, “Study and analysis of Web Content Mining Tools to Improve Techniques of Web Data Mining”, International Journal of Advanced Research in computer.

(7)

10.Tripurari Pujan pratap Singh and Dr. Anurag Seetha, “HIT: Web Content Mining Tool”, International Journal of Electronics communication and Computer Engineering, Vol. 3, pp. 1388-1394, Dec-2012.