STUDY OF FAMILIAR WEB DATA EXTRACTION BY WEB AND SCREEN SCRAPING TECHNIQUES

(1)

STUDY OF FAMILIAR WEB DATA EXTRACTION BY WEB AND SCREEN

SCRAPING TECHNIQUES

Mrs. Sarada Devi.Ch¹, Dr Kumanan.T²,Mrs. Prameela Devi Chillakuru³,Dr S.K.Muthusundar⁴

1 Research Scholar, Meenakshi Academy for Higher Education and Research, K.K Nagar, Chennai, TN, India.

2 Professor, Meenakshi Academy for Higher Education and Research, K.K Nagar, Chennai, TN, India

3 Research Scholar, Meenakshi Academy for Higher Education and Research, K.K Nagar, Chennai, TN, India.

4 Professor, Sharad Institute of Technology COE, Kolhapur.

Abstract

Discovering potentially useful and previously unknown historical cognition from heterogeneous E- Commerce web site aggregation to result comparative queries such as “list all CPU prices from Flipkart and Staples between 2017and 2018 mid as well as amendment, variety, protection size, CPU cognition, year of make”, would demand the difficult project of finding the schema of web documents from different web pages, extracting target. Content and performing arts net aggregation information group action, building their virtual or physical information warehouse and mining from it. Some limitations of the existing systems consider using complicated matching techniques such as tree matching, non-deterministic finite state automata, domain ontology and cognition to response complex comparative historical and derived queries.

Web data mining is important platform for retrieval of useful data. Users prefer WW W more to upload and download data. As growing of data over internet, it is obtain data and time consuming information Different mining are used to fetch relevant information from web by hyperlinks, contents, web usage logs. Data processing may be a sub discipline of data which mainly deals with web. Web data mining is divided into three different types: Internet structure, Website and internet usage mining.

The Automatic web and scraping and screen scraping techniques conceptualization eliminates the involve for any supervised training or updating the crawling for each new business to customer web page making the conceptualization simpler, more easily extendable and automated. Experiments display that the 100% recollect in identifying the goods records, 95.55% exactitude and 100% recollect in identifying the information columns.

Keywords— Content Mining; Web Data Extraction; Data combination; crawling; scraping;

1. Introduction:

With the dramatically fast and explosive growth of information out there over the web, World Wide internet has become a robust platform to store, spread and retrieve information further more as mine useful data. Thanks to the properties of the massive, diverse, dynamic and unstructured nature of web information, web information analysis has encountered tons of challenges, such as measurability, transmission and temporal problems etc. As a result, internet users are continually drowning in Associate in Nursing “ocean” of knowledge and facing the matter of knowledge overload once interacting with the web.

Typically, the subsequent issues are usually mentioned in internet related analysis and applications:

(1).Finding relevant data.

(2).Finding required data.

(3).Learning helpful information.

(4). Recommendation/personalization of knowledge.

The higher than problem place the present search engines and other net applicaions below important stress. A variety of efforts are contributed to traumatize these difficulties by developing advanced machine intelliegent techniques or algorithms from totally different analysis domins, such as info, data processing, machine learning, data retrieval and data management etc.,

Characteristics of net data for info on the online, it’s its own distincive options compared to the info in conventional direction systems. Net information typically exhibits the subsequent characteristics:

 The data on the online is large in quantity.

 The online is distributed and heterogeneous.

(2)

 The data on the online is unstructured.

 The information on the online is dynamic.

Net information search web program technology has emerged occupation for the ascent and exponential flux of net information on the internet, to assist net users notice desired data, and has resulted in varied commercial net search engines accessible on-line like Yahoo!, Google, Alta Vista so on. Search engines can be categorized into two types one may be a general program and another could be a specific- purpose program.

2. Data Mining

Data mining (DM) process are utilized massively in different of fields. At the period of laying out Network IDS, it is vital to distinguish and correct those assaults inside less time and raise the most ideal alert.

To do this DM strategies are one of interesting field and productive strategies that can be utilized to plan the IDS. DM based intrusion area methodologies frequently fall into any of the two classes; anomaly detection and misuse detection.

2.1 Information Extraction

Information extraction is the process of extracting, computing and compiling information from the text of a large corpus using machine learning. Information extraction systems are generally used to extract information about a page, storing it in a form that makes queries and retrieval of the data as easy and efficient as possible.

2.2 Web Data Model and Matrix Expression

For efficient Web data management, the Web data model is essential and crucial, on which a variety of data mining and machine learning techniques are employed. To achieve the desired mining tasks discussed above, there are different Web data models in the forms of feature vectors engaged in pattern discovery and knowledge application. According to the three identified categories of Web mining methods [14], three types of Web data/sources, namely content data, structure data and usage data, are mostly considered in the context of Web mining. Before we start to propose different Web data models, we firstly give a brief discussion on these three data types in the following paragraphs. Web content knowledge is an assortment of objects accustomed conveys content data of sites to users. In most cases, it is composed of textural material and other types of multimedia contents, which include static HTML/XML pages, images, sound and video files, and dynamic pages generated from scripts and databases. The content data also includes semantic or structured meta-data embedded within the site or individual pages

In addition, the domain ontology might be considered as a complementary type of content data hidden in the site implicitly or explicitly. The underlying domain knowledge could be incorporated into Web site designs in an implicit manner, or be represented in some explicit forms. The explicit kind of domain ontology are often abstract hierarchy e.g. product class, and structural hierarchy such as yahoo directory etc. Web structure [6] data is a representation of linking relationships between Web pages, which reflects the organization concept of a site from the viewing point of the designer. It is normally captured by the inter-page linkage structure within the site, thus, is called linkage data. Particularly, the structure data of a site is usually represented by a specific Web component, called “site map”, which is generated automatically when the site is completed.

For dynamically generated pages, the positioning mapping is turning into more difficult to perform since additional techniques are needed to trot out the dynamic environment.

(3)

Figure 1: The Illustration of web data model

Web usage information is mainly sourced from blog files that embody internet server access logs and application server logs [10]. The log data collected at Web access or application servers reflects navigational behavior knowledge of users in terms of access patterns. In the context of Web usage mining, usage data that we need to deal with is transformed and abstracted at different levels of aggregations, namely Web page sets and user session collections. Web page is a basic unit of a Web site organization, which contains a number of meaningful units serving for the main functionality of the page[11].

Physically, a page could be an assortment of internet things, generated statically or dynamically, contributing to the display of the results in response to a user action. A page set could be an assortment of whole pages at intervals a site.

User session could be a sequence of sites clicked by one user throughout a selected period.

Matrix expression has been widely used to model co-occurrence activities like Web data. The illustration of a matrix expression for Web data is shown in Figure 1. In this scheme, the rows and columns correspond to various Web objects which are dependent on various Web data mining tasks.

3. Web Mining

The Web mining consists in the application of the data mining techniques to discover and automatically extract information from the documents and the web services. In particular, the creation, extraction and maintenance of the user models in the recommender systems on the Internet improve the user experiences in terms of the relevancy of the information, avoiding the issues related to the overload of the information [1].

The mined data in this category is the data found in the Web Server Logs as a result of the user interactions. These logs files and other information sources should be processed first in order to obtain the user models. The web mining is composed of three tasks: the discovery of the data sources, the selection and processing of the information, and the discovery of patterns from the websites.

3.1 Phases of the Web Mining

The web mining is composed of four tasks discovery of the data sources, selection, generalization and analysis, as we mentioned before. The following illustration describes those tasks[2]

Figure 2: Phases of the web mining

3.2 Areas and innovative of the Web Mining

The web mining is able to obtain the data from the server side, the client side, the proxy servers or the database which contains the web content [3].

These data is categorized in three types:

3.2.1 Content

This is the real data delivered to the users. There are plenty of formats such as images, texts, or other media formats. This is the most important data type, and the most difficult to process, because it is based on multimedia format.

(4)

3.2.2 Structure

This is the data which describes the content structure and how it is organized within the web structure and the social networks pages [7]. This structure includes the organization within the web pages, the distribution of the internal and external hyperlinks, and the hierarchy of the whole web document.

3.2.3 Utilization

This is the data which describes the usage registered by the web document. This is recorded in the access logs of the web server

3.2.4 Innovative of Web Mining

Web mining is data mining using data from the web. Within this field, there are the following five major research areas[12]:

1) Information extraction: Finding, extracting and compiling information from a large corpus.

2) Wrapper induction: The process of finding general structural information about a set of web pages, and with this in mind extracts only the relevant information from each page.

3) Web link mining: Mining the spatial link structure of the web for information.

4) Web log mining: Mining for knowledge in web logs, otherwise known as click stream data 4. Web Data Extraction Methods

Web search engines are present over 2 generations and acquire evolved from straight fo rward full text search engines into advanced systems that analyses web page content along with links that level into them. But still even most modern web search engine has three main components Web crawling and data use and extracting link and text data, Data storage and online processing Data earnings and aggregation and modify search index whole, Query processing search index and ranking results based of search status.

As mentioned in introduction, this work will direction on web content [4] extraction, therefore only web crawling and data acquisition is covered in this area. Other portion (data aggregation, product classification task, reduplicate matching) required to construct fully functional goods search engine are not discussed as these are implemented as separate organization.

In next subsections we glance content extraction and net travel in more detail.

4.1 Semantic Web

Semantic net was designed to vary website computer code and permitting Message shared beyond originated web site. Semantic net permits to solely relation within single website however conjointly linkage completely different websites along in an exceedingly meaningful method and by that making internet of information.

Embedding semantic message to web pages can be done indifferent ways. Most popular is micro data which is shown in Figure1. Although Schema.org is becoming de facto standard to define schemas for semantic web it is not necessary to activity their schema.

But consolidating schemas makes their use easier and allows spreading. For commercial products Good Relations schema was created by Martin Hepp and is now partially merged into Schema.org. When process your own schema it is best to usage because it are persistent over time. Together with linked data, semantic elements were introduced in HTML5 standard. These are element like, <header>, and others.

These were designed to supply meaning to website structure and regenerate HTML scheme like As semantic web is getting more widespread (17% of domains crawled by commons Crawl used semantic annotation) and with growing of different API it is becoming questionable, to we ask data extraction from fuzzy semi structured Webpages and instead focusing on structured data extraction and its meaningful investigation. But unfortunately, other ways to include semantic and JSON-LD (JSON Linked collection) embedding’s. At current time (2016) we tend to still cannot be without extracting knowledge from non- linguistics sites.

Even once linguistics direction is used, it's typically used part and describes solely portion of knowledge made obtainable. Also syntactical mistakes are common.

4.2 Web data extraction

Web pages comprise messy data. Mortal web content does not solely comprise main matter and image content; it's conjointly extra collection value-added as header, footer, space region whole thing. These blocks contain direction links and sometimes advertisements. Also web content is adorned with markup which solely aim is to change visual happening. Vast of web content is generated automatically from relational

(5)

databases. These include Content Management Systems like Word press, web forums, and online shops. This means that assortment in its original type is structured [13].

This structured message is embedded into markup language example and decorated with visual content. Figure shows underlying info schema (left) of common open supply e-commerce resolution and timefy.com online shop. In order to amass original information, the example must be removed or data must denning your own schema it is advisable to use Together with linked data, semantic element were introduced in HTML5 standard. These are element like, <header>, and others. These were designed to provide meaning to website construction and regenerate HTML syntax like. As semantic web is getting more widespread (18% of domains crawled by Common Crawl used semantic annotation) and with growth of different API-s it is becoming questionable, to we demand data extraction from fuzzy semi structured Web pages and instead concentration on structured data extraction and its meaningful analysis.

But unfortunately, at current time (2017) we still can't reside without extracting data from non-semantic web pages. Even when linguistics markup is employed, it is often used partially and describes only portion of data made available. Also syntactical mistakes are common. Data must wrapper is a Technique that implements a family, that finds the message that individual needs, extracts this from an unstructured origin and change them into structured data.

Over the last generation great quantity of analysis and tools are created that support with web mining project. Product studies are performed to analyze existing solutions, these embrace and newer summary regarding web mining together with net information extraction. Together with web content mining some of these surveys give summary also about other parts of web mining, such as web structure mining [8] and web utilization mining. Product studies are performed to analyze existing solutions, these embrace and newer summary regarding web mining together with net information extraction.

4.3 Advanced web data extraction

There ar businesses that function alone supported information, others use it for business intelligence, challenger analysis and marketing research among other un-numerable use cases. However, extracting large amounts of information from the online remains a major roadblock for several corporations, a lot of therefore as a result of they are not looking the optimum route. Here is a detailed overview of different

ways by which you can extract data from the web[15].

4.3.1 DaaS

Outsourcing your internet information extraction project to a DaaS supplier is by far the simplest thanks to extract information from the online. When looking on a data supplier, you are completely relieved from the responsibility of crawler setup, maintenance and quality inspection of the data being extracted. Since DaaS corporations would have needed | the mandatory} experience and infrastructure required for a swish and seamless data extraction, you can avail their services at a much lower cost than what you’d incur by doing it yourself. Providing the DaaS supplier along with your actual needs is all you would like to do and rest is assured.

You would have to send across details like the data points, source websites, and frequency of crawl, data format and delivery methods. With DaaS, you get the information precisely the means you wish, and you can rather focus on utilizing the data to improve your business bottom lines, which should ideally be your priority.

Since they're experienced in scraping and possess domain data to induce the information with efficiency and at scale, going with a DaaS provider is the right option if your requirement is large and recurring. One of the most important benefits of outsourcing is that the information quality assurance. Since the online is very dynamic in nature, data extraction requires constant monitoring and maintenance to work smoothly.

Web information extraction services tackle all these challenges and deliver noise-free information of prime quality. Another good thing about going with an information extraction service is that the customization and flexibility. Since these services ar meant for enterprises, the giving is totally customizable per your specific needs.

(6)

4.3.2. In house data extraction

You can go with in house information extraction if your company is technically wealthy.

Web scraping may be a technically niche method and demands a team of sure-handed programmers to code the crawler, deploy them on servers, debug, monitor and do the post processing of extracted data. Apart from a team, you would also need high end infrastructure to run the crawling jobs. Maintaining the in- house crawling setup will be a much bigger challenge than building it. Web crawlers tend to be very fragile.

They hit with tiny changes or updates within the target websites.

You would have to setup a observation system to grasp once one thing goes wrong with the crawl task, so that it will be mounted to avoid information loss. You will got to dedicate time and labour into the upkeep of the in-house crawling setup. Apart from this, the complexity associated with building an in-house crawling setup would go up significantly if the number of websites you need to scrape is high or the target sites are using dynamic coding practices. An in-house crawl setup would conjointly take a toll on the main target and dilute your results as internet scraping itself is some things that need specialization. If you aren’t cautious, it might simply hog your resources and cause friction in your operational work flow.

4.3.3 Vertical specific solutions

There are data suppliers that cater to solely a selected trade vertical. Vertical specific information extraction solutions ar nice if you may notice one that’s job to the domain you 'retargeting and covers all your necessary information points. The good thing about going with a vertical specific answer is that the comprehensiveness of information that you would get. Since these solutions cater to only 1 specific domain, their expertise in that domain would be very high.

The schema {information of knowledge } sets you'd get from vertical specific data extraction solutions ar usually fixed and won’t be customizable. Your information project are going to be restricted to the info points provided by such solutions, but this may or may not be a deal breaker depending on your requirements.

These solutions usually give you datasets that ar already extracted and is prepared to use.

A good example for a vertical specific information extraction answer is Jobs Pikr, which is a job listings data solution that extracts data directly from career pages of company websites

from across the world.

4.3.4. DIY data extraction tools

If you don’t have the take into account building an in-house travel setup or outsourcing your information extraction method to a vender, you are left with DIY tools. These tools are easy to learn and often provide a point and click interface to make data extraction simpler than you could ever imagine. These tools are an ideal choice if you are just starting out with no budgets for data acquisition. DIY net scraping tools are typically priced terribly low and a few ar even absolve to use. However, there are serious downsides to employing a DIY tool to extract information from the online. Since these tools wouldn’t be able to handle advanced websites, they are terribly restricted in terms of practicality, scale, and the efficiency of data extraction. Maintenance will be a challenge with DIY tools as they're made in an exceedingly rigid and fewer versatile manners.

You ought to check that that the tool is working and even create changes from time to time.The only sensible facet is that it doesn’t take a lot of technical experience to configure and use such tools, which might be right for you if you aren’t a technical person. Since the answer is readymade, you will also save the costs associated with building your own infrastructure for scraping. With the downsides apart, DIY tools will cater to easy and tiny scale information requirements. Best practices in web data extraction As a great tool for deriving powerful insights, web data extraction has become imperative for businesses in this competitive market. As is that the case with most powerful things, web scraping must be used responsibly.

Here is a compilation of the simplest practices that you simply should follow whereas scraping websites[16].

(7)

1. Respect the robots.txt

You should forever check the Robots.txt file of a website you're getting to extract information from. Websites set rules on how bots should interact with the site in their robots.txt file. Some sites even block crawler access fully in their robots file.

Extracting information from sites that disallow creeping is will cause legal ramifications and may be avoided.

Apart from outright blocking, every site would have set rules on good behavior on their site in the robots.txt.

You are absolute to follow these rules while extracting information from the target website.

2. Do not hit the servers too frequently

Web servers area unit at risk of downtimes if the load is incredibly high. Just like human users, bots can also add load to the website’s server. If the load exceeds an explicit limit, the server may prevent or crash, rendering the website unresponsive for the users. This creates a bad user experience for the human visitors on the website which defies the whole purpose of that site.

It ought to be noted that the human guests are of upper priority for the web site than bots.

To avoid such problems, you should set your crawler to hit the target site with a reasonable interval and limit the number of parallel requests. This will offer the web site some breathing area, which it should indeed have.

3. Scrape during off peak hours

To make certain that the target web site doesn’t prevent because of a high traffic from humans as well as bots, it is better to schedule your web crawling tasks to run in the off-peak hours. The off-peak hours of the location are often determined by the geo location of wherever the site’s majority of traffic is from. You can avoid doable overload on the website’s servers by scraping during off-peak hours. This will even have a positive result on the speed of your information extraction method as the server would respond quicker throughout this point.

4. Use the scraped data responsibly

Extracting information from the online has become an important business method. However, this doesn’t mean you own the information you extracted from a web site on the net. Publishing the information elsewhere without the consent of the web site you're scraping are often thought of unethical and you may be violating copyright laws. Using {the data | the info | the information} responsibly and in line with the target website’s policies are a few things you ought to follow whereas extracting data from the online.

Finding reliable sources

1. Avoid sites with too many broken links

Links are just like the connecting tissue of the net. A website that has too several broken links may be a dangerous choice for an online information extraction project. This is AN indicator of the poor maintenance of {the website | the location | the positioning} and creeping such a site won’t be a good expertise for you. For one, a scraping setup will return to a halt if it encounters a broken link throughout the taking process.

This would eventually tamper the information quality, which should be a deal breaker for anyone who’s serious about the data project. You are at an advantage with a distinct supply web site that has similar data and higher work.

2. Avoid sites with highly dynamic coding practices

This may not forever be AN option; however, it is better to avoid sites with complex and dynamic practices to have a stable crawling job running. Since dynamic sites tend to be difficult to extract data from and change very frequently, maintenance could become a huge bottleneck. It’s forever higher to search

out less advanced sites once it comes to net creeping.

3. Quality and freshness of the Data

The quality and freshness {of information | of knowledge | of information} should be one in every of your most important criteria whereas selecting sources for data extraction. The data that you simply acquire ought to be contemporary and relevant to the current time-period for it to be of any use the least bit.

Always rummage around for sites that area unit updated oftentimes with contemporary and relevant data once choosing sources for your information extraction project. You could check the last changed date on the site’s ASCII text file to get an inspiration of however contemporary the information is. Legal aspects of web crawling Web information extraction are sometimes seen with clouded eye by those

(8)

that aren’t terribly acquainted with the conception. To clear the air, internet scraping/crawling is not AN unethical or criminal activity. The method crawler larva fetches info from a web site is in no completely different from somebody's visitant overwhelming the content on a webpage. Google search, for example runs of web crawling and we don’t see anyone accusing Google of doing something even remotely illegal.

However, there area unit some ground rules you should follow whereas scraping websites.

If you follow these rules and operate as a good bot on the internet, you aren’t doing anything illegal. Here are the rules to follow:

1. Respect the robots.txt file of the target site

2. Make sure you are staying compliant to the TOS page

3. Do not reproduce the data elsewhere, online or offline without prior permission from the site If you follow these rules whereas travel a web site, you are completely in the safe zone.

Conclusion and future work

The planned method has solved by dynamically discovering the arrange for every given website and doesn't demand any training information. In this work, we have used the web content [5] and web structure mining techniques to detect the frequently repeated blocks of html tags to observe the goods data and their schemas. We covered the importance aspects of web data extraction here like the different routes you can take to web data, best practices, various business applications and the legal aspects of the process. As the business world is quickly moving towards a data-centric operational model, it’s high time to evaluate your data requirements and get started with extracting relevant data from the web to improve your business potency and boost the revenues.

The future work concludes that our try of mining the goods information records mechanically in an exceedingly simple manner contains a batch of room for transformation of Our algorithms are only capable of extracting the meta data available on the “product listing page”, it can be modified and extended to take complete goods specifications and information from the “detail goods page”. The detail page can be identified by finding the linkage to tag to fact in the frequent artifact generated by our fourth module and extending our algorithm to extract data from it and the crawler module by automatically identifying the target goods page using a focused crawler and also extract data using natural language processing.

Reference

[1] Anurag Kumar and Ravi Kumar Singh, "Web Mining Overview, Techniques, Tools and Applications: A Survey," International Research Journal of Engineering and Technology (IRJET), vol. 03, no. 12, pp. 1543-1547, December 2016.

[2] Raymond Kosala and Hendrik Blockeel, "Web Mining Research: A Survey," SIGKDD Explorations, vol. 2, no.

1, pp. 1-15, July 2000.

[3] Faustina Johnson and Kumar Santosh Gupta, "Web Content Mining Techniques: A Survey," International Journal of Computer Applications (0975 – 888), vol. Volume 47– No.11, pp. 44-50, June 2012.

[4] Anurag kumar and Kumar Ravi Singh, "A Study on Web Content Mining," International Journal Of Engineering And Computer Science, vol. 6, no. 1, pp. 20003-20006, January 2017.

[5] Dr. S. Vijayarani and Ms. A. Sakila, "MULTIMEDIA MINING RESEARCH – AN OVERVIEW," International Journal of Computer Graphics & Animation (IJCGA), vol. 5, pp. 69-77, January 2015.

[6] Miguel Gomes da Costa Júnior and Zhiguo Gong, "Web Structure Mining: An Introduction," International Conference on Information Acquisition, pp. 590-595, June 27 - July 3 2005.

[7] Anurag Kumar and Kumar Ravi Singh, "A Study on Web Structure Mining," International Research Journal of Engineering and Technology (IRJET), vol. 04, no. 1, pp. 715-720, January 2017.

[8] B. L. Shivakumar and T. Mylsami, "SURVEY ON WEB STRUCTURE MINING," ARPN Journal of Engineering and Applied Sciences, vol. 9, pp. 1914-1923, October 2014.

[9] Pranit Bari and P.M. Chawan, "Web Usage Mining," Journal of Engineering, Computers & Applied Sciences (JEC&AS), vol. 2, pp. 34- 38, June 2013.

[10] Saša Bošnjak, Mirjana Marić, and Zita Bošnjak, "The Role of Web Usage Mining in Web Applications Evaluation," Management Information Systems, vol. 5, October 2009.

[11] Prabha.K and Suganya.T, "A Guesstimate on Web Usage Mining Algorithms and Techniques," International Journals of Advanced Research in Computer Science and Software Engineering, vol. 7, no. 6, pp. 518-521, June 2017.

(9)

[12] Yan Wang, Web Mining and Knowledge Discovery of Usage Patterns., February 2000.

[13] Parth Suthar and Prof. Bhavesh Oza, "A Survey of Web Usage Mining Techniques," (IJCSIT) International Journal of Computer Science and Information Technologies, vol. 6, pp. 5073-5076, 2015.

[14] Nasrin JOKAR, Reza Ali HONARVAR, Shima AgHAMIRZADEH, and Khadijeh ESFANDIARI, "Web mining and Web usage mining ," Bulletin de la Société des Sciences de Liège, vol. 85, pp. 321 - 328, 2016.

[15] Ajith Abraham, "BUSINESS INTELLIGENCE FROM WEB USAGE MINING," Journal of Information &

Knowledge Management, vol. 2, no. 4, December 2003.

[16] M.SANTHANAKUMAR and C.CHRISTOPHER COLUMBUS, "Web Usage Based Analysis of Web Pages Using RapidMiner," WSEAS TRANSACTIONS on COMPUTERS, vol. 14, pp. 455-464, 2015.

.