Design and Implementation of Web Scraping Platform Using Spring Framework Based on Distributed Hadoop Ecosystem

(1)

Vol. 28, No. 5, (2019), pp. 174-182

Design and Implementation of Web Scraping Platform Using Spring Framework Based on Distributed Hadoop Ecosystem

Jung-Sang Yoo¹, Myeong-Ho Lee^*2

1Professor, Dept. of Industrial and Management Eng., Gachon University, Seongnam, Gyeonggi, 461-701, Korea

*2Professor, Dept. of eCommerce, Semyung University, Jecheon, Chungbuk, 390-711, Korea

1[email protected], ^*2[email protected]

Abstract

Background/Objectives: Recently, a rapid increase of general data and informal data and the quick data generation in all fields have a great effect on how to utilize data.The demand is increasing for utilizing Big Data in making important decisions for organizations by interpreting various patterns in a number of heterogeneous data, and predicting the future.

Methods/Statistical analysis: Furthermore, it is necessary to provide services quickly based on up-to-date information. However, in most research, the collection, loading and processing of data have been insufficient and great attention has been paid to the analysis of data.

Findings: Thus, this research collects the data searched with keywords through the Spring Framework using next generation web standards and through Web scraping based on the Hadoop 2.0 Ecosystem, loads the collected data on to a Hadoop Distributed File System (HDFS) and HBase, and designs and implements a Big Data utilization system that can schematize, through a word cloud, the results of analysis of keyword, title, contents and morpheme on the basis of contents and nouns extracted from the loaded data with a Twitter morpheme analyzer.

Improvements/Applications: This research intends to provide a platform reference model that is applicable to enterprise groupware to which the Distributed Hadoop Ecosystem and the Spring Framework under next generation web standards are applied.

Keywords: Big Data, Spring Framework, Web Scraping, HDFS, HBase, Distributed Hadoop Ecosystem.

1. Introduction

Currently, the following comes into the spotlight as core technologies of the fourth industrial revolution: Big Data Statistical Analysis, Cloud Computing, Artificial Intelligence, Robot Engineering, Blockchain, Quantum Cryptography, Internet of Things, Autonomous Vehicle, 3D printing, etc. However, these are technologies that would be difficult to realize without the technology to process Big Data [1,2]. Furthermore, it is the beginning of a new era in which the following will become economic assets: the generation of data in general consumers’ devices, such as PCs, smart phones, cameras, CCTV, RFID, sensors, etc.; and the explosive growth of data in machine-to-machine or IoT, which is the basis of the fourth industrial revolution. The IDC White Paper 'DATA AGE 2025' forecasts that, in 2025, the scale of global data will amount to 163ZB (Zetabyte) which is 10 times the current scale of global data. The scale of global data will grow explosively as stated above because intelligent decisions will be made based not only on the important data that are generated by the said devices but also on the information that each IoT sensor collects. IDC forecasts that, as about 75% of the world population will in 2025 be connected to the Internet through their mobile devices, mobile data will increase in real time accordingly [3]. The explosive growth of data and informal

(2)

Vol. 28, No. 5, (2019), pp. 174-182

data and rapid data generation and the quick data generation in all fields also have a great effect on how to utilize the data. The demand is increasing for utilizing Big Data in making important decisions for organizations by interpreting various patterns in a number of heterogeneous data and predicting the future. Furthermore, it is necessary to provide services quickly based on up-to-date information. [4-6]. Hadoop is the Big Data processing technology that Mr. Doug Cutting, a developer of Yahoo, created after Goggle disclosed its service platform based on the Google File System (GFS). Hadoop 1.0 is an open source framework that implements MapReduce. MapReduce is the technology to combine data and the Hadoop Distributed File System (HDFS) that can replace GFS, a distributed processing system [7,8]. YARN (MapReduce version 2: MR v2) is a resource management platform based on Hadoop 2.0 that remedies the shortcomings of Hadoop 1.0. YARN causes various applications to share Hadoop cluster resources by allocating necessary resources to each application in a Hadoop cluster and focusing on monitoring [8,9]. Currently, as Front-End HTML5.0/CSS3 standards are determined, a new multi- device platform has also become complete as next generation web standards in an N-tier environment [10]. In particular, as ECMAScript is standardized, web applications that utilize JavaScript can be developed in a high-performance client side and server side [11].

Furthermore, as smartphones rapidly proliferate, the mobile-first strategy becomes the core strategy of web services. Accordingly, it becomes more necessary to support a mobile environment [12]. In Korea, an electronic government standard framework has been implemented in preparation for the next generation web standards, to establish the standards for development of national informatization with Spring Framework-based open source. In February and December 2018, the Electronic Government Standard Framework version 3.7.0 and the eGovFrame Lite 3.7.0 were released, respectively [13]. Even if Big Data were utilized in various industries in various ways, most research focused on introducing its practical cases. There is at present a serious shortage of the Hadoop Ecosystem to which Big Data technologies are applied with real time Web Scraping and the enterprise groupware Big Data utilization system that is based on the Spring Framework. Thus, this research collects data searched with keywords through the Spring Framework using next generation web standards and through Web scraping based on the Hadoop 2.0 ecosystem, loads the collected data onto a Hadoop Distributed File System (HDFS) and HBase, and designs and implements a Big Data utilization system that can schematize, through a word cloud, the results of analysis of keywords, titles, contents and morphemes on the basis of contents and nouns extracted from the loaded data with a Twitter morpheme analyzer. This research intends to provide a platform reference model that is applicable to enterprise groupware to which the Distributed Hadoop Ecosystem and the Spring Framework under the next generation web standards are applied.

2. Investigation on Existing Researches

2.1. Web Scraping

Web scraping refers to a technology that extracts, processes, stores, and provides information to a user only in the HTML document displayed on the web browser screen.

Web scraping can be used to collect commodity price information from online marketplaces and to produce one’s own commodity catalogs, news articles, blogs and cafe posts, real estate listings, company profiles and financial data. Web scraping is also used in text mining [14]. Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. While Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. Web scraping a web page involves fetching it and extracting from it.

Fetching is the downloading of a page. Therefore, web crawling is a main component of Web scraping, to fetch pages for later processing. Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparisons, product review scraping,

(3)

Vol. 28, No. 5, (2019), pp. 174-182

gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and web data integration [15].

2.2.HTML5/CSS3

HTML5 is the fifth full version of HTML and the core markup language of WWW.

HTML is a suggestion for the next standards of HTML 4.01, XHTML 1.0 and DOM Level 2 HTML. On October 28, 2014, W3C determined and announced the HTML5 standards [16]. CSS is a standard style sheet to be applied to HTML documents and applies the existing concept of style sheets to the Web. Since CSS1 was recommended at the W3C consortium held in December 1996, each module has been standardized independently notwithstanding the fact that official CSS3 standards did not exist [17].

2.3.Architecture of Spring MVC

In Spring MVC, the role of each component, such as DispatcherServlet, HandlerMapping, Controller, Interceptor, ViewResolver, View, etc. is clearly separated.

DispatcherServlet supervises the Front Controller of the Spring MVC Framework and the life cycle of web request and response. HandlerMapping determines what Controller will process a relevant URL at the time of a web request. Controller performs business logic and reflects the resultant data in ModelAndView. ModelAndView is composed of a Model data object in which the Controller reflects the result of performance and the information of a page for the data object to be moved. ViewResolver determines what View shall be selected. View displays the Model object which is the result data. Figure 1 illustrates the flow of the relationship among Spring MVC components [18,19].

Figure 1.Flow of Spring MVC Components 2.4.Hadoop Ecosystem

Currently, the Hadoop Ecosystem can be composed as illustrated in Figure 2[20].

However, the system architecture for this research is composed of three Linux servers.

Thus, the Hadoop Ecosystem will be realized in a system architecture that utilizes Hadoop 2.0, HBase and ZooKeeper.

(4)

Vol. 28, No. 5, (2019), pp. 174-182

Figure 2.Architecture of Hadoop Ecosystem

In Hadoop, NameNode acts as a master of HDFS and assigns I/O operations to DataNode which will act as a slave. Secondary NameNode is a node to be used when combining block information of NameNode. Figure 3 illustrates an architecture of Apache Hadoop 2.0 and YARN [21].

Figure 3.Architecture of Apache Hadoop 2.0 and YARN

In YARN, each cluster has its own Resource Manager which manages resources for cluster and schedules tasks. The Scheduler which is one of three main components in the Resource Manager manages the resources of Node Manager and allocates insufficient resources. The Application Manager executes the Application Master to perform a specific operation in the Node Manager and manages the status of the Application Master.

To verify whether the Container is still operational, the Resource Tracker contains setup information, such as the maximum retry number of the Application Master, how long to wait to deem that the Node Manager is not operational, etc. The actual data are stored in the DataNode. Each Node Manager contains only one node. The Application Master acts as a master of one program and is allocated an appropriate Container from the Scheduler.

Furthermore, the Application Master monitors and manages the program execution status.

The Container is defined with attributes, such as CPU, disk, memory, etc. HBase is the distributed column-based database made on HDFS of Hadoop. The Master Server assigns the Region to the Region Server with the help of Zookeeper. The Regions spread in the Region Server perform the load-balancing. The Region Server communicates with clients and manages the data-related computation. Furthermore, the Region Server manages requests for reading and writing from the Region [22]. The Zookeeper is an open source project to provide distributed coordination services. The purpose of Zookeeper is to support a developer to focus on core business logic rather than coordination logic. The Zookeeper is based on the Master-Slave architecture composed of Leader and Follower.

More specifically, the Zookeeper is based on the Zookeeper data model. The Zookeeper data model is composed of Ensemble which is composed of a number of the Zookeeper servers, Quorum which prevents inconsistencies in Ensemble data and Znode which is a distributed data system. The Zookeeper supports a distributed data model composed of Znodes. This data model is the core of the Zookeeper and provides a system that is similar to the Linux file system. The purpose of architecture and techniques of the Zookeeper is to provide a stable data model. Through this system, the Zookeeper may be utilized to realize Global Lock, cluster information, Leader Selection, etc. [23].

3. Analysis and Design of Big Data Ecosystem

3.1. Development Environment

This research has, in the next generation web standard framework environment, placed

(5)

Vol. 28, No. 5, (2019), pp. 174-182

in a middle tier the Spring Framework which is in a lightweight container structure and has constructed three virtual machines (server 1, server 2 and server 3) using Oracle Virtual Box[24]. The configuration of a Big Data development environment made of Ubuntu Linux servers is specified in Table 1.

Table 1.Development Environment of Big Data Ecosystem

Items Contents

Server O/S Linux Ubuntu 16.04

IDE Tool Eclipse Neon (4.6)

Web Container Apache Tomcat 9.0.13 Java Development Kit Linux x64 Java SE 8.x

Framework Spring Framework 4.3.14

Hadoop Ecosystem

Hadoop 2.6.5 HBase 1.2.6 Zookeeper 3.4

Server 1 installs and operates Hadoop (NameNode, Secondary NameNode and Resource Manager), Zookeeper and HBase in a Linux Virtual Server. Server 2 and Server 3 install and operate Hadoop (DataNode and Node Manager), Zookeeper and HBase. A Big Data hardware structure shall be composed from the point of 3V views (Volume, Velocity and Variety). However, as illustrated in Figure 4, in a Big Data pilot system, one PC is composed of three virtual machines. A Big Data pilot system shall be implemented in a structure in which the Big Data’s various technologies and functions can be utilized.

Figure 4.Architecture of Big Data Pilot System 3.2. Collection of Blog Data

A Big Data pilot system begins with the collection of Big Data. The Big Data are collected from various sources, such as standardized internal sources, external systems, such as portals, blogs, SNS, news, weather, government organizations, etc. Public data portal sites provide data in various ways, such as file data, open APIs, visualizations, etc.

In particular, Naver, which is one of the most widely used portal sites in Korea, provides blogs and various information, such as news, through the Naver Open API. This research is designed to collect, for a certain period of time, the informal data based on keyword searches in Naver blogs, which is an external system, and apply the collected data to a Big

(6)

Vol. 28, No. 5, (2019), pp. 174-182

Data pilot system [25,26]. The collected Big Data are then distributed and stored through Hadoop. Hadoop is not effective in permanently storing large-scale messaging data generated in real time. Thus, HBase, a NoSQL database, is used instead. The information is shared easily and safely among servers to manage the Big Data distribution environment more efficiently through Zookeeper in Hadoop and HBase.

3.3. Application of Morpheme Analyzer

The term “morpheme analysis” refers to an operation to divide a bundle of source language into morpheme units and then to attach word class information to each morpheme. The morpheme analyzers which are currently widely used are as follows:

Kkma [27], Komoran [28], Han nanum [29], MeCab-ko [30], Twitter-Korean-Text Processor [31], khaiii (Kakao Hangul Analyzer III) [32], etc. This research is designed to implement a Big Data pilot system with a Twitter morpheme analyzer which is quick and relatively efficient in standardizing words and extracting jargon [33].

4. Implementation of Big Data Pilot System

The Big Data pilot system of this research collects Big Data searched with keywords in blogs through the Naver Open API from a web browser as the first step of a Big Data utilization system which is based on the Hadoop Ecosystem and Spring Framework under the next generation web standard framework environment. Moreover, as a step to store Big Data in Hadoop, the morpheme analysis step of this research processes blog data and organizes the stored Big Data into a form of contents, nouns, etc. extracted with keywords through a Twitter morpheme analyzer. An integration of a Big Data pilot system of this research then summarizes and displays analyzed morphemes into title, contents, the results of morpheme analysis, date, etc. through keywords. Lastly, a Big Data pilot system of this research visualizes extracted Big Data with a Word Cloud. The start-up screen of the Big Data pilot system of this research is as illustrated in Figure 5.

Figure 5. Start View of Big Data Pilot System

Figure 6 shows the results of Hadoop’s processing of the Big Data collected from blogs through the Naver Open API.

(7)

Vol. 28, No. 5, (2019), pp. 174-182

Figure 6. Start View of Big Data Pilot System

Figure 7 shows the results of analysis and extraction rendered by a Twitter morpheme analyzer on Hadoop’s Big Data collected from blogs.

Figure 7. POS-Tagging View of Big Data Pilot System

Figure 8 shows the final results of analysis performed by a Twitter morpheme analyzer after loading blog data searched with keywords in Hadoop Big Data.

Figure 8. Main View of Big Data Pilot System

Figure 9 show the data visualization with a Word Cloud so that the results of keyword search Big Data analysis completed as explained above can be easily understood.

(8)

Vol. 28, No. 5, (2019), pp. 174-182

Figure 9. Word Cloud View of Big Data Pilot System

5. Conclusion

In the future, as web technologies are standardized and various devices emerge according to the mobile-first strategy, there will be, in a short time, various changes in the full stack technology standard of the industry at large and enterprise system environments.

Accordingly, research shall be conducted on the following fields: UX even in the next- generation web standard framework environment; infographic-based visualization; Hybrid App-based responsive web design; and adaptive web design. Furthermore, as data and informal data are rapidly generated due to various types of intelligent equipment and mobile devices, more and more businesses intend to utilize Big Data for their decision- making processes. Accordingly, services shall be provided quickly based on up-to-date Big Data technologies.However, most cases on Big Data utilization are focused on practical cases. There is a serious shortage, due to the information security of businesses and public institutions, of the Hadoop Ecosystem utilization system based on electronic government standard platforms in a distributed environment to collect data to which Big Data technology is applied or processed through loading.Thus, this research collects data searched with keywords in Naver blogs on the basis of Hadoop 2.0 through Spring Framework which is the electronic government standard framework and loads the collected data on a Hadoop Distribution File System and HBase. Furthermore, this research designs a Big Data utilization system that can schematize, through a Word Cloud, the results of analysis of keywords, titles, contents and morphemes on the basis of the contents and nouns extracted from loaded data with a Twitter morpheme analyzer and implements a reference model.In the future, research shall continue to be conducted for analysis of technology, exploration, reasoning and forecasting using four) technologies, such as Impala, Zeppelin, Mahout, Sqoop, etc. to analyze and apply Big Data through their processing and exploration, after collecting and loading Big Data in real time by expanding the Hadoop Ecosystem.

References

[1] Klaus Schwab. The Fourth Industrial Revolution, World Economic Forum, (2016).

[2] Rahman, Hamid and Chin. Emerging Technologies with Distruptive Effects: A Review. PERINTIS eJournal, (2017), 7(2):111-128.

[3] D. Reinsel. J. Gantz and J. Rydning. DATA AGE 2025: The Evolution of Data to Life-Critical. An IDC White Paper. (2017).

[4] Lee, M. Y. and Choi. W. BigData Processing Technology Trend for BigData Analysis. Korea Information Processing Society Review. (2012). 19(2):20-28.

[5] Kim, G. W. BigData Technology to Lean in Practice. Wikibooks. (2017).

[6] Lee, M. H. Design and Implementation of Big Data Utilization System based on Hadoop Ecosystem and Spring Framework. Journal of the Korean Institute of Plant Engineering. (2018). 22(2):15-21.

(9)

Vol. 28, No. 5, (2019), pp. 174-182 [7] Wikipedia. https://en.wikipedia.org/wiki/Apache_Hadoop

[8] Korea Database Agency. A Practical Guide to Big Data Technology. (2015). p.4-655.

[9] Hadoop 2.0. https://www.edureka.in/blog/introduction-to-hadoop-2-0-and-advantages-of-hadoop-2-0/.

edureka!

[10] MDN web docs. Modules and the standardization process. https://developer.mozilla.org/en- US/docs/Web/CSS/CSS3

[11] Wikipedia. ECMAScript. https://en.wikipedia.org/wiki/ECMAScript [12] Luke Wroblewski. Mobile First. A Book Apart. (2011). p.1-120.

[13] eGovFrame. Release Note. http://www.egovframe.go.kr/

[14] Y.S. Jang and H. G. Kang. Coding to Lean with R. Life and Power Press. (2017). p.276-293.

[15] Wikipedia. https://en.wikipedia.org/wiki/Web_scraping [16] Wikipedia. https://en.wikipedia.org/wiki/HTML5

[17] Wikipedia. https://developer.mozilla.org/ko/docs/Web/CSS/CSS3 [18] Java2Blog. https://java2blog.com/spring-mvc-tutorial/

[19] Lee, I. M. Spring 3.0 of Toby, Acorn. (2010). p.1014- 1116.

[20] edureka!. https://www.edureka.co/blog/hadoop-ecosystem

[21] edureka!. https://www.edureka.co/blog/introduction-to-hadoop-2-0-and-advantages-of-hadoop-2 -0/

[22] Apache HBase. https://hbase.apache.org/

[23] Apache ZooKeeper. https://zookeeper.apache.org/

[24] VirtualBox. https://www.virtualbox.org/

[25] Public Data Portal. https://www.data.go.kr/

[26] Naver Developers. https://developers.naver.com/docs/search/blog/

[27] Kokoma Project. https://kkma.snu.ac.kr/

[28] KOMORAN. https://shineware.tistory.com/entry/KOMORAN-ver-24 29] Hannanum. http://semanticweb.kaist.ac.kr/hannanum/

[30] mecab-ko. https://bitbucket.org/eunjeon/mecab-ko

[31] open-korean-text. https://github.com/open-korean-text/open-korean-text [32] khaiii. https://github.com/kakao/khaiii

[33] KoNLPy. http://konlpy.org/ko/latest/morph/#comparison-be tween-pos-tagging-classes