OPTIMIZATION OF THE SEARCH GRAPH USING HADOOP AND LINUX OPERATING SYSTEM

(1)

Sherine et al. World Journal of Engineering Research and Technology

OPTIMIZATION OF THE SEARCH GRAPH USING HADOOP AND

LINUX OPERATING SYSTEM

P. M. Dayana Sherine*1, Gooty Dadameer1 and Manasa K.2

1

Assistant Professor, Koshys Institute of Management Studies, Bangalore. 2

Assistant Professor, Adrasha Institute of Technology, Bangalore.

Article Received on 24/07/2017 Article Revised on 14/08/2017 Article Accepted on 03/09/2017

ABSTRACT

The evolution of Social networking sites has lot of challenges for

technology firms and researchers. The Social networking sites are

gaining popularity among the users across the globe and networking of

individuals is increasing very rapidly. People search on the Social

networking sites to find old friends and other interesting people, this search operation runs in

the background of Social network. The search operation needs to be optimized to enhance the

user‟s experience. In this paper a MapReduce based search algorithm was designed and

processed on a selected Hadoop cluster. The search operation on Hadoop framework could be

affected by system parameters like memory, I/O and CPU performance etc. In this research

paper one of the functionality of memory i.e. swappiness (which handles a Swap space) is

changed & all the other parameters are kept constant to achieve the optimization of the search

operation. The search operation is optimized by processing the BigData of different sizes and

varying the swappiness which behave as a virtual memory of the system.

KEYWORDS: Social Network; Search Graph algorithms; Big Data; Hadoop; Map Reduce; Swappiness; Linux; Operating System.

INTRODUCTION

With the several advancements in computing and networking technologies are manifested to

change the way of computing in business. Many of these changes/trends emphasized through

managerial computing likewise Cloud computing, Distributed computing, and many more.

wjert, 2017, Vol. 3, Issue 5, 438-448.

World Journal of Engineering Research and Technology

WJERT

www.wjert.org

ISSN 2454-695X

Original Article

SJIF Impact Factor: 4.326

*Corresponding Author

P. M. Dayana Sherine

(2)

Social network initiated at the grassroots level has been growing quickly in several sectors

and it‟s a combination of many computing technologies like Cloud computing, distributed

computing &, parallel computing. Social network leading to real business models, and

provide efficient platform for new emerging businesses. Some examples of Social network

platforms are MySpace, Bebo, FaceBook, and LinkedIn.

The internet users are empowered to connect and access any information through social

networking by rapid development and integration of various computing concepts. The

number of users increased at a fast pace and at the same time the evolvement of users is

increasing rapidly. The development & emergence of social network posed various research

problems and issues for researchers.

Humongous data is stored across the globe in the social networking sites; this data is

interacting, assimilating and coordinating. Hence this huge data may pose issues related to

fragmentation of the social graph on the networks, data security & ethical issues in the

networked globe. The data available on social networks could be used for various business

solutions like marketing, data management and other fields.

In Social networking one of the main backbone technologies is graph search algorithm. It

runs in the background of the Social network for searching and networking operation. Today,

the main problems faced by graph search algorithm are Inefficiency of the system,

unstructured graph, Poor locality, Big Data access to computation etc.

In this research, two problem areas- Inefficiency of the system and Big Data access to

computation are targeted. The Big Data computation is optimized by selecting an appropriate

configuration of the system. The system is configured in form of Hadoop cluster and

computation of Big Data is processed by designed graph search algorithm. The sample of Big

Data size is 2GB, 4GB, 8GB & 10GB & system configuration is constant for processing of

all size of data. The processing of Big Data is optimized, in terms of time complexity, by

selecting an optimum value of swappiness (form of virtual memory).The search algorithm is

designed for processing the Big Data on the Hadoop cluster and able to achieve the optimized

performance of algorithm for the designed cluster using operating system memory

(3)

Development of Search Algorithm for Optimization on Hadoop Framework

The process of finding a solution can be divided in various steps, like developing an

algorithm, executing it on Hadoop framework, finding the parameters affecting the

processing, selecting the parameters for optimization & finally concluding on the efficient &

effective method of optimizing the developed algorithm.

Before understanding further, it is necessary to know the basic terminologies related to the

developed algorithm.

A. What is Distributed System?

The data is distributed in different systems and the data search operations are carried out by

connecting various systems (defined as nodes). The advantage of this system is that data can

be retrieved & stored on the network.

“The distributed is a collection of various computers, connected through a network with the

help of software called middleware. The system helps to share various resources of the all the

computers and is perceived as a single system with good, fast and integrated computing

facility”.

B. What is Parallel Process?

The concept is focused on the processing of the huge data simultaneously on the same

network. This concept is widely applicable in Big Data processing in social networks.

“Parallel processing solves Big Data problems by fragmenting them into smaller ones and

solving them at the same time. Parallel processing was considered mainly for solving

data-intensive problems encountered in computing problems related to the field of engineering,

science”.

C. What is Cloud Computing?

The principle of Cloud computing is a type of bigger platform of Distributed systems on the

internet. The nodes can be connected on a large scale; this concept has a vital usability in the

Social networking sites.

“Cloud computing: is based on a principle of sharing computing resources on internet, cloud

computing concept promotes cost effective solution for business to set virtual office with

(4)

D. What is MapReduce?

Map/Reduce was first implemented by the Google search engine; the technology helps in

indexing and analyzing of the BigData. Map/Reduce distributes the data across a large.

Fig. 2.1: Concept of emerging technologies in flow-diagram.

Cluster can divide the tasks to work independently, & in Parallel. This helps in swift &

efficient processing of the Big Data.[11]

E. What is Hadoop?

Hadoop was developed to support the open- source web search engine project called Nutch.

Today Hadoop is the best– known MapReduce framework in the market. Its java base

program but it also supports python and ruby programming for the development of the

algorithms.

F. Development of Algorithm

Social networking sites have become prominent tool to connect with people across the globe.

Users keep on searching people to expand their network, this search operation happens after a

(5)

to enhance the user‟s experience; therefore all technology companies working on Cloud

computing; are striving to optimize the search operation.

In this research, the search algorithm is developed on MapReduce concept & processed on

the Hadoop framework. As the Hadoop framework is an open source and is a combination of

Distributed systems and Parallel processing.

Algorithm I. Mapper

Step-1- Enter the keyword to search.

Step-2- The keyword will be searched on the networking sites.

Step-3 Searched keyword data is stored in a variable Step-4 Restore all URL and text data in

new variable.

II. Reducer

Step-5 Calculate the number of search of that particular keyword on webpage/networking

site.

The Social networking site/webpage data will be stored on the Distributed system; the stored

data is further used for optimization of the search operation.

Following steps are undertaken to develop the algorithm; it is divided in two stages- Mapper

& Reducer:

The developed algorithm was processed in a Hadoop cluster and parameters like system

memory, Input-Output system & CPU are monitored to optimize the algorithm.

Figure 2.1 is the flow diagram to understand the concept of all technologies used in research.

It was observed that the system memory is the key parameter affecting the processing time of

BigData; hence in the next step, the system memory was microscopically focused for the

optimization of the algorithm. Further study shows that the swappiness of the system memory

could help in understanding the advance level optimization of the designed search algorithm.

(6)

A. Swappiness

Swappiness is an important parameter of Linux kernel and its value is between “0” to

„100”.This function is basically a devoted space in the hard drive and that is generally two

times the capacity random access memory ( RAM) of the system.

Linux kernel uses this swap function by swapping the chucks from the RAM to the swap; this

allows the availability of RAM for other active and important processes.

The value of swappiness derives that, how much and how frequently the linux kernel will

copy the RAM contents to the swap memory. The system defined default value of swappiness

is “60” and it can vary between “0” to “100”.The higher value of swappiness denotes that the

kernel will be more aggressive to ummap the mapped pages and on the counter side a lower

value of swappiness shows that kernel will not tend to unmap the mapped pages.

B. Why to Change Swappiness?

The default value cannot be a perfect fit solution for all the individual cases; it depends upon

hardware specifications and user needs. Hence, it is very important to understand the

functionality of swappiness; this function derives the effcieincy of the operating system in

BigData processing, having said that, it is required to focus on swappiness function for

different system configurations to optimize the user experience on the social networks .

The applicability of swappiness with an example

Example: BigData of 10 GB is taken for processing; the system configuration has 4GB RAM

and 10 GB swap memory. The 10GB data is executed, the default value of swappiness is

“60”; in the next step it was observed that when 40% of the RAM memory is used the

processing will be handled by the swap memory. In the second iteration the swappiness is

changed from default to “100”; this resulted in slow processing of the data. The basic reason

of slow processing of the data is because swap memory act as a virtual memory, but

originally it opt space from the hard disk only.

When swappiness value is “10”; it will consume 90% of memory and 10% of swap space.

This means processing will be faster than “vm.swappiness=100”.

Therefore an experiment is set by varying the value of swappiness from “0” to “100” in the

(7)

and various results of the all the set experiments on the swappiness will be discussed in the

section.

IV. Optimized Algorithm’s Results Based on Swappiness in Hadoop Cluster

To know the swappiness affect on the optimized performance of the algorithm, a cluster of

five systems was created; three out of the five performed the task of mapper/reducer (Task

trackers); one is job tracker (Controller of mapper/reducers) and one is name node (Hadoop

HDFS).

The configuration of three mapper/reducer system, job tracker and name node is 2GB RAM 2

Core processor & 100 mbps network. Big Data of different size viz. a viz. 2GB, 4GB, 8GB &

10GB are processed on the above defined Hadoop cluster. The swappiness is varied from “0” to “100” and results are obtained in terms of time (to the precision of milliseconds) taken to

process the Big Data.

Table 4.1; present the variation in swappiness and processing time of various size of the Big

Data. The Table 4.1 is presenting consolidated results of all the iterations and the details of

the each iteration are represented in the graphical format in the next few paragraphs. It can be

depicted from the table that minimum processing time of Big Data of different sizes, is falling

in the range of 45-50 swappiness. Each of the iterations is executed on the Hadoop cluster for

3 times and average Big Data processing is taken.

Table 4.1: Three Task tracker with 2GB Ram and Data are 2GB, 4GB, 8GB and 10GB. Sr. No. Swap Piness 2GB Data 4GB Data 8GB Data 10GB Data

1 0 1m28.45s* 2m32.77s 4m23.08s 14m5.75s 2 10 1m30.59s 2m31.98s 4m25.13s 11m32.31s 3 20 1m26.23s 2m32.49s 4m25.93s 12m17.16s 4 30 1m31.55s 2m32.82s 4m23.28s 11m43.57s 5 40 1m30.33s 2m33.10s 4m22.94s 12m39.60s 6 50 1m26.26s 2m28.65s 4m15.93s 10m59.60s 7 60 1m30.16s 2m31.65s 4m24.90s 11m12.18s 8 70 1m30.41s 2m31.62s 4m24.99s 11m17.37s 9 80 1m30.69s 2m31.55s 4m20.98s 13m3.28s 10 90 1m26.61s 2m33.69s 4m23.97s 10m47.51s 11 100 1m30.58s 2m36.81s 4m23.97s 11m0.26s *m- minutes, s- seconds

In the first set of experiment, 2GB BigData is processed on a Hadoop cluster of five systems

(8)

the processing time of the Data. It could be depicted from the Fig. 4.1 that for swappiness

range of 45-50 (Data point is shown with green colour) the processing time reduced, while

the time taken at default value of swappiness is more and shown in blue colour in the graph.

The graph also shows some dips in the processing time at 20 % and 80% of swappiness but

these values could be rejected due to extreme lower & higher values of the swappiness; as on

lower swappiness value may face the problem during multitasking of the system due to

incapability of the Memory and on higher swappiness values may slow down the process

since swap memory use is at its peak.

The fourth set of the experiment is on 10GB Big Data processing on the same Hadoop

cluster. Fig. 4.4 shows the result of the 10GB data. The processing time presented on y-axis

against the % virtual memory (Swappiness) in Fig. 4.1 to 4.4, shows that the processing time

of all four size BigData viz. a viz. 2GB, 4GB, 8 GB & 10GB is reduced on the configured

Hadoop cluster taken for the experiment.

Fig. 4.2: Processing time for 4GB Data against swappiness.

(9)

Fig. 4.4: Processing time for 10GB Data against swappiness.

The second set of the experiment is also processed on the same Hadoop cluster to swappiness

is changed in similar fashion as done in the first experiment; Fig. 4.2 shows the results of the

experiment.

The third set of the experiment is on 8GB BigData processing on the same Hadoop cluster.

Fig. 4.3 shows the result of 8GB data.

Table 4.2 presents the comparison of the processing time of BigData of 2GB, 4GB, 8GB &

10GB size with default swappiness of the Hadoop cluster and optimal value of swappiness.

The task of optimization (reduction in the processing time of BigData) is achieved in all the

sets of the experiment & the optimal range of swappiness coming for the experimental set up

is in the range 45- 50 The processing time improvement in 2GB, 4GB, 8GB & 10GB data is

4.32%, 1.98%, 3.38% & 1.87% respectively in case of optimal range of swappiness i.e.

45-50% as compared to default system value of swappiness (60%).

Table 4.2: Comparison of Processing time for Swappiness at Default & Modified value for 2gb, 4gb, 8gb and 10gb Bigdata for Hadoop Cluster.

Sr. No.

Data size

Time taken (sec) to Data process at Default Swappiness (60)%

Time taken (sec) to process Data at Swappiness range

(45- 50)%

% Improvement in processing time

(Optimization)

1 2 GB 86.269 86.785 4.32%

2 4 GB 151.658 148.654 1.98%

3 8 GB 264.903 255.938 3.38%

(10)

CONCLUSION

All the experiments set to run the designed Search MapReduce algorithm on the configured

Hadoop cluster shows that optimization of BigData processing achieved by changing the

virtual memory (swappiness) in the optimal range.

These results could be extrapolated in the search operation of the Social networking sites, and

searching time can be optimized to enhance the user‟s experience. The results also shows that

range of improvement in processing time vary between 1.5 to 4.5 % for the given size of Big

Data by fixing the swappiness in the optimal range.

The sample data taken for processing varied from 2 to 10 GB, the data size is small as

compared to actual Big Data problems of the real world; but due to resource constraint the

bigger size of the data cannot be taken for the experiment. The processing of only 10 GB data

required a Hadoop Cluster of 5 systems with 2GB RAM, 2 core processor and 100mbps

network connection. The optimal swappiness range in bigger size of data would fetch us

more fruitful results in terms of optimization, of the search operation on the Social

networking sites.

The extreme lower ( below 30%) & higher ( above 70%) swappiness have given reduction in

processing time but not considered since lower swappiness create hindrance in multitasking

and higher swappiness will slow down the BigData processing.

The above fact of lower and higher % of swappiness for BigData processing; can be attained

by going in depth of the memory & processor‟s functionalities, of the operating system. For

future work researchers may take other functions of the system memory and processor; for

optimization of the BigData processing.

REFERENCE

1. Major Parameswaran and Andrew B. Whinston, “Resaearch Issues in Social Computing”,

In: Journal of the Association for Information System, 2007; 8(6): 22.

2. Preeti Narooka and Dr. Sunita Chaodhary, “Graph Search Process in Social Networks and

it's Challenges”, In IJCSET, June 2016; 6(6): 228-232.

3. Firat Tekiner and John A. Keane, “Big Data Framework”, In: International Conference on

(11)

4. Preeti Narooka, Sunita Chaodhary, “Paradigm Shift of Big-Data Application in Cloud

Computing”, International Journal of Advanced Research in Computer and

Communication Engineering, 2016; 5(5): 515-521.

5. Bo Li and Prof. Raj Jain, “Survey of Recent research Progress and issues in Big Data”,

2013.

6. Harshawardhan S. Bhosale, Prof. Devendra P. Gadekar “A Review Paper on Big Data

and Hadoop” in International Journal of Scientific and Research Publications, October

2014; 4(10): 1-7.

O‟reilly Media Inc & Associates Sebastopol, CA, 2006.

8. Book: Wadkar, Sameer, and Madhu Siddalingaiah. "Hadoop Concepts", Pro Apache

Hadoop, 2014.

9. Book: Tom White, “Hadoop: The Definitive Guide”, publication is Oreilly and Yahoo

press, 2009.

10.Whitepaper: next MEDIA, CSA, “Social Networks Overview: Current Trends and

Research Challenges”, 2012.

11.Whitepaper: Jean Yan, “Big Data, Bigger Opportunities”.

Available:http://www.meritalk.com/pdfs/bdx/bdx-whitepaper-090413.pdf, 2013.

12.Whitepaper: Novell‟s Technical Whitepaper, “Determining the Correct Usage of Swap in

Linux* 2.6 Kernels”, 2007.

13.Whitepaper: Matthews, Bob & Murray, Norm, “Virtual Memory Behavior in Red Hat

Linux A.S. 2.1”, Red Hat whitepaper, Raleigh, NC, 2001.

14.Article: International Telecommunication Union – “Distributed Computing: Utilities,

Grids & Clouds Utilities, Grids & Clouds”, ITU-T Technology Watch Report, March

2009

15.Website: Bhavin Turakhia, “Understanding and Optimizing Memory Utilization”,

http://careers.directi.com/display/tu/Understanding+and +optimizing

+Memory+utilization, 2013.

16.Website:

http://unix.stackexchange.com/questions/265713/how-to-configure-swappiness-in-linux-memory-management.