Average Weight based Pattern Frequency for Performing Outlier Mining in Web Documents

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 7, Issue 9, September 2017)

702

Average Weight based Pattern Frequency for Performing

Outlier Mining in Web Documents

S. Sathya Bama

1

, M. S. Irfan Ahmed

2

, A. Saravanan

3 1,3_{Sri Krishna College of Technology, Coimbatore, Tamil Nadu, India} 2

Nehru Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India

Abstract— With the enormous amount of information resources available online, the World Wide Web becomes a well-regarded sector to carry out the research. Commonly, web mining research is an integration and intersection of several research communities like data mining, information retrieval and subareas of artificial intelligence, text mining from text analytics. Search engine is one of the major application that emphases on retrieving the relevant documents from the web based on the users query. Conversely, the results produced by them are not accurate, since it contains several uninteresting documents along with the relevant documents. Thus, outlier / anomaly detection is one of the very interesting problems arising recently in web mining research. This paper presents a pattern frequency model to mine the web documents based on the users’ needs. The experimental result analysis illustrates that the proposed average weight based pattern frequency identifies 90% outliers perfectly and ranks the relevant documents accurately based on the given query.

Keywords— Web Mining, Outlier Detection, Pattern Frequency, Web Documents, Search Engine.

I. INTRODUCTION

Due to the characteristics of web such as its huge size,

diversified contents, heterogeneous and dynamic

information, normally a web query can retrieve millions of resulting web pages that contains irrelevant and redundant documents by which most of the users lose temper in navigating more number of links without getting exact information. Thus, web mining has been the central focus of attention in recent days that leads to the establishment of several research projects and articles. According to research, web mining is broadly classified in to three categories [1]. They are web usage mining, web structure mining and web content mining [2], [3].

The traditional search engine is more multifaceted system in extracting and integrating data [4] due to which it engenders thousands of retorts with voluminous data to the user, however, many of them are not pertinent to the user query and do not deliver the best outcome [5].

Currently, there are several search engines from different multinational companies. All these search engines may not use the same procedure to retrieve the interesting documents from the web, instead each of them employ

different techniques with some basic common

functionality. However, they all perform three basic tasks. Unluckily, the information presented to the user for the given query may not be always effective; perhaps it contains inappropriate and replicate documents from one or more location.

Also, search engine extracts tens and hundreds of web pages by comparing the terms of a single user query. They undergo millions of queries as a request from web users at each and every hour. Besides, due to technological improvement, the web, allows any user to publish from anything to everything based on their needs. This causes the web resources to get increased uncontrollably by at least 10 documents each a minute. Yet, only a minimum research have been carried out on the results produced by the search engine. Thus, implementing the search engine alone is not only a challenging task, but to produce better results are equally important. As a result, presenting only significant information from the web has become intricate and exigent task for search engines due to raise in the increasing quantity of data stored on the web.

II. LITERATURE SURVEY

Web content outlier mining focuses on discovering outliers from web contents of similar classification. Many researchers carried out their research in the domain related to general outlier mining, where most of them concentrate on mining outliers in numeric and other data types from traditional data set than mining outliers in web data set. Few researchers evolved to mine the web by proposing novel algorithms. A survey on outlier mining techniques, its advantages and disadvantages along with unseen issues was presented in [12].

(2)

International Journal of Emerging Technology and Advanced Engineering

703

Also, the authors partitioned the taxonomy for web outliers as web content outliers, web usage outliers and web structure outliers along with their description. The authors presented a common framework for web content outlier mining by assuming the existence of dictionary containing the important words from the particular domain which act as a base for further research. The authors also extended their work that focuses on the distance between the objects and its neighbours using k-nearest neighbour algorithm [7].

The authors prolonged their work by taking the benefit of the HTML tag structure of web pages and n-gram method for limited matching of strings to implement n-gram based procedure for mining web content outliers [8]. To decrease the processing interval, the proposed procedure applies only data enclosed in <Meta> and <Title> tags. Moreover the authors proved that the method that utilizes text seized from <Meta> and <Title> tags offers the outcome equivalent to the method that employs text captured from <Meta>, <Title> and <Body> tags.

The authors protracted the above work by enlightening the significance of exception mining and its real-time applications [9]. They proposed and documented the general model that supports the improvement of content-oriented procedures for mining web outliers. Furthermore, the authors explained the proposed model WCO-Mine algorithm for mining web content outliers. They also initiated the WCOND mine procedure for distinguishing web content outliers based on n-grams without the existence of a domain dictionary [10]. The authors presented a novel algorithm called HyCOQ, which is a hybrid process that employs the power of n-gram oriented and word oriented systems [11].

Poonkuzhali et al. focused on mining web content outliers by suggesting a signed approach and full word matching with the organized domain dictionary [13]. As the dictionary is organized based on the number of characters in a word, searching and retrieving the documents takes less time and less space. Poonkuzhali et al. [14] extended their work by proposing new procedure for mining the web documents that distinguishes the redundant links from the web content using a set theoretical approach such as subset, union and intersection. The method removes the redundant links from the original web content before displaying the results to the user.

Poonkuzhali prolonged the work by recommending the new process that mines web documents by employing the clustering concept together with mathematical concepts such as union, intersection in set theory for noticing outliers [15].

Finally, the noisy data are detached from the resulted web documents to acquire the essential content for the user.

Poonkuzhali introduced a novel method through rectangular representation and signed method for enlightening the outcomes of search engines by perceiving and eradicating duplicate web documents [16]. The authors also sustained their research by incorporating two statistical methodologies [17]. These methods are grounded on proportions (Z-test) and chi square test (T-test) to extract the outliers from the web content. Rejection of this outlier document during an exploration process, undoubtedly enhance the quality of results produced by the search engines. The authors proved that the proposed statistical methods certainly offer relevancy and supports users by surveying and producing the valuable results from the available web resources based on their search query.

Poonkuzhali continued their work by developing a mathematical model constructed on two way rectangular representations coupled with signed approach for trust rating and a correlation scheme for possessing the right information without redundancy with both structured and unstructured documents stored on the web [18].The authors introduced a correlation algorithm for web content mining by which, not only relevance ranking is calculated, but also redundant documents can be detected [19]. Normalized discounted cumulative gain scheme was adopted by the authors to evaluate the proposed ranking algorithm.

Poonkuzhali concentrated more on mining web content outliers which removes the unrelated web document taken from the collection of documents belonging to the similar fields [20]. Their proposed work sketches a novel mathematical method centred on signed-with-weight procedure for extracting the web content outliers from both structured and unstructured web pages. A novel method that uses clustering techniques coupled with set theory to mine the web content outliers by compiling domain term dictionary was proposed [20]. The authors extended their work by applying mathematical approaches like chi-square test, statistical method using test hypothesis and a correlation algorithm for web content mining [17)] [18] [19]. The investigational results of these methods divulge that they are able to identify properly about 70% of the outliers.

(3)

International Journal of Emerging Technology and Advanced Engineering

704

The authors extended their work by introducing a new outlier detection approach that discovers low hit web pages using sequential frequent pattern mining to improve website's design [24]. An innovative algorithm has been proposed to identify the outliers called web page hit detection algorithm [25] and navigation problems occurring while surfing with web pages, called efficient page surfing algorithm [26]. However, these method only focuses on html web pages by partitioning the pages into links title, meta and other tags.

In most of the above algorithms for web content outlier mining, compilation of dictionary for all the words in each domain is necessary and each word in the web page is searched in the dictionary to find a match which is a tedious task to implement them practically in real life. Moreover, all these methods provide weights to the words based on two categories; one for the words found in the dictionary and another for the words that are not found in the domain term dictionary. No method considers the keywords in the search query.

The above issues necessitate the need for developing an algorithm to detect web content outliers for both structured and unstructured data with improved performance. Thus a method named enhanced weighted frequency approach [21] and proximity based term frequency approach [22] has been introduced. In this paper, the novel idea of generating patterns from the key terms has been proposed to compute the relevance score. Next section presents the pattern frequency approach by assigning average weight to the subsets generated from the key term set given by the user.

III. ARCHITECTURE OF PATTERN FREQUENCY APPROACH

The architecture of the proposed average weight based pattern frequency approach is depicted in Fig. 1. In general, the user query is pre-processed to remove the stop words which give less substantial meaning. Then the query string is tokenized and key terms are extracted which forms the super set K with n terms. The superset is processed to find all its sequential subsets Pi, where the number of sequential subset that can be generated from the superset of n elements will be n(n+1)/2.

Basically, a set is a collection of elements of order n, where n is the number of elements in the set. Accordingly, for a set with order n, 2n subsets can be generated and n*(n+1)/2 sequential subsets can be created. This sequential subsets are named as patterns in this method. For example, consider the superset keywords K. Let the set K = {x, y, z}.

The subsets are {{x},{y},{z}, {x, y},{y, z},{x, z},{x, y,

z}, {}} having 2n elements; ∀ Pi ⊆ K. However, for this

proposed algorithm, the sequential subsets are needed to find the appropriate relevance score of a document. Thus the sequential subsets are {{x}, {y}, {z}, {x, y}, {y, z}, {x, y, z}}. Instead of {} set, insignificant words other than key terms are considered and is used for computing the relevance score.

FIGURE IARCHITECTURE OF PATTERN FREQUENCY APPROACH

The subset can be grouped based on their order or based on the number of elements in the subset. Thus, subset group 1 includes {{x}, {y}, {z}} subsets; subset group 2 includes {{x, y},{y, z},{x, z}} subsets; subset group 3 comprises of {x, y, z}. Based on the subset group, *the weights may be assigned using ratio property. According to the ratio property, the weight assigned for a pattern is equal to the order of the subset group in which the pattern resides to the order of the superset. Thus, the weight Pw = O(Pi)/O(K).

In this example, the superset {x, y, z} with order 3, the weight of the patterns in a subset group of order 1 is 1/3; The weight of the patterns in a subset group of order 2 is 2/3; The weight of the patterns in a subset group of order 3 is 3/3.

Pre-process and extract sequential key

terms

Sequential Pattern/subset

Generation

Extracted Web Documents

Pre-Process the Documents

Subset/Pattern Grouping & Weight

Computation

Compute the Pattern Term Count

Relevance Score Computation Rearrange the

processed documents

(4)

International Journal of Emerging Technology and Advanced Engineering

705

After assigning the weights to the key patterns, the input documents are pre-processed. The total frequency of key

term Fkt and the total frequency of non-key term Fnkt along

with the total term frequency Tf is computed for further

processing.

Each sequential pattern generated from the key terms is matched with the documents to increment the pattern

frequency Pf. Absolutely, the pattern having higher order is

processed first to avoid duplication in the pattern frequency. Furthermore, each pattern will be deleted after incrementing the pattern frequency. Then the relevance score is calculated using the below formula given in Equation 1.

( ) ∑ ( ) ( )

To normalize the relevance score and since the non-key terms should have less weightage when compared with key terms, the total frequency of non-key terms has been multiplied by 0.5. Thus the documents having least score will be eliminated from the result set. This procedure is explained with an algorithm in the next section.

IV. AVERAGE WEIGHT BASED PATTERN FREQUENCY

ALGORITHM

The Average Weight based Pattern Frequency Algorithm generates the patterns from the given search query and assigns the weight to the patterns. The algorithm is explained below.

Input : Set of documents Di extracted from the web for the given user query.

Output : Documents with relevancy score

Step 1 Pre-process the given user query and extract the

set of sequential keywords. Let the set of key terms K be {k1, k2, k3, .. kn}, where n is the number of terms or the order of the set.

Step 2 Generate the all possible n*(n+1)/2 subsets except

{φ} of key terms and is named as patterns P={P1, P2, .. Pk}.

Step 3 Group the subsets based on the number of

elements in the subset and assign the weight Pw for each group as m/n where m is the number of elements in a subset and 1≤m≤n.

Step 4 For each document Di from the input dataset,

process the below steps, where 1≤i≤r, r is the total number of input documents.

4.1 Pre-process the document Di by removing the stop

words.

4.2 Calculate the total number of terms in the

document Di as Tf, total number of key terms Fkt

and total number of non-key terms Fnkt.

4.3 For each pattern P, calculate the pattern frequency

Pf by searching the documents with exact match

with the constraint that pattern having maximum elements will be searched first.

4.4 If a match occurs, increase the pattern count by

one and delete the patterns in the documents.

4.5 Calculate the relevancy score for the document Di

as RSi as in Equation 1.

4.6 Repeat the step 4 for all the documents in the input

set.

Step 5: Sort RSi in descending order which is relevant documents related to the user‘s query;

V. EXPERIMENTAL ANALYSIS

The experiment is conducted using classified documents as input. The top 10 documents which are listed in Table I for a given query ‗Recent Research in Web Content Mining‘, are retrieved and it becomes the input for the proposed approach. The input documents are pre-processed by removing stop words.

TABLEI

INPUT DOCUMENTS

Did Retrieved Documents

D1 www.cs.uic.edu/~liub/publications/editorial.pdf

D2 dmr.cs.umn.edu/Papers/P2004_4.pdf

D3 www.ijarcsse.com/docs/papers/Volume_3/11_Nove

mber2013/V3I11-0352.pdf

D4 www.ijcsit.com/docs/Volume%205/vol5issue03/ijcs

it20140503316.pdf

D5 ebiquity.umbc.edu/_file_directory_/papers/214.pdf

D6 esatjournals.org/Volumes/IJRET/2014V03/I03/IJRE

T20140303009.pdf

D7

www.kdd.org/sites/default/files/issues/2-1-2000-06/kosala.pdf

D8 citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.

258.8941&rep=rep1&type=pdf

D9 www.upet.ro/annals/economics/pdf/2012/part1/Din

uca-Ciobanu.pdf

D10 arxiv.org/pdf/cs/0011033.pdf

(5)

International Journal of Emerging Technology and Advanced Engineering

706

Then the patterns are matched to compute their frequency along with the weights. The total term frequency is also calculated along with a key term count and a non-key term count. Finally, relevance scores for all the documents are calculated. Table II shows the relevance score based on pattern frequency.

TABLEII

RELEVANCE SCORES BASED ON PATTERN FREQUENCY

D id F re q u en cy o f k ey t er ms (F kt ) F re q u en cy o f n o n -k ey te rms (F nkt ) T o ta l P a tt er n F re q u en cy w it h w ei g h ts (P f *P w ) T o ta l F re q u en cy (T f ) R el ev a n ce S co re D o c R a n k

D1 452 1896 2348 98.4 0.6382 4 D2 356 1255 1611 88.4 0.6654 3 D3 188 583 771 77.5 0.7224 1 D4 139 553 692 70.2 0.7019 2 D5 236 1256 1492 58.2 0.6181 5 D6 492 2541 3033 68.2 0.6036 8 D7 225 1360 1585 55.6 0.6061 7 D8 172 1067 1239 50.5 0.6102 6 D9 18 2578 2596 5.4 0.5055 9 D10 225 1360 1585 55.6 0.6061 7

The precision at each point is calculated to evaluate the performance of relevance ranking. At any point along the ranked list, the precision can be seen, which is a useful metric for verifying the search result, since a user tries to find out good results on the first entry or on the first few entries. This is known as precision at k.

To determine the performance of the proposed pattern frequency approach, precision at each position, is calculated. Figure II shows the precision comparison at each position of the retrieved documents after relevance score.

FIGURE IICOMPARISON OF PRECISION AT EACH POSITION

Thus, from the precision graph in Figure II, the performance of the proposed Pattern Frequency Approach is compared with Enhanced Weighted Approach [21] and Proximity based approach [22].

Also, a dataset consisting of 100 web documents that are related to web content outlier mining are named as relevant document set (RDS). Similarly, a dataset consisting of 100 web documents that are not related to web content outlier mining are termed as an outlier document set (ODS).

Dataset D1: Minimum number of Outlier Documents with the ratio 80%: 20%

Dataset D2 : Equal number of Outlier Documents with the ratio 50% : 50%

Dataset D3: Maximum number of Outlier Documents with the ratio 20% : 80%

Thus the experiment has been performed from the three datasets D1,D2 and D3 and the precision for the three methods Enhanced Weighted Term Frequency (EWTF) [21], Proximity based Term Frequency (PTF) [22] and proposed Average weight based Pattern Frequency (AWPF) has been compared and the comparison is shown in the Figure III, Figure IV and Figure V.

FIGURE IIICOMPARISON OF PRECISION WITH DATASET D1

FIGURE IVCOMPARISON OF PRECISION WITH DATASET D2 0

0.5 1 1.5

P@1 P@2 P@3 P@4 P@5 P@6 P@7 P@8

P

rec

is

io

n

Precision at each position Patter Frequency Approach Proximity Approach (PTF)

Enhanced Weighted Approach (EWTF)

0 20 40 60 80 100

1 2 3 4 5 6 7 8 9 10

Pr ec isi o n ( % ) Trial Number Precision - 'D1' Dataset

EWTF PTF AWPF

30 50 70 90

1 2 3 4 5 6 7 8 9 10

Pre cisio n (% ) Trial Number Precision - 'D2' Dataset

(6)

International Journal of Emerging Technology and Advanced Engineering

707

FIGURE VCOMPARISON OF PRECISION WITH DATASET D3

The labelled dataset is created by providing the search queries to the Google search engine. The sample set of 100 documents produced by the search engine for different queries are extracted and the queries are listed in Table III.

The proposed methods are compared along with the existing methods against this real world dataset. However, these labelled datasets contain few outliers which constitute redundant documents and irrelevant documents.

TABLEIII

LIST OF INPUT QUERIES

Q# Query title

1 Applications and Techniques of Web Content Mining

2 Web Mining Techniques for Recommendation and Personalization

3 Web Usage Mining using Fuzzy Logic Techniques

4 Web Usage Mining with Semantic Analysis

5 Web Content Mining Issues and Challenges

The comparative study with the existing methods such as are N-Gram Methods [8] and Weighted Approach [20] is given in Table IV.

TABLEIV

COMPARATIVE STUDY

Q# 1 2 3 4 5

Relevant Documents 95 93 94 93 92

Actual Outliers 5 7 6 7 8

EWTF

Outliers

Detected 4 6 5 6 7

Accuracy % 80 85.71 83.33 85.71 87.50

False Rate % 20 14.29 16.67 14.29 12.50

P

TF

Outliers

Detected 5 6 6 7 7

Accuracy % 100 85.71 100 100 87.50

False Rate % 0 14.29 0 0 12.50

A

W

P

F

Outliers

Detected 5 7 6 7 6

Accuracy % 100 100 100 100 75.0

False Rate % 0 0 0 0 25.0

Ex

is

ti

n

g

N

-G

ra

m

A

p

ro

ac

h

Outliers

Detected 2 4 3 4 4

Accuracy % 40 57.14 50 57.14 50

False Rate % 60 42.86 50 42.86 50

Ex

is

ti

n

g

W

ei

g

h

te

d

A

p

ro

ac

h

Outliers

Detected 3 4 3 5 4

Accuracy % 60 57.14 50 71.43 50

False Rate % 40 42.85 50 28.57 50

The Precision Comparison is given in Figure VI.

FIGURE VI PRECISION COMPARISON 70

75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10

Pre

cisio

n

(%

)

Trial Number Precision - 'D3' Dataset

EWTF PTF AWPF

80 85.71 83.33 85.71 87.5

100

85.71

100 100

87.5

100 100.00 100 100

75

40

57.14

50 57.14 50

40

57.14

50 57.14 50

0 20 40 60 80 100 120

Q1 Q2 Q3 Q4 Q5

Prec

is

ion

(%

)

Query Number

EWTF PTF

AWPF N-Gram (Existing)

(7)

International Journal of Emerging Technology and Advanced Engineering

708

From the result analysis, it is inferred that the proposed PTF approach offers 98% accuracy in detecting outliers as well as ranking the documents based on the user‘s interest and so it is found to be the best algorithm when compared with other approaches. Similarly, EWTF and AWPF approaches provide 97% accuracy in detecting outliers, but with minimum execution time than other methods.

VI. CONCLUSION

This research work focuses on removing outliers called web outliers which have tremendous applications like search engines for improving the quality of search results and even in plagiarism detection. This research work addresses the web content outlier mining based on the

mathematical approach through pattern frequency.

Research in the above mentioned field has led to several new ideas and innovations. Further research in this area could include mining of outliers present for heterogeneous web documents containing hypertext, image, audio and video data. Other mathematical tools can be explored to further improvise the results. The benchmark dataset for comparing web content outlier mining algorithms can be provided.

REFERENCES

[1] Kosla, R.; Blockeel, H. (2000): Web Mining Research: A Survey, ACM SIGKDD Explorations, 2(1), pp. 1-15.

[2] Madria, S.; Bhowmic, S. S.; Ng, W. K.; Lim, E. P. (1999): Research issues in web data mining, Proceedings of the Conference on Data Warehousing and Knowledge Discovery, pp. 303–319.

[3] Pal, S; Talwar, V.; Mitra, P. (2002): Web mining in soft computing framework: relevance, state of the art and future directions, IEEE Transactions on Neural Networks, 13(5), pp. 1163–1177.

[4] Gou, J. (2012): Web Content Mining and Structured Data Extraction and Integration: An Implement of Vertical Search Engine System, Research Report.

[5] Chau, M.; Chen, H. (2003): Comparison of Three Vertical Search Spiders, Computer (Journal), 36(5), pp. 56-62.

[6] Agyemang, M.; Barker, K.; Alhajj, R. S. (2004a): Framework for mining web content outliers, Symposium on Applied computing, pp. 590-594.

[7] Agyemang, M.; Ezeife, C. I. (2004b): LSC-Mine: Algorithm for mining local outliers, Proceedings of the International Conference on Information Resource Management Association, pp. 5-8.

[8] Agyemang, M.; Barker, K.; Alhajj, R. S. (2005a): Mining web content outliers using structure oriented weighting techniques and N-grams, Proceedings of the ACM symposium on applied computing, pp. 482-487.

[9] Agyemang, M.; Barker, K.; Alhajj, R. (2005b): Web outlier mining: Discovering outliers from web datasets, Intelligent Data Analysis, 9(5), pp. 473-486.

[10] Agyemang, M.; Barker, K.; Alhajj, R. S. (2005c): WCOND-Mine: algorithm for detecting web content outliers from Web documents, Proceedings of Symposium on Computers and Communications, pp. 885-890.

[11] Agyemang, M.; Barker, K.; Alhajj, R. (2005d): Hybrid approach to web content outlier mining without query vector, Data Warehousing and Knowledge Discovery, Springer Berlin Heidelberg, pp. 285-294. [12] Agyemang, M.; Barker, K.; Alhajj, R. (2006): A comprehensive survey of numeric and symbolic outlier mining techniques, Intelligent Data Analysis, 10(6), pp. 521-538.

[13] Poonkuzhali, G.; Thiagarajan, K.; Sarukesi, K.; Uma, G. V. (2009a): Signed approach for mining web content outliers, World Academy of Science, Engineering and Technology, 56(9), pp. 820-824. [14] Poonkuzhali, G.; Thiagarajan, K.; Sarukesi, K. (2009b): Elimination

of redundant links in web pages–Mathematical Approach, World Academy of Science, Engineering and Technology, 52, pp. 562-565. [15] Poonkuzhali, G.; Thiagarajan, K.; Sarukesi, K. (2009c): Set

theoretical Approach for mining web content through Outliers detection, International journal on research and industrial applications, 2, pp. 131-138.

[16] Poonkuzhali, G.; Uma, G. V.; Sarukesi, K. (2010): Detection and Removal of redundant web content through rectangular and signed approach, International Journal of Engineering Science and Technology, 2(9), pp. 4026-4032.

[17] Poonkuzhali, G.; Kumar, R. K.; Keshav, R. K.; Thiagarajan. K.; Sarukesi. K. (2011a): Effective Algorithms for Improving the Performance of Search Engine Results, International Journal of Applied Mathematics and Informatics, 5(3), pp. 216-223.

[18] Poonkuzhali, G.; Sarukesi, K.; Uma, G. V. (2011b): Web content outlier mining through mathematical approach and trust rating, Recent Researches In Applied Computer And Applied Computational Science, ISI/SCI Web of Science and Web of Knowledge, Italy, pp.77-82.

[19] Poonkuzhali, G.; Kishore Kumar, R.; Krip Keshav, R.; Sudhakar, P.; Sarukesi, K. (2011c): Correlation Based Method to Detect and Remove Redundant Web Document, Advanced Materials Research, 171, pp. 543-546.

[20] Poonkuzhali, S.; Sudhakar, P.; Sarukesi, K. (2012): Signed – With - Weight Technique for Mining Web Content Outliers, International Conference on Communication, Computing and Information Technology, pp. 40-45.

[21] Sathya Bama, S.; Irfan Ahmed, M. S.; Saravanan, A. (2015):

Enhancing The Search Engine Results Through Web Content Ranking, International Journal of Applied Engineering Research, 10(5), pp.13625-13635.

[22] Sathya Bama, S.; Irfan Ahmed, M. S.; Saravanan, A. (2017): Relevance Re-ranking through Proximity based Term Frequency Model, In G. Stojanov and A. Kulakov Eds., ICT Innovations 2016, Springer, pp. 217-228. (Yet to Publish)

[23] Vasuki, S.; Subramanian, K. (2014): Clustering Based Outlier Detection Using K-Means Strategy, Software Engineering and Technology, 6(8), pp. 226-231.

(8)

International Journal of Emerging Technology and Advanced Engineering

709

[25] Vasuki, S.; Subramanian, K. (2016a): An Innovative Outlier

Detection Scheme to Identify the Web Page Usage Strategies, International Journal of Advance Research in Computer Science and Management Studies, 4(5), pp. 208-217.