Inter and Intra Feature Based Document Clustering And Summarization approach in the Distributed P2P Overlays

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)

203

Inter and Intra Feature Based Document Clustering And

Summarization approach in the Distributed P2P Overlays

A. Srinivasa Rao

1

, Dr. Ch. Divakar

2

, Dr. A. Govardhan

3

1

Research Scholar, Dept of CSE JNTUK Kakinada, A.P, India

2_{Principal, Pydah College of Engineering and Technology,Visakhapatnam, A.P, India}

3_{Professor and Director, School of Information Technology, JNTUH, Hyderabad, Telangana state, India}

Abstract-- With the increase in the quantity and size of the online text documents in recent years, demand for the automatic text clustering based summarization is growing rapidly. Most of the traditional algorithms perform the clustering based on predefined rules, sentence or phrase position, term frequency, etc. Since the large number of text documents are collected from different sources, these predefined patters are greatly affected on the accuracy. Context identification is one of the major problems in traditional term based or sentence based clustering algorithms. In this proposed approach, topic relevance based context identification algorithm is used for text clustering and summarization approach. This system analyzes sentence to clustering, sentence to summarization and phrase to sentence relationships to select optimal meaningful phrases/sentences from the given large text documents and reduce the duplication in the summarization process. Experimental results prove that the proposed context based clustering and summarization approach has got improvement compared to traditional graph and probability based approaches.

Keywords-- Contextual information, clustering, summarization, term-frequency, medical documents.

I. INTRODUCTION

Automatic document clustering and summarization has become a big issue of interest in document representation with the rapid growth of documents from different sources.Extractive summarization and abstractive summarization are two commonly used approaches to this problem. Extractive summarization selects phrase or sentence from the source documents.Abstractive summarization extracts high probable sentences from the sources which do not appear in the original documents.

Various approaches have been implemented on text documents are based on the features of the document sentences, for example term frequency or word frequency in the literature cue phrase technique[1],optimization techniques[2],probability based methods[3],rank based methods [4], graph based approaches [5] have been implemented. For a large set clustering technique, document summarization can be viewed as a special kind of data reduction which minimizes the size of the documents.

Most of the literature work performs an operation on the document content which is difficult to handle ideal summary for a text document. Traditional web based text summarization runs on the remote server [6] as shown in Fig1.

Fig 1: Cloud based document summarization

Automatic document clustering can be categorized according to document features. It can be categorized according to the document size: multidocument vs single document,Derivation: Abstract vs Extract, Specificity: general vs domain specific, etc. Traditional works were applied to improve the relevant sentence extraction using different term frequency ,textual characteristics and key phrases. The use of natural language processing and machine learning techniques to generate summarized documents from the large collection of documents. The motivation for implementing document clustering in automatic document summarization is to detect the various topics in the source documents. This can be used to reduce duplicates using the most relevant sentence in each sentence. Partitional based document clustering algorithm and its similarity measures are used to find the sentence or phrase ranking process [8]. Similarity measures are used to define the goodness of the clusters. Therefore, the similarity criteria have to be chosen based on the document or sentence context and characteristics of objects specific to that application. Optimization functions have to be chosen based on the textual context, to ensure its mathematical formulation[1-5].

Browser

Summarization Controller

Runtime Engine

(2)

International Journal of Emerging Technology and Advanced Engineering

204

II. RELATED WORK

NetSum is a neural network based technique where they use a set of document characteristics to identify the sentence importance in the document and also incorporating position information along with the term frequency of content words and formulating the content selection process as an optimization problem instead of greedy search[9]. In [6],authors enhanced the context based summarization process for web document summarization by analyzing the relevance of the contents of hyperlinks in a document.

Wang implemented an application which is based on phrase level symmetric non-negative matrix formulation and semantic analysis for clustering the sentences or phrases. In [10] each text sentence is considered as a group of phrases or sentences and the main objective of extractive summarization is denoted the sentences in the group with 0 or 1, where a tag of 1 denotes that a phrase or sentence is a summary document while 0 denotes a non summary sentence content. To overcome static label approach, a conditional random field is implemented in [11].

Supervised summarization approaches treat the summarization task as a two fold classification issue at the phrase level,which the non summary phrases are negatively grouped. After the document representation algorithm, each phrase by a group of characteristics the classification procedure can be tested in different methods like Genetic techniques, Support vector machines(SVM), Feed forward neural network, mathematical logistic regression, Gaussian mixture models and probabilistic neural networks.

For keyword based multi-document summarization, the key sentence can be considered as a key phrase, so the phrase was selected as cluster membership, it is different from conventional clustering such as term and document.Mani[9-12] developed a graph based model for text summarization. Edges between nodes represent the relationships of words,phrases, such as the adjacency and semantic relationships. Each node denotes a topic corresponding to a word. The graph search technique is used to detect the cluster differences between documents to summarization and sentence similarities.

Text analyzer is designed to derive the structure of the input document text using a rule reduction approach in three phases:

1) Feature identification 2) Token creation

3) Categorization and Summarization

The text analyzer parses the given input into different pools and group them into noun phrase or verb phrase and final summary is generated.

An improved symmetric non negative factorization and feature based sentence level semantic analysis are implemented in [10]. Based on the semantic role analysis and word relation, pairwise sentence similarity is computed. In the second phase, matrix factorization is used to cluster the sentences. The most important sentence is chosen from the clusters using a similarity measure which combines the internal and external information.

Various documents clustering techniques have been implemented in literature. Hierarchical clustering techniques are generally viewed as more reliable than other types of techniques [6]. The cause is that it is a bottom up procedure which primarily implies that each document is a cluster and thus merges the best similar clusters in an iterative procedure. The total number of iterations relies upon k, the number of required clusters. Due to its quadratic computational complexity, hierarchical clustering techniques are un-practical for large collection of documents. A number of techniques that attempt parallelism in hierarchical clustering techniques have been introduced in research[5-10]. Instructive statistics are utilized to decrease the illustration of cluster centroids and thus decrease communication expense. The random cluster centroid based technique is the most well-known summarization technique used to find the inter and intra document relationships in the large corpus. MEAD is a clustering technique based on the cluster-centroid approach for the multidocument summarization process.

III. PROBLEM STATEMENT

Assuming that we have a graph G with v Nodes

1 2 3 4

{ ,

n n n n

,

,.. }

n

_v and edges

{e , e , e , e ,..e }

₁ ₂ ₃ ₄ _v_₁ . Let the document collection in the graph peer’s can be denoted by

D



{d , d ...d }

₁ ₂ _N . Let

D

_l and

D

_u be the labelled and unlabeled cluster documents. Our goal is to cluster the featured documents based on a keyword or phrase in the labelled documents. Let the featured clustering documents from the unlabeled documents can be represented as

f



{ ,

d d d

₁i ₂i

,

₃i

....

d

_{| |}i_f

}

where

1  

i

k

.

By optimizing the feature document clustering and minimizing the squared error.

1 2 3 | |

min{{ ,

d d d

i i

,

i

....

d

i_f

} e }



_i

IV. PROPOSED APPROACH

(3)

International Journal of Emerging Technology and Advanced Engineering

[image:3.595.118.215.186.474.2]

205

In this approach, a user specified keyword clustering operation is performed on the graph based clustered mechanism. The proposed architecture is shown in the figure 2.

Using the graph based clustering algorithm and summarization approach [1], the initial document collection is clustered and summarized. After the initial document clustering, user selected feature, or phrase based clustering is applied on the clustered overlays to highlight the essential features in the clustered documents.

Feature based Inter and Intra Document Clustering and Summarization algorithm

Input:

Documents Collection D

,

o c

D

: Overlay Document Cluster.

, i

p c

D

: ith Peer document in the overlay cluster. Output:

Feature based Clustering with Summarization. Procedure:

Initialization:

Initialize the documents collection. Map each node to the overlay network.

Collect all the document corpus that is required. All the document corpus [D] are tokenized/parsed. Identify the words Word List [WL] of the each document corpus and give the words as a test set to the training set of [WD] and [FL]

3.2. For the classification, we will apply Bayes Classification as Classify (true or negative);

(f / X) :

_i

(

/

_i

). ( ) / P(X)

_i

P



P X

f P f

P(X/ f )

_i denotes the probability that the feature belong to the document .

(f )

_i

P

denotes the probability of having feature f.

(X)

P

:Probability of the feature belongs to the clustered set X.

Then, the unwanted reviews are eliminated from the required corpus and the new review corpus is named as Dnew

Separate the entire review documents Dnew according to the features identified.

Phase1:

Apply graph based clustering with Summarization algorithm using [1].

Phase2:

For each overlay cluster

D

_{o c}_,

Do

For each peer cluster document _, i

p c

D

in the overlay cluster

Do

, , ,

(

,

) :

(

)

i i

p c o c p c

Tk D

D



Tokenize D

Done Done

Find the inter and intra feature similarity in the overlay cluster.

Inter-Cluster Feature Similarity Index:

D

_{o c}_,

Do

p c

D

Do

Find the relationship between the user specified query and the tokens or phrase within the clustered document. Let Query Q and Document Tokens/Phrases

dp t

_i

( )

_j Document

Collection

Graph based Clustering

Summarization algorithm

Feature Based Clustering

(4)

International Journal of Emerging Technology and Advanced Engineering

206

InterD(Q,

dp t

_i

( )

_j ):=

| Prob( )

Q





prob dp t

(

_i

( )) | / |

_j

dp t

_i

( ) |

_j Done

Done

Sort(Q,

dp t

_i

( )

_j )

Find the nearest distance similarity documents using InterDistance.

Find the graph path between the user specified query and the similar features in the clustered documents represented as

Path

_i

Intra-Cluster Feature Similarity Index:

D

_{o c}_,

Do

p c

D

Do

Find the relationship between the user specified query and the tokens or phrase within the clustered document. Let Query Q and Document Tokens/Phrases

dp t

i

( )

j

IntraD(Q,

dp t

_i

( )

_j ):=

{| Pr

ob Q

( )



InterD

*



prob dp t

(

_i

( )) |} / |

_j



dp t

_i

( ) |

_j Done

Done

Sort(Q,

dp t

_i

( )

_j )

Find the nearest distance similarity documents using InterDistance.

Find the graph path between the user specified query and the similar features in the clustered documents represented as

Path

_j

Finally, Highlight the path in the graph using

i j

Path



Path

.

V. ACCURACY MEASURES

The class accuracy rate A, F-Measure rate and Recall Rate R measures are used to find the performance of the proposed algorithm on the p2p document clustering algorithm.

Accuracy Rate

A D

(

_c

, F ) |

_i



D

_c



F | / |

_i

D

_c

|

Recall Rate

R(

D

c

, F ) |

i



D

c



F | / | F |

i i

F-Measure Rate:

1 1

|

| . ( ) /

|

k k

i i i

i i

F



F

 



where

1

( ) :

_max

(2. ( , F ).R( , F ) / ( ( , F ) R( , F ))

k

i c i c i c i c i i

F A D D A D D





 

VI. EXPERIMENTAL RESULTS

(5)

International Journal of Emerging Technology and Advanced Engineering

(6)

International Journal of Emerging Technology and Advanced Engineering

208

Performance Metrics

Algorithm 5

PEER-Accuracy

20 PEER-Accuracy

30-PEER Accuracy

40-PEE

R Accu racy

HP2PC 83 81.65 83.23 82

P2P K-means 68.45 74 76.34 76.89

MEAD 87.89 86.34 85.67 87 NeuralNetwo

rks 85.89 79.05 74.56 77.12 GA_SVM 74 75.78 76.45 79.45 GraphBased

CluSumm 91.56 88.12 88.97 90.29

Proposed 94.67 93.7 91.7 95.78

VII. CONCLUSION

The proposed algorithm executed on a large set of medical documents to find the relationship between the medical documents. Context identification is one of the major problems in traditional term based or sentence based clustering algorithms. In this proposed approach, topic relevance based context identification algorithm is used for text clustering and summarization approach. This system analyses sentence to clustering, sentence to summarization and phrase to sentence relationships to select optimal meaningful phrases/sentences from the given large text documents and reduce the duplication in the summarization process.

Experimental results prove that the proposed context based clustering and summarization approach has got improvement compared to traditional graph and probability based approaches.

REFERENCES

[5] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? : Sentiment Classification Using Machine Learning Techniques,” Proc. Conf.Empirical Methods in Natural Language Processing, pp. 79-86,2002.

[6] “A framework for text summarization in mobile web browsers”,Bose, Joy; Puthenveettil, Dipin Kollencheri; Kasi, Sainath Gadhamsetty; Bhide, Aditya Mohan,IEEE, Dec 26, 2013. [7] Hu Minqing, Liu Bing, “Mining and summarizing customer

reviews”, 10th ACM SIGKDD International Conference on Knowledge discovery and data mining, Seattle, WA, USA. [8] Tsytsarau, Mikalai, Palpanas, Themis “Survey on mining

subjective data on the web” Data Mining Knowledge Discovery 2011.

[9] “A comprehensive tool for text categorization and text summarization in bioinformatics”, Kamal,Md. Mustofa, Sultana,kazi zakia, IEEE dec 22,2012.

0 100 200 300 400

40-PEER Accuracy

30-PEER Accuracy

20 PEER-Accuracy