A REVIEW ON IDENTIFYING INTERESTING USAGE PATTERNS IN TEXT COLLECTIONS

(1)

Available Online at www.ijpret.com 416

INTERNATIONAL JOURNAL OF PURE AND

APPLIED RESEARCH IN ENGINEERING AND

TECHNOLOGY

A PATH FOR HORIZING YOUR INNOVATIVE WORK

A REVIEW ON IDENTIFYING INTERESTING USAGE PATTERNS IN TEXT

COLLECTIONS

MISS. SHRADDHA S. GUPTA1_{, DR. H. R. DESHMUKH}2_{, PROF V. K. LIKHITKAR}3

1. M.E in Computer Science and Engg, I.B.S.S.C.O.E, Amravati.

2. Head of Department, Computer Science and Engg, I.B.S.S.C.O.E, Amravati. 3. Assistant Professor, Computer Science and Engg, I.B.S.S.C.O.E, Amravati.

Accepted Date: 05/03/2015; Published Date: 01/05/2015

\

Abstract: Basically we can define the process of text mining is nothing but the extraction of non-trivial and interesting data from unstructured text. In the last decade many data mining techniques have been proposed for fulfilling many knowledge discovery tasks in order to achieve the goal of retrieving useful information for users. Text Mining is nothing but search technique but the one major difference between searching and text mining is that search method needs a user to know what he or she is looking for, whereas text mining attempts to find information in a pattern which is not known beforehand. In this paper we are presenting the literature study over text mining and discuss it as well as intersection of the related areas such as machine learning, computational linguistics, information retrieval, statistics and importantly data mining. Additionally we will discuss the analysis tasks of text mining such as preprocessing, classification, clustering and extraction of information and finally its visualization.

Keywords: KDD, Clustering, IR, IE, NLP

Corresponding Author: MISS. SHRADDHA S. GUPTA

Access Online On:

www.ijpret.com

How to Cite This Article:

(2)

INTRODUCTION

Critical Interpretation of Literary works is difficult. With the development of digital libraries researchers can easily search and retrieve large bodies of text, images and multimedia materials online for their research. Those archives provides the raw material but the researchers still have to rely on their notes, files and their own memories to find “interesting” facts that will support or contradict existing hypothesis. In the field of humanities computers are essentially used to access the text documents but rarely to support their interpretation and the development of new hypothesis. As per we stated in the abstract, the concept of text mining is about analyzing the unstructured texts and extracting relevant patterns as well as characteristics. Based on this characteristics and patterns, improved search results as well as deeper data analysis is possible. Using these patterns and characteristics better search results and deeper data analysis can be done, providing the fast retrieval of information which else remains hidden inside the unstructured text information. While the ability to search for keywords or phrases in a collection is now widespread such search only marginally supports discovery because the user has to decide on the words to look for. On the other hand, text mining results can suggest “interesting” patterns to look at, and the user can then accept or reject these patterns as interesting. Unfortunately text mining algorithms typically return large number of patterns which are difficult to interpret out of context[7].

II. LITERATURE SURVEY

2.1 Text Mining

Text mining is the discovery of interesting knowledge in text documents. It is challenging issue to find accurate knowledge in text documents to help users to find what they want. Many applications, such as market analysis and business management, can benefit by the use of the information and knowledge extracted from a large amount of data. Knowledge discovery is effectively used to update discovered patterns and applied to field of text mining.

(3)

2.2 KDD

Knowledge Discovery in Databases is the process of nontrivial extraction of information large databases, information that is implicitly present in the data, previously unknown and potentially useful for users. The knowledge discovery can be defined as:

Given a set of facts F, a language L , and some measure of certainty C, a pattern is a statement S

in L that describes relationships among a subset Fa of F with a certainty c, such that S is simpler

than the enumeration of all facts in Fs.A pattern is called knowledge if it is interesting and

certain enough, according to the users imposed criteria[9].

2.3 Statistical and Machine Learning.

Statistics has its grounds in arithmetic and deals with the science and apply for the analysis of empirical knowledge. It's supported applied math theory that is a branch of mathematics, at intervals applied math theory, randomness and uncertainty are effective than applied mathematics. These days several ways of statistics are employed in the sphere of KDD.

Machine Learning (ML) is a part of computing involved with the event of techniques which permit computers to learn by the analysis of information sets. The focus of most machine learning ways is on symbolic knowledge. ML is additionally involved with the algorithmic quality of procedure implementations [1][5].

III. RELATED RESEARCH WORK

Current analysis within the space of text mining tackles with the issues of text illustration, classification, clustering, and info extraction, the rummage around for and modeling of hidden patterns. During this context the choice of characteristics and conjointly the influence of domain data and domain-specific procedures play a very important role [3][6][7]

3.1 Information Retrieval (IR)

(4)

Available Online at www.ijpret.com 419 referred to as data retrieval systems. Data retrieval is that the finding of documents that contain answers to queries and not the finding of answers itself. So as to realize this goal applied math measures and strategies are used for the automated process of text information and comparison to the given question[2][10].

3.2 Natural Language Processing (NLP)

Additionally, linguistic analysis techniques are used among alternative things for the process of text. The final goal of NLP is to attain an improved understanding of language by use of computers. Others embrace additionally the use of easy and sturdy techniques for the quick processing of text[3][8].

3.3 Information Extraction (IE)

The goal of knowledge extraction ways is the extraction of specific information from text documents. This square measure holds on in knowledge base-like patterns and square measure then obtainable for any use[2].

IV. ANALYSIS OF TEXT ENCODING

For mining massive document collections it's necessary to pre-process the text documents and store the data during an organization, that is additional applicable for more process than an apparent computer file. Although, in the meantime many strategies exist that attempt to exploit conjointly the grammar structure and linguistics of text, most text mining approaches are supported the thought that text document may be painted by a group of words, i.e. a text document is represented supported the set of words contained in it bag-of-words illustration. But for defining the importance of given word within specified document the method of vector representation is used. In this method every word having its important value assigned and stored. Presently predominant approaches depending over this idea are the vector space model, the probabilistic model as well as the logical model[4][9].

4.1 Text Preprocessing

(5)

4.1.1 Filtering, Lemmatization and Stemming

Further in text preprocessing, to reduce the size of dictionary as well as dimensionality of description of documents inside the collection, the set of words describing the documents can be reduced by filtering and lemmatization or stemming methods [4].

4.1.2 Index Term Selection

To more decrease the amount of words that ought to be used additionally categorization or keyword choice algorithms are often used. During this case, solely the chosen keywords are accustomed describe the documents. A straightforward methodology for keyword choice is to extract keywords based on their entropy [1][5].

4.2 The Vector Space Model

It was originally introduced for categorization and data retrieval however is currently used conjointly in many text mining approaches further as in most of the presently out document retrieval systems. Despite of its easy arrangement while not victimization any express linguistics info, the vector area model allows terribly economical analysis of giant document collections[3].

4.3 Linguistic Preprocessing

Frequently text mining strategies could also be applied while not additional preprocessing. Sometimes, however, further linguistic preprocessing could also be want to enhance the market info regarding terms[2].

V. CONCLUSION

(6)

VI. REFERENCES

1. Y. Li and N. Zhong, ―Interpretations of Association Rules by Granular Computing, Proc. IEEE

Third Int‘l Conf. Data Mining (ICDM ‘03), pp. 593-596, 2003.

2. Y. Li and N. Zhong, ―Mining Ontology for Automatically Acquiring Web User Information

Needs, IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp. 554-568, Apr. 2006.

3. Y. Li, X. Zhou, P. Bruza, Y. Xu, and R.Y. Lau, ―A Two- Stage Text Mining Model for

Information Filtering, Proc. ACM 17th Conf. Information and Knowledge Management (CIKM ‘08), pp. 1023-1032, 2008.

4. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, ―Text Classification

Using String Kernels, J. Machine Learning Research, vol. 2, pp. 419-444, 2002.

5. A. Maedche, Ontology Learning for the Semantic Web. Kluwer Academic, 2003.

6. J. Allan, editor. Topic Detection and chase. Kluwer educational Publishers, Norwell, MA,

2002.

7. N. Jindal and B. Liu, “Identifying Comparative Sentences in Text Documents,” Proc. 29th

Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’06), pp. 244-251, 2006.

8. T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many

Relevant Features,” Proc. European Conf. Machine Learning (ICML ’98),, pp. 137-142, 1998.

9. T. Joachims, “Transductive Inference for Text Classification Using Support Vector Machines,”

Proc. 16th Int’l Conf. Machine Learning (ICML ’99), pp. 200-209, 1999.

10.W. Lam, M.E. Ruiz, and P. Srinivasan, “Automatic Text Categorization and Its Application to