International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 9, September 2014)
694
A Survey on VariousTopic Mining Techniques
&Applications
Sakshi Jain
1, Virendra Raghuwanshi
2, Anurag Jain
3Abstract— Text-mining commonly consigns to the development of mining attractive patterns and non-trivial information from the database and gets acquaintance from non-arrangement text. Generally text mining covers several computer science restraints with a physically powerful orientation towards artificial intelligence in wide-ranging, together with but not maximum valued of given attractive patterns to recognize patternrecognition, NLP,machine learning, neural networks, andinformation retrieval. An essential difference with mining attractive patterns and non-trivial information from the database search is that necessitates a user to be acquainted with what he or she is give the impression of being for while text mining challenges to determine information in a pattern that is not acknowledged earlier. Here in this paper complete survey of all the techniques that are related to topic analysis and mining of text is discussed, so on the basis of various techniques used for the mining of text a new techniques is implemented in future.
Index Terms— KDT, Information Retrieval, Machine Learning, Text Mining, Data Mining, Support Vector Machine.
I. INTRODUCTION
In current scenario here the information era has made it uncomplicated for human beings to store up huge quantities of texts data or information. The large number of documents available on the internet in current period of time, on information chains, on commercial intranets technology, and somewhere else is devastating. On the other hand, at the same time as the given quantity of information obtainable to us is frequently ever-increasing, our capability to understand and progression this information stays behinds continuous.
Text Mining is the innovation by computer of new era of technology, in the past unidentified information, by without human intervention to take out information from unusual written resources. The goal of text mining is to find out unidentified information from the collected works of text documents.
Text mining is the data analysis of text resources so that new, previously unknown knowledge is discovered [5].
It is an interdisciplinary field that borrows techniques from the general field of Data Mining. Text-mining, also known as Knowledge discovery from text (KDT) refers to the development of mining attractive patterns from very large text database for the intention of determining knowledge from the useful data patterns [1].
Text-mining be appropriate on the same methodical occupations of data-mining but also apply analytical functions from natural language processing and information retrieval (IR) methods.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 9, September 2014)
695
Figure: The Text Mining ProcessText mining methods shall take part in a crucial responsibility in the approaching years in this long-lasting development in given mining process. As a result of long-lasting universal approach is also there much significance in multi-language text mining: the get your hands on just around the corners in multi-language collected works. To the latest ease of use of machine translation methods is in that circumstance a significant improvement of mining process. Multi-language text mining is much more difficult that it become visible as well to find the discrepancies in character sets and words, text mining makes exhaustive use of statistics over and above the linguistic possessions (such as conjugation, grammar, senses or meanings) of a Multi-language.
There are a lot of essential hypothesis about capitalization and tokenization that would not work for other verbal communications of a Multi-language. When we apply text mining methods are make use of non-English data collections supplementary confronts have to be concentrate on linguistic possessions.
Text mining is on the subject of investigating unstructured information and mining taken out from pertinent patterns and its distinctiveness. Using these pertinent patterns and uniqueness for getting better search consequences and profound the given data analysis is achievable; for retrieving quick information and remaining information may get hidden for these pertinent patterns. Text mining is the finding of significance acquaintance in text documents. It is a not easy concern to discover preciseinformation or features in text documents to help customers todiscover what they want. In the establishment of relevant information of significant patterns, InformationRetrieval (IR) make available many term-based techniques to explainthis confront such as Rocchio and probabilistic models [4],rough set models [2], BM25 and support vector machine (SVM) [3] based filtering models.
The advantages of these term-based techniques take account of well-organized computational concert in addition to established theories for expression influencing,which have come forwardover the last link of decades fromthe InformationRetrieval and machine learning neighborhood. On the other hand, term-based techniques get through from the many difficulties of synonymyandpolysemy, where synonymy is multiple words having the same meaning and polysemy means a word has multiplemeanings, the semantic meaning of many discoveredterms is tentative for answering what users want.
II. KEY CHARACTERISTICS OF TEXT DATA
The explanation component of text mining is to association of taken out information simultaneously to form new piece of information’s or new assumption to be investigated advance by more predictable signify of conducting tests. At the same time as expected data mining taken outs the patterns from structured databases, text mining arrangements with difficulty of natural language processing. The prevalent discrepancy between text mining and data mining is in the preprocessing phase [2].
Text mining phase is the core process of knowledge discovery in text documents. There are several types of text mining tasks as follows:
Text classification: allocating the documents with pre-defined categories (e.g. decision trees generation). Text clustering: expressive movement, which clusters
related documents simultaneously (e.g. self-organizing maps).
Perception mining: modeling and determining of perceptions, from time to time come together classification and come together move towards with idea/judgment-based suggestions with the intention of come across conceptions and their relatives from text collected works (e.g. recognized concept analysis of basic approach for building of conception hierarchy). Information retrieval: retrieving the relevant
documents to the user’s query for given keywords. Information extraction: question answering.
It is very standard that in any text mining development on first three types are essential building blocks with the intention of support also information retrieval or extraction.
Text data can be analyzed at different levels of representation.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 9, September 2014)
696
III. THEORETICAL BACKGROUND
In the Text mining, by means of physical performances was use first for the duration of the 1980s. It rapidly became perceptible that these physical performances were manual and consequently they are more costly. It also expenditure too much times to physically procedure the before now developing amount of information. Over time there was ever-increasing accomplishment in creating small programs to without human intervention development of the information, and in the last 10 years there has been much progression on this research areas.
At present the learning of text mining apprehensions the improvement of an assortment of mathematical, statistical, linguistic and pattern-recognition methods which permit involuntary investigation of unstructured information over and above the taking out of high quality and significant data, and to create the text as an entire enhanced searchable. Far above the ground excellence refers at this time, in exacting, to the arrangement of the consequence i.e. finding a needle in a haystack and the get hold of new and interesting just around the corners.
A text document restrains characters that simultaneously form words, which can be come together to form phrases. These are all syntactic possessions that together correspond to characterized grouping ideas, senses or meanings. Text mining must be familiar with to take out and use all this finding information. With the help of text mining, as an alternative of pointed for words, we can exploration for linguistic word patterns, and this is consequently investigating at a higher level.
IV. OVERVIEW ON TEXT MINING
Most work in knowledge discovery and data mining was concerned with transactional or structured databases. However a large portion of the available data appears in collection of text articles. Text mining is used to denote all tasks that try to extract useful information by finding potential patterns from large quantities of text. It combines many disciplines such as information retrieval, information extraction, machine learning, text categorization, text clustering and data mining [6].
The usage and applications of text mining in various fields and can be used as different areas:
Text Mining as Information mining: The text mining can be used for a variety of applications such as information mining. Mining of information from a text document can be used for the knowledge extraction and hence for the analysis.
Text Mining as Text Data Mining: Text mining can be also distinct i.e. similar to data mining as the purpose of algorithms and techniques from the fields machine learning and statistics to texts with the objective of come across according to useful patterns. For this reason it is essential to pre-process the texts or relevant document for that reason along useful patterns. Many researchers have use this relevant information mining techniques, natural language processing or some simple pre-processing paces with the intention of extract data from texts. To take out that given data then data mining algorithms can be applied [7, 8].
Text Mining as KDD Process: Go after the knowledge discovery process model [9], we find the commonly along useful patterns used in literature text mining as a process with a series of unfinished footsteps, among other articles also information extraction over and above the use of data mining or statistical methods.
Text classification or categorization (TC) is an instance of text mining. TC is a supervised learning task that assigns a Boolean value to each pair (di, ci)∈(D×C), where D is a
domain of document and C is a set of predefined categories. The task is to approximate the true function ϕ: D×C→{1, 0} by means of a function ϕˆ:D×C→{1, 0}, such that ϕ and ϕˆ coincide as much as possible [10]. The function ϕˆ is called a classifier. The goal of the classifier is to precisely define and estimates this coincidence.
V. LITERATURE REVIEW
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 9, September 2014)
697
In this paper, [15] here the author has envision of the problem of decide on association rules as a classification job. A structure of a dual probabilistic classifier is established that bring into plays ontologies with the intention of approximation whether and to which quantity an imperative articulates a simple taxonomic association. In this paper here firstly in attendance the taken as a whole formal structure that predicts out our come within reach for deciding on association rules. Here they current examples that formulate use of straightforward ontologies which are built upon the hypernym-hyponym connection. Second, a probabilistic structure is initiated and is acquire to accomplish computations of the degree of appropriateness between each rule of a rule set to begin with extracted and a model of the domain chosen. Third, here they in attendance an example in use from a text mining research that demonstrates in which approach is used to extract knowledge, i.e., ontologies, maintains the collection of association imperatives. The knowledge-based come within reach of nearby in this paper overall research is accomplishing a negative selection or selection by eradication (i.e. negative response) of association rules. So the taken as a whole and probably large set of rules make available as an amount produced of association rule mining is decreased by those rules that provide an appropriate get together a exacting measures. Whether and to which degree an association rule gets together a think about on certain principle (i.e., demonstrating a simple taxonomic relationship) is approximation by a probabilistic come within reach of the features of which are given below. In this fashion, the primary law set can be pack together since inconsequential or non-awareness about association rules, i.e., rules that make known theoretical or taxonomic justifications of ideas are recognized and throw away. Under these circumstances this paper wraps ups with a conversation of probable supplementary expansions of the knowledge-based come within reach of to association rule assortment.
As the amount day by day of electronic information amplifies, there is producing concentration in expanding tools to help people improved discover, filter, and control these supply. Text categorization [14] is the task of natural language texts to one or more predefined grouping based on their substance which is a significant factor in much information society and managing tasks. Machine learning techniques, including Support Vector Machines (SVMs), have incredible probable for facilitating inhabitants to efficiently organize the electronic supplies. Text mining frequently engages the mining of keywords concerning some measure of consequence.
Weblog data is textual substance with an understandable and important as per temporal aspect. Text categorization [11] is the task of involuntarily arranging a set of files into grouping from a predefined set. So the existing task has a number of applications, together with programmed manifestation of systematic articles according to predefined sets of scientific terms, filing exclusive rights into patent directories, selective distribution of information to information consumers, programmed population of hierarchical catalogues of Web sources, person responsibility attribution, investigation coding, spam pass through a filtering process, recognition of document type, and even programmed essay grading. Automated text classification technique is good-looking for the reason that it without charges of charges associations from the need of physically organizing document foundations, which can be too costly or simply unacceptable given the time limitations of the application or the amount of documents engaged. The correctness of contemporary text classification classifications challengers that of skilled human professionals, show gratitude’s to a arrangement of information retrieval (IR) expertise and machine learning (ML) technology. Here the author has to show the outline of essential characteristics of the technologies engaged, of the applications that can possibly be equipment from beginning to end text classification and of the tools and resources that are accessible to the developer and researcher aspiration to take up these technologies for arranging real-world applications. A web technology [12] take outs the statistical information and determines remarkable user patterns, cluster the customer into collections according to their navigational performance discover prospective relationships between user groups and web pages, recognition of potential consumers for E-commerce, improve the feature and escape of Internet information examinations to the end user, improve web server scheme presentation and site design and smooth the progress of personalization.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 9, September 2014)
698
Here this paper experimental outcomes using three types of documents, news articles, customer evaluations of products, and Internet environment situation, show a exactness of 79% and recall of 81%. Assessment is one of the largest part compelling methods for text assessment. Extracting relative sentences from text is constructive for many appliances.
In this existing system, [17] an efficient outline of discovering method established which first determines discovered specificity patterns and then estimates the expression influence according to the circulation of expressions in the discovered patterns to a certain extent than the distribution in manuscripts for solving the misconception of the existing difficulty. Resulting method has also considers the manipulate of patterns from the negative training examples to discover indefinite (noisy) prototypes and aspire to decrease their authority for the low-frequency predicament. The development of updating uncertain patterns can be passing on as pattern progression. This come within reach of can get better the exactness of appraising term influences because determined patterns are added unambiguous than entire documents. This method uses two processes, pattern organizing and pattern developing, to get do away with of impurities the determined patterns in text documents. But they don’t think about the time series to set the level of the given sets of documents. In this proposed scheme, here they used the temporal text mining come within reach of is commenced. The coordination expressions of its ability are estimated to forecast approaching happenings in the document. At this time the most advantageous disintegration of the time period join together with the given document set is determined, where each subinterval consists of uninterrupted time points having impossible to tell apart information substance. Extraction of progressions of occurrences from new and other documents based on the periodical times of these documents has been revealed to be enormously valuable in pathway in past events.
VI. CONCLUSION
Lack of configurations, text data is characteristically deal with passing through a search engine as an alternative of a database. Data gatherings are presently accomplishment to great to be having another look at consecutively. Data Collections need to be pre-classified and pre-investigated. Make another study of can be put into practiced more professionally and closing dates can be made easier.
These data gatherings confront will be to encourage courts of the appropriateness of these new instruments. Consequently, a combined come within reach of suggested where computers make the preliminary collection and classification of documents and exploration courses and human make another study of investigators put into practice quality control and valuate the examination recommendation. By doing so, computers can centre of attention on bear in mind and human being can meet point on exactness.
There are many other applications where this approach has led to both more efficiency but also to acceptance of the technology by society.
REFERENCES
[1] Moty Ben-Dov, Ronen Feldman,Text Mining and Information Extraction., The Data Mining and Knowledge Discovery Handbook 2005, 801-831, Springer.
[2] Y. Li, C. Zhang, and J.R. Swan, “An Information Filtering Model on the Web and Its Application in Jobagent,” Knowledge-Based Systems, vol. 13, no. 5, pp. 285-296, 2000.
[3] S. Robertson and I. Soboroff, “The Trec 2002 Filtering Track Report,” TREC, 2002, trec.nist.gov/pubs/trec11/papers/OVER. FILTERING.ps.gz..
[4] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison Wesley, 1999.
[5] Hearst, M.A, 1999, Untangling text data mining. In Proc. Of the 37th ACL,College Park, MD, pp. 3-10.
[6] C-H Lee and H-C Yang. A multilingual text mining approach based on self-organizing maps. Applied Intelligence, 18(3):295-310,2003. [7] U. Nahm and R. Mooney. Text mining with information extraction.
In Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, 2002.
[8] R. Gaizauskas. An information extraction perspective on text mining: Tasks, technologies and prototype applications. http://www.itri.bton.ac.uk/projects/euromap/TextMiningEvent/Rob_ Gaizauskas.pdf , 2003.
[9] Cross industry standard process for data mining. http://www.crisp-dm.org/ , 1999.
[10] G.P.C Fung, J.X. Yu, H. Lu and P.S. Yu. Text classification without negative examples revisit. IEEE Transactions on Knowledge and Data Engineering, 18(1):6-20, 2006.
[11] M.F. Caropreso, S. Matwin, and F. Sebastiani. Statistical Phrases in Automated Text Categorization, Technical Report IEI-B4-07- 2000, Instituto di Elaborazione dell’Informazione, 2000.
[12] J. Han and K.C.-C. Chang. Data Mining for Web Intelligence, Computer, Vol. 35, No. 11, pp. 64-70, Nov. 2002.
[13] N. Jindal and B. Liu. Identifying Comparative Sentences in Text Documents, Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’06), pp. 244-251, 2006.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 4, Issue 9, September 2014)
699
[15] Dietmar Janetzko ,Hac` ene Cherfi , Roman Kennke, Amedeo Napoliand Yannick Toussaint,” Knowledge-based Selection of Association Rules for Text Mining” 2003.
[16] Rayid Ghani and Andrew E. Fano,” Using Text Mining to Infer Semantic Attributes for Retail Data Mining”, 2003.
[17] K. Mythili, K. Yasodha, “A Pattern Taxonomy Model with New Pattern Discovery Model for Text Mining” International Journal of Science and Applied Information Technology, ISSN No. 2278-3083 Volume 1, No.3, July – August 2012.
[18] Luhn, H. P.: A Statistical Approach to Mechanized Encoding and Searching of Literary Information, in IBM Journal of Research and Development, 4:309-317, 1957.
AUTHOR’S PROFILE
Sakshi Jain, Computer Science dept., Radharaman Institute of Technology, RGPV University, India. E-mail: ([email protected])
VirendraRaghuwanshi, Computer Science dept., Radharaman Institute of Technology, India. E-mail:([email protected])