Rising of Text Mining Technique: As Unforeseen-part of Data Mining

(1)

139

Rising of Text Mining Technique: As Unforeseen-part

of Data Mining

Param Deep Singh, Jitendra Raghuvanshi

M.Tech M.Tech

(S.A.T.I,) VIDISHA(M.P.) B.U.I.T. Bhopal (M.P.)

[email protected] [email protected]

Abstract—

Text Data Mining or Knowledge-Discovery in Text (KDT) technique refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining technique is a deviation on a countryside called data mining that tries to find interesting patterns from large databases; text mining also known as the Intelligent Text Analysis (ITA). Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. Text Mining Technique (TMT) is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. In this paper, we introduce the rising of Text Mining Technique as unforeseen-part of the Data Mining and Data Warehouse Methodologies; for improving its role, performances and productivities and also used in different research areas.

Key Words—

Text Mining Technique (TMT),Text Data Mining (TDM),Knowledge-Discovery in Text (KDT),Intelligent Text Analysis (ITA).

1. INTRODUCTION:

Text mining is a new and exciting area of computer science research that tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. The dilemma of Knowledge Discovery as of Text (KDT) is to extract explicit and implicit concepts and semantic relations between concepts using Natural Language Processing (NLP) techniques. Its aim is to get insights into large quantities of text data. KDT, while deeply rooted in NLP, draws on methods from statistics, machine learning, reasoning, information extraction, knowledge management, and others for its discovery process. Text mining plays an increasingly significant role in promising applications, such as Text Understanding and as Text Thoughtful.

(2)

140

As the nearly everyone innate appearance of storing and exchanging information is written words, text mining (TM) has a very high commercial potential. In fact, a recent study indicated that80% of a company's information is contained in text documents, such as emails, memos, customer correspondence, and reports. Traditional document and text management tools are inadequate to meet the utilities. Even the best Internet search tools suffer from poor precision and recall. (Precision is a measure of how many documents returned from a search actually meet the intended query criterion. Recall measures the percentage of documents returned versus how many should have been returned). Text mining provides a competitive edge for a company to process and take advantage of a large quantity of textual information. The potential applications are countless.

2. KEY-NOTION POINT OF TEXT

MINING:

2.1 Knowledge Discovery

‖Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data‖. The analysis of data in KDD aims at finding hidden patterns and connections in these data. By data we understand a quantity of facts, which can be, for instance, data in a database, but also data in a simple text file. Knowledge discovery in databases is a process that is defined by several processing steps that have to be applied to a data set of interest in order to extract useful patterns.

2.2 Data Mining

Research in the area of data mining and knowledge discovery is still in a state of great flux. One indicator for this is the sometimes confusing use of terms. On the one side there is data mining as synonym for KDD, meaning that data mining contains all aspects of the knowledge discovery process. This definition is in particular common in practice and frequently leads to problems to distinguish the terms clearly. The second way of looking at it considers data mining as part of the KDD-Processes and describes the modeling phase, i.e. the application of algorithms and methods for the calculation of the searched patterns or models.

2.3 What is Text Mining?

* Uncover information hidden in text.

* Application of data mining to unstructured or less structured text files.

* Entails the generation of meaningful numerical indices from the unstructured text and then processing these indices using various data mining algorithms.

* Attempts to categories textual data and not to understand its contents.

* Uses supervised learning for a classification problem with a binary output (i.e. whether or not a document is about a specific topic).

2.4 Definition of Text Mining

―Text mining is the study and practice of extracting information from text using the principles of computational linguistics‖. Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key-notion point is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation. Text mining or knowledge discovery in text (KDT) — for the first time mentioned in Feldman et al. [FD95] — deals with the machine supported analysis of text. It uses techniques from information retrieval, information extraction as well as natural language processing (NLP) and connects them with the algorithms and methods of KDD, data mining, machine learning and statistics.

Text mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text mining seeks to extract useful information from

(3)

141

exploration of interesting patterns. In the case of text mining, however, the data sources are document collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections.

3. NEED OF TEXT MINING:

Text mining has attracted increasing interest and has been actively applied in knowledge management. Finding useful facts or ‗nuggets‘ of knowledge in databases of text is the essence of text mining, an analysis process that attempts to uncover hidden patterns in unstructured text data. TM is currently being used in knowledge discovery and business intelligence applications ranging from human resource management to market intelligence to research and development. Its techniques are also being used to extend conventional information retrieval systems with features that create a more interactive and contextually aware search experience. This technique helps to:

1. Substantially enhance the retrieval of useful information from global databases;

2. Identify the technology infrastructure (authors, journals, organizations) of a technical domain; 3. Identify experts for innovation-enhancing technical workshops and review panels;

4. Provide roadmaps for tracking myriad research impacts across time and applications areas;

5. Estimate global levels of emphasis in targeted technical areas;

6. Generate methodological taxonomies (classification schemes) with human-based and computer-based clustering methods;

7. Develop site visitation strategies for assessment of prolific organizations globally;

Text Mining illuminates the trans-citation thematic relationships, and provides insights of knowledge diffusion to other intra-discipline research, advanced intra-discipline development and extra-discipline research and development. The addition of Text Mining to citation bibliometrics makes feasible the large-scale multi-generation citation studies that are necessary to display the full impacts of research.

4. APPLICATIONS OF TEXT MINING:

* Automatic detection of e-mail spam or phishing through analysis of the document content

* Automatic processing of messages or e-mails to route a message to the most appropriate party to process that message

* Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses

* Filter & match resumes to open positions * Security applications

* Biomedical applications

A range of text mining applications in the biomedical literature has most imp one example is PubGene that combines biomedical text mining with network visualization as an Internet service.

* Software and Applications

Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.

* Online media applications

Text mining is being used by large media companies and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

* Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship management and apply it to improve predictive analytics models for customer churn (customer attrition).

(4)

142

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie; such an analysis may need a labeled data set or labeling of the affectivity of words.

* Academic applications

The issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines. And some more important applications of Text Data Mining have shown in figure as:

Many text mining software packages are marketed for security applications, especially analysis of plain text sources such as Internet news. It also involves in the study of text encryption.

5. MAJOR RESEARCH AREAS:

Current research areas of text mining deal with the problems of text representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this context the selection of characteristics and also the influence of domain knowledge and domain-specific procedures play an important role. Text mining always involves: (a) Getting some texts relevant to the domain of interest (traditional IR);

(b) Representing the content of the text in some medium useful for processing (natural language processing, statistical modeling);

(c) Doing something with the representation (finding associations, dominant themes, etc.)

In order to achieve this, one frequently relies on the experience and results of research in information retrieval, natural language processing and information extraction. These areas we also apply data mining methods and statistics to handle their specific tasks:

5.1 Information Retrieval (IR):

Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself [Hea99]. In order to achieve this goal statistical measures and methods are used for the automatic processing of text data and comparison to the given question. Information retrieval in the broader sense deals with the entire range of information processing, from data retrieval to knowledge retrieval for an overview. Although, information retrieval is a relatively old research area where first attempts for automatic indexing where made in 1975, it gained increased attention with the rise of the World Wide Web and the need for sophisticated search engines. Even though, the definition of information retrieval is based on the idea of questions and answers, systems that retrieve documents based on keywords, i.e. systems that perform document retrieval like most search engines are frequently also called information retrieval systems.

5.2 Natural Language Processing (NLP):

(5)

143

analysis techniques are used among other things for the processing of text.

5.3 Information Extraction (IE):

The goal of information extraction methods is the extraction of specific information from text documents. These are stored in data base-like patterns and are then available for further use.

6. CHALLENGES & LIMITATIONS OF TEXTUAL DATA MINING:

There are numerous challenges to the statistical community that reside within this discipline area. As in any data mining or exploratory data analysis effort, visualization of textual data is an essential part of the problem.

* More difficult textual mining problems involve the analysis of free-form text found in e-mail documents and recorded telephone transcripts

* The challenge of textual data mining is handling ambiguities such as spelling and grammar errors * Text contains acronyms, abbreviations, misspellings (customer,cust,customar,csmr)

* Semantic analysis: understanding the meaning of words (i.e. book = to reserve vs book = a manual) * Syntax analysis: understands a sentence‘s structure and the roles of words (i.e. subject, verb, preposition and noun).

The fundamental limitations of text mining are first, that we will not be able to write programs that fully interpret text for a very long time, and second, that the information one needs is often not recorded in textual form. If I tried to write a program that detected when a where a new word came into existence and how it spread by analyzing web pages, I would miss important clues relating to usage in spoken conversations, email, on the radio and TV, and so on. Therefore text data mining (TDM) has limitations in data mining at the time of mining data or information from the data warehouse.

7. CONCLUSION:

In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down. This very simple method can yield surprisingly good results, even though the meanings of the texts are not being discerned by the programs. Rather, the text is treated like a "bag of words".

In this paper, we tried to give a short-lived intro to the rising of text mining technique and its

wide-ranging field in data mining as the unforeseen-part of data mining. Therefore, we aggravated this field of research, gave a more prescribed definition of the terms used herein and presented a brief overview of currently available text mining role, their research areas and their application to specific problems in data mining. Even though, it was impossible to describe all text mining concepts, methods, approaches, different research fields, challenges and applications in detail within the (size) limits of an article, we think that the ideas discussed and the provided references should give the interested reader a rough overview of this field and several starting points for further studies.

8. REFRENCES:

[1]Ah-Hwee Tan, (1999), Text Mining: The state of art and the challenges.

[2] Srivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications. Boca Raton, FL: CRC Press.

[3]DanialTkach,(1998),TextMiningTech Turning Info into Knowledge from IBM.

[4]HelenaAhonen,OskariHeinonen,A. Inkeri Verkamo, (1997), Applying Data Mining Techniques in Text Analysis,Report1997.

[5]J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, SanFrancisco, 2000.

[6]R. Feldman, J. Sanger: The Text Mining Textbook: Adv Approaches in Analyzing UnstructureData,CambridgeUnivPress2007.

[7]S.Weiss,N.Indurkhya,T.Zhang,F.Dameru TextMining:PredictiveMethodsforAnalysing Unstructured Information,Springer,2005.

[8]Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons.

[9]Andreas Hotho, A. N¨urnberger, Gerhard Paaß,A Brief Survey of Text Mining, 2005.

(6)

144

[11] Lucas, M. 1999-2000. Mining in textual mountains, an interview with Marti Hearst. Mappa Mundi Magazine, [Online],Available http://mappa.mundi.net/tripm/hearst/.

[12] Nasukawa, T. and Nagano, T. 2001. Text analysis and knowledge mining system. IBM Systems Journal 40(4):967–984.

[13] Mack, R. & Hehenberger, M. 2002. Text-based knowledge discovery: search and mining of life-science documents, S89–S98.

[14]http:/en.wikipedia.org/wiki/Text_mining