TEXT MINING: CONCEPT, TECHNIQUES, APPLICATIONS, CHALLENGES AND OPPORTUNITIES

(1)

TEXT MINING: CONCEPT,

TECHNIQUES, APPLICATIONS,

CHALLENGES AND OPPORTUNITIES

ANSHIKA SINGH

Assistant Professor, Department of Computer Science,

Bhaskaracharya college of Applied Sciences, Dwarka, University of Delhi, Delhi, India anshika1807@gmail.com

Abstract : Data is gradually increasing over the time usually unstructured or semi structured over the internet or databases. It can only be utilized once it can be located and synthesized into valuable information or knowledge. Hence, it created the demand for the system or tools that can automatically discover or extract useful knowledge from unstructured text. Text mining emerges as the most efficient way to cater this prospect to the Web. Text Mining is the process of analyzing and structure text data to extract interesting of meaningful patterns to generate useable knowledge from previously unstructured data. This paper provides an overview of text mining in the context of its techniques and applications. Keywords: Text mining, Data mining, Classification, Clustering, Text analysis.

1. Introduction

With the rapid advancement in information technology and applications in networks, internet has become an integral part in individual’s life. An enormous amount of unstructured or semi structured data is dramatically increasing on the websites in the form of technical documentation, articles, blogs, forms, forum posts etc. Text is unstructured and imprecise in nature and difficult to handle. Presently, text is the most common medium for exchanging information which has given rise for the demand of analysis of text data in order to discover or extract meaningful and useful information. Text mining is a process which attempts to glean significant information or simply knowledge from natural language text for a specific purpose. Text mining is also known as Text Data Mining or Knowledge-Discovery in Text (KDT)[Han et al.(2012)] or Intelligent Text Analysis which can be defined as a process or task of extracting hidden or previously unknown, reasonable and latent patterns from the gigantic collection of unstructured text corpus or database. Text Mining can be described as the process of discovering new, hidden or previously unknown information by computer, via automatically extracting non-trivial information or simply knowledge usually from a massive collection of unstructured text resources [Han et al. (2012)]. Text mining is the exploration of interesting information or knowledge in text document or databases. Text mining is a deviation on a field called data mining [Navathe et al. (2000)] that strive to explore interesting and meaningful patterns from large databases. Text mining is quite similar to data mining process, except that data mining tools are intended to handle structured data, but text mining can be applied on unstructured or semi-structured data sets such as html or xml files, full-text documents, emails etc. As huge amount of organisational information are stored in the form of text documents, Text mining has higher commercial value as compared to data mining and serves as a better solution for big organisations.

(2)

Fig.1. A Text mining process model.

Term Document Matrix is a vector space model representation of the documents and act as an input for various mining processes like information extraction & retrieval, text categorisation, text summarization etc. The

document dj in text miningis represented as dj ≜

f1j f2j ⋮ fn j

, where fij is no. of occurrences of ith term in document j

and Term-Document matrix is represented as:

D ≜

f11 f12 ⋯ f1n

d11 d12 … d1n

⋮ ⋮ ⋱ ⋮

dn 1 dn 2 ⋯ dn n

documents

words

Term Frequency Inverse Document Frequency (TF-IDF) is used to calculate the weights of the words or terms, where Term frequency (TF) is defined as the frequency of word or term in a document and Inverse Document Frequency (IDF) is defined as the measure of the rarity of a word or term in the complete document set.

IF

∑ and IDF

where, ni = no. of occurrences of the considered terms , nk = no. of occurrences of all terms in the document,

N = no. of occurrences of the considered terms, dfi= no. of documents that contain term i. This method

measures the statistical strength of the given term in reference to the query. TF IDF TF IDF

2. Text mining Techniques and methodologies

Text mining is the methodology for analysing unstructured text data and extracting relevant and meaningful patterns or structures and characteristics. According to Zorn et al.(1999), “Text mining offer powerful possibilities for creating knowledge and relevance out of the massive amounts of unstructured information available on the Internet and corporate intranets”. They regard text mining as a knowledge top creation tool. Text Mining is a pragmatic tool which is capable of identifying meaningful and hidden information from a document collection.

Textual Data can be examined to derive structure and inherent meanings “hidden” in the document or to summarize the documents based upon the words present in them. We can analyse clusters of words in the documents, etc., documents and determine similarities between them. The major objective of the text mining is to discover the useful and meaningful information without getting duplicated from several documents with synonymous meaning or understanding.

Text Mining Process consists of the following Steps:

1) Text pre-processing; syntactic/semantic analysis of text,

(3)

3) Text Representation; feature selection in the document to further reduce the dimensionality, and eradicating irrelevant features or missing information, e.g. - sampling, statistics etc.

4) Data Mining; application of various data mining methods like association rule mining, Classification,

Clustering etc. This step is solely dependent on the type of application, 5) Interpretation/Evaluation; Analysing and visualising the results.

Text Mining utilises data mining techniques to identify or discover patterns or information from textual data. Text mining intrinsically requires methods or techniques from other areas such as Information Retrieval, Computational linguistics , and data mining [Bolasco et al. (2002)] as shown in fig. 2. Text mining techniques also help organisations to combat the tough competition in the marketplace by providing an efficient Business Intelligence solution.

Fig.2. Text Mining described as an Interdisciplinary Field

2.1. Information Retrieval

Information Retrieval is defined as the methods used for representing, organising and accessing the information items [Gerard et al.(1983)] where the information is usually in the form of text documents, research papers, articles, books etc. which are retrieved from databases to cater the queries based on user’s interest. Information Retrieval (IR) systems identify the documents of user’s interest in a collection which match a user’s query. Information Retrieval system employs two main approaches:

1) Boolean; using the logical operator like AND, OR and NOT, a Boolean query is obtained from several atomic query terms such as words or phrases. This approach divides the database in two sets: one containing the documents relevant to the user’s query and the other one consist of the remaining documents. Each set is further inspected by user with no priori knowledge to identify where the set of useful or relevant document lie in the database.

2) Ranked output or Best-match; compares a set of extracted terms from the user query with set of terms for each document in the database and calculate a measure of similarity between them by applying a numerical based algorithm and finally rank or sort the documents by decreasing degree of similarity or simply likelihood of relevance with the user query. The user can further browse down or access the list obtained based on the user’s query. This approach is not only dependent on the user query but consider the prior knowledge and items previously retrieved and examined in that search. Though, identification of required information in the relevant documents obtained by either of the approaches is carried out by the user via reading the reading the document or highlighted terms in the text retrieved. IR tools based on traditional keyword search such as Google or PubMed identify the relevant documents on the World Wide Web containing certain keywords and Excite web search engine identify the relevant documents on www based on correlation of their concepts as well as keywords by considering the entire text of each page in the document.

2.2. Information Extraction

(4)

Information Extraction potentially addresses the issue of text processing, i.e., transforming a corpus of textual documents into a more structured form or structured record/template, which can further be provided to the data mining module for extraction of interesting patterns or rules to represent the valuable information or knowledge as illustrated in fig. 3. Hence Information Extraction can play an evident role in text mining.

For example, VIE (Vanilla IE system) is an IE system which performs the following tasks: 1) Name entity recognition; recognise and classify the definite entities,

2) Coreference resolution; identify unique relationships between entities,

3) Template element construction; filling the information in to a predefined structure like records,

4) Scenario template construction; detecting relationships between template elements relevant to specific purpose and constructing a fixed format structure containing all the entities and their relationships. VIE carry out all these tasks by using NLP techniques to build a single rich representation of text which consists of the three stages:

1) Lexical preprocessing; read and create tokens from the raw text, performs “parts-of-speech” tagging, morphological analysis, phrasal matching etc.,

2) Parsing; grammatical parsing,

3) Discourse interpretation; adds additional information presumed by the input to the, performs

coreference resolution between new instances which are added and already present in the world model, etc.

Fig.3. Overview of IE-based text mining framework

While IR systems facilitate minimizing the set of documents in the database which are relevant to a user query or specific problem, IE systems extract the relevant and meaningful information from the documents for some specific purpose like summarizing the record of the patient in Heath care delivery by extracting symptoms, test results and therapeutic treatments. Information Extraction and Information Retrieval are complimentary to each other and combination of both acts as an efficient solution for text processing, i.e., automatic construction of structured data repositories from large natural language text collections. Hence, combination of IE and IR can be used to improve user query precision and speed up the analysis significantly by narrowing down the set of documents relevant to the user query.

Efficiency and effectiveness of both IR and IE system are mainly accessed by twin measures of recall and precision. If “A” relevant documents have been retrieved by a search from “B” relevant documents in the database, and “C” documents retrieved in total, then

Recall is defined as 100 and

Precision is defined as 100 .

2.3. Computational Linguistics

(5)

“Text mining is the study and practice of extracting information from text using the principles of computational linguistics” [Sullivan (2000)]. Text mining techniques share the methods from natural language processing (NLP) to handle unstructured textual information hidden in natural language text databases. The goal of NLP is to design and build a computer system that will analyze, understand and generate NLP. It is useful for enabling the use of human language for providing a summary after understanding any text document, for commands and queries understanding and analysis purpose but still not achieved efficient methods for natural language text information processing and extracting useful knowledge patterns. The patterns generated by these methods and techniques are further analyzed by the computer in order to construct information which can used as an input for data mining algorithms to form an analysis or desired result. In addition, linguistic analysis techniques are used for the processing of text. Therefore text mining techniques can offer benefits for processing human natural language information with speed and accuracy.

2.4. Pattern Recognition

Pattern Recognition is a branch of machine learning which aims to recognize patterns and regularities in the data. Pattern Recognition is the process of identifying predefined patterns or sequences in text (supervised learning). In a text mining scenario it is taken as a process of matching the patterns using words as well as morphological and syntactic properties, considering their statistical variation. Pattern recognition involves two types of methods:

1) terms or word matching; easier to implement but need manual efforts as well,

2) Relevancy signatures; based on morphological and syntactic information processing techniques. A common example of pattern recognition is regular expression which search for patterns of given type in text data and applied in search capability of various word processors and text editors. Pattern recognition forms the basis for CAD (computer aided diagnosis) systems, text classification, automatic recognition of speech, handwritten text or images, etc.

2.5. Text Categorization

Categorization refers to the data-driven and iterative process which involves grouping of related concepts, classes, or common threads. Text categorization is said to be supervised learning process which automatically determines the category of text based on the text context under the considered classification algorithm. It identifies the similarities in the documents in order to find the category of a document within a pre-defined set of categories. The process of categorization of documents relies on methods of representing the whole document as “bag of words” and meaningful information is extracted based upon the count of each word in the document. Documents having the high percentage of content on a specific topic are arranged in order and rank is given to it as per content.

Document categorization or Text classification may be defined as the task of ascertaining an unknown target function, → 0,1 represented as decision matrix which determines how documents can be classified illustrated in fig. 4. where, D = {d1,…., dn} represents set of predefined categories and, C = {c1,…., cm}

represents set of documents for categorization (possibly infinite).The values aij(dj , ci) is said to be 1 if dj is the

member ofci , while aij(dj , ci) is said to be 0 dj is non- member ofci .

Feature Selection plays an important role in document categorization, in selecting a subset of features depending upon predetermined measures. Text categorization has its applications in many fields like automatic indexing for Boolean information retrieval systems, Document organization, document filtering, word sense disambiguation, yahoo style search tree categorization etc.

d1 … … dj … … dn

c1 a11 … … a1j … … a1n

… … …

ci ai1 … … aij … … ain

… … …

cm am1 … … amj … … amn

Fig.4. Decision matrix for document categorization

2.6. Text Clustering

(6)

based on the concept of dividing the text with similarities into the same cluster. Each and every cluster consists of the number of documents. A basic clustering algorithm creates a vector of topics for each document and measures the weights of how well the document fits into each cluster. The elementary idea behind clustering analysis is using its significant features in calculating the degree of similarity in relationships among objects and achieving automatic classification. The clustering is considered better if contents of the intra cluster documents have more similarity than inter cluster documents content.

Document Clustering is useful in document organization and summarization in various information management systems. It caters topic extraction and information retrieval efficiently and finds its applications in categorizing the results provided by various search engines in answering the user’s queries. Document clustering also facilitates hierarchical grouping of documents to enhance browsing and document organization by categorizing the clusters in to meaningful hierarchy or tree like structure.

2.7. Text Summarization

Summary can be defined as “a text which is created from one or more sources of texts, that which is significantly less than half of the length of original text(s) while preserving important information in the original text(s)[Hovy (2005)]. Text summarization is the process or task of extracting the significant information from a text source(s) to produce a shorter version for a specific user(s) or task(s) [Mani et al. (1999)]. Text Summarization is the process which helps to reduce the content of documents whilst preserving the sense of the topic communicated in it. In practice humans read through the text and understand its meaning and mention or highlight the main topic or concept discussed in the text.

Text Summarization methods can be categorized into extractive and abstractive summarization. An extractive text summarization consists of selecting important sentences, paragraphs etc. from the original document and concatenating them into shorter form. The statistical and linguistic features of sentences decide the significance of sentences. Abstractive Text summarizations develop an understanding of the main concepts in a document and then express those concepts in natural language text. It also uses linguistic methods which examines and interpret the text data to discover the new concepts and expressions in order to describe it by generating a new shorter version of text which communicates the most significant and essential information from the original text document.

2.8. Association Rule mining

Association rule is one of the important data mining techniques which discover the frequent patterns or structures representing associations and correlations among the set of items or feature words in the transaction databases or information repositories[Srikant et al. (1996)]. Let us consider a collection of documents, D= {d1,

d2,….., dn} and set of items I= {W1, W2,…, Wn}, which consist of keywords, concepts, phrases or terms. A

document di contains Wi iff Wi di. An association rule between Wi andWj is implied Wi Wj , where Wi

I , Wj I and Wi Wj . There are two basic measures for association rules: support and confidence.

The association rule Wi Wj holds in D with support S, if S % of documents in D contains Wi Wj and

confidence C, if C% of the documents that contain Wi also contains Wj .

The support and confidence is calculated as:

Support (Wi Wj ) =

Confidence (Wi Wj ) =

Minimum support is that if rules which have support greater than a user defined support. An association rule extraction can be decomposed into two steps:

 Generate frequent itemsets ; itemsets whose support is greater than the user specified support called minimum support (min_sup)

 Generate the association rules; using the frequent itemsets whose confidence satisfy a user specified confidence called minimum confidence (min_conf) to generate the association rules.

Association rule mining has its applications in knowledge discovery and decision making in industries, agriculture sector etc.

2.9. Text Visualization

(7)

Trend Analysis and Association Analysis are used to find or predict future patterns or trends over time depending on time dependent data and associate these patterns to the other extracted patterns. Trend analysis aims at financial reports, business reports, scientific reports or literature, current news etc. There are many approaches to determine and summarize the evolutionary patterns[Mei et al. (2005)] of themes in stream of texts like evolutionary theme graphs generate clusters and then discover coherent themes over time by using Kullback-Lieber divergence measure. In Markov model; firstly globally interesting themes are discovered which is followed by computing the strength of each theme in each period of time. It shows the trends of strength variation as well as compares relative strength of all themes over time. Statistics based models label documents using keyword distributions and determine the changing trend of text topics by calculating the distance between keyword distributions for collections from different point in time[Feldman et al. (1998)].

3. Applications

In Natural Language Processing (NLP), Text mining used in construction of websites which support systems of questioning in natural language. Text Mining applications are used to analyze web pages published in different language. In the area of Knowledge management & HR, Text Mining techniques are used in support and decision making and Competitive Intelligence by selecting only relevant information by automatic reading of this data about the company as well as its competitor. They are mainly used in applications designed to manage human resources strategically which involve staff’s opinions analysis, monitoring the level of employee satisfaction, as well as reading and storing CVs for the selection of new personnel. In Biomedical Applications, Text Mining is used for identification and classification of technical terms in the domain of molecular biology corresponding to concepts. In Customer profile & Relationship management, Companies use text mining to draw out the occurrences and instances of key terms in large blocks of text such as articles, Web pages, complaint forums[Shantanu et al.(2008)]. Text Mining helps in determining the Companies image through the analysis of press reviews and other relevant sources. TM helps in rerouting specific requests automatically to the appropriate service or supplying immediate answers to the most frequently asked questions. In the area of Technology watch, Identification of the relevant Science and Technology literatures and efficiently extracting the essential information required from these literatures, text mining techniques are used extensively[Ronald (2003)]. There are several Text mining products and applications available in the market place summarized in Table 1.

Table 1. A summary of Text mining applications and products.

Text Mining Product/

Application Key Functions

IBM Intelligent Miner Document-based Clustering, Text Summarization, Text Categorization/

Classification

Inxight Linguist Document-based Information retrieval, text analysis, summarization

TextWise DR_LINK/

CINDOR / CHESS Concept-based Information retrieval, Information extraction

Megaputer TextAnalyst Document-based semantic net representation and performs information

retrieval, Text summarization, classification

Cambio Data Junction Concept-based information extraction in the form of relational attributes

IBM / Synthema

Technology Watch

Document-based Clustering and visualization of technical publications and patent databases in map representation.

Canis's cMap Document-based Clustering and visualization of documents using Self

organizing map(SOM)

Cartia's ThemeScape Document-based Clustering and visualization of documents in landscape

representation

Inxight VizControls Document-based Clustering the documents into groups and visualization in

hyperbolic tree representation.

Semio Corp's SemioMap Concept-based Visualization in the three-dimensional graphical interface which

maps the relationships between concepts in the collection of the documents

4. Benefits, Challenges and Opportunities

(8)

One of the common issue in text mining is to represent the text corpora in to an intermediate forms suitable for mining. Semantic analysis is one of the solutions which obtain a rich representation to analyze the relationship between the objects or concepts given in the documents. It operates on few words per second which make it computationally expensive and remains a challenge to achieve an efficient and scalable semantic analysis for large text corpus.

Currently, most of the text mining tools process the text data in English language, but mining data from multiple language documents allows us uncovering untouched valuable information useful for specific purpose and offers an opportunity to develop multi lingual refining algorithms for text mining.

Domain knowledge plays an integral role in knowledge distillation process. It helps in improving parsing efficiency as well as learning or mining efficiency of the text mining model and is another concern area for research. Presently, Text mining tools are designed for trained specialists and usually used by technical users and management executives. Hence, there should be an intelligent personal assistant or miner which first learn about a user profile and performs text mining operations and extract information relevant to user’s profile without explicitly specifying in the user query. Hence, personalized autonomous mining is another burgeoning area in text mining.

5. Conclusion

With the vivid growth of stored information in almost every area in the real world, there is a great demand for new, powerful tools for transforming data into useful knowledge. The problem of overloaded information over the web or databases is further elevated as the majority of the data is in unstructured, textual forms. Text represents a vast, rich collection of information, but encrypts this information in such a form that is difficult to decipher automatically.

Text Mining also known as Text Data Mining or KDT refers generally to the process or task of extracting interesting, meaningful and non-trivial information or knowledge from unstructured text. It is an interdisciplinary field which relies on information retrieval, information extraction, computational linguistics, machine learning, statistics and data mining. As most of the information in the organizations (over 80%) is stored as text, text mining is believed to have a high potential value. Knowledge may be discovered or explored from several sources of information. However, unstructured texts remain the largest readily available source of knowledge. Text Mining unlocks the hidden information and extracts new knowledge. Hence improved research process and quality resulting in cost saving and productivity gains.

References

[1] Han J; Kamber M; Pei J. (2012). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc.: San Francisco, CA., pp.

7-8.

[2] Navathe; Shamkant B.; ElmasriRamez(2000). Data Warehousing And Data Mining, Fundamentals of Database Systems. Pearson

Education pvt Inc, Singapore, 841-872.

[3] Thomas W. Miller.( 2005). Data and Text Mining: a Business Applications Approach. Pearson Edition.

[4] Zorn P.; Emanoil M.; Marshall L.; Panek M.( 1999). Mining meets the Web. Online, vol. 23 no. 5, pp 17–28.

[5] Bolasco, S.; Canzonetti, A.; Ratta-Rinaldi, F. D.; Singh, B. K. (2002). Understanding Text Mining. Roma, Italy.

[6] Gerard Salton G.; McGill M. J.(1983). Introduction to Modern Information Retrieval.McGraw-Hill, Inc. New York, USA.

[7] Wilks Yorick. (1997). Information Extraction as a Core Language Technology. International Summer School, SCIE-97.

[8] Susan Armstrong, editor( 1994). Using large corpora. MIT press.

[9] Sullivan, D. (2000). The need for text mining in business intelligence. DM Review. Available at:

http://www.dmreview.com/master.cfm.

[10] Liritano S.; Ruffolo M.( 2001). Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining,

IEEE, Italy , pp. 454-458.

[11] Hovy, E. H.( 2005). Automated Text Summarization.The Oxford Handbook of Computational Linguistics, Oxford University Press.

pp 583–598.

[12] Mani, I.; House, D.; Klein, G., et al. (1999)The TIPSTER SUMMAC Text Summarization Evaluation. In Proceedings of EACL.

[13] Srikant R.; Agrawal R. (1996)Mining quantitative association rules in large relational tables. Proc. Conf. Management Data ACM

SIGMOD, pp. 1–12.

[14] Mei, Qiaozhu; C. X. Zhai. (2005)Discovering evolutionary theme patterns from text: an exploration of temporal text

mining.Proceedings of Kdd, pp. 198-207.

[15] Feldman, Ronen; Dagan I.; Hirsh H. (1998). Mining Text Using Keyword Distributions. Journal of Intelligent Information

Systems10.3, pp. 281-300.

[16] Shantanu Godbole; Shourya Roy (2008). Text to Intelligence: Building and Deploying a Text Mining Solution in the Services Industry

for Customer Satisfaction Analysis, IEEE, pp. 441-448.

[17] Ronald Nell Kostoff (2003).Text Mining For Global Technology Watch. Article, Office of Naval Research, Quincy St. Arlington,

1-27.

[18] R. Rao. (2003). From unstructured data to actionable intelligence. Proceedings of the IEEE Computer Society.

[19] Karanikas H.; Tjortjis C.; Theodoulidis B.(2000). An approach to text mining using information extraction. Proceedings of Workshop

of Knowledge Management: Theory and Applications in Principles of Data Mining and Knowledge Discovery 4th European Conference.

[20] Liritano S.; Ruffolo M. (2001). Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining,

IEEE, Italy , pp. 454-458.

[21] Zaïane, O. R. (1999). Principles of Knowledge Discovery in Databases.University of Alberta.

[22] Singh, Anshika.( 2017). A Framework To Automatically Categorize The Unstructured Text Documents, Indian Journal of Science and