Vol 8, No 1 (2018)

(1)

Research Article

a

January

2018

Computer Science and Software Engineering

ISSN: 2277-128X (Volume-8, Issue-1)

A Brief Survey on Text Classification Using Various

Machine Learning Techniques

Padmavathi.S

Ph.D., Research Scholar of Computer Science, PG and Research Department of Computer Science, AVVM Sri

Puspam College, Poondi, Tamilnadu, India

Email: [email protected]

Dr. M. Chidambaram

Assistant Professor in Computer Science, Rajah Serfoji Government College (Autonomous),

Thanjavur, Tamilnadu, India

Email: [email protected]

Abstract— Text classification has grown into more significant in managing and organizing the text data due to tremendous growth of online information. It does classification of documents in to fixed number of predefined categories. Rule based approach and Machine learning approach are the two ways of text classification. In rule based approach, classification of documents is done based on manually defined rules. In Machine learning based approach, classification rules or classifier are defined automatically using example documents. It has higher recall and quick process. This paper shows an investigation on text classification utilizing different machine learning techniques.

Keywords— Text Classification, Machine Learning, Multiview Boosting, cMFDR, Visual Classifiers, Text Segmentation.

I. INTRODUCTION

Text classification is the process of separating a bunch of text records in to various categories in the defined order. It will so sorting and arranging the records in the folders hierarchies so that identification of topic become easy that would support topic specific operation. But if this work done by manually means, it‟s a time consuming work, error phone and needs expert of domain in predefined category. It is necessary to use machine learning techniques automate the classification work. The text classification process has many steps. Collect all format documents and do preprocessing. It is done by token and stem word generation. Removing the insignificant words by token generation and apply the stemmey procedure used for convert the different word format to canonical formal. Clear format is created at the final stage of preprocessing.

Machine learning is band on algorithm that will define classification rules or classifiers automatically using example documents. Two types of learning algorithm is used in the machine learning-supervised learning algorithm or unsupervised learning algorithm. Earlier days, Machine learning was used of binary classifications the classifier rules determines whether the record belongs to defined set or not. For eg. Using machine learning, in email/binary classification is done by whether the mail belongs to legitimate or spam. Similarly machine learning is applied to multi label classification.

Here we analyzed seven different machine learning techniques after text classifications by different authors for many applications. In order to handle it in easy way apply indexing to the record collection and create record vector. In order to create

vector space,

feature selection is the vital step of text classification and utilized for selection of features subset from original records Finally classification of records is prepared by either manually or automatically.

II. AUTOMATIC FEATURE SUBSETS ANALYSER

Developed by Rogerio C. P. Fragoso et al. Basic idea:

The increasing accessibility of text documents in digital form is leading to a need for easy, flexible and automated ways of accessing their contents. In this context, text categorization (TC) is a crucial tool for content organization and information retrieval. The AFSA enhances the already present Class-dependent Maximum Features per Document (cMFDR) procedure and spontaneously describes the good count of features per document.

(2)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 14-20 Outline of concept:

Many of the text categorization techniques use the Bag of Words for representation. In this method, every word of a textual document is considered a feature. Thereby, a standard sized dataset should have thousands of features, which makes text categorization a great dimensional problem. Calculation costs of categorization are directly affected by the dataset‟s dimensionality. Furthermore, the excessive features can negatively impact on the classification accuracy, especially in datasets having a minimal number of instances. As many of the features in text documents are redundant or irrelevant for classification, these issues can be addressed by restricting the feature‟s count in the dataset, this method to be called as dimensionality reduction.

Dropping the feature‟s count in the dataset can improve computational efficiency while managing or enhancing classification performance. Dimensionality reduction techniques are feature extraction and feature selection. The first technique based on extracting approach that dynamically determines the best value for the f parameter, i.e., the count of features to be chosen per document. The cMFDR method presents good results, however, as its predecessors, it requires a value for the parameter f. In a production environment, as new documents are presented, it is necessary, from time to time, to reanalyze the dataset including the new documents in the feature selection process. Every time this analysis is performed, the client needs to test manually the subsets generated by cMFDR and choose the best one. So, in a working environment, different values of the parameter “f” must be evaluated to determine which one generates the best subset. This process is time-consuming and requires humaninteraction. Automatic Feature Subsets Analyzer (AFSA), the proposed method, aims to establish, in a data-driven way, the best value for f parameter and, hence, the feature‟s count is taken.

The proposed method adopts a validation set strategy to generate a sequence of subsets using cMFDR, preassesses the performance of each set and then choose the value for f that produces the effective set. So, the best value of f is provided as input for cMFDR to build the final feature set, using the test set.

III. MINING USING LATENT DIRICHLET ALLOCATION (LDA)

Developed by Carina Silberer et al. Basic Idea:

LDA based topic place and to filter the high level representation in mapped text reports. But this method is highly sensitive to high hierarchy parameters related to the number of classes or topics and there is no processed way to forecast correct configuration.

In the same manner, the relevant variability yields from the content common to the document and the content of each classes composing the topic place. The proposed -vector representation is evaluated in the context of the theme identification of automatic transcription of human-human telephone conversations and of the categorization of textual newswire collection. This representation is built from a set of feature vectors. Each one is composed of scores of discriminative words. Then, the metric used to associate a document to a class is the Mahalanobis metric.

Outline of concept:

The first step is to build a set of topic places from a training corpus of dialogues D={d1,d2,…dn}. LDA algorithm with different high hierarchy parameters used to learn these topic places. Then, each dialogue is mapped into each topic place to obtain a set of topic-based representations of the same document. This representation is heterogeneous. Indeed, the feature space is not the same from one LDA hyper-parameter configuration to another.

To tackle this issue, we propose to map each topic-based representation into a common discriminative set of words. Thus, we obtain a set of homogeneous feature vectors to represent each document, and each feature is the probability that a discriminative word is associated with a given document. Then, these multiple views are compacted with the -vector framework. This process allows us to remove most of the useless and irrelevant information from the topic-based presentations of the document.

In this research, deal various representation of reports or speech transcriptions and fusion process with factor analysis process called i-vector. The first step is to present a report in multiple/various topic places of different sizes. Then a compact representation of the report from is used to compensate the vocabulary and variability of the topic named presentation. This method was developed for speaker new algorithm. Then it is applied to text categorization task. All original topic band presentation of documents (c-vector) is joined with EFR normalization also provides good solution to report variabilities. This method results 86.5% accuracy in classification compared to other LDA Speaker.

IV. MACHINE LEARNING AND CONCEPTUAL REASONING FOR INCONSISTENCY DETECTION

Developed by Jameela Al Otaibi et al. Basic Idea:

(3)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 14-20

continuously increasing text-based content. Most existing methods deal with this problem using complicated textual analysis which is known for not being accurate enough. The proposed method consists of two steps first is a machine learning step that does multilevel text categorization. By using a machine learning approach, discovered a path of labels associated to any text. By this way, derive a logical rule for each text induced by the associate path. The next step used to notice the inconsistency by conceptual reasoning on the predicted classes. This study has been shown on a set of Islamic advisory opinions (fatwas). This domain is gaining a large interest with users continuously checking the authenticity and relevance of such content. It provides us the accurate results and can pair existing methods using linguistic analysis.

Outline of algorithm:

It works by first classifying each fatwa to its corresponding category. The Hyper conceptual feature extraction algorithm is used for extracting keywords at each level. These keywords are then fed into a classifier to predict their categories. A propositional logic prover is used to detect inconsistencies. Propositional logic formula is used for representing the fatwa. If the goal is proved, the two compared fatwas are considered consistent, otherwise they are considered inconsistent. To find the correct category of fatwas, extracted keywords put into a classifier. The classifier used for this experiment is the Random Forest classifier experiment was conducted on the category of Vow related fatwa. The text of the fatwa goes through different classifiers at different levels of the classification tree. The next classifier is decided based on the output label given to the text at the current level. Depending on the classification results, each fatwa is assigned a list of labels. The labels are used in the next step to construct the propositional formula.

Classification is performed for the following three levels:

 In the first level, the fatwa is classified into vow related and non-vow-related fatwas.

 In the second level, the vow fatwa is classified into the four vow categories: Anger related vows, vows done for a reason, allowed vows, invalid vows.

 In the third level, the anger related vow fatwas are classified into those that require atonement and those that give the choice between atonement and fulfillment.

The labels that are given to the fatwa at each classification level are used to construct the propositional formula. For example, if the fatwa was classified as Vow related! Anger Fatwa! Fulfillment. The corresponding formula will be V &&A&&J where ”„V”‟ is the symbol for vow related fatwa, ”„A”‟ is the symbol for anger fatwa and ”„J”‟ indicates the judgment of the fatwa which in this case is fulfillment. A second fatwa that is a vow related, anger vow fatwa, but has a different judgment would be represented by the formula V &&A&&!J. By placing the formula corresponding to the first fatwa in the formula section of the tool and the formula corresponding to the second fatwa in the goal section, if both fatwas have the same category and the same judgment, the goal will be proven and the two fatwas are consistent. Otherwise, if both fatwas belong to the same category but have different judgments then the goal will not be proven. The strings of labels of each fatwa are fed to ConProve to detect inconsistencies between pairs of fatwas. There are four potential cases after following these steps: The two fatwas are consistent and detected as consistent (true positive). The two fatwas are inconsistent and detected as inconsistent (true negative). The two fatwas are consistent but detected as inconsistent (false negative). The two fatwas are inconsistent but detected as consistent (false positive). The keyword extraction method based on hyper rectangular decomposition shows that this method successfully extracts discriminative keywords which are representative of each category or subcategory of documents. The way has been authenticated on a set of Islamic advisory opinions, but can be generalized to other domains.

V. MULTIVIEW BOOSTING WITH INFORMATION PROPAGATION

Developed by Jing Peng et al. Basic idea:

Multiview learning has shown promising potential in many applications. Most techniques are focused on either view consistency, or view diversity. In many machine learning applications, it is important to develop algorithms that take input from the multiple views of an object of interest to make a prediction. These algorithms require an effective way of combining the multiple views, and are expected to provide better generalization performance on unseen data, especially when the multiple views are independent given class labels. These algorithms are mainly concentrating on agreement among the views or complementary views. Boost.SH is a boosting algorithm with shared weight distribution used in the proposed method.

Outline of algorithm:

(4)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 14-20

views of classifiers. So, shared weight distribution in view used for catching the consistency and randomly chosen views to make base classifiers used to ensure the diversity.

Given a set of training examples represented by M views Boost.SH learns weak classifiers from each view independently. However, all the views share the same weight distribution w = (w1,...,wn) with i wi = 1, which is

determined by the view having the largest edge. Randomized Boost.SH (r Boost.SH) maintains a probability distribution over the views, and chooses the winning view probabilistically according to the distribution. This distribution is updated according to forecaster strategies employed in multiarmed bandits. As a result, randomized Boost.SH not only ensures diversity, but also achieves convergence guarantees. The significance of the convergence result is twofold. First, the bound on the training error plays a role in bounds on generalization error. Second, the bound on the training error shows that rBoost.SH is a strong learning algorithm.

Learning Expert Strategy: Consider a strategy as advice given to the algorithm by an “expert”.

Each strategy makes its assumptions regarding the reliability of a view and allocates a different probability mass to a view at different times, and in different situations. The main objective is to find the best strategy so that algorithms belief in every view of classification comes near to it. Most repeated views involves in making classification decision. As a result, the most consistent views contribute to its classification decision. A version of the Boost.SH is eboost.SH algorithm for learning the best expert strategy from a pool of strategies. The strategy set has both inverse and uniform strategy used in this experiment. For the inverse variance strategy, the consensus function is simply true training labels. Therefore, Boost.SH uses a same weight distribution among the views and helps to improve performance of multi-view learning.

VI. SCENE TEXT DETECTION AND SEGMENTATION

Developed by Youbao Tang, and Xiangqian Wu Basic idea:

The objective of scene text detection is to locate the positions of text in different scenes, for example guideposts, store marks, and warning signs; it is one of the most important steps for end-to-end scene text recognition. It can enhance the performance of various multimedia applications, e.g. mobile visual searches, content-based image retrieval, and automatic sign translation.VGGNet-16 is modified for framing the framing the CNN based models. The boundaries and the entire region of text is used for training and designing of a detection network called DNet which is a CNN based text-aware candidate text region used for detecting the coarse CTR. For segmenting the detected coarse CTR into the refined CTRs, SNet a segmentation network which is based on CTR refinement model has been applied. When compared with the traditional approaches, these SNet and DNet networks used for extracting very few CTRs so more true text regions are kept. To get the final text region from refined CTRs, CTR classification network CNet is defined.

Most existing scene text detection approaches generate text bounding boxes containing a lot of background, which makes scene text recognition difficult. To deal with this issue, scene text segmentation methods were proposed to obtain more precise text regions. These methods were hand-crafted and cannot be trained by an end-to-end process. To solve these problems, in this method, a text-aware CTR extraction model and a CTR refinement model are devised to extract CTRs and obtain precise text segmentation results, which can overcome the above problems.

(5)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 14-20

Trending towards the one word vector space and its usage in many language applications like retrieving the information, inquiry developments, record categorization and question replying. Vector space have been well known in study of mind science mainly applied in simulations of human conduct.

Another model which utilizes stacked auto encoders to take in top level presentations from printed and visual info. The visual modality is encoded by means of vectors of variables got automatically from pictures. Make another substantial scale scientific categorization of 600 visual variables speaking to more than 500 ideas and 700K pictures. This dataset is utilized to prepare variable classifiers and coordinate their forecasts with content based distributional models of word meaning. Assess the model on its capacity to stimulate word same judgments and idea categorization. On the two works, this model yields a superior fit to behavioral information contrasted with baselines and related models which either depend on a single methodology.

In this article, presented a model, which draws components from connectionist, characteristic based, and distributional models to learn grounded meaning presentation. This model uses (stacked) auto encoders, a variation of multilayer neural systems, to learn high level meaning presentation by mapping words and pictures into a typical invisible space. The writing depicts a few effective ways to deal with multimodal getting the utilization of distinctive variations of profound systems and information sources including content, pictures, sound, and video. Not at all like most past work, this model processes presentation for singular words and is one of a kind in its utilization of variables as a methods for speaking to the visual methodology.

The point was to build up an arrangement of visual variables that are both perceptive and cognitively conceivable, i.e., people would utilize them to represent a solid idea. As a beginning stage, utilizes the visual variables from McRae's norming study. Variables catching other essential sensitive data (smell, sound), working properties, or all-encompassing data were not considered. For instance, is purple is a legitimate visual characteristic for an eggplant, while a vegetable isn't, since it can't be visualized. The methodology implemented here is to choose characteristics supported by the pictures. Also, by taking a look at the real pictures is first step and next step is take out the mistakes in McRae's standards. For instance, eight members mistakenly imagined that a catfish has scales. Thirdly, amid the explanation procedure, a standardized synonymous variables and characteristics that displayed immaterial varieties in meaning. At last, the point was to gather a visual variables for every idea which is constant over all individuals from a category. Second the Automatic Extraction of Visual Attributes, learned one classifier for each property gave that the credit had been appointed to no less than two ideas in the dataset. A key essential in depicting pictures by their characteristics is the accessibility of preparing information for learning quality classifiers. Beginning work on automatic variable extraction from pictures concentrated on basic shading and surface properties (e.g., blue, stripes) and demonstrated that these can be learned in a weekly administered setting from pictures returned by a web crawler when utilizing the variable as a question.

At last, a current surge of enthusiasm for the improvement of neural system structures that learn word portrayals comparing to vectors of enactment of system units. A noticeable distinction between these models and prior models sketched out above is that learning is performed on unlabeled content corpora utilizing different strategies from neural-organize language displaying. The continuous skip-gram demonstrate is outstanding among other known word-embedding models. It not just delivers valuable word presentation, it is additionally productive to prepare, and scales to billions of words.

VII. VISUAL CLASSIFIERS FROM UNSTRUCTURED TEXT

Developed by Mohamed Elhoseiny, Ahmed Elgammal, and Babak Saleh Basic Idea:

Generally people gain the new knowledge through visual concepts along with the verbal descriptions. Even children also learn new things by seeing the objects with their descriptions. This way of teaching will make us to think about the learning process should be designed to learn visual classifiers.

The principle question of this work is the way to use simply text depiction of visual classes with no training pictures to learn the visual classifiers for them. The creator propose and research on two pattern details in view of regression and domain exchange that anticipates a linear classifier. At that point, the writer proposed another optimization methodology that merges regression and knowledge transfer function with extra requirements to anticipate the linear classifier parameters.

(6)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 14-20

in the visual space and content space, separately. At last the creator proposes a kernel work between unstructured content that expands on distributional semantics, which could be helpful for different applications. One of the principle challenges for scaling up this frameworks is the absence of commented on pictures for the current category.

Regularly there are few pictures accessible for training classifiers for the greater part of these classifications. This is reflected in the quantity of pictures per class accessible for training in many classification datasets, which, as pointed out in, demonstrates a Zipf dissemination. The issue of absence of training pictures turns out to be considerably more serious when target recognition issues inside a general classification, i.e., fine-grained arrangement, for instance constructing classifiers for various flying creature species or bloom sorts. An organized expectation regressor would be more reasonable since it would take in the relationship between the info and yield domain. Notwithstanding, even an organized forecast model will just take in the connection between the text and visual domain through the data accessible in the information yield sets (tk; ck).

Here the visual domain data is encapsulated in the pre-learned classifiers and prediction does not approach to the original data in the visual domain. In particular, an approach for learning cross domain transformation was introduced. In this work a regularized asymmetric transformation between points in two domains were learned.

The approach was applied to transfer learned categories between different data distributions, both in the visual domain. A specific characteristic related with other domain models, is that the source and target domains do not have parallel feature spaces or the same dimensionality.

VIII. CONCLUSION

We have been studied the different machine learning techniques executed for text classification. This investigation gave a review of probably the most principal design and procedures which are broadly utilized in the text area. Each methodology have their own remark in their applications.Boost.SH algorithms performs better than the other procedure implemented in multi view learning applications. AFSA achieves better results than existing cMFDR with less training data. Inconsistency detection can yields high performance when done using multilevel text categorization followed by conceptual reasoning. Scene text detection method by using multiple convolutional neural networks provides comparable precision on the bench mark datasets. The proposed kernelized version generated by multiple kernel learning provides better results. The model using stacked autoencoders provides better fit to behavioral data compared to baselines. The i-vector method system provides good classification accuracy.

REFERENCES

[1] C.L.Liu, W.H.Hsaio, C.H.Lee, T.H.Chang, T.H.Kuo, “Semi-Supervised Text Classification with Universum Learning”, IEEE Trans. on Cybernetics, Vol.46, Issue.2, P.462– 473, 2016.

[2] J.Peng, A.J.Aved, G.Seetharaman, K.Palaniappan, “Multiview Boosting With Information Propagation for Classification”, IEEE Trans. on Neural Networks and Learning Systems, 2017.

[3] F.Zhuang, P.Luo, C.Du, Q.He, Z.Shi, H.Xiong, “Triplex Transfer Learning: Exploiting Both Shared and Distinct Concepts for Text Classification” IEEE Trans. on Cybernetics, Vol.44, Issue.7, PP.1191–1203, 2014. [4] R.C.P.Fragoso, R.H.W.Pinheiro, G.D.C.Cavalcanti, “A method for automatic determination of the feature vector

size for text categorization”, 5th Brazilian Con. on Intelligent Systems, 2016.

[5] J.A.Otaibi, Z.Safi, A.Hassaine, F.Islam and A.jaoua, “Machine Learning and Conceptual Reasoning for Inconsistency Detection”, IEEE Access, 2017, Vol.5, PP.338–346.

[6] S.Baccianella, A.Esuli, F.Sebastiani, “Feature Selection for Ordinal Text Classification”, Neural Computation, Vol.26, Issue.3, Pages.557–591, 2014.

[7] B.Zhang, A.Marin, B.Hutchinson, M.Ostendorf, “Learning Phrase Patterns for Text Classification”, IEEE Transactions on Audio, Speech, and Language Processing, Vol.21, Issue.6, PP.1180–1189, 2013.

[8] J.Y.Jiang, R.J.Liou, S.J.Lee, “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification”, IEEE Transactions on Knowledge and Data Engineering, Vol.23, Issue.3, PP.335 – 349, 2011.

[9] C.Silva, U.Lotric, B.Ribeiro, A.Dobnikar, “Distributed Text Classification With an Ensemble Kernel-Based Learning Approach”, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) Vol.40, Issue.3, PP.287 – 297, 2010.

[10] Y.Tang, & X.Wu, “Scene Text Detection and Segmentation based on Cascaded Convolution Neural Networks”, IEEE Transactions on Image Processing, 2017, Vol.26, Issue.3, PP.1509 – 1520.

(7)

ISSN(E): 2277-128X, ISSN(P): 2277-6451, pp. 14-20

[12] C.Silberer, V.Ferrari & M.Lapata, “Visually Grounded Meaning Representations”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, Vol.39, Issue.11, PP.2284 – 2297.