Analysis and discussion - Text classification Based on Machine Learning Methods

The experiment has 3 steps, except the first step ’preprocess text’, there are several choices of the second step ’word embedding’ and the third step ’classification models, in the whole experiment, there are 24 different combinations in total.

For the word embedding part, Doc2vec word embedding has the lowest accuracy which is only around 20%, it is too low to be accepted. Thus for short text, Doc2vec is not a good choice for word embedding. Tfidf method will a produce very sparse matrix which takes much disk space. Furthermore, if the vocabulary has a new word, it have to be trained again, and the dimen- sion will also change. Word2vec has the same drawback, a new word will

CHAPTER 4. EXPERIMENT 43

cause the retraining of the whole vocabulary. However, it can be avoided by assigning initial values to a new word, for example, all zeros. However, this method will impact the correlation of the new word with its context.

Compared with machine learning models such as Naive Bayes and SVM, all the deep learning models have higher accuracy. However, naive bayes is much less time consuming compared to SVM and other deep learning models, but it still has the accuracy around 70%.

For deep learning models, there are 2 input ways, which are pretrained word2vec embeddings, and the embedding layer needs to be trained. From the result we can see, It’s hard to conclude which is better, every method has its advantages.

MLP model has the simplest structure, but from the results we can see, the simplest model is not the worst. The structure of a model determines the model complexity, which means, the running time and space will also be affected.

2 layer GRU with pretrained word2vec embedding has the highest accuracy within all the models using the normal sized dataset. It is the same for half sized dataset and double sized dataset. It performs very well for the Sohu news dataset.

The limitation of the experiment is, it is hard to execute the experiment in all the Chinese news dataset, thus it is a hint for Chinese news classification, but strictly speaking, it is not the best for all the Chinese news. The dataset size, the length of news, the complexity of models will also have an impact on the result.

Chapter 5

Conclusions

With the development of machine learning and deep learning, text classification comes into a new generation, the accuracy is higher and the time consuming is lower. In this thesis, the process of text classification is introduced, which includes 3 parts, preprocess text, word embeddings and classification models. Chinese news is chosen to be the dataset, which is a collection of news from different channels from Sohu company .The preprocess part includes removing punctuation, English letters, numbers and stopwords. After that, the news will be segmented to a list of words. There are 4 ways to do word embedding which are doc2vec, word2vec, tfidf and embedding layer. In the classification models, 2 machine learning models and 8 deep learning models are taken to do classification. At the same time, There is a comparison between whether adding feature selection, whether using pretrained word embeddings, which also produce more combinations of the models. The ’2 layer GRU model with pretrained word2vec embeddings’ model gets the highest accuracy.

In this thesis, the hypothesis of whether the volume of the dataset will impact the accuracy is also discussed. The same models are used for half sized and double sized dataset from the same source. The results show the accuracy of half sized dataset is lower compared to normal sized dataset. On the contrary, double sized dataset gets higher accuracy in most of the models˜a The dataset is relatively small compared to the data size in a real company. From the time consuming, the accuracy, the space the matrix takes, the dataset, all the aspects will affect the final choices in the real industry. Text classification not only can be used in news classification as described in this thesis, but also in other areas such as spam detection and sentiment

CHAPTER 5. CONCLUSIONS 45

analysis. It will surely help people save time and easy to find what they need. The development of natural language processing and big data also have a good impact on text classification, which will make the classification faster and more accurate.

Bibliography

[1] H.P. Luhn. Key word–in–context index for technical literature (kwic index). American Documentation, pages 288–295, 1960.

[2] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620, November 1975.

[3] Eva Gibaja and Sebasti´an Ventura. A tutorial on multilabel learning.

ACM Comput. Surv., 47(3):52:1–52:38, April 2015.

[4] Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. Neural Probabilistic Language Models, pages 137–186. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. [5] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient

estimation of word representations in vector space. ICLR(International Conference on Learning Representations), 2013.

[6] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pages II–1188–II–1196. JMLR.org, 2014.

[7] Fabrizio Sebastiani. Machine learning in automated text categorization.

ACM Comput. Surv., 34(1):1–47, March 2002.

[8] Wissal Farsal, Samir Anter, and Mohammed Ramdani. Deep learning: An overview. In Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications, SITA’18, pages 38:1– 38:6, New York, NY, USA, 2018. ACM.

[9] T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine, 13(3):55–75, Aug 2018.

BIBLIOGRAPHY 47

[10] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis 3rd Edition. Chapman and Hall/CRC, 2013.

[11] Eibe Frank and Remco R. Bouckaert. Naive bayes for text classification with unbalanced classes. In Johannes F¨urnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, Knowledge Discovery in Databases: PKDD 2006, pages 503–510, Berlin, Heidelberg, 2006. Springer Berlin Heidel- berg.

[12] Fabrice Colas and Pavel Brazdil. Comparison of svm and some older classification algorithms in text classification tasks. In Max Bramer, editor, Artificial Intelligence in Theory and Practice, pages 169–178, Boston, MA, 2006. Springer US.

[13] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multi- class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, March 2002.

[14] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic op- timization. International Conference on Learning Representations, 12 2014.

[15] Yann Lecun, Leon Bottou, Y Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86:2278 – 2324, 12 1998.

[16] Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gulcehre, Dzmitry Bah- danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics. [17] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural Comput., 9(8):1735–1780, November 1997.

[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

BIBLIOGRAPHY 48

[19] Wen Zhang, Taketoshi Yoshida, and Xijin Tang. A comparative study of tf*idf, lsi and multi-words for text classification. Expert Systems with Applications, 38(3):2758 – 2765, 2011.

[20] Yoon Kim. Convolutional neural networks for sentence classification.

In document Text classification Based on Machine Learning Methods (Page 42-48)