An improved Artificial Fish Swarm Optimization based Feature Selection in big data Sentimental Analysis Classification

(1)

ISSN: 2005-4238 IJAST 927

An improved Artificial Fish Swarm Optimization based Feature Selection in big data Sentimental Analysis Classification

Dr. J.K. Kanimozhi¹, C. Suresh Kumar²

1Research Department, ²Assistant Professor

1,2Arts & Science College, Trichengode, Namakkal, Tamil nude, India.

Abstract

Big data is a term that defines immense data volumes. The details can be organized as well as unstructured. Since big data are created from different fields and resources it is too complicated to deal with conventional data processing methods. Large data processing tasks, in particular in the choice of features, play an important role in eliminating redundancies. Several methods for selecting features like greedy cat-swarm optimization are now available. However, the greedy collection of features is more difficult and even with the efficient algorithms can also generate the worst solutions. In order to overcome this problem, we propose an artificial fish swarm optimization feature based on sentimental research classification to improve the functionality of big data selection. Preprocessing data was the first step followed by the selection of features through Improved Artificial Fish Swarm Optimization (IAFSO).

Keywords: Big data, Classification, Features and Random Forest Classifier.

INTRODUCTION

Big data is defined as the collection of large amounts of data that in conventional computing systems and frameworks are difficult to process. Big data can be quickly captured or produced at high speeds anywhere in different real time applications with high volumes of valuable and unsafe data. [2]. High data volumes (big data) are now generated every day by all-pervading communication, imaging, and mobile equipment such as cellular, cameras and drones, medical and e-commerce facilities and social networking [3]. Big Data slowly relates to the challenges and advantages of gathering and processing large-scale data measurements [4]. Big Data has 5 specific features, i.e. length, velocity, complexity, truthfulness and value [5].

The quantity relates to the data size for analysis. The growth and use of data are based on volume.

Variety is renowned for the various types and formats of data used for analysis. The accuracy of outcomes on analysis of data is based on veracity. The added value and the quality of data processing and interpretation is important. Huge volume of data are too difficult to analyse by conventional processing applications. Nowadays, it includes vast, complex and abundant structured and unstructured data that is generated and collected from a number of fields and resources. Selection of features is a function that chooses attributes that represent identifying issues, an extremely critical phase in data processing [6]. The goal is to locate a subset of an attribute (function) in the original data that preserves certain features [7].

BIG DATA ANALYTICS

Big data analysis is a method in which vast collections of data (big data) are gathered, processed, analyzed to find correlations and useful information. Big data analysis usually requires state-of - the- art tools and techniques for data store, method and analysis [8]. In order to support business decisions, large data analytics are important to remove hidden knowledge from the data [9][10].

(2)

ISSN: 2005-4238 IJAST 928

BIG DATA SENTIMENTAL ANALYSIS

Sentiment analyzes are defined as the method of identifying a portion of the data's sentiment or perception. Online social networking content sentiment management is a tedious job, since it requires comprehensive knowledge of semantics and linguistics, communicate, straightforward, coherent and intermittent dialect laws. As feeling research, the multitasking process involves the selection of apps, the compilation of mood data, the classification of feelings and the recognition of sensation polarity.

The collection of features in feeling analysis plays an important role in perceiving main characteristics and increasing system precision. Due to the volume, complexity and unstructured nature of Web-based social network content, it becomes difficult to automate the sentiment recognition and classification method. The collection of roles is therefore important for the evaluation of characteristics in broad data- based sentiment analysis.

CHALLENGES OF FEATURE SELECTION FOR BIG DATA ANALYTICS

Dimensionality curse is a critical issue when using high-dimensional data mining and machine learning algorithms. This refers to the fact that the data is decreased in high dimensional space and adversely affects the nature of algorithms for low dimensional space. Therefore, the presence of high-dynamic functionality would require significantly expanded storage and computing resources. The collection of applications is the management of high-dimensional data and a kind of dimension reduction technology.

For the building layout, a related function subset is chosen directly. This preserves the spatial meaning of the original interface sets as well as provides a better format and readability, because the collection of the functionality retains a subset of the original features. Therefore, it is more commonly used to gain useful functionality in a number of real-world applications, such as text mining and gene analysis by eliminating obsolete, unnecessary functions. The unnecessary and irrelevant reduction of functionality decreases computing and device costs without any loss of information or adverse learning results.

LITERATURE REVIEW

The knowledge discovery method for large data, which uses Apache Sparc methodology, has been proposed by Gallego et al.[11]. The goal is to prove that common application approaches in large data systems such as Apache Spark could be paralleled. Such systems are used to increase accuracy and performance. They used a real world data collection to validate the usefulness of the system (up to the O(107) functions and instances). Using 2-D big data their approach was efficient and produced better performance by processing a wide range of examples of both extremely large datasets and data sets.

Ahmad et al.[12] has introduced a system architecture that uses Artificial Bee Colony (ABC) to pick features on the social network of big data objects. Their solution is based on an architectural paradigm of four layers and aggregated data, excluded inaccurate or redundant data and chosen computational capabilities that can be used with high-performance Hadoop computing platform. At first, the extraction function of IoT Big Data is hierarchical. The optimization of the Kalman filter helps to increase the efficiency of the Big Data analysis. ABC algorithm mainly used generate large data functions.

The rapid search function for data stream mining Big Data was introduced by Fong et al. [13]. The feature collection is specifically designed to collect fly streaming data; it achieves better accuracy in an adequate time of processing by using APSO. Everyone interested in developing data stream mining software for big data analytics using a light weighted practical discovery method including Swarm Search and APSO contributes to their research.

(3)

ISSN: 2005-4238 IJAST 929

Barbu et al[ 14] introduced the Computer Vision and Big Data Learning feature set. they incorporated the methodology of regularization and the sequential algorithm architecture for feature selection system optimized for the theoretical learning of big data. The suggested selection method will handle large (possibly too gullible and inaccurate) data sets without being online. Experiments on robust Synthetic and Real Data (including facet recognition and action segmentation) offer empirical evidence that the theoretical FSA software works more efficiently than or equal to current penalization and improvement approaches when running on large datasets considerably more effectively.

The Fast Minimum Redundancy Maximum Relevance Algorithm for HD Big Data was proposed by Gallego et al. (15) Quick-MRMR. In many Machine Learning (ML) techniques, the high number of input features can not be managed effectively. Our method is to evaluate the data sets in the popular libSVM database2 and data has risen to over 29 million. Likewise, certain of these algorithms are also influenced by large samples. In this new scenario, the datasets are huge in terms of number and measurements and the existing techniques of learning must be modified.

Lei et al.[16] implemented an intelligent error detection solution using unsupervised feature learning functionality. The apps are selected manually according to the prior knowledge and technical skills of conventional smart diagnostic approaches. These systems benefit from human naivety which require time and work.

In order to learn raw data from artificial intelligence techniques, a two-stage learning approach has been introduced for machine diagnosis. During the first step of research the technique was used to learn the characteristics from mechanical vibration signals directly using coarse filtration, an unattended two- layer neural network. Softmax regression is used for classifying aspects of health according to the learning characteristics in the second phase. The approach has been verified by the use of an engine bearing data set and a locomotive bearing data set. The test results show that their methodology has quite high analytical accuracy and is comparable to the current motor-bearing dataset approaches.

Yamada et al.[17] have established the first non-linear teacher learning problem selection method that could extend biological data to long, large dimensions. In addition, the novel Hilbert-Schoke Independence Criterion Lasso (HSIC Lasso) is used to cover millions of features from hundreds of thousands of samples. The approach suggested promises greater predictive power and better interpretability in order to find an optimal subset of predictive functions with minimum variances. With just 20 out of 1 million implementations, they achieve high accuracy with a dimensional reduction of 99.998 per cent.The dramatic reduction of characteristics could contribute to the omnipresent use of sophisticated prediction models in mobile health applications.

For large datasets of high dimensions, Gong et al.[18] implemented a BSSReduce incremental feature selection method for O(|U|). They presented a new justification for using a structure that uses bijective soft sets in their system. The O(|U|) selection technique used to maximize computer time linearly only in the number of instances. Experiments were performed utilizing UCI data sets, huge and broad data sets comprising four million instances and over three million functions. Nevertheless, their methodology can be used for huge, large, functional data sets that cannot be used by traditional methods.

RESEARCH METHODOLOGY PROBLEM DESCRIPTION

In 2018, Alarifi et al. [19] introduced a greedy feature selection with cat-swarm optimization-based long short-term memory neural networks. But greedy feature selection approach is harder to understand the correctness issues. Even with the correct algorithm, it is harder to prove. Sometimes greedy algorithm may even generate the unique worst possible solution. As well as Neural networks-based optimization also complex to understand and also not produce good accuracy.

(4)

ISSN: 2005-4238 IJAST 930

PROPOSED METHODOLOGY

To overcome these issues, the Random Forest (RF) classifier was designed to improve the IAFSO (Improved Artificial Fish Swarm Optimisation). Next, with pre-processing methods, emotion data will be extracted from noise. Features are selected using IAFSO from the pre-processed data.

AFSO[ 20] is basically used to select features. By altering the quest phase, we suggested an IFSO to reduce the time needed to determine the best solution. We generate candidate solutions by crossover operations in the selection stage as follows. The data is classified with the aid of the RF[21] classifier after the feature selection.

DISCUSSION

The feature selection and classification is simulated and carried out on the basis of our proposed work using open source python. We assess our work performance in comparison with existing systems. The measurements of performance including accuracy, accuracy, recall and efficiency were determined.

CONCLUSION

The use of social media such as WhatsApp, Twitter and more rapidly increases data everywhere.

People want to record their moments of life and share them all with family, family, friends, and others. Like this situation, it is very difficult to maintain, analyze and especially classify for the selection of a specific subset element or variable. Thus, I strongly express an idea titled "An improved selection of features based in artificial swarm optimization in Big Data Classification of sentimental analysis." Random Forest's classification of data for the selection of features. Finally, the results are evaluated by comparing the performance of existing systems.

REFERENCES

[1] Lopez D, Gunasekaran M (2015). Assessment of Vaccination Strategies Using Fuzzy Multi- criteria Decision Making. In Proceedings of the Fifth International Conference on Fuzzy and Neuro Computing (FANCCO-2015) (pp. 195–208). Springer

[2]V. Snášel, J. Nowaková, F. Xhafa and L. Barolli, "Geometrical and topological approaches to Big Data", Future Generation Computer Systems, vol. 67, pp. 286-296, 2017.

[3]K. Slavakis, G. Giannakis and G. Mateos, "Modeling and Optimization for Big Data Analytics:

(Statistical) learning tools for our era of data deluge", IEEE Signal Processing Magazine, vol.

31, no. 5, pp. 18-31, 2014.

[4]A. Katal, M. Wazid and R. Goudar, "Big data: Issues, challenges, tools and Good practices", 2013 Sixth International Conference on Contemporary Computing (IC3), 2013.

[5]H. Chen, R.H.L. Chiang, V.C. Storey, Business intelligence and analytics: From big data to big impact, MIS Q. 36 (4) (2012) 1165–1188.

[6] Roman W Swiniarski and Andrzej Skowron. Rough set methods in feature selection and recognition. Pattern recognition letters, 24(6):833– 849, 2003.

[7] Fachao Li, Zan Zhang, and Chenxia Jin. Feature selection with partition differentiation entropy for large-scale data sets. Information Sciences, 329:690–700, 2016.

[8]G. Manogaran, D. Lopez, C. Thota, K. Abbas, S. Pyne and R. Sundarasekar, "Big Data Analytics in Healthcare Internet of Things", Understanding Complex Systems, pp. 263-284, 2017.

[9]P. Colombo and E. Ferrari, "Privacy Aware Access Control for Big Data: A Research Roadmap", Big Data Research, vol. 2, no. 4, pp. 145-154, 2015.

(5)

ISSN: 2005-4238 IJAST 931

[10]B. Hazen, J. Skipper, J. Ezell and C. Boone, "Big data and predictive analytics for supply chain sustainability: A theory-driven research agenda", Computers & Industrial Engineering, vol.

101, pp. 592-598, 2016.

[11]S. Ramirez-Gallego, H. Mourino-Talin, D. Martinez-Rego, V. Bolon-Canedo, J. Benitez, A.

Alonso-Betanzos and F. Herrera, "An Information Theory-Based Feature Selection Framework for Big Data Under Apache Spark", IEEE Transactions on Systems, Man, and Cybernetics:

Systems, vol. 48, no. 9, pp. 1441-1453, 2018.

[12]A. Ahmad, M. Khan, A. Paul, S. Din, M. Rathore, G. Jeon and G. Choi, "Toward modeling and optimization of features selection in Big Data based social Internet of Things", Future Generation Computer Systems, vol. 82, pp. 715-726, 2018.

[13]S. Fong, R. Wong and A. Vasilakos, "Accelerated PSO Swarm Search Feature Selection for Data Stream Mining Big Data", IEEE Transactions on Services Computing, pp. 1-1, 2015.

[14]A. Barbu, Y. She, L. Ding and G. Gramajo, "Feature Selection with Annealing for Computer Vision and Big Data Learning", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 2, pp. 272-286, 2017.

[15]S. Ramírez-Gallego, I. Lastra, D. Martínez-Rego, V. Bolón-Canedo, J. Benítez, F. Herrera and A. Alonso-Betanzos, "Fast-mRMR: Fast Minimum Redundancy Maximum Relevance Algorithm for High-Dimensional Big Data", International Journal of Intelligent Systems, vol.

32, no. 2, pp. 134-152, 2016.

[16]Y. Lei, F. Jia, J. Lin, S. Xing and S. Ding, "An Intelligent Fault Diagnosis Method Using Unsupervised Feature Learning Towards Mechanical Big Data", IEEE Transactions on Industrial Electronics, vol. 63, no. 5, pp. 3137-3147, 2016.

[17]M. Yamada, J. Tang, J. Lugo-Martinez, E. Hodzic, R. Shrestha, A. Saha, H. Ouyang, D. Yin, H. Mamitsuka, C. Sahinalp, P. Radivojac, F. Menczer and Y. Chang, "Ultra High-Dimensional Nonlinear Feature Selection for Big Biological Data", IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 7, pp. 1352-1365, 2018.

[18]K. Gong, Y. Wang, M. Xu and Z. Xiao, "BSSReduce an O(|U|) incremental feature selection approach for large-scale and high dimensional data", IEEE Transactions on Fuzzy Systems, pp.

1-1, 2018.

[19]A. Alarifi, A. Tolba, Z. Al-Makhadmeh and W. Said, "A big data approach to sentiment analysis using greedy feature selection with cat swarm optimization-based long short-term memory neural networks", The Journal of Supercomputing, 2018.

[20]R. Manikandan and A. Kalpana, "Feature selection using fish swarm optimization in big data", Cluster Computing, 2017.

[21]S. del Río, V. López, J. Benítez and F. Herrera, "On the use of MapReduce for imbalanced big data using Random Forest", Information Sciences, vol. 285, pp. 112-137, 2014.