Multi-Terms Association Approach to Generate Patterns for Efficient Data Categorization for Web Information Mining

(1)

Multi-Terms Association Approach to Generate Patterns for Efficient Data Categorization for Web Information Mining

1Sunil Kumar Thota, ²Dr. Tummala Sita Mahalakshmi,

1Research Scholar, CSE Department, Gitam University, Vishakapatnam

2Professor, CSE Department, Gitam University, Vishakapatnam

Abstract

The demand for heterogeneous data classification for text, image, music, movie, and medical datasets is growing in real-world applications. The complexity of learning classes for a single object associated with multiple term sets is a key issue for multi-term data sets. Existing methods learn based on the feature discrimination observed for similar term sets, but the discrimination measures the deviation of class values rather than associations. Such a method may not be suitable for classification, because each term contains a specific feature.

This paper proposes a Multi-Term Association (MTA) Approach that uses term features to describe and use association rules to discover term correlations for data classification. The MTA aims to find accurate classes for data objects using a "knowledge class" structure that suggests binary associations between them to build terminology patterns to handle multi-term database classification. A comprehensive experimental analysis was performed on a set of MULAN datasets to verify the efficiency of MTA relative to other recent classifiers. The analysis of the results indicates that improvisation has been done in different case studies for accurate measurement.

Keywords: Multi-Terms, Association Rules, Pattern, Categorization, Information mining.

___________________________________________________________________________________________

1. INTRODUCTION

In distributed practical applications, there are various heterogeneous data, which are identified as multi-term objects. These data may come from different fields, such as "education", "sports", "multimedia", "politics" or

"medicine" [1], [2], [3]. The content of these data may show multiple meanings, and because of multiple terms, may be related to multiple categories. Even in different applications, high-dimensional data faces obstacles to accurate classification in different applications of information processing [4], [5], and machine learning [6].

However, in all cases, the cause of the problem has been determined to be multifaceted. We solve this problem by suggesting learning terms and accurate classification of data objects that use association rules generated in multi- term datasets to improve data classification accuracy.

In data learning and mining research, building an effective classifier for a heterogeneous data set with multiple annotations is a difficult task. In the literature, most work has focused on feature selection [7], [8], feature reduction [9], [10] and association classification [11], [12] to build classifiers. A classifier that predicts the class of a data object based through the process of training the collected data. But, in the case of multiple influences, the exploration of classifier construction is insufficient, which has a great impact on the prediction of class values, and even in the literature, so far research on this issue is still very little being explored.

In past suggestions [14], feature reduction or selection methods [8], [9], [13] were used for multiple classifications. Most of these suggestions analyze the feature correlations between them and decrease those characteristics that are not able to offer constructive knowledge for class prediction. The classifier uses these reduced or selected features for training and classification to support improvisation. But the complexity lies in the object of multiple terms and how to convert these objects suitable for improvisation classification. Even though these selection methods work well for some classifiers with multi-semester learning [15], for each class semester, they may not be optimal for their specific characteristics. For example, word classification in a set of documents may be related to entertainment, politics, sports, stocks, etc. [16].

The fundamental aim of this paper is to analyze and recommend a new term-based multi-term learning method for data objects. This method uses the association rule algorithm for ad hoc classification. It correlates terms through a multi-term correlation using term density calculations and correlation algorithms. It emphasizes first suggesting an Accurate Class (AC), which is very suitable for class suggestions between data object classes. In the second stage, it suggests an association mechanism for a multi-terms attribute by learning multiple term binary

(2)

associations between the attributes terms to build a classification pattern. This generated pattern utilize to classify the multi-terms datasets during the classification process.

The subsequent papers are prearranged as follows. The section-2 illustrates the work interrelated to the classification of multiple terms. In the section-3, we discuss the multiple term association methods that describe the problem description and the multiple term association learning. The section-4 introduces the data set and evaluation methods. The section-5 presents the experimental evaluation using multi-term data sets. Finally, section-6 discusses the conclusions of this article.

2. RELATED WORKS

Accurate data classification focuses on providing the necessary information in data mining [17], [18], [19], and [20]. Classification is performed by classifiers mainly by identifying the characteristic characteristics of the object and assigning a set of trained knowledge classifications [21]. Let's considered a collection of datasets consisting of a number of data records, where each of these data record consists of a collection of data attributes, and these attributes collection are being considered for class identification. The classification process classifies the unidentified data objects utilizing the constructed class knowledge. The main function of the classifier is to predict the unidentified data objects accurately to meet the real-time need in various applications.

Supervised learning has been used effectively in several learning tasks to recognize unidentified objects. But it does not fit well with current real-time data objects because of the multiple semantic meaning of the data object.

In relating news information of a textual document might be related to "politics", "sports", "economy",

"education", etc., builds a set of multi-terms features to complicate traditional supervised learning systems.

G. Tsoumakas and I. Katakis [22] identified a problem with multi-value classification and proposed a solution by converting data and adapting the multi-value classification algorithm. Data conversion addresses the problem of converting multi-value data from one to several values. The proposal uses a commercially available single-valued workbook that reduces classification requirements. The classification algorithm is modified to match multi-valued classifications for a given field in specific situations to achieve high computational complexity.

K. Dembczynski et al. [23] The implications of the value shown in multi-value classifications. Discrete conditional and unconditional values, with a primary focus on value dependencies. Multi-valued classification through unconditional dependency modeling has been shown to perform well and perform less well in the condition of conditional dependencies. X. Kong et al. [24] It also explores multi-value classifications based on different types of dependencies between objects and their values, known as PIPL. The proposal focuses primarily on heterogeneous information to facilitate classification. This evaluation shows that performance has been improved but is limited to heterogeneous network data sets.

Charte et al. [17] offer multi-value data objects classification technology for dealing with multi-value data, It aims to answer conventional problems in categorizing multi-value data with many attribute values. Identifying a feature by choosing an instance by converting data and linking rules depends on naming dependencies. The dependencies on the naming value determine the feature choice in the multi-value classification algorithm. This approach can be successful if there is a linear variance in the data object to discover value dependencies, but you can get inaccurate information for high data distributions in multi-tag data objects.

M. Zhang and L. Wu [25] aim to solve multilingual learning problems in selecting features. Take advantage of strategies to learn the advantages of differentiation in order to differentiate the different class values. The name of the proposed LIFT algorithm implements group analysis on the positive and negative states of forming groups based on feature-specific values. Classification rules for training and testing will be required according to the results of the past knowledge grouped. However, the proposed approach shows a promising tendency to learn multiple label assignment values, it is necessary to examine the importance of features in relation to other improvements in order to achieve further improvements.

Based on the review and approach above, we understand the importance of multiple values in taxonomy. We suggest the importance of choosing the feature in the accuracy rating. But learning the most distinguishing feature of taxonomy is a difficult problem. In constructing the above-mentioned reviews and limitations, we have proposed a new method for categorizing multi-terms datasets using the MTA approach, which is based on the association rules for multi-party datasets. The learning algorithm has generalized the choice of feature dependencies based on field needs and requirements. System details are discussed in the next section.

(3)

3. PROPOSED MULTI-TERMS ASSOCIATION (MTA) APPROACH 3.1. Problem

Traditional learning systems are often studied on supervised machine learning systems. In these systems, data objects are associated with terms and learning systems that supervised the data set to know the properties of the properties that will be used for classification as shown in Fig. 1.

Fig.1: Supervised Learning Mechanism

This kind of learning methodology fits well with a concept, but the complexity arises when an object has multiple terms. The improvisation of learning was found under traditional supervision to adapt to multi-term data objects in the literature [5], [8], [15]. But most of the proposed solutions depend on learning to rely on features or relate terms by calculating their iterations. However, this may not apply to areas where this type of term information is not available. In several cases, the reliance and basic relationships of terms are determined by means of the

"association rules algorithm", but they fail to assist multiple variable data sets and their conditions in different domains. We aim to create a workbook based on the new MTA approach that can work on multi-domain multi- term data sets and allow us to provide accurate and fast classification.

3.2. Multi-Terms Association Learning

The classification approach relies on the accurateness of feature assortment and term specification. It is identified that more than one terms in the data in a field suggest the various level of association between them. This

"correlation learning" can be extremely useful for multi-terms data objects for the classification. It suggests a two- phase learning scheme that predicts an accurate class (AC) which is very appropriate to the separation proposal, and in the second phase, we find other multiple terms that assist AC to build useful separation patterns in categorizing multiple databases.

Let's suppose that training set D is made up of instances of objects having k terminology vectors representing, "D

= {d₁,. . . , d_n}" and its terms as "V = {m₁, . ., m_k}". Now, the first task of the MTA learning system is to locate AC using vector V. To do this, we are building a "Knowledge Class Table (KCT)" which consists of the term related to the base category class for a domain as shown in Table 2.

Table-2: Knowledge Class Table (KCT)

To find out the AC for an instance, we compute the associated term density compared to KCT. The value of the terms density represents as TDV, and it will be estimated using Eq.1. The value of TDV ranges from "0 to 1", and the upper the value, the nearer the separation is.

(4)

(1)

where, k is the number of terms, and KC is the knowledge

class values vector.

The method to find an instance of an AC class using KCT and TDV values is illustrated in the Algorithm -1.

Selecting one category for multiple states will result in the loss of basic information [5], [14]. To overcome this problem, the AC class learning, which learns in different terms that shape patterns through association rules, is expanded to reduce Hamming's loss in information categorization. Let's suppose an example where "V = {v₁, v₂, v3, v4, v5}" and "D = {(1,0,0,1,0), (0,1,0,1,0), (1,0,0,1,1), (1,1,0,0,0), (0,1,0,0,1)}".

(5)

Fig.-2: Multiple-terms Generating from Instance Term data

Suppose, training data sets "D = { (d₁,l₁), (d₂,l₂), . . . , (d_n,l_k)}", where "d_i ϵ D, v_k ⊆ V". To find the multiple terms that can be relevant to building separation accuracy, we consider the binary fit of instance conditions.

Fig. 2 shows the example process for association terms to find multiple terms to create the pattern for classification. The learning mechanism scans all the binary values of D to create a list of value collections that provide the least amount of support.

Here, it uses a complete "support count of 2", so the consequent least qualified support might be of "40% (2/5)".

The collection acquired as "C1" will contain things that meet the least support requirements, and other items may be removed. Furthermore, in order to identify the most frequent and relevant terms from the obtained "C1", it merges "C1⋈C1" to produce "C2" consisting of 2 sets of entities. This iteration continues until we have multiple conditions that satisfy the least support. The multiple conditions ultimately acquired might be believed the most appropriate and extremely relevant. At this stage, the classification rules are generated using AC class, C, and multiple terms.

Table-3: Rules obtain Using Multi-Term Association

4. DATASETS AND EVALUATION MEASURES

To evaluate the effectiveness of the proposal, we calculated three popular measures proposed by Tsoumakas et al.

[26] Multiple classifications for Hamming loss (HL) and accuracy. As shown in Table 4, the data set used for the analysis was downloaded from the MULAN [27] data repository.

4.1. Multi-Terms Datasets

The multi-term classification problem appears in various situations and applications in the real world. The data set included in the investigational setup covers three foremost application domains where multi-term data is often observed as: "text classification", "multimedia classification", and "bioinformatics". The entire data sets were retrieved mainly from the "MULAN data repository" [27], as shown in Table 4. It shows the domain dataset properties and their instances, attributes, values, and the number of L_Card.

L_Card - means value used to measure the average of each test data.. The L_Card measured [24], [25] for each datasets D = "{ (d_n,V_k)| 1≤ n ≤ k}" are denoted as,

(2)

Table-4: Datasets Used for Experiment Evaluation with L_Card

4.2. Evaluation Measures

1). Hamming Loss (HL): In multi-term classification, the most commonly used measurement method is the misclassification of measurement data items. It evaluates the misclassification of instances and term pairs based on predicted related and irrelevant terms. When HL = 0, the performance is perfect.

(6)

(3)

where 𝛿 defines the asymmetric difference between two

dataset instances, N is considered as a number of testing datasets, and V defines class values that may be related to the data set.

2). Accuracy (A): It measures the percentage of truly identified terms in a given data set. To perform the accuracy calculations we use Eq. 4 as given below.

(4)

3). Precision:Precision is computed based on the ratio of the "number of true positives" over the "number of true positives" and with "the number of false positives".

(5)

4). Recall: The recall definition is the "number of true positives" over the "number of true positives" and along with the "number of false negatives".

(6)

4.3. Algorithm Evaluated

We use the proposed multi-term association (MTA) method to classify multi-terms. The multi-term classification schemes are "Binary Relevance (BR)", "Label Powerset (LP)", "Calibration Label Ranking (CLR)" and "Random- k-Label set (RAkEL)" [22], [28]. In case of without multi-term data sets, different multi-term classification methods will be used to apply the proposed The results of the model are compared with traditional multi-learning.

Experiments are being performed using the MULAN dataset using Weka using a 10-fold cross-validation method.

5. EXPERIMENTAL EVALUATION

This section presents the results of the evaluation experiments that have been performed. Initially, we learned AC using KCT, and later discovered multiple terms using term associations from dataset instances. The learning knowledge of the MTA classifier is measure with the conventional multi-term-based classifier method. Table 5 below lists the outcomes obtained for everyone data set in Table 4.

Table-5: Number of MTA Pairs Identified for the Classification

The results are shown in Table 5. The identified results will be used for classification by traditional multi- classifiers. A comparison of the evaluation results shows the Hamming loss (HL) in Figure 3 and the accuracy in Figure 4.

(7)

Fig.3: Classifier HL Performance (The Lower The Better)

Fig.4: Classifier Accuracy Performance (The Higher The Better)

Based on the HL assessment, we found improvement in one-third of cases. In many cases, the difference in either direction is small. The significance of these differences is doubtful, but in any case, this improvement is very significant. Similarly, a significant improvement in assessment accuracy can be observed.

Fig-5: Classifiers Precision Comparison

(8)

Fig-6: Classifiers Recall Comparison

In Figures 5 and 6, for each data set, the accuracy and recall of the classifier show that by considering the structure of the knowledge category, the proposed MTA classifier achieves better results in accuracy and recall. For comparison classifiers, it shows improvised performance compared to each dataset, while for RAkEL it shows some close performance in the case of scene datasets. This makes it best suited for the classification of multi-term data sets for future application needs.

6. CONCLUSION

In this paper, a Multi-Terms Association (MTA) method using term associations between multiple words is proposed. The learning procedure primarily determines the exact class (AC) that is compatible with class recommendations, and in the second stage, we find other AC-supporting terms to construct a classification model that is constructive for diverse query classifications. To learn the AC of the example, we need to compare it with a knowledge classification table (KCT) to calculate the related term density. The contributions in this proposal will be used for different multi-term datasets learning. Experimental evaluation shows a potential methodology to find out multiple terms for effective classification by means of diverse algorithms. Statistical terms summarize the usefulness and enhancement of multi-term classification. In the future, further research on the use of related words and fuzzy and Bayesian factors can be used to enhance multi-term classification faster.

REFERENCES

[1]. Y. Zhao, Y. Ming, X. Liu, E. Zhu, K. Zhao, and J. Yin, "Large-scale k-means clustering via variance reduction", Neuro computing, vol. 307, pp. 184-194, Sep. 2018.

[2]. H. Ye, B. Cao, Z. Peng, T. Chen, Y. Wen, J. Liu, "Web Services Classification Based on Wide & Bi- LSTM Model", IEEE Access, Vol. 7, 2019.

[3]. T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, "Statistical topic models for multi-value document classification", Mach. Learn., vol. 88, no. 1-2, pp. 157-208, 2012.

[4]. S. Zhou, X. Xu, Y. Liu, R. Chang, Y. Xiao, "Text Similarity Measurement of Semantic Cognition Based on Word Vector Distance Decentralization With Clustering Analysis", IEEE Access, Vol. 7, 2019.

[5]. M. R. Boutell, X. Shen, J. Luo and C.M. Brown, "Learning multi-value scene classification", Pattern Recognition, vol. 37, no. 9, pp. 1757-1771, 2004.

[6]. K. Jayamalini and M. Ponnavaikko, "Research on Web data mining concepts, techniques and applications", in Proc. Int. Conf. Algorithms, Methodology., Models Appl. Emerg. Technol. (ICAMMAET), 2017, pp. 1- 5, 2017.

[7]. R. S. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino, "Matrix completion for multi-value image classification", in Advances in Neural Information Processing Systems, pp. 190-198, 2011.

[8]. P. Mitra, C. A. Murthy, and S. K. Pal, "Unsupervised feature selection using feature similarity", IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 301-312, Mar. 2002.

[9]. N. Spolaor, E. A. Cherman, M. C. Monard, and H. D. Lee, "A comparison of multi-value feature selection methods using the problem transformation approach", Electron. Notes Theoretical Comput. Sci., vol. 292, pp. 135-151, Mar. 2013.

(9)

[10]. C. Sung Ferng and Hsuan-Tien Lin, "Multi-value Classification with Error-correcting Codes", 20th Asian Conference on Machine Learning, Journal of Machine Learning Research, 281-295, 2011.

[11]. M. Passent El-Kafrawy, M. Amr Sauber, Awad Khalil, "Multi-Value classification for Mining Big Data", Inter national Conference on Advances in Big Data Analytics, 2015.

[12]. D. Sasirekha and A. Punitha, "A Comprehensive Analysis on Associative Classification in Medical Datasets", Indian Journal of Science and Technology, Vol 8(33), DOI: 10.17485/ijst/2015/v8i33/80081, December 2015.

[13]. F. Ali, P. Khan, K. Riaz, D. Kwak, T. Abuhmed, D. Park, K. S. Kwak, "A Fuzzy Ontology and SVM–

Based Web Content Classification System", IEEE Access, Vol. 5, 2017

[14]. L. Chekina, D. Gutfreund, A. Kontorovich, L. Rokach, and B. Shapira, "Exploiting value dependencies for improved sample complexity", Machine Learning, vol. 91, no. 1, pp. 1-42, 2013.

[15]. M. Ling Zhang and Zhi-Hua Zhou, "A Review on Multi-Value Learning Algorithms", IEEE Transactions On Knowledge And Data Engineering, Vol. 26, No. 8, August 2014.

[16]. K. Kasemsap, "Mastering Web mining and information retrieval in the digital age", Web Usage Mining Techniques and Applications Across Industries. Hershey, PA, USA: IGI Global, 2017, pp. 1-28, 2017.

[17]. F. Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera, "LI-MLC: A Value Inference Methodology for Addressing High Dimensionality in the Value Space for Multivalue Classification", IEEE Transactions On Neural Networks And Learning Systems, Vol. 25, No. 10, October 2014.

[18]. M. Elkano, M. Galar, J. Antonio Sanz, A. Fernandez, E. Barrenechea,F.o Herrera and H. Bustince,

"Enhancing Multiclass Classification in FARC-HD Fuzzy Classifier: On the Synergy Between n- Dimensional Overlap Functions and Decomposition Strategies", IEEE Transactions On Fuzzy Systems, Vol. 23, No. 5, October 2015.

[19]. N. Cesa-Bianchi, M. Re, and G. Valentini, "Synergy of multi-value hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference", Mach. Learn., vol. 88, nos. 1/2, pp. 209-241, 2012.

[20]. A. Elisseeff and J. Weston, "A kernel method for multi-valueled classification", in Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. Cambridge, MA, USA: MIT Press, 2002, pp. 681-687.

[21]. J. Read, B. Pfahringer, G. Holmes, and E. Frank, "Classifier chains for multi-value classification", Mach.

Learn.,vol.85, no. 3, pp. 333-359, 2011.

[22]. G. Tsoumakas and I. Katakis, "Multi value classification: An overview", International Journal of Data Warehouse and Mining, vol. 3, pp. 1-13, 2007.

[23]. K. Dembczynski, W. Waegeman, W. Cheng, and E. Hullermeier, "On value dependence in multi-value classification", in Workshop proceedings of learning from multi-value data. Citeseer, pp. 5-12, 2010.

[24]. X. Kong, B. Cao, and P. S. Yu, "Multi-value classification by mining value and instance correlations from heterogeneous information networks", in Proceedings of the 19th ACM SIGKDD KDD'13. New York, NY, USA: ACM, pp. 614-622, 2013.

[25]. M. Ling Zhang and Lei Wu, "LIFT: Multi-Value Learning with Value-Specific Features", IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 37, No. 1, January 2015.

[26]. G. Tsoumakas, M.-L. Zhang, and Z.-H. Zhou, "Tutorial on learning from multi-value data", in Proc. Eur.

Conf. Mach. Learn. Principles Practice Knowl. Discov. Databases, Bled, Slovenia, 2009.

[27]. G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, "MULAN: A java library for multi- value learning", J. Mach. Learn. Res., vol. 12, no. Jul, pp. 2411-2414, 2011.

[28]. J. Read, B. Pfahringer, G. Holmes, and E. Frank, "Classifier chains for multi-value classification", in Proceedings of tECML PKDD '09: Part II. Berlin, Heidelberg: Springer- Verlag, pp. 254-269, 2009.