3.3 Experiment Design:
3.3.2 Experiments for Variance Ranking Attribute Selection
tion
The highlight of this session is to articulate all the experiments to demonstrate Variance Ranking Attributes Selection by following this sequence;
• the raw formula that would initiate the variances of each class and ultimately the Variance Ranking Processes.
• A clear process flow Algorithm in form of a flow chart
• The tabulations of the experiments, Binary and multi classed data set.
All the itemised will be presented in this section. When some process has been carried out in other sections or chapters it would be referred and properly signed post. The proposed method of attributes selections is for discrete and continuous numeric data for a binary class and multi-classed (decomposed into n binary see section 2.3.2) represented by 1 and 0.
The data preparation have been explained in sections 2.3.5 and 3.3.1 . Each of the data set in Table A.2 was first split into two subsets of class 0 and class 1 and the Variance of each attribute deduced using equation 3.13 and 3.14 respectively. The Variance Significant is deduced using the equation of 3.15. In all the total number of the majority and minority class is maintained through the number of the data items as nmaj and nminj. In general the variance of each of the subsections; class
target 1 and class target 0, of dataset was computed using the following formula Variance v =
P
(x−¯x2)
(n−1) . If The Variance subsection of class 0 is given by:
V0 =
(x0−x¯02)
(nmaj −1)
(3.13)
If then Variance subsection of class 1 is given by:
V1 =
(x1−x¯12)
(nmin−1)
(3.14)
The Variance Significant Attribute Selection is then deduced by:
V R =
(x0−x¯02) (nmaj−1) /(x1−x¯1 2) (nmin−1) 2 = V0 V1 2 (3.15)The total number of data items is inversely proportional to the variance or spread from the mean position, that is V ariance ∝ n1
m, this relationship shows that the
formula is generic, therefore if the ranking is done in either order it would remain consistent. In Tables3.1,3.2,3.3and3.4, the columnV1 andV0 is the results of the
variance of each subsection class (positive=1 and negatives=0) for each attribute.
Binary classed Experiments
The results of the experiment for the binary classed data is in Tables3.1,3.2,3.3and
3.4. The serial numbers in the tables are not mistakes but show how the attributes were numbered before from the original data sets descriptions in TableA.2and how the (VR) techniques have ranked them.
Table 3.1: Variance Ranking attribute selection using Pima India data
The result of the experiment using the Pima India data in Table3.1have a serial number that has identified age followed by Body Mass Index (bmass) and plama glucose as the three most significant. The ranking continued until the last attribute which is insutest.
Table 3.2: Variance Ranking attribute selection using Bupa data
The Bupa data have also been ranked in the order from 4,3, 6, 5, 1 and 2 in Table 3.2. While the Wisconsin and Cod-rna data have been ranked in Table 3.3
Table 3.3: Variance Ranking attribute selection using Wisconsin Breast Cancer data
Table 3.4: Variance Ranking attribute selection using Cod-rna data
Multi-classed Experiments (n)Binary
The above three experiments demonstrated the (VR) technique for binary classed data and is straight forward, but for multi-classed distributed dataset each of the minority classes will be taken in turn as the positive or 1 while the rest will be negative or class 0. This will result in more multiple experimentation equal to the number of target classes. For example, if there are three classes in the dataset, this would result into (n)Binary, whilen = 3. The subsequent sessions would show the experiment with Iris data set that has n= 3.
The result for Iris data in Table3.5is very obvious, the classes are label originally as Iris Versicolor, Satosa and Virginica. For Iris Versicolor, the Petal width has been identified as the most significant attribute that distinguishes it from the rest, fol- lowed by the Petal length, Sepal length and Sepal width. The result for Satosa and Virginica are the most interesting, even if the experiment were performed differently both results are similar thereby showing that both flowers share more similarity to
Table 3.5: Variance Ranking attribute selection using Iris data
each other than Versicolor.
The Glass data set that has n = 6 classes for details on this dataset please see Table A.2. The imbalanced classes are from 1 to 7; notice that class 4 is not in this dataset, so the total number of available classes is six, and they are labeled 1, 2, 3, 4, 5, and 7. Using the ”one-versus-all” process, as explained in sections 2.8 each of these classes will be taken in turn as class 1 (minority class) and others as class 0 (majority class)
Table 3.6: Glass data set details showing highly imbalance classes
The Glass data imbalanced contents proportion is shown in Figure 3.5, which represents the chemical elements compositions and the refractive index (RI) which is a physical property of glass that measureS the bending of light as it passes through.
Figure 3.5: Glass data contents proportion
The amount of the chemical compositions of a glass determines its application, type, and classes; for example, class 1 is ”Building window float processed”, class 2 is the ”Building window non-float processed”, class 3 is ”vehicle window float processed”, class 4 (not available in this dataset) is ”vehicle window non-float pro- cessed”, class 5 is ”container”, class 6 is ”tableware,” and finally, class 7 is ”head- lamps.” Therefore, the experiment will be conducted with class 1 as the minority while the rest will be class 0 as the majority class. Each of the classes in the mi- nority, as shown in Figure 3.5 will consequently be relabeled as class 1 and class 0. Table 3.7 is the implementation of the one-versus-all approach; it shows the rela- beled table with the actual numbers of minority classes as 70, 76, 17, 13, 9 and 29 and the corresponding majority classes are 144, 138, 197, 201, 205 and 185
Table 3.7: Glass data class relabel to One-vs-All
The final result of using the (VR) techniques for the glass dataset is shown in Table 3.10. Again, notice that the serial number as ranked by the experiment is different from that presented in Table 3.6, excluding the ID number. Each of the sub-tables in Table 3.10 is a representation of each class relabelled as class 1 and the rest as class 0; for example, class 2 is relabelled as class 1 and the rest as class 0, (one versus all). This process is continued for all the classes in the dataset; see Table 3.7.
463 429 244 163 51 4437 30 20 5
Target class of Yeast data set
CYT (cytosolic or cytoskeletal) NUC (nuclear)
MIT (mitochondrial) ME3 (membrane protein, no N-terminal signal)
ME2 (membrane protein, uncleaved signal) ME1 (membrane protein, cleaved signal)
EXC (extracellular) VAC (vacuolar)
POX (peroxisomal) ERL (endoplasmic reticulum lumen)
Figure 3.6: Yeast data contents proportion
Table 3.9: Yeast data class relabel to One-vs-All
For the Yeast data, the imbalanced classes could be seen in Table 3.8 and the contents proportions in Figure 3.6. This dataset is one of the most popular ones, and it has been used in various work for imbalanced multi-class data. The data are numeric measurements of different protein in the nucleus and cell materials in Yeast unicellular organisms. The objective of the dataset is using this physical protein descriptor for ascertaining the localization, which in turn, may provide help explaining the growth, health, and other physical and chemical properties of Yeast. The data are made up of 1,484 instances. The re-coding of the dataset to one versus all is shown in Table3.9. For example, the recoding proceeds as ”CYT (463) as class 1, 1023 as class 0”; this continues until the last minority class, which is ”ERL(5) as class 1, 1481 as class 0.
CHAPTER 3. V ARIANCE RAN KING A TTRIBUTE SELECTION TECHNIQUE
Table 3.10: Experiment on Glass data
The first sub-table in table 3.10 is class 1, and the rest classes (class 2,3,5,6 and 7; notice no class 4) are class 0. The (VR) technique ranks the most significant attributes as follows Ba, Mg, K and so forth. The second experiment has class 2 relabeled as class 1 and the other classes (1, 3, 5, 6 and 7) as class 0. They are ranked K, Al, Ba and so on, as the most significant. The next experiment is class 3 relabeled as class 1 while the rest as class 0; they ranked Ba, Mg, Ca, and so on. All six classes are taken in turn.
The general postulate here from the experimental result is that the type of glass depends on the amount of chemical element that the glass contains; this has been captured by the (VR)technique.
3. V ARIANCE RAN KING A TTRIBUTE SELECTION TECHNIQUE
Table 3.11: Experiment on Yeast data
3. V ARIANCE RAN KING A TTRIBUTE SELECTION TECHNIQUE
Table 3.12: Experiment on Yeast data continue
Summary
This chapter presented one of the main assertions of the research, it started with a clear theoretical basis of the nature of numeric data and the techniques for transform- ing from one data types to another within the conditions of the same range density formalism. Then a demonstration of the framework which lays the background for the heuristic axiomatic inferences that established the relationship between class distribution and the variances of the data items in the domain of discourse vis a vis the sample space were presented. The variance of the data items in the domain space and how this is related to their class distribution was deduced and explained. An introduction of variance comparison testing which culminated into (VR) tech- nique was derived therein. The main research design like the processes of data sampling for experimental validity and reliability were explained in details. A clear process demonstrations for decomposing Multi-classed into Binary classed data were explained in detailed with clear diagrams. The process of avoiding overfitting using the state of the art cross-validation and the justifications as a process of producing more dependable results were also explained.
The results of the experiments carried out in this chapter present a significant contri- bution to the overall and major insight into some of the discovery and many claims made by this research and also would become an input into later chapters. It is also a whole knowledge in his own right that could lead to future work by any researcher. This work may be seen as pioneering new field because, in most academics and in- dustries, the correlations of the effects of the variances of data points and their classes have not been investigated before. But, this work showed the correlation with a concise inferences and presenting an axiomatic open field for new avenue of knowledge insight and opportunities for further researches. The implication of the knowledge discovered therein is in no doubt and limitless.
Comparison of Variance Ranking
Attribute Selection With Other
Attribute Selections
4.1
Introduction
In this chapter, comparison of(VR)technique and two main state of the art features selection techniques (algorithm) will be made, these two are (PC) and (IG) belong to the categories of features selections known as ”filter” methods the other(s) are ”wrapper” and ”hybrid” which are just a combination of filter and wrapper meth- ods. Please refer to section 2.2.7 for the details of these methods and reasons for comparing the(VR) with(PC) and (IG). But just for a hint the (PC) and (IG)are two most popular and state of the art feature selections.
Most feature selection results are heuristics [192][193][194][195], meaning that no two feature selection on the same dataset will produce the same result perfectly, especially in the filter algorithm; instead, each attribute identified are most likely to be ranked slightly differently by different filter feature selection algorithms. To estimate or measure the similarities in these results, the order of ranking of the attributes becomes the metric used to quantify the similarities. Some of the iden- tified attributes may be in the same position in the order of ranking, while others may share similarities by proximity to the attribute’s positions. The result of (VR)
obtained in sections3.1 will be paired with the result of (PC)and (IG)on the same data to investigate the extents of their similarities and differences. A novel method of quantifying similarity called ”Ranked Order Similarity Index”(ROS)will be dealt with extensively in subsequent sections. The (ROS) is a similarity index quantifier to assess two or more sets that may contain the same object but ranked differently. The necessity of (ROS) came about from the fact that the existing similarity index
measure appeared to be inadequate when quantifying the similarity of sets of the same items that are ranked differently, please see sections 4.3.