Chapter 7 Experiments and Results
7.2 Datasets
In this section, datasets used in this thesis will be introduced. All dataset used in this experiment are from UCI (Lichman, 2013) machine learning repository and Feature Selection Dataset (FSD) (Li et al., 2016).
According to the categorization of problem sizes described in (Oh et al., 2004) and (Kudo and Sklansky, 2000), dataset with less than 20 features are considered small dataset, and those with more than 50 can be seen as large dataset. Others are medium dataset whose number of attributes is between 20 and 50. In this experiment, dataset of all categories will be used. To be more specific, it will use 2 small, 4 medium and 2 large dataset.
As the above categorization is proposed 15 years ago, this research supplies the idea of “extremely large dataset” which contains more than 100 attributes. 4 extremely large dataset will be used to test the performance of feature selection algorithms.
1. Germen Credit (Statlog)
This dataset contains 20 features and one binary label attribute. This dataset classifies people described by a set of attributes as good or bad credit risks. In these 20 attributes there are 3 continuous and 17 nominal ones. In its 1000 instances, 700 of them are
classed as “good” and 300 as “bad”. So this is slightly imbalance dataset with number of nominal attributes.
2. Australia Credit (Aus)
This dataset concerns credit card applications. There are 6 numerical and 8 categorical attributes. There is a good mix of attributes: continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.
3. Waveform (Wf)
This dataset contains 3 classes of waves and 40 attributes. The latter 19 attributes are all noise attributes with mean 0 and variance 1. It has 5000 instances, and in some tests that only consider binary classification, instances with class 1 and 2 form the binary waveform dataset.
4. Ionosphere (Io)
This dataset includes classification of radar returns from the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere. All 34 attributes are continuous. 225 of the total 351 instances are in class “g” so it can be seen as imbalanced dataset.
5. QSAR biodegradation (Bio)
The data have been used to develop QSAR (Quantitative Structure Activity Relationships) models for the study of the relationships between chemical structure and biodegradation of molecules. This dataset contains values for 41 attributes (molecular descriptors) used to classify 1055 chemicals into 2 classes (ready and not ready biodegradable). One third of the total instances belong to “RB” class and others belong to “NRB”.
6. Sonar (So)
The task of Sonar is to discriminate between sonar signals bounced off a metal cylinder and those bounced of a roughly cylindrical rock. It contains 208 instances and 60 attributes where all values in the range 0.0 to 1.0. Each number represents the energy
within a particular frequency band, integrated over a certain period of time. The class labels contain “R” (rock) and “M” (metal cylinder). It contains 111 patterns with “R” and 97 with “M”.
7. Lung cancer (Lc)
The lung cancer dataset described 3 types of pathological lung cancers. It contains 32 instances and 56 attributes that are all nominal. The class label contains 3 values from 1 to 3, with 9, 13 and 10 instances respectively. All attribute values are nominal.
8. Vehicle (Ve)
The task of Vehicle dataset is to recognize a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The original purpose was to find a method of distinguishing 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects. The dataset contains 18 attributes and 946 instances. The four types of vehicle are: OPEL, SAAB, BUS and VAN. All attribute values are continuous.
9. Lung discrete (Ld)
This is one of the four extremely large dataset got from Feature Selection Datasets (FSD) (Li et al., 2016). It contains 326 attributes and 1 target class that contains 3 labels. All instances are discrete.
10.Colon (Co)
Colon is another extreme large dataset from FSD (Li et al., 2016). It contains 2000 attributes and 62 instances, with 2 different class labels. It is a nominal dataset and each attribute has 3 possible values. The percentage of two labels is 40:22.
11.Lymphoma (Ly)
Lymphoma is another extreme large dataset from FSD (Li et al., 2016). It contains 4026 candidate attributes and one class attribute. There are 96 instances with 9 class labels. It is a typical multi class recognition problem with very high dimensions.
Leukemia dataset is another extreme large dataset from FSD (Li et al., 2016). It contains 7071 attributes and 72 instances with 2 class labels (47 labels with “-1” and 25 with “1”). All attribute values are nominal.
13.Pima Indians Diabetes (Pima)
Pima dataset is from National Institute of Diabetes and Digestive and Kidney Diseases. The goal is to forecast the onset of diabetes mellitus. There are 8 numeric features and one target class (0 and 1). It contains 768 instances. This dataset will be used in HONN.
14.Blood Transfusion Service Center (Blood)
The Blood dataset was taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan. There are 4 numeric features and one binary feature of class labels. This dataset will be used in HONN.
15.Liver Disorders (Liver)
There are 6 numeric features in this dataset and 345 instances. These features are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. It is a binary classification problem and will be used in HONN.