• No results found

3.2 Algorithms and Datasets

3.2.3 Datasets and Data Correlation

This work was motivated by studies on two proteomics datasets, respectively the ovarian cancer dataset from Petricoin E, et al. (2002), which we denote OV, and the pancreatic cancer dataset from Hingorani S, et al. (2003), which we denote PA. OV is

relatively highly correlated – that is, when we look at the simple statistical correlation of a feature with the target feature, the values tend to be quite high. In contrast, PA has rather low correlation values. These issues are reflected in the performance that tends to be achieved on these datasets in machine learning studies – typically 95— 100% accuracy on test data for OV, but 60—65% for PA.

a) Ovarian Cancer:

This research is quite significant for women, especially who have a poor risk of ovarian cancer due to their family history. How to find a proteomics pattern which can distinguish ovarian cancer patients from normal patients, becomes the purpose of research on this dataset. This dataset (each value representing mass/ charge ratios from a spectrometer) consists of 91 control samples (Normal) and 162 ovarian cancer samples, with the task being to train a classifier (in our case, learn a set of rules) to correctly predict the class of an unseen sample. The dataset is separated into 128 training samples and 125 test samples.

b) Pancreatic Cancer:

This particular dataset comes from mouse samples, developed as part of research to firstly generate a mouse model of PanIN (Pancreatic intraepithelial neoplasias). This led to a reliable means of detecting PanINs in the serum proteome of mutant animals. These results are pertinent to an accurate prediction model of the earliest stages of human neoplasias. The PA data has 181 samples divided randomly into train and test sets by us, and there are 6776 genes (features) in each sample.

When we consider the individual statistical correlation values (correlation with the target feature) for features in these two datasets, the difference is clear. Note that we consider the absolute value, so that high values (near 1) mean a strong correlation or anti-correlation with the target feature, and low values (near 0) mean very poor correlation. In the OV dataset, the highest individual feature correlation coefficient (which we call the Dataset Correlation Value (DCV)) is 0.896, while in the PA dataset it is 0.185. Various suggestions follow from such a clear distinction between these datasets. In general we should not expect that the ideal analysis method for the

OV dataset will correspond to the ideal analysis method for PA, while, of course, the potential accuracy of predictive models is possibly limited at the outset by these correlation values (although, it is entirely possible that strong predictive models are possible for the PA dataset which exploit underlying patterns that are obscured or ignored by simple pairwise correlation).

Our specific interest here is feature selection, and how the choice of feature selection method might be guided by a simple measure of the inherent correlations in the data. As mentioned, we characterise a dataset’s inherent correlation in terms of this highest individual (absolute) correlation coefficient of its non-target features with the target feature, and we call this the Dataset Correlation Value (DCV). Thus, the OV dataset has a DCV of 0.896, and PA has a DCV of 0.185. This characterization is sufficient for the purposes in this chapter, however it is an open question whether the median correlation coefficient or some other averaging measure will be more generally useful. We investigate that question in a later chapter. In the cases studied here, the maximal value tended to be a good guide, rather than an outlier.

In order to investigate the relationship between feature selection method and dataset correlation value, a number of other many-attribute datasets were obtained in addition to OV and PA. To keep things straightforward, we looked for many-attribute datasets that had only real-valued features and a natural two-class classification task. However, the range of DCVs among these datasets was still quite small. After this brief investigation, two datasets were added to this study, respectively the Ionosphere and Optical Digit datasets from the UCI repository (Asuncion A and Newman D J (2008)). We decided that a fast way to obtain test datasets that had a wide range of DCVs, spanning from very low to very high, was to artificially add noise to an existing dataset that itself had a high DCV. The best candidate for that was the OV dataset. We therefore generated several variants of the OV dataset by adding different amounts of noise to the attributes. The result was an additional 11 datasets that we called rOV1, rOV2, …, rOV11. These are shown in Table 3.1, along with the four original datasets, listed in ascending order of DCV.

DCV Dataset 0.099 rOV1, 1000 fields 0.185 Pancreatic, 8,642 fields 0.335 rOV2, 1000 fields 0.349 rOV3, 1000 fields 0.378 Opt digit, 64 features 0.399 rOV4, 1000 fields 0.449 rOV5, 1000 fields 0.496 rOV6, 1000 fields 0.51 Ionosphere, 32 features 0.539 rOV7, 1000 fields 0.598 rOV8, 1000 fields 0.618 rOV9, 1000 fields 0.699 rOV10, 1000 fields 0.784 rOV11, 1000 fields 0.896 OV, 15,143 fields

Table 3.1. Datasets used in the experiment of chapter 3.

In order to generate the rOV datasets, we first chose the top 1000 features from OV according to correlation coefficient with the target class. Then, we produced a new randomised OV (rOV) dataset by adding a small random value to each field of each feature, and calculated the new dataset’s DCV. This entire process was repeated for increasing values of the random value’s range parameter, until datasets were acquired with correlation values close to 0.1, 0.3, 0.6, 0.7 and 0.8 (existing datasets were available with values already close to 0.2, 0.4, 0.5 and 0.9). In this way we got the eleven additional datasets used later in this chapter and later in this thesis.