The multiview datasets used in the experiments belong to three categories: text documents, image and biological. However, each of the datasets on each category have different features that make them convenient to test different aspects of the performance of the algorithms.
The criteria used to select these datasets have been the following. First, they are multiview datasets, i.e. they are published with multiple views or feature matrices, thus allowing anyone interested in reproducing the present experiments to use exactly the same data. Second, they are provided to the community by a recognized institution and research team and are backed by peer-reviewed publications. Finally, they are used by other methods in the state of the art so the results can be compared
2.2.1 Text datasets
Obtaining multiple views from text documents can be accomplished in several ways, and the datasets used in the experiments reflect a different multiview approach. Their quantitative details are given in table 2.1.
First, the BBC News multiview text collection [44, 43]1. It comprises 2,225
news articles labelled with one of five possible topics (business, entertainment, politics, sport or tech). The input texts are split into several segments. The term frequencies on each segment become the different input views. There are several subsets in the original data set. The two-segment subset has been
chosen to allow direct comparison with the results in the literature. The
number of terms in each view, i.e. the number of attributes, is 6,838 and 6,790 respectively, although only the 500 most frequent terms on each segment
1
2.2. DATASET DESCRIPTION 31
BBC Reuters Cora
View 1 Seg. A (500/6,838) English (500/21,531) Bag of words (1,433)
View 2 Seg. B (500/6,790) French (500/24,892) References (2,708)
View 3 — German (500/34,251) —
View 4 — Italian (500/15506) —
View 5 — Spanish (500/11547) —
No. of samples 2,112 18,758 2,708
No. of samples used 2112 6000 2,708
No. of classes 5 6 7
Feature name (used variables/number of variables in the feature matrix). A single number means that all available variables have been used.
Table 2.1: Summary of the text multiview datasets
are used as the less frequent terms do not contribute to the quality of text classification [51]. The tf.idf (term frequency / inverse document frequency) [78] is computed on each of the input segments, and the cosine similarity is used instead of the euclidean distance because of the high sparsity of the feature matrices.
The second text dataset is the Reuters multilingual corpus [3]2, a set of
18,758 news articles available in five different languages (English, French, Ger- man, Italian and Spanish). The subset of original English news articles has been used; the term matrices of the remaining languages come from machine- translated texts. The texts belong to one out of six news categories. For each input view (language), a matrix with term frequencies is given. As with the BBC news dataset, only the 500 most frequent terms of each language have been used. Their tf.idf value has been computed and finally the cosine similarity has been employed to find the similarity matrices.
The third text dataset used in the experiments is the Cora dataset [79]3,
which contains 2, 708 scientific publications classified into one of seven classes. This dataset has two views. The first one is a bag of words with 1, 433 words. The second view is a reference graph that represents 5, 429 links between the documents.
2.2.2 Image datasets
Although the number of image datasets used in the literature is huge, few of them are specifically multiview or multifeature in the sense of providing different sets of features for each image; often these datasets simply contain raw images. The two multiview image datasets selected, on the contrary, pro-
2
https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual+ Multiview+Text+Categorization+Test+collection
3
32 CHAPTER 2. EXPERIMENTAL SETUP
Table 2.2: Summary of the image datasets
Digits AWA
View 1 Pixels (240) CQ (2,688)
View 2 Fourier coeffs. (76) LSS (2,000)
View 3 Profile correl. (216) PHOG (252)
View 4 Zernike coeffs. (47) SIFT (2,000)
View 5 Karhunen moments (64) RGSIFT (2,000)
View 6 Morph. feats. (6) SURF (2,000)
No. of samples 2,000 30,475
No. of samples used 2,112 4,000
No. of classes 10 50
Feature name (number of variables in the feature matrix).
vide different image features. The main difference between these two datasets stems from the original images: the first dataset (Digits), derives from hand- written numerals in grayscale tonalities, while the second dataset (Animal with attributes, or AWA) contains features extracted from real-world, color pho- tographs. As a consequence, the specific feature types extracted and their val- ues greatly differ from one dataset to the other. The details of these datasets are given in table 2.2.
The University of California at Irvine (UCI) multiple features digits dataset
[9], available at the UCI machine learning repository,4 is created from a set
of handwritten numerals (from ’0’ to ’9’), scanned as 15 × 16 grayscale pixels images. There are 200 samples of each numeral, resulting in a total of 2,000 samples. The data set provides six different views or feature sets of the original image data: (1) the pixel averages in 2 × 3 windows, (2) 76 Fourier coefficients of the character shapes, (3) 216 profile correlations, (4) 64 Karhunen-Love co- efficients [99], (5) 47 Zernike moments [71], and (6) 6 morphological features (not specified).
The other image dataset used in the experiments is the Animal with at-
tributes data set (AWA)[65], 5 which is a multiple feature data set with six
standard image features extracted from animal photographs. This dataset in- cludes photographs from 50 different animal species, which become the classes of the data samples. Due to the high number of classes, it is particularly hard to achieve high evaluation scores with this dataset in its original configuration.
4
https://archive.ics.uci.edu/ml/datasets/Multiple+Features 5