• No results found

The findings in this paper open up new future research directions, for example: (1) Model compression. How would removing low-correlation, rare filters af- fect performance? (2) Optimizing ensemble formation. The results show some features (and subspaces) are shared between independently trained DNNs, and some are not. This suggests testing how feature correlation among different

DNNs in an ensemble affects ensemble performance. For example, the “shared cores” of multiple networks could be deduplicated, but the unique features in the tails of their feature sets could be kept. (3) Similarly, one could (a) post-hoc assemble ensembles with greater diversity, or even (b) directly encourage en- semble feature diversity during training. (4) Certain visualization techniques, e.g., deconv [197], DeepVis [195], have revealed neurons with multiple func- tions (e.g. detectors that fire for wheels and faces). The proposed matching methods could reveal more about why these arise. Are these units consistently learned because they are helpful or are they just noisy, imperfect features found in local optima? (5) Model combination: can multiple models be combined by concatenating their features, deleting those with high overlap, and then fine- tuning? (6) Apply the analysis to networks with different architectures — for example, networks with different numbers of layers or different layer sizes — or networks trained on different subsets of the training data. (7) Study the cor- relations of features in the same network, but across training iterations, which could show whether some features are trained early and not changed much later, versus perhaps others being changed in the later stages of fine-tuning. This could lead to complementary insights on learning dynamics to those reported by [52]. (8) Study whether particular regularization or optimization strategies (e.g., dropout, ordered dropout, path SGD, etc.) increase or decrease the con- vergent properties of the representations to facilitate different goals (more con- vergent would be better for data-parallel training, and less convergent would be better for ensemble formation and compilation).

0 20 40 60 80 unit index (sorted by correlation of semi-matching assignment) 0.0 0.2 0.4 0.6 0.8 1.0

correlation with assigned unit

semi-matching matching

Figure 5.3: Correlations between paired conv1 units in Net1 and Net2. Pair- ings are made via semi-matching (light green), which allows the same unit in Net2 to be matched with multiple units in Net1, or matching (dark green), which forces a unique Net2 neuron to be paired with each Net1 neuron. Units are sorted by their semi-matching values. See text for discussion.

conv1 conv2 conv3 conv4 conv5

convolutional layers 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75

average correlation semi-matching

matching

Figure 5.4: Average correlations between paired conv1 units in Net1 and Net2. Both semi-matching (light green) and matching (dark green) methods suggest that features learned in different net- works are most convergent on conv1 and least convergent on conv4.

Sparse Prediction Loss (after 4,500 iterations)

decay 0 decay 10−5 decay 10−4 decay 10−3 decay 10−2 decay 10−1

conv1 0.170 0.169 0.162 0.172 0.484 0.517

conv2 0.372 0.368 0.337 0.392 0.518 0.514

conv3 0.434 0.427 0.383 0.462 0.497 0.496

conv4 0.478 0.470 0.423 0.477 0.489 0.488

conv5 0.484 0.478 0.439 0.436 0.478 0.477

Table 5.1: Average prediction error for mapping layers with varying L1 penalties (i.e. decay terms). Larger decay parameters enforce stronger sparsity in the learned weight matrix. Notably, on conv1 and conv2, the prediction errors do not rise much com- pared to the dense (decay = 0) case with the imposition of a spar- sity penalty until after an L1 penalty weight of over 10−3is used.

This region of roughly constant performance despite increasing sparsity pressure is shown in bold. That such extreme sparsity does not hurt performance implies that each neuron in one net- work can be predicted by only one or a few neurons in another network. For the conv3, conv4, and conv5 layers, the overall er- ror is higher, so it is difficult to draw any strong conclusions regarding those layers. The high errors could be because of the uniqueness of the learned representations, or the optimization could be learning a suboptimal mapping layer for other reasons.

CHAPTER 6

TRAINING NEURAL NETWORKS ENSEMBLES THROUGH COMMON SUBSPACE

This section is written in collaboration with Shuang Li, Matt Kusner, Karthik Sridharan, Kilian Weinberger and John Hopcroft.

6.1

Introduction

Averaging classifiers is an effective method to reduce model variance by approx- imating the expected classifier. Ensemble methods [28, 34, 67, 103, 199] leverage this fact to obtain improved generalization performance [97]. For example, the Random Forests [28] algorithm elevates the modest CART [29] algorithm to one of the most competitive classifiers within machine learning. Here, slightly mod- ified CART trees are bagged [28], where each tree is trained on a different subset of the data (drawn uniformly with replacement) and splits are randomized. The additional randomization of the decision trees leads to high variance classifiers. This is advantageous as the ensemble size can be very large (in practice often ex- ceeding 10,000 CART trees) due to the extremely efficient ID3 algorithm [150]. The large number of classifiers ensures that even in the presence of high vari- ance the ensemble average approaches the expected model (which would have zero model variance).

In the wake of deep learning, ensembling is just as important as it has ever been. Nowadays most high profile competitions (e.g. Imagenet [47] or Kaggle1)

are won by ensembles of deep learning architectures. As neural networks are often initialized with random weights, there appears to be a sufficient amount of natural variation that allows all networks to be trained on the entire data set. Training deep networks is computationally expensive and can last for days or even weeks even on high performance hardware with GPU acceleration. Training ensembles of them increases the cost linearly and quickly becomes pro- hibitive for most researchers without access to industrial scale computational resources. Although the training of deep net ensembles can be trivially par- allelized, few have access to sufficient GPU servers that can be deployed in parallel for long durations. As a result, ensembles of deep nets are typically small—averaging only a hand-full of classifiers. Consequently, the ensemble average still has high variance and does not approach the expected model.

In this paper we propose Subspace Ensemble Networks (SEN), a method that improves the generalization performance of small deep network ensembles. SEN trades off the variance of individual deep networks for increased model bias (along a carefully chosen direction). A key component to the success of deep learning is that neural networks automatically learn their own feature rep- resentation of the data. Similar to the standard ensemble approach of reducing the variance of the output by averaging the predictions, we also reduce the vari- ance of the internal representation by aligning the learned features.

Neural networks that are trained on the same data set but with different initializations will learn partially correlated feature representations [122]. We reinforce this trend by decomposing the feature representation of the final layer into two components: (a) the common low-dimensional subspaces that are aligned across all ensemble members; and (b) the corresponding null spaces—

orthogonal to the subspaces—learn features specific to each particular neural network.

The aligned subspace distills the commonality between the individual net- works’ features and improves their respective feature quality but biases them towards the same representation. The null spaces allows each individual net- work to learn features that are unique to itself and make different generalization mistakes. The decomposition of representation space is therefore a natural way to trade-off the bias and variance of the ensemble members.

We evaluate SEN on six benchmark data sets and demonstrate that it yields significant performance gains over standard ensembles. In particular, it reduces the required ensemble sizes drastically. In 5 out of 6 data sets SEN with only 2 classifiers significantly outperforms ensembles of 20 deep nets.