Subspace Ensemble Networks - Representation Learning for Web Intelligence

E[ ]

¯µ

Standard Ensemble Subspace Ensemble

1 2 3 1 2 3

Figure 6.2: Illustration of SEN. In general, we will not be able to ensemble

more than 3 or 5 deep networks due to the massive computational cost of days or even weeks required to train a single network. In this section we describe an efficient technique to create an accurately biased small deep net-

work ensemble, that in many cases outperforms much larger network ensem- bles.

Figure 6.2 illustrates the high level intuition behind our subspace ensemble method. In a standard ensemble (left plot), classifiers σ1, . . . , σM are drawn from

a distribution around the expected classifier σµ = E[σ].

The ensemble approximates this expected classifier because ¯σ → σµ as

M → ∞. If the classifier variance is high and the ensemble size M is small, this approximation will be poor. We therefore propose to sample low variance classifiers around a biased center σµ , E[σ] however at low variance (right plot). We obtain the biased classifier σµby enforcing a shared subspace across all net-

works. More precisely, the learned features hα is decomposed into two vectors

within orthogonal sub-spaces sα and s⊥α. The first of the two shares a classi-

fier V across all neural networks which all ensemble members to be close to a common “center classifier” σµ. The latter has classifier weights unique to each

neural network and ensures a controlled amount of variance across the ensemble members. To facilitate this decomposition, we add an additional subspace decomposition layer to each neural network, as illustrated in Figure 6.3. In the following, we describe its individual components in detail.

Subspace Decomposition. We decompose the feature representation learned by Netα into two orthogonal subspaces sα = Θhα and s⊥α = ˆΘhα. The

two affine matrices are of dimensions Θα ∈ Rd×D and ˆΘα ∈ R(D−d)×D and are

constrained to have (approximately) orthonormal rows, i.e. preciselyΘΘ>

≈ I and ˆΘ ˆΘ>

≈ I. The two sub-spaces are forced to be approximately orthogonal, i.e. Θ>_{Θ ≈ 0. Essentially, we divide the feature space into two different com-}_ˆ

ponents. The first will be used for a shared, the second for a network specific classifier.

Net1 x + V x + V Net2 h1 h2 Θ1 Θ2 ˆ Θ1 ˆ Θ2 U2 U1 s1 s2 s1 ⊥ s2 ⊥ σ (⋅) =λVΘ1+ (1−λ)U1Θˆ1 =λVΘ2+ (1−λ)U2Θˆ2 W1 σ ( )W1h1 σ ( )W2h2 W2

Figure 6.3: The Subspace Ensemble Network (SEN). The final hidden layer of each network is decomposed into two types of activations: 1. an aligned activation set s1, s2and 2. a set of orthogonal activa-

tions s⊥ 1, s

⊥

2. The aligned activations capture generalized infor-

mation about the classification task and the orthogonal activations explain the variance of the model class. The final output of each network is the softmax σ(·) over a weighted combina- tion of the aligned and orthogonal activations.

sentation, sα, as the shared space and the second component, s>α, as the network

specific space. For the shared space s, we learn a shared weight matrix V ∈ Rd×k

which produces predictions VΘαhα. Each row of V is a linear classifier for one of

the k classes, applied and averaged over the d dimensions of the shared space. As these linear classifiers are shared across the ensemble members, the feature representation becomes aligned such that the transformed data points si

α lie on

the “right side” of the hyper-planes. In practice, the fully connected layer lead- ing to the softmax can have hundreds or even thousands of dimensions, result- ing in high-dimensional representation space. The dimensionality of the subspace d can be much smaller than the original feature dimensionality D. The shared subspace distills the most valuable predictive structure across the ensemble networks. It further increases the similarity across the networks and

benefits the classifier V, which is effectively trained on a much larger set of data points (M × n) and therefore generalizes better to unseen test data.

Network Specific Null Space. In order for the ensemble compilation to work, it is important to allow classifiers to make independent mistakes (which is averaged out). It is therefore important to also have a feature representation and classifier weights that are unique to the individual model. We facilitate this requirement by maintaining a second, network specific weight matrix Uαfor the

null space s>

α. The weight matrix Uα∈ R(D−d)×kis trained to produce a prediction,

UαΘˆαhα.

Loss Function.The softmax layer takes the weighted sum between the linear subspace predictor VΘαhα, and the null space predictor UαΘˆαhα, as the input.

We define the matrix

Wα = λVΘα+ (1 − λ)UαΘˆα.

and obtain the final weighted classifier, WT

αhα, where we use λ ∈ [0, 1] to control

the weighting between the shared subspace and the null space. The param- eter λ controls the inherent tradeoff between the variance reduction (through the shared subspace), and model independences (through the null spaces). The extreme case when λ = 1 means that the predictors are completely relying on the common subspace features, whereas with λ= 0 the commonality across the ensemble members is removed and all networks become unrelated. The final training objective becomes

Objective(V, U1:M, Θ1:M, ˆΘ1:M, Net1:M)= 1 nM M X α=1 n X i=1 `(yi, σ(WαThiα)). (6.1)

with a slight abuse of notation, where hiα is to mean Netα(xi). The objective

6.3.1 Soft Orthonormality Constraint

In practice, training deep neural networks while ensuring the orthonormality of {Θα, ˆΘα}α=1..M is hard since it requires the costly computation of singular value

decomposition (SVD), especially considering the dimensionality of hα can be of

thousands in real networks and the entire training can take up to hundreds of thousands iterations. We use a soft penalty for constraining the orthonormality introduced in [112], and integrate the optimization of {Θα, ˆΘα}α=1..M into the

learning objective as a whole. Specifically, the soft orthonormality constraint for any projection matrixΘ is given by

minkΘΘT − Ik2F.

Furthermore, for any individual network j, the orthonormality between the subspace and null space can be achieved similarly through the soft constraint of

minkΘ_αΘˆT_αk2_F,

so that each component in ˆΘα is orthogonal to Θα. The entire regularizer can

then be formatted as a sum of orthonormality constraint on each individual network, with additional penalty on the model complexity.

Regularizer. The regularizer becomes

R(Θ1:M, ˆΘ1:M, U1:M, V) = γ M M X α=1 kΘ_αΘT_α − Ik2F | {z } Orthonormality + kΘαΘˆTαk2F + k ˆΘαΘˆ T α − Ik2F | {z }

Null space constraint

(6.2) + 1 M M X α=1 Ω(Θα, ˆΘα, Uα, V),

where Ω(·) is a standard weight-decay regularization term on the weight pa- rameters in order for controlling the model complexity; and γ the coefficient regulating the orthonormality.

The entire set of projection matrices can be optimized directly through any non-convex optimization algorithm (e.g., stochastic gradient descent) used in deep architectures. This advances previous work studying the structural learning under the setting of linear models [13, 12, 26] which uses an alternating structural optimization (ASO) procedure (i.e., performing the learning objective optimization and SVD computation in an alternating fashion).

Training Objective. Our final training objective can be formulated as the sum of Objective and regularizer

arg min

V,U1:M,Θ1:M, ˆΘ1:M,Net1:M

Objective(V, U1:M, Θ1:M, ˆΘ1:M, Net1:M)+ R(Θ1:M, ˆΘ1:M, U1:M, V)

(6.3)

In document Representation Learning for Web Intelligence (Page 167-172)