Once the filters are pruned, traditional methods will resort to retraining the network to regain accuracy. Early works reported 2x more time spent on retraining than training [139]. Until recently it still took 20-40 epochs of retraining for a pruned specialist model to regain accuracy [140]. If the desired objective is to create a model that classifies on a subset of labels for every ad hoc situation, a method that requires retraining is equivalent to training a new model every time. It is certainly inefficient and most likely infeasible. Therefore, an online method that requires no retraining is desired in this setting. This is where the process of what we call compensation comes into play. The objective of compensation is to restore activation of the filters that are pruned away from the model. The closer the restore values are to the original ones, the closer the more accurate our network will be.
We adopted three ways of compensation. The first way is to simply use the mean of the feature map. The second is to use the feature map of the most similar filter. The third is compensate using a linear combination of the rest of the filters.
6.4.1 Compensation with Mean
Since we are compensating for only a subset of classes, when we are calculating the mean of feature maps, we should only select the the samples that belong the the subset of classes Ytarget.
The mean of activation for jth channel ¯XXXj is calculated by:
¯ XXXj =
1
|{i|yi ∈ Ytarget}|
X
{i|yi∈Ytarget}
XXXij
Once we obtain ¯XXXj, we use it as the feature map for channel j whatever the input data is.
6.4.2 Compensation with Correlated Filters
The first step of this method is to find the most correlated channel for all the channels to be pruned. To obtain the correlation measurement, we unroll the feature map of every channel into a vector. Let Ci(j0, j) represent the Pearson correlation of the vectors of channel j0and j given input sample
i. The correlation between channel j0 and j is calculated by averaging the correlation of individual samples that belong to the target subset of classes:
C(j0, j) = 1
|{i|yi ∈ Ytarget}|
X
{i|yi∈Ytarget}
Ci(j0, j)
To compensate for the channel j that is to be pruned, we choose a channel j0 that bears the highest correlation value C(j0, j).
When we compensate channel j using the feature map of another channel j0, we are effectively calculating the convolution on channel j0 twice: the first time using its own weights wwwj0k, the
second time using the weights for the pruned channel wwwjk. Inevitably there are errors caused by
such replacement. However, there is no easy way of modifying the new feature map such that the values are closer to the original feature map. By easy, we mean that it could be implemented using operations existing in convolution neural networks. One workaround is to adjust the weights such that the next convolution layer will produce closer results.
Consulting the notation in Section 6.2, let XXXj0 represent the feature map of the replacement
channel that compensates XXXj, and www0jk is used to represent the new set of weights for output
channel k that we want to analytically solve for. We set our objective to be minimizing the mean square error after we use the new feature map to calculate the convolution:
argmin w w w0 jk L(www0jk) = X
{i|yi∈Ytarget}
H
X
h=1
(xxx|ijhwwwjk − xxxij0h|www0jk)2
By setting the derivative of the loss function to 0, ∂L(www0jk)
∂www0jk
= − 2 X
{i|yi∈Ytarget}
H
X
h=1
(xxx|ijhwwwjk − xxxij0h|www0jk)xxxij0h
=0
We will have W equations for the vector www0jkwith W elements. We could solve for the new weights www0jk using techniques that solve standard linear equations.
6.4.3 Compensating with Linear Combination of Filters
Approximating the output of filter j with a linear combination of the rest of the filter is expressed as follows ˆ xxxj = N X k=1 zkβkjxxxk
here zk ∈ {0, 1} represents whether or not filter k is to be pruned (zk = 0 means that filter k will
be pruned), N is the total number of filters, and βkj is the coefficient in the linear combination. We
can solve for the best set of coefficients by minimizing the Mean Squared Error (MSE) over the subset of samples belonging to target classes
argmin
βjk
X
{i|yi∈Ytarget}
N X j=1 (1 − zj) · || N X k=1 zkβjkxxxik− xxxij||22 (6.1)
Intuitively, the expression represents the difference between xxxj and ˆxxxj. Further, we only take
into consideration samples from the classes that we care about. This minimization can be easily solved by taking the derivatives of the MSE. The seemingly complex process can actually be pre- computed. We will talk more about it in Section 6.5.1.