• No results found

Wavelet-Based Nonlinearities

compression layer with shape gc∈CM×Cl×k×k followed by an expansion layer with shape

ge∈RCl+1×M×1×1 withMCl+1.

6.4.3.2 DeConvolution and Filter Sensitivity

To visualize what the gain layer is responsive to, we build a deconvolutional system similar to the one described in chapter 4. In particular, we present the entire CIFAR-100 validation set to the reference architecture and to the gain1_2_3 architecture, keeping track of what most highly excites each channel. Once we have this information, we present the same image again, storing the ReLU switches and max pooling locations for this same image, then we zero out all but a single value for the given channel, and zero out all other channels, and deconvolve to see the input pattern.

The resulting visualizations for the first two layers are shown inFigure 6.10. We show only the top activation for each filter, rather than the top-9. For the second layer filters, we show only 64 of the 128 filter responses.

It is reassuring to see that despite the performance difference between the reference architecture and the gain1_2_3 architecture, the filters are responding to similar shapes. Note that for both the first and second layer responses, the gain layer has a smoother roll-off at the edges of the visualization, whereas the convolutional architecture has more blocky regions of support.

6.5

Wavelet-Based Nonlinearities

Returning to the goals from subsection 6.2.4, the experiments from the previous section have shown that while it is possible to use a wavelet gain layer (G) in place of a convolutional layer

(H), this may come with a small performance penalty. Ignoring this effect for the moment,

in this section, we continue with our investigations into learning in the wavelet domain. In particular, is it possible to replace a pixel domain nonlinearityσ with a wavelet-based one σw?

But what sensible nonlinearity should be used? Two particular options are good initial candidates:

1. The ReLU: this is a mainstay of most modern neural networks and has proved invaluable in the pixel domain. Its pseudo-nonlinearity (ReLU(Ax) =AReLU(x)) makes learning

less dependent on signal amplitudes. Perhaps its sparsifying properties will work well on wavelet coefficients too.

2. Thresholding: a technique commonly applied to wavelet coefficients for denoising and compression. Many proponents of compressed sensing and dictionary learning even like to compare soft thresholding to a two-sided ReLU [148], [149].

conv gain

(a) First layer

conv gain

(b) Second layer

Figure 6.10: Deconvolution reconstructions for the reference architecture and purely gain layer architecture. Visualizations using a DeConvNet method similar to the one described inchapter 4. Here we find the input images in CIFAR-100 validation set that most highly activate each filter. Each image is then re-shown to the network and the meta-information is used to prime the DeConvNet to create the visualizations seen here. The left column has visualizations for the first and second layer filters for the all convolutional method, and the right column has visualizations for the first and second layer filters for the all gain layer method. Note the smoother roll-off at the edge of visualizations in the gain layer compared to the rectangular support regions for the conv layers. Aside from that, the two networks appear to be learning similar shapes.

6.5 Wavelet-Based Nonlinearities | 131 In this section, we will look at both and see if they improve the gain layer. If they do, it would the be possible to connect multiple layers in the wavelet domain, avoiding the necessity to do inverse wavelet transforms after learning.

6.5.1 ReLUs in the Wavelet Domain

Applying the ReLU to the real lowpass coefficients is not difficult, but it does not generalize so easily to complex coefficients. The simplest option is to apply it independently to the real and imaginary coefficients, effectively only selecting one quadrant of the complex plane:

ulp= max(0, vlp) (6.5.1)

uj= max(0, Re(vj)) +jmax(0, Im(vj)) (6.5.2) Another option is to apply it to the magnitude of the bandpass coefficients. Of course, these are all strictly positive so the ReLU on its own would not do anything. However, they can be arbitrarily scaled and shifted by using a batch normalization layer. Then the magnitude could shift to (invalid) negative values, which can then be rectified by the ReLU.

Dropping the scale subscriptjfor clarity (we need it for the square root of negative 1), let a

bandpass coefficient at a given scale bev=rvejθv and defineµr=E[rv] andσr2=E[(rvµr)2], then applying batch-normalization and the ReLU to the magnitude of vj means we get:

ru= ReLU(BN(rv)) = max 0, γrvµr σr +β (6.5.3) u=ruejθv (6.5.4)

This also works equivalently on the lowpass coefficients, althoughvlp can be negative unlike

rv: ulp= ReLU(BN(vlp)) = max 0, γvlpµlp σlp +β′ (6.5.5) 6.5.2 Thresholding

For t∈R and z=rejθC the pointwise hard thresholding is:

H(z, t) =

(

z rt

0 r < t (6.5.6)

and the pointwise soft thresholding is: S(z, t) = ( (rt)ejθ rt 0 r < t (6.5.8) = max(0, rt)ejθ (6.5.9)

Note that (6.5.9) is very similar to (6.5.3) and (6.5.4). We can rewrite (6.5.3) by taking the strictly positive termsγ,σ outside of the max operator:

ru= max(0, γ rvµr σr +β) (6.5.10) = γ σr max0, rvµrσrβ γ (6.5.11) then ift′=µvσrγβ >0,doing batch normalization followed by a ReLU on the mag- nitude of the complex coefficients is the same as soft shrinkage with threshold

t, scaled by a factor σγ

r.

The same analogy does not apply to the lowpass coefficients, asvlp is not strictly positive. While soft thresholding is similar to batch normalizations and ReLUs, we would also like to test how well it performs as a sparsity-inducing wavelet nonlinearity. To do this, we can:

• Learn the threshold t

• Adapt t as a function of the distribution of activations to achieve a desired sparsity

level.

In early experiments, we found that trying to set desired sparsity levels by tracking the standard deviation of the statistics and setting a threshold as a function of it performed very poorly (causing a drop in top-1 accuracy of at least 10%). Instead, we choose to learn a threshold t. We make this an unconstrained optimization problem by changing (6.5.9) to:

S(v, t) = max(0, r− |t|)ejθ (6.5.12)

Learning a threshold is only possible for soft thresholding, as ∂L

∂t is not defined for hard thresholding. Like batch normalization, we learn independent thresholds tfor each channel.

6.6

Gain Layer Nonlinearity Experiments

Taking the same ‘gain1_2_3’ architecture used for CIFAR-100, we expand the wavelet gain

layer by including nonlinearities as described inAlgorithm 6.1. In this layer, we have three different nonlinearities: the pixel, the lowpass, and the bandpass nonlinearity.

For these experiments, we test over a grid of possible options for these three functions: where:

6.6 Gain Layer Nonlinearity Experiments | 133