• No results found

6.2.1 Related Work

Fujieda et. al. use a DWT in combination with a CNN to do texture classification and image annotation [143], [144]. They take a multiscale wavelet transform of the input image, combine the activations at each scale independently with learned weights, and feed these back into the network at locations where the activation resolution matches the subband resolution. The architecture block diagram is shown in Figure 6.1, taken from the original paper. They found that their ‘Wavelet-CNN’ could outperform competitive non-wavelet-based CNNs on both texture classification and image annotation.

Several works also use wavelets in deep neural networks for super-resolution [145] and for adding detail back into dense pixel-wise segmentation tasks [146]. These typically save wavelet coefficients and use them for the reconstruction phase.

In [147], Rippel, Snoek, and Adams parameterize filters in the DFT domain. Rather than having a pixel domain filter w∈RF×C×K×K, they learn a set of Fourier coefficients

ˆ

w∈CF×C×K×⌈K/2⌉ (the reduced spatial size is a result of enforcing that the inverse DFT of their filter to be real, so the parameterization is symmetric). On the forward pass of the neural network, they take the inverse DFT of ˆw to obtain wand then convolve this with the

input xas a normal CNN would do1.

Note that an important point should be emphasized about reparameterizing filters in either the wavelet or Fourier domains: many linear transforms of the parameter space will not change parameter updates if a linear optimization scheme is used (for example standard GD, or SGD with momentum). Rippel, Snoek, and Adams mention in their work that this holds

1

The convolution may be done by taking both the image and filter back into the Fourier space but this is typically decided by the framework, which selects the optimal convolution strategy for the filter and input size. Note that there is not necessarily a saving to be gained by enforcing it to do convolution by product of FFTs, as the FFT size needed for the filter will likely be larger thanK×K, which would require resampling the coefficients.

6.2 Background | 111 Input Image kh,1 kl,1 Conv, 64 Conv, 64, /2 +

Conv, 128 Conv, 128, /2 Conv, 256 Conv, 256, /2

+

Conv, 512, /2

Conv, 512 Ave pool

fc + + kh,2 kl,2 kh,3 kl,3 + + kh,4 kl,4 + /2 /2 /2 /2 : Projection shortcut + : Channel-wise concat

kl : Low-pass filter: High-pass filter

kh k1 k2 k3 k4 + Conv, 64 Conv, 64 Conv, 128 Conv, 256 Conv, 64 Conv, 128

Figure 6.1: Architecture using the DWT as a frontend to a CNN. Figure 1 from [144]. Fujieda et. al. take a multiscale wavelet decomposition of the input before passing the input through a standard CNN. They learn convolutional layers independently on each subband and feed these back into the network at different depths, where the resolution of the subband and the network activations match.

for all invertible transforms but this is not strictly true, and we prove inappendix C that it only holds for tight frames. We make this point clear as a natural extension to continue the work in [147] would be to parameterize filters in the wavelet domain, taking inverse transforms and then doing normal convolution.

While [147] was an inspiration for this chapter, the work we present here is not a reparameterization in the wavelet domain with convolution in the pixel domain. Instead, we learn wavelet filters and perform filtering in the wavelet domain too.

6.2.2 Notation

We make use of the 2-D Z-transform to simplify our analysis: X(z) =X n1 X n2 x[n1, n2]z−1n1zn2 2 = X n x[n]zn (6.2.1)

As we are working with three-dimensional and four-dimensional arrays but are only doing convolution in two, we introduce a slightly modified 2-D Z-transform which includes the

channel index cand the filter number f: X(c,z) =X n1 X n2 x[c, n1, n2]z1n1z2n2= X n x[c,n]zn (6.2.2) H(f, c,z) =X n1 X n2 h[f, c, n1, n2]z1−n1zn2 2 = X n h[f, c,n]zn (6.2.3) We then define the product of these newZ-transform signals to be the channel-wise convolution.

E.g. for the 4-D filterh[f, c,n] withZ-transformH(f, c,z) and the the 3-D signalx[c,n] with

Z-transform X(c,z), let us call the product of the twoZ-transforms:

X(c,z)H(f, c,z) =X n X k h[f, c,nk]x[c,k] ! zn (6.2.4)

Recall from subsection 2.4.1that a typical convolutional layer in a standard CNN gets the next layer’s output in a two-step process:

y(l+1)[f,n] =

CXl−1

c=0

x(l)[c,n]h(l)[f, c,n] (6.2.5) x(l+1)[f,n] =σy(l+1)[f,n] (6.2.6)

If we define the nonlinearityσz to be the action of σ to each z-coefficient in the polynomial

Y(f,z), then we can rewrite (6.2.5) and (6.2.6) as:

Y(l+1)(f,z) = CXl−1 c=0 X(l)(c,z)H(l)(f, c,z) (6.2.7) X(l+1)(f,z) =σz(Y(l+1)(f,z)) (6.2.8) 6.2.3 DTCWT Notation

For this chapter, we will work with lots of DTCWT coefficients so we define some slightly new notation here.

A J scale 2-D DTCWT gives 6J+ 1 coefficients, 6 sets of complex bandpass coefficients

for each scale (representing the oriented bands from 15 to 165 degrees) and 1 set of real lowpass (lp) coefficients.

6.2 Background | 113

x(1) y(2) x(2) y(3) x(3)

u(1) v(2) u(2) v(3) u(3)

H(1) σ H(2) σ

W W−1 W W−1 W W−1 W W−1 W W−1

(a) A regular convolutional layer

x(1) y(2) x(2) y(3) x(3) u(1) v(2) u(2) v(3) u(3) H(1) G(1) σ H(2) σ W W−1 W W−1 W W−1 W W−1

(b) Gain applied in the wavelet domain

x(1) y(2) x(2) y(3) x(3)

u(1) G(1) v(2) σw u(2) v(3) u(3)

H(2) σ

W W W−1 W−1 W W−1 W W−1

(c) Gain and nonlinearity applied in the wavelet domain

Figure 6.2: Proposed new forward pass in the wavelet domain. Two network layers with some possible options for processing. Solid lines denote the evaluation path and dashed lines indicate relationships. In (a) we see a regular convolutional neural network. We have included the dashed lines to make clear what we are denoting as u and v with respect to

their equivalents x and y. In (b) we get to y(2) through a different path. First, we take the

wavelet transform ofx(1) to giveu(1), apply a wavelet gain layer G(1), and take the inverse

wavelet transform to givey(2). The dotted line forH(1) indicates that this path is no longer

present. Note that there may not be any possibleG(1) to make y(2) from (b) equaly(2) from

(a). In (c) we have stayed in the wavelet domain longer and applied a wavelet nonlinearity

σw to giveu(2). We then return to the pixel domain to givex(2) and continue on from there in the pixel domain.

Each of these coefficients has size:

ulp ∈ RN×C× H 2J−W 2J−1 (6.2.10) uj,k ∈ CN×C× H 2J× W 2J (6.2.11)

Recall that the lowpass coefficients are twice as large as in a fully decimated transform due to the interleaving of the four lowpass terms inAlgorithm 3.3.

If we ever want to refer to all the subbands at a given scale, we will drop thek subscript

6.2.4 Learning in Multiple Spaces

At the beginning of each layer lof a neural network, we have the activationsx(l). Naturally,

all of these activations have their equivalent wavelet coefficientsu(l).

From (6.2.5), convolutional layers also have intermediate activationsy(l). Let us discern

these from the x coefficients and modify (6.2.9) to say the DTCWT ofy(l) givesv(l).

We now propose the wavelet gain layer G. The gain layer G can be used instead of a

convolutional layer. It is designed to work on the wavelet coefficients of an activation u, to

give wavelet domain outputsv.

This can be seen as breaking the convolutional path inFigure 6.2 and taking a new route to get the next layer’s coefficients. From here, we can return to the pixel domain by taking the corresponding inverse wavelet transform W−1. Alternatively, we can stay in the wavelet

domain and apply wavelet-based nonlinearities, σlp and σbp for the lowpass and highpass coefficients respectively, to give u(l+1).

Ultimately we would like to explore architecture design with arbitrary sections in the wavelet and pixel domain, but to do this we must first explore:

1. How effective is a wavelet gain layerGat replacing a standard convolutional layer H?

2. What are effective wavelet nonlinearities σlp and σbp?