• No results found

the smallest non-zero eigenvalue and 0) is important to protect against overfitting.

This is something that we have not enforced or considered in the design of our wavelet- based layers, and it would be an interesting extension.

7.3

Final Remarks

It is our intuition that complex wavelets in a ScatterNet style system perform well in CNNs at locations where we want to reduce the sample rate, as they can nicely demodulate regions of the frequency space to lower frequencies. We also believe that using complex wavelets without taking the complex modulus is beneficial at locations where we want filters with large spatial support, something that is particularly useful in the early layers of CNNs. However, the current trend in CNNs is shifting away from these uses. Modern architectures typically build many layers with very small spatial support filters (usually 3×3 and often 1×1) and

lots of mixing and combining of the channels. For example, the recent Wide ResNet [70] (one of the best modern methods), has close to 1000 channels in the later layers.

We believe that the work in this thesis has opened up a rich vein of ideas and a new perspective on modern CNN methods. We have found there to be some performance advantages to redesigning CNNs with complex wavelets as well as other, less measurable advantages, such as the ability to determine that certain orientations and frequency regions are less important than others (seesubsection 4.6.2 andsubsection 6.4.3) or the ability to have smooth roll-of in the support of filters. There is still much more work to be done; the learning efficiency of CNNs must be improved, as well as a greater understanding of their operation and outputs if they are to be widely used in the future.

Appendix A

Architecture Used for Experiments

The experiments for this thesis were run on a single server with 8 GPUs and 14 core CPUs. The GPUs were each NVIDIA GeForce GTX 1080 cards released in May 2016. They each have 8GiB of RAM, 2560 CUDA cores and 320 GB/s memory bandwidth. The CPUs were Intel(R) Xeon(R) E5-2660 models.

At the completion of the project, we were running CUDA 10.0 with cuDNN 7.6 and PyTorch version 1.1.

To do hyperparameter search we used the Tune package [139] which we highly recommend, as it makes running trials in parallel very easy.

A.1

Run Times of some of the Proposed Layers

Throughout the main body of the thesis, we derive theoretical computational costs for many of our methods and compare these to convolutional operations. While this is useful to give a rough guide about the cost of our methods, we give experimental values here.

The numbers in the tables are calculated by running the specified input through our layer five times and then averaging the values. Timings were done by using NVIDIA’s ‘nvprof’ command, which allows us to get millisecond timing on kernel execution times.

We test the effect of changing the spatial size for a constant batch and channel size in

Table A.1, and we test the effect of changing the channel dimension size for constant batch and spatial size in Table A.2. Our reference is a 10×10 convolutional layer that does not do

mixing across the channels. We compare the run time of this operation to each of our layers on an input of size C×H×W.

Using results fromsection 3.5(for the DTCWT and ScatterNet),subsection 5.5.3(for the invariant layer) and subsubsection 6.3.4.3 (for the gain layer), thetheoretical computational costs for the tested layers for an input with size C×H×W are:

Table A.1: Run time speeds for different layers with increasing spatial size. Input size is 32×32×H×H where H is the column heading listed below. Run times are in

milliseconds, averaged over five runs.

Spatial Size 16 32 64 128 256 Conv10x10 0.2 0.8 6.2 22.4 112 DTCWT 0.5 2.0 7.6 29.4 118 DTCWT−1 0.6 2.1 8.1 33.3 123 Scatter 0.6 2.1 8.7 31.8 125 Invariant 0.7 2.4 9.6 37.4 144 Gain (J = 1) 1.5 5.7 21.6 80 336

• 10×10 Convolution: 100 multiplies per input pixel

• DTCWT with J= 1: 36 multiplies per input pixel (see Algorithm 3.3)

• DTCWT−1 with J= 1: 36 multiplies per input pixel (see Algorithm B.2)

• DTCWT ScatterNet with J= 1: 39 multiplies per input pixel (see Algorithm 3.5)

Invariant Layer with square A matrix:: 74C+ 36 multiplies per input pixel (see Algorithm 5.1)

• DTCWT Gain Layer with J = 1: 7C+ 72 multiplies per input pixel (see Algo- rithm 6.1)

While we were able to create a reasonably fast method for calculating the DTCWT, it is still slower than what we believe it ought to be, with it often running 1 to 2 times slower than a 10×10 convolution. As it is the core for the other layers in this thesis, these are also

A.1 Run Times of some of the Proposed Layers | 147

Table A.2: Run time speeds for different layers with increasing channel size. Input size is 32×C×64×64 where C is the column heading listed below. Run times are in

milliseconds, averaged over five runs.

Channel Size 3 10 32 64 128 Conv10x10 2 4.6 15.8 28.4 70 DTCWT 3.2 10.5 30.0 58.6 126 DTCWT−1 4.1 13.3 37.0 79.4 152 Scatter 3.4 11.0 31.4 65.8 133 Invariant 3.6 11.8 34.6 73.6 164 Gain Layer 9.7 28.4 79.4 158 371

Extra Proofs and Algorithms

We derive proofs for the gradients of decimation, interpolation, and for the forward 2-D DWT. These are needed for subsection 3.4.3.

We have also listed some of the algorithms that are not included in the main text for the interested reader. In particular, the inverse DWT (used in subsection 3.4.5), the inverse DTCWT (used insection 3.5), and the smooth magnitude operation (used insubsection 3.6.1).

B.1

Gradients of Sample Rate Changes

Consider 1D decimation and interpolation of a signal x. The results we prove here easily

extrapolate to 2D, but for ease we have kept to the 1D case. Decimation of a signal x by M∈Z is defined as:

y[n] =x[M n] (B.1)

and interpolation by M as:

y[n] =

(

x[Mn] n=M k, k∈Z

0 otherwise (B.2)