Proof - Uses of Complex Wavelets in Deep Convolutional Neural Networks

C.2 Proof

Consider a single filter parameterized in the pixel and frame space. In one system, the original filter parameters are updated. In a second system, the frame representation of them are updated. We want to track the evolution of the two filters and compare them when the same data are presented to them. We set them to have the same ℓ2 regularization rateλand

the same learning rate η.

Let us call the weights at timetare wt, the frame-parameterized ˆw_tand we assume that: ˆ

w_t= ˜Φ∗w_t (C.3)

It follows from (C.2) and our definition of ˆw_t that:

∂L ∂wˆ_t = ∂wt ∂wˆ_t ∂L ∂w_t= Φ ∗ ∂L ∂w_t (C.4)

After presenting both systems with the same minibatch of samplesDand calculating the gradient ∂L

∂wt we update both parameters:

wt+1=wt−η ∂L ∂w_t+λwt (C.5) = (1−ηλ)wt−η ∂L ∂wt (C.6) ˆ wt+1= (1−ηλ) ˆwt−η ∂L ∂wˆ_t (C.7)

If we left multiply (C.6) by the analysis operator we get: ˜Φ∗_w t+1= ˜Φ∗ (1−ηλ)w_t−η ∂L ∂w_t (C.8) = (1−ηλ) ˆwt−η˜Φ∗ ∂L ∂w_t (C.9)

In general, this does not reduce further. However, if ˜Φ = Φ as is the case with tight frames [20], then we can use (C.4) and this last line simplifies to:

˜Φ∗

wt+1= (1−ηλ) ˆwt−η

∂L

∂wˆt (C.10)

= ˆw_t₊₁ (C.11)

Which shows that they remain related at time t+ 1 given they were related at time t.

This proves the simpler case for SGD, but the same result holds when momentum terms are added due to the linearity of the update equations. This does not hold for the Adam [49]

or Adagrad [50] optimizers, which automatically rescale the learning rates for each parameter based on estimates of the parameter’s variance.

We mention in subsubsection 2.6.7.3that when the DTCWT uses orthogonal wavelet transforms, as is the case with the q-shift filters [95], then it forms a tight frame. If the biorthogonal filters are used (as is often the case for the first scale of the transform), it does not form a tight frame.

Appendix D

DTCWT

Single Subband Gains

This appendix proves that the DTCWT gain layer proposed in chapter 6 maintains the shift-invariant properties of the DT_CWT.

Recall that with multirate systems, upsampling byM takes X(z) toX(zM) and down- sampling byM takes X(z) to _M1 PM_k₌₀−1X(Wk

Mz1/M) where WMk =e

j2πk

M . We will drop the M subscript below unless it is unclear of the sample rate change, simply usingWk.

D.1 Revisiting the Shift-Invariance of the

DTCWT

It is easiest to prove the shift-invariance of the gain layer by expanding on the shift-invariance of the DT_CWT proofs done in [19].

Let us consider one subband of the DT_CWT. This includes the coefficients from both tree A and tree B. For simplicity in this analysis we will consider the 1-D DTCWT without the channel parameterc.

If we only keep coefficients from a given subband and set all the others to zero, then we have a reduced tree as shown in Figure D.1. The outputY(z) is:

Y(z) = 1 M

M_X−1

k=0

X(Wkz)hA(Wkz)C(z) +B(Wkz)D(z)i (D.1)

where the aliasing terms are formed from the addition of the rotated z transforms, i.e. when

k̸= 0.

As is standard for filter design in the real DWT, it is possible to makeA and C have

similar frequency responses. We can also make A(W±2_z)_C(_z) near zero if their stopbands

can be made reasonably small. It is not possible however to make the terms A(W±1_z)_C(_z)

zero, as the transition band of the shifted analysis filter A(W±1_z) overlap with those of the

A(z) yM Xa(z) x M C(z) B(z) yM Xb(z) x M D(z) X(z) Ya(z) Yb(z) Y(z)

Figure D.1: Filter bank diagram of 1-D DTCWT. Note the top and bottom paths are through the wavelet or scaling functions from just level m (M= 2m). Figure based on Figure 4 in [19].

Theorem D.1. The oddk aliasing terms in (D.1) cancel out if the impulse responses ofB andD are Hilbert transforms of the impulse responses of A andC respectively.

Proof. See [19, section 4] for the full proof of this. The full cancellation of aliasing terms for allk̸= 0 makes the DTCWT nearly shift-invariant (also see [19, section 7] for the bounds on

what ‘nearly’ shift-invariant means).

Now, consider the complex filters defined as:

P(z) =1

2(A(z) +jB(z)) (D.2)

Q(z) =1

2(C(z)−jD(z)) (D.3)

and defineP∗(_z) =P

np∗[n]z−n as the Z-transform of p after taking the complex conjugate of the filter taps.

From this, we can rewrite the filters A, B, C and Das:

A(z) =P(z) +P∗(z) (D.4)

B(z) =−j(P(z)−P∗(z)) (D.5)

C(z) =Q(z) +Q∗(z) (D.6)

D(z) =j(Q(z)−Q∗(z)) (D.7)

Substituting these into (D.1) gives:

A(Wkz)C(z) +B(Wkz)D(z) = 2P(Wkz)Q(z) + 2P∗(Wkz)Q∗(z) (D.8)

This result is important as it shows that the P∗Q andP Q∗ terms cancel out when BDis

added toAC, which are the terms that would cause significant aliasing.

Using (D.2) and (D.3) Kingsbury showed that ifB is the Hilbert pair of Athen P has

support only on the right-hand side of the frequency plane. Similarly, ifD is the Hilbert pair

D.2 Gains in the Subbands | 157

In document Uses of Complex Wavelets in Deep Convolutional Neural Networks (Page 173-177)