Chapter 4. Estimation Theory for GFGL
C.6 Proof of Lemma 4.3
5.4 Estimation of the Evolutionary Wavelet Spectrum
5.4.2 Periodogram Smoothing
We here detail several approaches to smoothing the wavelet periodogram as discussed in the literature. As noted by Fryzlewicz et al. (2006), the signal- noise ratio of the wavelet periodogram is always relatively low; asymptotically we obtain
EX[Ij[k]]/VarX[Ij[k]]1/2= 2−1/2.
Not only is this signal-noise ratio low, but typically the periodogram sequences will also be correlated over nearby points in time due to non orthogonality of the non-decimated basis functions. The distance over which these correla- tions are significant depends on the ground-truth structure of the spectrum, and is thus generally not known in advance. An example of the raw wavelet periodogram, can be seen in Figure 5.4.1.
Sliding Window Smoothing
Perhaps the simplest form of smoothing is to utilise a sliding window ap- proach whereby the spectrum is estimated at the centre of some localised pe- riod in time. In some sense, this is similar to the localised Fourier periodogram (5.1.6) which averages in the context of a data taper function.
5. ESTIMATION OF THE EVOLUTIONARY WAVELET SPECTRUM 187 0 5 10 j=1 Wavelet Periodogram I j[k] 0 2 4 j=2 0 5 10 15 j=3 0 20 40 j=4 100 200 300 400 500 600 700 800 900 1000 k 0 5 10 15 j=5
Figure 5.4.1 – Raw, unsmoothed wavelet periodogram, for the process realisa- tion in Figure 5.3.1.
Definition 5.7. Kernel Smoothed Periodogram Estimator
Let k ∈ [t − M, t + M ] be an interval of time on the discrete grid t = 1, . . . , T . Let h[k] be a weighting function for the data at point k. A smoothed periodogram estimator can be constructed such that:
(5.4.5) Sˆj(t/T ) = 1 H t+M X k=t−M J X l=1 A−1jl |h[k]dl,k|2 , where H =Pt+M k=t−M h[k]
2 is the integrated weighting function.
We note that the weighting function in the above plays a similar role to that in the local Fourier process models, for example via Eq. 5.1.6. However, unlike the Fourier equivalent, this is defined to be the same across all scale levels. In the particular case where
h[k] = 1 if k ∈ [t − M, t + M ] 0 otherwise ,
the resulting estimator is equivalent to the central moving average estimator of Stevens (2013). It is also related to the smoothing technique employed in Park et al. (2014). In practice, the uniform kernel is preferred here, both for simplicity of analysis and because the wavelets are already localised in time.
Second Stage Wavelet Smoothing
A slightly more elaborate approach to periodogram smoothing is to smooth using a second-stage wavelet transform of the periodogram (G.P. Nason et al. 2000; Sachs, G.P. Nason, et al. 1997). Thresholding, or wavelet shrinkage methods can then be applied to this second-stage transform to act as a denois- ing step. Transforming back to the original periodogram results in a smoothed and asymptotically consistent estimator.
More precisely, one takes the raw periodogram and then performs a second stage of wavelet analysis, this time with a decimated set of wavelets { ¯ψl,m}.
Taking the transform at scale l and position m we have the set of wavelet coefficients ˆvl,m = Pkdj,kψ¯l,m[k] for scales 2l = o(T ). Below the results of
G.P. Nason et al. (2000) are quoted:
Theorem 5.1. Properties of Second-Stage DWT Coefficients (G.P. Nason et al. 2000)
The DWT wavelet coefficients {ˆvl,m} of the periodogram from a Gaussian-
LSW process at position z = k/T , with 2l = o(T ) obey uniformly in m,
EX[ˆvl,m] = Z 1 0 JT X i=1 Aj,iSi(z) ¯ψl,m(z)dz = O(2l/2/T ) , and VarX[ˆvl,m] = 2 T Z 1 0 JT X i=1 Aj,iSi(z) !2 ¯ ψ2l,m(z) + O(2lT−2) .
Furthermore, let ˆSjψ¯(z) be the estimator obtained from inverse DWT of the coefficients ˆvl,m with the threshold λ2(l, m; j, T ) = VarX(ˆvl,m) log2(T ). For
each fixed j, the estimate ¯S obeys: Z 1 0 EX[ ˆS ¯ ψ j (z) − Sj(z)]2dz = O(log2(T )/T2/3) .
The proof of the consistency result above relies on results obtained through- out the 1990’s relating to function denoising via wavelet shrinkage. In fact, the behaviour of such thresholded estimators is very closely related to the thresholding properties of estimators such as the lasso (c.f. 2.1.5). To avoid detracting from the main topic, i.e. smoothing the LSW periodogram, I have added some notes on denoising via wavelet shrinkage in Appendix 5.5.
5. SUMMARY 189
Finally, while Theorem 5.1 holds for the case of the DWT smoothed pe- riodogram, in practice, it is desirable to use a method that allows for time- invariance; that is, when we shift the data in time, the periodogram esti- mate should also shift. To this end, it is often suggested (G.P. Nason et al. 2000; Sachs and Schneider 1996) to apply a cycle spinning method, to per- form translation-invariant denoising, see Coifman et al. (1995) for details of such a scheme. Briefly, the cycle spinning method works by shifting the data, in this case the periodogram at scale j a random number of positions (while maintaining ordering), smoothing the periodogram according to ˆvl,m, and then
performing the inverse DWT on these estimates. A modified second-stage transform known as the Harr-Fisz transform (Fryzlewicz et al. 2006) has also gained popularity for smoothing the non-decimated periodogram, we discuss this further in Chapter 6.
5.5
Summary
In this chapter, a set of spectral models and estimators for the represen- tation of time-series were introduced. One key benefit of adopting a spectral approach to modelling time-series, is that often in the frequency/scale domain, the process can be considered to be in some sense sparse; that is only a subset of frequencies are required to describe the process. However, as discussed, the tra- ditional Fourier representation is not appropriate for describing non-stationary processes. Instead, one can either allow the Fourier spectra to vary over time, c.f. Priestly’s oscillatory processes, or adopt a localised wavelet like basis. The LSF and LSW processes (5.1, 5.4) form classes of stochastic processes respec- tively constructed over Fourier and wavelet basis functions. In both cases, a connection is made from the increasing set of time points t = 1, . . . , T to a continuous function over the restricted interval z = t/T ∈ (0, 1). In this sense, they allow us to asymptotically represent, and recover the second order, auto- covariance properties of a process, even if these are non-stationary. However, while the processes can now be non-stationary, they must maintain appropri- ate smoothness constraints to enable spectral identification. For example, in this chapter, we assumed that the underlying spectral transfer function W (z) is Lipschitz smooth. Such assumptions place limits on the range of processes LSW models can represent, for example, they do not permit sharp jumps in
the wavelet spectra. However, in real dynamic systems, such sharp jumps may be present, and are important to detect; for instance, one may consider search- ing for structural breaks in financial time-series, or edges in textured images. In the following chapters, such smoothness assumptions are considered in the context of regularised estimators for the LSW spectra.
5. SUMMARY 191
Appendix D
D.1
Wavelet Thresholding
Wavelet thresholding typically refers to the method of selecting certain active coefficients in a wavelet decomposition by thresholding the empirically obtained coefficients. Traditionally, such thresholding is performed in order to recover a function ft from noisy measurements {xt}Tt=1. For example, we may
observe the process {Xt;T}, i.e:
( D.1) Xt;T = f (t/T ) + Zt;f ,
where Zt;f is an independently sampled noise term, c.f. Zt;f ∼ N (0, σ2).
Let vj,k be the discrete wavelet coefficients of an observed sequence {xt}. A
somewhat ideal estimate of the function from the wavelet coefficients vj,k is
defined via the selective wavelet construction as ˆ
ft = TSW(xt, δ) :=
X
j,k∈S0
vj,kψj,k ,
where S0 is a finite list of (j, k) pairs. In reality, we do not know S0 and need
to estimate this set, thresholding provides one way to achieve this. However, as when dealing with parameter selection in linear regression, there are many different thresholding functions available and several ways to set appropriate thresholds. In this section, I aim to give a brief review of wavelet thresholding methods proposed by Donoho, Johnstone, Neumann et al. throughout the 1990’s (D. L. Donoho et al. 1995; D. Donoho 1995; D. Donoho et al. 1994, 1998; M. Neumann et al. 1995).
Remark. Second-stage smoothing vs general functional recovery
The discussion here is aimed at the general recovery of a function f (·) in the presence of noise {Zt;f}. Traditionally, theory on denoising is developed
assuming additive Gaussian noise. However, results for non-Gaussian smooth- ing also exist such as the work by M. Neumann et al. (1995) (these are covered briefly in Remark .4). With regards to LSW spectral estimation (as discussed in Sec. 5.4.2), we treat each scale level Ij,k as the noisy realisations of the
spectral function S(k/T ). The setup here would mimic that of Eq. D.1 such that Ij,k = Sj(k/T ) + Zj,t;S.
For the purposes of this discussion, assume that a DWT of the process is performed alongside that of the true function, i.e. {Yj,k} = DWT({Xt;T}) and
{vj,k} = DWT({f (t/T )}). At each scale level j of the DWT, we now assume
that Yj,k = vj,k+Zj,k;v where Zj,k;v is a noise term for time k = 1, . . . , 2J −j and
describes the scale of this noise. Note: the exact distribution of Zj,k;v may
be different from Zj,k;f; the variables correspond respectively to noise sources
in the time, and spectral domain. In the following, let us temporarily drop the scale index j, and consider the noisy wavelet coefficients {Yk}N =2
J −j
k=1 . If
we assume a Gaussian noise source Zk;v ∼ N (0, 2), then in the limit of large
N we obtain limN →∞P [{maxt|Zt;v| > p2 log(N)}] = 0. Taking a threshold
λ = √2 log N , then it is unlikely that any contribution from the noise will breach this level. A particularly amazing fact, is that when using such a threshold in conjunction with hard/soft thresholding for the DWT the risk of the associated functional estimator is bounded to within a logarithmic factor of the oracle risk.
Proposition D.7. Theorem 1 (D. Donoho et al.1994)
Let Yk = vk+ Zk;v where Zk;v ∼ N (0, 1) and > 0, defining the risk as
R(ˆv, v) := E[kˆv − vk2 2] for ˆv = soft(y; √ 2 log N ) we obtain Runiv(ˆv, v) ≤ (2 log N + 1) 2 + N X k=1 min(|vk|2, 2) ! .
While setting λ = √2 log N allows us to bound the risk, it is interesting to ask whether there is a some sense more optimal setting of λ. If we only have one observation of a random variable Γ ∼ N (µ, 1) then defining ρST(λ, µ) :=
E[{soft(Γ, λ) − µ}2] one can proceed by introducing the minimax quantities
Λ∗N = inf
λ supµ
ρST(λ, µ)
N−1+ min(µ2, 1) .
Selecting a threshold λ∗N which is the largest λ attaining Λ∗N enables the tighter bound:
5. SUMMARY 193
Defining the soft-thresholding estimator ˆv∗ = soft(y, λ∗N), with known one can obtain the bound
R(ˆv∗, v) ≤ Λ∗N 2+ N X k=1 min(vk2, 2) ! .
As one might expect, utilising this somehow optimised threshold we find both the resulting multiplier and threshold themselves are reduced in compari- son to those of Prop. D.7. In particular, Λ∗N ≤ 2 log N +1 and λ∗
N ≤
√
2 log N . However, asymptotically as N → ∞, for any > 0 one finds results broadly similar to Prop. D.7, i.e.
Λ∗N ∼ 2 log N , λ∗N ∼p2 log N .
With regards to recovering the function f (t/T ), we are interested, not in the recovery of the second-stage wavelet coefficients vj,k, but the function itself. If
we define ˆf∗(t/T ) as the inverse wavelet transform of the minimax thresholded coefficients ˆv∗, then it is possible to translate results on the estimation of the coefficients vj,k to the function itself.
Corollary. Universal Thresholding
Following the above definitions, and Props. D.7, D.8, the risk k ˆf∗− f k2 2
can be bounded for all f and T = 2J +1 according to R( ˆf∗, f ) ≤ Λ∗T σ
2
T + R0(SW, f )
,
where R0(SW, f ) = infSRT ,S(TSW(x, S), f ) is the oracle risk (it selects the best
subset of coefficients for reconstruction). The asymptotic limit of the minimax threshold motivates the application of what is known as the universal threshold:
λuniv = ˆσp2 log T . Remark .4. Application to periodogram smoothing
With regards to estimation of the LSW spectrum, specific studies by H. Neu- mann (1996) and M. Neumann et al. (1995) establish error bounds over func- tions in Besov balls F =Bm
p,q. In the Gaussian case, setting λ = 2 log(T )/T1/2
for all j leads to: sup
f ∈F
n
In the case of non-Gaussian time-series, let JT = {(j, k) | 2j ≤ T1−η} for some
η > 0, and set the universal threshold as λ = maxj,k∈Jn{σj,k}p2 log |Jn|. Then
sup
f ∈F
n
E[k ˆf − f k2L2]o= O (log(T )/T )−2m/(2m+1) .
For more detail on the above results the reader is referred to Theorem 3.2a),b) in M. Neumann et al. (1995). The threshold given in Theorem 5.1, as suggested by G.P. Nason et al. (2000) satisfies the Gaussian case above, and uses the result with m = 1, p = 1.
It is interesting to note the relation between wavelet denoising and the lasso (Sec. 2.1.5), whereby, in the orthogonal design situation, the lasso becomes a thresholding operation. Indeed, the DWT is equivalent to this situation, where given the orthogonality of the wavelet basis the wavelet model is simply a linear regression model with orthogonal design. The λ in the lasso is therefore directly related to the thresholds discussed above. In the next chapter, this relation is highlighted in greater detail in the context of smoothing the LSW spectra.