Threshold selection - Algorithmic advances in learning from large dimensional matrices and scie

The method we described so far requires a threshold parameter that separates the small eigenvalues, those assumed to be perturbations of the zero eigenvalue, from the

0.5 1 1.5 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Exact DOS by KPM, deg = 30

λ φ ( λ ) KPM (Chebyshev) 0.5 1 1.5 2 2.5 3 3.5 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Exact DOS by KPM, deg = 30

λ φ ( λ ) KPM (Chebyshev) 1 2 3 4 5 6 7 8 9 0.5 1 1.5 2 2.5 3 3.5 4

Exact DOS by KPM, deg = 30

λ φ ( λ ) KPM (Chebyshev) (A) (B) (C)

Figure 3.3: Exact DOS plots for three different types of matrices.

relevant larger eigenvalues that contribute to the rank. We now describe a method to select based on the spectral density, which was introduced in the previous chapter.

3.4.1 DOS plot analysis

For motivation, let us first consider a matrix that is exactly of low rank and observe the typical shape of its DOS function plot. As an example we take an n×n PSD matrix with rank k < n, that has keigenvalues uniformly distributed between 0.2 and 2.5, and whose remainingn−keigenvalues are equal to zero. An approximate DOS function plot of this low rank matrix is shown in figure 3.3(A). The DOS is generated using KPM, with a degreem= 30 where the coefficientsµkare estimated using the exact trace of the

Chebyshev polynomial functions of the matrix. Jackson damping is used to eliminate oscillations in the plot. The plot begins with a high value at zero indicating the presence of a zero eigenvalue with high multiplicity. Following this, it quickly drops to almost a zero value, indicating a region where there are no eigenvalues. This corresponds to the region just above zero and below 0.2. The DOS increases at 0.2 indicating the presence of new eigenvalues. Because of the uniformly distributed eigenvalues between 0.2 and 2.5, the DOS plot has a constant positive value in this interval.

To estimate the rank k of this matrix, we can count the number of eigenvalues in the interval [, λ1]≡[0.2,2.5] by integrating the DOS function over the interval. The value λ1 = 2.5 can be replaced by an estimate of the largest eigenvalue. The initial value = 0.2 can be estimated as the point immediately following the initial sharp drop observed or the mid point of the valley. For low rank matrices such as the one considered here, we should expect to see this sharp drop followed by a valley. The

49 cutoff point between zero eigenvalues and relevant ones should be at the location where the curve ceases to decrease, i.e., the point where the derivative of the spectral density function becomes zero (local minimum) for the first time. Thus, the threshold can be selected as

= min{t:φ0(t) = 0, λn≤t≤λ1}. (3.7)

For more general numerically rank deficient matrices, the same idea based on the DOS plot can be employed to determine the approximate rank. Defining a cut-off value between the relevant singular values and insignificant ones in this way works when there is a gap in the matrix spectrum. This corresponds to matrices that have a cluster of eigenvalues close to zero, which are zero eigenvalues perturbed by noise/errors, followed by an interval with few or no eigenvalues, a gap, and then clusters of relevant eigenvalues, which contribute to the approximate rank. Two types of DOS plots are often encountered depending on the number of relevant eigenvalues and whether they are in clusters or spread out wide.

Figures 3.3(B) and (C) show two sample DOS plots which belong to these two categories, respectively. Both plots were estimated using KPM and the exact trace of the matrices, as in the previous low rank matrix case. The middle plot (figure 3.3(B)) is a typical DOS curve for a matrix which has a large number of eigenvalues related to noise which are close to zero and a number of larger relevant eigenvalues which are in a few clusters. The spectral density curve displays a fast decrease after a high value near zero eigenvalues due to the gap in the spectrum and the curve increases again due the appearance of large eigenvalue clusters. In this case, we can use equation (3.7) to estimate the threshold .

In the last DOS plot on figure 3.3(C), the matrix has again a large number of eigenvalues related to noise which are close to zero, but the number of relevant eigenvalues is smaller and these eigenvalues are spread farther and farther apart from each other as their values increase, (as for example when λi = K(n−i)2.) The DOS curve has

a similar high value near zero eigenvalues and displays a sharp drop, but it does not increase again and tends to hover near zero. In this case, there is no valley or local minimum, so the derivative of the DOS function may not reach the value zero. The best we can do here is detect a point at which the derivative exceeds a certain negative

Algorithm 2 Numerical rank estimation by polynomial filtering

Input: Ann×nsymmetric PSD matrixA,λ1andλnofA, and number nv of sample vectors to be used.

Output: The numerical rank r ofA.

1. Generate the random starting vectorsvl:l= 1, . . .nv, such thatkvlk2 = 1.

2. Transform the matrix A to B = A/λ1, choose degree m for DOS and form the matvecs

Bkvl :l= 1, . . . ,nv, k= 0, . . . , m.

3. Form the scalarsv>_l Tk(B)vl using the above matvecs and obtain the DOS ˜φ(t). 4. Estimate the thresholdfrom ˜φ(t) using eq. (3.8).

5. McWeeny filter: Estimate m1 and τ1 from . Compute Θ[m0,m1]vl using the above matvecs (compute additional matvecs if required). Estimate the numerical rank r.

Chebyshev filter: Compute the degree m and estimate the coefficients γk for the

interval [, λ1]. Compute the numerical rankr using the above matvecs.

value, for the first time, indicating a significant slow-down of the initial fast decrease. In summary, the threshold for all three cases can be selected as

= min{t:φ0(t)≥tol, λn≤t≤λ1}. (3.8)

Our sample codes use tol=−0.01 which seems to work well in practice.

When the input matrix does not have a large gap between the relevant and noisy eigenvalues (when numerical rank is not well-defined), the corresponding DOS plot of that matrix will display similar behavior as the plot in figure 3.3(C), except the plot does not go to zero. That is, the DOS curve will have a similar knee as in figure 3.3(C).

3.4.2 Algorithm

Algorithm 2 describes our approach for estimating the approximate rank r by the two

polynomial filtering methods discussed earlier.

Computational cost. The core of the computation in the two rank estimation methods is the matrix vector product of the form Tk(A)vl or in general Akvl for

l = 1, . . . ,nv, k = 0, . . . , m (step 3). Note that no matrix-matrix products or factorizations are required. In addition, the matrix vector products Akvlcomputed during

51 the estimation of the threshold, for the spectral density, can be saved and reused for the rank estimation, and so the related matrix-by-vector products are computed only once. All remaining steps of the algorithm are essentially based on these ‘matvec’ operations. For ann×n dense symmetric PSD matrix, the computational cost of Algorithm 2 is O(n2mnv). For a sparse matrix, the computation cost will beO(nnz(A)mnv), where nnz(A) is the number of nonzero entries of A. This cost is linear in the number of nonzero entries ofAfor large matrices and it will be generally quite low whenAis very sparse, e.g., when nnz(A) = O(n). These methods are very inexpensive compared to methods that require matrix factorizations such as QR or SVD.

Remark 3 In some of the rank estimation applications, it is perhaps required to estimate the corresponding eigenpairs or the singular triplets of the matrix, after the approximate rank estimation. These can be easily computed using a Rayleigh-Ritz pro- jection type methods, exploiting again the vectorsAkvlgenerated for estimating the rank.

On the convergence. The convergence analysis of the trace estimator was discussed in the previous chapter. The best known convergence rate for (2.4) is O(1/√nv) for Hutchinson and Gaussian distributions, see Theorem 3.

Theoretical analysis for approximating a step function as in (2.2) is not straight- forward since we are approximating a discontinuous function. Convergence analysis on approximating a step function is documented in [83]. A convergence rate ofO(1/m) can be achieved with any polynomial approximation [83]. However, this rate is obtained for point by point analysis (at the vicinity of discontinuity points), and uniform convergence cannot be achieved due to the Gibbs phenomenon.

Improved theoretical results can be obtained if we first replace the step function by a piecewise linear approximation, and then employ polynomial approximation. Article [81] shows that uniform convergence can be achieved using Hermite polynomial approximation (as in sec. 3.3.1) when the filter is constructed as a spline (piecewise linear) function. For example,

ψ(t) =          0 :f or t∈[0, 0) Θ_[_m₀_.m₁_] :f or t∈[0, 1) 1 :f or t∈[1,1] . (3.9)

1 2 3 4 5 6 7 8 9 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 DOS by LSM deg =50 λ φ ( λ ) Spectroscopic 0 10 20 30 3900 3950 4000 4050 4100

4150 Mc−Weeny filter method

Number of vectors (1 −> 30) Estimed # eigen values in

interval Cumulative Avg(rε)ℓ Exact 0 10 20 30 3900 3950 4000 4050 4100

4150 Chebyshev filter method

Number of vectors (1 −> 30) Estimed # eigen values in

interval Cumulative Filter

Exact

(rε)ℓ

Figure 3.4: Numerical ranks estimated for the example ukerbe1.

It is well known that uniform convergence can be achieved with Chebyshev polynomial approximation if the function approximated is continuous and differentiable [57]. Further improvement in the convergence rate can be accomplished, if the step function is replaced by an analytic function, for example, ψ(t) can be a shifted version of tanh(αt) function. In this case, exponential convergence rate can be achieved with Chebyshev polynomial approximation [57]. However, such complicated implementations are unnecessary in practice. The bounds achieved for both the trace estimator and the approximation of step functions discussed above are too pessimistic, since in practice we can get accurate ranks for m∼50 and nv ∼30.

In document Algorithmic advances in learning from large dimensional matrices and scientific data (Page 61-66)