• No results found

Multi-Grid Lanczos

N/A
N/A
Protected

Academic year: 2020

Share "Multi-Grid Lanczos"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Multi-Grid Lanczos

M. A.Clark1,,ChulwooJung2,, andChristophLehner2,

1NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA 2Physics Department, Brookhaven National Laboratory, Upton, NY 11973, USA

Abstract.We present a Lanczos algorithm utilizing multiple grids that reduces the mem-ory requirements both on disk and in working memmem-ory by one order of magnitude for RBC/UKQCD’s 48I and 64I ensembles at the physical pion mass. The precision of the

resulting eigenvectors is on par with exact deflation.

1 Introduction

In recent years RBC/UKQCD has benefited significantly from the generation of the 2000 lowest eigen-vectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 48I (a−1 =1.7 GeV) and 64I (a−1 =2.3 GeV) ensembles at near physical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-goingg2 projects [1–3]; they have also found additional use in the calculation of∆MK[4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 48I and 64I ensembles respectively. These high storage requirements both on disk and in RAM are ad-dressed in this contribution allowing for usage of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth mem-ory such as GPUs and allows for running on small-scale clusters.

We note that related ideas were recently successfully used in the context of Monte-Carlo estima-tion of the trace of a matrix inverse [5].

2 Eigenvector compression

We first explore the compression of existing eigenvectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice. To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigenmodes in this basis [6]. For the figures shown below, we have usedN=400 for the 48I ensemble andN=250 for the 64I ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs.1and2illustrate the efficacy of this blocking for the eigenvector compression. The squared relative error is the squared norm of the difference of original and reconstructed vector divided by the squared norm of the original vector. In all cases shown here, we only have a single block in the fifth dimension.

Speaker, e-mail: [email protected]

© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).

EPJ Web of Conferences 175, 14023 (2018) https://doi.org/10.1051/epjconf/201817514023

(2)

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a−1 = 1.7 GeV) and 643 (a−1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going

g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble and N = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for 101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced.

1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble.

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t) times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t) times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors.

4. Reconstruct an approximation of the eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis

N+ 1,N+ 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93

(2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113

(2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter

Figure 1.Squared relative error of eigenvectorn+1 when reconstructed from blocked eigenvectors 1, . . . ,n.

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a−1 = 1.7 GeV) and 643 (a−1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going

g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble and N = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for 101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced.

1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble.

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t) times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t) times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors.

4. Reconstruct an approximation of the eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis

N + 1,N+ 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93

(2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113

(2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter

Figure 2.Squared relative error of eigenvectornwhen reconstructed from blocked eigenvectors 1, . . . ,400.

We furthermore reduce storage cost by expressing the eigenvectors in terms of a two-byte fixed-precision representation, where all spin-color elements for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 basis vectors and this two-byte representation for 101, . . . ,Nto reduce precision loss, see Fig.3.

In the following, we show that the precision loss from this compression technique is minimal. The effects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs.4and5.

2

EPJ Web of Conferences 175, 14023 (2018) https://doi.org/10.1051/epjconf/201817514023

(3)

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a−1 = 1.7 GeV) and 643 (a−1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going

g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble and N = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for 101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced. 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on483 ensemble. 0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t) times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t) times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors.

4. Reconstruct an approximation of the eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis

N + 1,N + 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93

(2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113

(2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter

Figure 1.Squared relative error of eigenvectorn+1 when reconstructed from blocked eigenvectors 1, . . . ,n.

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a−1 = 1.7 GeV) and 643 (a−1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going

g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble and N = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for 101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced. 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble. 0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t)times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t)times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors.

4. Reconstruct an approximation of the eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis

N + 1,N + 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93

(2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113

(2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter

Figure 2.Squared relative error of eigenvectornwhen reconstructed from blocked eigenvectors 1, . . . ,400.

We furthermore reduce storage cost by expressing the eigenvectors in terms of a two-byte fixed-precision representation, where all spin-color elements for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 basis vectors and this two-byte representation for 101, . . . ,Nto reduce precision loss, see Fig.3.

In the following, we show that the precision loss from this compression technique is minimal. The effects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs.4and5.

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a−1 = 1.7 GeV) and 643 (a−1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going

g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble and N = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for 101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced. 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble. 0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t) times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t) times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors.

4. Reconstruct an approximation of the eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis

N+ 1,N + 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93

(2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113

(2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter

Figure 3.Effect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator

for the light quarks on the 483 (a−1 = 1.7 GeV)

and 643 (a−1 = 2.3 GeV) ensembles at near

phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going

g 2 projects [1, 2, 3]; they have also found

additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial

with 9.3 TB and 36 TB per configuration for the

483 and 643 ensembles respectively. These high

storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis

out of the lowest N modes and write all

eigen-modes in this basis [5]. For the figures shown

be-low, we have used N = 400 for the 483 ensemble

and N = 250 for the 643 ensemble. The blocking

allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for 101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced. 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble. 0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t) times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t) times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first

round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors.

4. Reconstruct an approximation of the

eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis

N+ 1,N+ 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory

footprint of the 483 eigenvectors by 85%, from

9.3 TB to 1.4 TB, and of the 643 eigenvectors by

90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93 (2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113 (2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter

Figure 4.Squared CG residual as function of iteration number for point source on 48I ensemble.

In case of a single point source the effects become visible at long distances, see Fig.6, however, are sufficiently small that the statistical advantage of a low-mode subtraction is not reduced.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [7].

3

EPJ Web of Conferences 175, 14023 (2018) https://doi.org/10.1051/epjconf/201817514023

(4)

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a1 = 1.7 GeV) and 643 (a1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4].

The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble and N = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for

101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced.

1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble.

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t)times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t)times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors. 4. Reconstruct an approximation of the

eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis N+ 1,N + 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643 ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93

(2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113

(2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter T

t

Figure 5.γ0–γ0correlatorC(t) timest4on 48I ensemble for the full low-mode average on a single configuration.

Multi-Grid Lanczos

Kate Clark

, Chulwoo Jung

, and Christoph Lehner

†,+

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050, USA; Physics Department, BNL, Upton, NY 11973, USA

Introduction

In recent years RBC/UKQCD has benefited sig-nificantly from the generation of the 2000 low-est eigenvectors of the preconditioned normal (z)Mobius Domain Wall Fermion Dirac operator for the light quarks on the 483 (a1 = 1.7 GeV) and 643 (a1 = 2.3 GeV) ensembles at near phys-ical pion mass. These eigenvectors were used for deflation and volume averages over the low-mode space and were a key ingredient in the on-going g 2 projects [1, 2, 3]; they have also found additional use in the calculation of ∆MK [4]. The storage cost for these vectors is substantial with 9.3 TB and 36 TB per configuration for the 483 and 643 ensembles respectively. These high storage requirements both on disk and in RAM are addressed in this contribution allowing for us-age of these methods at even larger volumes. Our approach makes deflation much more applicable to architectures with limited amounts of high-bandwidth memory such as GPUs and allows for running on small-scale clusters.

Eigenvector compression

We first explore the compression of existing eigen-vectors computed with a Chebyshev-accelerated implicitly restarted Lanczos (IRL) on the original lattice.

To this end, we create a spatially-blocked basis out of the lowest N modes and write all eigen-modes in this basis [5]. For the figures shown be-low, we have used N = 400 for the 483 ensemble andN = 250 for the 643 ensemble. The blocking allows us to create a coarse-grid representation of the eigenmodes. Figs. 1 and 2 illustrate the efficacy of this blocking for the eigenvector com-pression. In all cases shown here, we only have a single block in the fifth dimension.

We furthermore reduce storage cost by express-ing the eigenvectors in terms of a two-byte fixed-precision representation where all spin-color ele-ments for a given five-dimensional position share a common two-byte exponent. We use a single-precision representation for the first 100 ba-sis vectors and this two-byte representation for

101, . . . ,N to reduce precision loss, see Fig. 3.

1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250 300 350 400 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 1: Squared relative error of eigenvector n+ 1 when reconstructed from blocked eigenvectors 1, . . . ,n.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block 483, 2.2.3.2 block 643, 4.4.4.4 block

Fig. 2: Squared relative error of eigenvector n when reconstructed from blocked eigenvectors 1, . . . ,400.

1e-16 1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

483, 4.4.3.4 block, 100 single precision 483, 4.4.3.4 block, 0 single precision

Fig. 3: E↵ect of keeping the first 100 basis vectors in single precision instead of keeping all vectors in two-byte

fixed-point precision.

In the following, we show that the precision loss from this compression technique is minimal. The e↵ects on a sloppy CG solve and on a full low-mode volume average are negligible, see Figs. 4 and 5. In case of a single point source the ef-fects become visible at long distances, see Fig. 6, however, are sufficiently small that the statisti-cal advantage of a low-mode subtraction is not reduced.

1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0 50 100 150 200 250

n

Original Compressed

Fig. 4: Squared CG residual for point source on 483 ensemble.

0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

0 5 10 15 20 25 30 35

n Original

Compressed

Fig. 5: 0 0 correlator C(t) times t4 on 483 ensemble for the full low-mode average on a single configuration.

0e+00 1e+01 2e+01 3e+01 4e+01 5e+01 6e+01

0 5 10 15 20 25 30 35 n

Original Low Compressed Low Original Full Compressed Full

Fig. 6: 0 0 correlator C(t) times t4 on 483 ensemble and its low-mode approximation for a single point source

on a single configuration.

The numerical experiments presented here were performed with an open-source stand-alone com-pression tool that is available at Ref. [6].

Multi-Grid Lanczos

In this section we demonstrate the we can also generate the eigenvector data directly in its com-pressed representation. To this end, we have de-veloped a Multi-Grid Lanczos method that is now publicly available at Ref. [7].

The basic steps are as follows:

1. Compute the N basis vectors with a first round of Chebyshev-accelerated IRL. We have found significant precision benefits by creating a precise basis through the Lanczos algorithm compared to the use of an

imprecise basis. The use of other methods such as the Jacobi-Davidson iteration to create the basis is currently being

investigated.

2. For a given blocking, create a locally

orthogonal basis using the results of step 1. This defines the mapping between coarse and fine grid.

3. Solve a second round of

Chebyshev-accelerated IRL on the coarse Grid to obtain the full set of eigenvectors. 4. Reconstruct an approximation of the

eigenvalues by locally inverting the Chebyshev polynomial of the Lanczos eigenvalues.

5. The first eigenvalues outside of the basis N + 1,N + 2,. . . may lack sufficient precision which we correct by smoothening the

corresponding eigenvectors (currently with low-iteration CG) and then determining the precise fine-grid eigenvalues.

The importance of precise fine-grid eigenvalues is illustrated in Fig. 7.

1e-14 1e-12 1e-10 1e-08 1e-06 1e-04 1e-02 1e+00

0 200 400 600 800 1000 1200 1400 1600 1800 2000 n

Undeflated Deflated with basis on fine grid Defl. on coarse grid w/o smoothed eigenvalues Defl. on coarse grid w/ smoothed eigenvalues

Fig. 7: Squared CG residual for volume source on 643

ensemble using Multi-Grid Lanczos.

Summary

By using both local coherence of eigenvectors [5] and a two-byte fixed-precision representation of eigenvectors we are able to reduce the memory footprint of the 483 eigenvectors by 85%, from 9.3 TB to 1.4 TB, and of the 643 eigenvectors by 90%, from 36 TB to 3.5 TB. Both a stand-alone compression tool [6] and a Multi-Grid Lanczos implementation [7] are available.

Acknowledgments. CL acknowledges support through a DOE Early Career Award and we thank our col-leagues in RBC/UKQCD for helpful discussions.

[1] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin and C. Lehner, Phys. Rev.D93 (2016), no. 1, 014503.

[2] T. Blum, P. A. Boyle, T. Izubuchi, L. Jin, A. Jttner, C. Lehner, K. Maltman,

M. Marinkovic, A. Portelli and M. Spraggs, Phys. Rev. Lett.116(2016), no. 23, 232002.

[3] T. Blum, N. Christ, M. Hayakawa, T. Izubuchi, L. Jin, C. Jung and C. Lehner, Phys. Rev. Lett.118(2017), no. 2, 022005.

[4] Z. Bai, N. H. Christ, T. Izubuchi, C. T. Sachrajda, A. Soni and J. Yu, Phys. Rev. Lett.113 (2014) 112003.

[5] M. Luscher, JHEP07(2007) 081.

[6] https://github.com/lehner/eigen-comp/.

[7] The Multi-Grid code is available athttps://github.com/lehner/Grid/which adds this feature to Peter Boyle’s Grid libraryhttps://github.com/paboyle/Grid/.

+presenter t

Figure 6. γ0–γ0 correlatorC(t) timest4 on 48I ensemble and its low-mode approximation for a single point source on a single configuration.

3 Multi-Grid Lanczos

In this section we demonstrate that we can also generate the eigenvector data directly in its compressed representation. To this end, we have developed a Multi-Grid Lanczos method that is now publicly available at Ref. [8].

The basic steps are as follows:

4

EPJ Web of Conferences 175, 14023 (2018) https://doi.org/10.1051/epjconf/201817514023

References

Related documents

Primary Postpartum Hemorrhage: was defined as bleed loss of more than 500 ml after vaginal delivery measured as one full kidney tray and more than 1000 ml after cesarean section..

The payoffs of all investment managers’ assets are subsequently shown to participants as “Last payoffs realized” (see also Figure A.1). Crucially, in Treatment 1 we also

Cape Light Compact is soliciting proposals from qualified vendors to provide service delivery as the Lead Vendor (LV) for the Residential Home Energy Services (HES) Program

Top of atmosphere (TOA) and atmospherically corrected surface reflectance for the spectrally corre- sponding visible, near infrared and shortwave infrared bands, and derived

atmospheric effects over the quite different NIR and shortwave infrared band spectral response 595. functions is reduced by the atmospheric

5.1 The Group Director of Operations and Head of Hotel Services updated the meeting on progress in improving telephony services and presented operational performance data on

The software, called Hoppla (Home and Personal Persistent, Long term Archiving), developed in Java, allows the acquisition and selection of digital data, performs migrations