Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

(1)

Hierarchically Parallel FE Software for

Assembly Structures : FrontISTR

-

Parallel Performance Evaluation and Its

Industrial Applications –

Hiroshi Okuda

The University of Tokyo [email protected]

CO-DESIGN 2012, October 23-25, 2012 Peking University, Beijing

(2)

Outline

Background : Towards peta/exascale computing

Necessary advances in programming models and software FrontISTR : as an HPC tool for industry

Large-grain Parallelism

Assembly structures under hierarchical gridding

Small-grain Parallelism

Blocking with padding for multicore CPU (and GPUs)

(3)

Necessary advances in programming models and software

Fast SpMV for unstructured grid

Two (at most) nested programming model, i.e. message passing and loop decomposition

Keep consistency and stability

B/F required in program consistent with H/W

Automatic generation of compiler directives, which can consider the “dependency”, after trial runs.

“middleware ( mid-level interface )”, bridging the application and the BLAS based numerical libraries. Uncertainty or risk in the physical model

Hierarchical consistency between H/W configuration and the physical modeling, particularly in engineering fields.

(4)

FrontISTR

HEC-MW

Nonlinear structural analysis functions have been deployed on a parallel FEM basis: HEC-MW.

FrontISTR built on HEC-MW

Advanced features of parallel FEM Hierarchical mesh refinement Assembly structure Up to O(10 5 ) nodes Portability From MPP to PC CAE cloud

Nonlinear analysis functions

Hyper-elasticity/Thermal-elastic-plastic/Visco-elastic/Creep,

Combined hardening rule Total/Updated Lagrangian Finite slip contact, Friction

(5)

Function Supported contents

Static

Material

Elastic/Hyper-elasticity/Thermal-Elastic-Plastic/Visco-Elastic/Creep, Combined hardening rule

Geometry Total Lagrangian／Updated

Lagrangian

Boundary

Augmented Lagrangian/Lagrangian multiplier method, Finite slip contact, Friction

Dynamic Linear/Nonlinear, Explicit/Implicit Eigen

value Lanczos method

Heat Steady / Non-steady (implicit), Nonlinear

Structural Analysis Functions Supported in FrontISTR

(6)

Front

(7)

Thermal-Elastic-Plastic Analysis of Welding Residual Stress

Temperature

Joint research with IHI

Heat source Heat source Heat source Heat source transfer along transfer along transfer along transfer along a welding line a welding line a welding line a welding line Residual stress Residual stress Residual stress Residual stress induced by plastic induced by plastic induced by plastic induced by plastic deformation

deformationdeformation deformation

(8)

Cupping press simulation / Elasto-plasticity and friction on contact faces

8

A punch is plugged into a blank, which is placed between a die and a blank holder. The blank is formed into a cylinder shape as the punch is plugged.

(9)

(10)

ゴムゴムゴムゴム心線心線心線心線帆布帆布帆布帆布プーリ面（剛体）ベルト対称性により、幅方向に半分のみモデル化従動プーリ駆動プーリ軸荷重負荷トルク回転 V belt 大規模解析へのニーズ

Friction of power transmission belt

Joint research with Mitsuboshi Belt

Active

(11)

先端力学シミュレーション研究所殿ご提供

(12)

(13)

A

B C

(14)

メッシュ分割領域分割

Large-grain Parallelism :

Parallelization based on domain decomposition

Local Data Local Data Local Data Local Data MPI MPI MPI Solver Subsystem Solver Subsystem Solver Subsystem Solver Subsystem FEM Code FEM Code FEM Code FEM Code

(15)

Data structure for assembly structures with

parallel and hierarchical gridding

A) Partitioning ( MPI ranks ) B) Hierarchical level C) Assembly model Refine Refine Refine Refine Assembly_2 Level_1 Assembly_1 Level_1 MPC Partitioning Assembly_2 Level_2 Assembly_1 Level_2 MPC Partitioning A) A)A) A) A) A)A) A) C) C) C) C) C) C)C) C) B) B)B) B) C) C) C) C) C) C) C) C) Assembly_2 Level_1 Assembly_1 Level_1 Assembly_2 Level_2 Assembly_1 Level_2 Partitioning Partitioning

(16)

Iterative solvers with MPC

*

preconditioning

CG itrs. CPU (sec) CPU/CG itr. (sec)

MPCCG 14,203 3,456 2.43×10 -1 Penalty+CG 171,354 40,841 2.38×10 -1 Mises stress 0 0 0 0 r p u KT T f T r = ′ − = T T , 0 = k 1,L k T k k k k k k k k T k k k k KTp T r r p u u KTp T p r r α α α − = + ′ = ′ = + + 1 1 ) , ( ) , ( k k k k k k k k k p r p r r r r β β + = = + + + + 1 1 1 1 ) , ( ) , ( u T u = ′ For Check convergence end Algorithm *) Multi-point constraint T: Sparce MPC matrix

(17)

Assembled Structure: Piping composed of many parts 2 nd order tet-mesh 3,093,453 elements 5,433,029 nodes Num. of MPC : 70,166 fixed 10mm Mises stress Piping system composed of many

parts is easily handled. 5 pipes & 32 bolts

Front

(18)

Strong Scale with Refiner

- Static linear analysis of machine part - 2

nd

order tetra element - PCG (eps=10^-6)

FX10@UT

SPARC64 Ixfx(1.848 GHz) ×1CPU (16core)/node

(19)

Performance of 1 node is a crucial factor Performance of 1 node is a crucial factorPerformance of 1 node is a crucial factor Performance of 1 node is a crucial factor

Intra-node parallel

- Remove ‘dependency’ by ordering - OpenMP and/or vectorization

Access of innermost loop

Experience

(20)

Blocking with padding for multicore CPU and GPUs

SpMV in iterative solvers is crucial for unstructured

grid.

For improving B/F ratio

Blocking, Padding, ELL+CSR.

SpMV AXPY & AYPX DOT SpMV AXPY & AYPX DOT Breakdown of CPU in CG operations

(21)

SpMV with CSR:

Flop/Byte = 1/{6*(1+m/nnz)}

=

0.08

0.08～

0.08

～

～0.16

0.16

SpMV with BCSR:

Flop/Byte = 1/{4*(1+fill/nnz) + 2/c + 2m/nnz*(2+1/r)}

=

0.18

0.18～

～

～0.21

0.21

nnz: number of non-zero components

m: number of columns,

r, c: block size,

fill: number of ‘zero’s for blocking

(22)

Acceleration of

SpMV Computation

Parallelization

Rows are distributed among threads.

Load balancing

Reallocate rows to balance loads.

Blocking

Matrix format is crucial.

CSR: Compressed Sparse Row

Flops/Byte = 0.08 Flops/Byte = 0.08 Flops/Byte = 0.08 Flops/Byte = 0.08～～～～0.160.160.160.16 BCSR: Blocked CSR Flops/Byte = 0.18 Flops/Byte = 0.18 Flops/Byte = 0.18 Flops/Byte = 0.18～～～～0.210.210.210.21 B a la n ce d Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3 1 0 0 0 4 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 5 0 0 2 1 0 0 0 0 0 8 A 0 B 0 C 0 D 0 E 1 0 0 2 4 0 0 0 0 0 0 3 0 5 0 0 2 1 0 8 0 4 2 0 4 0 2 3 4 value colindx rowptr A B C D E

(23)

Performance Test (1/3)

Load Balancing

• x10,000 SpMV of unbalanced matrices from the library[2]

• Left: w/o. load balancing

• Right: without load balancing

[2] T. A. Davis. University of Florida sparse matrix collection, 1997.

(24)

Performance Test (2/3)

Parallelization and Matrix Format

• Performance of SpMV on Nehalem (Core i7 975)

• CSR / BCSR format matrix # p e r f o r m a n c e [ M F L O P S ]

(25)

Performance Test (3/3)

Overall CG Solver

• CG solver’s performance on CSR single thread on Nehalem (Core i7 975) matrix # p e r f o r m a n c e o v e r C S R s i n g l e t h r e a d

(26)

Performance Model (1/2)

The K-computer’s roofline model based on William’s model[1]. Sustained performance can be predicted w.r.t. applications’ Flop/Byte ratio.

[1] S. Williams. Auto-tuning

Performance on Multicore Computers.

Univ. of California, 2008. S u s t a i n e d 8 / 128 estimated To-peak 6.25%

(27)

Performance Model (2/2)

Machine MachineMachine

Machine NodeNodeNodeNode

performance performance performance performance BW BW BW BW (catalog) (catalog) (catalog) (catalog) BW BW BW BW (STREAM) (STREAM) (STREAM) (STREAM) B/F B/FB/F B/F B/F of B/F of B/F of B/F of FISTR FISTR FISTR FISTR To To To To --- -peak peak peak peak K 128 Gflops 64 GB/s 46.6 GB/s 0.36 FX10 236.5 Gflops 85 GB/s 64 GB/s 0.27 SpMV with CSR (Flop/Byte = 0.08～0.16) Byte/Flop = 6.25～12.5 SpMV with BCSR: (Flop/Byte = 0.18～0.21) Byte/Flop = 4.76～5.56 SpMV with CSR 2.9～5.8 % SpMV with BCSR: 4.9～7.6 % SpMV with CSR 2.2～4.3 % SpMV with BCSR: 3.7～5.7 %

Measured performance by profiler on

FX10

(28)

Hierarchical approaches:

- Memory : register cache main memory

- Granularity : thread among cores message passing among nodes

- Algorithm ( information transfer ) : near field far field

ex) iterative solvers with multi-grid preconditioning, FETI with balancing preconditioning, fast multipole method, homogenization method, zooming methods, Saint-Venant theorem

If algorithms are made multi-leveled, iterative process is generally

introduced. “Magic number”, which controls the convergence and/or the accuracy, would appear there. With the non-linearity, different solutions may be searched at each hierarchical level.

(29)

Summary

A parallel structural analysis system for tackling the

assemble structure as a whole is being developed.

Hierarchical gridding strategy is used for increasing

the problem size and enhancing the accuracy.

For intra-node, multithreading with considering the

blocking/padding has been explored on multicore CPU

and GPU (currently small number of nodes ).

Future work

Intra-node ILU preconditioning Hiding data translation time Performance model

(30)

Front

ISTR

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications