Numerical libraries solving large-scale problems developed at IT4Innovations Research Programme Supercomputing for Industry

(1)

Availableonlineatwww.sciencedirect.com

ScienceDirect

jo u rn al h o m e p a g e :w w w . e l s e v i e r . c o m / p i s c

Numerical

libraries

solving

large-scale

problems

developed

at

IT4Innovations

Research

Programme

Supercomputing

for

Industry

夽

Michal

Merta

a,b,∗

,

Jan

Zapletal

a,b

,

Tomas

Brzobohaty

a

,

Alexandros

Markopoulos

a

,

Lubomir

Riha

a

,

Martin

Cermak

a

,

Vaclav

Hapla

a,b

,

David

Horak

a,b

,

Lukas

Pospisil

a,b

,

Alena

Vasatova

a,b

a_{IT4Innovations}_National_{Supercomputing}_Center,_17._listopadu_15/2172,₇₀₈₀₀_Ostrava,_Czech_Republic b_Department_of_Applied_Mathematics_VSB_—_Technical_University_of_Ostrava,_17._listopadu_15/2172,

70833Ostrava,CzechRepublic

Received26October2015;accepted11November2015 Availableonline 15December2015

KEYWORDS FETI; TFETI; BEM; Domain decomposition; Quadratic programming; HPC

Summary TheteamofResearchProgrammeSupercomputingforIndustryatIT4Innovations NationalSupercomputingCenterisfocusedondevelopmentofhighlyscalablealgorithmsfor solution of linear and non-linear problems arising from different engineering applications. Asamainparallelisationtechnique,domaindecompositionmethods (DDM)ofFETI typeare used.Thesemethodsarecombinedwithﬁniteelement(FEM)orboundaryelement(BEM) dis-cretisation methods andquadratic programming(QP)algorithms. All thesealgorithms were implementedintoourin-housesoftwarepackagesBEM4I,ESPRESOandPERMON,which demon-stratehighscalabilityuptotensofthousandsofcores.

夽 _This_article_is_part_of_a_special_issue_entitled_{‘‘Proceedings}_of

the1stCzech-ChinaScientiﬁcConference2015’’.

∗_{Corresponding}_author_at:_{IT4Innovations}_National

Supercompu-tingCenter,17.listopadu15/2172,70833Ostrava,CzechRepublic. E-mailaddress:[email protected](M.Merta).

Introduction

Highperformanceofcontemporarycomputersresultsfrom anincreasingnumberofcomputenodesinclustersand num-ber of processor cores per node. While the current most powerful petascale or multi-petascale computers contain hundreds of thousands of CPU cores, the futureexascale

http://dx.doi.org/10.1016/j.pisc.2015.11.023

(2)

systemswillcomprisemillionsofthem.Forefﬁcientuseof suchsystems,algorithmswithhighparallelscalabilityhave tobedeveloped.

Discretisationofmostengineeringproblemsdescribable bypartialdifferentialequations(PDE)leadstolargesparse linearsystemsofequations.However,problemsthatcanbe expressedasellipticvariationalinequalities,suchasthose describingtheequilibriumofelasticbodiesinmutual con-tact,leadtoquadraticprogramming(QP)problems.

Finite element tearing and interconnecting (FETI) and boundary element tearing and interconnecting (BETI) (LangerandSteinbach,2003;OfandSteinbach,2009) meth-ods form a successful subclass of domain decomposition methods (DDM). They belong to non-overlapping methods andcombine sparseiterative anddirectsolvers. FETIwas ﬁrstly introduced by Farhat and Roux (Farhat and Roux, 1991,1992).ThekeyingredientoftheFETImethodis the decompositionof the spatialdomain intonon-overlapping subdomainsthatare‘‘gluedtogether’’byLagrange multipli-ers.Eliminationoftheprimalvariablesreducestheoriginal linear problem to a smaller, relatively well conditioned, equalityconstrainedQP.IftheFETIprocedureisappliedtoa contactproblem(Dostáletal.1998,2000,2005,2010,2012;

Dostál and Horák, 2004), the resulting QP has additional boundconstraints.FETImethodsallowhighlyaccurate com-putationsscalinguptotensofthousandsofprocessors.

Ourteamwassuccessfulin adaptingFETIapproachfor contactproblemsanddesignednewvariants.Oneofthem isTotal-FETI(TFETI)developedbyDostaletal.(Dostáletal., 2006,2010;Kruisetal.,2002; ˇCermáketal.,2015)which usesLagrangemultiplierstoenforceDirichletboundary con-ditions. This enables a simpler building of the stiffness matrix kernel, as all subdomains are ﬂoating and associ-ated subdomain stiffness matrices have the same kernel, obtainedwithout any computation.Hybrid-TFETI(HTFETI) reducescoarseproblem(CP)sizebyaggregatingthe subdo-mainsintoclusters,i.e.TFETIisappliedtwice.

ResultingQPproblems canbethen solvedby meansof efﬁcientMPRGPand SMALBEalgorithms designedagainby Dostaletal.(Dostáletal.,2003;DostálandSchöberl,2005; Dostál,2009)withknownrateofconvergencegivenby spec-tralpropertiesofthesolvedsystem.

WedevelopseveralsoftwarepackagesdealingwithFETI: PERMONbasedonPETSc andESPRESObasedonIntel MKL andCilk.TheBEM4IlibraryimplementsBEMdiscretisation, andtogetherwiththeothertwopackagestheBETImethod. Thepaperisorganisedasfollows.Afterintroduction,we describethemainprinciplesofFETIandBETImethods.Then the particular libraries and their modules are introduced withtheachievedhighlightsfromvariousareas.

Numerical

methods

FETImethods

FETI-1(Farhat andRoux,1991, 1992;Farhat etal.,1994; Kruis, 2006) is a non-overlapping DDM (Gosselet and Rey, 2006) which is based ondecomposing the original spatial domainintonon-overlappingsubdomains.Theyare‘‘glued together’’byLagrangemultiplierswhichhavetosatisfy cer-tainequalityconstraintswhichwillbediscussedlater.The

originalFETI-1 methodassumes that the boundary subdo-mainsinheritDirichletconditionsfromtheoriginalproblem wheretheconditionsareembeddedintothelinearsystem arisingfrom FEM. This means physically that subdomains whoseinterfacesintersecttheDirichletboundaryareﬁxed whileothersarekeptﬂoating;inthelinearalgebraspeech, the corresponding subdomain stiffness matrices are non-singularandsingular,respectively.

ThebasicideaoftheTotal-FETI(TFETI)method(Dostál et al., 2006, 2010; ˇCermák et al., 2015) is to keep all thesubdomains ﬂoating and enforce the Dirichlet bound-aryconditionsbymeansofaconstraintmatrixandLagrange multipliers, similarly to the gluing conditions along sub-domain interfaces. This simpliﬁes implementation of the stiffnessmatrix generalisedinverse. The keypointis that kernelsRs _of_subdomain _stiffness_matrices_Ks_are_known_a

priori,havethe samedimensionandcan beformed with-outanycomputationfromthemeshdata,sothatRmatrix (ImR=KerK)possess alsonice block-diagonallayout. Fur-thermore, each local stiffness matrix can be regularised cheaply,andtheinverseoftheresultingnonsingularmatrix is at the same time a generalised inverse of the original singularone(Dostáletal.,2011;Brzobohat´yetal.,2011).

FETImethodsusetheLagrangemultiplierstoenforce bothequalityandinequalityconstraints(gluingand nonpen-etrationconditions)intheoriginalprimalproblem

min1 2u

T_Ku₋_uT_f _s.t. _B

Eu=o and BIu≤cI.

Theprimalproblemisthen transformedusingdualityinto signiﬁcantlysmaller andbetter conditioned dual problem withequalityconstraintandnonnegativitybound

min1 2 T_F₋T_d _s.t. _G₌_e, I≤o with F=BK+BT_, _G₌_RT_BT_, _d₌_BK+_f, _e₌_RT_f.

After homogenisation using particular solution ˜

=GT_(GGT₎−1_e, _while _{= ˆ}_{+ ˜, ˆ} _∈ _Ker_{G, ˜} _∈ _Im_GT

and enforcing homogenised equality constraint by means of projector P=I−Q on KerG, where Q =GT_(GGT₎−1_G _is

projectortoImGT_,_SMALSE_algorithm_can_be_applied_to_the

problem

min1 2ˆ

T_PFPˆ_{− ˆ}T_P(d₋_{F ˜)} _s.t _Gˆ₌_{o, ˆ} I ≥−˜I.

Forthisdualproblemtheclassicalestimateofthe spec-tral condition number is valid, i.e. (PFP|ImP)≤C(H/h),

withHdenotingthedecompositionandhthediscretisation parameter.Naturaleffortusingthemassivelyparallel com-putersistomaximisethenumberofsubdomains(decrease

H)sothatsizesofsubdomainstiffnessmatricesarereduced which accelerates not only their factorisation and subse-quent generalised inverse application but also improves conditioningandreducesthenumberofiterations.Negative effectofthatisincreaseofdualandnullspacedimensions,

(3)

whichdeceleratethecoarseproblem(CP)solution,i.e. solu-tionofthesystem GGT_x₌_y,_so_that_the _bottleneck_of_the

TFETImethodistheapplicationoftheprojectordominating thesolutiontime.

HybridFETImethod

Althoughthereareseveralefﬁcientcoarseproblem paral-lelisationstrategies(HaplaandHorák,2012;Kozubeketal., 2012, 2013), stillthere are size limitations of the coarse problem.Soseveralhybrid(multilevel)methodswere pro-posed(Lee,2009;KlawonnandRheinbach,2010).The key idea is to aggregate small number of neighbouring sub-domainsintoclusters(see Fig.1), which naturallyresults intothesmallercoarseproblem.InourHTFETI,the aggre-gation of subdomains into the clusters is enforced again by Lagrange multipliers. Thus TFETI method is used on both cluster and subdomain levels. This approach simpli-ﬁesimplementationofhybridFETImethodsandenablesto extendparallelisationoftheoriginalproblemuptotensof thousandsofcoresduetolowermemoryrequirements.This isthepositiveeffectofreducingthecoarsespace.The neg-ativeoneisgettingworseconvergenceratecomparedwith theoriginalTFETI.Toimproveitthetransformationofbasis originally introduced by Klawonn and Widlund (Klawonn andWidlund,2006),KlawonnandRheinbach(Klawonnand Rheinbach,2006),andLiandWidlund(LiandWidlund,2006) isappliedtothederivedhybridalgorithm.

BoundaryelementmethodandBETI

Theboundaryelementmethod(BEM)iswell-suitedforthe solutionofexteriorproblemssuchassoundor electromag-neticwavescattering,orshapeoptimisationproblems.The boundaryintegral formulationof the givenproblem leads tothediscretisationoftheboundaryonly,thuseffectively reducestheproblemdimension.

Themethodisapplicabletoproblemsforwhichthe fun-damentalsolutionisknown,whichisthecase,e.g.ofthe LaplaceorHelmholtzequations.In3D,therespective fun-damentalsolutionsread

v(x,y) := 1 4 1 x−y, v(x,y) := 1 4 eix−y x−y.

Thesolutiontotheboundaryvalueproblemunder consider-ationisgivenbytherepresentationformula

u(x):= ∂ 1 u(y)v(x,y)dsy− ∂ 0 u(y)∂v ∂ny(x, y)dsy, where0_and

1_represent_the_Dirichlet_and_Neumann_trace

operators.TheunknownCauchydatacanbeobtainedfrom the appropriate system of boundary integral equations. ApplyingtheDirichletandNeumanntraceoperatorstothe representationformulaleadstotheboundaryintegral equa-tions (V1 u)(x)= 1 2 0 u(x)+(K0 u)(x) forx ∈ ∂, (1)

(D0u)(x)= 1₂1u(x)−(K∗1u)(x)forx ∈ ∂ (1) withV,K,K∗,andDdenotingthesingle-layer,double-layer, adjoint double-layer, and hypersigular boundary integral operators,respectively. The Galerkindiscretisation of the single-layer operator equation (1) leads to the system of linearequations Vt= 1 2M+K u

withtheboundaryelementmatrices

V[k,]:= k v(x,y)dsydsx K[k,j] := k ∂ ∂v ∂ny(x,y)ϕj (y)dsydsx

andthesparseidentitymatrixM.

Theassemblyofthefullmatricesisofquadratic complex-itywithrespecttothedegreesoffreedomonthesurface. Moreover,advancednumericalquadraturemethodsmustbe applied totreat singularities occurring in the integrals in the case ofidentical or adjacent elements(see Rjasanow andSteinbach(2007)orSauterandSchwab(2010)). There-fore,anefﬁcientimplementationandparallelisationofthe method is necessary to allow the solution of large scale problems.

FETIdomain decompositionmethodology applied com-bined with BEM discretisation results in so called BETI (boundaryelementtearingandinterconnecting)method.

MPRGPandSMALBEalgorithms

Combination of SemiMonotonic Augment Lagrangian algo-rithm for Bound and Equality constraints (SMALBE) and Modiﬁed Proportioning with Reduced Gradient Projec-tion (MPRGP) algorithms (Dostál et al., 2003; Dostál and Schöberl,2005;Dostál,2009)wasdevelopedandtestedfor solutionofQPproblemsresultingfromdiscretisationof con-tactproblemsofmechanics,butcanbeaswellusedforany otherQPproblems.Theyhavetheoreticallysupportedrate of convergencegivenby spectralproperties ofthe solved system.Generallinearinequalityconstraintsmustbe con-verted toboundconstraints byapplyingdualisation which alsotypicallyimprovesconditioningandreducesdimension. MPRGP isan activesetbasedalgorithm.Themain ideaof MPRGP is gradient splitting basedonactive setsintofree andchoppedgradientswhosesumyieldstheprojected gra-dient.Thealgorithmexploitsatesttodecideaboutleaving thefaceandthreetypesofstepstogenerateasequenceof theiteratesthatapproximatethesolution:

1 The expansionstep,ifthesolutionisproportional,may expand the current active set using ﬁxed steplength relatedtomatrixnormandreducedfreegradient. 2 The proportioning step may remove indices from the

activesetusingchoppedgradient. 3 Theconjugategradientstep.

(4)

Figure1 CubepreparedforTFETIandHFETI.

ThealgorithmhasbeenprovedtoenjoytheR-linearrateof convergenceintermsofthespectralconditionnumber.The SMALBEis an algorithm basedonaugmented Lagrangians. Ittakescare ofthe equalityconstraints,while initseach iteration,theinnerproblemconsistinginbound-constrained minimisationoftheaugmentedLagrangianissolvedbyany suitablesolversuchasMPRGPdescribedabove.

The

BEM4I

library

Overview

The boundary element library BEM4I concentrates on the efﬁcient assembly of the boundary element matrices for the3D Laplace, Helmholtz, Lamé, andtime-domain wave equations. It employs sparsiﬁcation methods, namely the fast multipolemethod (Greengard and Rokhlin, 1987; Of, 2007) (FMM) and the adaptive cross approximation (ACA) (Bebendorf,2008;RjasanowandSteinbach,2007)toreduce thecomputationalefforttoalmostlinear.

The core of the library consists of three main setsof classes:

1 BESpace:theclassesinheritingfromtheBESpaceclass are responsible for the approximation of the continu-ousfunctionspaces.Thestoredinformationincludesthe order of polynomial test and Ansatz functions or data

necessarytoapproximatematricesusingtheACAorFMM methods.

2 BEBilinearForm:themainpurposeofthisclassandits descendants is toassemble the boundary element sys-tem matrices (in both full and sparsiﬁed formats). The element-wiseassemblyisperformed usingthe BEInte-gratorclass.TheassemblyisparallelisedusingOpenMP andMPIatthislevel.

3 BEIntegrator: the classes responsible for the local system matrix assembly inherit from the BEInte-grator class. Several types of numerical quadratures are employed by these classes, including the classical Gaussian quadrature schemes over the pairs of distant elementsandthesemi-analyticalapproach(Rjasanowand Steinbach,2007;ZapletalandBouchala,2014)andfully numericalschemes(SauterandSchwab,2010)totreatthe singularitiesintheintegralsoverpairsofcloseelements. The computation is vectorised toreduce the computa-tionaltimeusingtheSSEorAVXinstructionsets(Fig.3).

In addition to these classes the library also contains the supportiveclassesrepresenting full,sparse,andsparsiﬁed matrices,iterativeanddirectsolvers,preconditioners, sur-facemeshes,etc. The library structure togetherwiththe resultsofthescalabilitytestshavebeenpresentedinMerta andZapletal(2015,acceptedforpublication), ˇCermáketal. (2015)(Fig.2).

(5)

Figure 3 Concurrent summation of scalars using vector instructions.

IntelXeonPhiutilisation

Toreducethecomputationaltimethecodeofthelibraryis acceleratedbytheIntelXeonPhicoprocessors.The compu-tationalymostdemandingpartsofthecodeareofﬂoadedto thecoprocessorusingoffloadpragmasof theIntel com-pilerandthecomputation iscarriedoutusing60 physical (240logical)coresavailableatthecoprocessor(seeFig.4). Thecomputationconsistsofseveralsteps.

1 Pack the data(mainly nodesandelements of asurface mesh)andsendittothecoprocessor.

2 Perform simultaneous computation on the coprocessor andthehost.

3 Sendresultsfromthecoprocessortothehostprocessor. 4 Combinedatafromthecoprocessorandtheprocessorand

assembletheglobalsystemmatrix.

The results of the numerical benchmarks focused on the assembly of the full single-layer operator matrix for the Laplaceequationshowasigniﬁcantreductioninthe compu-tationaltime(seeFig.5).Themainbottleneckiscurrently thedatatransferfromthecoprocessortothehostprocessor (Fig.6).

ExaScale

PaRallel

FETI

SOlver

(ESPRESO)

Overview

TheESPRESOlibraryisimplementedinC++.Signiﬁcantpart ofthedevelopmenteffortwasdevotedtodevelopmentof aC++wrapperfor(1)theselectedsparseanddenseBLAS routinesand(2)thesparsedirectsolvers(MKLandoriginal versionsofPARDISOdirectsolvers)oftheIntelMKLlibrary. Thesolverisdevelopedtosupportcurrentandfuturemulti

Figure4 Ofﬂoad ofthe computationto theIntel XeonPhi coprocessor.

Figure5 Comparisonoftheassemblyofthesinglelayer oper-atormatrix.

andmanycorearchitectures,forinstanceIntelXeonPhior NvidiaTesla.ThereforefortheCPUandXeonPhiversionwe areusingtheIntelMKLlibraryandCUDAlibrariesareused (cuBLAS,cuSPARSE,cuSolver)fortheGPUversion.

Communicationlayeroptimisation

ESPRESO-Hismainlyfocusedonthescalabilityofthe com-municationlayerforlargecomputersystemswiththousands andtensofthousandsofcomputenodes.Alltheprocessing isdonebytheCPUs.Thesolveruseshybridparallelisation which is well suitedfor multi-socket andmulti-core com-pute nodes asthis is the architecture of most of today’s supercomputers.

Theﬁrstlevelof parallelisationisdesignedfor parallel processingoftheclustersofsubdomains.Individualclusters areprocessed pernode. It is possible toprocessmultiple clustersperonenode,butnottheother wayaround.The distributedmemoryparallelisationisdoneusingMPI.In par-ticularweareusingMPIstandard3.0whichisimplemented inmostofthemodernMPIdistributions.TheMPI3.0isused becausethecommunicationhidingtechniquesimplemented inthecommunicationlayerrequirethenon-blocking collec-tiveoperations.

Thecommunication layeris identicalfor bothTFETIor HTFETIsolversinESPRESO.Itusesnovelcommunication hid-ing techniques for the main iterative solver. In particular we have implemented: (1) the Pipelined Conjugate Gra-dient (PipeCG) solver which hides communication of the global dot products behind the local matrix vector mul-tiplications; (2) distributed CP processing — merges two globalcommunicationoperations(GatherandScatter)into one(AllGather)andparallelisestheCPprocessingusingthe

(6)

Figure7 Thestencilcommunicationforsimpledecompositionintofoursubdomains.TheLagrangemultipliers(LMs)thatconnects differentneighbouringsubdomainsaredepictedindifferentcolours.IneveryiterationwhentheLMsareupdatedanexchangeis performedbetweentheneighbouringsubdomainstofinishtheupdate.Thisaffinityalsocontrolsthedistributionofthedataforthe maindistributediterativesolver,whichiteratesoverlocalLMsonly.InourimplementationeachMPIprocessmodifiesonlythose elementsofthevectorsusedbytheCGsolverthatmatchtheLMsassociatedwiththeparticulardomainincaseofFETIortheset ofdomainsinaclusterincaseofhybridFETI.

distributedinversematrixoftheCP;and(3)theoptimised versionofglobalgluingmatrixmultiplication(matrixBfor FETIandB1forHFETI)—writtenasstencilcommunication whichisfullyscalable,seeFig.7.

Inter-clusterprocessing

The secondlevel ofparallelisationis designedfor parallel processingofsubdomainsinacluster.Ourimplementation enablesoversubscriptionofCPUcoressothateachcorecan processmultiplesubdomainsandthereforethesizeofthe cluster is not limitedby the hardwareconﬁguration. This shared memory parallelisation is implemented using Intel Cilk+.WehavechosentheCilk+duetoitsadvancedsupport forC++language.Inparticularwearetakingadvantageof the functionalitythat allowsus tocreate custom parallel reductionoperationsontopoftheC++objectswhichinour casearesparsematrices.

Numericalresults

TheESPRESOisdesignedtosolvelargeproblemsusingworld largestsupercomputers.Inthispaperwepresenttheresults measuredontheEuropeanlargestmachine,CSCSPizDaint inLugano,Switzerland.ThePizDaintisaCrayXC30machine with5272computenodeseachwithone8-coreSandyBridge CPU(E5-2670),32GBofRAMandoneK20XGPUaccelerator. Allthefollowingtestsaredoneusingthesynthetic3Dcube linearelasticitybenchmark.Forthisbenchmarking weare developing massively parallel in memoryproblem genera-tor,whicheliminatesI/Obottlenecksandallowstoevaluate the efﬁciency and scalability of the solverroutines more precisely.

The first set of results is shown in Fig. 8. This figure presentstheweakscalabilityofHTFETIsolverinESPRESO. Duetolimitedamountofmemorypernode,solverisableto process2.7million ofunknownspersinglenode.Thenthe amountofworkpernodeiskeptfixedandweareincreasing thenumberofnodesfrom1to2197,whichdefinesthe max-imumproblemsizeto5.8billionsofunknowns.Thisissofar

thelargestproblemwewereabletosolveonthePizDaint machine.The important message from this measurement is the ﬂattening characteristics from 343 to 2197 nodes, whichisexpectedresultfromgoodweakscalabilityofthe solver.

ThenexttestsshowstrongscalingoftheHTFETImethod inESPRESO.IntheFig.9wecanseethestrongscalabilityof singleiterationtime.Thisexperimentdecouplesthe numer-icalscalabilityoftheHTFETImethodandthescalabilityof the implementation itself. We can see that the ESPRESO achievessuper-linearscalabilityperiterationwhensolving 2.6billionunknownsproblemstartingfrom1000nodesand scalingto4913nodes.Theperiterationtimeisshowninthe ﬁgurenexttoeachpoint,thesecondline,whilethenumber ofnodesisdescribedintheﬁrstline. Theblueline shows thelinearscalingbasedonprocessingtimeon1000nodes.

The lasttest using thesynthetic benchmark shows the strong scalability of the entire iterative solver in the ESPRESO. This involves the per iteration time as well as thenumberofiteration(thenumericalscalability).Wecan seethateveninthistestsolverachievedthelinearscaling. Pleasenotethatforbothstrongscalabilitytestswekeepthe clusterconﬁgurationidentical,inotherwordsthenumberof domainspernoderemainsthesameandwearereducingthe domainsizewhileincreasingthenumberofnodes/clusters (Fig.10).

ESPRESO-GPUandESPRESO-MIC

In parallel with EPSRESO-H we are developing two more ﬂavoursofESPRESOwhicharedesignedtotakeadvantageof modernmany-coreaccelerators.TheESPRESO-GPUisusing CUDA and its libraries to run on Nvidia Tesla GPUs. The ESPRESO-MICisdevelopedundertheIntelParallel Comput-ingCenter(IPCC)atIT4Innovationsanditsmainfocusisto fullyutilisethepotentialofXeonPhiacceleratorsbasedin Knights Corner architecture. This is an essential research forIT4Innovations asit hastheEuropean largestXeon Phi acceleratedsystemcalledSalomon.

(7)

Figure8 TheweakscalingevaluationoftheESPRESOsolveronEuropeanlargestCSCSPizDaintsupercomputer.Solverisable toprocess2.7millionofunknownspernode.Thescalabilityisevaluatedfrom1to2197nodes.Thefatteningshapeofthetotal executiontimeshowsthepotentialoftheESPRESOtoscaleevenfurther.

Figure9 StrongscalabilityofasingleiterationtimeoftheESPRESOsolver.InthistestESPRESOissolving2.6billionunknown problemstartingfrom1000to4913nodes.

PERMON

Overview

We develop a novel software package based on PETSc using TFETI for solution of QP called PERMON (Paral-lel,Efﬁcient,Robust,Modular,Object-oriented,Numerical) toolbox since 2011. It makes use of theoretical results

in discretisation techniques, QP algorithms, and DDM. It incorporates ourown codes, and makes use of renowned open source libraries. The solver layer, discussed here, consists of three modules: PermonFLLOP, PermonQP, and PermonIneq. Other modules are problem-speciﬁc such as PermonPlasticity for plasticity, PermonImage for image recognition, PermonMultiBody for particle dynamics and others.

Figure10 StrongscalabilityoftheiterativesolveroftheESPRESO.InthistestESPRESOissolving2.6billionunknownproblem startingat1000nodesto4913nodes.

(8)

Figure11 DoublylinkedlistofQPs.

PermonQP

PermonQPisapackageprovidingabasefor solutionofQP problems.ItsmainideaisseparationofconceptsofQP prob-lems,transformsandsolverswhichareabstractedbythree basic classes QP, QPT and QPS, respectively. A QP trans-formderivesanewQPfromthegivenQP,sothatadoubly linkedlist(QPchain)isgeneratedwhereeverynodeisaQP (Fig.11).Theprogramminginterface(API)ofPermonQPis carefullydesignedtobeeasy-to-use,andatthesametime efﬁcientandsuitableforHPC.Thesolutionprocessisfrom theuser’spointofviewdividedintothefollowingsequence ofactions:

1 QPproblemspeciﬁcation;

2 QPtransforms, which reformulate the original problem andcreateachainofQPproblemswherethelastoneis passedtothesolver;

3 automaticormanualchoiceofanappropriateQPsolver; 4 QPsolution.

PermonQPasastand-alonepackageallowssolving uncon-strained QP problems (i.e. linear systems witha positive semideﬁnite matrix)or equalityconstrained ones. Inboth casesitmakesuseofthePETScKSPpackagewhichincludes both direct and iterative solvers, including interfaces to many external solvers. Examples of equality constraints arefor instance multipointconstraints, or thealternative enforcingofDirichletboundaryconditionsusingaseparate constraintmatrix. This moduleisbeingprepared for pub-lishingundertheBSD2-Clauselicense.

PermonIneq

PermonQP capabilities can be further extended with the PermonIneqpackage, which adds severalconcrete solvers forinequalityconstrainedQPs,e.g.thealreadymentioned MPRGPandSMALBEalgorithms.

PermonFLLOP

PermonFLLOP is a wrapper of PermonQP implementing FETI. It assembles the FETI-speciﬁc constraint matrix B

andnullspacematrixR.Theyarepassedinternallyto Per-monQPtogetherwithsubdomain-wisestiffnessmatricesand loadvectors, which can beassembled witharbitrary FEM librarysuchasPermonCubeor libMesh.FETImethoditself consists here just in calling the proper sequence of QP transformations: primal scaling, dualisation, dual scaling, homogenisationoftheequalityconstraints,preconditioning byorthogonalprojectorontothekernelofthedualequality constraintmatrix.

Numericalexperiments

Asabenchmarkanelasticcubewassubjectedtothevolume forcespressingitagainsttheobstacle.Thereweretwo rea-sonsforthisdecision.Theelasticcubeisanumericalmodel whichcouldbefullycontrolledandtheobtainedresultsare notaffectedbycomplexityofgeometry.Anotherreasonis that it is very difﬁcult or even impossible tocreate very largemeshesoncomplexgeometriesusingexistingmeshing

Table1 Resultsforthecubecontactlinearelasticityproblem.

X NS #Decomp.DOF Solutiontime[s] Outeriters Inneriters

4 64 3,000,000 2.68E+01 3 94

6 216 10,125,000 5.38E+01 3 147

8 512 24,000,000 1.21E+02 4 250

(9)

Figure12 ParallelweakscalabilityoftheTFETImethodimplementationinthePermonFLLOPcodeforthelinearelasticitycube benchmarkatHECToRsupercomputer.

Figure13 NumericalscalabilityoftheTFETImethodwithinthePermonFLLOPcodefor thelinearelasticitycubebenchmark. Notethatfromacertainpointwegetalmostconstantnumberofiterationsallowinggoodparallelscalability.

tools.WithourmeshgeneratorPermonCube wewereable topreparelargescaleproblemsdecomposedintothousands ofsubdomains.

Rand Gmatrices wereorthonormalised usingIterative Classical Gram-Schmidt process, K matrix was factorised using the Cholesky factorisation from the MUMPS library. Currently,eachcomputationalcoreownsoneandonlyone subdomain.Thenormoftheprojectedgradient compared withthe10−5multipleoftheprojecteddualRHSwasused asthestoppingcriterion.Theresultsaresummarisedinthe

Table1.

Theweakscalabilityfor13,824;8000and4096elements persubdomainandthenumericalscalabilityforthese con-ﬁgurations(correspondingtotheﬁxed ratiosH/h=24,20, 16)arethenillustratedinFigs.12and13.Toinvestigatethe strongscalabilityweselecteddiscretisationwith32,768,000 elements(approx.100,000,000unknowns).Thestrong scal-abilityfordiscretisationwith32,768,000elements(approx. 100,000,000unknowns)wasdemonstratedupto8000cores (41.5susing2197cores;19.8susing4096cores;15.7s8000 cores).

Conclusion

Efﬁcient variants of BEM discretisation method, scalable QPalgorithms,andFETI-typedomaindecomposition meth-ods (BETI, TFETIand HTFETI) wereimplemented intoour

in-housesoftwarepackages.Thesesolvers wereoptimised employing available state-of-the-art external libraries, communicationhidingandavoidingtechniques,hybrid MPI-OpenMP programming, GPU and MIC accelerators, etc. Scalability wasproven for both hugemodel problems and complicated engineering problems uptoten thousands of cores.

The presented BEM4I library for the boundaryelement discretisation ofengineering problems hasbeentested up tomorethanathousandofcores.Currently,itsacceleration usingtheIntelXeonPhicoprocessorsisunderdevelopment. The initial results suggest a signiﬁcant reduction in com-putationaltimeinthecaseoffull systemmatricesforthe Laplaceequation,thereforetheaccelerationofthe assem-blyofmatricessparsiﬁedbyACA,aswellastheassemblyof thesystemmatricesfortheLameequationisbeing consid-ered.

The presented ESPRESO library brings highly optimised TFETIandHTFETIimplementations.ESPRESO-H isoriented to large computer systems with thousands and tens of thousands of computenodes.ESPRESO-GPU and EPSRESO-MIC are developed to exploit power of GPU and MIC accelerators.

WehavealsopresentedourPERMONtoolbox,mainlyits solverpackagesbasedonPETSc.TheyuniquelycombineFETI DDM with QP algorithms. PermonFLLOP is used to gener-ateFETI-speciﬁcobjectsforacontactproblemofelasticity while FEMobjectsareprovidedby anyFEMcodeforeach

(10)

subdomain independently. PermonFLLOP wraps PermonQP and PermonIneq which solve the resulting QP problem. Resultsfor contactproblemof elasticcube,generatedby PermonCubepackage,wereshown.

Conﬂict

of

interest

Theauthorsdeclarethatthereisnoconﬂictofinterest.

Acknowledgements

This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excel-lenceproject (CZ.1.05/1.1.00/02.0070); project of major infrastructures for research, development and innova-tion of Ministry of Education, Youth and Sports with reg. num. LM2011033; by the EXA2CT project funded fromthe EU’s Seventh Framework Programme (FP7/2007-2013) undergrant agreementno. 610741; bythe internal student grant competition project SP2015/186 ‘‘PERMON toolbox development’’; the project POSTDOCI II reg. no.CZ.1.07/2.3.00/30.0055withinOperationalProgramme EducationforCompetitiveness;andbytheGrantAgencyof theCzechRepublic(GACR)projectno.15-18274S.Wethank CSCS(www.cscs.ch)for the supportin usingthePiz Daint supercomputer.

References

Bebendorf,M.,2008.HierarchicalMatrices:AMeanstoEfﬁciently SolveEllipticBoundaryValueProblems,LectureNotesin Com-putationalScienceandEngineering.Springer.

Brzobohatý,T.,Dostál,Z.,Kozubek,T.,Kováˇr,P.,Markopoulos,A., 2011.Choleskydecompositionwithfixingnodestostable com-putationofa generalized inverseofthestiffnessmatrixofa floatingstructure.Int.J.Numer.MethodsEng.88(5),493—509. Dostál, Z., Friedlander, A., Santos, S.A., 2003. Augmented Lagrangianswithadaptiveprecisioncontrolforquadratic pro-grammingwithsimpleboundsandequalityconstraints.SIAMJ. Optim.13(January(4)),1120—1140.

Dostál,Z.,Friedlander,A.,Santos,S.A.,1998.Solutionofcontact problemsofelasticitybyFETIdomaindecomposition.Contemp. Math.218,82—93.

Dostál,Z.,Horák,D.,Kuˇcera,R.,2006.TotalFETI—aneasier imple-mentablevariantoftheFETImethodfornumericalsolutionof ellipticPDE.Commun.Numer.MethodsEng.22(12),1155—1162. Dostál,Z.,Kozubek,T.,Markopoulos,A.,Menˇsík,M.,2011.Cholesky decompositionofapositivesemideﬁnitematrixwithknown ker-nel.Appl.Math.Comput.217(13),6067—6077.

Dostál,Z.,Kozubek,T.,Vondrák,V.,Brzobohat´y,T.,Markopoulos, A.,2010.ScalableTFETIalgorithmforthesolutionofmultibody contactproblemsofelasticity.Int.J.Numer.MethodsEng.82 (11),1384—1405.

Dostál,Z., Neto,F.A.G.,Santos,S.A.,2000dec.Solution of con-tactproblemsbyFETIdomaindecompositionwithnaturalcoarse spaceprojections.Comput.MethodsAppl.Mech.Eng.190 (13-14),1611—1627.

Dostál,Z.,2009.OptimalQuadraticProgrammingAlgorithms,with Applications to Variational Inequalities. SOIA, Springer, New York,US.

Dostál,Z.,Horák,D.,2004.ScalableFETIwithoptimaldualpenalty for a variational inequality. Numer. Linear AlgebraAppl. 11, 455—472.

Dostál,Z.,Horák,D.,2007.TheoreticallysupportedscalableFETI fornumericalsolutionofvariationalinequalities.SIAMJournal onNumericalAnalysis45(2),500—513.

Dostál, Z., Horák, D., Kuˇcera, R., Vondrák, V., Haslinger, J., Dobiáˇs, J., Pták, S., 2005. FETI based algorithms for con-tactproblems:scalability,largedisplacementsand3dcoulomb friction. Comput. Methods Appl. Mech. Eng. 194 (2—5), 395—409.

Dostál,Z.,Kozubek,T.,Brzobohat´y,T.,Markopoulos,A.,Vlach,O., 2012.ScalableTFETIwithoptionalpreconditioningbyconjugate projectorfortransientcontactproblemsofelasticity.Comput. MethodsAppl.Mech.Eng.247—248,37—50.

Dostál,Z.,Schöberl,J.,2005.Minimizingquadraticfunctions sub-jectto boundconstraints.Comput.Optim. Appl.30(January (1)),23—43.

Farhat, C., Mandel, J., Roux, F.X., 1994. Optimal convergence propertiesoftheFETIdomaindecompositionmethod.Comput. MethodsAppl.Mech.Eng.115,365—385.

Farhat,C.,Roux,F.X.,1991.A methodofﬁnite elementtearing andinterconnectinganditsparallelsolutionalgorithm.Int.J. Numer.MethodsEng.32(6),1205—1227.

Farhat,C.,Roux,F.X.,1992.Anunconventionaldomain decomposi-tionmethodforanefﬁcientparallelsolutionoflarge-scaleﬁnite elementsystems.SIAMJ.Sci.Stat.Comput.(1).

Gosselet,P.,Rey,C.,2006.Non-overlappingdomaindecomposition methodsinstructuralmechanics.Arch.Comput.MethodsEng. 13(4),515—572.

Greengard,L.,Rokhlin,V.,1987.Afastalgorithmforparticle sim-ulations.J.Comput.Phys.73(2),325—348.

Hapla,V.,Horák,D.,2012.Tfeticoarsespaceprojectors paralleliza-tionstrategies.In:Wyrzykowski,R.,Dongarra,J.,Karczewski, K.,Wa´sniewski,J.(Eds.),ParallelProcessingandApplied Math-ematics,LectureNotesinComputerScience.Springer,Berlin, Heidelberg,pp.152—162.

Klawonn,A.,Rheinbach,O.,2010.Highlyscalableparalleldomain decomposition methodswithanapplication tobiomechanics. ZAMMZ.Angew.Math.Mech.90(1),5—32.

Klawonn, A., Widlund, O.B., 2006. Dual-primal FETI methods for linear elasticity. Commun. Pure Appl. Math. 59 (11), 1523—1572.

Klawonn,A., Rheinbach,O.,2006. Aparallel implementationof dual-primalFETImethodsforthree-dimensionallinearelasticity usingatransformationofbasis.SIAMJ.Sci.Comput.28(January (5)),1886—1906,http://dx.doi.org/10.1137/050624364. Kozubek, T., Horák, D., Hapla, V., 2012. FETI coarse

problem parallelization strategies and their compari-son, Tech. Rep. http://www.prace-project.eu/IMG/pdf/ feticoarseproblemparallelization.pdf.

Kozubek,T.,Vondrák,V.,Menˇsík,M.,Horák,D.,Dostál,Z.,Hapla, V.,Kabelíková,P., ˇCermák,M.,2013.TotalFETIdomain decom-positionmethodanditsmassivelyparallelimplementation.Adv. Eng.Softw.60—61,14—22.

Kruis,J.,2006.DomainDecompositionMethodsforDistributed Com-puting.Saxe-CoburgPublications.

Kruis,J.,Matouˇs,K.,Dostál,Z.,2002.Solvinglaminatedplatesby domaindecomposition.Adv.Eng.Softw.33,445—452.

Langer,U., Steinbach, O., 2003. Boundary element tearing and interconnectingmethods.Computing71(3),205—228. Lee,J.,2009.Ahybriddomaindecompositionmethodandits

appli-cationstocontactproblemsinmechanicalengineering.NewYork University(Ph.D.thesis).

Li,J., Widlund, O.B.,2006. FETI-DP, BDDC, and blockCholesky methods.Int.J.Numer.MethodsEng.66,250—271.

Merta,M.,Zapletal,J.,2015.Aparallellibraryforboundary ele-ment discretization ofengineering problems. Math. Comput. Simul.(acceptedforpublication).

Merta,M.,Zapletal, J., 2015.Acceleration ofboundaryelement methodbyexplicitvectorization.Adv.Eng.Softw.86,70—79.

(11)

Of, G., Steinbach, O., 2009. The all-ﬂoating boundary element tearingand interconnecting method.J. Numer.Math. 17(4), 277—298.

Of,G.,2007.Fastmultipolemethodsandapplications.In:Schanz, M.,Steinbach,O. (Eds.), BoundaryElementAnalysis,Lecture NotesinAppliedandComputationalMechanics.Springer,Berlin, Heidelberg,pp.135—160.

PatrickAmestoyandothers,2015.MUMPS:aMultifrontalMassively ParallelsparsedirectSolver,http://mumps.enseeiht.fr/. Rjasanow,S.,Steinbach,O.,2007.TheFastSolutionofBoundary

IntegralEquations.MathematicalandAnalyticalTechniqueswith ApplicationstoEngineering.Springer.

Sauter,S.,Schwab,C.,2010.BoundaryElementMethods.Springer SeriesinComputationalMathematics.Springer.

ˇ

Cermák,M.,Hapla,V.,Horák,D.,Merta,M.,Markopoulos,A.,2015. Total-FETIdomaindecompositionmethodforsolutionof elasto-plasticproblems.Adv.Eng.Softw.84,48—54.

ˇ

Cermák,M.,Merta,M.,Zapletal,J.,2015.Anovelboundary ele-mentlibrarywithapplications.In:Simos,T.,Tsitouras,C.(Eds.), ProceedingsofICNAAM2014.AIPConferenceProceedings,vol. 1648.

Zapletal,J.,Bouchala,J.,2014.Effectivesemi-analyticintegration forhypersingularGalerkinboundaryintegralequationsforthe Helmholtzequationin3d.Appl.Math.59(5),527—542.