Availableonlineatwww.sciencedirect.com
ScienceDirect
jo u rn al h o m e p a g e :w w w . e l s e v i e r . c o m / p i s c
Numerical
libraries
solving
large-scale
problems
developed
at
IT4Innovations
Research
Programme
Supercomputing
for
Industry
夽
Michal
Merta
a,b,∗,
Jan
Zapletal
a,b,
Tomas
Brzobohaty
a,
Alexandros
Markopoulos
a,
Lubomir
Riha
a,
Martin
Cermak
a,
Vaclav
Hapla
a,b,
David
Horak
a,b,
Lukas
Pospisil
a,b,
Alena
Vasatova
a,baIT4InnovationsNationalSupercomputingCenter,17.listopadu15/2172,70800Ostrava,CzechRepublic bDepartmentofAppliedMathematicsVSB—TechnicalUniversityofOstrava,17.listopadu15/2172,
70833Ostrava,CzechRepublic
Received26October2015;accepted11November2015 Availableonline 15December2015
KEYWORDS FETI; TFETI; BEM; Domain decomposition; Quadratic programming; HPC
Summary TheteamofResearchProgrammeSupercomputingforIndustryatIT4Innovations NationalSupercomputingCenterisfocusedondevelopmentofhighlyscalablealgorithmsfor solution of linear and non-linear problems arising from different engineering applications. Asamainparallelisationtechnique,domaindecompositionmethods (DDM)ofFETI typeare used.Thesemethodsarecombinedwithfiniteelement(FEM)orboundaryelement(BEM) dis-cretisation methods andquadratic programming(QP)algorithms. All thesealgorithms were implementedintoourin-housesoftwarepackagesBEM4I,ESPRESOandPERMON,which demon-stratehighscalabilityuptotensofthousandsofcores.
©2015PublishedbyElsevierGmbH.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
夽 Thisarticleispartofaspecialissueentitled‘‘Proceedingsof
the1stCzech-ChinaScientificConference2015’’.
∗Correspondingauthorat:IT4InnovationsNational
Supercompu-tingCenter,17.listopadu15/2172,70833Ostrava,CzechRepublic. E-mailaddress:michal.merta@vsb.cz(M.Merta).
Introduction
Highperformanceofcontemporarycomputersresultsfrom anincreasingnumberofcomputenodesinclustersand num-ber of processor cores per node. While the current most powerful petascale or multi-petascale computers contain hundreds of thousands of CPU cores, the futureexascale
http://dx.doi.org/10.1016/j.pisc.2015.11.023
2213-0209/© 2015 Published by Elsevier GmbH. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
systemswillcomprisemillionsofthem.Forefficientuseof suchsystems,algorithmswithhighparallelscalabilityhave tobedeveloped.
Discretisationofmostengineeringproblemsdescribable bypartialdifferentialequations(PDE)leadstolargesparse linearsystemsofequations.However,problemsthatcanbe expressedasellipticvariationalinequalities,suchasthose describingtheequilibriumofelasticbodiesinmutual con-tact,leadtoquadraticprogramming(QP)problems.
Finite element tearing and interconnecting (FETI) and boundary element tearing and interconnecting (BETI) (LangerandSteinbach,2003;OfandSteinbach,2009) meth-ods form a successful subclass of domain decomposition methods (DDM). They belong to non-overlapping methods andcombine sparseiterative anddirectsolvers. FETIwas firstly introduced by Farhat and Roux (Farhat and Roux, 1991,1992).ThekeyingredientoftheFETImethodis the decompositionof the spatialdomain intonon-overlapping subdomainsthatare‘‘gluedtogether’’byLagrange multipli-ers.Eliminationoftheprimalvariablesreducestheoriginal linear problem to a smaller, relatively well conditioned, equalityconstrainedQP.IftheFETIprocedureisappliedtoa contactproblem(Dostáletal.1998,2000,2005,2010,2012;
Dostál and Horák, 2004), the resulting QP has additional boundconstraints.FETImethodsallowhighlyaccurate com-putationsscalinguptotensofthousandsofprocessors.
Ourteamwassuccessfulin adaptingFETIapproachfor contactproblemsanddesignednewvariants.Oneofthem isTotal-FETI(TFETI)developedbyDostaletal.(Dostáletal., 2006,2010;Kruisetal.,2002; ˇCermáketal.,2015)which usesLagrangemultiplierstoenforceDirichletboundary con-ditions. This enables a simpler building of the stiffness matrix kernel, as all subdomains are floating and associ-ated subdomain stiffness matrices have the same kernel, obtainedwithout any computation.Hybrid-TFETI(HTFETI) reducescoarseproblem(CP)sizebyaggregatingthe subdo-mainsintoclusters,i.e.TFETIisappliedtwice.
ResultingQPproblems canbethen solvedby meansof efficientMPRGPand SMALBEalgorithms designedagainby Dostaletal.(Dostáletal.,2003;DostálandSchöberl,2005; Dostál,2009)withknownrateofconvergencegivenby spec-tralpropertiesofthesolvedsystem.
WedevelopseveralsoftwarepackagesdealingwithFETI: PERMONbasedonPETSc andESPRESObasedonIntel MKL andCilk.TheBEM4IlibraryimplementsBEMdiscretisation, andtogetherwiththeothertwopackagestheBETImethod. Thepaperisorganisedasfollows.Afterintroduction,we describethemainprinciplesofFETIandBETImethods.Then the particular libraries and their modules are introduced withtheachievedhighlightsfromvariousareas.
Numerical
methods
FETImethods
FETI-1(Farhat andRoux,1991, 1992;Farhat etal.,1994; Kruis, 2006) is a non-overlapping DDM (Gosselet and Rey, 2006) which is based ondecomposing the original spatial domainintonon-overlappingsubdomains.Theyare‘‘glued together’’byLagrangemultiplierswhichhavetosatisfy cer-tainequalityconstraintswhichwillbediscussedlater.The
originalFETI-1 methodassumes that the boundary subdo-mainsinheritDirichletconditionsfromtheoriginalproblem wheretheconditionsareembeddedintothelinearsystem arisingfrom FEM. This means physically that subdomains whoseinterfacesintersecttheDirichletboundaryarefixed whileothersarekeptfloating;inthelinearalgebraspeech, the corresponding subdomain stiffness matrices are non-singularandsingular,respectively.
ThebasicideaoftheTotal-FETI(TFETI)method(Dostál et al., 2006, 2010; ˇCermák et al., 2015) is to keep all thesubdomains floating and enforce the Dirichlet bound-aryconditionsbymeansofaconstraintmatrixandLagrange multipliers, similarly to the gluing conditions along sub-domain interfaces. This simplifies implementation of the stiffnessmatrix generalisedinverse. The keypointis that kernelsRs ofsubdomain stiffnessmatricesKsareknowna
priori,havethe samedimensionandcan beformed with-outanycomputationfromthemeshdata,sothatRmatrix (ImR=KerK)possess alsonice block-diagonallayout. Fur-thermore, each local stiffness matrix can be regularised cheaply,andtheinverseoftheresultingnonsingularmatrix is at the same time a generalised inverse of the original singularone(Dostáletal.,2011;Brzobohat´yetal.,2011).
FETImethodsusetheLagrangemultiplierstoenforce bothequalityandinequalityconstraints(gluingand nonpen-etrationconditions)intheoriginalprimalproblem
min1 2u
TKu−uTf s.t. B
Eu=o and BIu≤cI.
Theprimalproblemisthen transformedusingdualityinto significantlysmaller andbetter conditioned dual problem withequalityconstraintandnonnegativitybound
min1 2 TF−Td s.t. G=e, I≤o with F=BK+BT, G=RTBT, d=BK+f, e=RTf.
After homogenisation using particular solution ˜
=GT(GGT)−1e, while = ˆ+ ˜, ˆ ∈ KerG, ˜ ∈ ImGT
and enforcing homogenised equality constraint by means of projector P=I−Q on KerG, where Q =GT(GGT)−1G is
projectortoImGT,SMALSEalgorithmcanbeappliedtothe
problem
min1 2ˆ
TPFPˆ− ˆTP(d−F ˜) s.t Gˆ=o, ˆ I ≥−˜I.
Forthisdualproblemtheclassicalestimateofthe spec-tral condition number is valid, i.e. (PFP|ImP)≤C(H/h),
withHdenotingthedecompositionandhthediscretisation parameter.Naturaleffortusingthemassivelyparallel com-putersistomaximisethenumberofsubdomains(decrease
H)sothatsizesofsubdomainstiffnessmatricesarereduced which accelerates not only their factorisation and subse-quent generalised inverse application but also improves conditioningandreducesthenumberofiterations.Negative effectofthatisincreaseofdualandnullspacedimensions,
whichdeceleratethecoarseproblem(CP)solution,i.e. solu-tionofthesystem GGTx=y,sothatthe bottleneckofthe
TFETImethodistheapplicationoftheprojectordominating thesolutiontime.
HybridFETImethod
Althoughthereareseveralefficientcoarseproblem paral-lelisationstrategies(HaplaandHorák,2012;Kozubeketal., 2012, 2013), stillthere are size limitations of the coarse problem.Soseveralhybrid(multilevel)methodswere pro-posed(Lee,2009;KlawonnandRheinbach,2010).The key idea is to aggregate small number of neighbouring sub-domainsintoclusters(see Fig.1), which naturallyresults intothesmallercoarseproblem.InourHTFETI,the aggre-gation of subdomains into the clusters is enforced again by Lagrange multipliers. Thus TFETI method is used on both cluster and subdomain levels. This approach simpli-fiesimplementationofhybridFETImethodsandenablesto extendparallelisationoftheoriginalproblemuptotensof thousandsofcoresduetolowermemoryrequirements.This isthepositiveeffectofreducingthecoarsespace.The neg-ativeoneisgettingworseconvergenceratecomparedwith theoriginalTFETI.Toimproveitthetransformationofbasis originally introduced by Klawonn and Widlund (Klawonn andWidlund,2006),KlawonnandRheinbach(Klawonnand Rheinbach,2006),andLiandWidlund(LiandWidlund,2006) isappliedtothederivedhybridalgorithm.
BoundaryelementmethodandBETI
Theboundaryelementmethod(BEM)iswell-suitedforthe solutionofexteriorproblemssuchassoundor electromag-neticwavescattering,orshapeoptimisationproblems.The boundaryintegral formulationof the givenproblem leads tothediscretisationoftheboundaryonly,thuseffectively reducestheproblemdimension.
Themethodisapplicabletoproblemsforwhichthe fun-damentalsolutionisknown,whichisthecase,e.g.ofthe LaplaceorHelmholtzequations.In3D,therespective fun-damentalsolutionsread
v(x,y) := 1 4 1 x−y, v(x,y) := 1 4 eix−y x−y.
Thesolutiontotheboundaryvalueproblemunder consider-ationisgivenbytherepresentationformula
u(x):= ∂ 1 u(y)v(x,y)dsy− ∂ 0 u(y)∂v ∂ny(x, y)dsy, where0and
1representtheDirichletandNeumanntrace
operators.TheunknownCauchydatacanbeobtainedfrom the appropriate system of boundary integral equations. ApplyingtheDirichletandNeumanntraceoperatorstothe representationformulaleadstotheboundaryintegral equa-tions (V1 u)(x)= 1 2 0 u(x)+(K0 u)(x) forx ∈ ∂, (1)
(D0u)(x)= 121u(x)−(K∗1u)(x)forx ∈ ∂ (1) withV,K,K∗,andDdenotingthesingle-layer,double-layer, adjoint double-layer, and hypersigular boundary integral operators,respectively. The Galerkindiscretisation of the single-layer operator equation (1) leads to the system of linearequations Vt= 1 2M+K u
withtheboundaryelementmatrices
V[k,]:= k v(x,y)dsydsx K[k,j] := k ∂ ∂v ∂ny(x,y)ϕj (y)dsydsx
andthesparseidentitymatrixM.
Theassemblyofthefullmatricesisofquadratic complex-itywithrespecttothedegreesoffreedomonthesurface. Moreover,advancednumericalquadraturemethodsmustbe applied totreat singularities occurring in the integrals in the case ofidentical or adjacent elements(see Rjasanow andSteinbach(2007)orSauterandSchwab(2010)). There-fore,anefficientimplementationandparallelisationofthe method is necessary to allow the solution of large scale problems.
FETIdomain decompositionmethodology applied com-bined with BEM discretisation results in so called BETI (boundaryelementtearingandinterconnecting)method.
MPRGPandSMALBEalgorithms
Combination of SemiMonotonic Augment Lagrangian algo-rithm for Bound and Equality constraints (SMALBE) and Modified Proportioning with Reduced Gradient Projec-tion (MPRGP) algorithms (Dostál et al., 2003; Dostál and Schöberl,2005;Dostál,2009)wasdevelopedandtestedfor solutionofQPproblemsresultingfromdiscretisationof con-tactproblemsofmechanics,butcanbeaswellusedforany otherQPproblems.Theyhavetheoreticallysupportedrate of convergencegivenby spectralproperties ofthe solved system.Generallinearinequalityconstraintsmustbe con-verted toboundconstraints byapplyingdualisation which alsotypicallyimprovesconditioningandreducesdimension. MPRGP isan activesetbasedalgorithm.Themain ideaof MPRGP is gradient splitting basedonactive setsintofree andchoppedgradientswhosesumyieldstheprojected gra-dient.Thealgorithmexploitsatesttodecideaboutleaving thefaceandthreetypesofstepstogenerateasequenceof theiteratesthatapproximatethesolution:
1 The expansionstep,ifthesolutionisproportional,may expand the current active set using fixed steplength relatedtomatrixnormandreducedfreegradient. 2 The proportioning step may remove indices from the
activesetusingchoppedgradient. 3 Theconjugategradientstep.
Figure1 CubepreparedforTFETIandHFETI.
ThealgorithmhasbeenprovedtoenjoytheR-linearrateof convergenceintermsofthespectralconditionnumber.The SMALBEis an algorithm basedonaugmented Lagrangians. Ittakescare ofthe equalityconstraints,while initseach iteration,theinnerproblemconsistinginbound-constrained minimisationoftheaugmentedLagrangianissolvedbyany suitablesolversuchasMPRGPdescribedabove.
The
BEM4I
library
Overview
The boundary element library BEM4I concentrates on the efficient assembly of the boundary element matrices for the3D Laplace, Helmholtz, Lamé, andtime-domain wave equations. It employs sparsification methods, namely the fast multipolemethod (Greengard and Rokhlin, 1987; Of, 2007) (FMM) and the adaptive cross approximation (ACA) (Bebendorf,2008;RjasanowandSteinbach,2007)toreduce thecomputationalefforttoalmostlinear.
The core of the library consists of three main setsof classes:
1 BESpace:theclassesinheritingfromtheBESpaceclass are responsible for the approximation of the continu-ousfunctionspaces.Thestoredinformationincludesthe order of polynomial test and Ansatz functions or data
necessarytoapproximatematricesusingtheACAorFMM methods.
2 BEBilinearForm:themainpurposeofthisclassandits descendants is toassemble the boundary element sys-tem matrices (in both full and sparsified formats). The element-wiseassemblyisperformed usingthe BEInte-gratorclass.TheassemblyisparallelisedusingOpenMP andMPIatthislevel.
3 BEIntegrator: the classes responsible for the local system matrix assembly inherit from the BEInte-grator class. Several types of numerical quadratures are employed by these classes, including the classical Gaussian quadrature schemes over the pairs of distant elementsandthesemi-analyticalapproach(Rjasanowand Steinbach,2007;ZapletalandBouchala,2014)andfully numericalschemes(SauterandSchwab,2010)totreatthe singularitiesintheintegralsoverpairsofcloseelements. The computation is vectorised toreduce the computa-tionaltimeusingtheSSEorAVXinstructionsets(Fig.3).
In addition to these classes the library also contains the supportiveclassesrepresenting full,sparse,andsparsified matrices,iterativeanddirectsolvers,preconditioners, sur-facemeshes,etc. The library structure togetherwiththe resultsofthescalabilitytestshavebeenpresentedinMerta andZapletal(2015,acceptedforpublication), ˇCermáketal. (2015)(Fig.2).
Figure 3 Concurrent summation of scalars using vector instructions.
IntelXeonPhiutilisation
Toreducethecomputationaltimethecodeofthelibraryis acceleratedbytheIntelXeonPhicoprocessors.The compu-tationalymostdemandingpartsofthecodeareoffloadedto thecoprocessorusingoffloadpragmasof theIntel com-pilerandthecomputation iscarriedoutusing60 physical (240logical)coresavailableatthecoprocessor(seeFig.4). Thecomputationconsistsofseveralsteps.
1 Pack the data(mainly nodesandelements of asurface mesh)andsendittothecoprocessor.
2 Perform simultaneous computation on the coprocessor andthehost.
3 Sendresultsfromthecoprocessortothehostprocessor. 4 Combinedatafromthecoprocessorandtheprocessorand
assembletheglobalsystemmatrix.
The results of the numerical benchmarks focused on the assembly of the full single-layer operator matrix for the Laplaceequationshowasignificantreductioninthe compu-tationaltime(seeFig.5).Themainbottleneckiscurrently thedatatransferfromthecoprocessortothehostprocessor (Fig.6).
ExaScale
PaRallel
FETI
SOlver
(ESPRESO)
Overview
TheESPRESOlibraryisimplementedinC++.Significantpart ofthedevelopmenteffortwasdevotedtodevelopmentof aC++wrapperfor(1)theselectedsparseanddenseBLAS routinesand(2)thesparsedirectsolvers(MKLandoriginal versionsofPARDISOdirectsolvers)oftheIntelMKLlibrary. Thesolverisdevelopedtosupportcurrentandfuturemulti
Figure4 Offload ofthe computationto theIntel XeonPhi coprocessor.
Figure5 Comparisonoftheassemblyofthesinglelayer oper-atormatrix.
andmanycorearchitectures,forinstanceIntelXeonPhior NvidiaTesla.ThereforefortheCPUandXeonPhiversionwe areusingtheIntelMKLlibraryandCUDAlibrariesareused (cuBLAS,cuSPARSE,cuSolver)fortheGPUversion.
Communicationlayeroptimisation
ESPRESO-Hismainlyfocusedonthescalabilityofthe com-municationlayerforlargecomputersystemswiththousands andtensofthousandsofcomputenodes.Alltheprocessing isdonebytheCPUs.Thesolveruseshybridparallelisation which is well suitedfor multi-socket andmulti-core com-pute nodes asthis is the architecture of most of today’s supercomputers.
Thefirstlevelof parallelisationisdesignedfor parallel processingoftheclustersofsubdomains.Individualclusters areprocessed pernode. It is possible toprocessmultiple clustersperonenode,butnottheother wayaround.The distributedmemoryparallelisationisdoneusingMPI.In par-ticularweareusingMPIstandard3.0whichisimplemented inmostofthemodernMPIdistributions.TheMPI3.0isused becausethecommunicationhidingtechniquesimplemented inthecommunicationlayerrequirethenon-blocking collec-tiveoperations.
Thecommunication layeris identicalfor bothTFETIor HTFETIsolversinESPRESO.Itusesnovelcommunication hid-ing techniques for the main iterative solver. In particular we have implemented: (1) the Pipelined Conjugate Gra-dient (PipeCG) solver which hides communication of the global dot products behind the local matrix vector mul-tiplications; (2) distributed CP processing — merges two globalcommunicationoperations(GatherandScatter)into one(AllGather)andparallelisestheCPprocessingusingthe
Figure7 Thestencilcommunicationforsimpledecompositionintofoursubdomains.TheLagrangemultipliers(LMs)thatconnects differentneighbouringsubdomainsaredepictedindifferentcolours.IneveryiterationwhentheLMsareupdatedanexchangeis performedbetweentheneighbouringsubdomainstofinishtheupdate.Thisaffinityalsocontrolsthedistributionofthedataforthe maindistributediterativesolver,whichiteratesoverlocalLMsonly.InourimplementationeachMPIprocessmodifiesonlythose elementsofthevectorsusedbytheCGsolverthatmatchtheLMsassociatedwiththeparticulardomainincaseofFETIortheset ofdomainsinaclusterincaseofhybridFETI.
distributedinversematrixoftheCP;and(3)theoptimised versionofglobalgluingmatrixmultiplication(matrixBfor FETIandB1forHFETI)—writtenasstencilcommunication whichisfullyscalable,seeFig.7.
Inter-clusterprocessing
The secondlevel ofparallelisationis designedfor parallel processingofsubdomainsinacluster.Ourimplementation enablesoversubscriptionofCPUcoressothateachcorecan processmultiplesubdomainsandthereforethesizeofthe cluster is not limitedby the hardwareconfiguration. This shared memory parallelisation is implemented using Intel Cilk+.WehavechosentheCilk+duetoitsadvancedsupport forC++language.Inparticularwearetakingadvantageof the functionalitythat allowsus tocreate custom parallel reductionoperationsontopoftheC++objectswhichinour casearesparsematrices.
Numericalresults
TheESPRESOisdesignedtosolvelargeproblemsusingworld largestsupercomputers.Inthispaperwepresenttheresults measuredontheEuropeanlargestmachine,CSCSPizDaint inLugano,Switzerland.ThePizDaintisaCrayXC30machine with5272computenodeseachwithone8-coreSandyBridge CPU(E5-2670),32GBofRAMandoneK20XGPUaccelerator. Allthefollowingtestsaredoneusingthesynthetic3Dcube linearelasticitybenchmark.Forthisbenchmarking weare developing massively parallel in memoryproblem genera-tor,whicheliminatesI/Obottlenecksandallowstoevaluate the efficiency and scalability of the solverroutines more precisely.
The first set of results is shown in Fig. 8. This figure presentstheweakscalabilityofHTFETIsolverinESPRESO. Duetolimitedamountofmemorypernode,solverisableto process2.7million ofunknownspersinglenode.Thenthe amountofworkpernodeiskeptfixedandweareincreasing thenumberofnodesfrom1to2197,whichdefinesthe max-imumproblemsizeto5.8billionsofunknowns.Thisissofar
thelargestproblemwewereabletosolveonthePizDaint machine.The important message from this measurement is the flattening characteristics from 343 to 2197 nodes, whichisexpectedresultfromgoodweakscalabilityofthe solver.
ThenexttestsshowstrongscalingoftheHTFETImethod inESPRESO.IntheFig.9wecanseethestrongscalabilityof singleiterationtime.Thisexperimentdecouplesthe numer-icalscalabilityoftheHTFETImethodandthescalabilityof the implementation itself. We can see that the ESPRESO achievessuper-linearscalabilityperiterationwhensolving 2.6billionunknownsproblemstartingfrom1000nodesand scalingto4913nodes.Theperiterationtimeisshowninthe figurenexttoeachpoint,thesecondline,whilethenumber ofnodesisdescribedinthefirstline. Theblueline shows thelinearscalingbasedonprocessingtimeon1000nodes.
The lasttest using thesynthetic benchmark shows the strong scalability of the entire iterative solver in the ESPRESO. This involves the per iteration time as well as thenumberofiteration(thenumericalscalability).Wecan seethateveninthistestsolverachievedthelinearscaling. Pleasenotethatforbothstrongscalabilitytestswekeepthe clusterconfigurationidentical,inotherwordsthenumberof domainspernoderemainsthesameandwearereducingthe domainsizewhileincreasingthenumberofnodes/clusters (Fig.10).
ESPRESO-GPUandESPRESO-MIC
In parallel with EPSRESO-H we are developing two more flavoursofESPRESOwhicharedesignedtotakeadvantageof modernmany-coreaccelerators.TheESPRESO-GPUisusing CUDA and its libraries to run on Nvidia Tesla GPUs. The ESPRESO-MICisdevelopedundertheIntelParallel Comput-ingCenter(IPCC)atIT4Innovationsanditsmainfocusisto fullyutilisethepotentialofXeonPhiacceleratorsbasedin Knights Corner architecture. This is an essential research forIT4Innovations asit hastheEuropean largestXeon Phi acceleratedsystemcalledSalomon.
Figure8 TheweakscalingevaluationoftheESPRESOsolveronEuropeanlargestCSCSPizDaintsupercomputer.Solverisable toprocess2.7millionofunknownspernode.Thescalabilityisevaluatedfrom1to2197nodes.Thefatteningshapeofthetotal executiontimeshowsthepotentialoftheESPRESOtoscaleevenfurther.
Figure9 StrongscalabilityofasingleiterationtimeoftheESPRESOsolver.InthistestESPRESOissolving2.6billionunknown problemstartingfrom1000to4913nodes.
PERMON
Overview
We develop a novel software package based on PETSc using TFETI for solution of QP called PERMON (Paral-lel,Efficient,Robust,Modular,Object-oriented,Numerical) toolbox since 2011. It makes use of theoretical results
in discretisation techniques, QP algorithms, and DDM. It incorporates ourown codes, and makes use of renowned open source libraries. The solver layer, discussed here, consists of three modules: PermonFLLOP, PermonQP, and PermonIneq. Other modules are problem-specific such as PermonPlasticity for plasticity, PermonImage for image recognition, PermonMultiBody for particle dynamics and others.
Figure10 StrongscalabilityoftheiterativesolveroftheESPRESO.InthistestESPRESOissolving2.6billionunknownproblem startingat1000nodesto4913nodes.
Figure11 DoublylinkedlistofQPs.
PermonQP
PermonQPisapackageprovidingabasefor solutionofQP problems.ItsmainideaisseparationofconceptsofQP prob-lems,transformsandsolverswhichareabstractedbythree basic classes QP, QPT and QPS, respectively. A QP trans-formderivesanewQPfromthegivenQP,sothatadoubly linkedlist(QPchain)isgeneratedwhereeverynodeisaQP (Fig.11).Theprogramminginterface(API)ofPermonQPis carefullydesignedtobeeasy-to-use,andatthesametime efficientandsuitableforHPC.Thesolutionprocessisfrom theuser’spointofviewdividedintothefollowingsequence ofactions:
1 QPproblemspecification;
2 QPtransforms, which reformulate the original problem andcreateachainofQPproblemswherethelastoneis passedtothesolver;
3 automaticormanualchoiceofanappropriateQPsolver; 4 QPsolution.
PermonQPasastand-alonepackageallowssolving uncon-strained QP problems (i.e. linear systems witha positive semidefinite matrix)or equalityconstrained ones. Inboth casesitmakesuseofthePETScKSPpackagewhichincludes both direct and iterative solvers, including interfaces to many external solvers. Examples of equality constraints arefor instance multipointconstraints, or thealternative enforcingofDirichletboundaryconditionsusingaseparate constraintmatrix. This moduleisbeingprepared for pub-lishingundertheBSD2-Clauselicense.
PermonIneq
PermonQP capabilities can be further extended with the PermonIneqpackage, which adds severalconcrete solvers forinequalityconstrainedQPs,e.g.thealreadymentioned MPRGPandSMALBEalgorithms.
PermonFLLOP
PermonFLLOP is a wrapper of PermonQP implementing FETI. It assembles the FETI-specific constraint matrix B
andnullspacematrixR.Theyarepassedinternallyto Per-monQPtogetherwithsubdomain-wisestiffnessmatricesand loadvectors, which can beassembled witharbitrary FEM librarysuchasPermonCubeor libMesh.FETImethoditself consists here just in calling the proper sequence of QP transformations: primal scaling, dualisation, dual scaling, homogenisationoftheequalityconstraints,preconditioning byorthogonalprojectorontothekernelofthedualequality constraintmatrix.
Numericalexperiments
Asabenchmarkanelasticcubewassubjectedtothevolume forcespressingitagainsttheobstacle.Thereweretwo rea-sonsforthisdecision.Theelasticcubeisanumericalmodel whichcouldbefullycontrolledandtheobtainedresultsare notaffectedbycomplexityofgeometry.Anotherreasonis that it is very difficult or even impossible tocreate very largemeshesoncomplexgeometriesusingexistingmeshing
Table1 Resultsforthecubecontactlinearelasticityproblem.
X NS #Decomp.DOF Solutiontime[s] Outeriters Inneriters
4 64 3,000,000 2.68E+01 3 94
6 216 10,125,000 5.38E+01 3 147
8 512 24,000,000 1.21E+02 4 250
Figure12 ParallelweakscalabilityoftheTFETImethodimplementationinthePermonFLLOPcodeforthelinearelasticitycube benchmarkatHECToRsupercomputer.
Figure13 NumericalscalabilityoftheTFETImethodwithinthePermonFLLOPcodefor thelinearelasticitycubebenchmark. Notethatfromacertainpointwegetalmostconstantnumberofiterationsallowinggoodparallelscalability.
tools.WithourmeshgeneratorPermonCube wewereable topreparelargescaleproblemsdecomposedintothousands ofsubdomains.
Rand Gmatrices wereorthonormalised usingIterative Classical Gram-Schmidt process, K matrix was factorised using the Cholesky factorisation from the MUMPS library. Currently,eachcomputationalcoreownsoneandonlyone subdomain.Thenormoftheprojectedgradient compared withthe10−5multipleoftheprojecteddualRHSwasused asthestoppingcriterion.Theresultsaresummarisedinthe
Table1.
Theweakscalabilityfor13,824;8000and4096elements persubdomainandthenumericalscalabilityforthese con-figurations(correspondingtothefixed ratiosH/h=24,20, 16)arethenillustratedinFigs.12and13.Toinvestigatethe strongscalabilityweselecteddiscretisationwith32,768,000 elements(approx.100,000,000unknowns).Thestrong scal-abilityfordiscretisationwith32,768,000elements(approx. 100,000,000unknowns)wasdemonstratedupto8000cores (41.5susing2197cores;19.8susing4096cores;15.7s8000 cores).
Conclusion
Efficient variants of BEM discretisation method, scalable QPalgorithms,andFETI-typedomaindecomposition meth-ods (BETI, TFETIand HTFETI) wereimplemented intoour
in-housesoftwarepackages.Thesesolvers wereoptimised employing available state-of-the-art external libraries, communicationhidingandavoidingtechniques,hybrid MPI-OpenMP programming, GPU and MIC accelerators, etc. Scalability wasproven for both hugemodel problems and complicated engineering problems uptoten thousands of cores.
The presented BEM4I library for the boundaryelement discretisation ofengineering problems hasbeentested up tomorethanathousandofcores.Currently,itsacceleration usingtheIntelXeonPhicoprocessorsisunderdevelopment. The initial results suggest a significant reduction in com-putationaltimeinthecaseoffull systemmatricesforthe Laplaceequation,thereforetheaccelerationofthe assem-blyofmatricessparsifiedbyACA,aswellastheassemblyof thesystemmatricesfortheLameequationisbeing consid-ered.
The presented ESPRESO library brings highly optimised TFETIandHTFETIimplementations.ESPRESO-H isoriented to large computer systems with thousands and tens of thousands of computenodes.ESPRESO-GPU and EPSRESO-MIC are developed to exploit power of GPU and MIC accelerators.
WehavealsopresentedourPERMONtoolbox,mainlyits solverpackagesbasedonPETSc.TheyuniquelycombineFETI DDM with QP algorithms. PermonFLLOP is used to gener-ateFETI-specificobjectsforacontactproblemofelasticity while FEMobjectsareprovidedby anyFEMcodeforeach
subdomain independently. PermonFLLOP wraps PermonQP and PermonIneq which solve the resulting QP problem. Resultsfor contactproblemof elasticcube,generatedby PermonCubepackage,wereshown.
Conflict
of
interest
Theauthorsdeclarethatthereisnoconflictofinterest.
Acknowledgements
This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excel-lenceproject (CZ.1.05/1.1.00/02.0070); project of major infrastructures for research, development and innova-tion of Ministry of Education, Youth and Sports with reg. num. LM2011033; by the EXA2CT project funded fromthe EU’s Seventh Framework Programme (FP7/2007-2013) undergrant agreementno. 610741; bythe internal student grant competition project SP2015/186 ‘‘PERMON toolbox development’’; the project POSTDOCI II reg. no.CZ.1.07/2.3.00/30.0055withinOperationalProgramme EducationforCompetitiveness;andbytheGrantAgencyof theCzechRepublic(GACR)projectno.15-18274S.Wethank CSCS(www.cscs.ch)for the supportin usingthePiz Daint supercomputer.
References
Bebendorf,M.,2008.HierarchicalMatrices:AMeanstoEfficiently SolveEllipticBoundaryValueProblems,LectureNotesin Com-putationalScienceandEngineering.Springer.
Brzobohat´y,T.,Dostál,Z.,Kozubek,T.,Kováˇr,P.,Markopoulos,A., 2011.Choleskydecompositionwithfixingnodestostable com-putationofa generalized inverseofthestiffnessmatrixofa floatingstructure.Int.J.Numer.MethodsEng.88(5),493—509. Dostál, Z., Friedlander, A., Santos, S.A., 2003. Augmented Lagrangianswithadaptiveprecisioncontrolforquadratic pro-grammingwithsimpleboundsandequalityconstraints.SIAMJ. Optim.13(January(4)),1120—1140.
Dostál,Z.,Friedlander,A.,Santos,S.A.,1998.Solutionofcontact problemsofelasticitybyFETIdomaindecomposition.Contemp. Math.218,82—93.
Dostál,Z.,Horák,D.,Kuˇcera,R.,2006.TotalFETI—aneasier imple-mentablevariantoftheFETImethodfornumericalsolutionof ellipticPDE.Commun.Numer.MethodsEng.22(12),1155—1162. Dostál,Z.,Kozubek,T.,Markopoulos,A.,Menˇsík,M.,2011.Cholesky decompositionofapositivesemidefinitematrixwithknown ker-nel.Appl.Math.Comput.217(13),6067—6077.
Dostál,Z.,Kozubek,T.,Vondrák,V.,Brzobohat´y,T.,Markopoulos, A.,2010.ScalableTFETIalgorithmforthesolutionofmultibody contactproblemsofelasticity.Int.J.Numer.MethodsEng.82 (11),1384—1405.
Dostál,Z., Neto,F.A.G.,Santos,S.A.,2000dec.Solution of con-tactproblemsbyFETIdomaindecompositionwithnaturalcoarse spaceprojections.Comput.MethodsAppl.Mech.Eng.190 (13-14),1611—1627.
Dostál,Z.,2009.OptimalQuadraticProgrammingAlgorithms,with Applications to Variational Inequalities. SOIA, Springer, New York,US.
Dostál,Z.,Horák,D.,2004.ScalableFETIwithoptimaldualpenalty for a variational inequality. Numer. Linear AlgebraAppl. 11, 455—472.
Dostál,Z.,Horák,D.,2007.TheoreticallysupportedscalableFETI fornumericalsolutionofvariationalinequalities.SIAMJournal onNumericalAnalysis45(2),500—513.
Dostál, Z., Horák, D., Kuˇcera, R., Vondrák, V., Haslinger, J., Dobiáˇs, J., Pták, S., 2005. FETI based algorithms for con-tactproblems:scalability,largedisplacementsand3dcoulomb friction. Comput. Methods Appl. Mech. Eng. 194 (2—5), 395—409.
Dostál,Z.,Kozubek,T.,Brzobohat´y,T.,Markopoulos,A.,Vlach,O., 2012.ScalableTFETIwithoptionalpreconditioningbyconjugate projectorfortransientcontactproblemsofelasticity.Comput. MethodsAppl.Mech.Eng.247—248,37—50.
Dostál,Z.,Schöberl,J.,2005.Minimizingquadraticfunctions sub-jectto boundconstraints.Comput.Optim. Appl.30(January (1)),23—43.
Farhat, C., Mandel, J., Roux, F.X., 1994. Optimal convergence propertiesoftheFETIdomaindecompositionmethod.Comput. MethodsAppl.Mech.Eng.115,365—385.
Farhat,C.,Roux,F.X.,1991.A methodoffinite elementtearing andinterconnectinganditsparallelsolutionalgorithm.Int.J. Numer.MethodsEng.32(6),1205—1227.
Farhat,C.,Roux,F.X.,1992.Anunconventionaldomain decomposi-tionmethodforanefficientparallelsolutionoflarge-scalefinite elementsystems.SIAMJ.Sci.Stat.Comput.(1).
Gosselet,P.,Rey,C.,2006.Non-overlappingdomaindecomposition methodsinstructuralmechanics.Arch.Comput.MethodsEng. 13(4),515—572.
Greengard,L.,Rokhlin,V.,1987.Afastalgorithmforparticle sim-ulations.J.Comput.Phys.73(2),325—348.
Hapla,V.,Horák,D.,2012.Tfeticoarsespaceprojectors paralleliza-tionstrategies.In:Wyrzykowski,R.,Dongarra,J.,Karczewski, K.,Wa´sniewski,J.(Eds.),ParallelProcessingandApplied Math-ematics,LectureNotesinComputerScience.Springer,Berlin, Heidelberg,pp.152—162.
Klawonn,A.,Rheinbach,O.,2010.Highlyscalableparalleldomain decomposition methodswithanapplication tobiomechanics. ZAMMZ.Angew.Math.Mech.90(1),5—32.
Klawonn, A., Widlund, O.B., 2006. Dual-primal FETI methods for linear elasticity. Commun. Pure Appl. Math. 59 (11), 1523—1572.
Klawonn,A., Rheinbach,O.,2006. Aparallel implementationof dual-primalFETImethodsforthree-dimensionallinearelasticity usingatransformationofbasis.SIAMJ.Sci.Comput.28(January (5)),1886—1906,http://dx.doi.org/10.1137/050624364. Kozubek, T., Horák, D., Hapla, V., 2012. FETI coarse
problem parallelization strategies and their compari-son, Tech. Rep. http://www.prace-project.eu/IMG/pdf/ feticoarseproblemparallelization.pdf.
Kozubek,T.,Vondrák,V.,Menˇsík,M.,Horák,D.,Dostál,Z.,Hapla, V.,Kabelíková,P., ˇCermák,M.,2013.TotalFETIdomain decom-positionmethodanditsmassivelyparallelimplementation.Adv. Eng.Softw.60—61,14—22.
Kruis,J.,2006.DomainDecompositionMethodsforDistributed Com-puting.Saxe-CoburgPublications.
Kruis,J.,Matouˇs,K.,Dostál,Z.,2002.Solvinglaminatedplatesby domaindecomposition.Adv.Eng.Softw.33,445—452.
Langer,U., Steinbach, O., 2003. Boundary element tearing and interconnectingmethods.Computing71(3),205—228. Lee,J.,2009.Ahybriddomaindecompositionmethodandits
appli-cationstocontactproblemsinmechanicalengineering.NewYork University(Ph.D.thesis).
Li,J., Widlund, O.B.,2006. FETI-DP, BDDC, and blockCholesky methods.Int.J.Numer.MethodsEng.66,250—271.
Merta,M.,Zapletal,J.,2015.Aparallellibraryforboundary ele-ment discretization ofengineering problems. Math. Comput. Simul.(acceptedforpublication).
Merta,M.,Zapletal, J., 2015.Acceleration ofboundaryelement methodbyexplicitvectorization.Adv.Eng.Softw.86,70—79.
Of, G., Steinbach, O., 2009. The all-floating boundary element tearingand interconnecting method.J. Numer.Math. 17(4), 277—298.
Of,G.,2007.Fastmultipolemethodsandapplications.In:Schanz, M.,Steinbach,O. (Eds.), BoundaryElementAnalysis,Lecture NotesinAppliedandComputationalMechanics.Springer,Berlin, Heidelberg,pp.135—160.
PatrickAmestoyandothers,2015.MUMPS:aMultifrontalMassively ParallelsparsedirectSolver,http://mumps.enseeiht.fr/. Rjasanow,S.,Steinbach,O.,2007.TheFastSolutionofBoundary
IntegralEquations.MathematicalandAnalyticalTechniqueswith ApplicationstoEngineering.Springer.
Sauter,S.,Schwab,C.,2010.BoundaryElementMethods.Springer SeriesinComputationalMathematics.Springer.
ˇ
Cermák,M.,Hapla,V.,Horák,D.,Merta,M.,Markopoulos,A.,2015. Total-FETIdomaindecompositionmethodforsolutionof elasto-plasticproblems.Adv.Eng.Softw.84,48—54.
ˇ
Cermák,M.,Merta,M.,Zapletal,J.,2015.Anovelboundary ele-mentlibrarywithapplications.In:Simos,T.,Tsitouras,C.(Eds.), ProceedingsofICNAAM2014.AIPConferenceProceedings,vol. 1648.
Zapletal,J.,Bouchala,J.,2014.Effectivesemi-analyticintegration forhypersingularGalerkinboundaryintegralequationsforthe Helmholtzequationin3d.Appl.Math.59(5),527—542.