• No results found

Numerical libraries solving large-scale problems developed at IT4Innovations Research Programme Supercomputing for Industry

N/A
N/A
Protected

Academic year: 2021

Share "Numerical libraries solving large-scale problems developed at IT4Innovations Research Programme Supercomputing for Industry"

Copied!
11
0
0

Loading.... (view fulltext now)

Full text

(1)

Availableonlineatwww.sciencedirect.com

ScienceDirect

jo u rn al h o m e p a g e :w w w . e l s e v i e r . c o m / p i s c

Numerical

libraries

solving

large-scale

problems

developed

at

IT4Innovations

Research

Programme

Supercomputing

for

Industry

Michal

Merta

a,b,∗

,

Jan

Zapletal

a,b

,

Tomas

Brzobohaty

a

,

Alexandros

Markopoulos

a

,

Lubomir

Riha

a

,

Martin

Cermak

a

,

Vaclav

Hapla

a,b

,

David

Horak

a,b

,

Lukas

Pospisil

a,b

,

Alena

Vasatova

a,b

aIT4InnovationsNationalSupercomputingCenter,17.listopadu15/2172,70800Ostrava,CzechRepublic bDepartmentofAppliedMathematicsVSBTechnicalUniversityofOstrava,17.listopadu15/2172,

70833Ostrava,CzechRepublic

Received26October2015;accepted11November2015 Availableonline 15December2015

KEYWORDS FETI; TFETI; BEM; Domain decomposition; Quadratic programming; HPC

Summary TheteamofResearchProgrammeSupercomputingforIndustryatIT4Innovations NationalSupercomputingCenterisfocusedondevelopmentofhighlyscalablealgorithmsfor solution of linear and non-linear problems arising from different engineering applications. Asamainparallelisationtechnique,domaindecompositionmethods (DDM)ofFETI typeare used.Thesemethodsarecombinedwithfiniteelement(FEM)orboundaryelement(BEM) dis-cretisation methods andquadratic programming(QP)algorithms. All thesealgorithms were implementedintoourin-housesoftwarepackagesBEM4I,ESPRESOandPERMON,which demon-stratehighscalabilityuptotensofthousandsofcores.

©2015PublishedbyElsevierGmbH.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Thisarticleispartofaspecialissueentitled‘‘Proceedingsof

the1stCzech-ChinaScientificConference2015’’.

Correspondingauthorat:IT4InnovationsNational

Supercompu-tingCenter,17.listopadu15/2172,70833Ostrava,CzechRepublic. E-mailaddress:michal.merta@vsb.cz(M.Merta).

Introduction

Highperformanceofcontemporarycomputersresultsfrom anincreasingnumberofcomputenodesinclustersand num-ber of processor cores per node. While the current most powerful petascale or multi-petascale computers contain hundreds of thousands of CPU cores, the futureexascale

http://dx.doi.org/10.1016/j.pisc.2015.11.023

2213-0209/© 2015 Published by Elsevier GmbH. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

(2)

systemswillcomprisemillionsofthem.Forefficientuseof suchsystems,algorithmswithhighparallelscalabilityhave tobedeveloped.

Discretisationofmostengineeringproblemsdescribable bypartialdifferentialequations(PDE)leadstolargesparse linearsystemsofequations.However,problemsthatcanbe expressedasellipticvariationalinequalities,suchasthose describingtheequilibriumofelasticbodiesinmutual con-tact,leadtoquadraticprogramming(QP)problems.

Finite element tearing and interconnecting (FETI) and boundary element tearing and interconnecting (BETI) (LangerandSteinbach,2003;OfandSteinbach,2009) meth-ods form a successful subclass of domain decomposition methods (DDM). They belong to non-overlapping methods andcombine sparseiterative anddirectsolvers. FETIwas firstly introduced by Farhat and Roux (Farhat and Roux, 1991,1992).ThekeyingredientoftheFETImethodis the decompositionof the spatialdomain intonon-overlapping subdomainsthatare‘‘gluedtogether’’byLagrange multipli-ers.Eliminationoftheprimalvariablesreducestheoriginal linear problem to a smaller, relatively well conditioned, equalityconstrainedQP.IftheFETIprocedureisappliedtoa contactproblem(Dostáletal.1998,2000,2005,2010,2012;

Dostál and Horák, 2004), the resulting QP has additional boundconstraints.FETImethodsallowhighlyaccurate com-putationsscalinguptotensofthousandsofprocessors.

Ourteamwassuccessfulin adaptingFETIapproachfor contactproblemsanddesignednewvariants.Oneofthem isTotal-FETI(TFETI)developedbyDostaletal.(Dostáletal., 2006,2010;Kruisetal.,2002; ˇCermáketal.,2015)which usesLagrangemultiplierstoenforceDirichletboundary con-ditions. This enables a simpler building of the stiffness matrix kernel, as all subdomains are floating and associ-ated subdomain stiffness matrices have the same kernel, obtainedwithout any computation.Hybrid-TFETI(HTFETI) reducescoarseproblem(CP)sizebyaggregatingthe subdo-mainsintoclusters,i.e.TFETIisappliedtwice.

ResultingQPproblems canbethen solvedby meansof efficientMPRGPand SMALBEalgorithms designedagainby Dostaletal.(Dostáletal.,2003;DostálandSchöberl,2005; Dostál,2009)withknownrateofconvergencegivenby spec-tralpropertiesofthesolvedsystem.

WedevelopseveralsoftwarepackagesdealingwithFETI: PERMONbasedonPETSc andESPRESObasedonIntel MKL andCilk.TheBEM4IlibraryimplementsBEMdiscretisation, andtogetherwiththeothertwopackagestheBETImethod. Thepaperisorganisedasfollows.Afterintroduction,we describethemainprinciplesofFETIandBETImethods.Then the particular libraries and their modules are introduced withtheachievedhighlightsfromvariousareas.

Numerical

methods

FETImethods

FETI-1(Farhat andRoux,1991, 1992;Farhat etal.,1994; Kruis, 2006) is a non-overlapping DDM (Gosselet and Rey, 2006) which is based ondecomposing the original spatial domainintonon-overlappingsubdomains.Theyare‘‘glued together’’byLagrangemultiplierswhichhavetosatisfy cer-tainequalityconstraintswhichwillbediscussedlater.The

originalFETI-1 methodassumes that the boundary subdo-mainsinheritDirichletconditionsfromtheoriginalproblem wheretheconditionsareembeddedintothelinearsystem arisingfrom FEM. This means physically that subdomains whoseinterfacesintersecttheDirichletboundaryarefixed whileothersarekeptfloating;inthelinearalgebraspeech, the corresponding subdomain stiffness matrices are non-singularandsingular,respectively.

ThebasicideaoftheTotal-FETI(TFETI)method(Dostál et al., 2006, 2010; ˇCermák et al., 2015) is to keep all thesubdomains floating and enforce the Dirichlet bound-aryconditionsbymeansofaconstraintmatrixandLagrange multipliers, similarly to the gluing conditions along sub-domain interfaces. This simplifies implementation of the stiffnessmatrix generalisedinverse. The keypointis that kernelsRs ofsubdomain stiffnessmatricesKsareknowna

priori,havethe samedimensionandcan beformed with-outanycomputationfromthemeshdata,sothatRmatrix (ImR=KerK)possess alsonice block-diagonallayout. Fur-thermore, each local stiffness matrix can be regularised cheaply,andtheinverseoftheresultingnonsingularmatrix is at the same time a generalised inverse of the original singularone(Dostáletal.,2011;Brzobohat´yetal.,2011).

FETImethodsusetheLagrangemultiplierstoenforce bothequalityandinequalityconstraints(gluingand nonpen-etrationconditions)intheoriginalprimalproblem

min1 2u

TKuuTf s.t. B

Eu=o and BIu≤cI.

Theprimalproblemisthen transformedusingdualityinto significantlysmaller andbetter conditioned dual problem withequalityconstraintandnonnegativitybound

min1 2 TFTd s.t. G=e,  I≤o with F=BK+BT, G=RTBT, d=BK+f, e=RTf.

After homogenisation using particular solution ˜

=GT(GGT)−1e, while = ˆ+ ˜, ˆ KerG, ˜ ImGT

and enforcing homogenised equality constraint by means of projector P=IQ on KerG, where Q =GT(GGT)−1G is

projectortoImGT,SMALSEalgorithmcanbeappliedtothe

problem

min1 2ˆ

TPFPˆ− ˆTP(dF ˜) s.t Gˆ=o, ˆ I ≥−˜I.

Forthisdualproblemtheclassicalestimateofthe spec-tral condition number is valid, i.e. (PFP|ImP)C(H/h),

withHdenotingthedecompositionandhthediscretisation parameter.Naturaleffortusingthemassivelyparallel com-putersistomaximisethenumberofsubdomains(decrease

H)sothatsizesofsubdomainstiffnessmatricesarereduced which accelerates not only their factorisation and subse-quent generalised inverse application but also improves conditioningandreducesthenumberofiterations.Negative effectofthatisincreaseofdualandnullspacedimensions,

(3)

whichdeceleratethecoarseproblem(CP)solution,i.e. solu-tionofthesystem GGTx=y,sothatthe bottleneckofthe

TFETImethodistheapplicationoftheprojectordominating thesolutiontime.

HybridFETImethod

Althoughthereareseveralefficientcoarseproblem paral-lelisationstrategies(HaplaandHorák,2012;Kozubeketal., 2012, 2013), stillthere are size limitations of the coarse problem.Soseveralhybrid(multilevel)methodswere pro-posed(Lee,2009;KlawonnandRheinbach,2010).The key idea is to aggregate small number of neighbouring sub-domainsintoclusters(see Fig.1), which naturallyresults intothesmallercoarseproblem.InourHTFETI,the aggre-gation of subdomains into the clusters is enforced again by Lagrange multipliers. Thus TFETI method is used on both cluster and subdomain levels. This approach simpli-fiesimplementationofhybridFETImethodsandenablesto extendparallelisationoftheoriginalproblemuptotensof thousandsofcoresduetolowermemoryrequirements.This isthepositiveeffectofreducingthecoarsespace.The neg-ativeoneisgettingworseconvergenceratecomparedwith theoriginalTFETI.Toimproveitthetransformationofbasis originally introduced by Klawonn and Widlund (Klawonn andWidlund,2006),KlawonnandRheinbach(Klawonnand Rheinbach,2006),andLiandWidlund(LiandWidlund,2006) isappliedtothederivedhybridalgorithm.

BoundaryelementmethodandBETI

Theboundaryelementmethod(BEM)iswell-suitedforthe solutionofexteriorproblemssuchassoundor electromag-neticwavescattering,orshapeoptimisationproblems.The boundaryintegral formulationof the givenproblem leads tothediscretisationoftheboundaryonly,thuseffectively reducestheproblemdimension.

Themethodisapplicabletoproblemsforwhichthe fun-damentalsolutionisknown,whichisthecase,e.g.ofthe LaplaceorHelmholtzequations.In3D,therespective fun-damentalsolutionsread

v(x,y) := 1 4 1 xy, v(x,y) := 1 4 eix−y xy.

Thesolutiontotheboundaryvalueproblemunder consider-ationisgivenbytherepresentationformula

u(x):=  ∂ 1 u(y)v(x,y)dsy−  ∂ 0 u(y)∂v ∂ny(x, y)dsy, where0and

1representtheDirichletandNeumanntrace

operators.TheunknownCauchydatacanbeobtainedfrom the appropriate system of boundary integral equations. ApplyingtheDirichletandNeumanntraceoperatorstothe representationformulaleadstotheboundaryintegral equa-tions (V1 u)(x)= 1 2 0 u(x)+(K0 u)(x) forx ∈ ∂, (1)

(D0u)(x)= 121u(x)−(K∗1u)(x)forx ∈ ∂ (1) withV,K,K∗,andDdenotingthesingle-layer,double-layer, adjoint double-layer, and hypersigular boundary integral operators,respectively. The Galerkindiscretisation of the single-layer operator equation (1) leads to the system of linearequations Vt=  1 2M+K  u

withtheboundaryelementmatrices

V[k,]:=  k   v(x,y)dsydsx K[k,j] :=  k  ∂ ∂v ∂ny(x,y)ϕj (y)dsydsx

andthesparseidentitymatrixM.

Theassemblyofthefullmatricesisofquadratic complex-itywithrespecttothedegreesoffreedomonthesurface. Moreover,advancednumericalquadraturemethodsmustbe applied totreat singularities occurring in the integrals in the case ofidentical or adjacent elements(see Rjasanow andSteinbach(2007)orSauterandSchwab(2010)). There-fore,anefficientimplementationandparallelisationofthe method is necessary to allow the solution of large scale problems.

FETIdomain decompositionmethodology applied com-bined with BEM discretisation results in so called BETI (boundaryelementtearingandinterconnecting)method.

MPRGPandSMALBEalgorithms

Combination of SemiMonotonic Augment Lagrangian algo-rithm for Bound and Equality constraints (SMALBE) and Modified Proportioning with Reduced Gradient Projec-tion (MPRGP) algorithms (Dostál et al., 2003; Dostál and Schöberl,2005;Dostál,2009)wasdevelopedandtestedfor solutionofQPproblemsresultingfromdiscretisationof con-tactproblemsofmechanics,butcanbeaswellusedforany otherQPproblems.Theyhavetheoreticallysupportedrate of convergencegivenby spectralproperties ofthe solved system.Generallinearinequalityconstraintsmustbe con-verted toboundconstraints byapplyingdualisation which alsotypicallyimprovesconditioningandreducesdimension. MPRGP isan activesetbasedalgorithm.Themain ideaof MPRGP is gradient splitting basedonactive setsintofree andchoppedgradientswhosesumyieldstheprojected gra-dient.Thealgorithmexploitsatesttodecideaboutleaving thefaceandthreetypesofstepstogenerateasequenceof theiteratesthatapproximatethesolution:

1 The expansionstep,ifthesolutionisproportional,may expand the current active set using fixed steplength relatedtomatrixnormandreducedfreegradient. 2 The proportioning step may remove indices from the

activesetusingchoppedgradient. 3 Theconjugategradientstep.

(4)

Figure1 CubepreparedforTFETIandHFETI.

ThealgorithmhasbeenprovedtoenjoytheR-linearrateof convergenceintermsofthespectralconditionnumber.The SMALBEis an algorithm basedonaugmented Lagrangians. Ittakescare ofthe equalityconstraints,while initseach iteration,theinnerproblemconsistinginbound-constrained minimisationoftheaugmentedLagrangianissolvedbyany suitablesolversuchasMPRGPdescribedabove.

The

BEM4I

library

Overview

The boundary element library BEM4I concentrates on the efficient assembly of the boundary element matrices for the3D Laplace, Helmholtz, Lamé, andtime-domain wave equations. It employs sparsification methods, namely the fast multipolemethod (Greengard and Rokhlin, 1987; Of, 2007) (FMM) and the adaptive cross approximation (ACA) (Bebendorf,2008;RjasanowandSteinbach,2007)toreduce thecomputationalefforttoalmostlinear.

The core of the library consists of three main setsof classes:

1 BESpace:theclassesinheritingfromtheBESpaceclass are responsible for the approximation of the continu-ousfunctionspaces.Thestoredinformationincludesthe order of polynomial test and Ansatz functions or data

necessarytoapproximatematricesusingtheACAorFMM methods.

2 BEBilinearForm:themainpurposeofthisclassandits descendants is toassemble the boundary element sys-tem matrices (in both full and sparsified formats). The element-wiseassemblyisperformed usingthe BEInte-gratorclass.TheassemblyisparallelisedusingOpenMP andMPIatthislevel.

3 BEIntegrator: the classes responsible for the local system matrix assembly inherit from the BEInte-grator class. Several types of numerical quadratures are employed by these classes, including the classical Gaussian quadrature schemes over the pairs of distant elementsandthesemi-analyticalapproach(Rjasanowand Steinbach,2007;ZapletalandBouchala,2014)andfully numericalschemes(SauterandSchwab,2010)totreatthe singularitiesintheintegralsoverpairsofcloseelements. The computation is vectorised toreduce the computa-tionaltimeusingtheSSEorAVXinstructionsets(Fig.3).

In addition to these classes the library also contains the supportiveclassesrepresenting full,sparse,andsparsified matrices,iterativeanddirectsolvers,preconditioners, sur-facemeshes,etc. The library structure togetherwiththe resultsofthescalabilitytestshavebeenpresentedinMerta andZapletal(2015,acceptedforpublication), ˇCermáketal. (2015)(Fig.2).

(5)

Figure 3 Concurrent summation of scalars using vector instructions.

IntelXeonPhiutilisation

Toreducethecomputationaltimethecodeofthelibraryis acceleratedbytheIntelXeonPhicoprocessors.The compu-tationalymostdemandingpartsofthecodeareoffloadedto thecoprocessorusingoffloadpragmasof theIntel com-pilerandthecomputation iscarriedoutusing60 physical (240logical)coresavailableatthecoprocessor(seeFig.4). Thecomputationconsistsofseveralsteps.

1 Pack the data(mainly nodesandelements of asurface mesh)andsendittothecoprocessor.

2 Perform simultaneous computation on the coprocessor andthehost.

3 Sendresultsfromthecoprocessortothehostprocessor. 4 Combinedatafromthecoprocessorandtheprocessorand

assembletheglobalsystemmatrix.

The results of the numerical benchmarks focused on the assembly of the full single-layer operator matrix for the Laplaceequationshowasignificantreductioninthe compu-tationaltime(seeFig.5).Themainbottleneckiscurrently thedatatransferfromthecoprocessortothehostprocessor (Fig.6).

ExaScale

PaRallel

FETI

SOlver

(ESPRESO)

Overview

TheESPRESOlibraryisimplementedinC++.Significantpart ofthedevelopmenteffortwasdevotedtodevelopmentof aC++wrapperfor(1)theselectedsparseanddenseBLAS routinesand(2)thesparsedirectsolvers(MKLandoriginal versionsofPARDISOdirectsolvers)oftheIntelMKLlibrary. Thesolverisdevelopedtosupportcurrentandfuturemulti

Figure4 Offload ofthe computationto theIntel XeonPhi coprocessor.

Figure5 Comparisonoftheassemblyofthesinglelayer oper-atormatrix.

andmanycorearchitectures,forinstanceIntelXeonPhior NvidiaTesla.ThereforefortheCPUandXeonPhiversionwe areusingtheIntelMKLlibraryandCUDAlibrariesareused (cuBLAS,cuSPARSE,cuSolver)fortheGPUversion.

Communicationlayeroptimisation

ESPRESO-Hismainlyfocusedonthescalabilityofthe com-municationlayerforlargecomputersystemswiththousands andtensofthousandsofcomputenodes.Alltheprocessing isdonebytheCPUs.Thesolveruseshybridparallelisation which is well suitedfor multi-socket andmulti-core com-pute nodes asthis is the architecture of most of today’s supercomputers.

Thefirstlevelof parallelisationisdesignedfor parallel processingoftheclustersofsubdomains.Individualclusters areprocessed pernode. It is possible toprocessmultiple clustersperonenode,butnottheother wayaround.The distributedmemoryparallelisationisdoneusingMPI.In par-ticularweareusingMPIstandard3.0whichisimplemented inmostofthemodernMPIdistributions.TheMPI3.0isused becausethecommunicationhidingtechniquesimplemented inthecommunicationlayerrequirethenon-blocking collec-tiveoperations.

Thecommunication layeris identicalfor bothTFETIor HTFETIsolversinESPRESO.Itusesnovelcommunication hid-ing techniques for the main iterative solver. In particular we have implemented: (1) the Pipelined Conjugate Gra-dient (PipeCG) solver which hides communication of the global dot products behind the local matrix vector mul-tiplications; (2) distributed CP processing — merges two globalcommunicationoperations(GatherandScatter)into one(AllGather)andparallelisestheCPprocessingusingthe

(6)

Figure7 Thestencilcommunicationforsimpledecompositionintofoursubdomains.TheLagrangemultipliers(LMs)thatconnects differentneighbouringsubdomainsaredepictedindifferentcolours.IneveryiterationwhentheLMsareupdatedanexchangeis performedbetweentheneighbouringsubdomainstofinishtheupdate.Thisaffinityalsocontrolsthedistributionofthedataforthe maindistributediterativesolver,whichiteratesoverlocalLMsonly.InourimplementationeachMPIprocessmodifiesonlythose elementsofthevectorsusedbytheCGsolverthatmatchtheLMsassociatedwiththeparticulardomainincaseofFETIortheset ofdomainsinaclusterincaseofhybridFETI.

distributedinversematrixoftheCP;and(3)theoptimised versionofglobalgluingmatrixmultiplication(matrixBfor FETIandB1forHFETI)—writtenasstencilcommunication whichisfullyscalable,seeFig.7.

Inter-clusterprocessing

The secondlevel ofparallelisationis designedfor parallel processingofsubdomainsinacluster.Ourimplementation enablesoversubscriptionofCPUcoressothateachcorecan processmultiplesubdomainsandthereforethesizeofthe cluster is not limitedby the hardwareconfiguration. This shared memory parallelisation is implemented using Intel Cilk+.WehavechosentheCilk+duetoitsadvancedsupport forC++language.Inparticularwearetakingadvantageof the functionalitythat allowsus tocreate custom parallel reductionoperationsontopoftheC++objectswhichinour casearesparsematrices.

Numericalresults

TheESPRESOisdesignedtosolvelargeproblemsusingworld largestsupercomputers.Inthispaperwepresenttheresults measuredontheEuropeanlargestmachine,CSCSPizDaint inLugano,Switzerland.ThePizDaintisaCrayXC30machine with5272computenodeseachwithone8-coreSandyBridge CPU(E5-2670),32GBofRAMandoneK20XGPUaccelerator. Allthefollowingtestsaredoneusingthesynthetic3Dcube linearelasticitybenchmark.Forthisbenchmarking weare developing massively parallel in memoryproblem genera-tor,whicheliminatesI/Obottlenecksandallowstoevaluate the efficiency and scalability of the solverroutines more precisely.

The first set of results is shown in Fig. 8. This figure presentstheweakscalabilityofHTFETIsolverinESPRESO. Duetolimitedamountofmemorypernode,solverisableto process2.7million ofunknownspersinglenode.Thenthe amountofworkpernodeiskeptfixedandweareincreasing thenumberofnodesfrom1to2197,whichdefinesthe max-imumproblemsizeto5.8billionsofunknowns.Thisissofar

thelargestproblemwewereabletosolveonthePizDaint machine.The important message from this measurement is the flattening characteristics from 343 to 2197 nodes, whichisexpectedresultfromgoodweakscalabilityofthe solver.

ThenexttestsshowstrongscalingoftheHTFETImethod inESPRESO.IntheFig.9wecanseethestrongscalabilityof singleiterationtime.Thisexperimentdecouplesthe numer-icalscalabilityoftheHTFETImethodandthescalabilityof the implementation itself. We can see that the ESPRESO achievessuper-linearscalabilityperiterationwhensolving 2.6billionunknownsproblemstartingfrom1000nodesand scalingto4913nodes.Theperiterationtimeisshowninthe figurenexttoeachpoint,thesecondline,whilethenumber ofnodesisdescribedinthefirstline. Theblueline shows thelinearscalingbasedonprocessingtimeon1000nodes.

The lasttest using thesynthetic benchmark shows the strong scalability of the entire iterative solver in the ESPRESO. This involves the per iteration time as well as thenumberofiteration(thenumericalscalability).Wecan seethateveninthistestsolverachievedthelinearscaling. Pleasenotethatforbothstrongscalabilitytestswekeepthe clusterconfigurationidentical,inotherwordsthenumberof domainspernoderemainsthesameandwearereducingthe domainsizewhileincreasingthenumberofnodes/clusters (Fig.10).

ESPRESO-GPUandESPRESO-MIC

In parallel with EPSRESO-H we are developing two more flavoursofESPRESOwhicharedesignedtotakeadvantageof modernmany-coreaccelerators.TheESPRESO-GPUisusing CUDA and its libraries to run on Nvidia Tesla GPUs. The ESPRESO-MICisdevelopedundertheIntelParallel Comput-ingCenter(IPCC)atIT4Innovationsanditsmainfocusisto fullyutilisethepotentialofXeonPhiacceleratorsbasedin Knights Corner architecture. This is an essential research forIT4Innovations asit hastheEuropean largestXeon Phi acceleratedsystemcalledSalomon.

(7)

Figure8 TheweakscalingevaluationoftheESPRESOsolveronEuropeanlargestCSCSPizDaintsupercomputer.Solverisable toprocess2.7millionofunknownspernode.Thescalabilityisevaluatedfrom1to2197nodes.Thefatteningshapeofthetotal executiontimeshowsthepotentialoftheESPRESOtoscaleevenfurther.

Figure9 StrongscalabilityofasingleiterationtimeoftheESPRESOsolver.InthistestESPRESOissolving2.6billionunknown problemstartingfrom1000to4913nodes.

PERMON

Overview

We develop a novel software package based on PETSc using TFETI for solution of QP called PERMON (Paral-lel,Efficient,Robust,Modular,Object-oriented,Numerical) toolbox since 2011. It makes use of theoretical results

in discretisation techniques, QP algorithms, and DDM. It incorporates ourown codes, and makes use of renowned open source libraries. The solver layer, discussed here, consists of three modules: PermonFLLOP, PermonQP, and PermonIneq. Other modules are problem-specific such as PermonPlasticity for plasticity, PermonImage for image recognition, PermonMultiBody for particle dynamics and others.

Figure10 StrongscalabilityoftheiterativesolveroftheESPRESO.InthistestESPRESOissolving2.6billionunknownproblem startingat1000nodesto4913nodes.

(8)

Figure11 DoublylinkedlistofQPs.

PermonQP

PermonQPisapackageprovidingabasefor solutionofQP problems.ItsmainideaisseparationofconceptsofQP prob-lems,transformsandsolverswhichareabstractedbythree basic classes QP, QPT and QPS, respectively. A QP trans-formderivesanewQPfromthegivenQP,sothatadoubly linkedlist(QPchain)isgeneratedwhereeverynodeisaQP (Fig.11).Theprogramminginterface(API)ofPermonQPis carefullydesignedtobeeasy-to-use,andatthesametime efficientandsuitableforHPC.Thesolutionprocessisfrom theuser’spointofviewdividedintothefollowingsequence ofactions:

1 QPproblemspecification;

2 QPtransforms, which reformulate the original problem andcreateachainofQPproblemswherethelastoneis passedtothesolver;

3 automaticormanualchoiceofanappropriateQPsolver; 4 QPsolution.

PermonQPasastand-alonepackageallowssolving uncon-strained QP problems (i.e. linear systems witha positive semidefinite matrix)or equalityconstrained ones. Inboth casesitmakesuseofthePETScKSPpackagewhichincludes both direct and iterative solvers, including interfaces to many external solvers. Examples of equality constraints arefor instance multipointconstraints, or thealternative enforcingofDirichletboundaryconditionsusingaseparate constraintmatrix. This moduleisbeingprepared for pub-lishingundertheBSD2-Clauselicense.

PermonIneq

PermonQP capabilities can be further extended with the PermonIneqpackage, which adds severalconcrete solvers forinequalityconstrainedQPs,e.g.thealreadymentioned MPRGPandSMALBEalgorithms.

PermonFLLOP

PermonFLLOP is a wrapper of PermonQP implementing FETI. It assembles the FETI-specific constraint matrix B

andnullspacematrixR.Theyarepassedinternallyto Per-monQPtogetherwithsubdomain-wisestiffnessmatricesand loadvectors, which can beassembled witharbitrary FEM librarysuchasPermonCubeor libMesh.FETImethoditself consists here just in calling the proper sequence of QP transformations: primal scaling, dualisation, dual scaling, homogenisationoftheequalityconstraints,preconditioning byorthogonalprojectorontothekernelofthedualequality constraintmatrix.

Numericalexperiments

Asabenchmarkanelasticcubewassubjectedtothevolume forcespressingitagainsttheobstacle.Thereweretwo rea-sonsforthisdecision.Theelasticcubeisanumericalmodel whichcouldbefullycontrolledandtheobtainedresultsare notaffectedbycomplexityofgeometry.Anotherreasonis that it is very difficult or even impossible tocreate very largemeshesoncomplexgeometriesusingexistingmeshing

Table1 Resultsforthecubecontactlinearelasticityproblem.

X NS #Decomp.DOF Solutiontime[s] Outeriters Inneriters

4 64 3,000,000 2.68E+01 3 94

6 216 10,125,000 5.38E+01 3 147

8 512 24,000,000 1.21E+02 4 250

(9)

Figure12 ParallelweakscalabilityoftheTFETImethodimplementationinthePermonFLLOPcodeforthelinearelasticitycube benchmarkatHECToRsupercomputer.

Figure13 NumericalscalabilityoftheTFETImethodwithinthePermonFLLOPcodefor thelinearelasticitycubebenchmark. Notethatfromacertainpointwegetalmostconstantnumberofiterationsallowinggoodparallelscalability.

tools.WithourmeshgeneratorPermonCube wewereable topreparelargescaleproblemsdecomposedintothousands ofsubdomains.

Rand Gmatrices wereorthonormalised usingIterative Classical Gram-Schmidt process, K matrix was factorised using the Cholesky factorisation from the MUMPS library. Currently,eachcomputationalcoreownsoneandonlyone subdomain.Thenormoftheprojectedgradient compared withthe10−5multipleoftheprojecteddualRHSwasused asthestoppingcriterion.Theresultsaresummarisedinthe

Table1.

Theweakscalabilityfor13,824;8000and4096elements persubdomainandthenumericalscalabilityforthese con-figurations(correspondingtothefixed ratiosH/h=24,20, 16)arethenillustratedinFigs.12and13.Toinvestigatethe strongscalabilityweselecteddiscretisationwith32,768,000 elements(approx.100,000,000unknowns).Thestrong scal-abilityfordiscretisationwith32,768,000elements(approx. 100,000,000unknowns)wasdemonstratedupto8000cores (41.5susing2197cores;19.8susing4096cores;15.7s8000 cores).

Conclusion

Efficient variants of BEM discretisation method, scalable QPalgorithms,andFETI-typedomaindecomposition meth-ods (BETI, TFETIand HTFETI) wereimplemented intoour

in-housesoftwarepackages.Thesesolvers wereoptimised employing available state-of-the-art external libraries, communicationhidingandavoidingtechniques,hybrid MPI-OpenMP programming, GPU and MIC accelerators, etc. Scalability wasproven for both hugemodel problems and complicated engineering problems uptoten thousands of cores.

The presented BEM4I library for the boundaryelement discretisation ofengineering problems hasbeentested up tomorethanathousandofcores.Currently,itsacceleration usingtheIntelXeonPhicoprocessorsisunderdevelopment. The initial results suggest a significant reduction in com-putationaltimeinthecaseoffull systemmatricesforthe Laplaceequation,thereforetheaccelerationofthe assem-blyofmatricessparsifiedbyACA,aswellastheassemblyof thesystemmatricesfortheLameequationisbeing consid-ered.

The presented ESPRESO library brings highly optimised TFETIandHTFETIimplementations.ESPRESO-H isoriented to large computer systems with thousands and tens of thousands of computenodes.ESPRESO-GPU and EPSRESO-MIC are developed to exploit power of GPU and MIC accelerators.

WehavealsopresentedourPERMONtoolbox,mainlyits solverpackagesbasedonPETSc.TheyuniquelycombineFETI DDM with QP algorithms. PermonFLLOP is used to gener-ateFETI-specificobjectsforacontactproblemofelasticity while FEMobjectsareprovidedby anyFEMcodeforeach

(10)

subdomain independently. PermonFLLOP wraps PermonQP and PermonIneq which solve the resulting QP problem. Resultsfor contactproblemof elasticcube,generatedby PermonCubepackage,wereshown.

Conflict

of

interest

Theauthorsdeclarethatthereisnoconflictofinterest.

Acknowledgements

This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excel-lenceproject (CZ.1.05/1.1.00/02.0070); project of major infrastructures for research, development and innova-tion of Ministry of Education, Youth and Sports with reg. num. LM2011033; by the EXA2CT project funded fromthe EU’s Seventh Framework Programme (FP7/2007-2013) undergrant agreementno. 610741; bythe internal student grant competition project SP2015/186 ‘‘PERMON toolbox development’’; the project POSTDOCI II reg. no.CZ.1.07/2.3.00/30.0055withinOperationalProgramme EducationforCompetitiveness;andbytheGrantAgencyof theCzechRepublic(GACR)projectno.15-18274S.Wethank CSCS(www.cscs.ch)for the supportin usingthePiz Daint supercomputer.

References

Bebendorf,M.,2008.HierarchicalMatrices:AMeanstoEfficiently SolveEllipticBoundaryValueProblems,LectureNotesin Com-putationalScienceandEngineering.Springer.

Brzobohat´y,T.,Dostál,Z.,Kozubek,T.,Kováˇr,P.,Markopoulos,A., 2011.Choleskydecompositionwithfixingnodestostable com-putationofa generalized inverseofthestiffnessmatrixofa floatingstructure.Int.J.Numer.MethodsEng.88(5),493—509. Dostál, Z., Friedlander, A., Santos, S.A., 2003. Augmented Lagrangianswithadaptiveprecisioncontrolforquadratic pro-grammingwithsimpleboundsandequalityconstraints.SIAMJ. Optim.13(January(4)),1120—1140.

Dostál,Z.,Friedlander,A.,Santos,S.A.,1998.Solutionofcontact problemsofelasticitybyFETIdomaindecomposition.Contemp. Math.218,82—93.

Dostál,Z.,Horák,D.,Kuˇcera,R.,2006.TotalFETI—aneasier imple-mentablevariantoftheFETImethodfornumericalsolutionof ellipticPDE.Commun.Numer.MethodsEng.22(12),1155—1162. Dostál,Z.,Kozubek,T.,Markopoulos,A.,Menˇsík,M.,2011.Cholesky decompositionofapositivesemidefinitematrixwithknown ker-nel.Appl.Math.Comput.217(13),6067—6077.

Dostál,Z.,Kozubek,T.,Vondrák,V.,Brzobohat´y,T.,Markopoulos, A.,2010.ScalableTFETIalgorithmforthesolutionofmultibody contactproblemsofelasticity.Int.J.Numer.MethodsEng.82 (11),1384—1405.

Dostál,Z., Neto,F.A.G.,Santos,S.A.,2000dec.Solution of con-tactproblemsbyFETIdomaindecompositionwithnaturalcoarse spaceprojections.Comput.MethodsAppl.Mech.Eng.190 (13-14),1611—1627.

Dostál,Z.,2009.OptimalQuadraticProgrammingAlgorithms,with Applications to Variational Inequalities. SOIA, Springer, New York,US.

Dostál,Z.,Horák,D.,2004.ScalableFETIwithoptimaldualpenalty for a variational inequality. Numer. Linear AlgebraAppl. 11, 455—472.

Dostál,Z.,Horák,D.,2007.TheoreticallysupportedscalableFETI fornumericalsolutionofvariationalinequalities.SIAMJournal onNumericalAnalysis45(2),500—513.

Dostál, Z., Horák, D., Kuˇcera, R., Vondrák, V., Haslinger, J., Dobiáˇs, J., Pták, S., 2005. FETI based algorithms for con-tactproblems:scalability,largedisplacementsand3dcoulomb friction. Comput. Methods Appl. Mech. Eng. 194 (2—5), 395—409.

Dostál,Z.,Kozubek,T.,Brzobohat´y,T.,Markopoulos,A.,Vlach,O., 2012.ScalableTFETIwithoptionalpreconditioningbyconjugate projectorfortransientcontactproblemsofelasticity.Comput. MethodsAppl.Mech.Eng.247—248,37—50.

Dostál,Z.,Schöberl,J.,2005.Minimizingquadraticfunctions sub-jectto boundconstraints.Comput.Optim. Appl.30(January (1)),23—43.

Farhat, C., Mandel, J., Roux, F.X., 1994. Optimal convergence propertiesoftheFETIdomaindecompositionmethod.Comput. MethodsAppl.Mech.Eng.115,365—385.

Farhat,C.,Roux,F.X.,1991.A methodoffinite elementtearing andinterconnectinganditsparallelsolutionalgorithm.Int.J. Numer.MethodsEng.32(6),1205—1227.

Farhat,C.,Roux,F.X.,1992.Anunconventionaldomain decomposi-tionmethodforanefficientparallelsolutionoflarge-scalefinite elementsystems.SIAMJ.Sci.Stat.Comput.(1).

Gosselet,P.,Rey,C.,2006.Non-overlappingdomaindecomposition methodsinstructuralmechanics.Arch.Comput.MethodsEng. 13(4),515—572.

Greengard,L.,Rokhlin,V.,1987.Afastalgorithmforparticle sim-ulations.J.Comput.Phys.73(2),325—348.

Hapla,V.,Horák,D.,2012.Tfeticoarsespaceprojectors paralleliza-tionstrategies.In:Wyrzykowski,R.,Dongarra,J.,Karczewski, K.,Wa´sniewski,J.(Eds.),ParallelProcessingandApplied Math-ematics,LectureNotesinComputerScience.Springer,Berlin, Heidelberg,pp.152—162.

Klawonn,A.,Rheinbach,O.,2010.Highlyscalableparalleldomain decomposition methodswithanapplication tobiomechanics. ZAMMZ.Angew.Math.Mech.90(1),5—32.

Klawonn, A., Widlund, O.B., 2006. Dual-primal FETI methods for linear elasticity. Commun. Pure Appl. Math. 59 (11), 1523—1572.

Klawonn,A., Rheinbach,O.,2006. Aparallel implementationof dual-primalFETImethodsforthree-dimensionallinearelasticity usingatransformationofbasis.SIAMJ.Sci.Comput.28(January (5)),1886—1906,http://dx.doi.org/10.1137/050624364. Kozubek, T., Horák, D., Hapla, V., 2012. FETI coarse

problem parallelization strategies and their compari-son, Tech. Rep. http://www.prace-project.eu/IMG/pdf/ feticoarseproblemparallelization.pdf.

Kozubek,T.,Vondrák,V.,Menˇsík,M.,Horák,D.,Dostál,Z.,Hapla, V.,Kabelíková,P., ˇCermák,M.,2013.TotalFETIdomain decom-positionmethodanditsmassivelyparallelimplementation.Adv. Eng.Softw.60—61,14—22.

Kruis,J.,2006.DomainDecompositionMethodsforDistributed Com-puting.Saxe-CoburgPublications.

Kruis,J.,Matouˇs,K.,Dostál,Z.,2002.Solvinglaminatedplatesby domaindecomposition.Adv.Eng.Softw.33,445—452.

Langer,U., Steinbach, O., 2003. Boundary element tearing and interconnectingmethods.Computing71(3),205—228. Lee,J.,2009.Ahybriddomaindecompositionmethodandits

appli-cationstocontactproblemsinmechanicalengineering.NewYork University(Ph.D.thesis).

Li,J., Widlund, O.B.,2006. FETI-DP, BDDC, and blockCholesky methods.Int.J.Numer.MethodsEng.66,250—271.

Merta,M.,Zapletal,J.,2015.Aparallellibraryforboundary ele-ment discretization ofengineering problems. Math. Comput. Simul.(acceptedforpublication).

Merta,M.,Zapletal, J., 2015.Acceleration ofboundaryelement methodbyexplicitvectorization.Adv.Eng.Softw.86,70—79.

(11)

Of, G., Steinbach, O., 2009. The all-floating boundary element tearingand interconnecting method.J. Numer.Math. 17(4), 277—298.

Of,G.,2007.Fastmultipolemethodsandapplications.In:Schanz, M.,Steinbach,O. (Eds.), BoundaryElementAnalysis,Lecture NotesinAppliedandComputationalMechanics.Springer,Berlin, Heidelberg,pp.135—160.

PatrickAmestoyandothers,2015.MUMPS:aMultifrontalMassively ParallelsparsedirectSolver,http://mumps.enseeiht.fr/. Rjasanow,S.,Steinbach,O.,2007.TheFastSolutionofBoundary

IntegralEquations.MathematicalandAnalyticalTechniqueswith ApplicationstoEngineering.Springer.

Sauter,S.,Schwab,C.,2010.BoundaryElementMethods.Springer SeriesinComputationalMathematics.Springer.

ˇ

Cermák,M.,Hapla,V.,Horák,D.,Merta,M.,Markopoulos,A.,2015. Total-FETIdomaindecompositionmethodforsolutionof elasto-plasticproblems.Adv.Eng.Softw.84,48—54.

ˇ

Cermák,M.,Merta,M.,Zapletal,J.,2015.Anovelboundary ele-mentlibrarywithapplications.In:Simos,T.,Tsitouras,C.(Eds.), ProceedingsofICNAAM2014.AIPConferenceProceedings,vol. 1648.

Zapletal,J.,Bouchala,J.,2014.Effectivesemi-analyticintegration forhypersingularGalerkinboundaryintegralequationsforthe Helmholtzequationin3d.Appl.Math.59(5),527—542.

References

Related documents

Dendrochronological data provide data on the tem- poral and spatial variability of landslide activity in the past, with yearly (or even seasonal) resolution from

Autosomal and X chromosome dosage compensation may be mediated by one or more mechanisms, including buffering, feedback, and feed-forward that can be used to ameliorate the effect

The present research, exploring the experience of home and its shifting, moving image in the films of Andrei Tarkovsky and Ingmar Bergman, will apply an adapted version of

• Optimize reflected energy, reflector focus and lamp position Optimize reflected energy, reflector focus and lamp position. through design and

Filtering of ground level air is performed weekly at six different locations in Sweden: Kiruna, Umeå, Gävle, Ursvik, Visby and Ljungbyhed.. The filters are pressed and the contents

Given this scenario, let us consider that due-dates have been negotiated with the demand sources, priorities for various orders set, processing requirements in terms of BOS has