Sie sind auf Seite 1von 33

HighPerformanceComputingforMechanical SimulationsusingANSYS

JeffBeisheim ANSYS,Inc

HPCDefined
HighPerformanceComputing(HPC)atANSYS: Anongoingeffortdesignedtoremove computinglimitationsfromengineerswho usecomputeraidedengineeringinallphases ofdesign,analysis,andtesting.

Itisahardwareandsoftware initiative!

NeedforSpeed
Assemblies CADtomesh Capturefidelity

Impactproductdesign Enablelargemodels Allowparametricstudies Modal Nonlinear Multiphysics Dynamics

AHistoryofHPCPerformance

2012 2010

2012
GPUacceleration(multipleGPUs; DMP)

2010
GPUacceleration(singleGPU; SMP)

2007 2009 Optimizedformulticoreprocessors Teraflopperformanceat512cores 20052007 DistributedPCGsolver DistributedANSYS(DMP)released Distributedsparsesolver Variational Technology SupportforclustersusingWindowsHPC

2000

2004
1st companytosolve100MstructuralDOF

1999 2000
64bitlargememoryaddressing

1994

1990 1980

IterativePCGSolverintroducedforlargeanalyses

1990
SharedMemoryMultiprocessing(SMP)available

1980s
VectorProcessingonMainframes

HPCRevolution
Recentadvancementshaverevolutionizedthe computationalspeedavailableonthedesktop Multicoreprocessors
Everycoreisreallyanindependentprocessor

LargeamountsofRAMandSSDs GPUs

ParallelProcessing Hardware
2Typesofmemorysystems
Sharedmemoryparallel(SMP) singlebox,workstation/server Distributedmemoryparallel(DMP) multipleboxes,cluster

Workstation

Cluster

ParallelProcessing Software
2TypesofparallelprocessingforMechanicalAPDL Sharedmemoryparallel(np >1)
Firstavailableinv4.3 Canonlybeusedonsinglemachine

Distributedmemoryparallel(dis np >1)
Firstavailableinv6.0withtheDDSsolver Canbeusedonsinglemachineorcluster

GPUacceleration(acc)
Firstavailableinv13.0usingNVIDIAGPUs SupportsusingeithersingleGPUormultipleGPUs Canbeusedonsinglemachineorcluster

Distributed ANSYSDesignRequirements
Nolimitationinsimulationcapability
Mustsupportallfeatures Continuallyworkingtoaddmorefunctionalitywitheachrelease

Reproducibleandconsistentresults
Sameanswersachievedusing1coreor100cores SamequalitychecksandtestingaredoneaswithSMPversion UsesthesamecodebaseasSMPversionofANSYS

Supportallmajorplatforms
Mostwidelyusedprocessors,operatingsystems,andinterconnects SupportssameplatformsthatSMPversionsupports UseslatestversionsofMPIsoftwarewhichsupportthelatest interconnects

Distributed ANSYSDesign
Distributedsteps(dis np N)
Atstartoffirstloadstep,decompose FEAmodelintoNpieces(domains) Eachdomaingoestoadifferentcore tobesolved Solutionisnotindependent!!
Lotsofcommunicationrequiredto achievesolution Lotsofsynchronizationrequiredto keepallprocessestogether

Eachprocesswritesitsownsetsof files(file0*,file1*,file2*,,file[N1]*) Resultsareautomaticallycombinedat endofsolution


Facilitatespostprocessingin/POST1, /POST26,orWorkBench

Distributed ANSYSCapabilities
Staticlinearornonlinearanalyses Bucklinganalyses Modalanalyses HarmonicresponseanalysesusingFULLmethod TransientresponseanalysesusingFULLmethod Singlefieldstructuralandthermalanalyses Lowfrequencyelectromagneticanalysis Highfrequencyelectromagneticanalysis Coupledfieldanalyses Allwidelyusedelementtypesandmaterials Superelements(usepass) NLGEOM,SOLC,LNSRCH,AUTOTS,IC,INISTATE, LinearPerturbation Multiframerestarts Cyclicsymmetryanalyses UserProgrammablefeatures(UPFs)

Widevarietyof features&analysis capabilitiesare supported

DistributedANSYSEquationSolvers
Sparsedirectsolver(default)
SupportsSMP,DMP,andGPUacceleration Canhandleallanalysistypesandoptions FoundationforBlockLanczos,Unsymmetric,Damped,andQR dampedeigensolvers

PCGiterativesolver
SupportsSMP,DMP,andGPUacceleration Symmetric,realvaluematricesonly(i.e.,static/fulltransient) FoundationforPCGLanczoseigensolver

JCG/ICCGiterativesolvers
SupportsSMPonly

DistributedANSYSEigensolvers
BlockLanczoseigensolver(includingQRdamp)
SupportsSMPandGPUacceleration

PCGLanczoseigensolver
SupportsSMP,DMP,andGPUacceleration Greatforlargemodels(>5MDOF)withrelativelyfewmodes(<50)

Supernode eigensolver
SupportsSMPonly Optimalchoicewhenrequestinghundredsorthousandsofmodes

Subspaceeigensolver
SupportsSMP,DMP,andGPUacceleration Currentlyonlysupportsbucklinganalyses;betaformodalinR14.5

Unsymmetric/Dampedeigensolvers
SupportsSMP,DMP,andGPUacceleration

DistributedANSYSBenefits
Betterarchitecture
Morecomputationsperformedinparallel fastersolutiontime

BetterspeedupsthanSMP
Canachieve>10xon16cores(trygettingthatwithSMP!) Canbeusedforjobsrunningon1000+CPUcores

Cantakeadvantageofresourcesonmultiplemachines
Memoryusageandbandwidthscales Disk(I/O)usagescales Wholenewclassofproblemscanbesolved!

DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
Twomaincharacteristics foreachinterconnect:latencyandbandwidth DistributedANSYSishighlybandwidthbound
+--------- D I S T R I B U T E D A N S Y S S T A T I S T I C S ------------+

Release: 14.5 Build: UP20120802 Date Run: 08/09/2012 Time: 23:07

Platform: LINUX x64

Processor Model: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz Total number of cores available : Number of physical cores available : Number of cores requested : MPI Type: INTELMPI 32 32 4 (Distributed Memory Parallel)

Core Machine Name Working Directory ---------------------------------------------------0 hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core Latency time from master to core Latency time from master to core 1 = 2 = 3 = 1.171 microseconds 2.251 microseconds 2.225 microseconds 1 = 2 = 3 = 7934.49 MB/sec 3011.09 MB/sec 3235.00 MB/sec Same machine QDR Infiniband QDR Infiniband

Communication speed from master to core Communication speed from master to core Communication speed from master to core

DistributedANSYSPerformance
Needfastinterconnectstofeedfastprocessors
60

InterconnectPerformance
GigabitEthernet DDRInfiniband

Rating(runs/day)

50 40 30

Turbinemodel 20 2.1millionDOF 10 SOLID187elements Nonlinearstaticanalysis 0 Sparsesolver(DMP) Linuxcluster(8corespernode)

8cores 16cores 32cores 64cores 128cores

DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
Checkthebandwidthspecs
ANSYSMechanicalcanbehighlyI/Obandwidthbound
SparsesolverintheoutofcorememorymodedoeslotsofI/O

DistributedANSYScanbehighlyI/Olatencybound
Seektimetoread/writeeachsetoffilescausesoverhead

ConsiderSSDs
Highbandwidthandextremelylowseektimes

ConsiderRAIDconfigurations RAID0 forspeed RAID1,5 forredundancy RAID10 forspeedandredundancy

DistributedANSYSPerformance
Needfastharddrivestofeedfastprocessors
30

HardDrivePerformance
HDD SSD

Rating(runs/day)

25 20 15 10 5 0 1core

8millionDOF Linearstaticanalysis Sparsesolver(DMP) DellT5500workstation


12IntelXeonx5675cores,48GBRAM, single7.2krpmHDD,singleSSD,Win7)

2cores

4cores

8cores

Beisheim,J.R.,BoostingMemoryCapacitywithSSDs,ANSYSAdvantageMagazine,VolumeIV,Issue1,pp.37,2010.

DistributedANSYSPerformance
AvoidwaitingforI/Otocomplete! ChecktoseeifjobisI/Oboundorcomputebound
CheckoutputfileforCPUandElapsedtimes
WhenElapsedtime>>mainthreadCPUtime
Total CPU time for main thread . . . . . . Elapsed Time (sec) = 388.000 : 167.8 seconds

I/Obound

Date

08/21/2012

ConsideraddingmoreRAMorfasterharddriveconfiguration

WhenElapsedtimemainthreadCPUtimeComputebound Consideringmovingsimulationtoamachinewithfasterprocessors ConsiderusingDistributedANSYS(DMP)insteadofSMP ConsiderrunningonmorecoresorpossiblyusingGPU(s)

DistributedANSYSPerformance
ANSYS11.0 Thermal(fullmodel) 3MDOF Thermomechanical Simulation (fullmodel) 7.8MDOF Time Cores Time Iterations Cores Interpolation of BoundaryConditions Time LoadSteps Submodel:CreepStrain Analysis5.5MDOF Time Iterations Cores TotalTime 4 hours 8 ~5.5days 163 8 37hours 16 ~5.5days 492 18 2weeks ANSYS12.0 4hours 8 34.3hours 164 20 37hours 16 38.5hours 492 16 5days ANSYS12.1 4hours 8 12.5hours 195 64 37hours 16 8.5hours 492 76 2 days ANSYS13.0SP2 4hours 8 9.9 hours 195 64 0.2hour Improved algorithm 6.1hours 488 128 1day ANSYS 14.0 1hour 8+1GPU 7.5hours 195 128 0.2hour 16 5.9hours 498 64+8GPU 0.5day 4.2hours 498 256 0.8hour 32

AllrunswithSparsesolver Hardware12.0:dualX5460(3.16GHzHarpertown IntelXeon)64GBRAMpernode Hardware12.1+13.0:dualX5570(2.93GHzNehalemIntelXeon)72GBRAMpernode ANSYS12.0to14.0runswithDDRInfinibandinterconnect ANSYS14.0creeprunswithNROPT,CRPL+DDOPT,METIS

ResultsCourtesyofMicroConsult Engineering,GmbH

DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
25 20

SolutionScalability

Speedup

15

10 Turbinemodel 2.1millionDOF 5 Nonlinearstaticanalysis 1Loadstep,7substeps, 0 25equilibriumiterations 0 Linuxcluster(8corespernode)

16

24

32

40

48

56

64

DistributedANSYSPerformance
Minimumtimetosolutionmoreimportantthanscaling
45000

SolutionScalability
11hrs,48 mins

SolutionElapsedTime

40000 35000 30000 25000 20000

Turbinemodel 15000 2.1millionDOF 10000 Nonlinearstaticanalysis 5000 1Loadstep,7substeps, 0 25equilibriumiterations 0 Linuxcluster(8corespernode)

1hr,20mins 30mins

16

24

32

40

48

56

64

GPUAcceleratorCapability
Graphicsprocessingunits(GPUs)
Widelyusedforgaming,graphicsrendering Recentlybeenmadeavailableasgeneralpurposeaccelerators
Supportfordoubleprecisioncomputations PerformanceexceedingthelatestmulticoreCPUs

SohowcanANSYSmakeuseofthisnewtechnology toreducetheoveralltimetosolution??

GPUAcceleratorCapability
AccelerateSparsedirectsolver(SMP&DMP)
GPUisusedtofactormanydensefrontalmatrices DecisionismadeautomaticallyonwhentosenddatatoGPU
Frontalmatrixtoosmall,toomuchoverhead,staysonCPU Frontalmatrixtoolarge,exceedsGPUmemory,onlypartially accelerated

AcceleratePCG/JCGiterativesolvers(SMP&DMP)
GPUisonlyusedforsparsematrixvectormultiply(SpMV kernel) DecisionismadeautomaticallyonwhentosenddatatoGPU
Modeltoosmall,toomuchoverhead,staysonCPU Modeltoolarge,exceedsGPUmemory,onlypartiallyaccelerated

GPUAcceleratorCapability
Supportedhardware
CurrentlysupportNVIDIATesla20series,Quadro6000,and QuadroK5000cards NextgenerationNVIDIATeslacards(Kepler)shouldworkwith R14.5 InstallingaGPUrequiresthefollowing:
Largerpowersupply(singlecardneeds~250W) Open2xformfactorPCIe x162.0(or3.0)slot

Supportedplatforms
WindowsandLinux64bitplatformsonly
DoesnotincludeLinuxItanium(IA64)platform

GPUAcceleratorCapability
Targetedhardware
NVIDIA TeslaC2075
Power(W) Memory Memory Bandwidth (GB/s) PeakSpeed SP/DP (GFlops)

NVIDIA Tesla M2090 250 6GB 177.4

NVIDIA Quadro 6000 225 6GB 144

NVIDIA Quadro K5000 122 4GB 173

NVIDIA Tesla K10 250 8GB 320

NVIDIA Tesla K20 250 6to24GB 288

225 6GB 144

1030/515

1331/665

1030/515

2290/95

4577/190

5184/1728

TheseNVIDIAKeplerbasedproductsarenotreleasedyet,sospecificationsmaybeincorrect

GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
4.0 3.5 3.8x

RelativeSpeedup

3.0 2.5 2.0 1.5 1.0 0.5 0.0 2cores (noGPU)

2.6x

6.5millionDOF Linearstaticanalysis Sparsesolver(DMP) 2IntelXeonE52670(2.6GHz, 16corestotal),128GBRAM, SSD,4TeslaC2075,Win7

8cores (noGPU)

8cores (1GPU)

GPUAcceleratorCapability
GPUscanoffersignificantlyfastertime tosolution
GPUPerformance
6.0 5.2x 5.0 4.0 3.0 2.0 1.0 0.0 2cores (noGPU) 8cores (1GPU) 16cores (4GPUs) 2.7x

11.8millionDOF Linearstaticanalysis PCGsolver(DMP) 2IntelXeonE52670(2.6GHz, 16corestotal),128GBRAM, SSD,4TeslaC2075,Win7

RelativeSpeedup

GPUAcceleratorCapability
SupportsmajorityofANSYSusers
CoversbothsparsedirectandPCGiterativesolvers Onlyafewminorlimitations

Easeofuse
RequiresatleastonesupportedGPUcardtobeinstalled RequiresatleastoneHPCpacklicense Norebuild,noadditionalinstallationsteps

Performance
~1025%reductionintimetosolutionwhenusing8CPUcores Shouldneverslowdownyoursimulation!

DesignOptimization
Howwillyouuseallofthiscomputingpower?

Higherfidelity

Fullassemblies

Morenonlinear

DesignOptimizationStudies

HPCLicensing
ANSYSHPCPacksenable highfidelityinsight
Eachsimulationconsumesone ormorepacks Parallelenabledincreases quicklywithaddedpacks
Parallel Enabled (Cores)
64 GPU + 4 GPU + 16 GPU +

256 GPU +

2048

512

1GPU +

128

32

Singlesolutionforallphysics andanyleveloffidelity FlexibilityasyourHPC resourcesgrow


ReallocatePacks,asresources allow

2 3 4 5 PacksperSimulation

HPCParametricPackLicensing
Scalable,likeANSYSHPC Packs
Enhancesthecustomersability toincludemany designpointsas partofasingle study Ensuresoundproductdecision making
Numberof Simultaneous DesignPoints Enabled

64

32

Amplifiescompleteworkflow
Designpointscaninclude executionofmultipleproducts (pre,solve,HPC,post) Packagedtoencourageadoption ofthepathtorobustdesign!
16 8 4

NumberofHPCParametricPackLicenses

HPCRevolution

HDDvs.SSDs

SMPvs.DMP

Therightcombinationof algorithmsand hardware leadstomaximum efficiency


Clusters

GPUs

Interconnects

HPCRevolution

Everycomputertodayisaparallelcomputer EverysimulationinANSYScanbenefitfrom parallelprocessing

Das könnte Ihnen auch gefallen