Software
Development
Kit
for
Multicore
Acceleration
Version
3.0
Programmer's
Guide
Software
Development
Kit
for
Multicore
Acceleration
Version
3.0
Programmer's
Guide
Note: Beforeusingthisinformationandtheproductitsupports,readthegeneralinformationinAppendixD,“Notices,”onpage
101.
EditionNotice
Thiseditionappliestotheversion3,release0oftheIBMSoftwareDevelopmentKitforMulticoreAcceleration (Productnumber5724-S84)andtoallsubsequentreleasesandmodificationsuntilotherwiseindicatedinnew editions.
ThiseditionreplacesSC33-8325-01.
Contents
Preface
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. v
Aboutthisbook . . . v
What'snewinthisbook . . . v
Supportedplatforms . . . vi
Supportedlanguages . . . vi
Beta-level(unsupported)environments . . . vi
Gettingsupport . . . vi
Relateddocumentation . . . vii
Chapter
1.
SDK
3.0
overview
.
.
.
.
.
. 1
GNUtoolchain . . . 1
IBMXLC/C++compiler . . . 2
IBMFull-SystemSimulator . . . 3
Systemrootimageforthesimulator . . . 4
Linuxkernel . . . 5
CellBElibraries . . . 5
SPERuntimeManagementLibraryVersion2.2 . . 5
SIMDmathlibraries . . . 5
MathematicalAccelerationSubsystem(MASS) libraries . . . 6
ALFlibrary. . . 6
DaCSlibrary . . . 7
Prototypelibraries . . . 8
FastFourierTransform(FFT)library . . . 8
MonteCarlolibraries . . . 8
Codeexamplesandexamplelibraries . . . 9
Performancesupportlibrariesandutilities . . . . 11
SPUtimingtool . . . 11
OProfile . . . 12
SPUprofilingrestrictions. . . 12
SPUreportanomalies . . . 13
Cell-perf-countertool . . . 13
IBMEclipseIDEfortheSDK . . . 14
Hybrid-x86programmingmodeloverview . . . . 14
Chapter
2.
Programming
with
the
SDK
17
Systemrootdirectories . . . 17Runningthesimulator. . . 18
Thecallthruutility . . . 20
Readandwriteaccesstothesimulatorsysroot image . . . 20
EnablingSymmetricMultiprocessingsupport . . 21
Enablingxclientsfromthesimulator . . . 21
Specifyingtheprocessorarchitecture . . . 21
PPEaddressspacesupportonSPE . . . 22
SDKprogrammingexamplesanddemos . . . . 24
Overviewofthebuildenvironment . . . 24
Changingthedefaultcompiler . . . 24
Buildingandrunningaspecificprogram . . . 25
CompilingandlinkingwiththeGNUtoolchain 25 SupportforhugeTLBfilesystems. . . 26
SDKdevelopmentbestpractices . . . 27
Usingashareddevelopmentenvironment . . . 27
Performanceconsiderations . . . 27
NUMA . . . 27
Preemptivecontextswitching . . . 28
Chapter
3.
Debugging
Cell
BE
applications
.
.
.
.
.
.
.
.
.
.
.
. 29
Overview . . . 29
GDBforSDK3.0 . . . 29
CompilingwithGCCorXLC . . . 29
Usingthedebugger. . . 30
DebuggingPPEcode . . . 30
DebuggingSPEcode . . . 30
Sourceleveldebugging . . . 31
Assemblerleveldebugging . . . 31
Howspu-gdbmanagesSPEregisters . . . . 32
SPUstackanalysis . . . 33
SPEstackdebugging . . . 35
Overview . . . 35
Stackoverflowchecking . . . 36
Stackmanagementstrategies . . . 37
DebuggingintheCellBEenvironment . . . 37
Debuggingmultithreadedcode. . . 37
Debuggingarchitecture . . . 37
Switchingarchitectureswithinasinglethread 39 Viewingsymbolicandadditionalinformation 40 Usingscheduler-locking . . . 41
Usingthecombineddebugger . . . 42
Settingpendingbreakpoints. . . 42
Usingthesetspustop-on-loadcommand . . 43
Disambiguationofmultiply-definedglobal symbols . . . 44
Newcommandreference . . . 45
infospuevent . . . 45
infospusignal . . . 45
infospumailbox . . . 45
infospudma. . . 45
infospuproxydma . . . 46
Settingupremotedebugging . . . 46
Remotedebuggingoverview . . . 47
Usingremotedebugging . . . 47
Startingremotedebugging . . . 47
Chapter
4.
Cell
BE
Performance
Debugging
Tool
.
.
.
.
.
.
.
.
.
.
. 51
Introduction . . . 51
ComponentsHighLevelDescription . . . 51
TracingFacility . . . 52
TraceProcessing. . . 52
Visualization . . . 52
SettingupthePDTtracingfacility. . . 53
ConfiguringthePDTkernelmodule . . . 54
PDTexampleusage . . . 54
EnablingthePDTtracingfacilityforanew application . . . 55
Compilationandapplicationbuilding . . . . 55
SPEcompilation. . . 55
Runningatrace-enabledprogramusingthePDT
libraries . . . 55
RunningaprogramusingSPEprofiling . . . . 57
ConfiguringthePDTforanapplicationrun . . . 57
UsingtheTracingAPI . . . 58
Essentialdefinitions . . . 58
ApplicationprogrammerAPI . . . 58
User-definedevents . . . 58
Dynamictracecontrol . . . 59
LibrarydeveloperAPI. . . 59
Tracefacilitycontrol . . . 59
Eventsrecording. . . 59
Restrictions . . . 60
InstallingandusingthePDTtracefacilityon Hybrid-x86 . . . 61
PDTonHybrid-x86exampleusage . . . 61
UsingthePDTRtool(pdtrcommand) . . . 61
Chapter
5.
Analyzing
Cell
BE
SPUs
with
kdump
and
crash
.
.
.
.
.
.
.
. 65
Installationrequirements . . . 65
Productionsystem . . . 66
Analysissystem . . . 66
Chapter
6.
Feedback
Directed
Program
Restructuring
(FDPR-Pro)
.
.
.
.
.
. 69
Introduction . . . 69
Inputfiles . . . 70
Instrumentationandprofiling . . . 70
Optimizations . . . 70
Instrumentationandoptimizationoptions . . . . 71
ProfilingSPEexecutablefiles . . . 71
ProcessingPPE/SPEexecutablefiles . . . 71
Integratedmode. . . 71
Standalonemode . . . 72
Human-readableoutput . . . 72
RunningfdprprofromtheIDE . . . 73
Cross-developmentwithFDPR-Pro . . . 73
Chapter
7.
SPU
code
overlays
.
.
.
.
. 75
Whatareoverlays . . . 75
Howoverlayswork . . . 75
Restrictionsontheuseofoverlays. . . 76
Planningtouseoverlays . . . 76
Overview . . . 76
Sizing . . . 76
Scalingconsiderations . . . 77
Overlaytreestructureexample . . . 77
Lengthofanoverlayprogram . . . 78
Segmentorigin . . . 78
Overlayprocessing . . . 79
Callstubs . . . 80
Segmentandregiontables . . . 80
Overlaygraphstructureexample . . . 80
SpecificationofanSPUoverlayprogram . . . . 8383 Codingforoverlays . . . 84
Migration/Co-Existence/Binary-Compatibility Considerations . . . 84
Compileroptions(XLConly) . . . 84
SDKoverlayexamples. . . 85
Simpleoverlayexample . . . 85
Overviewoverlayexample . . . 88
Largematrixoverlayexample . . . 89
UsingtheGNUSPUlinkerforoverlays . . . 91
Appendix
A.
Changes
to
SDK
for
this
release
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 93
Changestothedirectorystructure. . . 93
Selectingthecompiler . . . 94
Synchingcodeintothesimulatorsysrootimage . . 94
Appendix
B.
PDT
troubleshooting
.
.
. 95
Appendix
C.
Related
documentation
.
. 99
Appendix
D.
Notices
.
.
.
.
.
.
.
. 101
Trademarks . . . 104
Glossary
.
.
.
.
.
.
.
.
.
.
.
.
. 105
Glossary . . . 105
Preface
TheIBM SoftwareDevelopmentKitforMulticoreAccelerationVersion3.0(SDK 3.0)isacompletepackageoftoolstoenableyoutoprogramapplicationsforthe Cell BroadbandEngine™(CellBE) processor.TheSDK3.0iscomposedof development toolchains,software librariesandsamplesourcefiles,a system simulator, anda Linux®kernel,allofwhichfullysupport thecapabilitiesofthe Cell BE.
About
this
book
Thisbookdescribeshow tousetheSDK3.0towriteapplications.Howtoinstall SDK 3.0isdescribedina separatemanual,SoftwareDevelopmentKitforMulticore Acceleration Version3.0InstallationGuide,andthereisalso aprogrammingtutorial tohelp getyoustarted.
Eachsection ofthisbookcoversadifferenttopic:
v Chapter1,“SDK3.0overview,”onpage1describes thecomponentsoftheSDK
3.0
v Chapter2,“Programmingwith theSDK,”onpage17explainshowtoprogram
applicationsfortheCell BEplatform
v Chapter3,“DebuggingCellBE applications,”onpage29describes howto
debugyourapplications
v Chapter4,“CellBE PerformanceDebuggingTool,”onpage51describeshowto
usetheperformancedebuggingtool
v Chapter5,“AnalyzingCell BESPUswithkdumpandcrash,” onpage65
describesameansofdebuggingkerneldatarelatedtoSPUsthrough specific crashcommands,byusingadumpedkernelimage.
v Chapter6,“FeedbackDirectedProgramRestructuring(FDPR-Pro),”onpage69
describeshowusetheFDPR-Protooltooptimizeyourapplications
v Chapter7,“SPUcodeoverlays,” onpage75describeshowtouseoverlays
What's
new
in
this
book
Thisbookincludesinformationaboutthenew functionalitydeliveredwiththe SDK 3.0,andcompletely replacesthepreviousversion ofthisbook.Thisnew informationincludes:
v PPEaddressspacesupportonSPE
v SPUstackanalysis
v HowtooptimizecodeusingFDPR-Pro,seeChapter6,“FeedbackDirected
ProgramRestructuring(FDPR-Pro),”onpage69
v
HowtousetheChapter4,“CellBEPerformanceDebuggingTool”
v EnhancementstotheGDB,see“Switchingarchitectureswithin asinglethread”
onpage39and “Disambiguationofmultiply-definedglobalsymbols”onpage 44
v HowtodebugkerneldatarelatedtoSPUs,seeChapter5,“AnalyzingCellBE
SPUswith kdumpandcrash,”onpage65
For informationaboutdifferencesbetweenSDK 3.0and previousversions,see AppendixA,“ChangestoSDKforthis release,”onpage93.
Supported
platforms
Cell BEapplicationscanbedevelopedonthefollowingplatforms:
v x86 v x86-64 v 64-bitPowerPC ®(PPC64) v BladeCenter ® QS20 v BladeCenterQS21
Supported
languages
The supportedlanguagesare:
v C/C++
v Assembler
v Fortran
v ADA(PowerProcessingElement(PPE)Only)
Note: AlthoughC++and Fortranaresupported, takecarewhenyouwritecode
fortheSynergisticProcessingUnits(SPUs)becausemanyoftheC++and Fortranlibrariesare toolargeforthe256KBlocalstoragememoryavailable.
Beta-level
(unsupported)
environments
Thispublicationcontains documentationthatmaybeapplied tocertain
environmentsonan"as-is"basis.Thoseenvironmentsare notsupportedbyIBM, but whereverpossible,workaroundsto problemsareprovidedintherespective forums.
Getting
support
The SDK3.0isavailable throughPassportAdvantage®with fullsupport at:
http://www.ibm.com/software/passportadvantage
Youcanlocatedocumentationand otherresources ontheWorldWideWeb.Refer tothefollowingWebsites:
v IBMBladeCentersystems,optionaldevices,services,and supportinformationat
http://www.ibm.com/bladecenter/
Forserviceinformation,selectSupport.
v developerWorks
®CellBE ResourceCenterat:
http://www.ibm.com/developerworks/power/cell/
ToaccesstheCellBE forumondeveloperWorks,selectCommunity.
v TheBarcelonaSupercomputingCenter (BSC)Web siteat
http://www.bsc.es/projects/deepcomputing/linuxoncell
v Thereisalso supportfortheFull-SystemSimulator andXLC/C++Compiler
throughtheirindividualalphaWorks®forums.Ifindoubt, startwith theCell BE
architectureforum.
v TheGNUProjectdebugger,GDBissupportedthroughmanydifferentforums
ontheWeb,butprimarilyattheGDBWebsite http://www.gnu.org/software/gdb/gdb.html
Thisversion(SDK3.0)oftheSDKsupersedesallpreviousversions oftheSDK.
Related
documentation
For alistofdocumentationreferencedinthis Programmer'sGuide,seeAppendixB. Related documentation.
Chapter
1.
SDK
3.0
overview
Thissectiondescribes thecontentsoftheSDK3.0,where itisinstalledonthe system,and howthevariouscomponentsworktogether.Itcoversthefollowing topics:
v “GNUtoolchain”
v “IBMXLC/C++compiler” onpage2
v “IBMFull-SystemSimulator”onpage3
v “Systemrootimageforthesimulator”onpage4
v “Linuxkernel”onpage5
v “CellBElibraries”onpage5
v “Prototypelibraries”onpage8
v “Performancesupportlibrariesandutilities”onpage11
v “IBMEclipseIDEfortheSDK”onpage14
v “Hybrid-x86programming modeloverview”onpage14
GNU
tool
chain
TheGNUtoolchaincontains theGCCC-language compiler(GCCcompiler)for thePPUandtheSPU.For thePPUitisareplacement forthenativeGCCcompiler onPowerPC (PPC)platforms anditisacross-compileronX86.TheGCCcompiler forthePPUisthedefaultand theMakefiles areconfiguredtouseitwhenbuilding thelibrariesand samples.
TheGCCcompileralsocontains aseparateSPEcross-compilerthatsupportsthe standards definedinthefollowingdocuments:
v
C/C++LanguageExtensionsforCellBroadbandEngineArchitectureV2.5.TheGCC
compilershippedinSDK3.0supportsalllanguageextensiondescribedinthe specificationexceptforthefollowing:
– TheGCCcompilerscurrentlydonotsupportalignmentofstackvariables
greaterthan16bytesasdescribed insection1.3.1.
– TheGCCcompilerscurrentlydonotsupporttheoptionalalternatevector
literalformatspecified insection1.4.6.
– TheGCCcompilerscurrentlysupportmappingbetweenSPUandVMX
intrinsicsasdefinedinsection 5onlyinC++code.
– Therecommended vectorprintfformatcontrolsasspecifiedinsection 8.1.1
due tolibraryrestrictions.
– TheGCCcompilerdoesnotsupport theoptionalAltivecstyle ofvectorliteral
construction usingparenthesis("("and ")").ThestandardCmethodofarray initializationusingcurlybraces(″{″and ″}″)shouldbeused.
– TheC99complexmathlibraryasspecifiedinsection8.1.1due tolibrary
restrictions
v SPUApplicationBinaryInterface(ABI)SpecificationV1.8
v SPUInstructionSetArchitectureV1.2
Theassociatedassembler andlinkeradditionallysupporttheSPUAssembly LanguageSpecificationV1.6.Theassemblerandlinkerarecommon toboththeGCC compilerand theIBM XLC/C++ compiler.
GDB supportisprovidedforbothPPUand SPUdebugging, andthedebugger client canbe inthesameprocessora remoteprocess.GDBalsosupportscombined (PPUandSPU)debugging.
On anon-PPC system,theinstall directoryfortheGNUtoolchainis
/opt/cell/toolchain.Thereisasingle binsubdirectory,whichcontainsbothPPU and SPUtools.
On aPPC64orBladeCenterQS21,bothtoolchainsareinstalledinto/usr.See “System rootdirectories”onpage17forfurtherinformation.
IBM
XL
C/C++
compiler
IBM XLC/C++ forMulticoreAccelerationforLinuxisan advanced,
high-performance cross-compilerthatistunedfortheCBEA.TheXLC/C++ compiler, whichishosted onanx86,IBM PowerPCtechnology-basedsystem,ora BladeCenterQS21, generatescodeforthePPUorSPU. Thecompilerrequiresthe GCCtoolchain fortheCBEA, whichprovidestoolsforcross-assemblingand cross-linking applicationsfor boththePPE andSPE.
IBM XLC/C++ supportstherevised2003InternationalC++StandardISO/IEC 14882:2003(E),ProgrammingLanguages--C++andtheISO/IEC9899:1999, ProgrammingLanguages--Cstandard,alsoknownasC99. Thecompileralso supports:
v TheC89Standard andK&Rstyleof programming
v Languageextensionsforvectorprogramming
v LanguageextensionsforSPUprogramming
v NumerousGCCCandC++extensionstohelpusersporttheirapplicationsfrom
GCC.
The XLC/C++compileravailable fortheSDK 3.0supportsthelanguages extensionsasspecifiedintheIBMXLC/C++Advanced EditionforMulticore Acceleration forLinuxV9.0LanguageReference.
The XLcompileralsocontainsa separateSPEcross-compilerthatsupportsthe standards definedinthefollowingdocuments:
v C/C++LanguageExtensionsforCellBroadbandEngine ArchitectureV2.5.TheXL
compilershippedinSDK 3.0supportsalllanguageextensiondescribedinthe specificationexceptforthefollowing:
– TheXLcompilerscurrentlydonotsupportalignmentofstackvariables
greaterthan16bytesasdescribed insection1.3.1
– TheXLcompilerscurrentlydonotsupportOperatorOverloadingforVector
Data Typesasdescribedinsection 10
– TheXLcompilerscurrentlydonotsupportVMXfunctionsvec_extract,
vec_insert,vec_promote,andvec_splatsasdescribedinsection7
– TheXLcompilerscurrentlydonotsupportPPEaddress spacesupport on
SPEasdescribedin“GNUtoolchain”onpage1
– TheXLcompilerscurrentlydonotsupportthe__builtin_expect_callbuiltin
function call
– TheXLcompilerscurrentlysupportmappingbetweenSPUandVMX
intrinsicsasdefinedinsection 5onlyinC++code
– Therecommended vectorprintfformatcontrolsasspecifiedinsection 8.1.1
– TheC99complexmathlibraryasspecifiedinsection8.1.1due tolibrary
restrictions
v SPUApplicationBinaryInterface(ABI)SpecificationVersion1.8
v SPUInstructionSetArchitectureVersion1.2
For informationabouttheXLC/C++compilerinvocationcommandsanda completelistofoptions,refertotheIBMXLC/C++AdvancedEditionforMulticore Acceleration forLinuxV9.0CompilerReference.
ProgramoptimizationisdescribedinIBMXLC/C++AdvancedEditionforMulticore Acceleration forLinuxV9.0ProgrammingGuide.
TheXLC/C++forMulticoreAccelerationfor Linuxcompilerisinstalledinto the/opt/ibmcmp/xlc/cbe/<compiler version number> directory. Documentation is locatedonthefollowingWebsite:
http://publib.boulder.ibm.com/infocenter/cellcomp/v9v111/index.jsp
IBM
Full-System
Simulator
TheIBM Full-SystemSimulator(referredtoasthesimulatorinthisdocument)isa software applicationthatemulatesthebehaviorofafullsystemthatcontains a Cell BEprocessor.YoucanstartaLinuxoperatingsystem onthesimulatorandrun applicationsonthesimulatedoperatingsystem.Thesimulatoralsosupportsthe loadingandrunningofstatically-linkedexecutableprogramsandstandalone tests withoutanunderlyingoperatingsystem.
Thesimulatorinfrastructureisdesignedformodeling processorandsystem-level architecture atlevelsofabstraction,whichvaryfromfunctionaltoperformance simulationmodels witha numberofhybridfidelitypoints inbetween:
v Functional-onlysimulation:Models theprogram-visibleeffectsofinstructions
withoutmodeling thetimeittakestoruntheseinstructions.Functional-only simulationassumesthateachinstructioncanberuninaconstantnumber of cycles.Memoryaccessesaresynchronousandare alsoperformedinaconstant numberofcycles.
Thissimulationmodelisusefulforsoftwaredevelopmentand debuggingwhen aprecisemeasureof executiontimeisnotsignificant.Functionalsimulation proceedsmuchmore rapidlythanperformancesimulation,andsoisalso useful forfast-forwardingtoaspecific pointofinterest.
v Performancesimulation:Forsystemand applicationperformance analysis,the
simulatorprovides performancesimulation(alsoreferredtoastiming simulation).Aperformance simulationmodelrepresentsinternalpolicies and mechanismsforsystemcomponents,suchasarbiters,queues,and pipelines. Operationlatenciesaremodeled dynamicallytoaccountforbothprocessingtime andresourceconstraints.Performancesimulationmodels havebeen correlated againsthardwareorotherreferencestoacceptable levelsof tolerance.
ThesimulatorfortheCell BEprocessorprovides acycle-accurateSPUcore modelthatcanbe usedforperformanceanalysisofcomputationally-intense applications.ThesimulatorforSDK3.0providesadditionalsupport for
performancesimulation.ThisisdescribedintheIBMFull-SystemSimulatorUsers Guide.
Thesimulatorcanalsobe configuredtofast-forwardthesimulation,usinga functional model,toaspecificpoint ofinterestintheapplicationandtoswitch to
a timing-accuratemodetoconductperformancestudies.Thismeansthatvarious typesofoperationaldetailscanbegatheredtohelp youunderstandreal-world hardwareand softwaresystems.
See the/opt/ibm/systemsim-cell/docsubdirectoryforcompletedocumentation includingthesimulatoruser’sguide.Theprerelease nameofthesimulatoris “Mambo” andthisnamemayappearinsomeofthedocumentation.
The simulatorfortheCell BEprocessorisalsoavailable asan independent technology at
http://www.alphaworks.ibm.com/tech/cellsystemsim
System
root
image
for
the
simulator
The systemrootimageforthesimulatorisa filethatcontainsadisk imageof Fedora 7files,libraries,andbinariesthatcanbeusedwithinthesystemsimulator. Thisdiskimagefileispreloaded withafullrangeofFedora7utilitiesand also includesallof theCell BELinuxsupportlibrariesdescribed in“Performance support librariesandutilities”onpage11.
ThisRPMfileisthelargest oftheRPMfilesandwhenit isinstalled,ittakesupto 1.6GBonthehostserver’sharddisk.Seealso“System rootdirectories” onpage 17.
The systemrootimageforthesimulatormustbe locatedeither inthecurrent directory whenyoustart thesimulatororthedefault/opt/ibm/systemsim-cell/ images/cell directory.Thecellsdkscript automaticallyputsthesystemrootimage into thedefaultdirectory.
Youcanmountthesystemrootimagetoseewhatitcontains.Assuminga mount point of/mnt/cell-sdk-sysroot,whichisthemountpointusedbythe
cellsdk_sync_simulatorscript,thecommandto mountthesystemrootimageis: mount -o loop /opt/ibm/systemsim-cell/images/cell/sysroot_disk /mnt/cell-sdk-sysroot/
The commandtounmounttheimageis: umount /mnt/cell-sdk-sysroot/
Donotattempttomounttheimageonthehostsystem whilethesimulatoris running. Youshouldalways unmountthesystemrootimagebeforeyoustartthe simulator. Youshouldnotmountthesystemrootimagetothesamepointasthe rootonthehostserverbecausethesystemcanbecomecorruptedand failtoboot. Youcanchangefilesonthesystemrootimagediskinthefollowingways:
v Mountitasdescribedabove.Thenchangedirectory(cd)tothemountpoint
directoryorbelowand usehostsystem tools,suchasviorcptomodifythefile. DonotattempttousetheRPMutilityonanx86platformtoinstallpackagesto thesysrootdisk,becausetheRPMdatabaseformatsare notcompatiblebetween thex86andPPCplatforms.
v Usethe/opt/cell/cellsdk_sync_simulatorcommandtosynchronizethesystem
rootimagewiththe/opt/cell/sysrootdirectory forlibrariesandsamples(see “Systemrootdirectories”onpage17)thathavebeencross-compiledandlinked onahostsystemand needtobecopied tothetargetsystem.
v Usethecallthrumechanism(see“Thecallthru utility”onpage20)tosourceor
sinkthehostsystemfilewhenthesimulatorisrunning.Thisistheonlymethod thatcanbe usedwhilethesimulatorisrunning.
Linux
kernel
For theBladeCenterQS21, thekernelisinstalledinto the/boot directory, yaboot.conf ismodifiedanda rebootisrequiredtoactivate thiskernel.The cellsdk install taskisdocumentedin theSDK3.0InstallationGuide.
Note: Thecellsdk uninstall commanddoes notautomaticallyuninstallthe
kernel.Thisavoidsleavingthesystem inanunusablestate.
Cell
BE
libraries
Thefollowinglibraries aredescribedinthissection:
v “SPERuntimeManagement LibraryVersion2.2”onpage5
v “SIMDmathlibraries”onpage5
v “MathematicalAccelerationSubsystem (MASS)libraries”onpage6
v “ALFlibrary”onpage6
v “DaCSlibrary”onpage7
SPE
Runtime
Management
Library
Version
2.2
TheSPERuntime ManagementLibrary (libspe)constitutesthestandardized
low-level applicationprogramming interface(API)forapplicationaccesstotheCell BE SPEs.ThislibraryprovidesanAPItomanageSPEsthatisneutralwithrespect totheunderlyingoperatingsystemand itsmethods. Implementationsof this librarycanprovideadditionalfunctionalitythatallowsforaccessto operating system orimplementation-dependentaspectsofSPE runtimemanagement.These capabilitiesarenotsubjecttostandardizationand theirusemayleadto
non-portablecodeand dependenciesoncertainimplemented versionsofthe library.
Theelfspeisa PPEprogramthatallows anSPEprogramto rundirectlyfroma LinuxcommandpromptwithoutneedingaPPE applicationtocreateanSPE threadand waitforittocomplete.
For theBladeCenterQS21, theSDK installsthelibspeheaders,libraries,and binariesintothe/usrdirectoryandthestandalone SPEexecutive,elfspe,is registeredwith thekernelduringboot bycommandsadded to/etc/rc.d/init.d usingthebinfmt_miscfacility.
For thesimulator,thelibspeandelfspebinariesand librariesarepreinstalledin thesamedirectoriesinthesystemrootimageandnofurtheraction isrequiredat install time.
SPERuntime ManagementLibraryversion 2.2isanupgradetoversion2.1. For more information,seetheSPERuntime ManagementLibraryReference.
SIMD
math
libraries
Thetraditionalmathfunctionsare scalarinstructions,anddonottakeadvantageof thepowerfulSingleInstruction,Multiple Data(SIMD)vectorinstructionsavailable inboththePPUandSPUintheCellBEArchitecture.SIMDinstructionsperform computationsonshortvectorsofdatainparallel,insteadofonindividualscalar dataelements.They oftenprovidesignificantincreasesinprogramspeedbecause more computationcanbedone withfewerinstructions.
The SIMDmathlibraryprovidesshortvectorversions ofthemathfunctions.The MASSlibraryprovideslongvector versions.Thesevector versionsconformas closely aspossibletothespecifications setoutbythescalarstandards.
The SIMDmathlibraryisprovidedbytheSDKasbotha linkablelibraryarchive and asasetof inlinefunction headers.ThenamesoftheSIMDmathfunctionsare formedfromthenamesofthescalarcounterpartsbyappendinga vectortype suffix tothestandardscalarfunction name.For example,theSIMDversionofthe absolutevalue functionabs(),whichactsona vectoroflongintegers,iscalled absi4().Inlineversionsoffunctionsare prefixedwiththecharacter″_″
(underscore),sotheinlineversionofabsi4() iscalled_absi4().
For moreinformationabouttheSIMDmathlibrary,refertoSIMDMathLibrary Specification forCellBroadbandEngineArchitectureVersion1.1.
Mathematical
Acceleration
Subsystem
(MASS)
libraries
The MathematicalAccelerationSubsystem(MASS) consistsoflibrariesof mathematicalintrinsic functions,whicharetunedspecificallyforoptimum performance ontheCell BEprocessor.Currentlythe32-bit,64-bitPPU, andSPU libraries aresupported.
Theselibraries:
v Includebothscalarandvector functions
v Arethread-safe
v Supportboth 32-and 64-bitcompilations
v Offerimprovedperformance overthecorrespondingstandardsystemlibrary
routines
v Areintendedforuseinapplicationswhereslightdifferencesinaccuracyor
handlingof exceptionalvaluescanbetolerated
YoucanfindinformationaboutusingtheselibrariesontheMASSWebsite: http://www.ibm.com/software/awdtools/mass
ALF
library
TheALFprovidesa programmingenvironmentfordataand taskparallel applicationsand libraries.TheALFAPIprovideslibrarydeveloperswitha setof interfacestosimplifylibrarydevelopmentonheterogenousmulti-core systems. Library developerscanusetheprovidedframeworktooffloadcomputationally intensive workto theaccelerators.Morecomplex applicationscanbe developedby combiningtheseveralfunctionoffloadlibraries.Applicationprogrammers canalso choose toimplementtheirapplicationsdirectlytotheALFinterface.
ALF supportsthemultiple-program-multiple-data(MPMD)programming module where multipleprogramscanbescheduledtorunonmultiple acceleratorelements at thesametime.
TheALFfunctionalityincludes:
v Datatransfermanagement
v Paralleltaskmanagement
v Doublebuffering
With theprovidedplatform-independentAPI,youcanalsocreatedescriptionsfor multiple computetasksand definetheirorderinginformationexecutionorders by definingtaskdependency.Taskparallelismisaccomplishedbyhavingtasks withoutdirectorindirectdependenciesbetweenthem.TheALF runtimeprovides an optimalparallelschedulingschemeforthetasksbased ongivendependencies. From theapplicationorlibraryprogrammer’spointofview,ALFconsistsofthe followingtworuntimecomponents:
v Ahost runtimelibrary
v
Anacceleratorruntimelibrary
Thehost runtimelibraryprovides thehostAPIstotheapplication.Theaccelerator runtime libraryprovidestheAPIstotheapplication’sacceleratorcode,usuallythe computational kerneland helperroutines.Thisdivisionoflaborenables
programmers tospecializeindifferentpartsofagivenparallelworkload.
TheALFdesignenablesaseparationof work.Therearethreedistincttypesoftask within agivenapplication:
Application
Youdevelop programsonlyat thehostlevel.Youcanusetheprovided acceleratedlibrarieswithoutdirectknowledgeoftheinnerworkingsofthe underlyingsystem.
Acceleratedlibrary
YouusetheALFAPIstoprovidethelibraryinterfacestoinvoke the computational kernelsontheaccelerators.Youdividetheproblemintothe controlprocess,whichrunsonthehost,andthecomputational kernel, whichrunsontheaccelerators.Youthen partitiontheinputandoutput intoworkblocks,whichALF canscheduletorunondifferentaccelerators. Computational kernel
Youwriteoptimizedacceleratorcode attheacceleratorlevel. TheALFAPI provides acommoninterfacefor thecomputetasktobe invoked
automaticallybytheframework.
Theruntime frameworkhandlestheunderlyingtaskmanagement,datamovement, and errorhandling,whichmeansthatthefocusisonthekernelandthedata partitioning,notthedirectmemoryaccess(DMA) listcreation orthelock management ontheworkqueue.
TheALFAPIs areplatform-independentandtheirdesignisbasedonthefactthat manyapplicationstargetedforCellBEormulti-core computingfollowthegeneral usage patternofdividingaset ofdataintoself-containedblocks,creatingalistof datablocks tobecomputedontheSPE,andthenmanagingthedistributionofthat datatothevariousSPEprocesses.Thistypeof controlandcompute processusage scenario,alongwiththecorrespondingworkqueuedefinition,arethefundamental abstractions inALF.
DaCS
library
TheDaCSlibraryprovidesaset ofservicesforhandlingprocess-to-process communicationina heterogeneousmulti-core system.Inaddition tothebasic messagepassingservicetheseinclude:
v Mailboxservices
v Resourcereservation
v Processanddatasynchronization
v Remotememoryservices
v Errorhandling
The DaCSservicesare implementedasaset ofAPIsproviding anarchitecturally neutrallayerforapplication developersTheystructuretheprocessingelements, referredtoasDaCSElements(DE),intoa hierarchicaltopology. Thisincludes general purposeelements,referredtoasHostElements(HE), andspecial processingelements,referredtoasAcceleratorElements(AE).Hostelements usuallyrunafulloperatingsystemand submitworktothespecializedprocesses whichrunintheAcceleratorElements.
Prototype
libraries
Thissectionprovides anoverviewofthefollowingprototypelibraries,whichare shippedwith SDK3.0:
v “FastFourierTransform(FFT)library”
v “MonteCarlolibraries”
Fast
Fourier
Transform
(FFT)
library
ThisprototypelibraryhandlesawiderangeofFFTs,andconsistsofthefollowing:
v APIforthefollowingroutinesusedinsingleprecision:
– FFTReal->Complex1D
– FFTComplex-Complex 1D
– FFTComplex->Real1D
– FFTComplex-Complex 2Dforfrequencies(from1000x1000to2500x2500)
Theimplementationmanagessizesupto10000and handlesmultiplesof2,3, and5aswell aspowersof thosefactors,plusonearbitraryfactoraswell.User coderunningonthePPUmakesuseoftheCBEFFTlibrarybycallingoneof either1Dor2Dstreamingfunctions.
v
Power-of-two-only2DFFTcodeforcomplex-to-complexsingleand double
precisionprocessing.
Bothpartsofthelibraryrunusingacommoninterface thatcontainsan initializationand terminationstep,andanexecutionstepwhichcanprocess “one-at-a-time” requests(streaming)orentirearraysofrequests(batch). EnterthefollowingtoviewadditionaldocumentationfortheprototypeFFT library:
man /opt/cell/sdk/prototype/usr/include/libfft.3
Monte
Carlo
libraries
The MonteCarlolibrariesare aCellBE implementationof RandomNumber Generator(RNG)algorithmsandtransforms.Theobjectiveof thislibraryisto providefunctionsneededtoperformMonteCarlosimulations.
The followingRNGalgorithmsareimplemented:
v Hardware-based
v Kirkpatrick-Stoll
v MersenneTwister
Thefollowingtransformsare provided:
v Box-Mueller
v Moro'sInversion
v PolarMethod
Code
examples
and
example
libraries
Theexample librariespackageprovidesa setofoptimizedlibraryroutinesthat greatly reducethedevelopment costandenhancetheperformance ofCellBE programs.
TodemonstratetheversatilityoftheCellBEarchitecture,a varietyof application-orientedlibrariesare included,suchas:
v FastFourierTransform(FFT)
v Imageprocessing
v Softwaremanagedcache
v Gamemath v Matrixoperation v Multi-precisionmath v Synchronization v Vector
Additional examplesanddemosshowhow youcanexploittheon-chip computational capacity.
Boththebinaryandthesourcecodeare shippedinseparate RPMs.TheRPM namesare: v cell-libs v cell-examples v cell-demos v cell-tutorial
For eachofthese,thereisoneRPMthathasthebinaries-alreadybuiltversions, thatare installedinto /opt/cell/sdk/usr,andforeachofthese,thereisoneRPM thathasthesourceinatarfile.Forexample,cell-demos-source-3.0-1.rpmhas demos_source.tarandthis tarfilecontainsallofthesourcecode.
Thedefaultinstallation processinstallsthebinariesandinstallsthesourcetarfiles. Youneedtodecideintowhichdirectoryyouwanttountarthose files,eitherinto /opt/cell/sdk/src,orinto a'sandbox'directory.
Thelibraries andexamplesRPMshavebeenpartitionedintothefollowing subdirectories.
Table1.SubdirectoriesforthelibrariesandexamplesRPM
Subdirectory Description
/opt/cell/sdk/buildutils ContainsaREADMEandthemakeincludefiles(make.env, make.header,make.footer)thatdefinetheSDKbuildenvironment.
/opt/cell/sdk/docs Containsalldocumentation,includinginformationaboutSDK3.0 librariesandtools.
Table1.SubdirectoriesforthelibrariesandexamplesRPM (continued)
Subdirectory Description
/opt/cell/sdk/usr/bin /opt/cell/sdk/usr/spu/bin
Containsexecutableprogramsforthatplatform.Onanx86system, thisincludestheSPUTimingtool.OnaPPCsystem,thisalsoincludes alloftheprebuiltbinariesfortheSDKexamples(ifinstalled).Inthe SDKbuildenvironment(thatis,withbuildutils/make.footer)the
$SDKBIN_<target>variablespointtothesedirectories.
/opt/cell/sdk/usr/include /opt/cell/sdk/usr/spu/include
ContainsheaderfilesfortheSDKlibrariesandexamplesonaPPC system.IntheSDKbuildenvironment(thatis,withthe
buildutils/make.footer)the$SDKINC_<target>variablespointtothese directories.
/opt/cell/sdk/usr/lib /opt/cell/sdk/usr/lib64 /opt/cell/sdk/usr/spu/lib
ContainslibrarybinaryfilesfortheSDKlibrariesonaPPCsystem.In theSDKbuildenvironment(thatis,withthebuildutils/make.footer) the$SDKLIB_<target>variablespointtothesedirectories.
/opt/cell/sdk/src Containsthetarfilesforthelibrariesandexamples(ifinstalled).The tarfilesareunpackedintothesubdirectoriesdescribedinthe followingrowsofthistable.EachdirectoryhasaREADMEthat describestheircontentsandpurpose.
/opt/cell/sdk/src/lib Containsaseriesoflibrariesandreusableheaderfiles.Complete
documentationforalllibraryfunctionsisinthe/opt/cell/sdk/docs/ lib/SDK_Example_Library_API_v3.0.pdffile.
/opt/cell/sdk/src/examples TheexamplesdirectorycontainsexamplesofCellBEprogramming techniques.Eachprogramshowsaparticulartechnique,orsetof relatedtechniques,indetail.Youcanreviewtheseprogramswhen youwanttoperformaspecifictask,suchasdouble-bufferedDMA transferstoandfromaprogram,performinglocaloperationsonan SPU,orprovideaccesstomainmemoryobjectstoSPUprograms. Somesubdirectoriescontainmultipleprograms.Thesyncsubdirectory hasexamplesofvarioussynchronizationtechniques,includingmutex operationsandatomicoperations.
Thespuletmodelisintendedtoencouragetestingandrefinementof programsthatneedtobeportedtotheSPUs;italsoprovidesaneasy waytobuildfiltersthattakeadvantageofthehugecomputational capacityoftheSPUs,whilereadingandwritingstandardinputand output.
Othersamplesworthnotingare:
v Overlaysamples
v SWmanagedcachesamples
Table1.SubdirectoriesforthelibrariesandexamplesRPM (continued)
Subdirectory Description
/opt/cell/sdk/src/demos Thedemodirectoryprovidesahandfulofexamplesthatcanbeused tobetterunderstandtheperformancecharacteristicsoftheCellBE processor.Therearesampleprograms,whichcontaininsightsinto howreal-worldcodeshouldrun.
Note: Runningtheseexamplesusingthesimulatortakesmuchlonger thanonthenativeCellBE-basedhardware.Theperformance
characteristicsinwall-clocktimeusingthesimulatorareextremely inaccurate,especiallywhenrunningonmultipleSPUs.Youneedto examinetheemulatorCPUcyclecountsinstead.
Forexample,thematrix_mulprogramletsyouperformmatrix multiplicationsononeormoreSPUs.Matrixmultiplicationisagood exampleofafunctionwhichtheSPUscanacceleratedramatically. Unlikesomeoftheotherexampleprograms,theseexampleshave beentunedtogetthebestperformance.Thismakesthemharderto readandunderstand,butitgivesanideaforthetypeofperformance codethatyoucanwritefortheCellBEprocessor.
/opt/cell/sdk/src/benchmarks Thebenchmarksdirectorycontainssamplebenchmarksforvarious operationsthatarecommonlyperformedinCellBEapplications.The intentofthesebenchmarksistoguideyouinthedesign,
development,andperformanceanalysisofapplicationsforsystems basedontheCellBEprocessor.Thebenchmarksareprovidedin sourceformtoallowyoutounderstandindetailtheactualoperations thatareperformedinthebenchmark.Thisalsoprovidesyouwitha basisforcreatingyourownbenchmarkcodestocharacterize performanceforoperationsthatarenotcurrentlycoveredinthe providedsetofbenchmarks.
/opt/cell/sdk/prototype/src Containsthetarfilesforexamplesanddemosforvariousprototype packagesthatshipwiththeSDK.EachhasaREADMEthatdescribes theircontentsandpurpose.
/opt/cell/sysroot Containstheheaderfilesandlibrariesusedduringcross-compiling andcontainsthecompiledresultsofthelibrariesandexamplesonan x86system.Thecompiledlibrariesandexamples(everythingunder
/opt/cell/sysroot/opt/cell/sdk)canbesynchedupwiththe simulatorsystemrootimagebyusingthecommand:
/opt/cell/cellsdk_sync_simulator.
Performance
support
libraries
and
utilities
Thefollowingsupport librariesandutilitiesareprovidedbytheSDKtohelpyou with developmentand performancetestingyourCellBE applications.
SPU
timing
tool
TheSPUstatictimingtool, spu_timing,annotatesanSPUassemblyfilewith scheduling,timing,andinstructionissueestimatesassumingastraight,linear executionoftheprogram. Thetoolgeneratesa textualoutputoftheexecution pipeline oftheSPEinstructionstreamfromthisinputassemblyfile.Run spu_timing -–helptosee itsusage syntax.
OProfile
OProfile isatoolforprofilinguserandkernellevelcode.Itusesthehardware performance counterstosampletheprogramcountereveryNevents.Youspecify thevalue ofNaspart oftheevent specification.Thesystemenforcesaminimum value onNtoensure thesystem doesnotgetcompletely swampedtryingto capturea profile.
Make sureyouselectalargeenoughvalue ofNtoensuretheoverheadof collectingtheprofileisnotexcessivelyhigh.
The opreporttoolproducestheoutputreport.Reports canbegeneratedbasedon thefilenamesthatcorrespondtothesamples,symbolnamesorannotatedsource code listings.
How touseOProfile andthepostprocessingtoolisdescribedintheusermanual available at:
http://oprofile.sourceforge.net/doc/
The currentSDK 3.0version ofOProfileforCell BEsupportsprofilingonthe POWER™processoreventsandSPUcycleprofiling.Theseeventsincludecycles as
well asthevariousprocessor,cacheandmemoryevents. Itispossibletoprofileon upto foureventssimultaneouslyontheCellBEsystem.Thereare restrictionson whichofthePPUeventscanbe measuredsimultaneously.(Thetoolnowverifies thatmultiple eventsspecified canbe profiledsimultaneously.Intheprevious release itwasuptoyoutoverifythat.)WhenusingSPUcycleprofiling,events must bewithin thesamegroupduetorestrictions intheunderlyinghardware support fortheperformance counters.You canusetheopcontrol –list-events commandtoview theeventsandwhichgroupcontains eachevent.
Thereisonesetof performancecountersforeachnodethatareshared betweenthe twoCPUsonthenode.For agivenprofileperiod,onlyhalfofthetimeisspent collectingdatafortheevenCPUsandhalfofthetimefortheodd CPUs.Youmay need toallowmoretimetocollect theprofiledataacrossallCPUs.
Notes:
1. Before youissue anopcontrol --start,youshouldissuethefollowing
command:
opcontrol --start-daemon
2. Toproduceareportwith Linuxkernelsymbolinformationyoushouldinstall
thecorrespondingKerneldebuginfo RPM..
SPU
profiling
restrictions
WhenSPUcycleprofilingisused, theopcontrol commandisconfiguredfor separatingtheprofilebasedonSPUsandonthelibrary.Thiscorrespondstothe youspecifying–separate=CPUand –separate=lib.TheseparateCPU isrequired because itispossibleto havemultiple SPUbinaryimagesembedded intothe executablefileorintoa sharedlibrary.Sofora givenexecutable,thevariousSPUs maybe runningdifferentSPUimages.
With –separate=CPU,theimageandcorrespondingsymbolscanbe displayedfor eachSPU. Theusercanusetheopreport–mergecommandtocreateasinglereport for allSPUsthatshowsthecountsforeachsymbolinthevariousembedded SPU binaries. By default,opreportdoesnotdisplaytheapp namecolumnwhenit reports samplesfor asingleapplication,suchaswhenitprofilesa singleSPU application. Foropreporttoattributesamplestoabinaryimage, theopcontrol
script defaultstousing–separate=libwhenprofilingSPUapplicationssothatthe image namecolumn isalwaysdisplayed inthegenerated reports.
SPU
report
anomalies
ThereportfileusesthetermCPUswhen theevent isSPU\_CYCLES. Inthis case, CPUsactuallyrefertothevariousSPUsinthesystem.Forallotherevents,the CPU termreferstothevirtualPPUprocessors.
With SPUprofiling,opreport’s--long-filenamesoptionmaynotprintthefullpath of theSPUbinaryimageforwhichsampleswere collected.Shortimagenamesare usedforSPUapplicationsthatemploy thetechniqueof embeddingSPUimagesin anotherfile(executableorshared library).TheembeddedSPUELFdatacontains onlythefilenameandnopathinformationtotheSPUbinaryfilebeingembedded because thisfilemaynotexist orbeaccessibleatruntime.Youmusthavesufficient knowledgeoftheapplication’sbuildprocess tobeabletocorrelate theSPUbinary imagenamesfoundinthereporttotheapplication’ssourcefiles.
Tip
Compiletheapplicationwith -gandgeneratetheOProfilereportwith-gto facilitatefindingtherightsourcefile(s) tofocuson.
Generally,whenthereportcontainsinformationaboutasingle application, opreportdoesnotincludethereportcolumnfortheapplicationname.Itis assumedthattheperformanceanalystknowsthenameoftheapplicationbeing profiled.
Cell-perf-counter
tool
Thecell-perf-counter (cpc)toolisusedforsettingupand usingthehardware performance countersintheCell BEprocessor.Thesecountersallowyoutosee how manytimescertain hardwareeventsareoccurring, whichisusefulif youare analyzing theperformance ofsoftwarerunningona CellBE system.Hardware eventsareavailable fromallofthelogicalunitswithintheCell BEprocessor, includingthePPE,SPEs,interfacebus, andmemoryandI/Ocontrollers.Four 32-bit counters,whichcanalso beconfiguredaspairsof16-bit counters,are providedin theCell BEperformancemonitoringunit(PMU)forcountingthese events.
Thecpc toolalso makesuseofthehardwaresamplingcapabilitiesoftheCellBE PMU.Thisfeatureallowsthehardwaretocollectveryprecisecounterdataat programmable timeintervals. Theaccumulateddatacanbeusedtomonitorthe changes inperformanceoftheCell BEsystemoverlongerperiodsof time.
Thecpc toolprovides avarietyofoutputformatsforthecounterdata.Simpletext outputisshownin theterminalsession, HTMLoutputisavailableforviewingina Webbrowser,andXMLoutputcanbegeneratedforusebyhigher-level analysis toolssuchastheVisualPerformanceAnalyzer(VPA).
Youcanfinddetailsinthedocumentationand manualpagesincludedwiththe cellperfctr-tools package,whichcanfound inthe /usr/share/doc/cellperfctr-<version>/ directoryafter youhaveinstalledthepackage.
IBM
Eclipse
IDE
for
the
SDK
IBM EclipseIDEfortheSDK isbuiltupontheEclipseandCDevelopmentTools (CDT) platform.Itintegrates theGNUtoolchain,compilers,theFull-System Simulator,and otherdevelopment componentstoprovideacomprehensive, Eclipse-based developmentplatformthatsimplifies development.Thekeyfeatures include thefollowing:
v
AC/C++ editorthatsupportssyntax highlighting,a customizabletemplate,and
anoutlinewindow viewforprocedures,variables,declarations,andfunctions thatappearinsourcecode
v Avisualinterface forthePPEandSPEcombinedGDB(GNUdebugger)
v SeamlessintegrationofthesimulatorintoEclipse
v Automaticbuilder,performancetools,andseveralotherenhancements
v Remotelaunching,runninganddebuggingonaBladeCenterQS21
v ALFsourcecodetemplatesforprogramming modelswithin IDE
v AnALF CodeGeneratortoproduceanALFtemplatepackage withCsource
codeanda readme.txtfile
v
Aconfigurationoptionfor boththeLocalSimulatorand RemoteSimulatortarget
environmentsthatallowsyoutochoosebetweenlaunchinga simulation machinewiththeCell BEprocessororanenhancedCBEA-compliant processor witha fullypipelined,doubleprecisionSPEprocessor
v RemoteCell BEandsimulatorBladeCentersupport
v SPUtimingintegration
v AutomaticmakefilegenerationforbothGCCandXLCprojects
For informationabouthowtoinstalland removetheIBM EclipseIDEfortheSDK, seetheSDK3.0InstallationGuide.
For informationaboutusingtheIDE,atutorialisavailable.TheIDEandrelated programs mustbeinstalledbefore youcanaccessthetutorial.
Hybrid-x86
programming
model
overview
The CellBroadbandEngineArchitecture(CBEA)isanexampleofamulti-core hybridsystemona chip.Thatistosay,heterogeneouscoresintegratedonasingle processor withaninherent memoryhierarchy.Specifically,thesynergistic
processingelements(SPEs)canbe thoughtof ascomputational acceleratorsfora more generalpurposePPE core.Theseconceptsofhybridsystems,memory hierarchiesandacceleratorscanbe extendedmoregenerallytocoupledI/O devices,and examplesofthose systemsexist today,forexample,GPUsinPCIe slots forworkstationsanddesktops.Similarly,theCell BEprocessorsisbeingused in systemsasanaccelerator, wherecomputationallyintensiveworkloadswell suitedfortheCBEAareoff-loadedfroma morestandardprocessingnode.There are potentiallymanywaystomovedataandfunctionsfromahostprocessor toan acceleratorprocessor andviceversa.
Inordertoprovidea consistentmethodologyand setofapplication programming interfaces(APIs)fora varietyofhybridsystems,includingtheCellBE SoChybrid system,theSDK hasimplementationsoftheCellBE multi-coredata
communicationandprogramming modellibraries,DataandCommunication Synchronization andAccelerated LibraryFramework,whichcanbe usedon x86/Linuxhostprocess systemswith CellBE-basedaccelerators.Aprototype implementationoversocketsisprovidedsothatyoucangainexperiencewiththis
programming styleandfocusonhow tomanagethedistributionofprocessingand datadecomposition.For example,inthecaseofhybridprogrammingwhen
moving datapointtopointoveranetwork,caremust betakentomaximizethe computational workdoneonacceleratornodespotentiallywithasynchronousor overlappingcommunication, giventhepotentialcostincommunicatinginputand results.
For moreinformationabouttheDaCSandALFprogrammingAPIs,refertoData and CommunicationSynchronizationLibraryforHybrid-x86Programmer'sGuideand APIReference andAcceleratedLibraryFrameworkforHybrid-x86Programmer'sGuide and APIReference.
Chapter
2.
Programming
with
the
SDK
Thissectionisashortintroductionaboutprogrammingwith theSDK.Itcoversthe followingtopics:
v “Systemrootdirectories”
v
“Runningthesimulator”onpage18
v “Specifyingtheprocessorarchitecture”onpage21
v “PPEaddress spacesupport onSPE”onpage22
v “SPUstackanalysis”onpage33
v “SDKprogramming examplesanddemos”onpage24
v “SupportforhugeTLB filesystems” onpage26
v “SDKdevelopment bestpractices”onpage27
v “Performanceconsiderations”onpage27
Refer totheCell BEProgrammingTutorial,theFull-SystemSimulatorUser’sGuide, and otherdocumentationformoredetails.
System
root
directories
Becauseof thecross-compileenvironmentand simulatorintheSDK,thereare severaldifferentsystemrootdirectories.Table2describesthese directories.
Table2.Systemrootdirectories
Directoryname Description
Host Thesystemrootforthehostsystemis“/”.TheSDKis
installedrelativetothishostsystemroot.
GCC Toolchain ThesystemrootfortheGCCtoolchaindependsonthehost
platform.ForPPCplatformsincludingtheBladeCenterQS21, thisdirectoryisthesameasthehostsystemroot.Forx86and x86-64systemsthisdirectoryis/opt/cell/sysroot.Thetool chainPPUheaderandlibraryfilesarestoredrelativetothe GCCToolchainsystemrootindirectoriessuchasusr/include
andusr/lib.ThetoolchainSPUheaderandlibraryfilesare storedrelativetotheGCCToolchainsystemrootindirectories suchasusr/spu/includeandusr/spu/lib.
Simulator Thesimulatorrunsusinga2.6.22kernelandaFedora7system rootimage.Thissystemrootimagehasarootdirectoryof“/”. Whenthissystemrootimageismountedintoahost-based directorysuchas/mnt/cell-sdk-sysroot.Thisdirectoryisthe
Table2.Systemrootdirectories (continued)
Directoryname Description
Examples and Libraries TheExamplesandLibrariessystemrootdirectoryis
/opt/cell/sysroot.Whenthesamplesandlibrariesare
compiledandlinked,theresultingheaderfiles,librariesand binariesareplacedrelativetothisdirectoryindirectoriessuch asusr/include,usr/lib,and/opt/cell/sdk/usr/bin.The libspelibraryisalsoinstalledintothissystemroot. Afteryouhaveloggedinasroot,youcansynchronizethis sysrootdirectorywiththesimulatorsysrootimagefile.Todo this,usethecellsdk_sync_simulatorscriptwiththesynch task.Thecommandis:
opt/cell/cellsdk_sync_simulator
Thiscommandisveryusefulwheneveralibraryorsamplehas beenrecompiled.Thisscriptreducesusererrorbecauseit providesastandardmechanismtomountthesystemroot image,rsyncthecontentsofthetwocorrespondingdirectories, andunmountthesystemrootimage.
Running
the
simulator
Toverify thatthesimulatorisoperatingcorrectlyand thenrunit,issuethe followingcommands:
export PATH=/opt/ibm/systemsim-cell/bin:$PATH systemsim -g
The systemsimscript foundinthesimulator’s bindirectory launchesthesimulator. The –gparameter startsthegraphicaluser interface.
Note: Itisnolongernecessarytohavealocalcopy of.systemsim.tcl.The
simulatorlooksinthelocaldirectoryfirst (asit alwaysdid),butif itisnot there,itusesthesystemsim.tcl(noleadingdot)inlib/cellofthe
Notes:
1. Youmustbe onagraphicalconsole, oratleasthavetheDISPLAYenvironment
variable pointedtoanXservertorunthesimulator's graphicaluserinterface (GUI).
2. Ifanerrormessageaboutlibtk8.4.soisdisplayed,youmustloadtheTK
package asdescribedin SDK3.0InstallationGuide.
WhentheGUIisdisplayed, clickGotostartthesimulator.
Note: Tomakethesimulatorruninfastmode,youcanclickModeandthenFast
Mode.Thisforcesthesimulatortobypassitsstandardanalysisandstatistics collectionfeatures. Fastmodeisusefulifyouwanttoadvance thesimulator throughsetup orinitializationfunctionsthatare notthefocusof analysis, suchastheLinuxboot processing.Youshoulddisablefastmodewhenyou reachthepointat whichyouwishtododetailedanalysisordebugthe application.Youcanalso selectSimpleModeorCycle Mode.
Youcanusethesimulator'sGUI togetabetter understandingoftheCell BE architecture.For example,thesimulatorshowstwosetsof PPEstate.Thisis because thePPEprocessor coreisdual-threadedand eachthreadhasitsown registers andcontext.Youcanalsolookat thestateoftheSPE’s,includingthestate of theirMemoryFlowController(MFC).
Thesystemsim commandsyntax is:
Console window for the system running on the simulator
Full System Simulator GUI
mysim [root@(none) ~]# cd /opt/cell/sdk/usr/bin/tutorial [root@(none) tutorial]# simple
Hello Cell (0x1820008) Hello Cell (0x1820688) Hello Cell (0x1820900) Hello Cell (0x1820b78) Hello Cell (0x1820df0) Hello Cell (0x1821068) Hello Cell (0x18212e0) Hello Cell (0x1821558)
The program has successfully executed. [root@(none) tutorial]#
systemsim [-f file] [-g] [-n]
where:
Parameter Description
-f<filename> specifiesaninitialrunscript(TCLfile)
-g specifiesGUImode,otherwisethesimulatorstartsincommand-line mode
-n specifiesthatthesimulatorshouldnotopenaseparateconsole window
Youcanfinddocumentationaboutthesimulatorincludingtheuser’s guideinthe /opt/ibm/systemsim-cell/doc directory.
The
callthru
utility
The callthruutility allowsyoutocopyfiles toorfromthesimulatedsystem while it isrunning. Theutilityislocatedinthesimulatorsystemrootimageinthe /usr/bin directory.
Ifyoucalltheutilityas:
v callthrusink<filename>,it writes itsstandardinputinto <filename>onthe
hostsystem
v callthrusource<filename>,it writes thecontentsof<filename> onthehost
systemtostandardoutput.
Redirecting appropriatelyletsyoucopyfilestoandfromthehost.Forexample, when thesimulatorisrunningonthehost,youcould copyaCell BEapplication into /tmp:
cp matrix_mul /tmp
Then,intheconsolewindow ofthesimulatedsystem,youcouldaccessitas follows:
callthru source /tmp/matrix_mul > matrix_mul chmod +x matrix_mul
./matrix_mul
The /tmpdirectory isshown asanexampleonly.
The sourcefiles forthecallthruutilityare in/opt/ibm/systemsim-cell/sample/ callthru.Thecallthruutilityisbuiltandinstalledontothesysrootdiskaspart of theSDKinstallation process.
Read
and
write
access
to
the
simulator
sysroot
image
By defaultthesimulatordoesnotwritechangesbacktothesimulatorsystem root (sysroot)image. Thismeansthatthesimulatoralways beginsinthesameinitial stateofthesysrootimage.Whennecessary,youcanmodifythesimulator
configurationsothatanyfilechangesmadebythesimulatedsystemtothesysroot imagearestored inthesysrootdiskfilesothattheyareavailabletosubsequent simulatorsessions.
Tospecifythatyouwantupdatethesysrootimagefilewithanychangesmadein the simulatorsession, change thenewcow parameteron themysim bogus disk init commandin.systemsim.tcltorw(specifyingread/writeaccess)andremovethe last twoparameters.Thefollowingisthechangedlinefrom.systemsim.tcl:
mysim bogus disk init 0 $sysrootfile rw
Whenrunningthesimulatorwithread/write accesstothesysrootimagefile,you must ensurethatthefilesystem inthesysrootimagefileisnotcorruptedby incompletewritesora prematureshutdownof theLinuxoperatingsystemrunning inthesimulator. Inparticular,youmust besure thatLinuxwrites anycacheddata out tothefilesystem before exitingthesimulator.To dothis, issue″sync ;sync″ intheLinuxconsolewindowjustbeforeyouexitthesimulator.
Enabling
Symmetric
Multiprocessing
support
By defaultthesimulatorprovides anenvironmentthatsimulatesoneCell BE processor.Tosimulateanenvironment wheretwoCell BEprocessorsexist,similar toa BladeCenterQS21,youmust enableSymmetricMultiprocessing(SMP) support.Atclrunscript,config_smp.tcl,isprovidedwith thesimulatorto configureit forSMPsimulation.Forexample,followingsequenceofcommands willstartthesimulatorconfiguredwitha graphicaluser interfaceandSMP. export PATH=$PATH:/opt/ibm/systemsim/bin
systemsim -g -f config_smp.tcl
Whenthesimulatorisstarted,ithasaccesstosixteenSPEs acrosstwoCellBE processors.
Enabling
xclients
from
the
simulator
Toenable xclientsfromthesimulator, youneedto configureBogusNet(seethe BogusNet HowTo),andthenperform thefollowingconfigurationsteps:
1. Enableip-forward:
echo 1 > /proc/sys/net/ipv4/ip_forward 2. ConfigureIPTABLES
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
iptables -A FORWARD -i eth0 -o tap0 -m state --state RELATED,ESTABLISHED -j ACCEPT iptables -A FORWARD -i tap0 -o eth0 -j ACCEPT
Notes:
1. TheIPTABLEScommandsneedtousethecorrecttap#interfaceconfigured
with BogusNet.
2. Thefirst iptablescommandfailsunlesstheLinuxkernelwasconfiguredto
allowtheNATfeature.ToenableyourkernelfortheNATfeature,youneedto rebuildtheKernelandreboot yoursimulatorhostsystem.
Specifying
the
processor
architecture
ManyofthetoolsprovidedinSDK3.0supportmultiple implementationsofthe CBEA. Theseinclude theCell BEprocessor andafutureprocessor.Thisfuture processor isa CBEA-compliantprocessor witha fullypipelined,enhanceddouble precision SPU.
Theprocessor supportsfiveoptional instructionstotheSPUInstructionSet Architecture. Theseinclude:
v DFCEQ
v DFCGT
v DFCMEQ
v DFCMEQ
Detailed documentationfortheseinstructions isprovidedinversion 1.2(orlater) of theSynergistic ProcessorUnitInstructionSet Architecturespecification.The future processor alsosupportsimprovedissueand latencyforalldoubleprecision instructions.
The SDKcompilerssupportcompilation foreithertheCellBEprocessor orthe futureprocessor.
Table3.spu-gcccompileroptions
Options Description
-march=<cpu type> GeneratemachinecodefortheSPUarchitecturespecifiedby theCPUtype.SupportedCPUtypesareeithercell(default) orcelledp,correspondingtotheCellBEprocessororfuture processor,respectively.
-mtune=<cpu type> Scheduleinstructionsaccordingtothepipelinemodelofthe specifiedCPUtype.SupportedCPUtypesareeithercell
(default)orcelledp,correspondingtotheCellBEprocessor orfutureprocessor,respectively.
Table4.spu-xlccompileroptions
Option Description
-qarch=<cpu type> GeneratemachinecodefortheSPUarchitecturespecifiedby
theCPUtype.SupportedCPUtypesareeitherspu(default) oredp,correspondingtotheCellBEprocessororfuture
processor,respectively.
-qtune=<cpu type> Scheduleinstructionsaccordingtothepipelinemodelofthe specifiedCPUtype.SupportedCPUtypesareeitherspu
(default)oredp,correspondingtotheCellBEprocessoror futureprocessor,respectively.
The simulatoralso supportssimulationofthefutureprocessor.Thesimulator installation providesatcl runscripttoconfigureit forsuchsimulation.For example,thefollowingsequenceofcommandsstartthesimulatorconfiguredfor thefutureprocessorwith agraphicaluser interface.
export PATH=$PATH:/opt/ibm/systemsim-cell/bin systemsim -g -f config_edp_smp.tcl
The statictiminganalysistool, spu_timing,alsosupportsmultipleprocessor implementations.Thecommandlineoption –march=celledpcanbe usedtospecify thatthetiminganalysisbe donecorrespondingtothefutureprocessors’enhanced pipeline model.Ifthearchitectureisunspecifiedorinvokedwith thecommand lineoption–march=cell,then analysisisdonecorrespondingtotheCell BE processor'spipelinemodel.
PPE
address
space
support
on
SPE
WhenyoudevelopSPEprogramsusingtheSDK,youmaywishtoreference variablesinthePPEaddress spacefromcoderunningonanSPE.Thisisachieved through anextensiontotheClanguagesyntax.
Itmight bedesirableto sharedatainthiswaybetweenanSPEand thePPE.This extension makesiteasier topasspointerssothatyoucanusethePPEtoperform certain functionsonbehalfoftheSPE.Youcanreadily sharedatabetweenallSPEs through variablesinthePPEaddressspace.
Thecompilerrecognizesanaddress spaceidentifier__eathatcanbeusedasan extra typequalifierlikeconst orvolatileintype andvariable declarations.You canqualifyvariable declarationsinthis way,butnotvariabledefinitions.
Thefollowingare examples.
/* Variable declared on the PPE side. */ extern __ea int ppe_variable;
/* Can also be used in typedefs. */ typedef __ea int ppe_int;
/* SPE pointer variable pointing to memory in the PPE address space */ __ea int *ppe_pointer;
PointersintheSPEaddressspacecanbecasttopointersinthePPEaddressspace. Doing thistransformsanSPEaddressintoan equivalentaddress inthemapped SPElocalstore (inthePPE addressspace).Thefollowingisanexample.
int x;
__ea int *ppe_pointer_to_x = &x;
Thesepointervariablescanbepassed tothePPE processbywayofamailboxand usedbyPPEcode.Withthismethod,youcanperform operationsinthePPE executioncontextsuchascopyingmemoryfromoneregion oftheSPElocalstore toanother.
Inthesameway,these pointerscanbeconverted toandfromthetwoaddress spaces,asfollows:
int *spe_x;
spe_x = (int *) ppe_pointer_to_x;
Referencesto__eavariablescausedecreasedperformance.Theimplementation performssoftwarecachingofthese variables,buttherearemuchhigheroverheads whenthevariable isaccessedforthefirst time.Modificationsto__eavariablesis also cached.Thewritebackofsuchmodificationsto PPEaddressspacemaybe delayeduntil thecachelineisflushed, ortheSPUcontextterminates.
GCCfortheSPUprovides thefollowingcommandlineoptionstocontrolthe runtime behaviorofprogramsthatusethe__eaextension.Manyoftheseoptions specifyparametersforthesoftware-managedcache. Incombination,theseoptions causeGCCtolinkyour programtoasinglesoftware-managedcachelibrarythat satisfies thoseoptions.Table5 describestheseoptions.
Table5.Options
Option Description
-mea32 Generatecodetoaccessvariablesin32-bitPPUobjects.The
compilerdefinesapreprocessormacro__EA32__toallow applicationstodetecttheuseofthisoption.Thisisthedefault.
-mea64 Generatecodetoaccessvariablesin64-bitPPUobjects.The
compilerdefinesapreprocessormacro__EA64__toallow applicationstodetecttheuseofthisoption.
-mcache-size=8 Specifyan8KBcachesize.
-mcache-size=16 Specifyan16KBcachesize.
-mcache-size=32 Specifyan32KBcachesize.
-mcache-size=64 Specifyan64KBcachesize.
Table5.Options (continued)
Option Description
-matomic-updates UseDMAatomicupdateswhenflushingacachelinebackto PPUmemory.Thisisthedefault.
-mno-atomic-updates Thisnegatesthe-matomic-updatesoption.
Accessing an __eavariable fromanSPUprogramcreatesa copyofthisvalue in thelocalstorageof theSPU. Subsequentmodificationstothevalue inmainstorage are notautomaticallyreflectedinthecopyofthevalue inlocalstore.Itisyour responsibility toensuredatacoherencefor __eavariablesthatareaccessedby
bothSPE andPPEprograms.
Acompleteexampleusing__eaqualifierstoimplementa quicksort algorithmon theSPUaccessingPPEmemorycanbefoundintheexamples/ppe_address_space directory providedbytheSDK3.0cell-examplestarball.
SDK
programming
examples
and
demos
Eachof theexamples anddemoshasanassociatedREADME.txtfile.Thereisalsoa top-levelreadmeinthe/opt/cell/sdk/srcdirectory,whichintroducesthestructure of theexamplecode sourcetree.
Almostalloftheexamples runbothwithin thesimulatorandontheBladeCenter QS21. Someexamples includeSPU-onlyprogramsthatcanberunonthesimulator in standalonemode.
The sourcecode,whichisspecifictoa givenCellBE processorunittype,isinthe corresponding subdirectorywithina givenexample’sdirectory:
v ppuforcodecompiledtorunonthePPE
v ppu64forcodespecificallycompiledfor64-bitABIonthePPE
v spuforcodecompiledtorunonanSPE
v spu_simforcodecompiledtorunonanSPEunderthesystemsimulatorin
standaloneenvironment
Overview
of
the
build
environment
In/opt/cell/sdk/buildutilsthereare sometoplevelMakefilesthatcontrolthe build environmentforalloftheexamples.Mostofthedirectoriesinthelibraries and examplescontaina Makefileforthatdirectoryandeverythingbelow it.Allof theexamples havetheirown Makefilebutthecommondefinitionsareinthetop levelMakefiles.
The buildenvironmentMakefilesaredocumented in/opt/cell/sdk/buildutils/ README_build_env.txt.
Changing
the
default
compiler
Environment variablesinthe/opt/cell/sdk/buildutils/make.* filesare usedto determinewhichcompilerisusedtobuildtheexamples.
The /opt/cell/sdk/buildutils/cellsdk_select_compilerscriptcanbeusedto switch thecompiler.Thesyntax ofthiscommandis:
where thexlc flagselectstheXLC/C++ compilerand thegcc flagselectstheGCC compiler. Thedefault,ifunspecified,istocompiletheexampleswith theGCC compiler.
Afteryouhaveselecteda particularcompiler,thatsamecompilerisusedforall futurebuilds,unlessit isspecificallyoverwrittenbyshellenvironmentvariables, SPU_COMPILER,PPU_COMPILER,PPU32_COMPILER,orPPU64_COMPILER.
Building
and
running
a
specific
program
Youdo notneedtobuild alltheexamplecodeatonce, youcanbuildeach
programseparately. Tostart fromscratch,issue amakeclean usingtheMakefilein the/opt/cell/sdk/srcdirectory oranywhereinthepathtoa specificlibraryor sample.
If youhaveperformeda makecleanat thetoplevel,youneedto rebuildthe include filesandlibrariesfirst beforeyoucompileanythingelse.Todothis runa make inthesrc/include and src/lib directories.
Note: InSDK3.0,thetop-levelMakefilesforCellBE applicationshavebeen
movedintothesubdirectorybuildutilsunderthemainSDKdirectory /opt/cell/sdk.IfyoudevelopedMakefiles usingpreviousversions ofthe SDK,youmayneed tomodifythemtoreferencethisnew locationforthe top-levelMakefiles.
Compiling
and
linking
with
the
GNU
tool
chain
ThisreleaseoftheGNUtoolchainincludesaGCCcompilerandutilitiesthat optimizecodefortheCellBE processor.Theseare:
v Thespu-gcccompilerforcreatinganSPUbinary
v Theppu32-embedsputool
v Theppu-gcccompiler
v
Theppu-embedsputoolwhichenablesan SPUbinarytobe linkedwitha PPU
binaryintoasingleexecutableprogram
v Theppu32-gcccompilerforcompilingthePPUbinaryand linkingitwith the
SPUbinary
Theexample belowshowsthestepsrequiredtocreatetheexecutableprogram simplewhichcontainsSPUcode,simple_spu.c,and PPUcode,simple.c. 1. CompileandlinktheSPEexecutable.
/usr/bin/spu-gcc -g -o simple_spu simple_spu.c
2. Optionally runembedsputowraptheSPUbinaryintoa CESOF(CBE
Embedded SPEObjectFormat)linkablefile.ThiscontainsadditionalPPE symbolinformation.
/usr/bin/ppu32-embedspu simple_spu simple_spu simple_spu-embed.o
3. CompilethePPEsideandlinkittogetherwith theembeddedSPUbinary.
/usr/bin/ppu32-gcc -g -o simple simple.c simple_spu-embed.o -lspe
4. Or,compilethePPEsideand linkit directlywiththeSPUbinary.Thelinker
willinvokeembedspu,usingthefilenameoftheSPUbinaryasthenameofthe programhandlestruct.
Notes:
1. Thissectiononlyhighlights32-bitABIcompilation.Tocompilefor64-bit,use
ppu-gcc (insteadofppu32-gcc)anduseppu-embedspu(insteadof ppu32-embedspu).
2. Youarestronglyadvisedtousethe-gswitch asshownintheexamples.This
embedsextra debugginginformationintothecodeforlaterusebytheGDB debuggerssuppliedwith theSDK.See Chapter3,“DebuggingCellBE applications,”onpage29formoreinformation.
Support
for
huge
TLB
file
systems
The SDKsupportsthehugetranslation lookasidebuffer(TLB) filesystem,which allows youtoreserve16MBhugepages ofpinned,contiguousmemory.This feature isparticularlyusefulforsomeCell BEapplicationsthatoperateonlarge datasets, suchastheFFT16Mworkloadsample.
ToconfiguretheBladeCenterQS21for 20huge pages(320MB),runthefollowing commands:
mkdir -p /huge
echo 20 > /proc/sys/vm/nr_hugepages mount -t hugetlbfs nodev /huge
Ifyouhavedifficultiesconfiguringadequatehuge pages,itcouldbe thatthe memoryisfragmentedandyouneedtoreboot.
Youcanaddthecommandsequenceshownabovetoastartupinitializationscript, suchas/etc/rc.d/rc.sysinit,sothatthehugeTLBfilesystemisconfigured during thesystemboot.
To verifythelargememoryallocation, runthecommandcat /proc/meminfo.The outputissimilarto:
MemTotal: 1010168 kB MemFree: 155276 kB . . . HugePages_Total: 20 HugePages_Free: 20 Hugepagesize: 16384 kB
Hugepages areallocatedbyinvoking mmapofa/huge fileofthespecified size. For example,thefollowingcodesampleallocates32MBofprivatehuge paged memory:
int fmem;
char *mem_file = "/huge/myfile.bin";
fmem = open(mem_file, O_CREAT | O_RDWR, 0755)) == -1) { remove(mem_file);
ptr = mmap(0, 0x2000000, PROT_READ | PROT_WRITE, MAP_PRIVATE, fmem, 0);
mmap succeedsevenifthereareinsufficienthugepages tosatisfytherequest.On first accesstoapagethatcannotbebackedbyhugeTLBfilesystem,the
application is″killed″.That is,theprocessisterminatedandthemessage″killed″is emitted.Youmustbeensurethatthenumberof hugepagesrequesteddoesnot exceedthenumber available.Furthermore,ona BladeCenterQS20and
BladeCenterQS21,thehugepages areequallydistributed acrossboth
restrictmemoryallocationtoaspecific nodefindthatthenumberofavailable huge pagesforthespecificnodeishalfofwhatisreportedin/proc/meminfo.
SDK
development
best
practices
Thissectiondocuments somebestpractices intermsofdevelopingapplications usingtheSDK.SeealsodeveloperWorksarticlesaboutprogramming tipsand best practices forwritingCellBE applicationsat
http://www.ibm.com/developerworks/power/cell/
Using
a
shared
development
environment
Multiple usersshouldnotupdatethecommon simulatorsysrootimagefileby mounting itread-writeinthesimulator.For shareddevelopment environments,the callthruutility(see“Thecallthruutility”onpage20)canbeusedtoget filesin and outofthesimulator.Alternatively,userscancopythesysrootimagefileto theirown sandboxareaand thenmountthisversion withread/writepermissions tomake persistentupdatestotheimage.
Ifmultiple usersneedtorunCell BEapplicationsonaBladeCenterQS21,you need amachinereservation mechanismto reducecollisionsbetweentwopeople who areusingSPEsat thesametime. ThisisbecauseSPEthreads arenotfully preemptable inthisversionoftheSDK.
Performance
considerations
Thefollowingsectionsdiscussesthefollowingperformance considerationsthatyou shouldtakeintoaccountwhenyouaredevelopingapplications:
v “NUMA”
v “Preemptiveco