Is OpenCL a suitable platform for algorithm development in health care systems?

(1)

UPTEC IT 12 011

Examensarbete 15 hp

Augusti 2012

Is OpenCL a suitable platform

for algorithm development in health

care systems?

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Is OpenCL a suitable platform for algorithm

development in health care systems?

Mattias Larsson

This thesis reviews if OpenCL is a suitable and cost effective platform for algorithm development in health care systems. Aspects such as maintainability, performance, portability and integration with high-level languages (in this case Python) are analyzed. The review is done by implementing one part of a dose calculation algorithm that is complex enough to provide a realistic case. The vision is that OpenCL can replace multiple platforms for both multi core CPU and GPU computing and removing the need of implementing an optimized version of an algorithm for every platform. To achieve performance- portability, automatic optimization is done using parameter tuning. Both its effects on performance and code structure are analyzed. The conclusion is that OpenCL coupled with auto tuning is not a suitable platform due to problems with code structure, language limitations, programming- portability, tool support and the effort and difficulty in implementing auto tuning.

Examinator: Arnold Pears

Ämnesgranskare: David Black-Schaffer Handledare: Anders Edin

(4)

(5)

Sammanfattning)

I!modern!strålterapi!används!kraftfulla!datorer!för!att!göra!en!så!bra!planering! av!behandlingen!som!möjligt.!Behovet!av!beräkningskraft!gör!att!möjligheten!att! använda!specialhårdvara!så!som!GPUer!är!intressant.!För!att!använda! specialhårdvara!så!krävs!dock!att!mjukvara!skrivs!om!till!en!plattform!som! stödjer!specialhårdvaran.!En!sådan!plattform!är!OpenCL,!som!är!en!öppen! standard!med!stöd!av!de!flesta!vanliga!hårdvarutillverkarna.!Visionen!är!att!en! algoritm!skriven!i!OpenCL!kan!köras!med!rimlig!prestanda!både!på!vanlig! hårdvara!(CPU)!och!på!specialhårdvara!(GPU)!och!därmed!ersätta!behovet!av! multipla!plattformar.!! Denna!uppsats!undersöker!hur!OpenCL!tillsammans!med!tekniker!för!att! automatiskt!anpassa!mjukvaran!till!den!aktuella!hårdvaran!påverkar!faktorer!så! som:!underhållbarhet,!prestanda,!portabilitet!och!integration!med!högnivåspråk.! Det!undersöks!genom!att!implementera!en!del!av!en!dosberäkningsalgoritm!som! är!tillräckligt!komplex!för!att!kunna!motsvara!ett!riktigt!fall.!För!att!automatiskt! anpassa!mjukvaran!till!den!aktuella!hårdvaran!så!kan!mjukvarans!beteende! anpassas!med!hjälp!av!parametrar.!Hur!väl!den!automatiska!anpassningen! fungerar!analyseras,!både!med!avseende!på!kodstruktur!och!prestanda.! Slutsatsen!är!att!OpenCL!som!plattform!tillsammans!med!automatisk!anpassning! av!mjukvaran!inte!är!en!lämplig!väg!att!gå!i!dagsläget.!Det!beror!på!negativa! effekter!på!kodstruktur,!begränsningar!i!programspråket,!problem!med! portabilitet,!avsaknad!av!verktygsstöd!och!den!svårghet!det!innebär!att! implementera!automatisk!anpassning!av!programvaran.! ! !

(6)

(7)

Contents)

1!Introduction!...!9! 1.1!Background!...!9! 1.2!Goal!...!9! 1.3!Scope!...!10! 2!Method!and!theory!...!11! 2.1!Calculation!of!the!fluence!map!...!11! 2.1.1!Fluence!map!basics!...!11! 2.1.2!Ray!tracing!...!13! 2.2!NonWfunctional!software!requirements!...!16! 2.2.1!Safety!...!16! 2.2.2!Portability!...!16! 2.2.3!Maintainability!and!performance!...!16! 2.3!OpenCL!...!17! 2.3.1!The!OpenCL!architecture!...!17! 2.3.2!The!OpenCL!programming!language!...!19! 2.3.3!Tool!support!...!20! 2.4!Automatic!tuning!...!21! 2.4.1!Related!work!...!21! 2.4.2!Model!based!optimization!...!22! 2.4.3!Empirical!optimization!...!22! 3!Design!and!implementation!...!23! 3.1!Program!structure!and!implementation!details!...!23! 3.1.1!Modules!...!23! 3.1.2!OpenCL!C!language!considerations!...!23! 3.1.3!Accuracy!adjustment!...!24! 3.1.4!Parallelization!and!concurrency!...!24! 3.1.5!General!optimizations!...!24! 3.2!Optimization!parameters!and!automatic!tuning!...!25! 3.2.1!WorkWgroup!size!and!shape!...!26! 3.2.2!Address!spaces!...!27! 3.2.3!Structure!...!28! 3.2.4!Scene!...!28! 3.2.5!Intersection!algorithms!...!29! 3.2.6!Automatic!tuning!...!29! 3.3!Integration!with!Python!...!30!

(8)

3.3.1!PyOpenCL!...!30! 3.3.2!C!structures!and!alignment!...!30! 4!Results!and!analysis!...!31! 4.1!Test!setup!...!31! 4.1.1!Hardware!platforms!...!31! 4.1.2!Test!scene!...!31! 4.1.3!Search!heuristic!...!33! 4.2!Test!results!and!parameter!analysis!...!34! 4.2.1!Performance!results!...!34! 4.2.2!Parameter!search!statistics!...!37! 4.2.3!WorkWgroup!parameters!...!37! 4.2.4!Address!space!parameters!...!39! 4.2.5!Algorithm,!scene!and!structure!parameters!...!40! 4.2.6!Parameter!importance!...!40! 4.3!Structure!analysis!...!41! 5!Conclusion!...!45! 6!Discussion!and!future!work!...!47! 6.1!Ray!tracing!improvements!...!47! 6.1.1!Intersection!algorithms!...!47! 6.1.2!Integration!and!sampling!techniques!...!47! 6.1.3!Hierarchies!of!bounding!volumes!...!47! 6.2!Automatic!optimization!...!47! 6.3!The!future!of!OpenCL!...!48! 7!References!...!49! !

(9)

1)Introduction)

1.1)Background) Radiation!therapy!is!a!type!of!cancer!treatment!where!highWenergy!radiation!is! used!to!kill!cancer!cells.!The!radiation!damages!the!DNA!and!stops!the!cells! ability!to!divide.!Both!cancer!and!healthy!cells!are!affected!by!the!radiation,!so!it! is!essential!to!minimize!the!radiation!to!the!healthy!cells.!Developments!in! medical!informatics!have!enabled!better!treatment!of!cancer!patients!with! radiation!therapy,!much!with!the!help!of!powerful!computers!and!smart! software.!Treatments!are!planned!in!advance!and!computers!simulate!a!patient’s! expected!radiation!dose.!The!exact!anatomy!of!a!patient!is!known,!with!the!help! of!computer!tomography,!therefore!even!the!radiation!dose!on!individual!organs! can!be!simulated.!The!radiation!can!then!be!shaped!to!fit!and!only!affect!the! cancer!tumor!and!minimize!the!radiation!dose!on!important!organs,!much!like! when!you!use!your!hands!in!front!of!a!lamp!to!form!shadows!on!a!wall.!The!lamp! in!this!case!is!radiation!from!an!accelerator!and!the!hands!are!decimeter!thick! blocks!of!tungsten,!called!a!collimator,!that!refract!and!absorb!most!of!the! radiation.!The!simulation!of!radiation!dose!is!done!in!two!steps:!first!simulate! the!shape!formed!by!the!collimator!onto!a!virtual!plane!called!a!fluence!map!that! captures!the!shape!and!intensity!of!the!radiation,!second!use!the!fluence!map!to! calculate!the!dose!in!the!patient![5].!The!simulation!of!radiation!dose!on!a! patient’s!body!is!a!computationally!expensive!operation!and!is!not!done!in!realW time.!This!limitation!influences!how!medical!staff!plans!the!treatment!and!the! quality!of!the!planning.!If!this!could!be!done!in!realWtime,!hopes!are!that!the! planning!could!become!more!effective.!There!is!an!endless!need!for! computational!power,!which!can!be!used!to!either!increase!speed!or!accuracy.! Recent!hardware!and!software!developments!have!started!to!expose!the! computing!power!of!multiWcore!CPUs,!GPUs!and!other!types!of!specialized! hardware.!This!gives!hope!to!be!able!to!do!both!faster!and!more!accurate! simulations!of!radiation!dose.!OpenCL!is!a!platform!for!getting!access!to!the! computing!power!of!the!new!hardware!and!is!defined!by!a!nonWprofit!group!that! is!supported!by!all!major!hardware!developers.!OpenCL!supports!execution!on! several!types!of!hardware!(CPUs,!GPUs!etc.)!without!any!modification!of!the!code.! OpenCL!introduces!a!C99!based!programming!language!in!which!portable! computation!kernels!can!be!written.!The!kernels!are!compiled!at!runWtime!and! can!be!run!on!any!available!and!supported!hardware.!Even!though!OpenCL! supports!programming!portability,!the!performance!is!not!portable![2].!To!get! good!performance,!it!is!often!the!case!that!the!kernels!have!to!be!optimized!for!a! particular!hardware.!Recent!studies!have!shown!that!auto!tuning!of! optimizations!can!be!used!to!provide!a!more!general!code!that!can!be!optimized! dependent!on!the!current!executing!hardware!and!possibly!fix!the!problem!of! performance!portability![3,!4].! 1.2)Goal) The!goal!with!this!thesis!is!to!review!if!OpenCL!is!a!suitable!and!cost!effective! platform!for!lower!level!algorithms!by!implementing!the!rayWtracing!algorithm!in! OpenCL.!This!is!done!by!analyzing!maintainability,!performanceWportability!

(10)

through!auto!tuning!and!how!it!can!interact!with!highWlevel!platforms!such!as! C#/.Net!and!Python.!The!first!part!of!the!dose!simulation!(the!calculation!of!the! fluence!map)!is!implemented!and!then!the!solution!is!analyzed.! 1.3)Scope) The!implementation!and!study!is!limited!to!use!a!single!OpenCL!device.!The! implementation!is!only!of!proofWofWconcept!quality!and!is!not!aiming!for! precision!or!to!adhere!to!medical!standards.!The!implementation!and!data! should!be!complex!enough!and!realistic!enough!to!test!the!limitations!of!OpenCL! together!with!auto!tuning.! !

)

(11)

2)Method)and)theory)

2.1)Calculation)of)the)fluence)map) 2.1.1)Fluence)map)basics) As!described!in!the!introduction,!the!expected!radiation!dose!from!a!treatment!is! calculated!in!a!simulation.!One!way!of!doing!this!is!in!a!twoWstep!manner![5].!The! calculation!of!the!effect!the!collimators!are!separated!from!the!step!of!calculating! how!the!radiation!is!spread!in!the!patient.! ! Figure'1.'A'patient'in'radiation'treatment.'A'marks'where'the'ray'source'is'located' and'B'marks'where'the'collimator'is'located.'Figure'from'[48].! The!collimators!block!the!radiation,!generated!by!an!accelerator,!from!hitting!the! patient.!The!goal!is!to!only!hit!the!cancer!tumor,!but!radiation!leakage!is! unavoidable.!This!is!due!to!limitations!in!the!collimator.!A!collimator!is! constructed!with!a!set!of!leaves!made!out!of!a!radiation!blocking!material,! typically!tungsten.!The!number!and!width!of!the!leaves!are!what!determines! what!shape!can!be!created!and!at!which!precision.!The!thickness!determines! how!much!radiation!will!pass!through!the!blocked!areas!of!the!patient.! Sometimes!a!backup!singleWleaf!collimator!is!positioned!aligned!with!the!most! open!leaf!to!even!more!reduce!the!radiation!leakage.!The!leaves!are!movable!and! allow!the!collimator!to!change!shape!to!best!fit!a!tumor!for!different!angles! around!a!patient.!There!is!also!another!singleWleaf!collimator!in!the!other! direction,!orthogonal!to!the!multiWleaf!collimator,!which!is!called!the!jaw.!Usually,! the!tumor!is!exposed!to!radiation!from!a!couple!of!directions!around!the!patient,! but!there!are!also!types!where!it!is!exposed!to!radiation!from!all!angles!around! the!patient!in!two!dimensions.! A" B"

(12)

! Figure'2.'A'two@step'dose'calculation'model.'Figure'from'[5].! As!seen!in!figure!2,!the!radiation!from!a!radiation!source!is!projected!onto!a! plane!with!the!effects!of!the!collimators,!which!shape!the!radiation!beam.!This!is! called!a!fluence!map!and!contains!the!fluence!in!a!twoWdimensional!plane!of! points.!The!fluence!map!is!then!used!to!calculate!the!radiation!dose!in!the!patient.! The!fluence!!!at!the!point!(x,y)!is!calculated!by!a!sum!of!the!contribution!of! multiple!radiation!sources:! ! !,! =!!_!"#$%&_!_!"#$%& !,!,! +⋯'

where!A!represents!the!collimator!settings!and!x!and!y!are!coordinates!in!the! fluence!map.!In!this!case!only!the!radiation!directly!from!the!source!is!

considered!because!it!accounts!for!the!major!part!of!the!fluence.!

The!dose!D!in!the!patient!at!the!point!x!is!then!calculated!as!a!function!of!the! fluence:!

!(!,!)= !(! ! ,!)!

where!A!represents!the!collimator!settings!and!P!represents!the!body!of!the! patient.! This!thesis!will!focus!on!the!first!step!in!the!dose!calculation.!The!problem!is! complex!enough!to!be!able!to!test!and!analyze!the!suitability!of!the!OpenCL! platform.! Since!the!goal!is!to!calculate!a!fluence!map,!the!problem!is!similar!to!rendering!a! scene!using!ray!tracing!in!computer!graphics.!Ray!tracing!is!where!every!pixel,!in! a!twoWdimensional!virtual!camera!plane,!cast!a!ray!which!interacts!with!the! scene!and!eventually!hits!a!light!source!or!goes!to!infinity.!It!is!also!called! backwards!ray!tracing!because!the!rays!are!cast!in!the!opposite!direction!of!the! actual!photons.!Forward!ray!tracing!is!where!rays!are!cast!from!the!light!sources! and!interacts!with!the!scene!until!it!hits!a!pixel!on!the!virtual!camera!plane!or! goes!to!infinity.!Both!methods!are!in!fact!equivalent,!but!the!implementation!

Model based dose calculations

1. Multisource beam fluence modelling

(

)

(

)

tot direct source A x, BL,yBL flattening filter A x, BL,yBL ...

Ψ =Ψ +Ψ + D=D

(

Ψtot

(

A x, BL,yBL

) (

,P x y z, ,

)

2. Dose calculation from fluence

Pencil beam, C/S, Collapsed cone, Monte Carlo...

Process independent of field size!

tot Ψ

Fluence map

”phase space”

(

, ,

)

P x y z tot Ψ

(13)

details!differ![42].!Backward!ray!tracing!is!more!common!in!the!literature,!so! that!one!is!used!in!this!case!study.! 2.1.2)Ray)tracing) Recursive!backward!ray!tracing!was!first!introduced!by!Whitted![7].!Whitted’s! model!is!based!on!Phong’s!model!but!Phong’s!model!only!supports!points!of!light! infinitely!far!away!from!the!objects!in!the!scene![7,!8].!Whitted’s!model!supports! point!light!sources!in!the!scene!and!is!using!recursion.!When!an!object!is!hit,!new! rays!are!cast!from!the!point!of!intersection!recursively.!The!rendering!equation! is!defined!by!Whitted!as:! !=!_!+!_! !∗!_! !!!" !!! !+!_!!+!_!!'

where!the!I!is!the!intensity,!Ia!is!the!intensity!due!to!ambient!light,!kd!is!the!

diffuse!intensity!coefficient,!N!is!the!unit!surface!normal,!Lj!is!the!vector!in!the!

direction!of!the!j:th!light!source,!ks!is!the!specular!intensity!coefficient,!S!is!the!

intensity!of!light!from!the!specular!reflection,!kt!is!the!transmission!coefficient! and!T!is!the!intensity!of!light!from!transmission.! The!resulting!intensity!is!composed!of!four!parts:!ambient,!diffuse,!specular!and! transmitted!intensity.!The!ambient!and!specular!intensity!is!removed!in!this! model!of!the!fluence!map!calculation.!Whitted’s!model!has!one!disadvantage:!it! does!only!support!point!light!sources.!In!the!calculation!of!the!fluence!map,!it!is! important!to!account!for!the!area!of!the!ray!source.!That!means!that!Whitted’s! model!is!alone!not!sufficient.! To!account!for!the!area!of!the!light!source!is!important!in!the!case!where!only!a! part!of!the!light!source!is!visible.!A!natural!way!of!calculating!the!area!is!to! integrate!over!the!visible!area!of!the!light!source.!Analytic!integration!is!not!a! feasible!technique!in!this!case.!A!numerical!method!has!to!be!used.!In!this!case,!a! disc!shaped!light!source!is!used!and!it!is!not!trivial!to!integrate!over.!By!sampling! over!a!simpler!shape!like!a!rectangle,!which!is!easier!to!integrate!over,!the!visible! area!of!the!disc!can!be!determined.!The!simplest!way!is!to!do!a!uniform!sampling! over!the!smallest!rectangle!that!fits!the!disc!using!the!midpoint!rule!in!two! dimensions.!This!is!done!by!subdividing!the!source!into!small!rectangular!part:! ! !,! !"!#≈! !(!!,!!) ! ! !

where!A'is!the!area!of!each!part!and!f(xi,yj)'is!the!intensity!at!the!center!of!each! part'[38].!

In!a!scene!where!objects!can!hide!a!light!source,!visibility!is!also!an!important! aspect.!That!is!the!case!if!point!x!in!a!pixel!cannot!see!the!point!x’!on!the!light! source.!A!visibility!function!can!encode!this!property:!

! !|!! ₌ 1!!"!!ℎ!"!!!"!!"#$%!!"#ℎ!!!"#$""%!!"#$%!!!!"#!!′ 0!!"!!"!!"#$%&!ℎ!"#$!!"#$%!!!_!"#$!!"#$%!! !

(14)

The!effect!of!the!collimators!can!be!determined!by!the!amount!of!material!a!ray! has!to!go!through.!Since!the!scene!consists!of!several!collimators!and!each! collimator!consists!of!one!or!more!leaves,!each!leaf!has!to!be!tested!for! intersection!by!the!ray.!If!the!ray!intersects!a!leaf,!the!amount!of!material!(the! thickness)!it!has!to!pass!will!affect!the!intensity!of!the!ray!that!comes!out!of!the! material.!The!intensity!can!be!calculated!by!the!BeerWLambert!law:! ! =!_!!!!!_! where!I0!is!the!initial!intensity,!α!is!the!attenuation!coefficient!of!the!material!and! d!is!the!thickness!of!the!material![11].!The!attenuation!of!a!ray!consists!of!both! scatter!and!absorption,!but!it!is!assumed!that!if!a!ray!is!attenuated,!it!loses!all!its! importance!in!the!scene!and!can!be!omitted.!The!attenuation!coefficient!is! determined!by!the!material!of!the!collimator!leaf!and!the!energy!of!the!ray.!One! example!of!a!collimator!leaf!is!one!made!out!of!tungsten!with!a!thickness!of!7.8! cm![12].!The!total!absorption!can!be!described!as:! !!"#$%&'($) = !! !!!!! !""!!"##$%&'"(!!"#$"% !

where!Cabsorption!is!the!absorption!coefficient!and!d!is!the!distance!the!ray!has!to!

pass!though!leaf!i.!This!will!replace!the!visibility!function!s!for!the!case!where!the! visibility!is!blocked!by!a!collimator!leaf.! The!distance!from!the!pixel!to!the!light!source!is!also!a!factor!to!take!into!account,! because!a!ray!source!loose!intensity!as!a!function!of!distance.!By!projecting!the! light!source!onto!a!unit!half!sphere!with!origin!from!the!ray!origin,!the!intensity! loss!with!distance!can!be!calculated.!The!shape!of!the!light!source!is! approximated!by!a!rectangle.!The!distance!decay!is!calculated!by:! !_!"#$%&'(_!!"#$% =!!∗!! 2! !

where!αx!is!the!angle!around!the!xWaxis!and!αy!is!the!angle!around!the!yWaxis.!

The!resulting!intensity!in!a!pixel!is!described!by:! !(!!_,_!!₎ ₌_! !"#$%&'(!!"#$% ∗! ! !!,!! ! ! !(!!,!!|!!,!′)!

where!f(xi,yj)!is!the!intensity!in!the!point!(xi,yj)!at!the!source!and!s(xi,yj|x’,y’)!is!the!

visibility!between!the!point!(x’,y’)!and!the!point!(xi,yj)!at!the!source.!The!total! intensity!in!a!pixel!is!the!integral!over!the!entire!source,!using!the!midpoint!rule.! 2.1.2.1$Intersection$algorithms$ If!and!where!a!ray!hits!an!object!on!its!way!towards!a!light!source!is!an!integral! question!in!ray!tracing.!Therefore,!the!selection!of!algorithms!for!finding!out!the! intersections!between!rays!and!objects!are!important.!All!the!following! algorithms!are!standard!algorithms!in!ray!tracing.!They!have!probably!been! developed!for!a!CPU!and!not!for!any!specialized!hardware.!

(15)

A!scene!can!have!three!kinds!of!primitive!objects:!triangle,!axisWaligned!box!and!a! disc.!A!disc!is!only!used!for!the!ray!source,!axis!aligned!boxes!for!bounding! volumes!and!all!other!objects!are!built!out!of!triangles.!That!means!that! algorithms!for!intersection!checks!are!needed!between!rays!and!triangles,!axis! aligned!boxes!and!discs.!There!is!quite!a!lot!of!research!on!fast!intersection! algorithms!especially!on!axis!aligned!boxes!and!triangles,!because!they!are! common!in!ray!tracing.!It!is!worth!looking!in!to!rather!than!using!the!naïve!way.! The!ray!triangle!intersection!used!in!this!case!study!is!one!from!Möller!and! Trumbore![13].!Its!performance!is!good!and!the!required!memory!is!relatively! low.!It!also!requires!no!precomputation!of!the!plane!equation,!inverse!direction! vectors!or!ray!type.!That!makes!it!a!good!fit!for!this!implementation,!because!it! uses!the!data!that!is!available.!The!standard!form!of!this!intersection!algorithm! gives!a!true!or!false!intersection!result!and!the!distance!from!the!ray!origin!to!the! intersection!point.!With!an!adjustment!to!the!intersection!algorithm,!the! intersection!point!itself!can!be!calculated:! ! !"= !_!+!_!"#$%&"'( ∗!"#$%&'(!

where!p0!is!the!ray!origin,!vdirection!is!the!normalized!ray!direction!and!distance!is!

the!distance!from!p0!to!the!intersection!point!ip!given!by!the!intersection!

algorithm.!Getting!the!intersection!point!is!necessary!to!enable!refraction!of!a!ray! when!the!ray!enters!a!material.! The!intersection!of!a!ray!and!an!axis!aligned!box!is!also!an!essential!intersection! test.!In!this!case!the!intersection!algorithm!from!Williams!et!al.![14]!is!used,!but! without!the!precomputed!inverted!ray!direction!to!save!memory.!This!algorithm! relies!on!some!on!the!properties!of!IEEEW754!floating!point!standard:!when!a! positive!number!is!divided!by!zero!the!result!is!+∞!and!when!a!negative!number! is!divided!by!zero!the!result!is!W∞.!OpenCL!supports!the!IEEEW754!floating!point! standard![15,!p.!248].! 2.1.2.2$Bounding$Volumes$ Instead!of!testing!every!triangle!in!the!scene!for!intersection!with!a!ray,!triangles! can!be!grouped!in!a!bounding!volume!which!can!be!checked!for!intersection.!If!a! ray!intersects!a!bounding!volume,!all!its!triangles!are!checked!for!intersection.!If! it!does!not!intersect!a!bounding!volume,!none!of!the!triangles!have!to!be!tested! for!intersection!with!the!ray.!That!makes!it!possible!to!skip!intersection!tests! with!a!specific!ray!and!potentially!most!triangles!in!the!scene,!dependent!on!how! the!bounding!volumes!are!constructed.! Any!type!of!volume!can!be!used!as!a!bounding!volume,!but!axis!aligned!boxes!are! common!because!of!its!low!memory!requirements!(two!points,!min!and!max)! and!the!fast!intersection!algorithms!that!are!available.! In!scenes!with!a!large!number!of!triangles!and!a!large!number!of!bounding! volumes!it!is!also!common!to!use!hierarchies!of!bounding!volumes.!The! hierarchy!forms!a!tree!structure!and!if!a!node!is!intersected,!then!its!leaf!nodes!

(16)

are!also!tested!for!intersection.!In!this!case!study,!hierarchies!of!bounding! volumes!is!not!used.! 2.2)NonCfunctional)software)requirements) NonWfunctional!software!requirements!describe!desired!nonWfunctional! characteristics!of!a!system.!They!describe!a!property!or!a!quality!a!system!must! have!to!make!its!functionality!usable![39].! 2.2.1)Safety) A!software!failure!in!a!cancer!treatment!planning!system!can!result!in!injuries!or! even!death.!It!is!defined!as!a!safetyWcritical!system![1,!p.!300].!In!such!a!complex! system!as!a!cancer!treatment!planning!system,!it!is!unviable!to!do!formal! verification,!so!verification!has!to!be!done!through!testing.! 2.2.2)Portability) If!the!same!code!base!for!a!performance!critical!algorithm!could!be!used!to!run! on!different!kinds!of!hardware! Radiation!therapy!planning!software!is!used!in!all!parts!of!the!world!and!the! resources!of!each!individual!hospital!can!be!very!different.!To!impose!too!strict! hardware!requirements!can!be!a!selling!disadvantage.!Such!an!example!is!to! require!a!GPU!that!supports!OpenCL!and!has!errorWcorrecting!code!(ECC)! memory,!which!is!required!today!for!medical!hardware!of!this!kind.!In!a! treatment!facility,!several!computers!are!often!used!to!be!able!to!access!the! treatment!planning!software.!If!some!of!the!computers!have!cheaper!hardware! and!still!can!run!the!software,!but!with!a!lower!performance,!that!is!a!good! selling!point.!Medical!staff!with!lower!salary!can!use!the!slower!computers!when! the!faster!and!more!expensive!computers!are!occupied!by!doctors!when!the! treatment!verification!is!done.!It!is!the!case!that!radiation!equipment!such!as! accelerators!and!collimators!are!bought!separately!from!information!and! planning!systems.!In!Sweden!the!procurements!for!these!different!categories!of! hardware!are!forced!to!be!separate.!In!practice!that!makes!hardware!costs!of! information!and!planning!systems!a!more!important!factor.!The!cost!of! information!and!planning!systems!does!not!get!hidden!by!the!cost!of!the!other! radiation!equipment.!Because!of!the!reasons!given!above,!portability!is!an! important!factor!when!developing!medical!software!of!this!kind.! One!can!distinguish!between!several!types!of!portability.!Two!of!them!will!be! discussed!here:!functional!portability!and!performance!portability.!Functional! portability!is!when!software!is!portable!across!several!platforms.!Even!if! software!is!designed!and!tested!on!one!platform!it!can!be!run!on!another! platform!with!the!same!functionality.!If!the!functionality!is!to!multiply!two! matrices,!the!result!of!the!multiplication!of!the!same!two!matrices!should!be!the! same!on!all!platforms,!not!considering!floating!point!differences.!Performance! portability!is!when!performance!is!portable!across!platforms.!This!is!not!the!case! for!heavily!optimized!software.!Studies!have!shown!that!auto!tuning!can! accomplish!at!least!some!level!of!performance!portability![2,!3].! 2.2.3)Maintainability)and)performance) Maintenance!is!defined!by!Sommerville!as!doing!one!or!more!of!the!following! activities!on!existing!software::!repairing!faults,!adopt!to!a!changed!environment!

(17)

or!to!introduce!new!functionality.!Writing!code!that!is!easy!to!maintain!is! essential!for!keeping!down!software!development!costs.!Maintenance!costs! usually!take!up!two!thirds!of!the!total!cost!in!an!IT!project![1,!p.!242].!In!medical! applications,!maintenance!costs!are!expected!to!be!higher!because!of!a!greater! need!of!verification!testing.! Performance!is!described!by!van!Vliet!as:!speed,!efficiency,!resource! consumption,!throughput!and!response!time![39].!The!performance!that!matters! in!this!case!is!how!fast!a!fluence!map!can!be!calculated!given!a!precision! requirement,!which!is!dependent!on!the!number!of!samples!per!second! (throughput)!and!techniques!for!minimizing!the!number!of!samples!(efficiency).! The!performance!is!later!in!section!4.2.1!measured!as!throughput.! The!common!way!of!programming!for!GPU’s!is!by!writing!performance!focused! code!that!is!optimized!for!a!single!specific!hardware!architecture.!With!that!kind! of!optimized!code,!the!performance!is!often!not!portable!across!hardware! architectures!from!different!manufacturers!or!even!across!architectures!from!the! same!manufacturer![2,!4].!Hardware!architectures!are!in!general!updated!every! other!year!or!every!third!year![6,!16].!To!adapt!to!the!changed!environment!and! support!all!the!new!capabilities!of!the!newest!architecture!and!simultaneously! support!the!older!architectures,!several!code!bases!have!to!be!maintained,!one! for!each!hardware!architecture.!Duplicated!code!is!considered!by!Fowler!as!the! worst!!problem!in!code![41.!P.!76].!If!maintenance!is!done!on!the!software,!all! code!bases!have!to!be!updated.!All!code!bases!also!have!to!be!tested!separately.! This!is!both!expensive!and!complex.!It!is!much!preferred!to!only!have!to! maintain!a!single!code!base!for!the!performance!critical!algorithms.! A!system!designed!with!a!focus!on!maintainability!is!potentially!less!costly!to! maintain!and!test![40!p.!459].!Since!maintenance!costs!are!a!large!part!of!the! total!cost!of!a!system,!the!choice!of!not!focusing!on!maintenance!can!be!a!costly! one.!On!the!other!hand,!focusing!on!performance!gives!a!better!product!that!can! bring!in!higher!revenue!because!of!more!sales!of!the!product.!Unfortunately,! strategies!for!creating!maintainable!code!and!good!performing!code!can!be! opposing![41!p.!!69].!For!instance,!large!software!components!can!give!better! performance!but!are!also!harder!to!maintain![1,!p.!153].!On!the!other!hand,!a! wellWstructured!and!easy!to!maintain!program!can!be!easier!to!tune!for! performance![41!p.!69],!so!not!all!strategies!necessarily!have!to!be!opposing.! What!it!comes!down!to!when!deciding!on!the!tradeWoff!between!maintainability! and!performance!is!its!costs!and!potential!revenue!gains.! 2.3)OpenCL) OpenCL!is!a!framework!for!programming!a!collection!of!heterogeneous! hardware!resources!including!CPU’s!and!GPU’s.!It!includes!a!programming! language,!an!API,!libraries!and!a!runtime!system.!The!programming!language!is! based!on!C99!with!some!restrictions!and!some!extensions![15].! 2.3.1)The)OpenCL)architecture) OpenCL!architecture!consists!of!four!models:!the!platform!model,!the!execution! model,!the!memory!model!and!the!programming!model![15,!Section!3].!

(18)

The!platform!model!consists!of!a!host!device!and!one!or!more!OpenCL!devices.! Every!OpenCL!device!can!then!include!one!or!more!compute!units!which! includes!one!or!more!processing!elements.! The!execution!model!defines!how!execution!is!done.!A!host!device!sets!up!a! context!with!OpenCL!devices,!kernels,!program!objects!and!memory!objects.!A! kernel!is!a!function!that!can!be!run!on!a!OpenCL!device,!initiated!by!the!host! device.!Each!OpenCL!device!has!its!own!program!object!to!implement!a!kernel! which!is!usually!compiled!and!linked!at!runtime.!Version!1.2!of!the!specification! separates!the!compilation!and!linking!and!supports!offWline!compilation!of! kernels.!In!that!way!kernels!can!be!precompiled!and!distributed!with!an! executable!without!the!need!to!compile!the!kernel!at!runtime![17].!Memory! objects!maps!objects!in!memory!of!the!host!device!to!the!memory!of!a!OpenCL! device.!Sometimes!(especially!the!case!for!GPU’s)!the!memory!objects!has!to!be! transferred!to!the!memory!on!the!device,!which!incurs!an!overhead.! The!execution!model!also!defines!how!parallel!work!of!kernels!is!structured.!It!is! structured!into!an!NWdimensional!index!space!(called!NDRange)!from!one!up!to! three!dimensions.!The!index!space!consists!of!workWgroups!with!the!same! number!of!dimensions!as!the!index!space.!The!smallest!part!is!called!a!workWitem! and!is!one!running!instance!of!a!kernel.!Each!workWitem!has!both!a!global!ID!in! the!index!space!and!a!local!ID!in!its!workWgroup.!Each!workWgroup!also!has!a! workWgroup!ID.!All!workWitems!in!a!workWgroup!are!executed!simultaneously![15,! section!3.2].! ! Figure'3.'Showing'the'execution'model'with'index'space,'work@groups'and'work@ items.'Figure'from'[15,'p.'24].! The!memory!model!has!four!distinct!memory!spaces:!global,!constant,!local!and! private!memory.!The!global!memory!is!accessible!by!every!workWitem!in!the! index!space.!The!constant!memory!is!also!accessible!by!every!workWitem!in!the! address!space!but!is!readWonly.!Local!memory!is!only!accessible!from!workWitems! within!the!same!workWgroup.!Private!memory!is!accessible!only!by!its!workWitem.!

(19)

! Table'1.'Memory'spaces'and'its'allocation'and'accessibility'capabilities.'Figure' from'[15,'p'27].! The!programming!model!in!OpenCL!explicitly!supports!both!the!data!parallel! programming!model!and!the!task!parallel!programming!model,!but!the!data! parallel!model!is!the!only!one!that!gives!a!good!performance!on!today’s!GPU’s! [15,!18].! The!architecture!of!OpenCL!much!reflects!the!architecture!of!GPU’s![18].! 2.3.2)The)OpenCL)programming)language) The!OpenCL!programming!language!is!based!on!C99,!but!has!both!restrictions! and!extensions!to!it![15,!chapter!6].! Some!of!the!extensions!include:! • Implementation!of!four!disjoint!address!spaces:!global,!local,!constant!and! private.! • The!__kernel!function!qualifier.! • The!__attribute__!qualifier.! The!address!space!qualifiers!are!used!to!define!in!what!region!of!memory!a! variable!is!allocated!upon!variable!declaration.!The!__kernel!qualifier!declares! functions!as!kernels!which!can!be!executed!on!an!OpenCL!device,!initiated!by!a! host!device.! The!padding!of!structures!can!be!adjusted!by!using!the!__attribute__!qualifier.! The!attribute!packed!is!used!to!minimize!the!required!memory.!For!alignment! purposes!the!packed!attribute!is!not!always!the!most!appropriate.!Sometimes!

extra!padding!can!make!a!data!type!better!aligned!to!fit!the!hardware!better![19].!

(20)

Some!of!the!restrictions!include:!! • No!recursion.! • No!dynamic!memory!or!variable!sized!arrays.! • Pointers!to!functions!are!not!allowed.! • A!pointer!pointing!to!one!address!space!cannot!be!cast!to!point!to!another! address!space.! OpenCL!C!supports!the!IEEEW754!floating!point!standard!for!single!precision! floating!point!numbers.!Double!precision!floating!point!numbers!can!be! supported!as!an!extension!up!to!version!1.1!of!the!OpenCL!standard,!but!are! mandatory!in!version!1.2![15,!17].!It!also!supports!vectors!in!dimensions!2,!3,!4,! 8,!and!16!with!common!vector!operation!such!as:!addition,!subtraction,! multiplication!and!division!by!vector!or!scalar,!dot!product,!cross!product,! normalization!and!length.! 2.3.3)Tool)support) Tools!can!be!an!important!part!of!the!development!of!software.!Tools!for! debugging!and!profiling!are!a!great!help!for!writing!good!programs![1!p.197,!40].! Most!vendors!with!their!own!implementation!of!the!OpenCL!standard!supply! their!own!set!of!tools.!! 2.3.3.1$NVIDIA$ NVIDIA!supplies!a!set!of!tools!for!their!proprietary!but!free!platform!CUDA.!Some! of!them!also!support!OpenCL,!but!often!with!a!limited!set!of!features.!This! includes!the!Visual!Studio!plugin!Parallel!Nsight,!which!can!debug!and!profile! CUDA!kernels.!For!OpenCL,!Parallel!Nsight!is!limited!to!profiling!but!with!a! heavily!limited!set!of!information!such!as!memory!usage!and!timings!of!kernels.! NVIDIA!also!supplies!the!crossWplatform!tool!Visual!Profiler,!which!supports! profiling!of!CUDA!as!well!as!OpenCL!kernels.!The!profiler!can!give!information!of! memory!usage,!kernel!timings,!register!usage,!number!of!threads,!number!of! divergent!threads,!reads!and!writes!to!global!memory!and!occupancy.!It!can!also! give!hints!on!optimization!areas!and!what!is!limiting!performance!for!individual! kernels.! NVIDIA’s!compiler!can!also!give!some!useful!statistics!at!compile!time.!With!the! compiler!option!-cl-nv-verbose,!the!compiler!can!output!stack,!register!and! shared!memory!usage!for!individual!kernels.![35]! 2.3.3.2$Intel$ Intel!supplies!a!Visual!Studio!plugin!for!debugging!OpenCL!kernels.! Unfortunately!it!only!supports!C/C++!projects!and!cannot!be!used!for!this!study! because!Python!projects!are!used.! Intel!also!supplies!an!offline!kernel!compiler!which!can!output!assembly!code.! [36]! 2.3.3.3$Amd$ AMD!supplies!several!Visual!Studio!plugins!for!OpenCL.!gDebugger!is!for! debugging!and!APP!Profiler!is!profiling!OpenCL!kernels.!Unfortunately!they!both!

(21)

only!supports!C/C++!projects!and!cannot!be!used!for!this!study.!Parts!of!the! profiling!features!are!accessible!through!a!commandWline!utility!that!can!profile! kernels!executed!from!any!environment.!The!commandWline!utility!can!generate! a!data!file!which!can!be!opened!by!the!Visual!Studio!plugin!to!show!the! information!in!a!more!appealing!way.!Unfortunately!the!information!is!sparse.! AMD!also!supplies!an!offline!kernel!compiler!which!can!output!assembly!code!for! CPU’s!and!different!families!of!GPU’s.![37]! 2.4)Automatic)tuning) Finding!good!optimizations!for!a!software!in!an!environment!can!be!hard!and! tedious!work!and!often!requires!expert!knowledge.!In!this!case!it!would!require! expert!knowledge!of!rayWtracing,!hardware!architectures!and!programming.!A! framework!for!automatic!tuning!can!replace!an!expert!in!finding!good! optimizations.!It!also!scales!better!because!for!every!new!instance!of!a!system! that!needs!to!be!tuned,!one!simply!needs!to!copy!and!include!the!auto!tuning! framework.!A!human!expert!can!only!work!with!one!instance!at!a!time.!For!an! installation!of!a!medical!specialist!application!it!is!reasonable!that!the!automatic! tuning!is!part!of!the!installation!process!and!that!it!may!be!allowed!to!run! overnight.!Any!more!than!that!might!cause!inconvenience!for!the!staff.! 2.4.1)Related)work) There!are!a!couple!of!implementations!and!articles!using!auto!tuning!where! FFTW!and!ATLAS!are!the!most!famous!ones.! FFTW!is!an!implementation!of!the!discrete!Fourier!transform.!It!contains! fragments!that!can!be!composed!to!an!implementation!that!calculates!the! Fourier!transform.!Different!fragments!contain!different!optimizations!and!the! combination!of!fragments!construct!a!search!space!of!Fourier!transform! calculators.!The!search!space!is!then!explored!with!dynamic!programming!to! find!the!fastest!one![32].! ATLAS!(Automatically!Tuned!Linear!Algebra!Software)!is!a!project!to!produce!a! performance!portable!linear!algebra!library.!As!FFTW!it!generates!optimized! code!from!an!abstract!description.!The!performance!of!the!different! optimizations!is!timed!to!find!out!which!one!is!the!best.!ATLAS!changes!the! blocking!factor!and!loopWunrolling!among!other!things!that!are!hard!to!predict.! To!lower!the!searchWspace!of!linear!algebra!libraries,!it!uses!a!search!heuristic! that!determines!the!best!value!for!one!optimization!at!a!time!which!finds!a!local! optimum![33].! GATLAS!is!an!attempt!at!implementing!ATLAS!on!a!GPU!using!OpenCL.!OpenCL! source!code!are!generated!from!C++!template!classes.!Optimizations!are!applied! to!a!base!class!using!inheritance!and,!where!the!C++!metaWprogramming!facilities! are!not!sufficient,!the!mixin!pattern.!It!uses!expectationWmaximization!and! dynamic!programming!to!find!the!best!optimizations.!Optimizations!are!workW group!size,!data!layout,!inner!blocking!among!others![34].! Maestro!is!an!open!source!library!for!data!orchestration!on!one!or!more!OpenCL! devices.!It!uses!empirical!autoWtuning!to!tune!workWgroup!sizes,!buffer!chunk! sizes!and!load!balancing!between!multiple!OpenCL!devices![3].!!

(22)

2.4.2)Model)based)optimization) Model!based!optimization!is!when!a!model!is!supplied!to!the!auto!tuning! framework.!The!model!can!be!a!model!over!the!hardware!architecture!for! instance,!with!a!map!of!the!different!memories,!their!speeds!and!properties!and! arithmetic!units.!Based!on!that!map,!the!auto!tuning!framework!can!then!tune! the!program!so!that!it!is!utilizing!the!available!hardware!to!the!maximum.!That! can!be!setting!buffers!to!exactly!fit!the!available!memory.!On!a!GPU,!that!could! mean!setting!the!size!of!a!buffer!to!exactly!fit!the!local!memory!size!reported!by! OpenCL!for!the!device.!A!problem!is!that!a!model!is!not!always!available!and!if!it! is!available,!it!can!be!wrong.!If!a!model!is!missing,!tuning!cannot!be!done!at!all.!If! the!parameters!are!tuned!to!the!wrong!model,!the!result!is!not!optimal!for!that! hardware.!An!advantage!with!model!based!optimization!is!that!is!enables! precalculation!of!optimal!optimizations!without!the!need!of!access!to!the!actual! environment!where!the!system!is!installed![31].!This!can!lower!the!time!of!the! installation!process!of!software!that!is!being!optimized.! 2.4.3)Empirical)optimization) Empirical!optimization!of!software!is!when!a!software!runs!test!to!figure!out! what!optimizations!works!best!for!a!given!environment.!Essentially!no! information!about!the!environment!is!needed!beforehand,!but!access!to!run!tests! in!the!environment!is!required.!Therefore!no!precalculation!of!optimizations!can! be!done.! ATLAS!sets!up!a!set!of!requirements!for!automatic!empirical!optimization!of! software![33]:!! • isolation!of!performanceWcritical!code! • a!method!of!adapting!software!to!differing!environments! • robust!and!context!sensitive!timers! • appropriate!search!heuristic! If!the!searchWspace!is!big,!then!a!search!heuristic!is!needed!to!find!a!solution!in! reasonable!time.!That!will!only!guarantee!to!find!a!local!optimum,!but!that!is! probably!good!enough.!If!the!searchWspace!is!small,!all!permutations!of! optimizations!can!be!tested!and!given!correct!timing!data,!the!globally!optimal! set!of!optimizations!in!the!searchWspace!will!be!found.! !

)

(23)

3)Design)and)implementation)

3.1)Program)structure)and)implementation)details) To!test!if!an!acceptable!tradeWoff!between!maintainability!and!performance!can! be!achieved,!the!structure!is!initially!more!focused!on!being!maintainable!than!to! give!good!performance!with!gradually!shift!towards!performance!by!applying! optimizations.!A!number!of!strategies!are!used!to!strive!to!achieve!maintainable! code.!These!include:!separation!of!concerns,!reuse,!modularity,!understandable! structure!and!clarity!in!code.! 3.1.1)Modules) The!OpenCL!code!is!structured!into!modules!according!to!functionality.!A!module! consists!of!a!header!file,!a!source!file!and!a!unit!test!file.!One!module!for! primitive!scene!objects!and!intersection!algorithms,!one!for!collimator!objects! and!one!for!ray!tracing!and!algorithms.!This!gives!a!hierarchical!dependency! graph.!Each!module!can!be!tested!individually!given!its!dependencies!are! fulfilled.!It!is!the!ray!tracing!module!that!exposes!functionality!to!the!host.! The!Primitives!module!contains!all!the!primitive!scene!objects!and!their! intersections.!Primitive!scene!objects!are:!Ray,!Triangle,!Rectangle,!Plane,!Disc,! BoundingBox!and!Box.!A!Ray!is!represented!by!an!origin!and!a!direction!vector.! A!triangle!is!represented!by!three!vertex!points.!A!Rectangle!is!made!up!of!two! triangles.!The!difference!between!a!BoundingBox!and!a!Box!is!that!BoundingBox! is!an!axisWaligned!box!represented!by!a!minimum!and!a!maximum!point!but!a! Box!is!a!box!made!up!of!10!triangles!where!the!back!face!is!missing.! The!Collimator!module!contains!the!definition!of!a!collimator!and!the!different! kinds!of!scene!representations!of!it.!It!also!contains!intersection!tests!that! depend!on!the!Primitives!module.! The!Ray!tracing!module!contains!all!ray!tracing!functionality!and!exposes! OpenCL!kernels!to!the!host.!The!complete!fluence!map!calculation!is!separated! into!three!steps!as!separate!kernels.!First!the!intensity!for!each!ray!is!calculated! when!it!is!cast!from!a!point!on!the!fluence!plane!towards!the!ray!source.! Secondly!the!intensity!decay!is!calculated.!Third,!all!the!intensities!that!makes!up! the!total!intensity!of!a!pixel!are!summed!up!and!multiplied!with!the!intensity! decay!factor.!Because!the!ray!source!is!not!a!point!but!a!disc,!several!rays!are!cast! from!each!pixel!to!integrate!over!the!visible!area!of!the!ray!source!as!described! in!2.12.! 3.1.2)OpenCL)C)language)considerations) OpenCL!C!does!not!support!classes!so!all!objects!are!defined!by!C!structs.!All! object!specific!functionality!is!located!in!functions!within!its!respective!module.! Functionality!is!reused!whenever!possible.!One!example!is!the!intersection!test! between!a!Ray!and!a!Box!which!uses!the!intersection!test!between!a!Ray!and!a! Triangle.!Since!OpenCL!C!does!not!support!variable!sized!arrays!all!array!sizes! has!to!be!known!upon!execution.!Kernels!are!compiled!at!runtime!so!constants! that!define!array!sizes!can!be!set!by!the!host!before!compilation.!OpenCL! supports!macros!being!set!as!a!compiler!option,!which!can!be!used!to!solve!this! problem![15].!

(24)

3.1.3)Accuracy)adjustment) The!fluence!calculation!can!be!calculated!with!different!degrees!of!accuracy.! When!the!positions!of!the!collimator!leaves!are!being!optimized!to!fit!a!tumor,! only!a!rough!estimation!is!needed,!then!the!execution!time!of!the!calculation! should!be!as!low!as!possible!to!enable!the!optimizer!to!try!many!possible! positions.!On!the!other!hand,!when!a!final!setup!of!the!collimator!is!decided!on,! the!calculation!should!be!as!accurate!as!possible,!within!limits.!There!is!a!need!to! adjust!the!degree!of!accuracy!to!accomplish!a!fast!or!an!accurate!fluence! calculation.!This!can!be!done!in!a!number!of!ways.!The!accuracy!of!the!geometry! of!the!scene!objects,!the!accuracy!of!the!integration!of!the!light!source,!the! resolution!of!the!fluence!map!or!the!numerical!accuracy!can!be!adjusted.!The! primary!way!to!adjust!the!accuracy!here!is!chosen!to!be!the!representation!of!the! collimator!leaves.!This!implementation!supports!three!different!types!of! representations!for!the!collimator.!One!is!when!the!leaves!are!infinitely!flat! rectangles,!one!is!where!the!leaves!are!axisWaligned!boxes!and!one!where!the! leaves!are!focused!boxes.!Focused!in!this!context!means!that!they!are!shaped!to! minimize!soft!shadows!made!by!the!collimator!on!the!fluence!plane.!All! representations!are!generated!from!a!more!general!description!of!a!collimator! which!has!the!minimum!amount!of!information.!That!allows!the!approximated! representation!of!the!collimator!to!be!different!depending!on!design!of! collimator!blade!for!instance.!All!generation!of!collimator!geometry!is!done!by! the!host!device!for!simplicity,!but!could!as!well!be!generated!on!an!OpenCL! device.! 3.1.4)Parallelization)and)concurrency) The!ray!tracing!is!implemented!so!that!the!intensity!from!each!ray!is!calculated! independent!of!every!other!ray.!That!enables!flexible!use!of!auto!tuning.! Grouping!rays!together!using!packet!traversal!is!common!in!ray!tracing,!but!the! results!by!Aila!and!Leine![28],!shows!that!independent!ray!traversal!is!more! efficient!on!GPU’s!than!packet!traversal.! The!three!steps!in!the!calculation!of!the!fluence!map!are!separated!into!each!own! kernel.!This!gives!an!implicit!synchronization!between!the!steps!since!the! kernels!are!executing!one!at!a!time.!The!last!summation!step!is!dependent!on! this!synchronization,!so!that!it!does!not!start!before!all!calculations!have!been! done.!Explicit!synchronization!is!only!done!when!global!data!is!copied!to!a!local! buffer!in!a!workWgroup.! 3.1.5)General)optimizations) Depending!on!the!type!of!OpenCL!device,!the!transfer!of!memory!objects!from! the!host!to!the!device!can!be!a!considerable!overhead.!Typically!on!a!CPU!the! transfer!is!not!needed,!but!on!a!GPU!it!is,!because!of!its!separate!memory.! Therefore!the!size!of!the!memory!objects!should!be!minimized![19].!In!this!case,! the!memory!objects!(scene!information!and!result)!are!small,!so!the!overhead!is! negligible!(see!Table!13).! The!majority!of!the!scene!data!is!in!the!vectors!that!represents!vertices!in! triangles.!Therefore!that!data!is!separated!into!its!own!array.!That!is!good!both! because!it!makes!it!easier!to!cache!that!data!into!local!memory!and!also!makes! the!accesses!more!aligned.!

(25)

A!common!way!of!structure!data!is!to!have!one!array!that!contains!several!C! structs!of!the!same!type.!This!pattern!is!called!array!of!structures.!Commonly! only!one!variable!in!a!structure!is!read!at!a!time,!in!a!loop!over!all!structures!in! an!array.!This!can!prevent!aligned!access.!A!way!to!make!the!memory!access! aligned!is!to!instead!create!one!C!struct!with!an!array!for!each!variable! containing!all!objects.!This!pattern!is!called!structure!of!arrays.!All!C!structs!are! structured!according!to!the!structure!of!arrays!pattern.!This!is!the!recommended! way!to!structure!data!by!both!Nvidia!and!Intel!because!it!uses!a!memory!access! pattern!that!is!more!cacheWfriendly![19,!20].! ! Figure'5.'Illustration'of'the'difference'between'array'of'structures'and'structure'of' arrays.! 3.2)Optimization)parameters)and)automatic)tuning) Optimization!parameters!are!parameters!that!can!be!changed!to!change!the! behavior!of!the!program,!to!maximize!the!performance!for!a!specific!platform! and!hardware.!The!optimization!parameters!are!grouped!into!different! categories!according!to!what!kind!of!behavior!they!are!changing.!The!categories! are:!workWgroup!size!and!shape,!use!of!address!spaces,!structure,!scene!and! intersection!algorithms.! Object!0! X, Y, Z Object 1 X, Y, Z Object 2 X, Y, Z Object! X: O! 1 2 Y: !O 1 2 Z: !O 1 2 Array!of! structures:! Structure of arrays:

(26)

!

Category! Parameter! Valid!values! Notes!

WorkWgroup!size! and!shape! X! [1,∞)!∊!Z! Has!to!be! lower!than!the! index!space.! Y! [1,∞)!∊!Z! Z! [1,∞)!∊!Z!

Address!spaces! Ray! private,!local! !

Scene!information! constant,!global! ! Triangle!data! local,!constant,!

global! !

Triangle!data!buffer! private,!local,!

constant,!global! !

Structure! DepthWfirst! [True,!False]! False!means!

breadthWfirst.!

Scene! Pieces! [1,∞)!∊!Z! Max!is!the!

number!of! collimator! leaves.! Intersection!

algorithms! Triangle!intersection! algorithm! [DS,!MT1,!MT2,!MT3]! ! Table'2.'Summary'of'the'optimization'parameters.' 3.2.1)WorkCgroup)size)and)shape) The!workWgroup!size!decides!how!many!workWitems!are!grouped!together!in!the! same!workWgroup.!Typically,!workWitems!in!the!same!workWgroup!have!access!to! a!fast!onWchip!memory!where!shared!data!can!be!stored.!If!shared!data!is!copied! from!the!global!address!space!to!the!shared!onWchip!address!space,!a! performance!gain!can!be!expected,!if!the!data!is!reused!so!that!it!outweighs!the! cost!of!copying!it!to!the!onWchip!memory.!The!size!of!the!workWgroup!often! decides!how!much!data!is!allocated!on!the!shared!memory!space!dependent!on! many!factors!and!the!following!considerations!have!to!be!taken!into!account:!! • Buffering!of!data!on!the!onWchip!shared!memory!space!lowers!the!access! needed!to!the!slower!global!memory!space.! • If!too!much!data!is!allocated!on!the!onWchip!memory,!the!program!fails!to! run.! • The!onWchip!memory!is!sometimes!used!as!an!automatic!cache.!More! allocated!memory!by!the!program!can!mean!a!smaller!cache!and!less! automatic!caching!of!data!from!the!global!address!space![25].! • Registers!can!be!stored!on!the!onWchip!memory.!Per!workWitem,!there!is! an!amount!of!needed!registers!and!if!the!onWchip!memory!is!full,!the! registers!are!spilled!over!to!slower!memory!or!the!program!fails!to!run.! On!GPU’s,!the!work!group!size!and!shape!can!have!a!big!impact!on!the! performance!because!it!is!usually!what!decides!how!much!data!is!allocated!to!the! controllable!onWchip!memory.!A!common!case!is!to!try!to!make!use!of!the!onWchip! memory!as!much!as!possible,!but!without!overflowing!it.!On!CPU’s!the!work! group!size!is!not!as!important!because!OpenCL!typically!do!not!have!control!over! the!faster!caches.!Intel!suggests!to!not!setting!the!workWgroup!at!all![19,!20,!24,!

(27)

25].!This!implementation!can!adjust!the!workWgroup!size!and!shape!in!three! dimensions.!OpenCL!implementations!set!limits!to!the!size!and!shape!of!a!workW group,!where!the!X!and!Y!dimensions!typically!supports!larger!width!than!the!Z! dimension.!

Vendor" X"size" Y"size" Z"size"

NVidia!compute!capability!1.0!–!2.x! 65535! 65535! 64! NVidia!compute!capability!3.0! 231_W1! _65535! _64! AMD!GPU!with!SDK!2.1! Product!of!all!dimensions!<=!256! Apple!Mac!OS!X! Dependent!on!hardware! Table'3.'Overview'of'maximum'allowed'size'per'dimension'and'implementation'[24,' 26,'27].' 3.2.2)Address)spaces) OpenCL!supports!allocation!on!four!different!memory!spaces:!global,!constant,! local!and!private![15].!Since!no!guarantees!are!made!on!the!performance!of!the! memory!spaces,!it!is!not!known!how!and!where!data!should!be!resident!to!utilize! the!hardware!the!best!to!get!good!performance.!

Data" Type" Size" Available"memory"spaces"

Rays! R/W! 32!bytes! private,!local!

Scene!information! Read!only! 1512W16172!

bytes! constant,!global!

Triangle!data! Read!only! 39360!bytes! local,!constant,!global! Triangle!data!buffer! R/W*! 480W19200!

bytes! private,!local,!constant,!global!

Table'4.'Data,'type,'size'and'where'it'can'be'allocated'for'each'work@item'using'the' test'scene'described'in'4.1.2.' Rays!are!allocated!and!created!on!the!OpenCL!device,!so!therefore!only!the! private!and!local!address!space!is!available.!The!host!could!allocate!global! memory!to!store!rays!in,!but!this!is!not!done!for!simplicity!reasons.! The!triangle!data!buffer!caches!the!triangle!data!of!a!single!scene!object.!In!the! case!when!it!is!constant!or!global,!the!buffer!is!just!a!pointer!to!constant!or!global! memory!and!nothing!is!copied.!When!both!the!triangle!data!and!the!triangle!data! buffer!is!in!the!local!address!space,!all!scene!objects!are!copied!to!the!triangle! buffer!at!the!start.!In!the!case!where!it!is!private!or!local!and!the!triangle!data!is! constant!or!global,!data!has!to!be!copied!to!the!chosen!address!space!because! data!cannot!be!copied!to!that!address!space!directly!from!the!host!device.! Triangle!data!is!copied!to!the!buffer!when!needed.!The!size!of!the!scene! information!and!the!triangle!data!buffer!is!variable!because!a!scene!object!can!be! split!into!smaller!parts.!The!smallest!division!creates!a!triangle!data!buffer!of!size! 448!bytes!and!the!full!a!buffer!is!of!size!19200!bytes.!This!is!for!a!scene!with!two! single!leaf!collimators!and!two!forty!leaf!collimators!as!described!in!the!test! scene!in!4.1.2.!There!exists!multiWleaf!collimators!with!80!leaves!as!well,!which! would!increase!the!triangle!data!to!77760!bytes,!which!would!in!turn!not!fit!on!a! NVIDIA!GPU!with!compute!compability!2.0!(see!Table!6).!

(28)

Private!memory!is!usually!allocated!as!registers!on!onWchip!memory.!It!is!fast,! but!takes!up!registers!which!are!a!limited!resource.!It!should!be!used!for!small! amounts!of!data!which!needs!to!have!fast!read!and!write!access![19,!24].! Vendor" Register"memory" NVIDIA!Compute!capability!1.0!W1.1! 32!kB!/!Multiprocessor! NVIDIA!Compute!capability!1.2!W1.3! 64!kB!/!Multiprocessor! NVIDIA!Compute!capability!2.x! 128!kB!/!Multiprocessor! NVIDIA!Compute!capability!3.0! 256!kB!/!Multiprocessor! AMD!GPU! 128!W!256!kB!/!Compute!unit! Table'5.'Table'of'register'memory'for'different'vendors'[24,'26].' Local!memory!is!also!usually!allocated!on!fast!onWchip!memory!and!is!shared! among!workWitems!in!a!workWgroup![15,!20,!24,!26].! Vendor" Size"of"local/shared"memory" NVIDIA!Compute!capability!1.0!W1.3! 16!kB!/!Multiprocessor! NVIDIA!Compute!capability!2.0+! 48!kB!/!Multiprocessor! AMD!GPU! 32!kB!/!Compute!unit! Table'6.'Table'of'local/shared'memory'for'different'vendors'[24,'26].' Constant!memory!is!usually!resident!in!the!global!space,!but!with!different! techniques!for!fast!broadcasting![19,!24].!Both!AMD!and!the!NVidia!use!a!special! cache!for!constants![24,!25].!

Vendor" Total"amount"of"constant"memory" Constant"cache"

NVIDIA! 64!kB! 8!kB!/!Multiprocessor!

AMD!GPU! Size!of!global!memory! 4!–!48!kB!/!Compute!unit!

Table'7.'Table'of'amounts'of'constant'memory'and'its'caches'[24,'26].' Global!memory!is!usually!the!slowest,!but!is!also!usually!the!biggest.!It!is!cached! on!some!hardware,!but!the!size,!speed!and!levels!of!cache!differs![20,!24,!26].! CPU’s!do!not!have!controllable!onWchip!memory!so!buffering!triangle!data!on! local!memory!should!only!cause!an!overhead.! 3.2.3)Structure) Aila![27]!suggests!that!structure!of!how!rays!are!traced,!can!have!a!big!impact!on! performance!for!GPU’s.!He!investigates!different!ways!of!looping!through!the! rays!and!the!nodes!in!a!scene!in!the!trace()!function.!Either!a!while-while

trace()!or!an!if-if trace().!More!declarative!names!of!the!different!

concepts!would!be!to!trace!rays!in!a!breadthWfirst!(whileWwhile)!or!depthWfirst!(ifW if)!manner.!In!the!depthWfirst!variant,!the!ray!searches!for!the!closest!object!that! it!intersects,!than!calculates!the!intersection!intensity!loss!and!then!continues! after!the!intersected!object!and!repeats!all!the!way!to!the!ray!source.!In!the! breadthWfirst!variant!the!ray!searches!for!all!objects!that!it!intersects!with!and! then!calculates!the!intensity!loss!from!each!of!the!intersected!objects.! 3.2.4)Scene) All!objects!in!the!scene!(except!for!the!ray!source)!are!embedded!in!bounding! boxes.!A!bounding!box!can!contain!one!or!more!objects.!In!this!case!every!

(29)

collimator!has!a!surrounding!bounding!box.!A!collimator!can!be!split!into!several! distinct!collimators!in!the!scene!and!effectively!set!how!many!bounding!boxes!a! collimator!has.!This!also!effects!how!big!objects!have!to!be!cached!in!the!triangle! data!buffer.!The!Pieces!parameter!sets!how!many!pieces!a!collimator!is!split!into.! 3.2.5)Intersection)algorithms) In!the!case!of!triangle!intersection!algorithms,!there!are!four!variants!to!choose! from.!They!are!all!based!on!the!MT!triangle!intersection!algorithm![13],!but!with! different!optimizations,!shown!to!perform!different!on!different!hardware!(with! the!exception!of!DS)![30].! Name" Characteristic" DS! Cleaner!and!more!compact!code!than!original!MT![29]! MT!1! Original!version!from!paper![13,!30]! MT!2! Division!at!end![30]! MT!3! Division!early![30]! Table'8.'Table'of'ray'triangle'intersection'algorithms'and'their'characteristics.' 3.2.6)Automatic)tuning) An!empirical!optimization!strategy!is!chosen.!There!are!several!reasons!for!that:! ease!of!implementation,!lack!of!models!for!model!based!optimization,!more! experience!in!that!area!by!the!writer,!possibly!less!biased!and!portable!and!the! greater!amount!of!research!in!that!area.! Since!a!code!generating!framework!for!generating!optimized!code!from!abstract! descriptions!is!outside!the!scope!of!this!master’s!thesis,!a!method!of!controlling! optimization!by!parameters!is!chosen.!Parameters!are!sent!to!the!OpenCL! kernels!as!defines!at!compile!time!through!the!compilation!options.!This!enables! the!compiler!to!control!structure!of!code!and!different!settings!in!kernels!in! absence!of!C++!metaWprogramming!features!and!a!code!generating!framework.! OpenCL!kernels!contain!a!general!unoptimized!implementation!or!sometimes! several!optimizations!that!can!be!turned!on!or!off!or!chosen!by!a!parameter.!! Parameters!such!as!workWgroup!size!can!be!specified!at!kernel!execution!through! the!standard!OpenCL!API.!Parameters!that!controls!the!generation!of!the!scene! can!be!applied!directly!by!the!host!device,!since!the!scene!is!generated!on!the! host!device.! Creation!of!memory!objects,!transfer!of!memory!objects!from!the!host!device!to! the!OpenCL!device,!execution!of!kernels!and!transfer!of!result!from!the!OpenCL! device!back!to!the!hostWdevice!are!all!timed.!The!execution!time!of!kernels!is!the! base!of!the!scoring!of!a!set!of!optimizations,!since!all!other!events!should!be! independent!of!the!optimizations!applied.!The!first!of!the!three!kernels!is!the! only!kernel!being!auto!tuned!and!optimized,!because!it!accounts!for!more!than! 90%!of!the!execution!time.! A!module!for!auto!tuning!is!implemented!for!setting!up!the!environment,! creating!the!searchWspace,!completely!search!the!search!space!and!calculate!and! save!statistics.!Search!heuristics!can!only!be!applied!manually!by!modifying!the! valid!values!of!individual!parameters.!The!auto!tuning!module!is!running!on!the! host!device!and!has!the!potential!to!find!the!globally!optimal!set!of!optimization!

(30)

3.3)Integration)with)Python) The!environment!in!which!the!scene!is!set!up!and!all!initialization!is!done!is! Python.!That!is!also!the!environment!in!which!the!OpenCL!host!is!executing!from.!! 3.3.1)PyOpenCL) The!Python!module!PyOpenCL!is!used!to!expose!the!OpenCL!runtime!and!API.!It! can!set!up!a!context,!create!a!queue,!create!memory!objects,!create!program! objects!and!add!tasks!to!the!queue![21].!This!simplifies!the!process!of!initializing! the!OpenCL!environment.!A!feature!of!PyOpenCL!is!caching!of!compiled!program! objects.!A!limitation!in!that!feature!is!that!only!the!supplied!source!file!is!being! checked!for!changes,!not!any!included!source!files!in!the!supplied!source!file.!So! when!any!of!the!depending!modules!are!changed,!one!needs!to!make!sure!that! the!top!module!is!also!changed!so!that!the!program!object!is!updated!in!the! cache.! 3.3.2)C)structures)and)alignment) There!is!a!module!in!the!Python!Standard!Library!for!integration!of!Python!and!C! called!ctypes![22].!Among!other!things!it!supports!constructing!C!structures!and! C!arrays.!Those!C!structures!can!then!be!seamlessly!transferred!into!C!structures! in!OpenCL.!It!also!handles!the!alignment!of!variables!in!a!structure.!That!can! otherwise!be!a!problem!because!the!alignment!is!platform!and!compiler! dependent![23].! !

)

(31)

4)Results)and)analysis)

4.1)Test)setup)

4.1.1)Hardware)platforms)

Tests!are!run!on!multiple!hardware!platforms!to!figure!out!if!and!how!the! automatic!optimization!performs.!!

Platform" Type" Description"

NVIDIA!GTX!470! GPU! 448!Cuda!cores,!Core!clock!607!MHz,!

Shader!clock!1215!MHz,!Memory!clock! 1674!MHz,!Compute!capability!2.0! NVIDIA!Quadro!FX!1800! GPU! 64!Cuda!cores,!Core!clock!550!MHz,!

Shader!clock!1375!MHz,!Memory!clock! 800!Mhz,!Compute!capability!1.1! Intel!Xeon!E5520! CPU! 4!cores,!2.26!GHz,!2.53!GHz!turbo!

frequency! Intel!Core!2!Duo!Mac!OSX! CPU! 2!cores,!2.4!GHz! AMD!Phenom!II!X6!1055T!! CPU! 6!cores,!2.8!GHz!

Table'9.'The'tested'platforms.! 4.1.2)Test)scene) The!test!scene!is!set!up!with!two!40!leaf!collimators!and!two!jaws,!which!are! single!block!collimators.!The!collimators!are!focused!but!without!rounded!edges.! They!are!modeled!to!approximate!the!Elekta!MLCi!2!primary!collimator!and!Y! secondary!collimator.!The!fluence!plane!is!600!mm!*!600!mm!and!is!located! 1000!mm!below!the!ray!source.!The!multiWleaf!collimators!are!located!at!295!mm! and!the!jaws!are!located!at!451!mm!below!the!ray!source![12].!

(32)

! Figure'6.'The'test'scene'with'collimators,'jaws'and'the'fluence'plane.! The!ray!source!is!located!at!the!top!of!figure!X.!Its!size!has!been!exaggerated!to! make!it!visible!in!this!picture.!The!radius!is!only!1!mm!so!it!would!otherwise!be! too!small!to!be!seen!in!this!figure.!The!multiWleaf!collimators!are!the!two!top! blocks!and!the!jaws!are!the!two!bottom!blocks.!The!bottom!rectangle!is!the! fluence!plane.!As!can!be!seen!from!figure!X,!the!back!faces!of!the!collimators!and! jaws!are!removed!because!they!will!never!be!hit!by!any!rays!of!importance!in!the! fluence!map!calculation.! The!fluence!map!is!calculated!with!a!resolution!of!128*128!pixels!and!20*20! samples!are!sent!from!every!pixel!to!integrate!the!ray!source.!That!gives!a!total! of!6553600!samples!for!this!test!scene.!Table!4!shows!an!overview!of!the! memory!requirements!per!workWitem.!

(33)

! Figure'7.'Normalized'fluence'map'calculated'from'the'test'scene.! 4.1.3)Search)heuristic) The!search!space!with!this!test!scene!contains!a!maximum!of!491520!states,! which!is!too!big!for!most!platforms!to!search!overnight.!Some!search!heuristic! has!to!be!applied!to!meet!the!requirement!for!the!search!time!specified!in!2.5.! The!search!is!divided!into!two!steps:!searching!for!workWgroup!size!and!shape! and!all!the!other!parameters.!For!CPU’s,!the!workWgroup!size!is!set!to!1!and!the! best!combination!of!all!the!other!parameters!are!determined.!Then!the!work! group!is!determined!for!the!best!parameters!that!where!found.!For!GPU’s!the! search!is!done!the!other!way!around.!The!workWgroup!together!with!the!Pieces! parameter,!which!influences!the!size!of!the!triangle!buffer,!is!determined!first!for! an!estimated!good!set!of!parameters,!then!the!rest!of!the!parameters!are! searched!for!the!best!found!workWgroup!and!Pieces!parameter.!The!estimated! good!set!of!parameters!was!found!during!manual!experimental!testing.! There!are!some!incompabilities!that!needs!to!be!taken!into!consideration,!that! would!otherwise!cause!errors.! ! !

(34)

Platform" Limitations" NVIDIA!GTX!470! Scene!information!has!to!be!set!to!constant.! Allocating!scene!information!in!the!global!address! space!cause!the!program!to!crash.!The!reason!for! this!is!unknown.!It!can!be!due!to!a!bug!in!the!ray! tracer,!pyOpenCL,!the!compiler!or!the!OpenCL! implementation.! NVIDIA!Quadro!FX!1800! Triangle!data!has!to!be!set!to!global.!Otherwise!it! fails!due!to!lack!of!constant!memory.!That!makes! constant!for!triangle!data!buffer!invalid.!Scene! information!has!to!be!set!to!constant.! Intel!Xeon!E5520! None! Intel!Core!2!Duo!Mac!OSX! Ray!address!space!has!to!be!set!to!private.!Setting! the!address!space!to!local!causes!the!program!to! crash!for!an!unknown!reason.!WorkWgroup!size!has! to!be!1!in!the!Y!and!Z!dimension.! AMD!Phenom!II!X6!1055T!! None! Table'10.'Platforms'and'their'limitations.! There!is!also!a!problem!with!the!axis!aligned!bounding!box!intersection! algorithm.!If!the!bounding!box!is!located!in!any!address!space!other!than!private,! the!intersection!test!fails.!This!problem!is!apparent!on!all!the!CPU’s!and!non!of! the!GPU’s.!The!reason!for!this!unknown.!The!workaround!is!to!explicitly!copy!the! bounding!box!to!the!private!address!space!for!the!CPU’s,!but!not!on!the!GPU’s! because!it!wastes!registers.'

Platform" Int."alg." Ray"as" Tri."as" Tri."buffer"as" Scene"as" DepthHfirst"

NVIDIA!

GTX!470! MT2! private! local! constant! constant! False!

NVIDIA! Quadro! FX!1800!

MT2! private! local! global! constant! False!

Table'11.'Estimated'good'starting'values'for'work@group'size'search'on'the'GPU’s.! 4.2)Test)results)and)parameter)analysis) 4.2.1)Performance)results) In!ray!tracing!the!standard!measure!of!performance!is!samples!per!second!or! rays!per!second.!Every!sample!usually!consists!of!several!rays!so!the!rays!per! second!performance!should!be!higher.!Here,!only!samples!per!second!is! measured.!

(35)

! Figure'8.'Performance'results'after'auto'tuning.! The!quad!core!Intel!Xeon!CPU!is!roughly!twice!as!fast!as!the!dual!core!Intel!CPU.! The!AMD!CPU!has!6!cores!and!a!greater!clock!speed!than!the!Intel!processors,!so! that!might!explain!the!almost!double!the!performance!of!the!Intel!Xeon!CPU.! Compilers!also!plays!a!role!in!how!good!they!can!optimize!and!apply!auto! vectorization.!For!the!Intel!Xeon!CPU,!the!compiler!from!Intel!states!that!it!fails! to!apply!auto!vectorization!to!the!most!compute!demanding!kernel!(see!table!12).! The!Apple!and!AMD!compilers!does!not!state!if!auto!vectorization!was!successful! or!not.!The!NVIDIA!GTX!GPU!is!roughly!17!times!faster!than!the!NVIDIA!Quadro! GPU.!That!can!be!explained!by!the!NVIDIA!GTX!CPU’s!7!times!more!Cuda!cores! and!newer!architecture!with!larger!fast!onWchip!memory,!which!affects!the! number!of!registers,!and!the!caches!for!global!memory.! It!is!hard!to!compare!these!results!with!other!ray!tracers!for!GPU’s!and!CPU’s.! Often!the!scenes!and!implementations!are!very!different!and!have!different! purposes.! The!highly!optimized!GPU!ray!tracer!by!Aila!and!Laine![28]!gets!results!in!the! same!order!of!magnitude!although!their!scenes!are!more!complex.!Aila!and!Laine! both!works!for!NVIDIA!and!this!paper!is!probably!written!with!the!intention!of! promoting!the!performance!of!their!GPU’s.!Their!results!are!measured!on!a! NVIDIA!GTX!285,!which!should!perform!roughly!the!same!or!a!little!bit!worse! than!the!NVIDIA!GTX!470.!They!use!hierarchical!bounding!volumes!which!help!in! scaling!for!complex!scenes.!One!similarity!with!that!ray!tracer!is!that!it!also!does! not!support!colored!rays,!which!would!require!a!larger!amount!of!memory!per! ray.!The!performance!of!that!ray!tracer!is!not!expected!to!be!portable!at!all,! because!even!the!machine!code!is!optimized!by!hand.!! There!is!also!an!open!source!ray!tracer!called!LuxRender!that!aims!to!be! physically!correct![47].!It!has!an!OpenCL!version!called!smallluxgpu2,!which!is! commonly!used!for!benchmarking!of!GPU!compute!performance.!For!similar! scenes!it!gets!a!performance!of!5W10!M!samples/s!for!the!NVIDIA!GTX!470!GPU! and!1!M!samples/s!for!the!Intel!Xeon!CPU.!LuxRender!uses!colored!rays.! 73!635!966! 4!289!006! 3!189!100! 1!704!350! 6!001!465! 0! 10!000!000! 20!000!000! 30!000!000! 40!000!000! 50!000!000! 60!000!000! 70!000!000! 80!000!000! NVIDIA! GTX!470! NVIDIA!Quadro! FX!1800!! Intel! Xeon! E5520! Intel!Core! 2!Duo! Mac!OSX! AMD! Phenom! II!X6! 1055T!

samples/s"

samples/s!

(36)

The!performance!of!the!auto!tuned!ray!tracer!seem!to!be!reasonable!compared! to!the!two!other!ray!tracers.!

4.2.1.1$Timing$results$for$kernels$

Platform" 1"Ray"intensity" 2"Decay"factor" 3"Summation"

NVIDIA!GTX!470! 83.9!ms! 0.0!ms! 5.6!ms! Intel!Xeon!E5520! 2222.0!ms! 0.0!ms! 29.0!ms! Table'12.'Sample'of'kernel'timing'results'for'different'kinds'of'hardware.! The!ray!intensity!kernel!is!the!most!demanding!kernel!and!the!only!one!that!is! optimized!and!auto!tuned.!The!decay!factor!is!basically!not!contributing!to!the! total!time!at!all.!The!summation!kernel!gives!a!relatively!small!contribution!to! the!total!time.!Therefore!it!has!been!left!unoptimized.!Interestingly!the!Intel!Xeon! CPU!compares!fairly!well!to!the!NVIDIA!GPU.!One!explanation!is!that!the!Intel! compiler!manages!to!auto!vectorize!the!kernel!(which!it!report!that!it!does)!and! that!the!NVIDIA!Visual!Profiler!reports!a!high!global!memory!and!cache!replay! overhead!for!that!kernel!that!suggest!that!the!global!memory!accesses!are!not! coalesced.!Not!even!a!workWgroup!size!is!set.!Manual!experiments!show!that! when!a!work!group!is!set!the!contribution!from!the!summation!kernel!drops! from!6.5%!to!0.3%!of!the!total!computation!time!for!the!NVIDIA!GTX!GPU.! 4.2.1.2$Analysis$of$GPU$performance$ NVIDIA’s!visual!profiler!can!show!profiling!information,!but!different!GPU’s! support!different!amounts!of!profiling!data.! GPU" Copy"to"

device" Registers/WorkHitem" Branch"divergence" overhead"

Achieved"

occupancy" Theoretical"occupancy" Global"memory" replay" overhead" NVIDIA!GTX!470! 0.2!%! 46! 15.3!%! 41.1!%! 41.7!%! 1.1!%! NVIDIA!Quadro!FX! 1800! 0.0!%! 40! W! W! 25!%! W! Table'13.'GPU'profiling'statistics.! The!cost!of!transferring!data!between!the!host!device!and!the!OpenCL!device!is! in!this!case!very!low.!The!amount!registers!are!high!and!is!probably!one!of!the! limiting!factors,!especially!for!the!NVIDIA!Quadro!GPU!where!the!register! memory!is!smaller.!This!lowers!the!occupancy,!which!is!the!ratio!of!the!number! of!active!warps!to!the!maximum!number!of!warps!per!multiprocessor.!To!further! increase!the!performance!on!GPU’s,!more!care!has!to!be!put!into!choosing! intersection!algorithms!that!require!a!smaller!amount!of!registers.!Divergence! can!in!ray!tracing!on!GPU’s!be!a!big!problem,!because!of!the!divergence!of!the! paths!of!rays!in!a!scene.!In!this!case!the!divergence!is!relatively!low,!because!of! how!the!rays!are!cast!and!never!changes!direction.!Divergence!happens!when!a! workWitem!in!a!workWgroup!has!a!ray!that!hits!one!object!in!the!scene!which! another!workWitem!in!the!same!workWgroup!does!not.!The!low!amount!of!global! memory!replay!overhead!hints!that!the!global!memory!accesses!are!coalesced! for!the!NVIDIA!GTX!GPU.!It!is!hard!to!say!much!about!coalesced!global!memory! access!for!the!NVIDIA!Quadro!GPU.!

(37)

4.2.2)Parameter)search)statistics) Platform! WHG" time" WHG"space" /Succesful" Param." time" Param."Space" /Successful" Total" time" NVIDIA!GTX!470! 1030! s! 1920!/! 1062! 158!s! 192!/!112! 1188!s! NVIDIA!Quadro!FX!1800! 6324! s! 1920!/!550! 631!s! 192!/!96! 6955!s! Intel!Xeon!E5520! 550!s! 320!/!210! 6925!s! 1152!/!864! 7475!s! Intel!Core!2!Duo!Mac!OSX! 47!s! 320!/!8! 4317!s! 432!/!288! 4364!s! AMD!Phenom!II!X6!1055T!! 261!s! 320!/!210! 3238!s! 1152!/!864! 3499!s! Table'14.'Optimization'parameter'search'statistics.! Non!successful!runs!occur!when!the!compiler!fails!or!the!program!fails!to!run.!No! automatic!checking!for!valid!workWgroup!sizes!is!being!done,!which!could! minimize!the!failed!runs.!Failed!runs!are!not!a!major!part!of!the!search!time.! Some!parts!of!the!search!time!consist!of!compiling!OpenCL!C!code.!pyOpenCL! caches!compiled!kernels,!so!that!saves!search!time.!The!compilation!time!is! usually!a!second!or!a!couple!of!seconds!long,!depending!on!the!compiler!and! input!code,!so!a!considerable!amount!of!time!can!be!saved!by!caching!compiled! kernels.! 4.2.3)WorkCgroup)parameters)

Platform" Size" Sizes"within"10%"

of"best" #"of"sets"within"10%"of"best"

NVIDIA!GTX!470! 128! 64W256! 23!of!1920!

NVIDIA!Quadro!FX!1800! 64! Only!64! 13!of!1920!

Intel!Xeon!E5520! W! 1W1024! 210!of!210!

Intel!Core!2!Duo!Mac!OSX! W! 1W128! 8!of!8!

AMD!Phenom!II!X6!1055T!! W! 1W1024! 209!of!210!

Table'15.'Best'work@group'sizes.'

Platform! X! Y! Z! X"10%"best" Y"10%"best" Z"10%"best"

NVIDIA!GTX!470! 2! 32! 2! 1W4! 4W64! 1W16!

NVIDIA!Quadro!FX!1800! 1! 64! 1! 1,!4! 4W64! 1W16!

Intel!Xeon!E5520! W! W! W! 1W128! 1W128! 1W16!

Intel!Core!2!Duo!Mac!OSX! W! W! W! 1W128! Only!1! Only!1!

AMD!Phenom!II!X6!1055T!! W! W! W! 1W128! 1W128! 1W16!

Table'16.'Best'work@group'shapes.'

For!CPU’s,!the!size!and!shape!of!the!workWgroup!has!little!effect!on!the!

performance.!All!or!almost!all!valid!shapes!and!sizes!are!within!ten!percent!of! the!best!performing!shape!and!size!(see!Figure!9).!

(38)

! Figure'9.'Computation'time'and'of'work@group'dimensions'for'the'Intel'Xeon'CPU.! It!is!clear!that!GPU’s!are!more!sensitive!to!the!workWgroup!size!and!shape!(see! Figure!10).!The!older!NVIDIA!Quadro!GPU!seems!to!be!more!sensitive!than!the! newer!NVIDIA!GTX!470!GPU,!due!to!the!ranges!of!values!of!the!parameters!in! Table!14.!A!possible!explanation!for!that!is!that!the!newer!GPU!has!larger!and! more!forgiving!caches!when!copying!from!global!memory!to!local!or!private! memory!than!the!older!card.!It!also!seems!that!a!larger!Y!dimension!is!preferred! over!a!large!X!dimension.!One!explanation!could!be!in!how!matrices!are!stored!in! memory!on!the!NVIDIA!GPU’s,!where!a!larger!Y!dimension!possibly!makes!more! aligned!reads!from!global!memory!to!local!memory!than!a!large!X!dimension.! ! Figure'10.!Computation'time'and'of'work@group'dimensions'for'the'NVIDIA'GTX' 470'GPU.! ! ) 0 20 40 60 80 100 120 140 0 0,5 1 1,5 2 2,5 Size X Size Y Size Z Size T im e (s ) 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 Size X Size Y Size Z Size T im e (s )