A first attempt at learning about

(1)

A first attempt at learning about optimizing the TigerSHARC code

TigerSHARC assembly syntax

What we NOW KNOW!

• Can we return from an assembly language routine without

h h ?

crashing the processor?

• Return a parameter from assembly language routine – (Is it same for ints and floats?)

• Pass parameters into assembly language – (Is it same for ints and floats?)

• Do IF THEN ELSE statementsDo IF THEN ELSE statements

• Read and write values to memory

• Read and write values in a loop

• Do some mathematics on the values fetched from memory

• Do some mathematics on the values fetched from memory All this stuff is demonstrated by coding HalfWaveRectifyASM( )

10/13/2010

TigerSHARC assemble code 3, M. Smith, ECE, University of Calgary,

Canada

3 / 28

Next Sprint stage Next Sprint stage

• Debug mode for C++ function

– Works

• Release mode (optimized) for C++ function

W k – Works

• First attempt at integer ASM function

– WorksWorks

• Next stage – Test for speed

• Then test for difference between integer and floating point speed

Tests for timing test codeg

(2)

Not bad for a first effort

F h il i d b d

Faster than compiler in debug mode

Cut and paste float version

10/13/2010

Canada

6 / 38

Where did the float ASM code suddenly

f ?

appear from?

• Integer 0 has bit pattern 0x0000 0000 Fl t 0 0 h bit tt 0 0000 0000

• Float 0.0 has bit pattern 0x0000 0000

• Integer +6 has format PLUS FORMAT b 0??? ???? ???? ???? ???? ???? ???? ????

We know more than the compiler in this example FLOATING POINT

• Float +6.0 has format PLUS FORMAT b 0### #### ##??? ???? ???? ???? ???? ????

• Integer ‐6 has format MINUS FORMAT b 1??? ???? ???? ???? ???? ???? ???? ????

Fl t 6 0 h f t MINUS FORMAT

FLOATING POINT EXPONENT SHOWN AS #####

• Float ‐6.0 has format MINUS FORMAT b 1 ### #### ##??? ???? ???? ???? ???? ????

• Format’s are very different, but the sign bit is in the same place

• Float algorithm ‐ if S == 1 (negative) set to zero Otherwise leave unchanged – same as integer algorithm

• Just re‐use integer algorithm with a change of name

10/13/2010

Canada

7 / 38

• Just re‐use integer algorithm with a change of name

Interesting observations Interesting observations

• “C” Debug float and integer are about the same in timing

• “C” Release float is much slower than Release integer

• Our float ASM is slightly slower than integer ASMOur float ASM is slightly slower than integer ASM,

• Extra jump (10 cycles ?) split across 160 operations – not very much

10/13/2010

Canada

8 / 28

How does 4.5 OLD compiler do it faster?

k d d d d h

Look at C++ source code and use mixed mode to show

• Warning – out of order instructions displayedWarning out of order instructions displayed

???????

10/13/2010

Canada

9 / 28

???????

(3)

How does LATEST 5.0 compiler do it faster?

k d d i d d h

Look at source code and use mixed mode to show

MINOR DIFFERENCES BETWEEN

COMPILER?

???????

10/13/2010

Canada

10 / 28

???????

Many new and parallel instructions.

Ones inside loop are key – the one’s with the Ones inside loop are key the one s with the

biggest bang for the buck for each change

How important is coding if conditional jump (NP or not) is predicted or not?

10/13/2010

Canada

11 / 28

0.0926 uS / Pt 0.0752 uS / pt BIG but data dependent

Many new instructions. Many parallel

i i O i id l k

instruction. Ones inside loop are key

JMP (NP) 0.092 0.075 XR1 not J1 0.075 0.074

How important is not using J registers as destination when reading from memoryg y

XR1 rather than J1 Now need

Condition XALT rather than JLT PASS rather than COMP with 0 PASS rather than COMP with 0

Many new instructions. Many parallel

i i O i id l k

instruction. Ones inside loop are key

JMP (NP) 0.092 0.075 XR1 not J1 0.075 0.074 and ++ operator

and ++ operator

0.074 0.072

How important is not using J registers as a destination when reading from memory, and using pointers (*pt++) rather than array ( y, g p ( p ) y ( pt[count])

XR1 rather than J1

Now need Condition XALT rather than JLT

PASS (MOVE) rather than COMP WITH 0 (MATH)

(4)

Redoing our code to this point.

d Note new instructions using XR2 and R2

Try a little thing. R2 = 0 is a constant – move outside loop Data dependant Will make a difference 1 time in 5 with this data

10/13/2010

Canada

14 / 28

Data dependant. Will make a difference 1 time in 5 with this data 0.072 0.0717 ‐‐OPTIMIZATION TECHNIQUE – HOW MUCH TIME SAVED?

The IF THEN JUMPS in the loop are killing pipeline.

R i C d i i i d f

Rewrite C++ code into optimized form

• Reduce loop size from 6 cycles if > 0 and 7Reduce loop size from 6 cycles if > 0 and 7 cycles if < 0 to 4 any way.

The jumps were causing us 9 cycles by disrupting the TigerSHARC pipeline FLOATasm 0.072 uS 0.038 uS

INTEGERasm = 0 038uS too – but release INTEGERasm 0.038uS too but release C++ = 0.019 uS

our ASM still too slow

Need to get rid of this jump and counter increment.

10/13/2010

Canada

15 / 28

Blackfin has hardware loops Does the TigerSHARC – Duh!!

Many new and parallel instructions. Ones

i id l k bi b h

inside loop are key – biggest bang per change

JMP (NP) 0.092 0.075 XR1 not J1 0.075 0.074 and ++ operator

and ++ operator

0.074 0.072 Remove IF then 0.038

Hardware loop instructions

LC0 = loop counter 0 – may only be a few hardware loops possible

SHARC ADSP‐21061 – allows 6, Blackfin ADSP‐BF5XX – allows 2, so need to still

10/13/2010

Canada

16 / 28

understand software loops

IF NLC0E, if NOT hardware loop count 0 expired

Line 124 ‐‐ IF LC0E If hardware loop expired,– MM – why used!!

Insert hardware loop – check code d

passes test ‐‐ my new code

10/13/2010

Canada

17 / 28

Failure indicates Excellent result

(5)

Some ideas on making code more

ll l l

parallel in general ‐‐ Step 1

Standard code Rearrange loop

Standard code For

g p

X = read memory

For

X = read memory X1 = use X

For

X1 = use X

X1 = use X X2 = use X1 write memory X2

X2 = use X1 write memory X2 EndFor JUMP to For

write memory X2 EndFor JUMP to For

EndFor JUMP to For with X = read memory done in parallel

10/13/2010

Canada

18 / 28

Some ideas on making code more

ll l l

parallel in general ‐‐ Step 2

Standard code Rearrange loop

Standard code For

g p

For

X2 = use X1 with X = read memory

X1 = use X X2 = use X1 write memory X2

read memory EndFor JUMP to For write

memory X2 with

write memory X2 EndFor JUMP to For

y

X1 = use X

Canada

19 / 28

Rearrange the loop Standard approach Standard approach

0.0319 us 0.0239 us Changed the stalls when reading memory

Need to have a closer look at what

l d b

compiler is doing better

USING NXALE and XALE USING NXALE and XALE

(6)

Got worse when we did that Got worse when we did that

10/13/2010

Canada

22 / 28

Need to have a 2

^nd

closer look at what

l d b

compiler is doing better

• ALSO USING DIFFERENT ADDRESSINGALSO USING DIFFERENT ADDRESSING MODE

10/13/2010

Canada

23 / 28

That causes a problem when we try it

h d

with our code

10/13/2010

Canada

24 / 28

Still not better. What else is different

10/13/2010

Canada

25 / 28

(7)

That improvement was unexpected Perhaps outside loop now a problem Perhaps outside loop now a problem

10/13/2010

Canada

26 / 28

Before we continue with the optimization

• C already works better than ours for int

• Ours works better than C for float

• Even if we found “all possible” optimizations (and we probably can’t) what is the best possible speed for this probably can t), what is the best possible speed for this processor

• Just how fast do we need to go?

• Typical target. The time for all the DSP algorithms added together must be less that the 0 5 times the added together must be less that the 0.5 times the interval between samples

– Why 0.5 times and not 0.95 times

10/13/2010

Canada

27 / 28

What is the theoretical maximum speed?

• This is something I always work out BEFORE optimizing.

– I have a target to meet – normally finish all processing before next sample comes in.

– If my code (in theory) can’t meet that target, I need to find a different approach, not spend days optimizing useless code.pp , p y p g

• In theory – if I have written the code with no hidden stalls – 1 cycle per instruction

– 6 instructions outside the loop

d h l * l

– 4 instruction inside the loop – N * 4 cycles

– Very short loop – read that getting out of very short loop stalls the pipeline – lets add 5 cycles for that

– 6 + 24 * 4 + 5 = 107 in theory 138 in practice6 + 24 4 + 5 = 107 in theory, 138 in practice

– Difference 21 – close enough to being 24, or 1 stall per cycle – Can use the pipeline viewer to find out where the problem is

occurring. In a long loop, done 4096 times, might be worth it.

Change tests to “remove time needed

‘ h ’

to ‘time the test’. IMPORTANT VALIATION

(8)

“timer overhead removed”

10/13/2010

Canada

30 / 38

Now using tests to explore (and

d ) “ b h ”

document) “system behaviour”

Unexpected behaviour Error is 300 larger than Error is 300 larger than measured times

10/13/2010

Canada

31 / 28

What we forgot – averaged timing d l

error over 160 times round loop

2^ndmistake

This indicates that not removing the time

10/13/2010

Canada

32 / 28

g causes 12% error

We need to know variability in time measurement

Variability is roughly

( ) /

(max – min ) / 2

• Precision of time measure seems good – muchPrecision of time measure seems good much lower than the changes we are seeing

• Lets save Test file under different name and

• Lets save Test file under different name and try different testing method

10/13/2010

Canada

33 / 28

(9)

Unclear what number to expect (in uS)

l d

– activate cycle counter instead

Doing print wrong (try convert to int)

10/13/2010

Canada

34 / 28

Now printing “acceptable = 5”

th

d

4

^th

Error – Did not turn on timer

• Timer seems to be running – error = 0x40 /Timer seems to be running error = 0x40 / 160

–Temporarily moved local array to be global array –Temporarily moved local array to be global array

and then display TigerSHARC memory

10/13/2010

Canada

35 / 28

Look at cycle counter values Break lecture here Break lecture here

(10)

Key times for integer Key times for integer

• C++ debug 12022 cycles / function

• C++ release 1814 cycles / function

• First ASMFirst ASM 1352 cycles / function1352 cycles / function

D b li th b ?

• Do we believe these numbers?

– Let me count the cycles (mis‐quote from Sh k

Shakespeare

10/13/2010

Canada

38 / 28

Program flow ‐‐ assumptions Program flow assumptions

• Each simple instruction line = 1 cycleEach simple instruction line = 1 cycle

• Each Jump taken – break pipeline

E t b ti BP l

–Enter subroutine BP cycles –Exit subroutine BP cycles

• Predicted jumps

–Break pipeline first time happens BP cycles –Break pipeline if not taken BP

10/13/2010

Canada

39 / 28

Operations Operations

• Memory reads take extra MR cycles if fetchedMemory reads take extra MR cycles if fetched value used immediately

• Register to register moves take no extra timeRegister to register moves take no extra time over cycle time for instruction

• Possible that math operations take extra timePossible that math operations take extra time if result used immediately

• Can’t do two accesses to same memory bankCan t do two accesses to same memory bank (reads or read / write) in one cycle

• External memory operations take longerExternal memory operations take longer

10/13/2010

Canada

40 / 28

Code outside loop lines 25 to 44 Code outside loop lines 25 to 44

• Enter subroutine BP cycles (9?)y ( )

• 9 instructions + 2 instruction where result used immediately (COMPJ6 then use COMP result) and memory access

10/13/2010

Canada

41 / 28

(11)

Code outside loop lines 57 to 59 Code outside loop lines 57 to 59

• 2 instructions2 instructions

• + break pipeline when return BP

• Total cycles outside loop – at least

– BP + 9 + 2 + BP ‐‐ at least 30 cycles

10/13/2010

Canada

42 / 28

Code in loop Code in loop

• 4 instructions + memory fetch where value used

immediately + possible pipeline break when exits routine immediately + possible pipeline break when exits routine

• Total cycles = 30 + 4 * N = 670 cycles predicted

• Actual cycles 1352

• Difference = 670 cycles – 4 extra each time around the loop

10/13/2010

Canada

43 / 28

Switch to “cycle accurate” TigerSHARC l

simulator

• Takes much more time to simulate than emulate

WRONG SIMULATOR?

Pipeline viewer (Simulator debug) Pipeline viewer (Simulator debug)

E l

Extra cycle on fetch

(12)

+1 cycles extra

3 l t

+3 cycles extra

10/13/2010

Canada

46 / 28 10/13/2010

Canada

47 / 28

New prediction New prediction

• Old prediction

– Outside loop 30 cycles – Inside loop N * 4

• New prediction

d l l

– Outside loop 30 cycles

– Inside loop 3 (first time hardware loop jump back) +

N * 4 + N * memory stall when value used immediately on fetch

10/13/2010

Canada

48 / 28

Trying to understand what we have done Trying to understand what we have done

• Most TigerSHARC instructions can be made conditional.

• WHY? Because doing a NOP g instruction (if condition not met) is much less disruptive to the instruction pipeline than doing a JUMP (lose of 9 cycles if jump taken – probably more because of code format)

10/13/2010

Canada

49 / 28

(13)

Why mostly conditional instructions?

• TigerSHARC has a very deep pipeline, so thatTigerSHARC has a very deep pipeline, so that conditional jumps cause a potential large disruption of the pipeline

• Better to use non‐jump instructions which don’t disrupt pipeline, even if instruction is not executed (acts as nop)

If (N < 1) return_value = NULL;

else return value = NULL;

10/13/2010

Canada

50 / 28

else return_value NULL;

Why mostly conditional instructions?

If (N < 1) If (N < 1) return_value = NULL;

else return_value = value;

return_value = NULL;

else return_value = value;

COMP(N, 1);;

IF NJLT, JUMP _ELSE;;

COMP(N, 1);;

IF NJLT; DO, J5 = NULL;;

J5 = NULL;;

JUMP _END_IF;;

ELSE

IF JLT; DO, J5 = value;;

Concept is there e need to _ELSE:

J5 = value;;

Concept is there – we need to check on whether syntax is

correct

10/13/2010

Canada

51 / 28

Trying to understand what we have done Trying to understand what we have done

• Use J registers for address g operations, but store values from memory in XR1 and YR1

YR1

• WHY? Instructions like this [J1] = XR1;; has the

potential to be put in parallel with more parallel with more operations

Hardware – zero overhead loop.

About 4 * N cycles better (N is times round the loop)

LC0 = N;; Load counter 0 with value N

Start_of_loop_LABEL:

Loop code here ;;

IF NLC0E, JUMP Start_of_loop_LABEL;;

NLC0E – Not LC0 expired – essentially Compare LC0 with 2 If less than 2, continue (don’t jump)

If 2 or more, then decrement LC0 and jump, j p

All sorts of stall issues if instruction is not properly aligned –TigerSHARC manual 8‐23

CAN’T USE WHEN THERE IS A FUNCTION CALL IN THE LOOP?

WHY NOT? – WHAT HAPPENS – NEED TO EXPLORE MORE.

(14)

Hardware – zero overhead loop.

BIG WARNING BIG WARNING

LC0 = N;; Load counter 0 with value N

LC0 uses UNSIGNED ARITHMETIC – MAKE SURE N is not negative as a negative number has the same bit pattern negative, as a negative number has the same bit pattern as a VERY large unsigned number, and the processor will go around the loop for a week

We did a check for N <= 0 before entering the hardware loop as another part of our code – so we lucked in – otherise could have big problems.

This issue is so important (and time wasting in the

10/13/2010

Canada

54 / 28

This issue is so important (and time wasting in the laboratories) that will be deducting marks in quizzes and exams

What’s this XR1, YR1 and R1 stuff What s this XR1, YR1 and R1 stuff

• TigerSHARC isTigerSHARC is designed to do many things at once

• So you need appropriate syntax to control it

10/13/2010

Canada

55 / 28

What’s this XR1, YR1 and R1 stuff What s this XR1, YR1 and R1 stuff

XYR1 = R2 + R3;;

d 2 dd

does 2 adds XR1 = XR2 + XR3 and

YR1 = YR2 + YR3;;

You can add the X values and not the Y values with this syntaxy

XR1 = R2 + R3;;

And NOT with XR1 = XR2 + XR3;;

Ugly – but they (ADI) will not change the syntax (DAMY)

10/13/2010

Canada

56 / 28

What’s this XR1, YR1 and R1 stuff What s this XR1, YR1 and R1 stuff

XYR1 = [J0 += 0x1];;

Does a 32‐bit fetch and puts the same value into XR1 and YR1.

Same as doing XR1 = [J0 += 0];; AND

YR1 = [J0 += 1];; at the same time

XYR1 = L[J0 +0x2];;

Does a dual 64 bit fetch and is the same as doing

XR1 = [J0 += 1];; AND

YR1 = [J0 += 1];; at the same time

10/13/2010

Canada

57 / 28

(15)

What’s this XR1, YR1 and R1 stuff What s this XR1, YR1 and R1 stuff

XYR1 = [J0 += 0x1];;

means means

XR1 = [J0 += 0];; AND YR1 = [J0 += 1];;

XYR1 = L[J0 +0x2];;

means

XR1 = [J0 += 1];; AND YR1 = [J0 += 1];; at the same time

XR1:0 = L[J0 +0x2];; [ ];;

means

XR0 = [J0 += 1];; AND XR1 = [J0 += 1];;

XYR1:0 L[J0 +0x2];;

XYR1:0 = L[J0 +0x2];;

means

XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];;

10/13/2010

Canada

58 / 28 YR1 = [J0 += 1];;

What’s this XR1, YR1 and R1 stuff What s this XR1, YR1 and R1 stuff

XYR1:0 = L[J0 +0x2];;

means

XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];;

YR1 = [J0 += 1];;

[ ]

XR3:0 = Q[J0 +0x4];;

means

XR0 = [J0 += 1];; AND XR1 = [J0 += 1];; AND XR2 = [J0 += 1];; AND XR3 = [J0 += 1];;

XR3 = [J0 += 1];;

XYR3:0 = Q[J0 +0x4];;

means

XR0 = [J0 += 0];; AND YR0 = [J0 += 1];; AND XR1 = [J0 += 0];; AND YR1 = [J0 += 1];; AND XR2 = [J0 +=0];; AND YR2 = [J0 += 1];; AND XR3 = [J0 += 0];; AND YR3 [J0 1]

10/13/2010

Canada

59 / 28 YR3 = [J0 += 1];;

Float release generated by C++ compiler – identify new instructionsde y e s uc o s

• I see 1 new instructionI see 1 new instruction

Difference between integer and math operations

XYR1 = R2 + R3;;

does 2 INTEGER adds does 2 INTEGER adds XR1 = XR2 + XR3 and

YR1 = YR2 + YR3;

SYNTAX XR1 = R2 + R3;;

And NOT with XR1 = XR2 + XR3;;

Use F syntax to make it a float operation

XYFR1 = R2 + R3;;

XYFR1 R2 + R3;;

does 2 FLOATING adds XFR1 = R2 + R3 and

YFR1 = R2 + R3;

YFR1 R2 + R3;

(16)

Exercise 1 – needed for Lab. 1 Exercise 1 needed for Lab. 1

• FIR filter operation ‐‐ data and filter‐coefficients are p both integer arrays – Write in C++

• New_value from Audio A/D, output sent to Audio D/A

1 1

for j= to N −

[ 1] [ ];

[0] ;

data N j data N j data newvalue

− − = −

=

1

0

[ ]* _ [ ];

N

j

output data j filter coeffs j

−

=

∑

10/13/2010

Canada

62 / 28

Exercise – needed for Lab. 1 Exercise needed for Lab. 1

• FIR filter operation ‐‐ data and filter‐FIR filter operation data and filter coefficients are both integer arrays ‐‐ ASM

ReadAudioSource(&newvalue);

Re (& );

1 1

[ 1] [ ];

adAudioSource newvalue for j to N

data N j data N j

= −

− − = −

1

[0] ;

[ ]* _ [ ];

N

data newvalue

output data j filter coeffs j

−

=

=∑

0

_

( );

j

p j f ff j

WriteAudioSource output

∑=

10/13/2010

Canada

63 / 28

Insert C++ code – for Lab. 1 Insert C++ code for Lab. 1

10/13/2010

Canada

64 / 28

Insert assembler code version (Lab. 2) Insert assembler code version (Lab. 2)

10/13/2010

Canada

65 / 28

(17)

What we NOW KNOW EVERYTHING FOR THE

( OS )!

FINAL (REALLY ‐‐ ALMOST)!

• Can we return from an assembly language routine without

h h ?

crashing the processor?

• Return a parameter from assembly language routine – (Is it same for ints and floats?)

• Pass parameters into assembly language – (Is it same for ints and floats?)

• Do IF THEN ELSE statementsDo IF THEN ELSE statements

• Read and write values to memory

• Read and write values in a loop

• Do some mathematics on the values fetched from memory

• Do some mathematics on the values fetched from memory All this stuff was demonstrated by coding HalfWaveRectifyASM( )

‐‐ ☺

10/13/2010

Canada

66 / 28