x 4 s block, which accepts streaming data Looking at Equa tion 8 that shows vector-matrix multiplication, we can see that the

Geometry stage

matmul 4 x 4 s block, which accepts streaming data Looking at Equa tion 8 that shows vector-matrix multiplication, we can see that the

four rows of the result are simply a summation of partial products. For example, when only vx is available, the partial products vxm11, vxm12,vxm13, andvxm14 can be calculated.

v_x v_y v_z v_w ·       m11 m12 m13 m14 m₂₁ m₂₂ m₂₃ m₂₄ m₃₁ m₃₂ m₃₃ m₃₄ m₄₁ m₄₂ m₄₃ m₄₄       =       vx·m11+vy·m21+vz·m31+vw·m41 vx·m12+vy·m22+vz·m32+vw·m42 v_x_·m₁₃+v_y_·m₂₃+v_z_·m₃₃+v_w_·m₄₃ v_x_·m₁₄+v_y_·m₂₄+v_z_·m₃₄+v_w_·m₄₄       (8)

The method of streaming in data and calculating partial results has several benefits. The first benefit is that less logic is required. The dual-ring interconnect can only transfer one word of data per clock cycle, therefore only one vertex component is available at the input. For each vertex component, one partial result for each of the four rows can be calculated, requiring four multipliers and adders. Calcu- lating the full result in one go would require 16 multipliers and 12

adders.

The second benefit is the inherent pipelining, which enables run- ning at higher clock frequencies. Figure 26 shows the how the partial results are calculated in a pipelined fashion. Every clock cycle, at most four multiplications and four additions are performed, and the final result is obtained in six clock cycles. Calculating the result when all vertex components are available would have to happen in at most two clock cycles, if the result should be obtained in six clock cycles. This would require a longer combinatorial path, lowering the potential clock frequency that this block could run at.

clk valid in data in X vx vy vz vw X mul result 1 X vxm11 vym21 vzm13 vwm41 X mul result 2 X vxm12 vym22 vzm23 vwm42 X mul result 3 X vxm13 vym23 vzm33 vwm43 X mul result 4 X vxm14 vym24 vzm34 vwm44 X add result 1 X r1 12 r1123 r11234 X add result 2 X r2 12 r2123 r21234 X add result 3 X r3 12 r3123 r31234 X add result 4 X r4 12 r4123 r41234 X valid out

Figure26: Pipelining of partial results in thematmul4x4_s IP.rn

m is the cu-

mulative result of thempartial products of rown.

The third benefit is that streaming requires less memory to store intermediate results. The four components of a vertex do not have to be stored before processing them, but also requires less memory for storing intermediate results. In thematmul4x4_sblock, only four words of memory is required to store the results of the adders, and consequently, the final result.

After multiplying the three vertices with theMVPmatrix, the results

V_n_·MVPare fed to thediv3andrecpblocks. Thewcomponent of the results contain theview spacedepth of the vertices of which we want the reciprocal. That is calculated by therecpblock and the results are processed by thefp_to_q8_24block which converts the floating point input to the Q8.24fixed-point format.

Additionally, the perspective divide is also performed for each ver- texV_n_·MVP using thediv3block, which divides thex,y, andzcom- ponents by thewcomponent. After this division, the vertices are now inNDCspace. Thescreenspaceblock transforms the vertices to screen spaceby scaling and translating thexandycomponents to match the viewport resolution that is stored in the configuration memory of the accelerator. It also drops thezandwcomponents, and converts thex

andycomponents into16bit signed integer formats.

The last calculation involves obtaining the reciprocal of twice the triangle area. First, twice the area of the triangle is calcuated by the

edge_function block, giving us the result in 32 bit signed integer

4.2.1.2 Back-face Culling

Back-face culling is performed by checking if the normal of the triangle is pointing towards the camera. This can be done in three steps (illustrated in the bottom path of Figure25):

• Define a (normalized) vector with its base at the camera origin, pointing to one of the vertices.

• Calculate the (normalized) normal vector of the triangle. • Check if the angle between the two vectors lies between90° and

270° i.e. if the dot product is negative.

Performing these steps is the easiest in view space, where the first step is to simply multiply v0 with the MV matrix to obtain v0·MV.

This multiplication is also performed by another instantiation ofmat- mul4x4_s and is performed in parallel with the calculation of v₀_· MVP described above.

Calculating the normal vector of the triangle involves determin- ing two edges e₁₀ = v₁−v₀ and e₂₀ = v₂−v₀, then performing a cross product to obtain the normal vector in model space Nmodel = e₁₀_×e₂₀. Finally,N_modelis multiplied with theMVmatrix using the

matmul3x3 IP, resulting in the normal vector in view space Nview =

In document Real Time rasterization on the Starburst MPSoC (Page 47-49)