Geometry stage
matmul 4 x 4 s block, which accepts streaming data Looking at Equa tion 8 that shows vector-matrix multiplication, we can see that the
four rows of the result are simply a summation of partial products. For example, when only vx is available, the partial products vxm11, vxm12,vxm13, andvxm14 can be calculated.
vx vy vz vw · m11 m12 m13 m14 m21 m22 m23 m24 m31 m32 m33 m34 m41 m42 m43 m44 = vx·m11+vy·m21+vz·m31+vw·m41 vx·m12+vy·m22+vz·m32+vw·m42 vx·m13+vy·m23+vz·m33+vw·m43 vx·m14+vy·m24+vz·m34+vw·m44 (8)
The method of streaming in data and calculating partial results has several benefits. The first benefit is that less logic is required. The dual-ring interconnect can only transfer one word of data per clock cycle, therefore only one vertex component is available at the input. For each vertex component, one partial result for each of the four rows can be calculated, requiring four multipliers and adders. Calcu- lating the full result in one go would require 16 multipliers and 12
adders.
The second benefit is the inherent pipelining, which enables run- ning at higher clock frequencies. Figure 26 shows the how the par- tial results are calculated in a pipelined fashion. Every clock cycle, at most four multiplications and four additions are performed, and the final result is obtained in six clock cycles. Calculating the result when all vertex components are available would have to happen in at most two clock cycles, if the result should be obtained in six clock cycles. This would require a longer combinatorial path, lowering the potential clock frequency that this block could run at.
clk valid in data in X vx vy vz vw X mul result 1 X vxm11 vym21 vzm13 vwm41 X mul result 2 X vxm12 vym22 vzm23 vwm42 X mul result 3 X vxm13 vym23 vzm33 vwm43 X mul result 4 X vxm14 vym24 vzm34 vwm44 X add result 1 X r1 12 r1123 r11234 X add result 2 X r2 12 r2123 r21234 X add result 3 X r3 12 r3123 r31234 X add result 4 X r4 12 r4123 r41234 X valid out
Figure26: Pipelining of partial results in thematmul4x4_s IP.rn
m is the cu-
mulative result of thempartial products of rown.
The third benefit is that streaming requires less memory to store intermediate results. The four components of a vertex do not have to be stored before processing them, but also requires less memory for storing intermediate results. In thematmul4x4_sblock, only four words of memory is required to store the results of the adders, and consequently, the final result.
After multiplying the three vertices with theMVPmatrix, the results
Vn·MVPare fed to thediv3andrecpblocks. Thewcomponent of the results contain theview spacedepth of the vertices of which we want the reciprocal. That is calculated by therecpblock and the results are processed by thefp_to_q8_24block which converts the floating point input to the Q8.24fixed-point format.
Additionally, the perspective divide is also performed for each ver- texVn·MVP using thediv3block, which divides thex,y, andzcom- ponents by thewcomponent. After this division, the vertices are now inNDCspace. Thescreenspaceblock transforms the vertices to screen spaceby scaling and translating thexandycomponents to match the viewport resolution that is stored in the configuration memory of the accelerator. It also drops thezandwcomponents, and converts thex
andycomponents into16bit signed integer formats.
The last calculation involves obtaining the reciprocal of twice the triangle area. First, twice the area of the triangle is calcuated by the
edge_function block, giving us the result in 32 bit signed integer
4.2.1.2 Back-face Culling
Back-face culling is performed by checking if the normal of the trian- gle is pointing towards the camera. This can be done in three steps (illustrated in the bottom path of Figure25):
• Define a (normalized) vector with its base at the camera origin, pointing to one of the vertices.
• Calculate the (normalized) normal vector of the triangle. • Check if the angle between the two vectors lies between90° and
270° i.e. if the dot product is negative.
Performing these steps is the easiest in view space, where the first step is to simply multiply v0 with the MV matrix to obtain v0·MV.
This multiplication is also performed by another instantiation ofmat- mul4x4_s and is performed in parallel with the calculation of v0· MVP described above.
Calculating the normal vector of the triangle involves determin- ing two edges e10 = v1−v0 and e20 = v2−v0, then performing a cross product to obtain the normal vector in model space Nmodel = e10×e20. Finally,Nmodelis multiplied with theMVmatrix using the
matmul3x3 IP, resulting in the normal vector in view space Nview =