But I run into some troubles with the MACx computations which are not clear. So can someone give me some hints about it ?
To illustrate it, I'm taking the AVSZ4 instruction as an example :
Code: Select all
FLAG = 0;
MAC0 = F( ( (s64)(s16)ZSF4 * SZ0 ) + ( (s16)ZSF4 * SZ1 ) + ( (s16)ZSF4 * SZ2 ) + ( (s16)ZSF4 * SZ3 ) );
OTZ = Lm_D( (s32)MAC0 >> 12 );
x[1:19:12] = ZSF4[1:3:12] * SZ0[0:16:0]
y[1:19:12] = ZSF4[1:3:12] * SZ1[0:16:0]
z[1:19:12] = ZSF4[1:3:12] * SZ2[0:16:0]
w[1:19:12] = ZSF4[1:3:12] * SZ3[0:16:0]
MAC0[1:21:12] = x+y+z+w
which explains why we get OTZ[0:16:0] = clamp(0, MAC0[1:21:12]>>12, 65535).
But the description says the output MAC0 is [0:31:0], so I'm totally lost.
My code :
Code: Select all
/*
Name Cycles Command Description
AVSZ4 6 cop2 0x168002E Average of four Z values
Fields:
in: SZ1,SZ2,SZ3,SZ4 Z-Values [0,16,0]
ZSF4 Divider [1,3,12]
out: OTZ Average. [0,16,0]
MAC0 Average. [1,31,0]
Calculation:
[1,31,0] MAC0=F[ZSF4*SZ0 + ZSF4*SZ1 + ZSF4*SZ2 + ZSF4*SZ3] [1,31,12]
[0,16,0] OTZ=Lm_D[MAC0] [1,31,0]
*/
.set OTZ, $s013.s // [0:16:0] but stored as float internally
.set SZ0SZ1SZ2SZ3, $c100.q // [0:16:0] but stored as float internally
.set MAC0, $s120.s // [1:31:0] or [1:27:4] or [1:19:12]
.set ZSF4, $s332.s // [1:3:12] but stored as float internally
.set FLAG, $s333.s // bit set but stored as float internally
.set DQBZSF3ZSF4FLAG, $c330.q
.extern host_gte_avsz4
.ent host_gte_avsz4
//
// OTZ = MAC0 = ZSF4*SZ0 + ZSF4*SZ1 + ZSF4*SZ2 + ZSF4*SZ3
vdot.q $s700.s, DQBZSF3ZSF4FLAG[z,z,z,z], SZ0SZ1SZ2SZ3 // 7 cycles latency so we shouldn't use $s700.s before 6 instructions to make it run as an 1 cycle instruction.
// constants for limits (and conveniently no stall because of vdot.q)
lui $at, %hi(0x4F000000) // 2147483648.0
vfim.s $s413.s, 65536.0
mtv $at, $s423.s // 2147483648.0
vzero.s $s433.s // 0.0
mtv $at, $s403.s // 2147483648.0
viim.s $s432.s, 65535.0
// (Fn,Dz) = (MAC0 < -2^31, OTZ < 0) <==> (Fn,Dz) = (2^31 < -MAC0, 0 < -OTZ)
vslt.p $c602.p, $r423.p, $c700.p[-x,-x]
// (Fp,Dp) = (MAC0 >= 2^31, OTZ >= 2^16)
vsge.p $c600.p, $c700.p[x,x], $r403.p
// constants for FLAG
vmov.s $s400.s, $s413.s # F pos
viim.s $s401.s, 32778 # F neg
viim.s $s402.s, 8192 # D
// save MAC0 in fixed point (1:31:0) !?
vf2in.s MAC0, $s700.s, 0
// clamp OTZ between 0 and 65535 (part I)
vmin.s $s700.s, $s700.s, $s432.s
// FLAG = Fp*(1<<(31-16)) + Dp*(1<<(31-18)) + Fn*(1<<(31-15)) + Dz*(1<<(31-18))
vdot.q FLAG, $c400.q[x,z,y,z], $c600.q
// clamp OTZ between 0 and 65535 (part II) and save it
jr $ra
vmax.s OTZ, $s700.s, $s433.s
.end host_gte_avsz4
FLAG contains various bits to indicate overflows. But I'm not handling FLAG as in a real PSX FLAG in order to reduce bit computations at each GTE instructions. Instead I defer those extra computations when reading FLAG register into a GP register. To retrieve the exact FLAG :
Code: Select all
host_gte_mfc2_flag:
vf2in.s $s400.s, FLAG, 0 // convert float into integer to get reversed bits (only 20 bits are used in FLAG so it is okay to have them stored as a 32-bit float)
lui $at, %hi(OVERFLOW_BITS_WHICH_SET_BIT31)
ori $at, %lo(OVERFLOW_BITS_WHICH_SET_BIT31)
mfv $v0, $s400.s // put bits into a GP register
and $at, $at, $v0
sltu $at, $0, $at // check if one of overflow bits is set
or $v0, $v0, $at // "bit 32" is set if one of overflow bit is set
jr $ra
bitrev $v0, $v0 // reverse the bit order to reflect a true PSX FLAG