[PSX -> PSP / GTE -> VFPU] need some hints

hlide · Post by **hlide** » Mon Jun 30, 2008 12:36 am

I'm coding GTE instructions into functions using VFPU.

But I run into some troubles with the MACx computations which are not clear. So can someone give me some hints about it ?

To illustrate it, I'm taking the AVSZ4 instruction as an example :

Code: Select all

      FLAG = 0;

      MAC0 = F&#40; &#40; &#40;s64&#41;&#40;s16&#41;ZSF4 * SZ0 &#41; + &#40; &#40;s16&#41;ZSF4 * SZ1 &#41; + &#40; &#40;s16&#41;ZSF4 * SZ2 &#41; + &#40; &#40;s16&#41;ZSF4 * SZ3 &#41; &#41;;
      OTZ = Lm_D&#40; &#40;s32&#41;MAC0 >> 12 &#41;;

what I don't understand is the output MAC0 is said to be in 1:32:0 format (sign:integer:fraction) whereas ZSF4 is 1:3:12, so i would expect MAC0 to be :

x[1:19:12] = ZSF4[1:3:12] * SZ0[0:16:0]
y[1:19:12] = ZSF4[1:3:12] * SZ1[0:16:0]
z[1:19:12] = ZSF4[1:3:12] * SZ2[0:16:0]
w[1:19:12] = ZSF4[1:3:12] * SZ3[0:16:0]

MAC0[1:21:12] = x+y+z+w

which explains why we get OTZ[0:16:0] = clamp(0, MAC0[1:21:12]>>12, 65535).

But the description says the output MAC0 is [0:31:0], so I'm totally lost.

My code :

Code: Select all

/*
Name	Cycles	Command	Description
AVSZ4	6	cop2 0x168002E	Average of four Z values
Fields&#58;
in&#58;      SZ1,SZ2,SZ3,SZ4   Z-Values                            &#91;0,16,0&#93;
         ZSF4              Divider                             &#91;1,3,12&#93;
out&#58;     OTZ               Average.                            &#91;0,16,0&#93;
         MAC0              Average.                            &#91;1,31,0&#93;

Calculation&#58;
&#91;1,31,0&#93; MAC0=F&#91;ZSF4*SZ0 + ZSF4*SZ1 + ZSF4*SZ2 + ZSF4*SZ3&#93;     &#91;1,31,12&#93;
&#91;0,16,0&#93; OTZ=Lm_D&#91;MAC0&#93;                                        &#91;1,31,0&#93;
*/

.set OTZ,             $s013.s // &#91;0&#58;16&#58;0&#93; but stored as float internally
.set SZ0SZ1SZ2SZ3,    $c100.q // &#91;0&#58;16&#58;0&#93; but stored as float internally
.set MAC0,            $s120.s // &#91;1&#58;31&#58;0&#93; or &#91;1&#58;27&#58;4&#93; or &#91;1&#58;19&#58;12&#93;
.set ZSF4,            $s332.s // &#91;1&#58;3&#58;12&#93; but stored as float internally
.set FLAG,            $s333.s // bit set but stored as float internally
.set DQBZSF3ZSF4FLAG, $c330.q

.extern host_gte_avsz4
.ent    host_gte_avsz4

    // 
    // OTZ = MAC0 = ZSF4*SZ0 + ZSF4*SZ1 + ZSF4*SZ2 + ZSF4*SZ3
    vdot.q      $s700.s, DQBZSF3ZSF4FLAG&#91;z,z,z,z&#93;, SZ0SZ1SZ2SZ3 // 7 cycles latency so we shouldn't use $s700.s before 6 instructions to make it run as an 1 cycle instruction.

    // constants for limits &#40;and conveniently no stall because of vdot.q&#41;
    lui         $at, %hi&#40;0x4F000000&#41; // 2147483648.0
    vfim.s      $s413.s, 65536.0
    mtv         $at, $s423.s // 2147483648.0
    vzero.s     $s433.s // 0.0
    mtv         $at, $s403.s // 2147483648.0
    viim.s      $s432.s, 65535.0    
    
    // &#40;Fn,Dz&#41; = &#40;MAC0 < -2^31, OTZ < 0&#41; <==> &#40;Fn,Dz&#41; = &#40;2^31 < -MAC0, 0 < -OTZ&#41;
    vslt.p      $c602.p, $r423.p, $c700.p&#91;-x,-x&#93;
    // &#40;Fp,Dp&#41; = &#40;MAC0 >= 2^31, OTZ >= 2^16&#41;
    vsge.p      $c600.p, $c700.p&#91;x,x&#93;, $r403.p

    // constants for FLAG
    vmov.s      $s400.s, $s413.s # F pos
    viim.s      $s401.s, 32778   # F neg
    viim.s      $s402.s, 8192    # D

    // save MAC0 in fixed point &#40;1&#58;31&#58;0&#41; !?
    vf2in.s     MAC0, $s700.s, 0

    // clamp OTZ between 0 and 65535 &#40;part I&#41;
    vmin.s      $s700.s, $s700.s, $s432.s
                
    // FLAG = Fp*&#40;1<<&#40;31-16&#41;&#41; + Dp*&#40;1<<&#40;31-18&#41;&#41; + Fn*&#40;1<<&#40;31-15&#41;&#41; + Dz*&#40;1<<&#40;31-18&#41;&#41;
    vdot.q      FLAG, $c400.q&#91;x,z,y,z&#93;, $c600.q

    // clamp OTZ between 0 and 65535 &#40;part II&#41; and save it
    jr          $ra
    vmax.s      OTZ, $s700.s, $s433.s

.end    host_gte_avsz4

NOTE:
FLAG contains various bits to indicate overflows. But I'm not handling FLAG as in a real PSX FLAG in order to reduce bit computations at each GTE instructions. Instead I defer those extra computations when reading FLAG register into a GP register. To retrieve the exact FLAG :

Code: Select all

host_gte_mfc2_flag&#58;
    vf2in.s $s400.s, FLAG, 0 // convert float into integer to get reversed bits &#40;only 20 bits are used in FLAG so it is okay to have them stored as a 32-bit float&#41;
    lui $at, %hi&#40;OVERFLOW_BITS_WHICH_SET_BIT31&#41;
    ori $at, %lo&#40;OVERFLOW_BITS_WHICH_SET_BIT31&#41;
    mfv $v0, $s400.s // put bits into a GP register 
    and $at, $at, $v0
    sltu $at, $0, $at // check if one of overflow bits is set
    or $v0, $v0, $at // "bit 32" is set if one of overflow bit is set
    jr $ra
    bitrev $v0, $v0 // reverse the bit order to reflect a true PSX FLAG

J.F. · Post by **J.F.** » Mon Jun 30, 2008 5:19 am

Well, looking at the Padua gte.txt doc, it looks that some operations are carried out at with temporaries of 1:31:12, so if a source were 1:3:12, it would be promoted to 1:31:12 for the duration of the calculation, then truncated to 1:31:0 when stored to MACn.

For example, look how they describe the RTPS operation:

Code: Select all

RTPS     15       Perspective transformation          
Fields&#58;  none
Opcode&#58;  cop2 $0180001

In&#58;      V0       Vector to transform.                         &#91;1,15,0&#93;
         R        Rotation matrix                              &#91;1,3,12&#93;
         TR       Translation vector                           &#91;1,31,0&#93;
         H        View plane distance                          &#91;0,16,0&#93;
         DQA      Depth que interpolation values.              &#91;1,7,8&#93;
         DQB                                                   &#91;1,7,8&#93;
         OFX      Screen offset values.                        &#91;1,15,16&#93;
         OFY                                                   &#91;1,15,16&#93;
Out&#58;     SXY fifo Screen XY coordinates.&#40;short&#41;                &#91;1,15,0&#93;
         SZ fifo  Screen Z coordinate.&#40;short&#41;                  &#91;0,16,0&#93;
         IR0      Interpolation value for depth queing.        &#91;1,3,12&#93;
         IR1      Screen X &#40;short&#41;                             &#91;1,15,0&#93;
         IR2      Screen Y &#40;short&#41;                             &#91;1,15,0&#93;
         IR3      Screen Z &#40;short&#41;                             &#91;1,15,0&#93;
         MAC1     Screen X &#40;long&#41;                              &#91;1,31,0&#93;
         MAC2     Screen Y &#40;long&#41;                              &#91;1,31,0&#93;
         MAC3     Screen Z &#40;long&#41;                              &#91;1,31,0&#93;

Calculation&#58;
&#91;1,31,0&#93; MAC1=A1&#91;TRX + R11*VX0 + R12*VY0 + R13*VZ0&#93;            &#91;1,31,12&#93;
&#91;1,31,0&#93; MAC2=A2&#91;TRY + R21*VX0 + R22*VY0 + R23*VZ0&#93;            &#91;1,31,12&#93;
&#91;1,31,0&#93; MAC3=A3&#91;TRZ + R31*VX0 + R32*VY0 + R33*VZ0&#93;            &#91;1,31,12&#93;
&#91;1,15,0&#93; IR1= Lm_B1&#91;MAC1&#93;                                      &#91;1,31,0&#93;
&#91;1,15,0&#93; IR2= Lm_B2&#91;MAC2&#93;                                      &#91;1,31,0&#93;
&#91;1,15,0&#93; IR3= Lm_B3&#91;MAC3&#93;                                      &#91;1,31,0&#93;
         SZ0<-SZ1<-SZ2<-SZ3
&#91;0,16,0&#93; SZ3= Lm_D&#40;MAC3&#41;                                       &#91;1,31,0&#93;
         SX0<-SX1<-SX2, SY0<-SY1<-SY2
&#91;1,15,0&#93; SX2= Lm_G1&#91;F&#91;OFX + IR1*&#40;H/SZ&#41;&#93;&#93;                       &#91;1,27,16&#93;
&#91;1,15,0&#93; SY2= Lm_G2&#91;F&#91;OFY + IR2*&#40;H/SZ&#41;&#93;&#93;                       &#91;1,27,16&#93;
&#91;1,31,0&#93; MAC0= F&#91;DQB + DQA * &#40;H/SZ&#41;&#93;                           &#91;1,19,24&#93;
&#91;1,15,0&#93; IR0= Lm_H&#91;MAC0&#93;                                       &#91;1,31,0&#93;

Notes&#58;
Z values are limited downwards at 0.5 * H. For smaller z values you'll have
write your own routine.

Note that the first three MAC calculations are done at [1,31,12], then stored as [1,31,0]. Similarly, you see SX and SY later done as [1,27,16] before being stored as [1,15,0]. So the GTE seems to do most calculations as fixed point, signed, 44 bit numbers that are later shifted and truncated to store in various registers.

hlide · Post by **hlide** » Mon Jun 30, 2008 6:29 am

Thanks J.P., your explanation makes sense. The only problem I see is :

FLAG = 0;
MAC0 = F( ( (s64)(s16)ZSF4 * SZ0 ) + ( (s16)ZSF4 * SZ1 ) + ( (s16)ZSF4 * SZ2 ) + ( (s16)ZSF4 * SZ3 ) );
OTZ = Lm_D( (s32)MAC0 >> 12 );

as you can see, the last result stored in MAC0 is still [1:19:12] at the end and not [1:31:0]. So I guess this GTE emulation code I found on several PSX emulators is not totally accurate about the results left on MAC0. Sure, if this result is never used, it should be probably okay.

But I like your explanation and I'll apply it this way, as it just costs 1 cycle and allow me to avoid some stalls anyway.

According to your explanation, the accurate code should be :

FLAG = 0;
s64 tmp = ( ( (s64)(s16)ZSF4 * SZ0 ) + ( (s16)ZSF4 * SZ1 ) + ( (s16)ZSF4 * SZ2 ) + ( (s16)ZSF4 * SZ3 ) ) >> 12;
MAC0 = F( tmp );
OTZ = Lm_D( (s32)tmp );

J.F. · Post by **J.F.** » Mon Jun 30, 2008 7:52 am

Well, it's more padua's explanation, but I imagine that some emulators used to take shortcuts when they could to save time. Many of these PSX emulators were written when doing a PSX emu was pushing the hardware to the limit, and not converting MAC0 from [1,19,12] to [1,31,0] when it wasn't needed by any games (that they knew of) was probably considered a valid speedup.