VFPU diggins
You are right, I just checked them again, and it seems those were wrong. Actually vone, vzero and vidt all take 1 cycle.
vmone/vmzero/vmidt take 4/3/2 cycles according to version used. Needs update.
vmone/vmzero/vmidt take 4/3/2 cycles according to version used. Needs update.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
values in hexa by default of rcx0-7 before any use of vrndf1 :
3f800001
3f800002
3f800004
3f800008
3f800000
3f800000
3f800000
3f800000
functions to save/restaore them :
code to test vrndf1:
I got : min = 0.0 max = 1.99999
so i guess vrndf1 gives us : 0.0 <= value < 2.0.
now i must test vrndf2 and vrndi.
3f800001
3f800002
3f800004
3f800008
3f800000
3f800000
3f800000
3f800000
functions to save/restaore them :
Code: Select all
void vfpu_save_rcx(float context[8])
{
__asm__ volatile
(
".set push;"
".set noreorder;"
"vmfvc S000, $136;"
"vmfvc S001, $137;"
"vmfvc S002, $138;"
"vmfvc S003, $139;"
"vmfvc S010, $140;"
"vmfvc S011, $141;"
"vmfvc S012, $142;"
"vmfvc S013, $143;"
"usv.q C000, 0(%0);"
"usv.q C010, 16(%0);"
".set pop"
:
: "r"(context)
: "memory"
);
}
void vfpu_load_rcx(float context[8])
{
__asm__ volatile
(
".set push;"
".set noreorder;"
"ulv.q C000, 0(%0);"
"ulv.q C010, 16(%0);"
"vmtvc $136, S000;"
"vmtvc $137, S001;"
"vmtvc $138, S002;"
"vmtvc $139, S003;"
"vmtvc $140, S010;"
"vmtvc $141, S011;"
"vmtvc $142, S012;"
"vmtvc $143, S013;"
".set pop"
:
: "r"(context)
);
}
Code: Select all
float min = 0.0, max = 0.0;
int i;
for (i = 0; i < 100000; ++i)
{
float val;
asm volatile
(
"vrndf1.s S000;"
"sv.s S000, %0;"
: "=m"(val)
);
if (min > val) min = val;
if (val > max) max = val;
}
pspDebugScreenPrintf("%f\n%f\n", min, max);
wait();
so i guess vrndf1 gives us : 0.0 <= value < 2.0.
now i must test vrndf2 and vrndi.
well it SEEMS that vrndi gives us : -2^31 <= value < 2^31
by the way, it mustn't be used for float, otherwise some crashes are expected.
by the way, it mustn't be used for float, otherwise some crashes are expected.
Code: Select all
int min = 0, max = 0;
int i;
for (i = 0; i < 100000; ++i)
{
int val;
asm volatile
(
"vrndi.s S000;"
"mfv %0, S000"
: "=r"(val)
);
if (min > val) min = val;
if (val > max) max = val;
}
I created a docuwiki on my webspace to use that as a documentation place for the further diggins, since the SVN request seems to not get answered.
Most likely a docuwiki is even better than a single txt file in a SVN.
Unfortunately I didn't have much time to add a lot of ops to the wiki, so it's rather sparse right now. If you feel like it, you're free to register and start adding our documents informations into it. The same goes for anyone else with knowledge about VFPU.
http://wiki.fx-world.org/
LMK what you think
Most likely a docuwiki is even better than a single txt file in a SVN.
Unfortunately I didn't have much time to add a lot of ops to the wiki, so it's rather sparse right now. If you feel like it, you're free to register and start adding our documents informations into it. The same goes for anyone else with knowledge about VFPU.
http://wiki.fx-world.org/
LMK what you think
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
i will tell it to you when at home, i cannot register here.Raphael wrote:I created a docuwiki on my webspace to use that as a documentation place for the further diggins, since the SVN request seems to not get answered.
Most likely a docuwiki is even better than a single txt file in a SVN.
Unfortunately I didn't have much time to add a lot of ops to the wiki, so it's rather sparse right now. If you feel like it, you're free to register and start adding our documents informations into it. The same goes for anyone else with knowledge about VFPU.
http://wiki.fx-world.org/
LMK what you think
Should work normally. I made you admin anyway.
Here's a site with links to most pages: http://wiki.fx-world.org/doku.php?id=general:cycles
You can then create them with the button on the bottom left
Here's a site with links to most pages: http://wiki.fx-world.org/doku.php?id=general:cycles
You can then create them with the button on the bottom left
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
Re: VFPU diggins
This function returns values in the range of -1 to 1, where libm's asinf returns values from -pi/2 to pi/2. I wrote a small routine to perform acos (needed for a quaternion function to return axis/angle from a quat):hlide wrote:Code: Select all
vasin.q/t/p/s vd, vs 4/?/?/? 3 { for (i = 0; i < |q/t/p/s|; ++i) vd[i] = asin(vs[i]) * 2/PI; // not sure about this conversion }
Code: Select all
float vacos(float x) {
float result;
__asm__ volatile (
"mtv %0, S000\n" // load x
"vcst.s S001, VFPU_PI_2\n" // S001 = PI/2
"vasin.s S002, S000\n" // S002 = asin(x)
"vmul.s S000, S002, S001\n" // S000 = asin(x) * PI/2 (vfpu returns -1 to 1, we need -PI/2 to PI/2)
"vsub.s S002, S001, S000\n" // S002 = acos(x) = PI/2 - asin(x)
"mfv %1, S002\n" // store result
: "=r"(result) : "r"(x));
return result;
}
Code: Select all
1.000000000000000: acosf = 0.000000000000000, vacos = 0.000000000000000
0.999998331069946: acosf = 0.001826981431805, vacos = 0.000026941299438
0.999983608722687: acosf = 0.005725612863898, vacos = 0.000263690948486
0.999941170215607: acosf = 0.010847153142095, vacos = 0.000943064689636
0.999869108200073: acosf = 0.016179904341698, vacos = 0.002095341682434
0.999770760536194: acosf = 0.021412530913949, vacos = 0.003669381141663
0.999642193317413: acosf = 0.026751749217510, vacos = 0.005728125572205
0.999485075473785: acosf = 0.032092638313770, vacos = 0.008242845535278
0.999303340911865: acosf = 0.037329345941544, vacos = 0.011150956153870
0.999089717864990: acosf = 0.042671307921410, vacos = 0.014570236206055
0.998847484588623: acosf = 0.048015348613262, vacos = 0.018447875976562
0.998576760292053: acosf = 0.053358737379313, vacos = 0.022781252861023
0.998277544975281: acosf = 0.058701783418655, vacos = 0.027570843696594
0.997949719429016: acosf = 0.064046569168568, vacos = 0.032818794250488
0.997593402862549: acosf = 0.069391109049320, vacos = 0.038522601127625
0.997216641902924: acosf = 0.074627749621868, vacos = 0.044555068016052
0.996803939342499: acosf = 0.079972051084042, vacos = 0.051162123680115
0.996362805366516: acosf = 0.085315905511379, vacos = 0.058223485946655
0.995893180370331: acosf = 0.090660177171230, vacos = 0.065743565559387
0.995395064353943: acosf = 0.096004940569401, vacos = 0.073717951774597
0.994868516921997: acosf = 0.101349666714668, vacos = 0.082148909568787
0.994313597679138: acosf = 0.106693953275681, vacos = 0.091034412384033
0.993730247020721: acosf = 0.112038522958755, vacos = 0.100376486778259
0.993118524551392: acosf = 0.117382980883121, vacos = 0.110171675682068
0.992478370666504: acosf = 0.122727967798710, vacos = 0.120423436164856
0.991823673248291: acosf = 0.127964779734611, vacos = 0.127515912055969
0.991127431392670: acosf = 0.133309572935104, vacos = 0.132169842720032
0.990402877330780: acosf = 0.138654336333275, vacos = 0.137006640434265
0.989650011062622: acosf = 0.143999248743057, vacos = 0.142025470733643
0.988868892192841: acosf = 0.149344027042389, vacos = 0.147226572036743
0.988059520721436: acosf = 0.154688835144043, vacos = 0.152607440948486
NOTE: vacos is almost 10 times faster than libm's acosf :)
quaternion -> to_axis_angle
I coded it but i didn't test it :
Code: Select all
# typedef struct quaternion_s { float i, j, k, r; } __attribute__((aligned(16))) quaternion_t;
# typedef struct axis_angle_s { float x, y, z, theta; } __attribute__((aligned(16))) axis_angle_t;
# void quaternion_to_axis_angle(quaternion_t *q, axis_angle_t *aa)
# {
# quaternion_normalise(q);
#
# aa.theta = acos(q->r) * 2;
#
# vx = q->i;
# vy = q->j;
# vz = q->k;
#
# norm = sqrt(vx * vx + vy * vy + vz * vz);
# if (norm > 0.0005)
# {
# aa->x = vx / norm;
# aa->y = vy / norm;
# aa->z = vz / norm;
# }
# }
.global quaternion_to_axis_angle
quaternion_to_axis_angle:
lv.q C000, 0($a0) # C000.q = (q.i, q.j, q.k, q.r)
vcst.s S102, VFPU_PI # S102.s = PI
vdot.q S100, C000, C000 # S100.s = (q.i^2 + q.j^2 + q.k^2 + q.r^2)
vrsq.s S100, S100 # S100.s = 1/sqrt(q.i^2 + q.j^2 + q.k^2 + q.r^2)
vscl.q C000, C000, S100 # C000.q = (vx, vy, vy) = (q.i, q.j, q.k, q.r)*(1/sqrt(q.i^2 + q.j^2 + q.k^2 + q.r^2)
vasin.s S101, S000 # S101.s = vasin(q.r) = 2*asin(q.r)/PI
vfim.s S100, 0.00005 # S100.s = epsilon
vdot.t S103, C000, C000 # S103.s = (vx^2 + vy^2 + vz^2)
vocp.s S101, S101 # S101.s = 1 - 2*asin(q.r)/PI
vrsq.s S103, S103 # S103.s = norm = 1/sqrt(vx^2, vy^2, vz^2)
vmul.s S003, S101, S102 # S003.s = 2*acos(q.r) = 2*(PI/2 - asin(q.r)) = PI*(1 - 2*asin(q.r)/PI)
vcmp.s LT, S103, S100 # VFPU_CC[0] = norm < epsilon
bvtl 0, 0f
vzero.t C000 # if (VFPU_CC[0] == true) C000.t = (0, 0, 0);
vscl.t C000, C000, S103 # if (VFPU_CC[0] == false) C000.t = (vx, vy, vz)/norm
0: jr ra
sv.q C000, 0($a1) # aa.x = S000.s, aa.y = S001.s, aa.z = S002.s, aa.theta = S003.s
humm, i made a mistake when norm < epsilon, I tend to set default value when I cannot compute the value, but here I made a big mistake, so i chnaged it in something more appropriate.
Remove "vzero.t" and transform "bvtl" into "bvfl" :
or replace "bvtl" by "vcmovt" :
Remove "vzero.t" and transform "bvtl" into "bvfl" :
Code: Select all
bvfl 0, 0f
vscl.t C000, C000, S103 # if (VFPU_CC[0] == false) C000.t = (vx, vy, vz)/norm
0: jr ra
sv.q C000, 0($a1) # aa.x = S000.s, aa.y = S001.s, aa.z = S002.s, aa.theta = S003.s
Code: Select all
vscl.t C100, C000, S103
vcmovf.t C000, C100, 0 # if (VFPU_CC[0] == false) C000.t = (vx, vy, vz)/norm
0: jr ra
sv.q C000, 0($a1) # aa.x = S000.s, aa.y = S001.s, aa.z = S002.s, aa.theta = S003.s
bloody brilliant man...thats some tight code, the normalization code is certainly much shorter than what I had =)
As to freenode, I'm there right now...try connecting to one of the alternatives here: http://freenode.net/irc_servers.shtml
As to freenode, I'm there right now...try connecting to one of the alternatives here: http://freenode.net/irc_servers.shtml
very bad news ! after some tests it appears that you cannot use a VFPU instruction in a branch delay slot, that is :
- after a bvt
- after a bvf
- after a bvtl
- after a bvfl
- after a beq
- after a bne
- after a b...z
- after a bal...z
- after a bal
- after a b
- after a jr
- after a jalr
- after a j
:((((
- after a bvt
- after a bvf
- after a bvtl
- after a bvfl
- after a beq
- after a bne
- after a b...z
- after a bal...z
- after a bal
- after a b
- after a jr
- after a jalr
- after a j
:((((
okay this version totally rocks and show how you can use efficiently the VFPU flag 4 (at leats one component matches the condition) to detect any NaN or Inf value in VFPU registers :
NOTE : NaN and Inf don't raise an exception so we don't need to exit early and it is preferably to run all the instructions in the normal case without any branch to take. To set a default value, we just only test at the end if our result register has at leat one of this component equals to NaN or Inf value. We cannot make it simpler.
Oh ! MrMr[iCE], fear the madness of VFPU prefix here ;P
Code: Select all
# bool vfpuQuaternionToAxisAngle(vfpu_quaternion_t quaternion, vfpu_axis_angle_t axis_angle)
.global vfpuQuaternionToAxisAngle
.p2align 4
vfpuQuaternionToAxisAngle:
lv.q C000, ($a0)
vcst.s S011, VFPU_PI
vdot.q S012, C000, C000
vrsq.s S013, S012
vscl.q C000, C000, S013
vdot.t S013, C000, C000
vrsq.s S012, S013
vscl.t C000, C000, S012
vcmp.q ES, C000
vasin.s S012, S003
vocp.s S013, S012
vmul.s S003, S013, S011
vcmov.q C000, C000[1, 0, 0, 0], 4
sv.q C000, ($a1)
jr $ra
nop
Oh ! MrMr[iCE], fear the madness of VFPU prefix here ;P
Code: Select all
void sceQuatToMatrix(ScePspQuatMatrix *q, ScePspFMatrix4 *m) {
/* x2 = SQR(x); y2 = SQR(y); z2 = SQR(z); w2 = SQR(w);
xy = x * y;
xz = x * z;
yz = y * z;
wx = w * x;
wy = w * y;
wz = w * z;
matrix[0] = float(1 - 2*(y2 + z2));
matrix[1] = float(2 * (xy + wz));
matrix[2] = float(2 * (xz - wy));
matrix[4] = float(2 * (xy - wz));
matrix[5] = float(1 - 2*(x2 + z2));
matrix[6] = float(2 * (yz + wx));
matrix[8] = float(2 * (xz + wy));
matrix[9] = float(2 * (yz - wx));
matrix[10] =float(1 - 2*(x2 + y2));*/
__asm__ volatile (
"lv.q C000, %1\n" // C000 = [x, y, z, w ]
"vmul.q C010, C000, C000\n" // C010 = [x2, y2, z2, w2]
"vcrs.t C020, C000, C000\n" // C020 = [yz, xz, xy ]
"vmul.q C030, C000, C000[w,w,w,0]\n" // C030 = [wx, wy, wz ]
"vadd.q C100, C020[0,z,y,0], C030[0,z,-y,0]\n" // C100 = [0, xy+wz, xz-wy]
"vadd.s S100, S011, S012\n" // C100 = [y2+z2, xy+wz, xz-wy]
"vadd.q C110, C020[z,0,x,0], C030[-z,0,x,0]\n" // C110 = [xy-wz, 0, yz+wx]
"vadd.s S111, S010, S012\n" // C110 = [xy-wz, x2+z2, yz+wx]
"vadd.q C120, C020[y,x,0,0], C030[y,-x,0,0]\n" // C120 = [xz+wy, yz-wx, 0 ]
"vadd.s S122, S010, S011\n" // C120 = [xz+wy, yz-wx, x2+y2]
"vmov.s S033, S033[2]\n"
"vscl.t C100, C100, S033\n" // C100 = [2*(y2+z2), 2*(xy+wz), 2*(xz-wy)]
"vscl.t C110, C110, S033\n" // C110 = [2*(xy-wz), 2*(x2+z2), 2*(yz+wx)]
"vscl.t C120, C120, S033\n" // C120 = [2*(xz+wy), 2*(yz-wx), 2*(x2+y2)]
"vocp.s S100, S100\n" // C100 = [1-2*(y2+z2), 2*(xy+wz), 2*(xz-wy) ]
"vocp.s S111, S111\n" // C110 = [2*(xy-wz), 1-2*(x2+z2), 2*(yz+wx) ]
"vocp.s S122, S122\n" // C120 = [2*(xz+wy), 2*(yz-wx), 1-2*(x2+y2)]
"vidt.q C130\n" // C130 = [0, 0, 0, 1]
"sv.q R100, 0 + %0\n"
"sv.q R101, 16 + %0\n"
"sv.q R102, 32 + %0\n"
"sv.q R103, 48 + %0\n"
: "=m"(*m) : "m"(*q));
}
hlide and I did a bit of testing with vrnds, vrndf1 and vrndf2.hlide wrote: VFPU has control registers and some are relative to random seed i guess. They are documented in groepaz's document.
Code: Select all
128 VFPU_PFXS Source prefix stack 129 VFPU_PFXT Target prefix stack 130 VFPU_PFXD Destination prefix stack 131 VFPU_CC Condition information 132 VFPU_INF4 VFPU internal information 4 133 VFPU_RSV5 Not used (reserved) 134 VFPU_RSV6 Not used (reserved) 135 VFPU_REV VFPU revision information 136 VFPU_RCX0 Pseudorandom number generator information 0 137 VFPU_RCX1 Pseudorandom number generator information 1 138 VFPU_RCX2 Pseudorandom number generator information 2 139 VFPU_RCX3 Pseudorandom number generator information 3 140 VFPU_RCX4 Pseudorandom number generator information 4 141 VFPU_RCX5 Pseudorandom number generator information 5 142 VFPU_RCX6 Pseudorandom number generator information 6 143 VFPU_RCX7 Pseudorandom number generator information 7
Turns out the 8 RCX registers are most likey part of a 'shift register' algorithm. They are updated everytime a rand value is queried with vrndf1/2, and there is a noticeable pattern between RCX2 and 3, and RCX 6 and 7 between iterations.
before vrnds or any vrndfX instruction is run, the state of the RCX registers are:
Code: Select all
3f800001 3f800002 3f800004 3f800008 3f800000 3f800000 3f800000 3f800000
1.0000001192 1.0000002384 1.0000004768 1.0000009537 1.0000000000 1.0000000000 1.0000000000 1.0000000000
Code: Select all
3f833333 3f833333 3f833333 3f833333 3f833fb3 3f8b3fb3 3f8f3fb3 3f833fb3
1.0249999762 1.0249999762 1.0249999762 1.0249999762 1.0253814459 1.0878814459 1.1191314459 1.0253814459
Code: Select all
rand: 1.155536293983
3f8096d8 3f8084f9 3f803333 3f80cccc 3f804f4c 3f80637a 3f803fb3 3f80fecc
1.0046033859 1.0040580034 1.0015624762 1.0062499046 1.0024199486 1.0030357838 1.0019439459 1.0077757835
rand: 1.911504626274
3f80c2f9 3f801c6b 3f80cccc 3f80cccb 3f80fad5 3f804f52 3f80fecc 3f803d4c
1.0059500933 1.0008672476 1.0062499046 1.0062497854 1.0076547861 1.0024206638 1.0077757835 1.0018706322
Also found vrndf1 returns floats in the range of 1.0 to 2.0, and vrndf2 returns in the range of 2.0 to 4.0...anyone have any idea what the 2->4 range would be useful for?
note their differences :MrMr[iCE] wrote:Also found vrndf1 returns floats in the range of 1.0 to 2.0, and vrndf2 returns in the range of 2.0 to 4.0...anyone have any idea what the 2->4 range would be useful for?
- vrndf1 --> [1, 2[ --> |2.0-1.0| < 1.0
- vrndf2 --> [2, 4[ --> |4.0-2.0| < 2.0
to have [0, 1[ --> (vrndf1() - 1.0) :
Code: Select all
vrndf1.s S000
vsub.s S000, S000, S000[1]
Code: Select all
vrndf2.s S000
vsub.s S000, S000, S000[3]
- sign bit always 0 : always positive number,
- exponent fixed to 127.
- mantissa is always [1, 2[
(-1)^0 x 2^(127-127) x (1.mantissa)
with 00000000000000000000000b <= mantissa < 11111111111111111111111b
so it would explain the range of [1, 2[ for "vrnd1f"
note also that vrndf2() <=> 2.0*vrndf1().
Re: VFPU diggins
Code: Select all
vi2uc.q vd.s, vs.q 1 0
{
vd.s[0]( 0.. 7) = vs.q[0] & 0xFF;
vd.s[0]( 8..15) = vs.q[1] & 0xFF;
vd.s[0](16..23) = vs.q[2] & 0xFF;
vd.s[0](24..31) = vs.q[3] & 0xFF;
}
Code: Select all
M000 before doing vi2uc (note C020, this is the data i want to pack)
C000 C010 C020 C030
R000: 43200000 42be0000 000000f5 7f800001
R001: 43200000 42be0000 000000e3 7f800001
R002: 43200000 42be0000 000000a7 7f800001
R003: 7f800001 7f800001 000000ff 7f800001
after vi2uc.q S000, C020 i get:
C000 C010 C020 C030
R000: 00000000 42be0000 000000f5 7f800001
R001: 43200000 42be0000 000000e3 7f800001
R002: 43200000 42be0000 000000a7 7f800001
R003: 7f800001 7f800001 000000ff 7f800001
Edit: I think I see whats happening...the vfpu is actually doing this:
Code: Select all
vi2uc.q vd.s, vs.q
{
vd.s[0]( 0.. 7) = (vs.q[0] & 0x7F800000) >> 23;
vd.s[0]( 8..15) = (vs.q[1] & 0x7F800000) >> 23;
vd.s[0](16..23) = (vs.q[2] & 0x7F800000) >> 23;
vd.s[0](24..31) = (vs.q[3] & 0x7F800000) >> 23;
}
Also, I think vf2iX.s/p/t/q does a fixed point conversion. If you provide a non-0 shift immediate, it will fill the bits to the right of the decimal point with fraction bits converted from the original float value. I will test some more.
hummm you're right, the text file is not updated because I was aware of the shift to do. That said, I never tested with negative values.
it looks like :
it means the integer source is like :
S = [s:1][i:8][f:23]
that is [s:1][i:8] would be a value between -256 and 255, not -128 and 127.
it looks like :
Code: Select all
vi2uc.q vd.s, vs.q
{
vd.s[0]( 0.. 7) = max_integer(0, vs.q[0] >> 23);
vd.s[0]( 8..15) = max_integer(0, vs.q[1] >> 23);
vd.s[0](16..23) = max_integer(0, vs.q[2] >> 23);
vd.s[0](24..31) = max_integer(0, vs.q[3] >> 23);
}
S = [s:1][i:8][f:23]
that is [s:1][i:8] would be a value between -256 and 255, not -128 and 127.
Found a bug with prefixes and vbfy1 instruction:
S000 = 1.0, so S020 should be 2.0, but as you can see, not quite...so far only tested with the 1 prefix, will try others
workaround:
do a vone.s S001 and perform vbfy1 without the prefix
Code: Select all
vadd.s S010, S000[1]
vsub.s S011, S000[1]
vbfy1.p C020, C000[x, 1]
should produce:
S010 = C000[x] + 1
S011 = C000[x] - 1
S020 = C000[x] + 1
S021 = C000[x] - 1
but we get a numerical error:
C000 C010 C020 C030
vvvvvvvv
R000: 1.000000 2.000000 2.442695 0.000000
^^^^^^^^
R001: 1.442695 0.000000 0.000000 0.000000
R002: 0.000000 0.000000 0.000000 0.000000
R003: 0.000000 0.000000 0.000000 0.000000
workaround:
do a vone.s S001 and perform vbfy1 without the prefix
you should also test if swizzle prefix works :
1) swizzle operation : x, y, z, w
2) absolute operation : ?, |?| where ? is one of 1)
3) negation operation : ?, -? where ? is one of 2
now that we know that using constant insert operation is not working very well with vbfy1.p, we may also suppose it would be the same for vbfy1/2.q
1) swizzle operation : x, y, z, w
2) absolute operation : ?, |?| where ? is one of 1)
3) negation operation : ?, -? where ? is one of 2
now that we know that using constant insert operation is not working very well with vbfy1.p, we may also suppose it would be the same for vbfy1/2.q
Ok, for those who are interested i wrote several asm macros to count pitch and latency of a VFPU insn, here is the result :
NOTE:
The pitch represents resource-occupying cycles. An instruction using the same resources can only be issued after the resource-occupying cycles. We call the cycles the pitch of the instruction.
An instruction has a "Read After Write" hazard with the next instruction if the latter has a shorter latency, stall delay is latency(insn1) - 1.
An instruction has a "Write After Write" hazard with the next instruction if the latter has a shorter latency, stall delay is latency(insn1) - latency(insn2).
the stubs.S where i put all the test :
and the main.c :
NOTE:
The pitch represents resource-occupying cycles. An instruction using the same resources can only be issued after the resource-occupying cycles. We call the cycles the pitch of the instruction.
An instruction has a "Read After Write" hazard with the next instruction if the latter has a shorter latency, stall delay is latency(insn1) - 1.
An instruction has a "Write After Write" hazard with the next instruction if the latter has a shorter latency, stall delay is latency(insn1) - latency(insn2).
Code: Select all
INSTRUCTION PITCH LATENCY
--------------- ------- -------
lv.s 1 3
lv.q 1 3
mfv 6 0
mfvc 6 0
mtv 1 3
mtvc 1 3
sv.s 5 0
sv.q 5 0
svl.q 5 0
svr.q 5 0
vabs.s 1 3
vabs.p 1 3
vabs.t 1 3
vabs.q 1 3
vadd.s 1 5
vadd.p 1 5
vadd.t 1 5
vadd.q 1 5
vasin.s 1 7
vasin.p 2 8
vasin.t 3 9
vasin.q 4 10
vavg.p 1 7
vavg.t 1 7
vavg.q 1 7
vbfy1.p 1 5
vbfy1.q 1 5
vbfy2.q 1 5
vcmovf.s 1 5
vcmovf.p 1 5
vcmovf.t 1 5
vcmovf.q 1 5
vcmovt.s 1 5
vcmovt.p 1 5
vcmovt.t 1 5
vcmovt.q 1 5
vcmp.s 1 3
vcmp.p 1 3
vcmp.t 1 3
vcmp.q 1 3
vcos.s 1 7
vcos.p 2 8
vcos.t 3 9
vcos.q 4 10
vcrs.t 1 5
vcrsp.t 3 9
vcst.s 1 3
vcst.p 1 3
vcst.t 1 3
vcst.q 1 3
vdet.p 1 7
vdiv.s 14 17
vdiv.p 28 31
vdiv.t 42 45
vdiv.q 56 59
vdot.t 1 7
vdot.q 1 7
vexp2.s 1 7
vexp2.p 2 8
vexp2.t 3 9
vexp2.q 4 10
vf2h.p 1 5
vf2h.q 1 5
vf2id.s 1 5
vf2id.p 1 5
vf2id.t 1 5
vf2id.q 1 5
vf2in.s 1 5
vf2in.p 1 5
vf2in.t 1 5
vf2in.q 1 5
vf2iu.s 1 5
vf2iu.p 1 5
vf2iu.t 1 5
vf2iu.q 1 5
vf2iz.s 1 5
vf2iz.p 1 5
vf2iz.t 1 5
vf2iz.q 1 5
vfad.p 1 7
vfad.t 1 7
vfad.q 1 7
vh2f.s 1 5
vh2f.p 1 5
vhdp.p 1 7
vhdp.q 1 7
vhtfm2.p 2 8
vhtfm3.t 3 9
vhtfm4.q 4 10
vi2c.q 1 3
vi2f.s 1 5
vi2f.p 1 5
vi2f.t 1 5
vi2f.q 1 5
vi2s.p 1 3
vi2s.q 1 3
vi2uc.q 1 3
vi2us.p 1 3
vi2us.q 1 3
vidt.p 1 3
vidt.q 1 3
viim.s 1 5
vlgb.s 1 5
vlog2.s 1 7
vlog2.p 2 8
vlog2.t 3 9
vlog2.q 4 10
vmax.s 1 5
vmax.p 1 5
vmax.t 1 5
vmax.q 1 5
vmfvc 1 3
vmidt.p 2 4
vmidt.t 3 5
vmidt.q 4 6
vmin.s 1 5
vmin.p 1 5
vmin.t 1 5
vmin.q 1 5
vmmov.p 2 4
vmmov.t 3 5
vmmov.q 4 6
vmmul.p 4 10
vmmul.t 9 15
vmmul.q 16 22
vmone.p 2 4
vmone.t 3 5
vmone.q 4 6
vmscl.p 2 6
vmscl.t 3 7
vmscl.q 4 8
vmtvc 1 3
vmul.s 1 5
vmul.p 1 5
vmul.t 1 5
vmul.q 1 5
vmzero.p 2 4
vmzero.t 3 5
vmzero.q 4 6
vneg.s 1 3
vneg.p 1 3
vneg.t 1 3
vneg.q 1 3
vnrcp.s 1 7
vnrcp.p 2 8
vnrcp.t 3 9
vnrcp.q 4 10
vnsin.s 1 7
vnsin.p 2 8
vnsin.t 3 9
vnsin.q 4 10
vocp.s 1 5
vocp.p 1 5
vocp.t 1 5
vocp.q 1 5
vone.s 1 3
vone.p 1 3
vone.t 1 3
vone.q 1 3
vqmul.q 4 10
vrcp.s 1 7
vrcp.p 2 8
vrcp.t 3 9
vrcp.q 4 10
vrexp2.s 1 7
vrexp2.p 2 8
vrexp2.t 3 9
vrexp2.q 4 10
vrndf1.s 3 5
vrndf1.p 6 8
vrndf1.t 9 11
vrndf1.q 12 14
vrndf2.s 3 5
vrndf2.p 6 8
vrndf2.t 9 11
vrndf2.q 12 14
vrndi.s 3 5
vrndi.p 6 8
vrndi.t 9 11
vrndi.q 12 14
vrnds.s 1 3
vrot.p 2 8
vrot.t 2 8
vrot.q 2 8
vrsq.s 1 7
vrsq.p 2 8
vrsq.t 3 9
vrsq.q 4 10
vs2i.s 1 3
vs2i.p 1 3
vsat0.s 1 3
vsat0.p 1 3
vsat0.t 1 3
vsat0.q 1 3
vsat1.s 1 3
vsat1.p 1 3
vsat1.t 1 3
vsat1.q 1 3
vsbn.s 1 5
vsbz.s 1 5
vscl.p 1 5
vscl.t 1 5
vscl.q 1 5
vscmp.s 1 5
vscmp.p 1 5
vscmp.t 1 5
vscmp.q 1 5
vsge.s 1 5
vsge.p 1 5
vsge.t 1 5
vsge.q 1 5
vsgn.s 1 5
vsgn.p 1 5
vsgn.t 1 5
vsgn.q 1 5
vsgn.s 1 7
vsin.p 2 8
vsin.t 3 9
vsin.q 4 10
vslt.s 1 5
vslt.p 1 5
vslt.t 1 5
vslt.p 1 5
vsocp.s 1 5
vsocp.p 1 5
vsqrt.s 1 7
vsqrt.p 2 8
vsqrt.t 3 9
vsqrt.q 4 10
vsrt1.q 1 5
vsrt2.q 1 5
vsrt3.q 1 5
vsrt4.q 1 5
vsub.s 1 5
vsub.p 1 5
vsub.t 1 5
vsub.q 1 5
vt4444.q 1 3
vt5551.q 1 3
vt5650.q 1 3
vtfm2.p 2 8
vtfm3.t 3 9
vtfm4.q 4 10
vus2i.s 1 3
vus2i.p 1 3
vwbn.s 1 5
vzero.p 1 3
vzero.t 1 3
vzero.q 1 3
Code: Select all
.set noreorder
.text
// use it for a single-cycle instruction (pitch = 1)
// or for macro-instruction which reiterates the same single-cycle instruction (pitch > 1)
.macro test1 insn
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles1
move $a1, $v0
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles2
subu $a2, $v0, $a1
sb $a2, 0($a0) # pitch = cycles2 - cycles1
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
vsync # our latency is somewhere here !
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
subu $a2, $v0, $a1
addiu $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
sb $a2, 1($a0) # latency = cycles3 - cycles1 - 2
addu $a0, $a0, 2
.endm
// use it for a macro-instruction which reiterates the same multi-cycle instruction
// NOTE : we cannot directly compute the pitch but we can guess it by this way :
// pitch(vdiv.s) = latency(vdiv.p) - latency(vdiv.s)
// pitch(vdiv.p) = pitch(vdiv.s) + pitch(vdiv.s)
// pitch(vdiv.t) = pitch(vdiv.s) + pitch(vdiv.p)
// pitch(vdiv.q) = pitch(vdiv.s) + pitch(vdiv.t)
.macro test2a insn
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles1
move $a1, $v0
move $t9, $0
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
vsync # our latency is somewhere here !
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
subu $a2, $v0, $a1
addiu $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
sb $a2, 1($a0) # latency' = cycles3 - cycles1 - 2
move $t8, $a0
addu $a0, $a0, 2
.endm
// use it for a macro-instruction which reiterates the same multi-cycle instruction
// NOTE : we cannot directly compute the pitch but we can guess it by this way :
// pitch(vdiv.s) = latency(vdiv.p) - latency(vdiv.s)
// pitch(vdiv.p) = latency(vdiv.t) - latency(vdiv.p) + pitch(vdiv.s)
// pitch(vdiv.t) = latency(vdiv.q) - latency(vdiv.t) + pitch(vdiv.p)
// pitch(vdiv.q) = pitch(vdiv.t) + pitch(vdiv.s)
.macro test2b insn
subu $t9, $t9, $a2
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1
move $a1, $v0
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
vsync
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
subu $a2, $v0, $a1
addiu $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
sb $a2, 1($a0) # latency = cycles3 - cycles1 - 2
addu $t9, $t9, $a2
sb $t9, -2($a0) # pitch = latency - latency'
sb $a2, 1($a0) # latency
addu $a0, $a0, 2
.endm
// use it for a macro-instruction which reiterates the same multi-cycle instruction
// NOTE : we cannot directly compute the pitch but we can guess it by this way :
// pitch(vdiv.s) = latency(vdiv.p) - latency(vdiv.s)
// pitch(vdiv.p) = latency(vdiv.t) - latency(vdiv.p) + pitch(vdiv.s)
// pitch(vdiv.t) = latency(vdiv.q) - latency(vdiv.t) + pitch(vdiv.p)
// pitch(vdiv.q) = pitch(vdiv.t) + pitch(vdiv.s)
.macro test2c insn
subu $t9, $t9, $a2
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1
move $a1, $v0
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
vsync
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1 # cycles3, note that cycles3 - cycles1 >= 2
subu $a2, $v0, $a1
addiu $a2, $a2, -2 # if no latency, cycles3 - cycles1 = 2 so we need to adjust here
sb $a2, 1($a0) # latency = cycles3 - cycles1 - 2
addu $t9, $t9, $a2
sb $t9, -2($a0) # pitch = latency - latency'
sb $a2, 1($a0) # latency
lbu $v0, ($t8)
addu $t9, $t9, $v0
sb $t9, 0($a0) # pitch
addu $a0, $a0, 2
.endm
// use it for an instruction which needs to deal with writing into CPU register or memory
// since there is no latency
.macro test3 insn
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
vsync
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $a1, $v0, $v1 # cycles1
.p2align 6
mfc0 $v1, $9
nop
vnop # stall next VFPU instruction
\insn
vsync
vnop # block next CPU instructions so we can count the stall cycles
0: mfc0 $v0, $9
sync
subu $v0, $v0, $v1
subu $a2, $v0, $a1
sb $a2, 0($a0) # pitch = cycles2 - cycles1
sb $0, 1($a0) # latency = 0
addu $a0, $a0, 2
.endm
.global test_vfpu_cycles
test_vfpu_cycles:
test1 "lv.s $0,($a3)"
test1 "lv.q $0,($a3)"
test3 "mfv $t0, $0"
test3 "mfvc $t0, $131"
test1 "mtv $0, $0"
test1 "mtvc $0, $131"
test3 "sv.s $0,($a3)"
test3 "sv.q $0,($a3)"
test3 "svl.q $0,($a3)"
test3 "svr.q $0,($a3)"
test1 "vabs.s $1, $0"
test1 "vabs.p $1, $0"
test1 "vabs.t $1, $0"
test1 "vabs.q $1, $0"
test1 "vadd.s $2, $1, $0"
test1 "vadd.p $2, $1, $0"
test1 "vadd.t $2, $1, $0"
test1 "vadd.q $2, $1, $0"
test1 "vasin.s $1, $0"
test1 "vasin.p $1, $0"
test1 "vasin.t $1, $0"
test1 "vasin.q $1, $0"
test1 "vavg.p $1, $0"
test1 "vavg.t $1, $0"
test1 "vavg.q $1, $0"
test1 "vbfy1.p $1, $0"
test1 "vbfy1.q $1, $0"
test1 "vbfy2.q $1, $0"
test1 "vcmovf.s $1, $0, 0"
test1 "vcmovf.p $1, $0, 0"
test1 "vcmovf.t $1, $0, 0"
test1 "vcmovf.q $1, $0, 0"
test1 "vcmovt.s $1, $0, 0"
test1 "vcmovt.p $1, $0, 0"
test1 "vcmovt.t $1, $0, 0"
test1 "vcmovt.q $1, $0, 0"
test1 "vcmp.s EQ, $1, $0"
test1 "vcmp.p EQ, $1, $0"
test1 "vcmp.t EQ, $1, $0"
test1 "vcmp.q EQ, $1, $0"
test1 "vcos.s $1, $0"
test1 "vcos.p $1, $0"
test1 "vcos.t $1, $0"
test1 "vcos.q $1, $0"
test1 "vcrs.t $2, $1, $0"
test1 "vcrsp.t $2, $1, $0"
test1 "vcst.s $0, VFPU_PI"
test1 "vcst.p $0, VFPU_PI"
test1 "vcst.t $0, VFPU_PI"
test1 "vcst.q $0, VFPU_PI"
test1 "vdet.p $2, $1, $0"
test2a "vdiv.s $2, $1, $0"
test2b "vdiv.p $2, $1, $0"
test2b "vdiv.t $2, $1, $0"
test2c "vdiv.q $2, $1, $0"
test1 "vdot.t $2, $1, $0"
test1 "vdot.q $2, $1, $0"
test1 "vexp2.s $1, $0"
test1 "vexp2.p $1, $0"
test1 "vexp2.t $1, $0"
test1 "vexp2.q $1, $0"
test1 "vf2h.p $1, $0"
test1 "vf2h.q $1, $0"
test1 "vf2id.s $1, $0, 0"
test1 "vf2id.p $1, $0, 0"
test1 "vf2id.t $1, $0, 0"
test1 "vf2id.q $1, $0, 0"
test1 "vf2in.s $1, $0, 0"
test1 "vf2in.p $1, $0, 0"
test1 "vf2in.t $1, $0, 0"
test1 "vf2in.q $1, $0, 0"
test1 "vf2iu.s $1, $0, 0"
test1 "vf2iu.p $1, $0, 0"
test1 "vf2iu.t $1, $0, 0"
test1 "vf2iu.q $1, $0, 0"
test1 "vf2iz.s $1, $0, 0"
test1 "vf2iz.p $1, $0, 0"
test1 "vf2iz.t $1, $0, 0"
test1 "vf2iz.q $1, $0, 0"
test1 "vfad.p $1, $0"
test1 "vfad.t $1, $0"
test1 "vfad.q $1, $0"
test1 "vh2f.s $1, $0"
test1 "vh2f.p $1, $0"
test1 "vhdp.p $2, $1, $0"
test1 "vhdp.q $2, $1, $0"
test1 "vhtfm2.p $8, $4, $0"
test1 "vhtfm3.t $8, $4, $0"
test1 "vhtfm4.q $8, $4, $0"
test1 "vi2c.q $1, $0"
test1 "vi2f.s $1, $0, 0"
test1 "vi2f.p $1, $0, 0"
test1 "vi2f.t $1, $0, 0"
test1 "vi2f.q $1, $0, 0"
test1 "vi2s.p $1, $0"
test1 "vi2s.q $1, $0"
test1 "vi2uc.q $1, $0"
test1 "vi2us.p $1, $0"
test1 "vi2us.q $1, $0"
test1 "vidt.p $0"
test1 "vidt.q $0"
test1 "viim.s $0, 0"
test1 "vlgb.s $1, $0"
test1 "vlog2.s $1, $0"
test1 "vlog2.p $1, $0"
test1 "vlog2.t $1, $0"
test1 "vlog2.q $1, $0"
test1 "vmax.s $2, $1, $0"
test1 "vmax.p $2, $1, $0"
test1 "vmax.t $2, $1, $0"
test1 "vmax.q $2, $1, $0"
test1 "vmfvc $0, $131"
test1 "vmidt.p $0"
test1 "vmidt.t $0"
test1 "vmidt.q $0"
test1 "vmin.s $2, $1, $0"
test1 "vmin.p $2, $1, $0"
test1 "vmin.t $2, $1, $0"
test1 "vmin.q $2, $1, $0"
test1 "vmmov.p $4, $0"
test1 "vmmov.t $4, $0"
test1 "vmmov.q $4, $0"
test1 "vmmul.p $8, $4, $0"
test1 "vmmul.t $8, $4, $0"
test1 "vmmul.q $8, $4, $0"
test1 "vmone.p $0"
test1 "vmone.t $0"
test1 "vmone.q $0"
test1 "vmscl.p $8, $4, $0"
test1 "vmscl.t $8, $4, $0"
test1 "vmscl.q $8, $4, $0"
test1 "vmtvc $131, $0"
test1 "vmul.s $2, $1, $0"
test1 "vmul.p $2, $1, $0"
test1 "vmul.t $2, $1, $0"
test1 "vmul.q $2, $1, $0"
test1 "vmzero.p $0"
test1 "vmzero.t $0"
test1 "vmzero.q $0"
test1 "vneg.s $1, $0"
test1 "vneg.p $1, $0"
test1 "vneg.t $1, $0"
test1 "vneg.q $1, $0"
test1 "vnrcp.s $1, $0"
test1 "vnrcp.p $1, $0"
test1 "vnrcp.t $1, $0"
test1 "vnrcp.q $1, $0"
test1 "vnsin.s $1, $0"
test1 "vnsin.p $1, $0"
test1 "vnsin.t $1, $0"
test1 "vnsin.q $1, $0"
test1 "vocp.s $1, $0"
test1 "vocp.p $1, $0"
test1 "vocp.t $1, $0"
test1 "vocp.q $1, $0"
test1 "vone.s $0"
test1 "vone.p $0"
test1 "vone.t $0"
test1 "vone.q $0"
test1 "vqmul.q $2, $1, $0"
test1 "vrcp.s $1, $0"
test1 "vrcp.p $1, $0"
test1 "vrcp.t $1, $0"
test1 "vrcp.q $1, $0"
test1 "vrexp2.s $1, $0"
test1 "vrexp2.p $1, $0"
test1 "vrexp2.t $1, $0"
test1 "vrexp2.q $1, $0"
test2a "vrndf1.s $0"
test2b "vrndf1.p $0"
test2b "vrndf1.t $0"
test2c "vrndf1.q $0"
test2a "vrndf2.s $0"
test2b "vrndf2.p $0"
test2b "vrndf2.t $0"
test2c "vrndf2.q $0"
test2a "vrndi.s $0"
test2b "vrndi.p $0"
test2b "vrndi.t $0"
test2c "vrndi.q $0"
test1 "vrnds.s $0"
test1 "vrot.p $1, $0, [c,s]"
test1 "vrot.t $1, $0, [c,s,0]"
test1 "vrot.q $1, $0, [c,s,0,0]"
test1 "vrsq.s $1, $0"
test1 "vrsq.p $1, $0"
test1 "vrsq.t $1, $0"
test1 "vrsq.q $1, $0"
test1 "vs2i.s $1, $0"
test1 "vs2i.p $1, $0"
test1 "vsat0.s $1, $0"
test1 "vsat0.p $1, $0"
test1 "vsat0.t $1, $0"
test1 "vsat0.q $1, $0"
test1 "vsat1.s $1, $0"
test1 "vsat1.p $1, $0"
test1 "vsat1.t $1, $0"
test1 "vsat1.q $1, $0"
test1 "vsbn.s $2, $1, $0"
test1 "vsbz.s $1, $0"
test1 "vscl.p $2, $1, $0"
test1 "vscl.t $2, $1, $0"
test1 "vscl.q $2, $1, $0"
test1 "vscmp.s $2, $1, $0"
test1 "vscmp.p $2, $1, $0"
test1 "vscmp.t $2, $1, $0"
test1 "vscmp.q $2, $1, $0"
test1 "vsge.s $2, $1, $0"
test1 "vsge.p $2, $1, $0"
test1 "vsge.t $2, $1, $0"
test1 "vsge.q $2, $1, $0"
test1 "vsgn.s $1, $0"
test1 "vsgn.p $1, $0"
test1 "vsgn.t $1, $0"
test1 "vsgn.q $1, $0"
test1 "vsin.s $1, $0"
test1 "vsin.p $1, $0"
test1 "vsin.t $1, $0"
test1 "vsin.q $1, $0"
test1 "vslt.s $2, $1, $0"
test1 "vslt.p $2, $1, $0"
test1 "vslt.t $2, $1, $0"
test1 "vslt.p $2, $1, $0"
test1 "vsocp.s $1, $0"
test1 "vsocp.p $1, $0"
test1 "vsqrt.s $1, $0"
test1 "vsqrt.p $1, $0"
test1 "vsqrt.t $1, $0"
test1 "vsqrt.q $1, $0"
test1 "vsrt1.q $1, $0"
test1 "vsrt2.q $1, $0"
test1 "vsrt3.q $1, $0"
test1 "vsrt4.q $1, $0"
test1 "vsub.s $2, $1, $0"
test1 "vsub.p $2, $1, $0"
test1 "vsub.t $2, $1, $0"
test1 "vsub.q $2, $1, $0"
test1 "vt4444.q $1, $0"
test1 "vt5551.q $1, $0"
test1 "vt5650.q $1, $0"
test1 "vtfm2.p $1, $4, $0"
test1 "vtfm3.t $1, $4, $0"
test1 "vtfm4.q $1, $4, $0"
test1 "vus2i.s $1, $0"
test1 "vus2i.p $1, $0"
test1 "vwbn.s $1, $0, 0"
test1 "vzero.p $0"
test1 "vzero.t $0"
test1 "vzero.q $0"
jr $ra
nop
Code: Select all
/*
* PSP Software Development Kit - http://www.pspdev.org
* -----------------------------------------------------------------------
* Licensed under the BSD license, see LICENSE in PSPSDK root for details.
*
* main.c - Basic ELF template
*
* Copyright (c) 2005 Marcus R. Brown <[email protected]>
* Copyright (c) 2005 James Forshaw <[email protected]>
* Copyright (c) 2005 John Kelley <[email protected]>
*
* $Id: main.c 1888 2006-05-01 08:47:04Z tyranid $
* $HeadURL$
*/
#include <pspkernel.h>
#include <pspdebug.h>
#include <pspctrl.h>
#include <psptypes.h>
#include <math.h>
#include <stdio.h>
#define printf pspDebugScreenPrintf
/* Define the module info section */
PSP_MODULE_INFO("template", 0x1000, 1, 1);
/* Define the main thread's attribute value (optional) */
PSP_MAIN_THREAD_ATTR(THREAD_ATTR_VFPU);
static void exception_handler(PspDebugRegBlock *regs)
{
pspDebugScreenInit();
pspDebugScreenSetBackColor(0x00FF0000);
pspDebugScreenSetTextColor(0xFFFFFFFF);
pspDebugScreenClear();
pspDebugScreenPrintf("\nSC - Exception Details:\n");
pspDebugDumpException(regs);
pspDebugScreenPrintf("\n\nPress 'cross' button to exit.");
wait();
sceKernelExitGame();
}
void save_file(const char *data, unsigned int n, const char *name)
{
int fdout;
fdout = sceIoOpen(name, PSP_O_WRONLY | PSP_O_CREAT | PSP_O_TRUNC, 0777);
sceIoWrite(fdout, data, n);
sceIoClose(fdout);
}
static char *vfpu_insn[] =
{
"lv.s",
"lv.q",
"mfv",
"mfvc",
"mtv",
"mtvc",
"sv.s",
"sv.q",
"svl.q",
"svr.q",
"vabs.s",
"vabs.p",
"vabs.t",
"vabs.q",
"vadd.s",
"vadd.p",
"vadd.t",
"vadd.q",
"vasin.s",
"vasin.p",
"vasin.t",
"vasin.q",
"vavg.p",
"vavg.t",
"vavg.q",
"vbfy1.p",
"vbfy1.q",
"vbfy2.q",
"vcmovf.s",
"vcmovf.p",
"vcmovf.t",
"vcmovf.q",
"vcmovt.s",
"vcmovt.p",
"vcmovt.t",
"vcmovt.q",
"vcmp.s",
"vcmp.p",
"vcmp.t",
"vcmp.q",
"vcos.s",
"vcos.p",
"vcos.t",
"vcos.q",
"vcrs.t",
"vcrsp.t",
"vcst.s",
"vcst.p",
"vcst.t",
"vcst.q",
"vdet.p",
"vdiv.s",
"vdiv.p",
"vdiv.t",
"vdiv.q",
"vdot.t",
"vdot.q",
"vexp2.s",
"vexp2.p",
"vexp2.t",
"vexp2.q",
"vf2h.p",
"vf2h.q",
"vf2id.s",
"vf2id.p",
"vf2id.t",
"vf2id.q",
"vf2in.s",
"vf2in.p",
"vf2in.t",
"vf2in.q",
"vf2iu.s",
"vf2iu.p",
"vf2iu.t",
"vf2iu.q",
"vf2iz.s",
"vf2iz.p",
"vf2iz.t",
"vf2iz.q",
"vfad.p",
"vfad.t",
"vfad.q",
"vh2f.s",
"vh2f.p",
"vhdp.p",
"vhdp.q",
"vhtfm2.p",
"vhtfm3.t",
"vhtfm4.q",
"vi2c.q",
"vi2f.s",
"vi2f.p",
"vi2f.t",
"vi2f.q",
"vi2s.p",
"vi2s.q",
"vi2uc.q",
"vi2us.p",
"vi2us.q",
"vidt.p",
"vidt.q",
"viim.s",
"vlgb.s",
"vlog2.s",
"vlog2.p",
"vlog2.t",
"vlog2.q",
"vmax.s",
"vmax.p",
"vmax.t",
"vmax.q",
"vmfvc",
"vmidt.p",
"vmidt.t",
"vmidt.q",
"vmin.s",
"vmin.p",
"vmin.t",
"vmin.q",
"vmmov.p",
"vmmov.t",
"vmmov.q",
"vmmul.p",
"vmmul.t",
"vmmul.q",
"vmone.p",
"vmone.t",
"vmone.q",
"vmscl.p",
"vmscl.t",
"vmscl.q",
"vmtvc",
"vmul.s",
"vmul.p",
"vmul.t",
"vmul.q",
"vmzero.p",
"vmzero.t",
"vmzero.q",
"vneg.s",
"vneg.p",
"vneg.t",
"vneg.q",
"vnrcp.s",
"vnrcp.p",
"vnrcp.t",
"vnrcp.q",
"vnsin.s",
"vnsin.p",
"vnsin.t",
"vnsin.q",
"vocp.s",
"vocp.p",
"vocp.t",
"vocp.q",
"vone.s",
"vone.p",
"vone.t",
"vone.q",
"vqmul.q",
"vrcp.s",
"vrcp.p",
"vrcp.t",
"vrcp.q",
"vrexp2.s",
"vrexp2.p",
"vrexp2.t",
"vrexp2.q",
"vrndf1.s",
"vrndf1.p",
"vrndf1.t",
"vrndf1.q",
"vrndf2.s",
"vrndf2.p",
"vrndf2.t",
"vrndf2.q",
"vrndi.s",
"vrndi.p",
"vrndi.t",
"vrndi.q",
"vrnds.s",
"vrot.p",
"vrot.t",
"vrot.q",
"vrsq.s",
"vrsq.p",
"vrsq.t",
"vrsq.q",
"vs2i.s",
"vs2i.p",
"vsat0.s",
"vsat0.p",
"vsat0.t",
"vsat0.q",
"vsat1.s",
"vsat1.p",
"vsat1.t",
"vsat1.q",
"vsbn.s",
"vsbz.s",
"vscl.p",
"vscl.t",
"vscl.q",
"vscmp.s",
"vscmp.p",
"vscmp.t",
"vscmp.q",
"vsge.s",
"vsge.p",
"vsge.t",
"vsge.q",
"vsgn.s",
"vsgn.p",
"vsgn.t",
"vsgn.q",
"vsgn.s",
"vsin.p",
"vsin.t",
"vsin.q",
"vslt.s",
"vslt.p",
"vslt.t",
"vslt.p",
"vsocp.s",
"vsocp.p",
"vsqrt.s",
"vsqrt.p",
"vsqrt.t",
"vsqrt.q",
"vsrt1.q",
"vsrt2.q",
"vsrt3.q",
"vsrt4.q",
"vsub.s",
"vsub.p",
"vsub.t",
"vsub.q",
"vt4444.q",
"vt5551.q",
"vt5650.q",
"vtfm2.p",
"vtfm3.t",
"vtfm4.q",
"vus2i.s",
"vus2i.p",
"vwbn.s",
"vzero.p",
"vzero.t",
"vzero.q",
0
};
ScePspFQuaternion res[4];
char g_data[4*2*512];
char g_text[32*1024];
int main(int argc, char *argv[])
{
pspDebugInstallErrorHandler(exception_handler);
pspDebugScreenInit();
sceCtrlSetSamplingCycle(0);
sceCtrlSetSamplingMode(PSP_CTRL_MODE_DIGITAL);
// prevent miss cache
res->x = 0;
res->y = 0;
res->z = 0;
res->w = 0;
test_vfpu_cycles(g_data, g_data, g_data, res);
int i = 0, len = 0;
len = len + sprintf(g_text + len, "INSTRUCTION\t\tPITCH\tLATENCY\n");
len = len + sprintf(g_text + len, "---------------\t-------\t-------\n");
while (vfpu_insn[i])
{
char const *q = vfpu_insn[i];
len = len + sprintf(g_text + len, "%s%s%d\t%d\n", q, (strlen(q) < 8 ? "\t\t" : "\t"), g_data[i*2], g_data[i*2+1]);
i++;
}
save_file(g_text, len, "ms0:/cycles.txt");
sceKernelExitGame();
return 0;
}
Last edited by hlide on Sat Jan 27, 2007 8:16 am, edited 1 time in total.
Some of the cycles look pretty synthetic though.
So I wonder if that method will return the real cycles that the op takes on the VFPU, or just the cycles it takes the CPU to submit the OPS to the VFPU. Also, what happens if you insert another instruction instead of the vnop to check for latency?
vdiv.s 1 2
vdiv.p 1 2
vdiv.t 1 3
vdiv.q 1 4
Makes me wonder if those are reliable for real-world usage, I doubt the VFPU does a full quadvector divide in 5 cycles, while the reciprocal takes 6 cycles. Especially in comparison to my tests.vrcp.s 1 2
vrcp.p 1 3
vrcp.t 1 4
vrcp.q 1 5
So I wonder if that method will return the real cycles that the op takes on the VFPU, or just the cycles it takes the CPU to submit the OPS to the VFPU. Also, what happens if you insert another instruction instead of the vnop to check for latency?
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
/!\ UPDATED /!\
Ok i think pitch and latency looks more accurate now, still I cannot really ascertain it.
vdiv and vrndX seem to be special because they are not single-cycle instructions so i was forced to tweast my macro to get indirectly their pitch.
macro-instructions are so-called because it seems some instructions iterate the same instruction a number of times according to the suffix .p, .t or .q.
Ok i think pitch and latency looks more accurate now, still I cannot really ascertain it.
vdiv and vrndX seem to be special because they are not single-cycle instructions so i was forced to tweast my macro to get indirectly their pitch.
macro-instructions are so-called because it seems some instructions iterate the same instruction a number of times according to the suffix .p, .t or .q.