GCC optimisation flags
GCC optimisation flags
Hi there,
(I've done a quick search for this but didn't seem to find anything else relevant)
When I first ported Daedalus over to the PSP I started compiling with -O1 as using -O2 and -O3 would result in random crashes in code which compiled fine with -O1.
I've recently updated to the latest toolchain so I re-compiled Daedalus with -O2 and -O3 and I'm pleased to say that I no longer seem to be getting random crashes. Unfortunately it's not all sweetness and light, because I'm getting various odd bugs (they mostly seem to be related to my transform and lighting code, and the results are different between -O2 and -O3).
Before I investigate any further, are there any known issues with -O2/-O3 that I should be aware of? Assuming I do figure out what's going on, and can provide a repro, who should I send this to?
Cheers,
StrmnNrmn
(I've done a quick search for this but didn't seem to find anything else relevant)
When I first ported Daedalus over to the PSP I started compiling with -O1 as using -O2 and -O3 would result in random crashes in code which compiled fine with -O1.
I've recently updated to the latest toolchain so I re-compiled Daedalus with -O2 and -O3 and I'm pleased to say that I no longer seem to be getting random crashes. Unfortunately it's not all sweetness and light, because I'm getting various odd bugs (they mostly seem to be related to my transform and lighting code, and the results are different between -O2 and -O3).
Before I investigate any further, are there any known issues with -O2/-O3 that I should be aware of? Assuming I do figure out what's going on, and can provide a repro, who should I send this to?
Cheers,
StrmnNrmn
Apologies for the double post. It turns out to be pretty much self-inflicted, but I thought I'd post with my findings as I can imagine lots of people have similar code and it might save some hassle in the future.
This is the offending function:
You may recognise this as the famous reciprocal square root from the Quake source (amongst other places), which is why I wouldn't be surprised to see other people with the same problem :)
Anyway, it looks like with O2 and O3 gcc is inlining this function 'incorrectly'. It looks like a pointer aliasing problem - it's passing x in $fp12, but the inlined code is expecting it to be present on the stack, and it's not. I used 'incorrectly' because the c-style float/int coercion is particularly horrible and I'm guessing it falls into the 'undefined behaviour' of ISO C.
It's easy enough to fix - using a union should work with GCC, or I'll rewrite in assembly. Hopefully that's the only issue remaining with enabling -O3, as Daedalus is around 10% faster as a result :)
StrmnNrmn
This is the offending function:
Code: Select all
static float InvSqrt(float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f-xhalf*x*x);
return x;
}
Anyway, it looks like with O2 and O3 gcc is inlining this function 'incorrectly'. It looks like a pointer aliasing problem - it's passing x in $fp12, but the inlined code is expecting it to be present on the stack, and it's not. I used 'incorrectly' because the c-style float/int coercion is particularly horrible and I'm guessing it falls into the 'undefined behaviour' of ISO C.
It's easy enough to fix - using a union should work with GCC, or I'll rewrite in assembly. Hopefully that's the only issue remaining with enabling -O3, as Daedalus is around 10% faster as a result :)
StrmnNrmn
O_o !? you still using that kind of code to optimize Daedalus !? come on !! what don't you use VFPU !?
"vrsq.s" is your friend, dude !
Code: Select all
static float InvSqrt(float x)
{
float result;
__asm__ volatile (
"mtv %1, S000\n"
"vrsq.s S000, S000\n"
"mfv %0, S000\n"
: "=r"(result) : "r"(x));
}
:)hlide wrote:O_o !? you still using that kind of code to optimize Daedalus !? come on !! what don't you use VFPU !?
I use the VFPU quite heavily for transform and lighting, clipping etc, but there's lots of old code that I haven't gotten around to updating yet (most of the codebase was written around 2001-2003).
I have to admit I'd not realised I could rewrite InvSqrt too until you mentioned it - thanks :D
I'm having similar problems atm with random crashes whenever I change a single line in project. Until your post I was pretty clueless, as I was pretty sure it wasn't for a buffer/stack overflow. But that aliasing gave me a new idea where to look.
BTW: I took a look over your VFPU code and I found a few things that you could improve on:
a) you use code like this often: vmul.q R000, R100, R001[x,x,x,x]
The prefix operation is not free though and you could get the same result with a vscl.q R000, R100, S001
b) You load and convert your int vectors with some overhead:
That code could be done in less ops:
The vuc2i.s is the op from hlide's last finding here: http://forums.ps2dev.org/viewtopic.php?p=50637#50637
If that op isn't yet in psp-asm then you could directly use the opcode through .word (OPCODE|VD|VS) as hlide did. I'm sure he can help you get the masks for the registers you need or take a look in mips-op.c.
Note that for your short -> float load/convert you can use the vus2i.p op and use the constant 1 in the register prefix to avoid the li/mtv, and this op IS in psp-asm :)
(I suppose the swizzle of z/x components is a product of N64's way of storage?).
BTW: I took a look over your VFPU code and I found a few things that you could improve on:
a) you use code like this often: vmul.q R000, R100, R001[x,x,x,x]
The prefix operation is not free though and you could get the same result with a vscl.q R000, R100, S001
b) You load and convert your int vectors with some overhead:
Code: Select all
lb $v0,15($a2) // x
mtv $v0, S200
lb $v1,14($a2) // y
mtv $v1, S210
lb $v0,13($a2) // z
mtv $v0, S220
li $v1, 0
mtv $v1, S230
# Convert to float and transform
vi2f.q R201, R200, 0 // int -> float (obliterates world transform)
Code: Select all
lv.s S200, 12($a2) // load word [?,z,y,x]
vuc2i.s R200, S200 // R200 = [?,z*0x10101010,y*0x10101010,x*0x10101010]
vi2f.q R201, R200[w,z,y,0], 24 // make last component 0 and swap x/z
If that op isn't yet in psp-asm then you could directly use the opcode through .word (OPCODE|VD|VS) as hlide did. I'm sure he can help you get the masks for the registers you need or take a look in mips-op.c.
Note that for your short -> float load/convert you can use the vus2i.p op and use the constant 1 in the register prefix to avoid the li/mtv, and this op IS in psp-asm :)
(I suppose the swizzle of z/x components is a product of N64's way of storage?).
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
Wow - that's incredibly helpful - thanks! The TnL is currently pretty expensive so any cycles I can save is a bonus (unfortunately to get the correct results I have to do this all on the CPU :( ) I'll give that a shot tonight and let you know how I get on.Raphael wrote:BTW: I took a look over your VFPU code and I found a few things that you could improve on:
vuc2i is a new one for me - I remember doing a bit of searching at the time for something that did what I wanted but came up blank. I should probably revisit the various VFPU threads here as there are probably a number of new discoveries since I originally wrote that code.
That's right - the N64's R3400 runs in little-endian mode. I byteswap everything so that all 32bit reads are direct, but I have to fiddle with the address or data for all other size reads.Raphael wrote:(I suppose the swizzle of z/x components is a product of N64's way of storage?).
And another thing I found:
You can avoid the costly matrix-matrix multiplication, since you need the concatenated transform only once anyway (in both TnL functions) and calculate the intermediate (World transform) result too.
This might cause a stall though, so I'm not sure if that's really faster in the end. In that case, you could move the following load/convert vector code up to fill the gap.
And another small idea:
Could be done with these two ops to avoid the quadword memory load:
Hope that helps a little :)
Code: Select all
lv.q R200, 0($a1) // Load mat project
lv.q R201, 16($a1)
lv.q R202, 32($a1)
lv.q R203, 48($a1)
# Produce world*project in M100
vmmul.q M100, M000, M200
...
vtfm4.q R201, M000, R200 // World transform
vtfm4.q R202, M100, R200 // Projection transform
Code: Select all
lv.q R100, 0($a1) // Load mat project
lv.q R101, 16($a1)
lv.q R102, 32($a1)
lv.q R103, 48($a1)
vtfm4.q R201, M000, R200 // World transform
vtfm4.q R202, M100, R201 // World*Projection transform
And another small idea:
Code: Select all
# Load 1/255
la $v0, recip_255
lv.q R203, 0($v0)
...
vmul.q R200, R200, R203 // R200 = [r * 1/255, g * 1/255, b * 1/255, a * 1/255]
Code: Select all
vfim.s S203, 0.003921568627450980392156862745098
...
vscl.q R200, R200, S203
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
well i added two new opcodes in my psp-as but i didn't give the source to be commited in ps2dev.org. vuc2i.s is one. If TyRaNiD is okay to commit it, i will give the source.Raphael wrote: The vuc2i.s is the op from hlide's last finding here: http://forums.ps2dev.org/viewtopic.php?p=50637#50637
If that op isn't yet in psp-asm then you could directly use the opcode through .word (OPCODE|VD|VS) as hlide did. I'm sure he can help you get the masks for the registers you need or take a look in mips-op.c.
Note that for your short -> float load/convert you can use the vus2i.p op and use the constant 1 in the register prefix to avoid the li/mtv, and this op IS in psp-asm :)
(I suppose the swizzle of z/x components is a product of N64's way of storage?).
this is okay if 1/255 can fit in a half-float wordRaphael wrote:Hope that helps a little :)Code: Select all
vfim.s S203, 0.003921568627450980392156862745098 ... vscl.q R200, R200, S203
maybe a sugestion ?
Code: Select all
vf2iz.q R200, R200, 22 // float 0.0 to 255.0 --> int [s:1][0][i:8][f:22]
vi2f.q R200, R200, 30 // int [s:1][0][f:30] --> float 0.0 to 0.9...
Good point there. For full single precision accuracy would do the job better, but I think a half-float 1/255 should be good enough. It can save it as 0.003921509 effectively (0.00111.0000000100b), so it's a precision loss of <0,002% and full intensity input would translate to 0.9999848 single precision float. But it's his decision to what precision he needs the conversion.
Code: Select all
vfim.s S203, 255.0; vrcp.s S203, S203
Would be nice :)hlide wrote:well i added two new opcodes in my psp-as but i didn't give the source to be commited in ps2dev.org. vuc2i.s is one. If TyRaNiD is okay to commit it, i will give the source.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki
Alexander Berl
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
No, as surprising as it may sound GCC is inlining the code correctly, it is you that have made a mistake. (That sounds harsh, but it is a sad reality, read on.)StrmnNrmn wrote:[snip]Code: Select all
static float InvSqrt(float x) { float xhalf = 0.5f*x; int i = *(int*)&x; i = 0x5f3759df - (i>>1); x = *(float*)&i; x = x*(1.5f-xhalf*x*x); return x; }
Anyway, it looks like with O2 and O3 gcc is inlining this function 'incorrectly'.
That is precisely what it is.StrmnNrmn wrote: It looks like a pointer aliasing problem
C99 is a broken piece of crap language specification. I quote:
So doing reasonable things link casting int's to floats and back again is not allowed. Moreover the compiler is free to assume the two pointers do not point to the same thing and optimize accordingly.Here are the aliasing rules:
An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:
- a type compatible with the effective type of the object,
- a qualified version of a type compatible with the effective type of
the object,
- a type that is the signed or unsigned type corresponding to the
effective type of the object,
- a type that is the signed or unsigned type corresponding to a
qualified version of the effective type of the object,
- an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a
subaggregate or contained union), or
- a character type.
I think this part of the C99 spec is BADLY BROKEN, and late versions of GCC want to be C99 compliant. The only way to ensure your program wont break in random unpredictable ways from compile to compile if you do lots of mixed type casting is to use -fno-strict-aliasing.
Also note that it is possible that code that breaks the strict aliasing rules will work at -O3, there is just no way to know or not and adding 1 line of code could change the ability of the optimizer to break your code. At -O1 or below strict aliasing optimization is inhibited.
Sad but true.
Strontium Dog
Thanks Raphael/hlide - some great suggestions from the both of you.
I've not tried the vuc2i.s load or separate world/project matrices yet (that's my next job :). I have converted all the vmul.q ops with the selection mask into vscl.q ops though and I'm seeing up to a 5% speedup in certain places (which is excellent for the amount of effort involved).
Thanks again guys!
This really doesn't need to be all that accurate - it's just for computing the diffuse colour for the vertex, so it doesn't really matter if I end up with R=240 or R=241 etc - you won't be able to tell the difference onscreen.Raphael wrote:Good point there. For full single precision accuracywould do the job better, but I think a half-float 1/255 should be good enough. It can save it as 0.003921509 effectively (0.00111.0000000100b), so it's a precision loss of <0,002% and full intensity input would translate to 0.9999848 single precision float. But it's his decision to what precision he needs the conversion.Code: Select all
vfim.s S203, 255.0; vrcp.s S203, S203
I've not tried the vuc2i.s load or separate world/project matrices yet (that's my next job :). I have converted all the vmul.q ops with the selection mask into vscl.q ops though and I'm seeing up to a 5% speedup in certain places (which is excellent for the amount of effort involved).
Thanks again guys!
That's really useful information - thanks for digging that up. Rewriting InvSqrt seems to have fixed all my problems, but -fno-strict-aliasing is a good tip to remember for the future.StrontiumDog wrote:I think this part of the C99 spec is BADLY BROKEN, and late versions of GCC want to be C99 compliant. The only way to ensure your program wont break in random unpredictable ways from compile to compile if you do lots of mixed type casting is to use -fno-strict-aliasing.
Thanks.