GCC optimisation flags

StrmnNrmn · Post by **StrmnNrmn** » Tue Mar 13, 2007 6:19 pm

Hi there,

(I've done a quick search for this but didn't seem to find anything else relevant)

When I first ported Daedalus over to the PSP I started compiling with -O1 as using -O2 and -O3 would result in random crashes in code which compiled fine with -O1.

I've recently updated to the latest toolchain so I re-compiled Daedalus with -O2 and -O3 and I'm pleased to say that I no longer seem to be getting random crashes. Unfortunately it's not all sweetness and light, because I'm getting various odd bugs (they mostly seem to be related to my transform and lighting code, and the results are different between -O2 and -O3).

Before I investigate any further, are there any known issues with -O2/-O3 that I should be aware of? Assuming I do figure out what's going on, and can provide a repro, who should I send this to?

Cheers,
StrmnNrmn

StrmnNrmn · Post by **StrmnNrmn** » Tue Mar 13, 2007 9:55 pm

Apologies for the double post. It turns out to be pretty much self-inflicted, but I thought I'd post with my findings as I can imagine lots of people have similar code and it might save some hassle in the future.

This is the offending function:

Code: Select all

static float InvSqrt&#40;float x&#41;
&#123;
	float xhalf = 0.5f*x;
	int i = *&#40;int*&#41;&x;
	i = 0x5f3759df - &#40;i>>1&#41;;
	x = *&#40;float*&#41;&i;
	x = x*&#40;1.5f-xhalf*x*x&#41;;
	return x;
&#125;

You may recognise this as the famous reciprocal square root from the Quake source (amongst other places), which is why I wouldn't be surprised to see other people with the same problem :)

Anyway, it looks like with O2 and O3 gcc is inlining this function 'incorrectly'. It looks like a pointer aliasing problem - it's passing x in $fp12, but the inlined code is expecting it to be present on the stack, and it's not. I used 'incorrectly' because the c-style float/int coercion is particularly horrible and I'm guessing it falls into the 'undefined behaviour' of ISO C.

It's easy enough to fix - using a union should work with GCC, or I'll rewrite in assembly. Hopefully that's the only issue remaining with enabling -O3, as Daedalus is around 10% faster as a result :)

StrmnNrmn

hlide · Post by **hlide** » Wed Mar 14, 2007 5:50 am

O_o !? you still using that kind of code to optimize Daedalus !? come on !! what don't you use VFPU !?

Code: Select all

static float InvSqrt&#40;float x&#41;
&#123;
	float result;
	__asm__ volatile &#40;
		"mtv     %1, S000\n"
		"vrsq.s S000, S000\n"
		"mfv     %0, S000\n"
	&#58; "=r"&#40;result&#41; &#58; "r"&#40;x&#41;&#41;;
&#125;

"vrsq.s" is your friend, dude !

StrmnNrmn · Post by **StrmnNrmn** » Wed Mar 14, 2007 8:48 am

hlide wrote:O_o !? you still using that kind of code to optimize Daedalus !? come on !! what don't you use VFPU !?

:)

I use the VFPU quite heavily for transform and lighting, clipping etc, but there's lots of old code that I haven't gotten around to updating yet (most of the codebase was written around 2001-2003).

I have to admit I'd not realised I could rewrite InvSqrt too until you mentioned it - thanks :D

Raphael · Post by **Raphael** » Wed Mar 14, 2007 9:52 pm

I'm having similar problems atm with random crashes whenever I change a single line in project. Until your post I was pretty clueless, as I was pretty sure it wasn't for a buffer/stack overflow. But that aliasing gave me a new idea where to look.

BTW: I took a look over your VFPU code and I found a few things that you could improve on:
a) you use code like this often: vmul.q R000, R100, R001[x,x,x,x]
The prefix operation is not free though and you could get the same result with a vscl.q R000, R100, S001
b) You load and convert your int vectors with some overhead:

Code: Select all

lb			$v0,15&#40;$a2&#41;			// x
mtv			$v0, S200
lb			$v1,14&#40;$a2&#41;			// y
mtv			$v1, S210
lb			$v0,13&#40;$a2&#41;			// z
mtv			$v0, S220
li			$v1, 0
mtv			$v1, S230
# Convert to float and transform
vi2f.q		R201, R200, 0				// int -> float &#40;obliterates world transform&#41;

That code could be done in less ops:

Code: Select all

lv.s S200, 12&#40;$a2&#41;  // load word &#91;?,z,y,x&#93;
vuc2i.s R200, S200 // R200 = &#91;?,z*0x10101010,y*0x10101010,x*0x10101010&#93;
vi2f.q R201, R200&#91;w,z,y,0&#93;, 24  // make last component 0 and swap x/z

The vuc2i.s is the op from hlide's last finding here: http://forums.ps2dev.org/viewtopic.php?p=50637#50637
If that op isn't yet in psp-asm then you could directly use the opcode through .word (OPCODE|VD|VS) as hlide did. I'm sure he can help you get the masks for the registers you need or take a look in mips-op.c.
Note that for your short -> float load/convert you can use the vus2i.p op and use the constant 1 in the register prefix to avoid the li/mtv, and this op IS in psp-asm :)
(I suppose the swizzle of z/x components is a product of N64's way of storage?).

StrmnNrmn · Post by **StrmnNrmn** » Thu Mar 15, 2007 2:49 am

Raphael wrote:BTW: I took a look over your VFPU code and I found a few things that you could improve on:

Wow - that's incredibly helpful - thanks! The TnL is currently pretty expensive so any cycles I can save is a bonus (unfortunately to get the correct results I have to do this all on the CPU :( ) I'll give that a shot tonight and let you know how I get on.

vuc2i is a new one for me - I remember doing a bit of searching at the time for something that did what I wanted but came up blank. I should probably revisit the various VFPU threads here as there are probably a number of new discoveries since I originally wrote that code.

Raphael wrote:(I suppose the swizzle of z/x components is a product of N64's way of storage?).

That's right - the N64's R3400 runs in little-endian mode. I byteswap everything so that all 32bit reads are direct, but I have to fiddle with the address or data for all other size reads.

Raphael · Post by **Raphael** » Thu Mar 15, 2007 3:40 am

And another thing I found:

Code: Select all

lv.q		R200, 0&#40;$a1&#41;		// Load mat project
lv.q		R201, 16&#40;$a1&#41;
lv.q		R202, 32&#40;$a1&#41;
lv.q		R203, 48&#40;$a1&#41;

# Produce world*project in M100
vmmul.q		M100, M000, M200
...
vtfm4.q		R201, M000, R200	// World transform
vtfm4.q		R202, M100, R200	// Projection transform

You can avoid the costly matrix-matrix multiplication, since you need the concatenated transform only once anyway (in both TnL functions) and calculate the intermediate (World transform) result too.

Code: Select all

lv.q		R100, 0&#40;$a1&#41;		// Load mat project
lv.q		R101, 16&#40;$a1&#41;
lv.q		R102, 32&#40;$a1&#41;
lv.q		R103, 48&#40;$a1&#41;

vtfm4.q		R201, M000, R200	// World transform
vtfm4.q		R202, M100, R201	// World*Projection transform

This might cause a stall though, so I'm not sure if that's really faster in the end. In that case, you could move the following load/convert vector code up to fill the gap.

And another small idea:

Code: Select all

# Load 1/255
la			$v0, recip_255
lv.q		R203, 0&#40;$v0&#41;
...
vmul.q		R200, R200, R203	// R200 = &#91;r * 1/255, g * 1/255, b * 1/255, a * 1/255&#93;

Could be done with these two ops to avoid the quadword memory load:

Code: Select all

vfim.s S203, 0.003921568627450980392156862745098
...
vscl.q     R200, R200, S203

Hope that helps a little :)

hlide · Post by **hlide** » Thu Mar 15, 2007 5:08 am

Raphael wrote: The vuc2i.s is the op from hlide's last finding here: http://forums.ps2dev.org/viewtopic.php?p=50637#50637
If that op isn't yet in psp-asm then you could directly use the opcode through .word (OPCODE|VD|VS) as hlide did. I'm sure he can help you get the masks for the registers you need or take a look in mips-op.c.
Note that for your short -> float load/convert you can use the vus2i.p op and use the constant 1 in the register prefix to avoid the li/mtv, and this op IS in psp-asm :)
(I suppose the swizzle of z/x components is a product of N64's way of storage?).

well i added two new opcodes in my psp-as but i didn't give the source to be commited in ps2dev.org. vuc2i.s is one. If TyRaNiD is okay to commit it, i will give the source.

hlide · Post by **hlide** » Thu Mar 15, 2007 5:44 am

Raphael wrote:
Code: Select all
vfim.s S203, 0.003921568627450980392156862745098
...
vscl.q     R200, R200, S203
Hope that helps a little :)

this is okay if 1/255 can fit in a half-float word

maybe a sugestion ?

Code: Select all

vf2iz.q R200, R200, 22 // float 0.0 to 255.0 --> int &#91;s&#58;1&#93;&#91;0&#93;&#91;i&#58;8&#93;&#91;f&#58;22&#93;
vi2f.q R200, R200, 30 // int &#91;s&#58;1&#93;&#91;0&#93;&#91;f&#58;30&#93; --> float 0.0 to 0.9...

i'm not sure if this correct however (never tested)

Raphael · Post by **Raphael** » Thu Mar 15, 2007 7:59 am

Good point there. For full single precision accuracy

Code: Select all

vfim.s S203, 255.0; vrcp.s S203, S203

would do the job better, but I think a half-float 1/255 should be good enough. It can save it as 0.003921509 effectively (0.00111.0000000100b), so it's a precision loss of <0,002% and full intensity input would translate to 0.9999848 single precision float. But it's his decision to what precision he needs the conversion.

hlide wrote:well i added two new opcodes in my psp-as but i didn't give the source to be commited in ps2dev.org. vuc2i.s is one. If TyRaNiD is okay to commit it, i will give the source.

Would be nice :)

StrontiumDog · Post by **StrontiumDog** » Fri Mar 16, 2007 7:35 am

StrmnNrmn wrote:
Code: Select all
static float InvSqrt&#40;float x&#41;
&#123;
	float xhalf = 0.5f*x;
	int i = *&#40;int*&#41;&x;
	i = 0x5f3759df - &#40;i>>1&#41;;
	x = *&#40;float*&#41;&i;
	x = x*&#40;1.5f-xhalf*x*x&#41;;
	return x;
&#125;
[snip]

Anyway, it looks like with O2 and O3 gcc is inlining this function 'incorrectly'.

No, as surprising as it may sound GCC is inlining the code correctly, it is you that have made a mistake. (That sounds harsh, but it is a sad reality, read on.)

StrmnNrmn wrote: It looks like a pointer aliasing problem

That is precisely what it is.

C99 is a broken piece of crap language specification. I quote:

Here are the aliasing rules:

An object shall have its stored value accessed only by an lvalue
expression that has one of the following types:

- a type compatible with the effective type of the object,

- a qualified version of a type compatible with the effective type of
the object,

- a type that is the signed or unsigned type corresponding to the
effective type of the object,

- a type that is the signed or unsigned type corresponding to a
qualified version of the effective type of the object,

- an aggregate or union type that includes one of the aforementioned
types among its members (including, recursively, a member of a
subaggregate or contained union), or

- a character type.

So doing reasonable things link casting int's to floats and back again is not allowed. Moreover the compiler is free to assume the two pointers do not point to the same thing and optimize accordingly.

I think this part of the C99 spec is BADLY BROKEN, and late versions of GCC want to be C99 compliant. The only way to ensure your program wont break in random unpredictable ways from compile to compile if you do lots of mixed type casting is to use -fno-strict-aliasing.

Also note that it is possible that code that breaks the strict aliasing rules will work at -O3, there is just no way to know or not and adding 1 line of code could change the ability of the optimizer to break your code. At -O1 or below strict aliasing optimization is inhibited.

Sad but true.

Strontium Dog

StrmnNrmn · Post by **StrmnNrmn** » Fri Mar 16, 2007 9:03 am

Thanks Raphael/hlide - some great suggestions from the both of you.

Raphael wrote:Good point there. For full single precision accuracy
Code: Select all
vfim.s S203, 255.0; vrcp.s S203, S203
would do the job better, but I think a half-float 1/255 should be good enough. It can save it as 0.003921509 effectively (0.00111.0000000100b), so it's a precision loss of <0,002% and full intensity input would translate to 0.9999848 single precision float. But it's his decision to what precision he needs the conversion.

This really doesn't need to be all that accurate - it's just for computing the diffuse colour for the vertex, so it doesn't really matter if I end up with R=240 or R=241 etc - you won't be able to tell the difference onscreen.

I've not tried the vuc2i.s load or separate world/project matrices yet (that's my next job :). I have converted all the vmul.q ops with the selection mask into vscl.q ops though and I'm seeing up to a 5% speedup in certain places (which is excellent for the amount of effort involved).

Thanks again guys!

StrmnNrmn · Post by **StrmnNrmn** » Fri Mar 16, 2007 9:06 am

StrontiumDog wrote:I think this part of the C99 spec is BADLY BROKEN, and late versions of GCC want to be C99 compliant. The only way to ensure your program wont break in random unpredictable ways from compile to compile if you do lots of mixed type casting is to use -fno-strict-aliasing.

That's really useful information - thanks for digging that up. Rewriting InvSqrt seems to have fixed all my problems, but -fno-strict-aliasing is a good tip to remember for the future.

Thanks.