PSP assembly coding and compiler optimizations

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
plankton
Posts: 9
Joined: Tue Jan 10, 2006 11:03 pm
Location: USA
Contact:

PSP assembly coding and compiler optimizations

Post by plankton »

hi - i'm looking for some pointers on how to optimize code written for the PSP. while i am a failrly experienced programmer, i haven't done much work with gcc or any work with the MIPS core. specifically:

1. are there any gcc build switches i really should/shouldn't use? right now i'm using '-O3 -finline-functions-called-once -finline-functions -floop-optimize2'. this gives about a 50% increase in performance over not using any of these switches. most of the gains seem to come form -O3. what am i missing?
2. are there any examples of PSP programs that mix C and assembly? where would i find doc on the C run time model? any recommended tutorials on R4000 asm? i'd really like to port the linear interpolator to asm but i don't know how to progress down that path.
3. i've found that when i cross some unknown barrier in code/data size, performance degrades significantly. i'm guessing that there's some caching issues to work around. does anyone know how to layout code & data in memory for optimal performance? are there fast memories where i should put frequently accessed buffers & modules/if so, what's the best way of getting the linker to place the modules there?
4. any general thoughts on fixed vs floating point? most of my code is floating point, but i do have a lot of fixed point experience so if porting down will greatly increase performance i'd be willing to give it a shot. also, is 32 bit fixed point the same performance as 16 bit?
5. what's the best place to read up about the hardware/software architecture of the PSP (memory map, latencies, etc)?

oh as for me i'm working on an audio synthesizer/sequencer ala fruityloops. based on my previous experience with these algorithms, i think i'm 2-3x slower than a fully optimized solution. nearly all the processing power goes into the audio synthesis code - it's a reasonable mixture of math and logical/branching operations. so the more you help me, the more channels of audio and crazy fx you'll have to play with when it's done. :)

thanks everyone!

ethan
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Re: PSP assembly coding and compiler optimizations

Post by jsgf »

plankton wrote:1. are there any gcc build switches i really should/shouldn't use? right now i'm using '-O3 -finline-functions-called-once -finline-functions -floop-optimize2'. this gives about a 50% increase in performance over not using any of these switches. most of the gains seem to come form -O3. what am i missing?
You might want to try -Os instead. It does save on code size (=icache misses), and is still pretty optimised. Try it and see. Also -fsingle-precision-constant, since doubles have no hardware support, and a float literal in C has double type (which promotes the whole expression to double).
2. are there any examples of PSP programs that mix C and assembly? where would i find doc on the C run time model? any recommended tutorials on R4000 asm? i'd really like to port the linear interpolator to asm but i don't know how to progress down that path.
Not too many, but you have a lot of options for your app. The most important one is that there's a whole second CPU dedicated to DSP stuff, which seems to have more DSP-oriented instruction extensions. But I don't really know anything about it. Have a look around here for "media engine" discussions.
3. i've found that when i cross some unknown barrier in code/data size, performance degrades significantly. i'm guessing that there's some caching issues to work around. does anyone know how to layout code & data in memory for optimal performance? are there fast memories where i should put frequently accessed buffers & modules/if so, what's the best way of getting the linker to place the modules there?
The wiki has a memory map description, but I don't see any reason why there would be a cliff beyond a certain code/data size, unless your cache misses are going way up.
4. any general thoughts on fixed vs floating point? most of my code is floating point, but i do have a lot of fixed point experience so if porting down will greatly increase performance i'd be willing to give it a shot. also, is 32 bit fixed point the same performance as 16 bit?
Single precision FP seems pretty quick, and the VFPU extensions can do matrix ops very efficiently as well.
5. what's the best place to read up about the hardware/software architecture of the PSP (memory map, latencies, etc)?
The wiki: http://wiki.ps2dev.org/
oh as for me i'm working on an audio synthesizer/sequencer ala fruityloops. based on my previous experience with these algorithms, i think i'm 2-3x slower than a fully optimized solution. nearly all the processing power goes into the audio synthesis code - it's a reasonable mixture of math and logical/branching operations. so the more you help me, the more channels of audio and crazy fx you'll have to play with when it's done. :)
Cool!
plankton
Posts: 9
Joined: Tue Jan 10, 2006 11:03 pm
Location: USA
Contact:

Re: PSP assembly coding and compiler optimizations

Post by plankton »

You might want to try -Os instead. It does save on code size (=icache misses), and is still pretty optimised. Try it and see. Also -fsingle-precision-constant, since doubles have no hardware support, and a float literal in C has double type (which promotes the whole expression to double).
ah good points. i'm used to programming systems which have small pipelines and fast memory access. i'll try both of these out. i imagine that switching to SP floats will be huge. i'm also going to look into some general architectural changes to the code; there's got to be some low hanging fruit in there that i can implement in C.

oh are there any pragmas in gcc for branch prediction? are they working on the PSP?
Not too many, but you have a lot of options for your app. The most important one is that there's a whole second CPU dedicated to DSP stuff, which seems to have more DSP-oriented instruction extensions. But I don't really know anything about it. Have a look around here for "media engine" discussions.
hmmm, i think i checked this out in the past and didn't think it was too useful, but i will take another look.
The wiki has a memory map description, but I don't see any reason why there would be a cliff beyond a certain code/data size, unless your cache misses are going way up.
i'm pretty sure it's just a cache miss/data layout issue. however, it's annoying when everything works fine & you add in another struct and the GUI goes totally unresponsive. :\ i'm used to projects where you specify exactly where you want your code and data segments to lie, which can help a lot in situations like this.

and thanks for pointing out the wiki - i read it really early on and i don't think i realized the implications of everything i saw. i do a ton of mallocs; those will be migrating memalign quite soon!

cheers!
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Re: PSP assembly coding and compiler optimizations

Post by jsgf »

plankton wrote:oh are there any pragmas in gcc for branch prediction? are they working on the PSP?
There's __builtin_expect(). I just tried scattering some around in PSPGL, but I didn't get the results I was hoping for. I was hoping it would put the unlikely branches out of line to make sure the hot-path is in icache, but it didn't seem to do that. I haven't looked into it in detail yet.
hmmm, i think i checked this out in the past and didn't think it was too useful, but i will take another look.
If nothing else, its still a whole other CPU for running MIPS instructions.
i'm pretty sure it's just a cache miss/data layout issue. however, it's annoying when everything works fine & you add in another struct and the GUI goes totally unresponsive. :\ i'm used to projects where you specify exactly where you want your code and data segments to lie, which can help a lot in situations like this.
That sounds very strange. The PSP doesn't seem to have any major precipices like that. I wouldn't be surprised about a 10-20% decline, but it sounds like you're seeing much larger slowdowns. How big is your code/data?
TyRaNiD
Posts: 907
Joined: Sun Jan 18, 2004 12:23 am

Post by TyRaNiD »

It might be worth enabling the cpu profiler and taking a look at the numbers, that will tell you whether you are getting high incidents of icache/dcache misses etc. in your code.
jsgf
Posts: 254
Joined: Tue Jul 12, 2005 11:02 am
Contact:

Post by jsgf »

Are there any docs/demos on what profiling facilities exist, and how to use them?

Edit: Hm, I see src/debug/profiler.c, but I was wondering if there's anything like gprof. I.e., a statistical sampler.
plankton
Posts: 9
Joined: Tue Jan 10, 2006 11:03 pm
Location: USA
Contact:

Re: PSP assembly coding and compiler optimizations

Post by plankton »

That sounds very strange. The PSP doesn't seem to have any major precipices like that. I wouldn't be surprised about a 10-20% decline, but it sounds like you're seeing much larger slowdowns. How big is your code/data?
not very big, only a couple hundred kbytes. i first ran into this when i tried to add in a ~25k structure; however i've also now seen it when i've added in small innocuous functions that are even executed. this lack of control over performance is quite disconcerting.
plankton
Posts: 9
Joined: Tue Jan 10, 2006 11:03 pm
Location: USA
Contact:

Re: PSP assembly coding and compiler optimizations

Post by plankton »

You might want to try -Os instead. It does save on code size (=icache misses), and is still pretty optimised. Try it and see. Also -fsingle-precision-constant, since doubles have no hardware support, and a float literal in C has double type (which promotes the whole expression to double).
ok i just tried these and it appears i got a small improvement from -Os but nothing from -fsingle-precision-constant. which is surprising because i do a lot of FP calculations in inner loops. ah well. going to look into the aligned memory allocations, see if that helps out more.

if you have any other ideas, i'm happy to listen! :)
Post Reply