Anyone had problems with flickering triangles?

turkeyman · Post by **turkeyman** » Thu Jul 14, 2005 9:31 pm

Hey again,

Does anyone know if sceGuDrawArray() copies the array to the display list, or does it just reference the array, requiring you to manage, not overwrite, and double buffer the data in dynamic arrays?

I'm having loads of problems with flickering triangles and loads of other crap when i use dynamic vertex arrays..

I am quad buffering (just to be sure).. but it still behaves differently to when i render from a static buffer...

yes my array is aligned... (anticipated question) :P

ReKleSS · Post by **ReKleSS** » Thu Jul 14, 2005 10:12 pm

Firstly, sceGuDrawArray only references the array. Secondly, are you flushing the cache before drawing? sceKernelDcacheWritebackAll(); should do it.
-ReK

turkeyman · Post by **turkeyman** » Fri Jul 15, 2005 12:09 am

why should i need to flush the cache?
surely sceGuDrawArray() dosent render it immediately? ..
it just ads it to the display list and it gets kicked off when i sceGuFinish()??
in which case, the dcache should have LOADS of time between rendering, and the end of the frame before for the cache to be flushed naturally... :/

although you might be on to something... the flickering gets less bad the longer i leave it running..

so whats the best way to manage a display list?
i imagine something like this:

program start:
init...

set display list buffer 1
begin frame
render some things (add to the display list)
end frame (display list begins rendering)
set display list buffer 2
begin frame
render some things (adding them to the second display list buffer)
sceGuSync() (allow gpu to finish rendering the previous frame if it hasnt already)
end frame (kick this frames display list)
set display list buffer 1
begin frame
....
etc etc etc

that should allow asyncrenous behaviour of CPU and GPU..

the odd thing is, in the cube demo for example, sceGuSync is called AFTER sceGuFinish() ..
this would not allow any asyncrenous processing by the GPU..
seems strange..

perhaps i've got this all wrong..
someone want to clarify the way to properly manage display list buffers and the rendering pipeline?

turkeyman · Post by **turkeyman** » Fri Jul 15, 2005 12:20 am

interestingly, flushing the cache seems to have solved the problem..

i'm still not exactly sure why that should have affected it, since the GPU shouldnt access that memory for quite a long time which should give the cache plenty of time to flush....

another question for you all..
what is the PSP's clip space? exactly how does the PSP expect data post projection? and what's the story with the divide by w? it is just performed by the hardware after projection as usual?

also, are there any rules reguarding filling the zbuffer?

i want to change the projection to be z into the screen (LH) and '0' being the nearest zbuffer value... i have tried it, but it dosent seem to work.. any other gotchas i should be aware of?
(i have considered, cull mode, zcmp function, and camera inversion)

sorry for all the questions :P
theres no real hard docs to refer to :)

crazyc · Post by **crazyc** » Fri Jul 15, 2005 2:48 am

turkeyman wrote:that should allow asyncrenous behaviour of CPU and GPU..

http://svn.ps2dev.org/filedetails.php?r ... rev=0&sc=0

ector · Post by **ector** » Fri Jul 15, 2005 3:03 am

Well the way the GPU interface works is that Flush (IIRC, I call it Flush in my library which is not identical to libGu) sets the GPU "Stall" address, i.e. where it should stop reading data. So to get the GPU and CPU running asynchronously, the CPU writing display lists for the GPU to consume, you should call the flush function from time to time to let the GPU know that it can safely start reading the list. I don't know if libGu does this implicitly anywhere, I keep things like that explicit in my library.

It makes perfect sense to call Finish and then Sync because Finish will write an "End of display list" command to the list, and Sync will (presumably) set the stall address and then wait for the GPU to catch up and get to the End of display list command.

I'm not 100% sure if it's the fastest way, but the easiest way I've found to get good data to the Gu without having to writeback the entire cache is to "uncache" my pointers before writing vertex data, that is OR-ing them by 0x40000000. This will disable caching of writes through it and make it write directly to real RAM.

By the way if you want optimal performance with textures, swizzle them (if you are going to draw them at any angle other than 0) and put them in VRAM. The speed difference between linear textures in RAM and swizzled textures in VRAM seem to be orders of magnitude, something like 100x!! Swizzled texture in RAM appear to have acceptable performance, though you really should put your most common textures in VRAM if you have room.

Swizzled textures (all formats) are stored as blocks of 16 bytes X 8 rows. (so 32-bit texture data is swizzled into 4x8 blocks).

By the way, no warranty is offered for the above info, which is the way I understand how things work which I've found out by my own experimentation and may be all wrong :)

chp · Post by **chp** » Fri Jul 15, 2005 4:45 am

crazyc wrote:
turkeyman wrote:that should allow asyncrenous behaviour of CPU and GPU..

http://svn.ps2dev.org/filedetails.php?r ... rev=0&sc=0

Do note however that callbacks are not currently enabled within pspgu, as I just get crashes if I add them in sceGuInit(). I'll revisit this & a lot of other stuff when I get back to work on this.

chp · Post by **chp** » Fri Jul 15, 2005 5:03 am

I don't know if libGu does this implicitly anywhere, I keep things like that explicit in my library.

It does. More exactly, sceGuGetMemory(), sceGuCallList(), sceGuDrawArray(), sceGuDrawArrayN(), sceGuFinish() and sceGuSignal() all updates the stall-address.

It makes perfect sense to call Finish and then Sync because Finish will write an "End of display list" command to the list, and Sync will (presumably) set the stall address and then wait for the GPU to catch up and get to the End of display list command.

Actually, sceGuFinish() sets the stall-address, sceGuSync() only waits for the list to finish (default behaviour, haven't explored the rest). One approach I have been pondering would be to have two lists running as some kind of ring-buffer and then listening to interrupts to know when the screen can be flipped (wiring it together with a vbl-interrupt), which could get rid of the sceGuSync() as long as the rendering isn't fast enough. There are more advanced approaches than this, but I think it could be a problem fitting it into the pspgu-approach. :)

I'm not 100% sure if it's the fastest way, but the easiest way I've found to get good data to the Gu without having to writeback the entire cache is to "uncache" my pointers before writing vertex data, that is OR-ing them by 0x40000000. This will disable caching of writes through it and make it write directly to real RAM.

If there are any "uncached accelerated" modes available, I'd pick those (like there is on the PS2, and it kicks ASS when it comes to filling a lot of data in a sequential array), but I'm not getting my hopes up. :)

Swizzled textures (all formats) are stored as blocks of 16 bytes X 8 rows. (so 32-bit texture data is swizzled into 4x8 blocks).

Nice, seems sony has ditched the idea of having one swizzle-mode per pixel-size then... Good move.

turkeyman · Post by **turkeyman** » Fri Jul 15, 2005 10:59 am

chp wrote:
I'm not 100% sure if it's the fastest way, but the easiest way I've found to get good data to the Gu without having to writeback the entire cache is to "uncache" my pointers before writing vertex data, that is OR-ing them by 0x40000000. This will disable caching of writes through it and make it write directly to real RAM.
If there are any "uncached accelerated" modes available, I'd pick those (like there is on the PS2, and it kicks ASS when it comes to filling a lot of data in a sequential array), but I'm not getting my hopes up. :)

are you suggesting that 0x40000000 is not a valid cache bypass mode?
i'll use that for my immediate renderer, doing a dcache flush kinda sucks..

is there a way to insert immediate data directly into the main display list rather than having a separate immediate buffer and inserting calls into the main display list to reference it?

whats the story with double buffering display lists?
neither of you mentioned double buffering the display list its self.. are you using some kind of ring buffer technique? how are you stalling the CPU when it catched up to the GPU's display list pointer? or are you just syncing the display list every frame before the CPU starts writing to it?

ector · Post by **ector** » Fri Jul 15, 2005 11:32 am

0x40000000 is sure a valid cache bypass, the question is if it has write combiners so it would be as fast as than writing normally and then flushing...

turkeyman · Post by **turkeyman** » Tue Jul 26, 2005 9:36 pm

0x40000000 is sure a valid cache bypass, the question is if it has write combiners so it would be as fast as than writing normally and then flushing...

Do you know the cached and uncached bus width to memory? i can write combine manually if i need to, provided the uncached write is the same width as the cache...

What does the PS2 do? I'm pretty sure it has an 'uncached accelerated' mode yeah? Does that have write combining? and does it write more than 16 bytes out at a time? (PPro writes 32bytes at a time yeah?)

Even if 0x40000000 dosent have write combining, if i'm writing out 16 bytes at a time, its gotta be faster than trashing and then flushing the dcache every time i write out some vertex/texture data..