As part of my work on porting Mesa to the PSP, I've been playing with more efficient models for submitting commands to the PSP hardware.
The current libgu pretty much assumes you create a single large command buffer, you fill it up, submit it, and synchronously wait for it to complete.
This works well and is very simple, but it has the disadvantage of using a fairly large amount of memory (single a single buffer has to fit a whole frame-worth of commands), and it makes it hard to get useful parallelism between the 3D hardware and the CPU.
I've written a simple demo/prototype of a multiqueue system (http://goop.org/~jeremy/psp/multiqueue/), which keeps a ring of relatively small command queues (with associated vertex buffers). As commands are submitted, they are added to the current buffer until it is full. When it is full, it is submitted, and commands are added to the next buffer (if the next buffer has previously been submitted, then it waits until the hardware has finished with it).
To clarify: each buffer is a command buffer paired with a vertex buffer; it is submitted to hardware when either fills up. You can scale their relative sizes so that they get full at about the same time to make sure memory isn't wasted. I'm not using vertex indexes, but an index buffer could easily be added and handled in a similar way (though you'd need to do vertex and index allocations atomically with respect to each other). Nothing stops you from also using external vertex arrays for long-lived pieces of geometry.
This means that you can start submitting stuff to the hardware early (to get it working ASAP) while minimising the amount of time waiting for hardware. You can tune the buffer sizes and number of buffers to trade off wait time against memory use and latency. It also has constant memory use.
One problem where the PSP differs from other hardware is that there doesn't seem to be a way to insert a "wait for vblank" or "swap buffers" command into the command stream, which means that the CPU has to sync with the GU at the end of each frame. It would be nice to keep pipelining across frame breaks (ie, start working on the next frame rather than sleeping waiting for VSYNC).
You could do something with multiple threads and/or polling to mitigate this, and logic dictates there should be a VSYNC callback registration function, like there is for all the other interrupt sources (an undiscovered entrypoint?).
One of the other differences from libgu is that I'm using cached memory for command and vertex buffers, and then explicitly flushing each one with sceKernelDcacheWritebackRange() before submitting them to hardware. This turns out to be noticably faster because cached writes are more efficient than uncached ones (in general, uncached memory ops are a performance killer). I think it's easier to get correct too, because accidentally intermixing cached and uncached accesses to the same memory will cause really strange bugs.
rendering with multiple command queues
I'd been tossing up whether to do the Mesa port or to start from scratch. Mesa seemed like the easier option, but pspgl seems to be pretty far advanced.mrbrown wrote:Have you seen Holger Waechtler's pspgl that went into SVN last week? It's OpenGL ES with a few bits of immediate mode from OpenGL. You may want to help out there instead of duplicating his work.
All this would apply equally well to pspgl.
Yes, libgu does a rather naive approach to buffer-management. Or rather, it doesn't try to manage anything, much like the rest of the system, it is up to you to manage everything that comes to buffers. OpenGL and similar approaches hides the buffer actually used, which can be good in a way and bad in others, but it's a different approach more or less.
Adding a callback when the buffer has been filled could be one approach to handle this. Should this give you enough flexibility when rendering and using multiple small queues?
Adding a callback when the buffer has been filled could be one approach to handle this. Should this give you enough flexibility when rendering and using multiple small queues?
GE Dominator
Not quite. The buffer management code would also need to know when the command buffer is full, which means that all the uses of sendCommand would need to check for enough buffer space.chp wrote:Yes, libgu does a rather naive approach to buffer-management. Or rather, it doesn't try to manage anything, much like the rest of the system, it is up to you to manage everything that comes to buffers. OpenGL and similar approaches hides the buffer actually used, which can be good in a way and bad in others, but it's a different approach more or less.
Adding a callback when the buffer has been filled could be one approach to handle this. Should this give you enough flexibility when rendering and using multiple small queues?
If you really want to abstract out buffer management from libgu, then the way to do it would be to define a formal buffer management api which allows different buffer managers to be plugged in. But it all seems like overkill to me; one good scheme is enough (maybe with a few app-specific tunable parameters).
The thing which works well in my version is that vertex buffers are managed like command buffers. For lots of apps, this is very useful. For example, I think there's a memory leak in sceGuClear which results from not having a way to manage the lifetime of vertex buffers.
libgu also seems to try to manage buffers around the use of calls to other command buffers, but I don't quite understand what's going on there.
Re: rendering with multiple command queues
No, this is not how it works. Have you seen the function sceGeListUpdateStallAddr? This tells the Gu that it can begin drawing, up to this address. So if you just call this function periodically (sceGu does it for you somewhere), you will get parallellism.jsgf wrote: The current libgu pretty much assumes you create a single large command buffer, you fill it up, submit it, and synchronously wait for it to complete.
http://www.dtek.chalmers.se/~tronic/PSPTexTool.zip Free texture converter for PSP with source. More to come.
Well that's not so hard to fix, just add a parameter to sceGuStart with the size of the buffer, and then in the callback when the buffer has been filled, you just set a new buffer.jsgf wrote: Not quite. The buffer management code would also need to know when the command buffer is full, which means that all the uses of sendCommand would need to check for enough buffer space.
You mean the "memory-leak" of the amazing 24 bytes if you're not using the striped clear? Bad example imho. You are basing all of this on how the examples render, but all of them use the base of the first code using the GU ever written, which means they might not use the most optimal approach. On a system like the PSP, as soon as you do dynamic vertex-buffers, you have "lost" anyway. You should keep it in static buffers as long as you can, just as on the PS2, and then you do not need any vertex buffer handling. Dynamic effects can be usually pulled off using skinning and/or morphing.jsgf wrote: If you really want to abstract out buffer management from libgu, then the way to do it would be to define a formal buffer management api which allows different buffer managers to be plugged in. But it all seems like overkill to me; one good scheme is enough (maybe with a few app-specific tunable parameters).
The thing which works well in my version is that vertex buffers are managed like command buffers. For lots of apps, this is very useful. For example, I think there's a memory leak in sceGuClear which results from not having a way to manage the lifetime of vertex buffers.
There are several commands that are not yet explored (sceGuBreak(), sceGuContinue(), sceGuSetCallback(), sceGusignal()) (note that the callbacks are not enabled because of a crash when initalizing). Also, adding too much code on a low level is a bad idea imho, better that you position a call further out that "pre-allocates" memory on the buffer or pushes a new buffer for rendering if it doesn't have enough room.
You have three ways of using sceGuStart(). One way is the normal way, where you let the GE render as you fill the buffer (so if the GE executes faster than you can fill the buffer, you are cpu-bound). The two others are for building lists that can be called from the main list using sceGuCallList(), and for lists that are to be pre-built and then use sceGuSendList() on it.jsgf wrote: libgu also seems to try to manage buffers around the use of calls to other command buffers, but I don't quite understand what's going on there.
Well, you're free to do your own solution, which is great. :) Just breaking the API we have now too much is not a good thing imho.
GE Dominator