GE Signals

ReJ · Post by **ReJ** » Wed Oct 12, 2005 8:57 pm

Hi,

http://rej.50megs.com/psp/signals
(back to psp coding) This experiment shows the possible usage of GU signals both for double-buffering and command-lists calls.

EDIT: Sample fully using libgu API is here: http://rej.50megs.com/psp/libgu_signals/

GU signals can be used implementing double buffering approach for sending dynamic geometry to GPU. Double buffering allows to save memory, while sending large amounts of dynamic data. Experiment shows rendering 64K dynamic billboards per frame (25fps) - 24 bytes per vertex - naive allocation would require 3MB.

At the start of each frame 2 vertex buffers are allocated, filled with data and submitted to command buffer. As soon as GPU finishes rendering from one of the buffer, signal is raised and interrupt handler is able to reuse the same buffer by filling it with data and submitting back to command buffer. Meanwhile GPU is busy rendering from another buffer.

Such approach can be usefull for managing texture cache in the VRAM as well. Texture can be evicted as soon as rendering job which uses it is done without CPU waiting for GPU to end the job.

Some notes on signals:
* User data passed as an argument signal-listening callbacks must be defined as a global variable, otherwise app will crash (presumably in sceGeSetCallback() ).

* In order to send a signal code 14 followed by 12 must be pushed into command buffer (pushing only 14 doesn't raise a signal):

Code: Select all

14  0E  SIGNAL      Raise Signal Interrupt
                             0-15&#58; hi 16 bits of signal arg 
                             16-23&#58; signal id

12  0C  END         Stop execution &#40;Finish Raise Signal Interrupt&#41;
                             0-15&#58; lo 16 bits of signal arg

* Following signal ids seem to work:

Code: Select all

0x01..0x03          Custom signals &#40;looking at the sceGuSignal, 0x03 has special path sending FINISH code after signal&#41;
0x11                Call command list &#40;arg specifies address of command list to be executed&#41;
0x12                Return from command list

Sending signals with ids other than these so far only led to app hangs.

* Only signals in the range 0x01..0x03 are passed to custom signal handler. Signal id is passed to handler as the 1st argument. No idea is it possible to pass the signal argument into the handler or it is meant to be used with 0x11 only.

* sceGuCallMode(1) sets the signals to be used when calling pregenerated command-lists. Since these signals are 'invisible' to custom signal handler,
I wonder what is the difference between CALL command or using 0x11 signal instead.

PS: source code is based on chp's sprite sample.

holger · Post by **holger** » Wed Oct 12, 2005 11:22 pm

cool! this greatly simplifies buffer and VRAM management...
great job!

ReJ · Post by **ReJ** » Thu Oct 13, 2005 7:37 am

Couple bits about performance of this experiment:

* Number of geometry batches sent to GPU slightly affects performance:
4 batches approximately 39ms per frame
64 batches approximately 41ms per frame
Need more testing to detect if that is due to CPU or GPU.

* Comparing performance of double-buffered approach with simlpe single buffer sent via one batch seems to be approximately the same (except the overhead mentioned before).

* Seems that performance in the current experiment is actually limited by the CPU memory access (code reads from template pregenerated torus geometry in order to place billboards). Several tests with different vertex sizes:

a) 12 bytes-per-vertex (vertex transform seems to be bottleneck here)
42fps
5.5Mvert/s
63Mb/s data transfer

b) 16 bytes-per-vertex
36fps
4.7Mvert/s
72Mb/s data transfer

c) 24 bytes-per-vertex (bottleneck: CPU memory reads)
25.5fps
3.34Mvert/s
75Mb/s data transfer

However minimizing CPU memory access or just storing 0s (8 byte writes seems to be fastest, why?) to vertex-buffer, 87Mb/s transfer rate is achieved:

24 bytes-per-vertex filled with 0s (or some simple runtime generated data)
d) 29fps
3.8Mvert/s
87Mb/s

So far 87Mb/s - still quite far from theoretical 150Mb/s transfer rate, when uploading to VRAM.

ector · Post by **ector** » Thu Oct 13, 2005 11:09 am

Do you really gain much by putting display lists/command buffers in VRAM? Mine are in main memory and i'm getting decent perf..

jsgf · Post by **jsgf** » Thu Oct 13, 2005 11:31 am

ector wrote:Do you really gain much by putting display lists/command buffers in VRAM? Mine are in main memory and i'm getting decent perf..

I notice about 20% improvement or so for some loads; but it only matters if the program is really GE-bound. I guess it depends on whether the GE is using more memory bandwidth for fetching vertices or textures.

I think the sample program does actually put the vertex buffers in system memory; it uses sceGuGetMemory, which just returns chunks of the memory
you originally passed it.

What I'm wondering is whether filling the vertex arrays with cached operations would help or not. The sample program uses all uncached memory for the vertex buffers. In general, I've found using uncached memory for write-only streams is more efficient, but that probably depends on getting good bursting/write gathering, which will probably require contigious word writes.

jsgf · Post by **jsgf** » Thu Oct 13, 2005 11:41 am

ReJ wrote:Couple bits about performance of this experiment:

* Number of geometry batches sent to GPU slightly affects performance:
4 batches approximately 39ms per frame
64 batches approximately 41ms per frame
Need more testing to detect if that is due to CPU or GPU.

* Comparing performance of double-buffered approach with simlpe single buffer sent via one batch seems to be approximately the same (except the overhead mentioned before).

* Seems that performance in the current experiment is actually limited by the CPU memory access (code reads from template pregenerated torus geometry in order to place billboards). Several tests with different vertex sizes:

It seems you're using uncached pointers to system memory for the vertex array. I wonder what would happen if you used cached pointer and explicitly flush each batch before drawing; it might get better write bursts (but also increase cache pressure).

Have you tried seeing what happens if you multibuffer with multiple command queues? This is the approach PSPGL uses, and it seems OK, though I haven't really done any recent performance tests on it lately (it certainly isn't a bottleneck so far).

ReJ · Post by **ReJ** » Sun Oct 16, 2005 6:13 pm

ector wrote:Do you really gain much by putting display lists/command buffers in VRAM? Mine are in main memory and i'm getting decent perf..

Oops, my fault (made a wrong assumption about GuGetMemory() function without looking into the code), the sample actually uses uncached system memory for buffers.

The sample code isn't transfer bound - putting vertices into the video memory didn't helped performance (and actually made it worse :) ).

I think that the main point of signals is not to improve a rendering speed, but be able to manage system and video memory required for rendering, thus reducing memory consumption.

jsgf wrote: It seems you're using uncached pointers to system memory for the vertex array. I wonder what would happen if you used cached pointer and explicitly flush each batch before drawing; it might get better write bursts (but also increase cache pressure).

I've tried cached pointers with explicit flush - it is slower in this case. So far it seems that sequentially writing to uncached memory (storing in 8 byte chunks, dunno why, looks a bit strange to me) is an optimal way.

As well, I've tried allocating vertex buffers in video memory and writing using uncached pointers - results in worst performance actually (approx 50% drop). I presume that I get no write bursts in this case.

Another scenario I've tried is allocating buffers both in system and video memory, transfering data from system to VRAM using GuCopyImage() function and rendering from VRAM. That actually did some speed improvements comparing to buffers in video memory only, but it was still much slower than original version. However main issue here - rendering artifacts. I suppose it might prove useful to pre-transfer vertex data to VRAM using GuCopyImage() (for example if the same geometry is used several times and app is somewhat transfer bound).

jsgf wrote:Have you tried seeing what happens if you multibuffer with multiple command queues?

Not yet. I'll try that as soon as I'll have a little bit more of free time.

PS: What is the peak main memory throughtput on PSP? (could it be as low as ~100MB/s?)

jsgf · Post by **jsgf** » Mon Oct 17, 2005 5:52 pm

ReJ wrote:I've tried cached pointers with explicit flush - it is slower in this case. So far it seems that sequentially writing to uncached memory (storing in 8 byte chunks, dunno why, looks a bit strange to me) is an optimal way.

That could be the depth of the writebuffers. 2 words seems like a reasonable depth. By "8 byte chunks", do you mean two back-to-back 32-bit writes?

As well, I've tried allocating vertex buffers in video memory and writing using uncached pointers - results in worst performance actually (approx 50% drop). I presume that I get no write bursts in this case.

Presumably that only applies to streaming vertex data. If you copied it there once and rendered from it multiple times, I'm assuming it would be better to use local memory...

Another scenario I've tried is allocating buffers both in system and video memory, transfering data from system to VRAM using GuCopyImage() function and rendering from VRAM. That actually did some speed improvements comparing to buffers in video memory only, but it was still much slower than original version. However main issue here - rendering artifacts.

You need to issue a sceGuTexSync before using the results of a CopyImage; it seems the copying is async with respect to other command execution.

I suppose it might prove useful to pre-transfer vertex data to VRAM using GuCopyImage() (for example if the same geometry is used several times and app is somewhat transfer bound).

Or the CPU needs its memory bandwidth for other things...

PS: What is the peak main memory throughtput on PSP? (could it be as low as ~100MB/s?)

That seems a bit low. I'd expect around 500Mbyte/s, but I haven't measured it. It might be that the VFPU gets better memory bandwidth than the main CPU.

ReJ · Post by **ReJ** » Mon Oct 17, 2005 6:37 pm

jsgf wrote:By "8 byte chunks", do you mean two back-to-back 32-bit writes?

Doing crazy thing :)
struct dfloat {
float hi;
float lo;
};

Btw, filling VRAM with 0s sequentially gives me a 144MB/s throughput, while filling uncached system memory gives a 135Mb/s. There is some additional overhead in the test, so actually it should be a little bit more.

You need to issue a sceGuTexSync before using the results of a CopyImage; it seems the copying is async with respect to other command execution.

Yeah, doing that. It seems to be async - that's exactly why I'm trying to use it.

jsgf · Post by **jsgf** » Mon Oct 17, 2005 8:15 pm

ReJ wrote:Yeah, doing that. It seems to be async - that's exactly why I'm trying to use it.

Well, I wonder if TexSync is exactly that: it syncs texture accesses with respect to CopyImage, but if you try to use the results as vertex data, no sync happens (or happens too late).