Ok, now the time has come to vertex performance, and as per usual I have a few interesting numbers and a sample that allows you to experiment for yourself.
First up: DO NOT USE INDEXBUFFERS. They are the succubus of performance for your applications. It has been suspected and people have mentioned on the forum that they have received a lot more speed not using index-buffers. Well, here are the cold, hard facts: If your vertices are 12 bytes in size (optimal size) and vertex buffer in vram, your performance drops from 13.67 million vertices per second to 6.31(!) million vertices per second. That's a 53% performance drop! And it gets even worse if you use system ram, then it drops to 3.16(!!!) million vertices per second. That's 76 per cent, and a real system killer. Why Sony added index-buffers is a complete mystery, since they really suck.
Now, after this public announcement, here are a few numbers:
From the test I have running, kicking as many batches containing 1536 vertices as possible, the maximum T&L that the PSP can push is 13.67 million vertices per second. This is nowhere close the 35 million vertices per second that they have stated before, so these numbers aren't final (will research some more to make sure I haven't broken something). This kind of performance is reached when the vertex is around 8-12 bytes in size, and seems to be very memory-sensitive, since when you grow beyond that size performance starts dropping rapidly, ending at 0.48mv/s (544 bytes with full skinning & morphing).
Raw numbers for transform (vram-numbers in parentheses):
4 bytes: 13.03mv/s (13.67mv/s)
8 bytes: 13.67mv/s (13.67mv/s)
10 bytes: 13.67mv/s (13.67mv/s)
12 bytes: 13.67mv/s (13.67mv/s)
16 bytes: 13.62mv/s (13.67mv/s)
20 bytes: 11.76mv/s (13.67mv/s)
24 bytes: 11.76mv/s (12.81mv/s)
28 bytes: 9.81mv/s (11.37mv/s)
32 bytes: 7.36mv/s (10.25mv/s)
36 bytes: 6.54mv/s (9.31mv/s)
No lighting, skinning, or morphing affected these numbers.
Skinning & Morphing
These operations are real powerhungry, and they should be used with care. The numbers are as follows on optimal vertices with only weights or morphs added:
Skinning
Disabled: 13.67mv/s
2 weights: 6.42mv/s
3 weights: 4.55mv/s
4 weights: 3.53mv/s
5 weights: 2.88mv/s
6 weights: 2.43mv/s
7 weights: 2.10mv/s
8 weights: 2.10mv/s
Morphing
1 vertex (disabled): 13.67mv/s
2 vertices: 9.80mv/s
3 vertices: 6.55mv/s
4 vertices: 4.92mv/s
5 vertices: 3.93mv/s
6 vertices: 3.28mv/s
7 vertices: 2.81mv/s
8 vertices: 2.46mv/s
Combining both skinning & morphing gives 0.58mv/s and a vertex-size of 352 bytes (not really usable).
The sample I used for these values are available at gu/vertex/vertex.c. Hack away! I'm going to add a few real-world examples on this sample, but it shouldn't be too hard to do it yourself if you want to test your own code.
Please note that these values are raw performance values, and no pixels have been rendered. Your application will not receive the same benifits, but this may act as a guide.
Vertex Performance (Revisiting GE)
Re: Vertex Performance (Revisiting GE)
Yeah, that's really bad. What order were you reading vertices in? Does linear versus random make much difference? I expect it would.chp wrote:Well, here are the cold, hard facts: If your vertices are 12 bytes in size (optimal size) and vertex buffer in vram, your performance drops from 13.67 million vertices per second to 6.31(!) million vertices per second. That's a 53% performance drop! And it gets even worse if you use system ram, then it drops to 3.16(!!!) million vertices per second. That's 76 per cent, and a real system killer. Why Sony added index-buffers is a complete mystery, since they really suck.
Also, what primitive were you using for this? I'm wondering whether strip vs fan vs independent triangles vs points makes a difference. I wonder if there's any evidence of a tranform cache (ie, independent triangles presented in strip order get better performance than completely independent tris).
I suspect vertices generated by the subdivision operators are dealt with much more quickly than explicitly specified ones; I think you'll find that a bezier patch will approach 35Mvert/s.From the test I have running, kicking as many batches containing 1536 vertices as possible, the maximum T&L that the PSP can push is 13.67 million vertices per second. This is nowhere close the 35 million vertices per second that they have stated before, so these numbers aren't final (will research some more to make sure I haven't broken something).
Re: Vertex Performance (Revisiting GE)
This was linear access to vertices, which should have shown maximum performance if there's any startup-cost for starting to read from memory.jsgf wrote:Yeah, that's really bad. What order were you reading vertices in? Does linear versus random make much difference? I expect it would.chp wrote:... Why Sony added index-buffers is a complete mystery, since they really suck.
I used simple points, to avoid any possible issues with primitive assembly. Cache-sizes tested were 4, 8, 16 and 32, and I saw no signs of improvements from running with a completely linear indexbuffer. It seems there's no transform-cache, or it works in a way that we haven't figured out yet.jsgf wrote:Also, what primitive were you using for this? I'm wondering whether strip vs fan vs independent triangles vs points makes a difference. I wonder if there's any evidence of a tranform cache (ie, independent triangles presented in strip order get better performance than completely independent tris).
Yes, that might be true. Should probably do some benchmarking on those kinds of primitives too.jsgf wrote:I suspect vertices generated by the subdivision operators are dealt with much more quickly than explicitly specified ones; I think you'll find that a bezier patch will approach 35Mvert/s.
GE Dominator
Again, interesting tests chp !
I looked at the test code, hard to find what to optimize to achieve the "Graphics sub-system running at 166 MHz on a 512-bit bus with 2 MB of DRAM, rendering [...] 35 million polygons per second" (sic) (what's a polygon ? 3 vertices ?)
Or maybe, as the PS2, the PSP has different data paths with different priorities to the GE... The scratchpad could also help (as it really help the "bad" path 3 way to not be that ridiculous)....
Yep other primitives could be interesting too.. 16bit draw buffer...Or maybe they bypass the Transform engine :D directly to the rasterizer... Quite difficult to find how to double the number of vertices...
by the way, good work ! Very interesting....
I looked at the test code, hard to find what to optimize to achieve the "Graphics sub-system running at 166 MHz on a 512-bit bus with 2 MB of DRAM, rendering [...] 35 million polygons per second" (sic) (what's a polygon ? 3 vertices ?)
Or maybe, as the PS2, the PSP has different data paths with different priorities to the GE... The scratchpad could also help (as it really help the "bad" path 3 way to not be that ridiculous)....
Yep other primitives could be interesting too.. 16bit draw buffer...Or maybe they bypass the Transform engine :D directly to the rasterizer... Quite difficult to find how to double the number of vertices...
by the way, good work ! Very interesting....
- TiTAN Art Division -
http://www.titandemo.org
http://www.titandemo.org
Bypassing the transform pipe does nothing, it even lowers performance down to just below 13mv/s. Also, running 2D vertices from VRAM is a really bad idea, performance drops to 5mv/s for some reason.
Using an ortho-projection would really be a better way than blitting in 2D, since it seems to give better vertex-performance, and it gives access to all fun things like transforms, texture scaling, etc.
Using an ortho-projection would really be a better way than blitting in 2D, since it seems to give better vertex-performance, and it gives access to all fun things like transforms, texture scaling, etc.
GE Dominator
Re: Vertex Performance (Revisiting GE)
nice benchmarks. u said, u've used pointsprites, could u try (backfacing) trianglestrips too? maybe pointsprites generate some overhead.
i cannot try it,'cause i have no psp yet.
i guess indices are used to accelerate expensive vertices, if u enable all lights and all fancy stuff with really big vertices, indexed drawing could be a lot faster... maybe u could benchmark it.chp wrote:Why Sony added index-buffers is a complete mystery, since they really suck.
i cannot try it,'cause i have no psp yet.