First of all, the re-filling of the texturepage seems to be a bit different, a lot more different than I had assumed before. By rewriting large parts of the blit-sample I have managed to do more precise samples than the first ones I did 6 months ago (and from the results which I based a lot of my conclusions on). Refilling seems to be more optimal at 32-pixel wide stripes, not 64. This applies to all pixel formats. Someone had tested and mentioned this on the forums recently, and apologies to that someone, (can't remember who it was) you were RIGHT. :)
The maximum fillrate seems to vary quite a bit depending on the pixelformat selected. In 16-bit it seems to level out at 523MB/s or 2100 frames per second (35 fullscreen passes per frame), but 32-bit manages to push 716MB/s worth of pixels, or 1438 frames per second (24 fullscreen passes per frame). Enabling or disabling depthwrites/reads do nothing to these results which implies that they are performed anyway, but it's only the results that do not happen. Also, changing the strip-width when filling doesn't seem to do much either, which means that the PSP either has a smarter page-write, or I'm missing something. :) This test was done with textures disabled, because with texturing, you won't ever reach these numbers (more on that later).
Another surprise (for me atleast) is when using 16-bit or 32-bit textures, VRAM->VRAM blits is FASTER when using linear textures, than when using swizzled textures. This means that rendertargets give no speed-penalty if dealt with correctly. Swizzling has an impact when using 4- and 8-bit CLUT though, but when using these formats and copy optimally, you are not limited by the texture-cache. Swizzling from system-ram is a must.
Texture-reads seems to be around a maximum of 289MB/s, but unless you use 32-bit RGBA you'll never reach these speeds, since pixel-shading limitations seem to kick in when you get really optimal memory reads. This means that texturing cuts pixel-throughput more than in half, so if you really can you should avoid it (you can do a lot with non-textured geometry, especially if you can use it on lower lods or distant geometry).
Doing no optimizations at all (no striping, no swizzling, system ram) gives you a maximum speed of 7MB/s as figured out earlier. DO NOT DO THIS! Swizzling will boost it to 30MB/s, but seriously, it's not an option to not care for the texture-cache.
These tests were done on a 32-bit drawbuffer, with a strip-width of 32:
4-bit CLUT
- VRAM->VRAM:
- Linear: 80MB/s (FPS:1295)
- Swizzled:97MB/s (FPS:1570)
- RAM->VRAM:
- Linear: 48MB/s (FPS: 779)
- Swizzled: 83MB/s (FPS: 1337)
- VRAM->VRAM:
- Linear: 116MB/s (FPS:938)
- Swizzled:155MB/s (FPS:1247)
- RAM->VRAM:
- Linear: 54MB/s (FPS: 442)
- Swizzled: 128MB/s (FPS: 1034)
- VRAM->VRAM:
- Linear: 233MB/s (FPS:940)
- Swizzled:220MB/s (FPS:885)
- RAM->VRAM:
- Linear: 54MB/s (FPS: 217)
- Swizzled: 167MB/s (FPS: 672)
- VRAM->VRAM:
- Linear: 290MB/s (FPS:582)
- Swizzled:283MB/s (FPS:569)
- RAM->VRAM:
- Linear: 57MB/s (FPS: 115)
- Swizzled: 204MB/s (FPS: 410)
From these numbers, swizzled 4/8-bit CLUT textures is the way to go with static textures, and linear 16/32-bit textures that should be used as rendertargets. I have not done any tests on compressed textures yet, but the sample has been committed to SVN (samples/gu/speed), you are welcome to extend it (or find errors in my conclusions :D).
Enjoy!