What is the fastest way to copy from VRAM to VRAM ?

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
chiwaw
Posts: 15
Joined: Sun Jul 24, 2005 7:12 am

What is the fastest way to copy from VRAM to VRAM ?

Post by chiwaw »

I need to copy blocks of 32 bytes (16 pixels of 2 bytes length each), and was wondering what is the fastest way to perform such operation ?

The best one I've found is to use structs pointers like this :

typedef struct
{

u16 val00;
u16 val01;
u16 val02;
u16 val03;
u16 val04;
u16 val05;
u16 val06;
u16 val07;
u16 val08;
u16 val09;
u16 val10;
u16 val11;
u16 val12;
u16 val13;
u16 val14;
u16 val15;
}
FAST_COPY_32_BYTES;

Then by using pointers :

FAST_COPY_32_BYTES *fast_trg = TARGET_ADDRESS;
FAST_COPY_32_BYTES *fast_src = SOURCE_ADDRESS;

*fast_trg = *fast_src;

That gives some good performance, but not as much as I'd hope. I tried with half vals in my structs but all at u32, but get weird results. Is the VRAM can only be accessed by 16-bits chunks ?

Anyone knows of a faster way ?
chiwaw
Posts: 15
Joined: Sun Jul 24, 2005 7:12 am

Post by chiwaw »

Similar question : how can use the Ge to copy blocks of bytes from VRAM to VRAM ?
ector
Posts: 195
Joined: Thu May 12, 2005 10:22 pm

Post by ector »

memcpy() should beat that silly struct method of yours.

To copy stuff insanely quickly inside vram, set up a render target at your destination, your source as texture, and draw a quad.
chiwaw
Posts: 15
Joined: Sun Jul 24, 2005 7:12 am

Post by chiwaw »

Thanks for the quick reply !
ector wrote:memcpy() should beat that silly struct method of yours.
I don't know about PSP CPU, but I'm from GBa programming, and on the ARM32 we get better bin code using this struct method, and save some cycles, than using a looping memcpy().

I just don't know if the bin output is as efficient on the psp CPU (I would believe so).
ector wrote:To copy stuff insanely quickly inside vram, set up a render target at your destination, your source as texture, and draw a quad.
Where can I find sample code for that ? I actually never used 3D quads to draw 2D (as I said, being from a GBA background, I'm not very experienced with anything 3D).

Any easy function I can call by passing a souce pointer and target pointer (all from VRAM) to copy X amount of bytes, and let the GPU do the work instead of the central CPU ?
chiwaw
Posts: 15
Joined: Sun Jul 24, 2005 7:12 am

Post by chiwaw »

Sorry for the double post, but I just made it work using a struct with u32s, and it happen to be twice faster tha using the u16 struct OR memcpy(). Simply because the 32 bit CPU doesn't have to patch at 0 the 16 remaining bits when copying 16 bits at a time, but transfering blocks of 32 bits, there's no waste.

Surely, I'll be better using GPU pipeline, but until I figure out, this u32 struct should be fine for now.

typedef struct
{
u32 val00;
u32 val01;
u32 val02;
u32 val03;
u32 val04;
u32 val05;
u32 val06;
u32 val07;
}
FAST_COPY_32;

Only problem tho : using it on a non-aligned address cause the PSP to hang ... ugh.
ector
Posts: 195
Joined: Thu May 12, 2005 10:22 pm

Post by ector »

Yeah unaligned accesses are to be avoided :)

Strange that your struct method is so much faster. I must be too used to MSVC whose memcpy implementations (yes it has several, from just inserting MOVs to various unrolled loops). Maybe GCC isn't as good at memcpy intrinsic optimization, or something is not configured right.
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

chiwaw wrote: Where can I find sample code for that ? I actually never used 3D quads to draw 2D (as I said, being from a GBA background, I'm not very experienced with anything 3D).

Any easy function I can call by passing a souce pointer and target pointer (all from VRAM) to copy X amount of bytes, and let the GPU do the work instead of the central CPU ?
Look at the 'gu/rendertarget' and 'gu/blit' samples. They should give you enough information to figure out how to do your GPU-assisted blit.
GE Dominator
Post Reply