How to use VFPU??

cooleyes · Post by **cooleyes** » Mon Nov 13, 2006 5:23 pm

I have some code like

#define EXPAND_16_TIMES&#40;CODE&#41; CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE

void Adapt&#40;short * pM, const short * pAdapt, int nDirection, int nOrder&#41;
&#123;
    nDirection = -nDirection;
    nOrder >>= 4;
    
    if &#40;nDirection < 0&#41; 
    &#123;    
        while &#40;nOrder--&#41;
        &#123;
            EXPAND_16_TIMES&#40;*pM++ += *pAdapt++;&#41;  
        &#125;
    &#125;
    else if &#40;nDirection > 0&#41;
    &#123;
        while &#40;nOrder--&#41;
        &#123;
            EXPAND_16_TIMES&#40;*pM++ -= *pAdapt++;&#41;
        &#125;
    &#125;
&#125;

I want to use VFPU code to instead of

so I wrote that

Code: Select all


#define vfpuadd16 \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"lv.q R100, 0+%0\n" \
	"lv.q R000, 0+%1\n" \
	"vadd.q R100, R100, R000\n" \
	"sv.q R100, 0+%0\n" \
	"lv.q R101, 16+%0\n" \
	"lv.q R001, 16+%1\n" \
	"vadd.q R101, R101, R001\n" \
	"sv.q R101, 16+%0\n" \
	"lv.q R102, 32+%0\n" \
	"lv.q R002, 32+%1\n" \
	"vadd.q R102, R102, R002\n" \
	"sv.q R102, 32+%0\n" \
	"lv.q R103, 48+%0\n" \
	"lv.q R003, 48+%1\n" \
	"vadd.q R103, R103, R003\n" \
	"sv.q R103, 48+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;blockM32&#41;, \
	  "+m" &#40;blockAdapt32&#41; &#41; ; 
	
#define vfpusub16 \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"lv.q R100, 0+%0\n" \
	"lv.q R000, 0+%1\n" \
	"vsub.q R100, R100, R000\n" \
	"sv.q R100, 0+%0\n" \
	"lv.q R101, 16+%0\n" \
	"lv.q R001, 16+%1\n" \
	"vsub.q R101, R101, R001\n" \
	"sv.q R101, 16+%0\n" \
	"lv.q R102, 32+%0\n" \
	"lv.q R002, 32+%1\n" \
	"vsub.q R102, R102, R002\n" \
	"sv.q R102, 32+%0\n" \
	"lv.q R103, 48+%0\n" \
	"lv.q R003, 48+%1\n" \
	"vsub.q R103, R103, R003\n" \
	"sv.q R103, 48+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;blockM32&#41;, \
	  "+m" &#40;blockAdapt32&#41; &#41; ; 

static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockM32&#91;16&#93;; 
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockAdapt32&#91;16&#93;;
	int i;
	for&#40;i = 0; i < 16; i++&#41; 
        &#123;
          	blockM32&#91;i&#93; = *&#40;pM+i&#41;; 
          	blockAdapt32&#91;i&#93; = *&#40;pAdapt+i&#41;;
        &#125; 
        vfpuadd16;
        for&#40;i = 0; i < 16; i++&#41; 
        &#123;
            *&#40;pM+i&#41; = &#40;short&#41;blockM32&#91;i&#93;; 
        &#125; 
&#125;

static inline void AdaptVFPUSub&#40;short * pM, const short * pAdapt&#41; &#123;
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockM32&#91;16&#93;; 
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockAdapt32&#91;16&#93;;
	int i;
	for&#40;i = 0; i < 16; i++&#41; 
        &#123;
          	blockM32&#91;i&#93; = *&#40;pM+i&#41;; 
          	blockAdapt32&#91;i&#93; = *&#40;pAdapt+i&#41;;
        &#125; 
        vfpusub16;
        for&#40;i = 0; i < 16; i++&#41; 
        &#123;
            *&#40;pM+i&#41; = &#40;short&#41;blockM32&#91;i&#93;; 
        &#125; 
&#125;

void Adapt&#40;short * pM, const short * pAdapt, int nDirection, int nOrder&#41;
&#123;
    nDirection = -nDirection;
    nOrder >>= 4;
    
    if &#40;nDirection < 0&#41; 
    &#123;    
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUAdd&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
            //EXPAND_16_TIMES&#40;*pM++ += *pAdapt++;&#41;  
        &#125;
    &#125;
    else if &#40;nDirection > 0&#41;
    &#123;
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUSub&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
            //EXPAND_16_TIMES&#40;*pM++ -= *pAdapt++;&#41;
        &#125;
    &#125;
&#125;

It can be complied, But not work, It let's my psp halt. :(

hlide · Post by **hlide** » Mon Nov 13, 2006 6:37 pm

address in lv.q/sv.q must be aligned to 16-byte region. Remember they load/store 4 floats, that is 16 bytes.

I don't know if ulv.q/usv.q for unligned access may be the right solution (but slower).

I'm must leave so i didn't take too much time to read all your code.

EDIT:
sorry i didn't read very well your code, you're aligning your floats at 64-bytes it sounds a bit much but i guess if it is for cache reason you're right.

while i'm trying to understand your code, I may point out the fact that you may use vs2i/vi2s and vi2f/vf2i instructions to convert your shorts into/from floats in a more efficient way that you're doing.

Raphael · Post by **Raphael** » Mon Nov 13, 2006 7:43 pm

To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)

To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.

hlide · Post by **hlide** » Mon Nov 13, 2006 7:58 pm

Raphael wrote:To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)

To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.

can we at least force an GCC options to align stack to 16-byte for isntance ?

cooleyes · Post by **cooleyes** » Mon Nov 13, 2006 8:46 pm

Raphael wrote:To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)

To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.

thanks for your help

when I use ulv.q/usv.q instead of lv.q/sv.q, it can work, no crash

but it was slower than the code not use VFPU. :(

hlide · Post by **hlide** » Mon Nov 13, 2006 9:01 pm

I didn't test it but you may do the same thing without temporary float buffer, please keep in mind there may be some bugs :

Code: Select all

static inline void AdaptVFPUAdd16&#40;short *pM, const short *pAdapt&#41;
   __asm__ volatile&#40;
   ".set push;"
   ".set noreorder;"
   "ulv.q R100, 0&#40;%0&#41;;"
   "ulv.q R000, 0&#40;%1&#41;;"
   "ulv.q R101, 16&#40;%0&#41;;"
   "ulv.q R001, 16&#40;%1&#41;;"
   "vs2i.q R100, R100;"
   "vs2i.q R101, R120;"
   "vs2i.q R102, R101;"
   "vs2i.q R103, R121;"
   "vs2i.q R000, R000;"
   "vs2i.q R001, R020;"
   "vs2i.q R002, R001;"
   "vs2i.q R003, R020;"
   "vi2f.q R100, R100, 16;"
   "vi2f.q R101, R101, 16;"
   "vi2f.q R102, R102, 16;"
   "vi2f.q R103, R103, 16;"
   "vi2f.q R000, R000, 16;"
   "vi2f.q R001, R001, 16;"
   "vi2f.q R002, R002, 16;"
   "vi2f.q R003, R003, 16;"
   "vadd.q R100, R100, R000;"
   "vadd.q R101, R101, R001;"
   "vadd.q R102, R102, R002;"
   "vadd.q R103, R103, R003;"
   "vf2iz.q R100, R100, 16;"
   "vf2iz.q R101, R101, 16;"
   "vf2iz.q R102, R102, 16;"
   "vf2iz.q R103, R103, 16;"
   "vi2s.p R100, R100;"
   "vi2s.p R120, R101;"
   "vi2s.p R101, R102;"
   "vi2s.p R121, R103;"
   "usv.q R100, 0&#40;%0&#41;;"
   "usv.q R101, 16&#40;%0&#41;;"
   ".set pop" &#58; &#58; "r"&#40;pM&#41;, "r"&#40;pAdapt&#41; &#58; "memory"&#41;;

...

void Adapt&#40;short * pM, const short * pAdapt, int nDirection, int nOrder&#41;
&#123;
    nDirection = -nDirection;
    nOrder >>= 4;
   
    if &#40;nDirection < 0&#41;
    &#123;   
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUAdd16&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
        &#125;
    &#125;
    else if &#40;nDirection > 0&#41;
    &#123;
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUSub16&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
        &#125;
    &#125;
&#125;

By the way, i didn't try to reorder vfpu instructions for better scheduling to ease the reading.

hlide · Post by **hlide** » Mon Nov 13, 2006 10:20 pm

cooleyes wrote:
Raphael wrote:To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)

To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.
no wonder !

1) copy of shorts in a float buffer using FPU (not VFPU !)
2) VFPU computation temporary buffer
3) copy of float buffer in the short buffers using FPU conversion (not VFPU again !)

That's definitely not the fast path to do !

cooleyes · Post by **cooleyes** » Mon Nov 13, 2006 10:42 pm

to hlide:

thanks for help

I have read the code you posted, and change some to make it can be compiled, but it crash , :(

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R100, R100\n" \
	"vs2i.p R101, R120\n" \
	"vs2i.p R102, R101\n" \
	"vs2i.p R103, R121\n" \
	"vs2i.p R000, R000\n" \
	"vs2i.p R001, R020\n" \
	"vs2i.p R002, R001\n" \
	"vs2i.p R003, R020\n" \
	"vi2f.q R100, R100, 16\n" \
	"vi2f.q R101, R101, 16\n" \
	"vi2f.q R102, R102, 16\n" \
	"vi2f.q R103, R103, 16\n" \
	"vi2f.q R000, R000, 16\n" \
	"vi2f.q R001, R001, 16\n" \
	"vi2f.q R002, R002, 16\n" \
	"vi2f.q R003, R003, 16\n" \
	"vadd.q R100, R100, R000\n" \
	"vadd.q R101, R101, R001\n" \
	"vadd.q R102, R102, R002\n" \
	"vadd.q R103, R103, R003\n" \
	"vf2iz.q R100, R100, 16\n" \
	"vf2iz.q R101, R101, 16\n" \
	"vf2iz.q R102, R102, 16\n" \
	"vf2iz.q R103, R103, 16\n" \
	"vi2s.q R100, R100\n" \
	"vi2s.q R120, R101\n" \
	"vi2s.q R101, R102\n" \
	"vi2s.q R121, R103\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  


static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	vfpuadd16ex;
&#125;

hlide · Post by **hlide** » Mon Nov 13, 2006 10:44 pm

cooleyes wrote:to hlide:

thanks for help

I have read the code you posted, and change some to make it can be compiled, but it crash , :(

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R100, R100\n" \
	"vs2i.p R101, R120\n" \
	"vs2i.p R102, R101\n" \
	"vs2i.p R103, R121\n" \
	"vs2i.p R000, R000\n" \
	"vs2i.p R001, R020\n" \
	"vs2i.p R002, R001\n" \
	"vs2i.p R003, R020\n" \
	"vi2f.q R100, R100, 16\n" \
	"vi2f.q R101, R101, 16\n" \
	"vi2f.q R102, R102, 16\n" \
	"vi2f.q R103, R103, 16\n" \
	"vi2f.q R000, R000, 16\n" \
	"vi2f.q R001, R001, 16\n" \
	"vi2f.q R002, R002, 16\n" \
	"vi2f.q R003, R003, 16\n" \
	"vadd.q R100, R100, R000\n" \
	"vadd.q R101, R101, R001\n" \
	"vadd.q R102, R102, R002\n" \
	"vadd.q R103, R103, R003\n" \
	"vf2iz.q R100, R100, 16\n" \
	"vf2iz.q R101, R101, 16\n" \
	"vf2iz.q R102, R102, 16\n" \
	"vf2iz.q R103, R103, 16\n" \
	"vi2s.q R100, R100\n" \
	"vi2s.q R120, R101\n" \
	"vi2s.q R101, R102\n" \
	"vi2s.q R121, R103\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  


static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	vfpuadd16ex;
&#125;

as I told you I didn't test it. And I may be wrong on row naming too... so... you can have a look on vfpu diggings for the purpose of each instruction and you may find the bugs.

hlide · Post by **hlide** » Mon Nov 13, 2006 10:47 pm

cooleyes wrote:to hlide:

thanks for help

I have read the code you posted, and change some to make it can be compiled, but it crash , :(

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R100, R100\n" \
	"vs2i.p R101, R120\n" \
	"vs2i.p R102, R101\n" \
	"vs2i.p R103, R121\n" \
	"vs2i.p R000, R000\n" \
	"vs2i.p R001, R020\n" \
	"vs2i.p R002, R001\n" \
	"vs2i.p R003, R020\n" \
	"vi2f.q R100, R100, 16\n" \
	"vi2f.q R101, R101, 16\n" \
	"vi2f.q R102, R102, 16\n" \
	"vi2f.q R103, R103, 16\n" \
	"vi2f.q R000, R000, 16\n" \
	"vi2f.q R001, R001, 16\n" \
	"vi2f.q R002, R002, 16\n" \
	"vi2f.q R003, R003, 16\n" \
	"vadd.q R100, R100, R000\n" \
	"vadd.q R101, R101, R001\n" \
	"vadd.q R102, R102, R002\n" \
	"vadd.q R103, R103, R003\n" \
	"vf2iz.q R100, R100, 16\n" \
	"vf2iz.q R101, R101, 16\n" \
	"vf2iz.q R102, R102, 16\n" \
	"vf2iz.q R103, R103, 16\n" \
	"vi2s.q R100, R100\n" \
	"vi2s.q R120, R101\n" \
	"vi2s.q R101, R102\n" \
	"vi2s.q R121, R103\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  


static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	vfpuadd16ex;
&#125;

keys :
"v(u)s2i"
"vi2(u)s"
"vi2f"
"vf2iz"

the rest should be okay for you

hlide · Post by **hlide** » Mon Nov 13, 2006 11:06 pm

I inverted two pairs of instructions :

Code: Select all

#define vfpuadd16ex \
   __asm__ volatile&#40; \
   ".set push\n" \
   ".set noreorder\n" \
   "ulv.q R100, 0+%0\n" \
   "ulv.q R000, 0+%1\n" \
   "ulv.q R101, 16+%0\n" \
   "ulv.q R001, 16+%1\n" \
   "vs2i.p R100, R100\n" \
   >>>"vs2i.p R102, R101\n"<<< \
   >>>"vs2i.p R101, R120\n"<<< \
   "vs2i.p R103, R121\n" \
   "vs2i.p R000, R000\n" \
   >>>"vs2i.p R002, R001\n"<<< \
   >>>"vs2i.p R001, R020\n"<<< \
   "vs2i.p R003, R021\n" \ <<< R020 should be R021
   "vi2f.q R100, R100, 16\n" \
   "vi2f.q R101, R101, 16\n" \
   "vi2f.q R102, R102, 16\n" \
   "vi2f.q R103, R103, 16\n" \
   "vi2f.q R000, R000, 16\n" \
   "vi2f.q R001, R001, 16\n" \
   "vi2f.q R002, R002, 16\n" \
   "vi2f.q R003, R003, 16\n" \
   "vadd.q R100, R100, R000\n" \
   "vadd.q R101, R101, R001\n" \
   "vadd.q R102, R102, R002\n" \
   "vadd.q R103, R103, R003\n" \
   "vf2iz.q R100, R100, 16\n" \
   "vf2iz.q R101, R101, 16\n" \
   "vf2iz.q R102, R102, 16\n" \
   "vf2iz.q R103, R103, 16\n" \
   "vi2s.q R100, R100\n" \
   "vi2s.q R120, R101\n" \
   "vi2s.q R101, R102\n" \
   "vi2s.q R121, R103\n" \
   "usv.q R100, 0+%0\n" \
   "usv.q R101, 16+%0\n" \
   ".set pop\n" \
   &#58; "+m" &#40;pM&#41;, \
     "+m" &#40;pAdapt&#41; &#41;;

i don't know if it is a the reason why it crashes. I suppose you a crash into this code when running and not at compiling ? or do you crash later because of the result of this function ?

cooleyes · Post by **cooleyes** » Mon Nov 13, 2006 11:10 pm

en, I have found the error, new code like this, no crash, but also slower.

Code: Select all


#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R300, R100\n" \
	"vs2i.p R301, R120\n" \
	"vs2i.p R302, R101\n" \
	"vs2i.p R303, R121\n" \
	"vs2i.p R200, R000\n" \
	"vs2i.p R201, R020\n" \
	"vs2i.p R202, R001\n" \
	"vs2i.p R203, R020\n" \
	"vi2f.q R300, R300, 16\n" \
	"vi2f.q R301, R301, 16\n" \
	"vi2f.q R302, R302, 16\n" \
	"vi2f.q R303, R303, 16\n" \
	"vi2f.q R200, R200, 16\n" \
	"vi2f.q R201, R201, 16\n" \
	"vi2f.q R202, R202, 16\n" \
	"vi2f.q R203, R203, 16\n" \
	"vadd.q R300, R300, R200\n" \
	"vadd.q R301, R301, R201\n" \
	"vadd.q R302, R302, R202\n" \
	"vadd.q R303, R303, R203\n" \
	"vf2iz.q R300, R300, 16\n" \
	"vf2iz.q R301, R301, 16\n" \
	"vf2iz.q R302, R302, 16\n" \
	"vf2iz.q R303, R303, 16\n" \
	"vi2s.q R100, R300\n" \
	"vi2s.q R120, R301\n" \
	"vi2s.q R101, R302\n" \
	"vi2s.q R121, R303\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;

hlide · Post by **hlide** » Mon Nov 13, 2006 11:17 pm

cooleyes wrote:en, I have found the error, new code like this, no crash, but also slower.

Code: Select all


#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R300, R100\n" \
	"vs2i.p R301, R120\n" \
	"vs2i.p R302, R101\n" \
	"vs2i.p R303, R121\n" \
	"vs2i.p R200, R000\n" \
	"vs2i.p R201, R020\n" \
	"vs2i.p R202, R001\n" \
	"vs2i.p R203, R020\n" \ <<<<<<< should be R021
	"vi2f.q R300, R300, 16\n" \
	"vi2f.q R301, R301, 16\n" \
	"vi2f.q R302, R302, 16\n" \
	"vi2f.q R303, R303, 16\n" \
	"vi2f.q R200, R200, 16\n" \
	"vi2f.q R201, R201, 16\n" \
	"vi2f.q R202, R202, 16\n" \
	"vi2f.q R203, R203, 16\n" \
	"vadd.q R300, R300, R200\n" \
	"vadd.q R301, R301, R201\n" \
	"vadd.q R302, R302, R202\n" \
	"vadd.q R303, R303, R203\n" \
	"vf2iz.q R300, R300, 16\n" \
	"vf2iz.q R301, R301, 16\n" \
	"vf2iz.q R302, R302, 16\n" \
	"vf2iz.q R303, R303, 16\n" \
	"vi2s.q R100, R300\n" \
	"vi2s.q R120, R301\n" \
	"vi2s.q R101, R302\n" \
	"vi2s.q R121, R303\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;

first you may need to reorder instuctions to hide latencies, because i'm sure it is not optimal here.

but anyway why do you need to use float to add shorts !?!? i'm coding something stupid !

Raphael · Post by **Raphael** » Tue Nov 14, 2006 1:36 am

hlide wrote: can we at least force an GCC options to align stack to 16-byte for isntance ?

Not sure about that. Last time I needed that, I wrote a work-around like that:

Code: Select all

float myarray&#91;SIZE + 4&#93;;
float* myarray16 = &#40;float*&#41;&#40;&#40;&#40;int&#41;myarray+16&#41;&~0xF&#41;;

which worked (but is ugly).

hlide wrote: first you may need to reorder instuctions to hide latencies

Unfortunately to my findings this seems hardly possible, if at all. I would suppose the VFPU isn't pipelined, or if it is, the pipeline is very short and most ops use all it's stages. You can however hide MIPS code inside the VFPU latencies.

cooleyes wrote:en, I have found the error, new code like this, no crash, but also slower.

The problem is that you only want to add shorts together, which requires you to load the data into VFPU registers, convert them, add them, reconvert them and write them back to memory. A lot of overhead for a simple functionality like that, so you won't get it faster with VFPU.

cooleyes · Post by **cooleyes** » Tue Nov 14, 2006 12:30 pm

to hlide:

I have made a mistake, the new code didn't work.
I found my demo app use the old code last night,
so it can work no crash.

but when I use the new code , it crashed.

but you are right, use vfpu to do this is stupid, too slower

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R300, R100\n" \
	"vs2i.p R301, R120\n" \
	"vs2i.p R302, R101\n" \
	"vs2i.p R303, R121\n" \
	"vs2i.p R200, R000\n" \
	"vs2i.p R201, R020\n" \
	"vs2i.p R202, R001\n" \
	"vs2i.p R203, R021\n" \
	"vi2f.q R300, R300, 16\n" \
	"vi2f.q R301, R301, 16\n" \
	"vi2f.q R302, R302, 16\n" \
	"vi2f.q R303, R303, 16\n" \
	"vi2f.q R200, R200, 16\n" \
	"vi2f.q R201, R201, 16\n" \
	"vi2f.q R202, R202, 16\n" \
	"vi2f.q R203, R203, 16\n" \
	"vadd.q R300, R300, R200\n" \
	"vadd.q R301, R301, R201\n" \
	"vadd.q R302, R302, R202\n" \
	"vadd.q R303, R303, R203\n" \
	"vf2iz.q R300, R300, 16\n" \
	"vf2iz.q R301, R301, 16\n" \
	"vf2iz.q R302, R302, 16\n" \
	"vf2iz.q R303, R303, 16\n" \
	"vi2s.q R100, R300\n" \
	"vi2s.q R120, R301\n" \
	"vi2s.q R101, R302\n" \
	"vi2s.q R121, R303\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;

cooleyes · Post by **cooleyes** » Tue Nov 14, 2006 12:34 pm

to Raphael:

I just want to test that can I use some vfpu code to instead of "MMX code" in PSP.

but I think it is impossible now. :(

Raphael · Post by **Raphael** » Tue Nov 14, 2006 5:12 pm

cooleyes wrote:to Raphael:

I just want to test that can I use some vfpu code to instead of "MMX code" in PSP.

but I think it is impossible now. :(

Yeah, you simply cannot compare VFPU to MMX :) MMX is int based and not really a vector processing scheme.

hlide · Post by **hlide** » Tue Nov 14, 2006 5:45 pm

cooleyes wrote:to hlide:

I have made a mistake, the new code didn't work.
I found my demo app use the old code last night,
so it can work no crash.

but when I use the new code , it crashed.

it would be interesting to say where it crashed, precisely in the "new" code or when exploiting the result ? this is quite different. And when you say crash you are supposedly having it compile well then running it, are you ?

Normally the conversion short->int->float should at least work since I have tested it by coding it in RTPS function (GTE) with a PCSX-like emulator source for psp and test it with a psx game using RTPS. I never tested the reverse conversion (i mean vfpu int->short conversion), so I'm less confident.

Raphael · Post by **Raphael** » Tue Nov 14, 2006 6:19 pm

hlide wrote:I never tested the reverse conversion (i mean vfpu int->short conversion), so I'm less confident.

It should be ok, I used the same way for converting the short blocks to floats and vice versa for the iDCT in ffmpeg.

chp · Post by **chp** » Tue Nov 14, 2006 6:25 pm

You can also use alloca() and align that address, since it also allocates from the stack and is a bit more clean than aligning a local array (it is how I align buffers in gum/vfpu). Also, there is no point using memalign() to allocate memory since malloc() is already quad-word aligned these days.

hlide · Post by **hlide** » Tue Nov 14, 2006 7:37 pm

chp wrote:You can also use alloca() and align that address, since it also allocates from the stack and is a bit more clean than aligning a local array (it is how I align buffers in gum/vfpu). Also, there is no point using memalign() to allocate memory since malloc() is already quad-word aligned these days.

GE dominator ? are we speaking about the Graphics Engine ? Ooooh you may interest me.

Raphael · Post by **Raphael** » Tue Nov 14, 2006 8:13 pm

chp wrote: Also, there is no point using memalign() to allocate memory since malloc() is already quad-word aligned these days.

I just mentioned memalign to make clear that I was going towards aligned memory (a lot of people still aren't aware about the malloc alignment)

hlide wrote: GE dominator ? are we speaking about the Graphics Engine ? Ooooh you may interest me.

Yes, GE as in Graphics Engine :) He's the one behind all the GU SDK samples and the most knowledged person about GE/GU in the whole scene ;)