How to use VFPU??

Discuss the development of new homebrew software, tools and libraries.

Moderators: cheriff, TyRaNiD

Post Reply
cooleyes
Posts: 123
Joined: Thu May 18, 2006 3:30 pm

How to use VFPU??

Post by cooleyes »

I have some code like

Code: Select all

#define EXPAND_16_TIMES(CODE) CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE CODE

void Adapt(short * pM, const short * pAdapt, int nDirection, int nOrder)
{
    nDirection = -nDirection;
    nOrder >>= 4;
    
    if &#40;nDirection < 0&#41; 
    &#123;    
        while &#40;nOrder--&#41;
        &#123;
            EXPAND_16_TIMES&#40;*pM++ += *pAdapt++;&#41;  
        &#125;
    &#125;
    else if &#40;nDirection > 0&#41;
    &#123;
        while &#40;nOrder--&#41;
        &#123;
            EXPAND_16_TIMES&#40;*pM++ -= *pAdapt++;&#41;
        &#125;
    &#125;
&#125;
I want to use VFPU code to instead of

so I wrote that

Code: Select all


#define vfpuadd16 \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"lv.q R100, 0+%0\n" \
	"lv.q R000, 0+%1\n" \
	"vadd.q R100, R100, R000\n" \
	"sv.q R100, 0+%0\n" \
	"lv.q R101, 16+%0\n" \
	"lv.q R001, 16+%1\n" \
	"vadd.q R101, R101, R001\n" \
	"sv.q R101, 16+%0\n" \
	"lv.q R102, 32+%0\n" \
	"lv.q R002, 32+%1\n" \
	"vadd.q R102, R102, R002\n" \
	"sv.q R102, 32+%0\n" \
	"lv.q R103, 48+%0\n" \
	"lv.q R003, 48+%1\n" \
	"vadd.q R103, R103, R003\n" \
	"sv.q R103, 48+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;blockM32&#41;, \
	  "+m" &#40;blockAdapt32&#41; &#41; ; 
	
#define vfpusub16 \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"lv.q R100, 0+%0\n" \
	"lv.q R000, 0+%1\n" \
	"vsub.q R100, R100, R000\n" \
	"sv.q R100, 0+%0\n" \
	"lv.q R101, 16+%0\n" \
	"lv.q R001, 16+%1\n" \
	"vsub.q R101, R101, R001\n" \
	"sv.q R101, 16+%0\n" \
	"lv.q R102, 32+%0\n" \
	"lv.q R002, 32+%1\n" \
	"vsub.q R102, R102, R002\n" \
	"sv.q R102, 32+%0\n" \
	"lv.q R103, 48+%0\n" \
	"lv.q R003, 48+%1\n" \
	"vsub.q R103, R103, R003\n" \
	"sv.q R103, 48+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;blockM32&#41;, \
	  "+m" &#40;blockAdapt32&#41; &#41; ; 

static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockM32&#91;16&#93;; 
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockAdapt32&#91;16&#93;;
	int i;
	for&#40;i = 0; i < 16; i++&#41; 
        &#123;
          	blockM32&#91;i&#93; = *&#40;pM+i&#41;; 
          	blockAdapt32&#91;i&#93; = *&#40;pAdapt+i&#41;;
        &#125; 
        vfpuadd16;
        for&#40;i = 0; i < 16; i++&#41; 
        &#123;
            *&#40;pM+i&#41; = &#40;short&#41;blockM32&#91;i&#93;; 
        &#125; 
&#125;

static inline void AdaptVFPUSub&#40;short * pM, const short * pAdapt&#41; &#123;
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockM32&#91;16&#93;; 
	float __attribute__&#40;&#40;aligned&#40;64&#41;&#41;&#41; blockAdapt32&#91;16&#93;;
	int i;
	for&#40;i = 0; i < 16; i++&#41; 
        &#123;
          	blockM32&#91;i&#93; = *&#40;pM+i&#41;; 
          	blockAdapt32&#91;i&#93; = *&#40;pAdapt+i&#41;;
        &#125; 
        vfpusub16;
        for&#40;i = 0; i < 16; i++&#41; 
        &#123;
            *&#40;pM+i&#41; = &#40;short&#41;blockM32&#91;i&#93;; 
        &#125; 
&#125;

void Adapt&#40;short * pM, const short * pAdapt, int nDirection, int nOrder&#41;
&#123;
    nDirection = -nDirection;
    nOrder >>= 4;
    
    if &#40;nDirection < 0&#41; 
    &#123;    
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUAdd&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
            //EXPAND_16_TIMES&#40;*pM++ += *pAdapt++;&#41;  
        &#125;
    &#125;
    else if &#40;nDirection > 0&#41;
    &#123;
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUSub&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
            //EXPAND_16_TIMES&#40;*pM++ -= *pAdapt++;&#41;
        &#125;
    &#125;
&#125;
It can be complied, But not work, It let's my psp halt. :(
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

address in lv.q/sv.q must be aligned to 16-byte region. Remember they load/store 4 floats, that is 16 bytes.

I don't know if ulv.q/usv.q for unligned access may be the right solution (but slower).

I'm must leave so i didn't take too much time to read all your code.

EDIT:
sorry i didn't read very well your code, you're aligning your floats at 64-bytes it sounds a bit much but i guess if it is for cache reason you're right.

while i'm trying to understand your code, I may point out the fact that you may use vs2i/vi2s and vi2f/vf2i instructions to convert your shorts into/from floats in a more efficient way that you're doing.
Last edited by hlide on Mon Nov 13, 2006 8:00 pm, edited 2 times in total.
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)


To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

Raphael wrote:To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)


To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.
can we at least force an GCC options to align stack to 16-byte for isntance ?
cooleyes
Posts: 123
Joined: Thu May 18, 2006 3:30 pm

Post by cooleyes »

Raphael wrote:To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)


To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.

thanks for your help

when I use ulv.q/usv.q instead of lv.q/sv.q, it can work, no crash

but it was slower than the code not use VFPU. :(
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

I didn't test it but you may do the same thing without temporary float buffer, please keep in mind there may be some bugs :

Code: Select all

static inline void AdaptVFPUAdd16&#40;short *pM, const short *pAdapt&#41;
   __asm__ volatile&#40;
   ".set push;"
   ".set noreorder;"
   "ulv.q R100, 0&#40;%0&#41;;"
   "ulv.q R000, 0&#40;%1&#41;;"
   "ulv.q R101, 16&#40;%0&#41;;"
   "ulv.q R001, 16&#40;%1&#41;;"
   "vs2i.q R100, R100;"
   "vs2i.q R101, R120;"
   "vs2i.q R102, R101;"
   "vs2i.q R103, R121;"
   "vs2i.q R000, R000;"
   "vs2i.q R001, R020;"
   "vs2i.q R002, R001;"
   "vs2i.q R003, R020;"
   "vi2f.q R100, R100, 16;"
   "vi2f.q R101, R101, 16;"
   "vi2f.q R102, R102, 16;"
   "vi2f.q R103, R103, 16;"
   "vi2f.q R000, R000, 16;"
   "vi2f.q R001, R001, 16;"
   "vi2f.q R002, R002, 16;"
   "vi2f.q R003, R003, 16;"
   "vadd.q R100, R100, R000;"
   "vadd.q R101, R101, R001;"
   "vadd.q R102, R102, R002;"
   "vadd.q R103, R103, R003;"
   "vf2iz.q R100, R100, 16;"
   "vf2iz.q R101, R101, 16;"
   "vf2iz.q R102, R102, 16;"
   "vf2iz.q R103, R103, 16;"
   "vi2s.p R100, R100;"
   "vi2s.p R120, R101;"
   "vi2s.p R101, R102;"
   "vi2s.p R121, R103;"
   "usv.q R100, 0&#40;%0&#41;;"
   "usv.q R101, 16&#40;%0&#41;;"
   ".set pop" &#58; &#58; "r"&#40;pM&#41;, "r"&#40;pAdapt&#41; &#58; "memory"&#41;;

...

void Adapt&#40;short * pM, const short * pAdapt, int nDirection, int nOrder&#41;
&#123;
    nDirection = -nDirection;
    nOrder >>= 4;
   
    if &#40;nDirection < 0&#41;
    &#123;   
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUAdd16&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
        &#125;
    &#125;
    else if &#40;nDirection > 0&#41;
    &#123;
        while &#40;nOrder--&#41;
        &#123;
            AdaptVFPUSub16&#40;pM, pAdapt&#41;;
            pM+=16;
            pAdapt+=16;
        &#125;
    &#125;
&#125; 
By the way, i didn't try to reorder vfpu instructions for better scheduling to ease the reading.
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

cooleyes wrote:
Raphael wrote:To my last own problems with the VFPU, it seems that the align macro doesn't apply to stack variables, therefore causing your unaligned accesses to crash the psp.
Two possible solutions:
- memalign the two buffers instead of declaring them on stack [or write your own stack align function] (bad)
- use unaligned access (ulv.q/usv.q) and drop the buffers completely (good)


To my findings, unaligned accesses also aren't slower when the data is fetched from/written to memory, and takes 2 cycles instead of 1 for cached reads and 14 instead of 7 cycles for cached writes. Not that much [waste], if you take into account that reads/writes from memory will take 68/111 cycles independant of unaligned/aligned.
no wonder !

1) copy of shorts in a float buffer using FPU (not VFPU !)
2) VFPU computation temporary buffer
3) copy of float buffer in the short buffers using FPU conversion (not VFPU again !)

That's definitely not the fast path to do !
cooleyes
Posts: 123
Joined: Thu May 18, 2006 3:30 pm

Post by cooleyes »

to hlide:

thanks for help

I have read the code you posted, and change some to make it can be compiled, but it crash , :(

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R100, R100\n" \
	"vs2i.p R101, R120\n" \
	"vs2i.p R102, R101\n" \
	"vs2i.p R103, R121\n" \
	"vs2i.p R000, R000\n" \
	"vs2i.p R001, R020\n" \
	"vs2i.p R002, R001\n" \
	"vs2i.p R003, R020\n" \
	"vi2f.q R100, R100, 16\n" \
	"vi2f.q R101, R101, 16\n" \
	"vi2f.q R102, R102, 16\n" \
	"vi2f.q R103, R103, 16\n" \
	"vi2f.q R000, R000, 16\n" \
	"vi2f.q R001, R001, 16\n" \
	"vi2f.q R002, R002, 16\n" \
	"vi2f.q R003, R003, 16\n" \
	"vadd.q R100, R100, R000\n" \
	"vadd.q R101, R101, R001\n" \
	"vadd.q R102, R102, R002\n" \
	"vadd.q R103, R103, R003\n" \
	"vf2iz.q R100, R100, 16\n" \
	"vf2iz.q R101, R101, 16\n" \
	"vf2iz.q R102, R102, 16\n" \
	"vf2iz.q R103, R103, 16\n" \
	"vi2s.q R100, R100\n" \
	"vi2s.q R120, R101\n" \
	"vi2s.q R101, R102\n" \
	"vi2s.q R121, R103\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  


static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	vfpuadd16ex;
&#125;
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

cooleyes wrote:to hlide:

thanks for help

I have read the code you posted, and change some to make it can be compiled, but it crash , :(

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R100, R100\n" \
	"vs2i.p R101, R120\n" \
	"vs2i.p R102, R101\n" \
	"vs2i.p R103, R121\n" \
	"vs2i.p R000, R000\n" \
	"vs2i.p R001, R020\n" \
	"vs2i.p R002, R001\n" \
	"vs2i.p R003, R020\n" \
	"vi2f.q R100, R100, 16\n" \
	"vi2f.q R101, R101, 16\n" \
	"vi2f.q R102, R102, 16\n" \
	"vi2f.q R103, R103, 16\n" \
	"vi2f.q R000, R000, 16\n" \
	"vi2f.q R001, R001, 16\n" \
	"vi2f.q R002, R002, 16\n" \
	"vi2f.q R003, R003, 16\n" \
	"vadd.q R100, R100, R000\n" \
	"vadd.q R101, R101, R001\n" \
	"vadd.q R102, R102, R002\n" \
	"vadd.q R103, R103, R003\n" \
	"vf2iz.q R100, R100, 16\n" \
	"vf2iz.q R101, R101, 16\n" \
	"vf2iz.q R102, R102, 16\n" \
	"vf2iz.q R103, R103, 16\n" \
	"vi2s.q R100, R100\n" \
	"vi2s.q R120, R101\n" \
	"vi2s.q R101, R102\n" \
	"vi2s.q R121, R103\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  


static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	vfpuadd16ex;
&#125;
as I told you I didn't test it. And I may be wrong on row naming too... so... you can have a look on vfpu diggings for the purpose of each instruction and you may find the bugs.
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

cooleyes wrote:to hlide:

thanks for help

I have read the code you posted, and change some to make it can be compiled, but it crash , :(

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R100, R100\n" \
	"vs2i.p R101, R120\n" \
	"vs2i.p R102, R101\n" \
	"vs2i.p R103, R121\n" \
	"vs2i.p R000, R000\n" \
	"vs2i.p R001, R020\n" \
	"vs2i.p R002, R001\n" \
	"vs2i.p R003, R020\n" \
	"vi2f.q R100, R100, 16\n" \
	"vi2f.q R101, R101, 16\n" \
	"vi2f.q R102, R102, 16\n" \
	"vi2f.q R103, R103, 16\n" \
	"vi2f.q R000, R000, 16\n" \
	"vi2f.q R001, R001, 16\n" \
	"vi2f.q R002, R002, 16\n" \
	"vi2f.q R003, R003, 16\n" \
	"vadd.q R100, R100, R000\n" \
	"vadd.q R101, R101, R001\n" \
	"vadd.q R102, R102, R002\n" \
	"vadd.q R103, R103, R003\n" \
	"vf2iz.q R100, R100, 16\n" \
	"vf2iz.q R101, R101, 16\n" \
	"vf2iz.q R102, R102, 16\n" \
	"vf2iz.q R103, R103, 16\n" \
	"vi2s.q R100, R100\n" \
	"vi2s.q R120, R101\n" \
	"vi2s.q R101, R102\n" \
	"vi2s.q R121, R103\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  


static inline void AdaptVFPUAdd&#40;short * pM, const short * pAdapt&#41; &#123;
	vfpuadd16ex;
&#125;
keys :
"v(u)s2i"
"vi2(u)s"
"vi2f"
"vf2iz"

the rest should be okay for you
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

I inverted two pairs of instructions :

Code: Select all

#define vfpuadd16ex \
   __asm__ volatile&#40; \
   ".set push\n" \
   ".set noreorder\n" \
   "ulv.q R100, 0+%0\n" \
   "ulv.q R000, 0+%1\n" \
   "ulv.q R101, 16+%0\n" \
   "ulv.q R001, 16+%1\n" \
   "vs2i.p R100, R100\n" \
   >>>"vs2i.p R102, R101\n"<<< \
   >>>"vs2i.p R101, R120\n"<<< \
   "vs2i.p R103, R121\n" \
   "vs2i.p R000, R000\n" \
   >>>"vs2i.p R002, R001\n"<<< \
   >>>"vs2i.p R001, R020\n"<<< \
   "vs2i.p R003, R021\n" \ <<< R020 should be R021
   "vi2f.q R100, R100, 16\n" \
   "vi2f.q R101, R101, 16\n" \
   "vi2f.q R102, R102, 16\n" \
   "vi2f.q R103, R103, 16\n" \
   "vi2f.q R000, R000, 16\n" \
   "vi2f.q R001, R001, 16\n" \
   "vi2f.q R002, R002, 16\n" \
   "vi2f.q R003, R003, 16\n" \
   "vadd.q R100, R100, R000\n" \
   "vadd.q R101, R101, R001\n" \
   "vadd.q R102, R102, R002\n" \
   "vadd.q R103, R103, R003\n" \
   "vf2iz.q R100, R100, 16\n" \
   "vf2iz.q R101, R101, 16\n" \
   "vf2iz.q R102, R102, 16\n" \
   "vf2iz.q R103, R103, 16\n" \
   "vi2s.q R100, R100\n" \
   "vi2s.q R120, R101\n" \
   "vi2s.q R101, R102\n" \
   "vi2s.q R121, R103\n" \
   "usv.q R100, 0+%0\n" \
   "usv.q R101, 16+%0\n" \
   ".set pop\n" \
   &#58; "+m" &#40;pM&#41;, \
     "+m" &#40;pAdapt&#41; &#41;;  
i don't know if it is a the reason why it crashes. I suppose you a crash into this code when running and not at compiling ? or do you crash later because of the result of this function ?
Last edited by hlide on Mon Nov 13, 2006 11:12 pm, edited 1 time in total.
cooleyes
Posts: 123
Joined: Thu May 18, 2006 3:30 pm

Post by cooleyes »

en, I have found the error, new code like this, no crash, but also slower.

Code: Select all


#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R300, R100\n" \
	"vs2i.p R301, R120\n" \
	"vs2i.p R302, R101\n" \
	"vs2i.p R303, R121\n" \
	"vs2i.p R200, R000\n" \
	"vs2i.p R201, R020\n" \
	"vs2i.p R202, R001\n" \
	"vs2i.p R203, R020\n" \
	"vi2f.q R300, R300, 16\n" \
	"vi2f.q R301, R301, 16\n" \
	"vi2f.q R302, R302, 16\n" \
	"vi2f.q R303, R303, 16\n" \
	"vi2f.q R200, R200, 16\n" \
	"vi2f.q R201, R201, 16\n" \
	"vi2f.q R202, R202, 16\n" \
	"vi2f.q R203, R203, 16\n" \
	"vadd.q R300, R300, R200\n" \
	"vadd.q R301, R301, R201\n" \
	"vadd.q R302, R302, R202\n" \
	"vadd.q R303, R303, R203\n" \
	"vf2iz.q R300, R300, 16\n" \
	"vf2iz.q R301, R301, 16\n" \
	"vf2iz.q R302, R302, 16\n" \
	"vf2iz.q R303, R303, 16\n" \
	"vi2s.q R100, R300\n" \
	"vi2s.q R120, R301\n" \
	"vi2s.q R101, R302\n" \
	"vi2s.q R121, R303\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

cooleyes wrote:en, I have found the error, new code like this, no crash, but also slower.

Code: Select all


#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R300, R100\n" \
	"vs2i.p R301, R120\n" \
	"vs2i.p R302, R101\n" \
	"vs2i.p R303, R121\n" \
	"vs2i.p R200, R000\n" \
	"vs2i.p R201, R020\n" \
	"vs2i.p R202, R001\n" \
	"vs2i.p R203, R020\n" \ <<<<<<< should be R021
	"vi2f.q R300, R300, 16\n" \
	"vi2f.q R301, R301, 16\n" \
	"vi2f.q R302, R302, 16\n" \
	"vi2f.q R303, R303, 16\n" \
	"vi2f.q R200, R200, 16\n" \
	"vi2f.q R201, R201, 16\n" \
	"vi2f.q R202, R202, 16\n" \
	"vi2f.q R203, R203, 16\n" \
	"vadd.q R300, R300, R200\n" \
	"vadd.q R301, R301, R201\n" \
	"vadd.q R302, R302, R202\n" \
	"vadd.q R303, R303, R203\n" \
	"vf2iz.q R300, R300, 16\n" \
	"vf2iz.q R301, R301, 16\n" \
	"vf2iz.q R302, R302, 16\n" \
	"vf2iz.q R303, R303, 16\n" \
	"vi2s.q R100, R300\n" \
	"vi2s.q R120, R301\n" \
	"vi2s.q R101, R302\n" \
	"vi2s.q R121, R303\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  
first you may need to reorder instuctions to hide latencies, because i'm sure it is not optimal here.

but anyway why do you need to use float to add shorts !?!? i'm coding something stupid !
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

hlide wrote: can we at least force an GCC options to align stack to 16-byte for isntance ?
Not sure about that. Last time I needed that, I wrote a work-around like that:

Code: Select all

float myarray&#91;SIZE + 4&#93;;
float* myarray16 = &#40;float*&#41;&#40;&#40;&#40;int&#41;myarray+16&#41;&~0xF&#41;;
which worked (but is ugly).
hlide wrote: first you may need to reorder instuctions to hide latencies
Unfortunately to my findings this seems hardly possible, if at all. I would suppose the VFPU isn't pipelined, or if it is, the pipeline is very short and most ops use all it's stages. You can however hide MIPS code inside the VFPU latencies.
cooleyes wrote:en, I have found the error, new code like this, no crash, but also slower.
The problem is that you only want to add shorts together, which requires you to load the data into VFPU registers, convert them, add them, reconvert them and write them back to memory. A lot of overhead for a simple functionality like that, so you won't get it faster with VFPU.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
cooleyes
Posts: 123
Joined: Thu May 18, 2006 3:30 pm

Post by cooleyes »

to hlide:

I have made a mistake, the new code didn't work.
I found my demo app use the old code last night,
so it can work no crash.

but when I use the new code , it crashed.

but you are right, use vfpu to do this is stupid, too slower

Code: Select all

#define vfpuadd16ex \
	__asm__ volatile&#40; \
	".set push\n" \
	".set noreorder\n" \
	"ulv.q R100, 0+%0\n" \
	"ulv.q R000, 0+%1\n" \
	"ulv.q R101, 16+%0\n" \
	"ulv.q R001, 16+%1\n" \
	"vs2i.p R300, R100\n" \
	"vs2i.p R301, R120\n" \
	"vs2i.p R302, R101\n" \
	"vs2i.p R303, R121\n" \
	"vs2i.p R200, R000\n" \
	"vs2i.p R201, R020\n" \
	"vs2i.p R202, R001\n" \
	"vs2i.p R203, R021\n" \
	"vi2f.q R300, R300, 16\n" \
	"vi2f.q R301, R301, 16\n" \
	"vi2f.q R302, R302, 16\n" \
	"vi2f.q R303, R303, 16\n" \
	"vi2f.q R200, R200, 16\n" \
	"vi2f.q R201, R201, 16\n" \
	"vi2f.q R202, R202, 16\n" \
	"vi2f.q R203, R203, 16\n" \
	"vadd.q R300, R300, R200\n" \
	"vadd.q R301, R301, R201\n" \
	"vadd.q R302, R302, R202\n" \
	"vadd.q R303, R303, R203\n" \
	"vf2iz.q R300, R300, 16\n" \
	"vf2iz.q R301, R301, 16\n" \
	"vf2iz.q R302, R302, 16\n" \
	"vf2iz.q R303, R303, 16\n" \
	"vi2s.q R100, R300\n" \
	"vi2s.q R120, R301\n" \
	"vi2s.q R101, R302\n" \
	"vi2s.q R121, R303\n" \
	"usv.q R100, 0+%0\n" \
	"usv.q R101, 16+%0\n" \
	".set pop\n" \
	&#58; "+m" &#40;pM&#41;, \
	  "+m" &#40;pAdapt&#41; &#41;;  
cooleyes
Posts: 123
Joined: Thu May 18, 2006 3:30 pm

Post by cooleyes »

to Raphael:

I just want to test that can I use some vfpu code to instead of "MMX code" in PSP.

but I think it is impossible now. :(
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

cooleyes wrote:to Raphael:

I just want to test that can I use some vfpu code to instead of "MMX code" in PSP.

but I think it is impossible now. :(
Yeah, you simply cannot compare VFPU to MMX :) MMX is int based and not really a vector processing scheme.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

cooleyes wrote:to hlide:

I have made a mistake, the new code didn't work.
I found my demo app use the old code last night,
so it can work no crash.

but when I use the new code , it crashed.
it would be interesting to say where it crashed, precisely in the "new" code or when exploiting the result ? this is quite different. And when you say crash you are supposedly having it compile well then running it, are you ?

Normally the conversion short->int->float should at least work since I have tested it by coding it in RTPS function (GTE) with a PCSX-like emulator source for psp and test it with a psx game using RTPS. I never tested the reverse conversion (i mean vfpu int->short conversion), so I'm less confident.
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

hlide wrote:I never tested the reverse conversion (i mean vfpu int->short conversion), so I'm less confident.
It should be ok, I used the same way for converting the short blocks to floats and vice versa for the iDCT in ffmpeg.
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
chp
Posts: 313
Joined: Wed Jun 23, 2004 7:16 am

Post by chp »

You can also use alloca() and align that address, since it also allocates from the stack and is a bit more clean than aligning a local array (it is how I align buffers in gum/vfpu). Also, there is no point using memalign() to allocate memory since malloc() is already quad-word aligned these days.
GE Dominator
hlide
Posts: 739
Joined: Sun Sep 10, 2006 2:31 am

Post by hlide »

chp wrote:You can also use alloca() and align that address, since it also allocates from the stack and is a bit more clean than aligning a local array (it is how I align buffers in gum/vfpu). Also, there is no point using memalign() to allocate memory since malloc() is already quad-word aligned these days.
GE dominator ? are we speaking about the Graphics Engine ? Ooooh you may interest me.
User avatar
Raphael
Posts: 646
Joined: Tue Jan 17, 2006 4:54 pm
Location: Germany
Contact:

Post by Raphael »

chp wrote: Also, there is no point using memalign() to allocate memory since malloc() is already quad-word aligned these days.
I just mentioned memalign to make clear that I was going towards aligned memory (a lot of people still aren't aware about the malloc alignment)
hlide wrote: GE dominator ? are we speaking about the Graphics Engine ? Ooooh you may interest me.
Yes, GE as in Graphics Engine :) He's the one behind all the GU SDK samples and the most knowledged person about GE/GU in the whole scene ;)
<Don't push the river, it flows.>
http://wordpress.fx-world.org - my devblog
http://wiki.fx-world.org - VFPU documentation wiki

Alexander Berl
Post Reply