SPE Media Lib

unsolo · Post by **unsolo** » Mon Apr 16, 2007 3:08 am

Hello ,

Im a norwegian guy hanging in #PS3Dev and #gentoo-ppc64 on irc.freenode.net

I am nearly finished(it works but is not released) with a colorspace converter YV420p ->ARGB.. (more or less the same as YV12->ARGB)

That runs on a spe at more than 60FPS for 1920x1080.

next logical steps is up/down scaling.. then maybe some extra filtering and decoding..

so heres my plan..

We create a SPU Media Lib project here on PS2Dev.org where we define inputs outputs (make reference project on sourceforge). And standards on locations etc etc of binaries. How do handshake and communicate between all the spe's running then we add subprojects of neccesary spu's as we se fit as the lib increases in size.

All spu code needs to be 64 and 32 bit ul compatible..

So please help me create the project and help me write the code..

Thanks
Kristian

Post by **Oobles** » Mon Apr 16, 2007 8:18 am

Sounds like a good project. If you would like to host the code at the subversion repository here (svn.ps2dev.org), then please send me a private message with the userid/password you would like and I will create an account for you.

The same goes for anyone else with project ideas for the ps3. The very few rules I have for subversion access are listed at:

http://ps2dev.org/Site_Information/Subversion

David. aka Oobles.

unsolo · Post by **unsolo** » Wed Apr 18, 2007 8:02 pm

project is up at
http://wiki.ps2dev.org/ps3:spu-medialib

svn
http://svn.pspdev.org/listing.php?repna ... rev=0&sc=0

mbf · Post by **mbf** » Sat May 05, 2007 12:04 am

very very nice one unsolo :)

Now, in order to benefit from this in the greatest number of applications without having to tweak each one of them for the PS3, I think the best thing would be to implement this in something like SDL or DirectFBor any other similar media layer (ggi? xv via a custom ps3fb based custom X server?). The idea being to provide transparent SPE based hardware acceleration and vsync support for any app using those backends.

What do you guys think?

jimparis · Post by **jimparis** » Sat May 05, 2007 9:36 am

I think making an SPU-accelerated Xv driver with XvMC could be a good place to do it.

popper · Post by **popper** » Sat May 05, 2007 12:12 pm

is it all going to be SPU code or will there be Altivec additions too seeing as theres one there currently (mostly)unused ;) until Lu_zero chips in does his magic LOL.

ldesnogu · Post by **ldesnogu** » Sat May 05, 2007 8:09 pm

jimparis wrote:I think making an SPU-accelerated Xv driver with XvMC could be a good place to do it.

I think too it is the best option:
- SDL has support for xv
- mplayer lib vo can use xv.

However I don't know how easy (or difficult) it is to add xv into the X server (can it be done without touching the server or does it have to be put into it?).

mbf · Post by **mbf** » Sat May 05, 2007 11:46 pm

and MPlayer supports -vo sdl ;)

After posting this yesterday, I did some digging on xv but I hit a problem with vblank sync. Not easily done under X it seems. Anyone got an idea on how to do this properly? Does MPlayer or VLC use vsinc in conjunction with xv and how?

jimparis · Post by **jimparis** » Sun May 06, 2007 6:16 am

mbf wrote:After posting this yesterday, I did some digging on xv but I hit a problem with vblank sync. Not easily done under X it seems. Anyone got an idea on how to do this properly? Does MPlayer or VLC use vsinc in conjunction with xv and how?

Some applications (mythtv) have an option to use OpenGL to do the vsync. But from what I can gather, Xv drivers are already supposed to handle vsync internally. For example, if you look at the i810_video.c source in the intel X video driver, it does double buffering inside I810DisplaySurface() and waits for vsync before flipping:

Code: Select all


   /* wait for the last rendered buffer to be flipped in */
    while &#40;&#40;&#40;INREG&#40;DOV0STA&#41;&0x00100000&#41;>>20&#41; != pI810Priv->currentBuf&#41; &#123;
      if&#40;loops == 200000&#41; &#123;
        xf86DrvMsg&#40;pScrn->scrnIndex, X_INFO, "Overlay Lockup\n"&#41;;
        break;
      &#125;
      loops++;
    &#125;

    /* buffer swap */
    if &#40;pI810Priv->currentBuf == 0&#41;
      pI810Priv->currentBuf = 1;
    else
      pI810Priv->currentBuf = 0;

    I810ResetVideo&#40;pScrn&#41;;

    I810DisplayVideo&#40;pScrn, surface->id, surface->width, surface->height,
                     surface->pitches&#91;0&#93;, x1, y1, x2, y2, &dstBox,
                     src_w, src_h, drw_w, drw_h&#41;;

In other words, the application just feeds video data to the Xv driver and it's up to the driver to decide how to best display the video without tearing. Makes sense.

ldesnogu wrote:However I don't know how easy (or difficult) it is to add xv into the X server (can it be done without touching the server or does it have to be put into it?).

Well, you'll need to write a new Xv-capable display driver, but with the new modular X.org, that no longer involves rebuilding the whole X tree.

jockyw2001 · Post by **jockyw2001** » Tue May 08, 2007 12:34 am

@unsolo a.o.
See:
http://lists.mplayerhq.hu/pipermail/ffm ... 28757.html

I can imagine we do something similar with a bunch of interested ps2dev devs.

ralferoo · Post by **ralferoo** » Tue May 08, 2007 12:48 am

Also, I noticed on the wiki that a52 development was being looked at. In actual fact, the PS3 implements a regular ALSA driver which supports A52 passthrough, so surround sound from DVDs should work as-is. You don't need to do any decoding, just pass the bitstream straight through.

As part of my python library, I'm looking at writing an SPE sound system, with several goals. Primarily standard sound effects and MP3 decoding for 2-channels but also porting some of liba52 to the SPU (or re-implementing completely) so that I can have spatially located sound effects for those with DTS amps. I'll do my best to keep that part of the library usable from C too!

digihoe · Post by **digihoe** » Tue May 08, 2007 1:20 am

While there has been some demonstration where the SPE decrypts HD AVC, will encryption work as good as well (full speed HD AVC encryption)?

Has anyone seen such demonstrations?

Best regards!

laichung · Post by **laichung** » Tue May 08, 2007 6:05 pm

The latest ADDOn of CELL now having a document called "Cell Programming Primer", which have a section contain a sample program of rgb2y using SPE (Chapter 3 Basics of SPE Programming).

For those who want to learn more about how to use SPE, check that out and you will find it is really informative.

Cell Programming Primer

mbf · Post by **mbf** » Thu May 10, 2007 4:52 am

nice find laichung :)

@jockyw2001: that was the point of my initial question. Better optimize the lower layers of the OS in order to improve performance for a broader range of applications in one single go. IMHO. However, optimizing MPlayer directly would certainly be more straightforward and fit the needs of most. I'm game for it anyway :)

@ralferoo: do you mean that there is no need to decode AC3/DTS, whatever audio system your PS3 outputs to? So far whith all distros and kernels I tried, the ALSA driver sucked big time for standard stereo output, only cracks and hisses.

@digihoe: that's doable, but it won't work without optimizing MEncoder or x264 specifically for the CellBE.

ralferoo · Post by **ralferoo** » Thu May 10, 2007 6:58 am

mbf wrote:@ralferoo: do you mean that there is no need to decode AC3/DTS, whatever audio system your PS3 outputs to? So far whith all distros and kernels I tried, the ALSA driver sucked big time for standard stereo output, only cracks and hisses.

I used to have FC5 installed which worked OK playing WAV files with aplay. I've manually installed a base version of Ubuntu 7.04 which also seems to work fine with both aplay and mpg123.

There is talk on the forums about cracks and hisses, but I haven't heard any evidence of it myself. So far, all my tests have been up to about 3 minutes as that's how long the MP3s I've tried are.

I might be wrong about the DTS passthrough as "aplay -l" doesn't list an iec958 device, although I was pretty sure I read someone had got it working. This also suggests it doesn't work:

Code: Select all

root@ps3&#58;~# aplay -Dspdif ~ralf/test.dts
ALSA lib pcm.c&#58;2145&#58;&#40;snd_pcm_open_noupdate&#41; Unknown PCM cards.pcm.iec958
aplay&#58; main&#58;550&#58; audio open error&#58; No such file or directory

There's still some hope though because it's possible for most amps to recognise a DTS bitstream even without the "None PCM data" option set in the stream. I'll let you know how I get on...

ralferoo · Post by **ralferoo** » Sun May 13, 2007 3:11 am

ralferoo wrote:I might be wrong about the DTS passthrough as "aplay -l" doesn't list an iec958 device ... There's still some hope though because it's possible for most amps to recognise a DTS bitstream even without the "None PCM data" option set in the stream. I'll let you know how I get on...

Well, I've done some digging and DTS pass-through is definitely not supported by the current kernel. However...

In sound/ppc/snd_ps3_reg.h we see lots of internal hardware definitions including

Code: Select all

S/PDIF Audio Output Channel Channel Status Setting Registers.
Configures channel status bit settings for each block &#40;192 bits&#41;.
Output is performed from the MSB&#40;AO_SPDCS0 register bit 31&#41;.
The same value is added for subframes within the same frame.

Now, this fills me with a lot of hope, as it's precisely this 192-bit block on SPDIF subcode information that's used to signal to the amp that it's not PCM data but AC3/DTS data.

So, whilst the current kernel driver doesn't support this, it's feasible that we could implement this in the future and without requiring a hypervisor fix from Sony.

See http://www.hardwarebook.info/S/PDIF for more about SPDIF if you're interested.

mbf · Post by **mbf** » Mon Jun 04, 2007 10:31 pm

Unsolo, I've started working on a proof of concept clone of ffplay that uses the spu media lib. While browsing through your code, I noticed that you align the memory allocations to 128 bytes boundaries.... For both memalign() and __attribute__ ((aligned(xy))), the alignment value is in bytes, not bits. DMA transfers require you to align to 128 bits (16 bytes) boundaries, so the memalign calls should be changed to memalign(16,xyz).

Edit: looks like "minimum requirement" doesn't mean "best performance". The CBE Architecture reference document states that:

For optimal performance of transfers of 128 bytes or more, the source and destination transfer addresses
should be 128-byte aligned (bits 25 through 31 set to 0).

Fair enough!

So basically, the choice depends which one is faster: the DMA transfer or the actual data processing AND how significant is the loss of available memory due to fragmentation (with large alignments).

ldesnogu · Post by **ldesnogu** » Tue Jun 05, 2007 12:12 am

mbf wrote:Edit: looks like "minimum requirement" doesn't mean "best performance". The CBE Architecture reference document states that:
For optimal performance of transfers of 128 bytes or more, the source and destination transfer addresses
should be 128-byte aligned (bits 25 through 31 set to 0).
Fair enough!

Yes, that 128 bytes comes from L2 cache line sizes.

So basically, the choice depends which one is faster: the DMA transfer or the actual data processing AND how significant is the loss of available memory due to fragmentation (with large alignments).

I *guess* it would be enough to align small dynamically allocated memory chunks to the hardware requirements (which depends on DMA packet size) and big chunks to the fastest requirement (128 bytes).

The rationale is that anyway for small DMA transfers a significant proportion of time is lost in the setup of the transfer, so a few cycles lost probably matters less than the fragmentation of memory.

One should also take care of allocating in the right order to minimize holes of unallocatable memory :)

mbf · Post by **mbf** » Tue Jun 05, 2007 1:15 am

Yes, forgot to mention the cache line size.

ldesnogu wrote: One should also take care of allocating in the right order to minimize holes of unallocatable memory :)

I might think about this, well sometime :P

Question: YUV->RGB conversion then RGB to RGB scaling, or YUV to YUV scaling first then YUV to RGB conversion.... or YUV to RGB and scaling at the same time? Which would be the fastest? YUV scaling first would seem to be the fastest since there is less data to process and scaling the most CPU intensive step, but that's only a guesstimate and I haven't benchmarked it yet.

Pizza67 · Post by **Pizza67** » Tue Jun 05, 2007 2:41 am

unsolo wrote: I am nearly finished(it works but is not released) with a colorspace converter YV420p ->ARGB.. (more or less the same as YV12->ARGB)

That runs on a spe at more than 60FPS for 1920x1080.

Hi, I'm doing some tests on PS3 with your converter but I'm a bit confuse on the way it has to be used.

What I did:

- I dowloaded this file
ftp://ftp.ldv.e-technik.tu-muenchen.de/ ... lm_ter.yuv
that is an uncompressed 576i YUV video of 252 frames @25fps.

- I replicated the file 20 times to finally obtain a 5040 frames video

- I modified the number of frames to run through in yuv2rgb.cpp

Code: Select all

int ftot = 5040;

- I ran

Code: Select all

# ./yuv2rgb 576i25_stockholm_ter_x20.yuv 720 576

obtaining about 40 FPS, that is worst than your 60FPS@1920x1080.

The video also plays with many latches.

Maybe I'm missing something, how do you explain these results?

mbf · Post by **mbf** » Tue Jun 05, 2007 10:17 am

Pizza67 wrote:obtaining about 40 FPS, that is worst than your 60FPS@1920x1080.

It shouldn't be that bad considering that this conversion takes about 20% of the CPU (PPU) time when playing this kind of stuff with MPlayer. Have you tried with MPlayer?

ldesnogu · Post by **ldesnogu** » Tue Jun 05, 2007 5:46 pm

Pizza67 wrote:- I dowloaded this file
ftp://ftp.ldv.e-technik.tu-muenchen.de/ ... lm_ter.yuv
that is an uncompressed 576i YUV video of 252 frames @25fps.

- I replicated the file 20 times to finally obtain a 5040 frames video

- I modified the number of frames to run through in yuv2rgb.cpp
Code: Select all
int ftot = 5040;
- I ran
Code: Select all
# ./yuv2rgb 576i25_stockholm_ter_x20.yuv 720 576
obtaining about 40 FPS, that is worst than your 60FPS@1920x1080.

The video also plays with many latches.

Maybe I'm missing something, how do you explain these results?

Well the original file is 153,090 KB x 20 = 3,061,800 KB.
3,061,800 KB / 5040 x 40 = 24,300 KB/s.
You are hard drive speed limited I guess :)

Pizza67 · Post by **Pizza67** » Tue Jun 05, 2007 6:03 pm

mbf wrote:
Pizza67 wrote:obtaining about 40 FPS, that is worst than your 60FPS@1920x1080.
It shouldn't be that bad considering that this conversion takes about 20% of the CPU (PPU) time when playing this kind of stuff with MPlayer. Have you tried with MPlayer?

Mplayer works fine with high definition MPEG2 streams: I tried 1080i@50FPS.

It plays ok, so a throughput of 40FPS with a 576i video seems really bad in comparison with MPlayer that uses just PPU.

My concern is that it might be a problem of presentation on the ps3fb. I mean, the conversion with SPU should be very fast but the frames swap maybe slows down the execution maybe because of wait for VSync from Hypervisor or something else.

Does it could be an explanation?

ldesnogu · Post by **ldesnogu** » Tue Jun 05, 2007 6:20 pm

Pizza67 wrote:Mplayer works fine with high definition MPEG2 streams: I tried 1080i@50FPS.

It plays ok, so a throughput of 40FPS with a 576i video seems really bad in comparison with MPlayer that uses just PPU.

My concern is that it might be a problem of presentation on the ps3fb. I mean, the conversion with SPU should be very fast but the frames swap maybe slows down the execution maybe because of wait for VSync from Hypervisor or something else.

Does it could be an explanation?

Read my post just above yours.
Then see how file is read in yuv2rgb, compare this to file reading in Mplayer. See the difference? :)

The file reading in yuv2rgb is primitive and inefficient, it's only here to demonstrate the use of the library.

I don't say this is the only explanation, yours might be part of the problem too. But there surely is a bottleneck in file reading.

Pizza67 · Post by **Pizza67** » Tue Jun 05, 2007 6:55 pm

ldesnogu wrote:But there surely is a bottleneck in file reading.

I read your post after I posted mine, sorry :)

You're totally right, I forgot to compute the disk throughput. That's definitively the problem.

Mplayer reads a compressed stream so it doesn't reach the disk throughput.

The best way to test the yuv2rgb converter is probably to use always the same frame cached in ram. I think this is done by launching the program without params. ;)

Thanks.

ldesnogu · Post by **ldesnogu** » Tue Jun 05, 2007 6:59 pm

Pizza67 wrote:Mplayer reads a compressed stream so it doesn't reach the disk throughput.

Mplayer also uses mmap which might be more efficient than using stdc++ iostream. Also reading the file in a different thread may help...

unsolo · Post by **unsolo** » Wed Jun 06, 2007 4:02 am

you can easely achive 60/50 fps however keep in mind that you need double buffered input and output.

it runs at 300FPS 1920x1080 if you load to images into ram and test with only that you will se results.

The yuvscaler will achive from 150->299 FPS depending on your scalefactor.

ps its very important to compile with spu-elf-gcc -O2 -fno-exceptions -g to achive good performance and i suggest spu-elf-gcc-4.1.1 barelona patches or spu-elf-gcc-4.3

hope this helps

unsolo

unsolo · Post by **unsolo** » Fri Oct 12, 2007 6:12 am

Ok time to recruit

who wants to help ? go into spu-medialib section in the forums please.

I need more people and i dont mind helping training them in how to think spu.

Basic consept:
Offloading anything to the spe's gives better overall performance so why not do it.

Currently im looking into if its possible to do make xv work.
and theres a working mplayer-vo using spu-medialib

IronPeter · Post by **IronPeter** » Fri Oct 12, 2007 4:37 pm

128 byte alignment for the DMA is optimal in terms of speed.

Not trivial, but you also need that alignmenet in the local storage. DMAs with addresses aligned in the memory but not aligned in the local storage are slow. Probably, each memory line is accessed twice in that case.

unsolo · Post by **unsolo** » Thu Oct 25, 2007 1:29 pm

threw an experimental spu Xv driver on SVN.
It does 1080p in X using 1 spu so it looks good but theres lots of TODO's with it.
Expect install guideline within days

forums.ps2dev.org

SPE Media Lib

SPE Media Lib

ac3/a52

Encoding?

Alignment?

Re: Alignment?

Re: Alignment?

Re: SPE Media Lib

Re: SPE Media Lib

regarding speed

Note about DMA transfers.