Who wants 252MB more RAM for PS3 homebrew.
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
Who wants 252MB more RAM for PS3 homebrew.
Hi All,
Early fruit of the Hypervisor documentation project is available in:
http://wiki.ps2dev.org/ps3:hypervisor:l ... y_allocate
Basically up to 252MB of memory can be allocated for use by an app from the DDR connected to the RSX.
Caveats :
It only appears to like 64bit or 32bit accesses. (This aspect needs more research). But even with that limitation, having a further 252MB available for homebrew development, AND having accesses to the memory on the other side of the RSX is well worthwhile.
The memory is a finite resource, 2 apps can not both request 252MB. Which would have some effect on homebrew apps playing nicely together.
Thanks goes to urchin for his work on this function, I have confirmed his results and added the stuff on accessing the memory.
A useful set of functions for the hypervisor module/library would be:
uint64_t *malloc_gpu_memory(int MB); /* Allocate specified number of MB for this application */
void free_gpu_memory(uint64_t *memory); /* Free the memory allocated by the malloc_gpu_memory function */
Strontium Dog
Early fruit of the Hypervisor documentation project is available in:
http://wiki.ps2dev.org/ps3:hypervisor:l ... y_allocate
Basically up to 252MB of memory can be allocated for use by an app from the DDR connected to the RSX.
Caveats :
It only appears to like 64bit or 32bit accesses. (This aspect needs more research). But even with that limitation, having a further 252MB available for homebrew development, AND having accesses to the memory on the other side of the RSX is well worthwhile.
The memory is a finite resource, 2 apps can not both request 252MB. Which would have some effect on homebrew apps playing nicely together.
Thanks goes to urchin for his work on this function, I have confirmed his results and added the stuff on accessing the memory.
A useful set of functions for the hypervisor module/library would be:
uint64_t *malloc_gpu_memory(int MB); /* Allocate specified number of MB for this application */
void free_gpu_memory(uint64_t *memory); /* Free the memory allocated by the malloc_gpu_memory function */
Strontium Dog
I would be surprised if Hypervisor by itself would slow down memory access.laichung wrote:Anyway, if my memory is correct, the PPU <-> GPU Ram access is very slow compare to PPU <-> Main Ram (And now with a hypervisor process between).
I guess it would only if it played with the MMU to force software R/W to some memory areas, which hopefully is not the case :)
StrontiumDog, how much have you tested this? It doesn't seem to work very well for me. I wrote a driver to expose the extra ram as a MTD block device so that it could be used as swap/whatever. If I write a value to the allocated RAM and read it back quickly, it works. But if I wait a few milliseconds before reading it, the RAM changes back to some other values.
As for speeds, quick tests on my mtd device show about 131MB/s reads and 10.9MB/sec writes. Far from spectacular, but still could be useful as a higher priority swap.
As for speeds, quick tests on my mtd device show about 131MB/s reads and 10.9MB/sec writes. Far from spectacular, but still could be useful as a higher priority swap.
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
Not exhaustively. I agree it needs more testing. I did see some strange things, but mostly to do with 8 & 16 bit operations. I will do more testing.jimparis wrote:StrontiumDog, how much have you tested this?
There seems to be some (as a guess) caching issues that might need to be addressed. What I did find however, is if I allocate the memory, write it, read it (it verifies) free it and re-allocate it the contents revert to 0. but the allocate seems way to fast to be clearing all bits of ram to 0 on the allocate (unless the GPU has some really fast way to do it that is being employed. Even then, 252MB is quite a lot to write.jimparis wrote:It doesn't seem to work very well for me. I wrote a driver to expose the extra ram as a MTD block device so that it could be used as swap/whatever. If I write a value to the allocated RAM and read it back quickly, it works. But if I wait a few milliseconds before reading it, the RAM changes back to some other values.
Couldn't speculate on this, as I haven't benchmarked it. There are other calls that may be relevant to this though, notably the enigmatic: http://wiki.ps2dev.org/ps3:hypervisor:lv1_gpu_attributejimparis wrote:As for speeds, quick tests on my mtd device show about 131MB/s reads and 10.9MB/sec writes. Far from spectacular, but still could be useful as a higher priority swap.
Although, http://www.watch.impress.co.jp/game/doc ... dps303.htm shows the CELL having a 20GB/s write bandwidth to the GPU and 15GB/s read bandwidth, with the GPU having 22.4GB/s bandwidth to its memory. Also, at 10.9MB/sec the http://wiki.ps2dev.org/ps3:hypervisor:l ... te:fb_blit function would not be able to keep up the frame rate of the linux console, so the bandwidth between Cell and GPU on write (at least using DMA) must be at least 9MB * 60hz = 540MB/s. [That is, the console frame buffer is in XDR and needs to be blitted to the GPU DDR 60 times a second so it can be displayed by the GPU. 18MB is allocated in linux for the frame buffers in XDR and it is double buffered, so a single (biggest) frame is ~9MB.]
My next tests were going to be with fb_blit to see if I can blit into mapped GPU DDR memory. Even if direct writes are slow, this could be used to write large amounts of data quickly to GPU DDR memory.
My accesses are all 64-bit.Not exhaustively. I agree it needs more testing. I did see some strange things, but mostly to do with 8 & 16 bit operations. I will do more testing.
If I allocate it, ioremap it, write a single value to the first location, and then read that back in a loop with some udelay(), I'm seeing it reset to some other value after 15ms or so. Definitely weird.There seems to be some (as a guess) caching issues that might need to be addressed. What I did find however, is if I allocate the memory, write it, read it (it verifies)
I'm not seeing it set to 0 on allocate -- I'm seeing it full of other data that looks like it's coming from main RAM (but writing over the full 252MB doesn't overrwite main ram).free it and re-allocate it the contents revert to 0. but the allocate seems way to fast to be clearing all bits of ram to 0 on the allocate (unless the GPU has some really fast way to do it that is being employed. Even then, 252MB is quite a lot to write.
Regarding speed, my driver is writing quadwords in a for() loop because I was first aiming for correctness and wanted to ensure I was always doing 64-bit reads/writes. I'm sure DMA/blit/whatever can improve things.
Anyway, it's waaaay too late for me to be coding here, I'll test it a bit further tomorrow and make sure I'm doing everything right.
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
I didn't do any data longevity tests, so I believe what you are seeing is occurring, it sounds like the sorts of strange problems I was getting when test 8 and 16 bit reads.jimparis wrote:My accesses are all 64-bit.Not exhaustively. I agree it needs more testing. I did see some strange things, but mostly to do with 8 & 16 bit operations. I will do more testing.If I allocate it, ioremap it, write a single value to the first location, and then read that back in a loop with some udelay(), I'm seeing it reset to some other value after 15ms or so. Definitely weird.There seems to be some (as a guess) caching issues that might need to be addressed. What I did find however, is if I allocate the memory, write it, read it (it verifies)
That is weird.I'm not seeing it set to 0 on allocate -- I'm seeing it full of other data that looks like it's coming from main RAM (but writing over the full 252MB doesn't overrwite main ram).free it and re-allocate it the contents revert to 0. but the allocate seems way to fast to be clearing all bits of ram to 0 on the allocate (unless the GPU has some really fast way to do it that is being employed. Even then, 252MB is quite a lot to write.
Seems I spoke to soon. This needs a lot more messing with to try and work out what the heck is happening.
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
It doesn't appear to no. See: http://wiki.ps2dev.org/ps3:hypervisor:l ... device_mapldesnogu wrote:15 ms looks suspiciously close to the framerate.jimparis wrote:I'm seeing it reset to some other value after 15ms or so. Definitely weird.
BTW doesn't the 256 MB area also contains GPU IO address space?
Audio registers are in GPU IO space, and they are not in the 256MB Area. It would seem to be a separate memory block.
The current method used to display things by all linux clones installed on PS3 is the "framebuffer" software method, right?
So, it's a memory area in cpu-side ram (theoretically).
My guess is what you see on screen may be in gpu-side ram.
So there might be some permanent, massive data transferts from the cpu-side ram to gpu-side ram (probably a bitblt gpu operation, triggered at each vertical blank event, which is just a standard dma access controlled by GPU).
That may explain "strange things" if you mess with the real gpu-side frame buffer area (it gets overwritten once per frame with data coming from cpu-side frame buffer area). Well, just a theory for now...
But if you notice changes, compare them with the exact instant the vertical blank event occurs...
Another thing: If you detect strange patterns (16 dwords blocks -4x4pixels-, more likely 1 dword set every 4 ones on a pixel line, then 3 empty pixel lines) it may be a depth stencil buffer (same size as frame buffer, but used to know the Z depth of pixel when 3D is rendered).
I doubt it's used when Linux runs, but if ps3 menu does 3D and memory is not cleaned up you can see some left traces. The 4x4 blocks are an automatic way to reduce bandwidth, an automatic compression algorithm used in NVidia GPU's (most of the time only 1 dword is read instead of 16).
If you are interested by how a GPU works, see open source drivers for xbox1 gpu (nv2A). All has been found and all dma accesses are under control, so, you can learn how they work, even if it's an older NVidia chip (RSX is probably of the nv40 family).
However I may be wrong, and hypervisor may trick us nicely...
So, it's a memory area in cpu-side ram (theoretically).
My guess is what you see on screen may be in gpu-side ram.
So there might be some permanent, massive data transferts from the cpu-side ram to gpu-side ram (probably a bitblt gpu operation, triggered at each vertical blank event, which is just a standard dma access controlled by GPU).
That may explain "strange things" if you mess with the real gpu-side frame buffer area (it gets overwritten once per frame with data coming from cpu-side frame buffer area). Well, just a theory for now...
But if you notice changes, compare them with the exact instant the vertical blank event occurs...
Another thing: If you detect strange patterns (16 dwords blocks -4x4pixels-, more likely 1 dword set every 4 ones on a pixel line, then 3 empty pixel lines) it may be a depth stencil buffer (same size as frame buffer, but used to know the Z depth of pixel when 3D is rendered).
I doubt it's used when Linux runs, but if ps3 menu does 3D and memory is not cleaned up you can see some left traces. The 4x4 blocks are an automatic way to reduce bandwidth, an automatic compression algorithm used in NVidia GPU's (most of the time only 1 dword is read instead of 16).
If you are interested by how a GPU works, see open source drivers for xbox1 gpu (nv2A). All has been found and all dma accesses are under control, so, you can learn how they work, even if it's an older NVidia chip (RSX is probably of the nv40 family).
However I may be wrong, and hypervisor may trick us nicely...
Right, the blitting is done by a kernel thread, so your explaination looks good.ps2devman wrote:So there might be some permanent, massive data transferts from the cpu-side ram to gpu-side ram (probably a bitblt gpu operation, triggered at each vertical blank event, which is just a standard dma access controlled by GPU).
But I think GPU wont use whole 256mb ram for bitblt process. The ram for framebuffer should be fixed allocate when the kenrel is started. The point of function malloc should allocate some free memory from the RAM space. Also if those ram we allocated is overwrited by framebuffer, I think the value should be other than 0.
So, actually how much memory can we allocate from the hypervisor? May be those function only allow to allocate memroy from FB area, or there are some bugs from the function which allocate some memory space which overlap with the FB area.
So, actually how much memory can we allocate from the hypervisor? May be those function only allow to allocate memroy from FB area, or there are some bugs from the function which allocate some memory space which overlap with the FB area.
ldesnogu wrote:Right, the blitting is done by a kernel thread, so your explaination looks good.ps2devman wrote:So there might be some permanent, massive data transferts from the cpu-side ram to gpu-side ram (probably a bitblt gpu operation, triggered at each vertical blank event, which is just a standard dma access controlled by GPU).
We have the code for the kernel, so there are some things that we do not need to speculate about.
18 MB (+ 1MB for alignment) of memory is allocated by default on startup by the kernel for the local framebuffer. The kernel also takes care of initialising the GPU, mapping the GPU IO interface into memory, setting up the GPU framebuffer and registering for the GPU vsync interrupt to know when to perform blits and buffer flips.
The curious part here is a define called DDR_SIZE, which is defined as 0 in the code and passed to lv1_gpu_memory_allocate during setup. Through a bit of trial and error, it was determined that values of 1MB to 252MB (in 1MB increments) can be "allocated". How the GPU will let us use this memory is the interesting part.
18 MB (+ 1MB for alignment) of memory is allocated by default on startup by the kernel for the local framebuffer. The kernel also takes care of initialising the GPU, mapping the GPU IO interface into memory, setting up the GPU framebuffer and registering for the GPU vsync interrupt to know when to perform blits and buffer flips.
The curious part here is a define called DDR_SIZE, which is defined as 0 in the code and passed to lv1_gpu_memory_allocate during setup. Through a bit of trial and error, it was determined that values of 1MB to 252MB (in 1MB increments) can be "allocated". How the GPU will let us use this memory is the interesting part.
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
Problem Solved (i think).
The first n words keep getting over-written. Absolutely certainly by the screen refresh blit. MY PS3 is running in PAL* mode (6) which has a frame buffer of 720*576*4 bytes = 1,658,880 bytes, double buffered = 3,317,760 bytes.
My system can not verify memory written to the first 0x32A00 64 bit words, which equals exactly 0x32A00*16 = 3,317,760 bytes.
Notice the correspondence with these 2 numbers. Coincidence I think not.
My test wrote a non-aliasing pattern through memory, waited 1 second then verified. After offset 3,317,760 bytes from the start of GPU memory, the memory still held the contents written during the write test.
So even assuming the worst case frame buffer size of 18MB that still leaves 234 MB to play with.
Next task is to benchmark this memory, and try and write to it using blits, to benchmark the blits, and try and move the Frame buffer offset .........
Strontium Dog
* My fricken 1920x1200 LCD monitor with DVI doesn't support HDCP, so I have to use its composite input :(
The first n words keep getting over-written. Absolutely certainly by the screen refresh blit. MY PS3 is running in PAL* mode (6) which has a frame buffer of 720*576*4 bytes = 1,658,880 bytes, double buffered = 3,317,760 bytes.
My system can not verify memory written to the first 0x32A00 64 bit words, which equals exactly 0x32A00*16 = 3,317,760 bytes.
Notice the correspondence with these 2 numbers. Coincidence I think not.
My test wrote a non-aliasing pattern through memory, waited 1 second then verified. After offset 3,317,760 bytes from the start of GPU memory, the memory still held the contents written during the write test.
So even assuming the worst case frame buffer size of 18MB that still leaves 234 MB to play with.
Next task is to benchmark this memory, and try and write to it using blits, to benchmark the blits, and try and move the Frame buffer offset .........
Strontium Dog
* My fricken 1920x1200 LCD monitor with DVI doesn't support HDCP, so I have to use its composite input :(
So this mean the hypervisor function only return the pointer of DDR address without any verification or standard malloc process. I think this is a serious bugs.
And we know that linux kernel allocated some memory for FB when it start, why the function still return the address which overlaped with the FB? I mean, if we call malloc_gpu_memory() two times, it should return two different starting address without overlaping each other (this is why we need malloc). So I wonder, does the function check the usage of the RAM or not? If not, I think this is a little bit danger to use this ram in our program or we need to code somethings ourselves to deal with the problem.
Anyway, Thanks for what you find.
And we know that linux kernel allocated some memory for FB when it start, why the function still return the address which overlaped with the FB? I mean, if we call malloc_gpu_memory() two times, it should return two different starting address without overlaping each other (this is why we need malloc). So I wonder, does the function check the usage of the RAM or not? If not, I think this is a little bit danger to use this ram in our program or we need to code somethings ourselves to deal with the problem.
Anyway, Thanks for what you find.
StrontiumDog wrote:Problem Solved (i think).
My test wrote a non-aliasing pattern through memory, waited 1 second then verified. After offset 3,317,760 bytes from the start of GPU memory, the memory still held the contents written during the write test.
So even assuming the worst case frame buffer size of 18MB that still leaves 234 MB to play with.
Next task is to benchmark this memory, and try and write to it using blits, to benchmark the blits, and try and move the Frame buffer offset .........
lv1_gpu_memory_allocate returns a logical partition address for the allocated memory. This lpar address must be iomapped before the memory can be accessed. The allocator does seem to perform some checking as it is not possible, for example, to allocate 252 MB a second time without freeing the first allocation.
Agreed, it seems to work much better now. I can now use my ps3vram driver for anything and it seems to be fine. It's still a little messy and slow, but the driver is simple and it should be straightforward to implement any speedups we can find.StrontiumDog wrote:Problem Solved (i think).
I've added it to the ps3dev svn under /linux/ps3vram. Here is the README:
Code: Select all
PS3vram by Jim Paris <[email protected]>.
Description
-----------
This is a Linux kernel module to use the extra PS3 GDDR video RAM as
a MTD block device. It requires MTD block device support in the
kernel (CONFIG_MTD && CONFIG_MTD_BLOCK). It is not currently very
fast, but could be useful as swap space or additional storage for
LiveCD systems.
It has been tested on the 2.6.16-ps3 kernel (20061110) but should work
on others.
Installation
------------
To build, have your kernel source or headers installed properly and
then run:
make
sudo make install
To use it, load the appropriate modules:
sudo modprobe mtdblock
sudo modprobe ps3vram
This should create a new block device /dev/mtdblock0.
Using it
--------
The block device behaves like any other disk. To use it as swap:
sudo mkswap /dev/mtdblock0
sudo swapon /dev/mtdblock0 -p 10
To create and mount a filesystem:
sudo mkfs.ext3 /dev/mtdblock0
sudo mkdir -p /mnt/tmp
sudo mount /dev/mtdblock0 /mnt/tmp
Acknowledgements
----------------
StrontiumDog and urchin for figuring out how to map and use the RAM.
See also http://forums.ps2dev.org/viewtopic.php?p=53486
Code: Select all
Filename Type Size Used Priority
/dev/sda2 partition 6958072 112 -1
/dev/mtdblock0 partition 241840 5792 10
Code: Select all
Filesystem Size Used Avail Use% Mounted on
/dev/mtdblock0 229M 6.1M 211M 3% /mnt/tmp
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
It isn't a malloc type function. Memory must be requested in 1MB blocks. It is really just a function that maps the GPU memory into the Linux Logical Partition memory space. It simply returns a pointer to some GPU memory.laichung wrote:So this mean the hypervisor function only return the pointer of DDR address without any verification or standard malloc process. I think this is a serious bugs.
Actually the Linux kernel allocates 18MB of memory in the XDR connected to the Cell for a "virtual" frame buffer. No memory is allocated by any Linux kernel I have seen in the GPU. In fact the comment in the kernel says "don't allocate memory in the GPU.laichung wrote:And we know that linux kernel allocated some memory for FB when it start,
The function will do what you say, but it isn't a malloc routine per se. The Linux kernel never calls it to allocate space for the GPU side of the frame buffer. Sony assume that no one can access the GPU memory, so they don't need to reserve any for a frame buffer. Its not a bug, just something we need to be aware of.laichung wrote: why the function still return the address which overlaped with the FB? I mean, if we call malloc_gpu_memory() two times, it should return two different starting address without overlaping each other (this is why we need malloc). So I wonder, does the function check the usage of the RAM or not? If not, I think this is a little bit danger to use this ram in our program or we need to code somethings ourselves to deal with the problem.
No problem.laichung wrote:Anyway, Thanks for what you find.
blit write
Why not submit this to the linux kernel so any distribution can take advantage of the extra ps3 gpu memory for their version of linux.
But on second thought, you probably need a standard for allocating
memory on external devices, which is different from disk fread and fwrite.
Can you use part of the framebuffer for fast writes?
You can use the frame buffer memory in cpu side
memory as a fast blit for fast write operations to the gpu side memory.
Just make sure when you initialize the screen you initialize it larger than
your current screen size. So for 9MB screen you purposely set 18MB
screen (so now you have a guarantee that the framebuffer is not at the bottom of the 256 in cpu-side memory, and leaves at least 9MB for your
blitting). After that is done, you set your screen to 9MB using realloc.
This frees up the bottom 9MB for regular use. Then when you need to
fast blit, just set your screen size to 18MB using realloc just for the
blit, and when done reset to 9MB. Only three caveats...
one is that only 9MB can be fast write blitted out of your total 240 GPU memory, and that 9MB must be located (using realloc) at the bottom of the framebuffer in cpu-side memory. No memory is wasted as after the blit, you can free that bottom 9MB for use again. I don't think the screen will flicker if it is done during the vsync and you switch the resolution back
before the real lcd output.
Caveat two is that before the blit, you gotta move out previous stuff occupying that cpu-side 9MB space to another cpu-side memory location if it has data before.
Caveat three is that your screen must be less than 1080p? Don't know,
maybe it will work... RGBA (4bytes) * 1920 * 1080 = 8MB, why is so
much space wasted for 720p?
But on second thought, you probably need a standard for allocating
memory on external devices, which is different from disk fread and fwrite.
Can you use part of the framebuffer for fast writes?
You can use the frame buffer memory in cpu side
memory as a fast blit for fast write operations to the gpu side memory.
Just make sure when you initialize the screen you initialize it larger than
your current screen size. So for 9MB screen you purposely set 18MB
screen (so now you have a guarantee that the framebuffer is not at the bottom of the 256 in cpu-side memory, and leaves at least 9MB for your
blitting). After that is done, you set your screen to 9MB using realloc.
This frees up the bottom 9MB for regular use. Then when you need to
fast blit, just set your screen size to 18MB using realloc just for the
blit, and when done reset to 9MB. Only three caveats...
one is that only 9MB can be fast write blitted out of your total 240 GPU memory, and that 9MB must be located (using realloc) at the bottom of the framebuffer in cpu-side memory. No memory is wasted as after the blit, you can free that bottom 9MB for use again. I don't think the screen will flicker if it is done during the vsync and you switch the resolution back
before the real lcd output.
Caveat two is that before the blit, you gotta move out previous stuff occupying that cpu-side 9MB space to another cpu-side memory location if it has data before.
Caveat three is that your screen must be less than 1080p? Don't know,
maybe it will work... RGBA (4bytes) * 1920 * 1080 = 8MB, why is so
much space wasted for 720p?
Tell me if I'm wrong...
CPU writing directly to gpu-side ram (alias vram, alias local memory):
=> around 4Gb/s
CPU reading directly from gpu-side ram (alias vram, alias local memory):
=> around 16Mb/s (this isn't a typo error...)
But of course GPU can use DMA (i.e CPU can issue a command to GPU, named bitblt) to move data at around 15-22Gb/s from anywhere to anywhere
It was something noticed by "the inquirer" in june 2006. And they got shocked. However, professional developers don't worry about it at all, they consider that CPU doesn't have to read directly from gpu-side ram.
http://www.theinquirer.net/default.aspx?article=32171
Last thing...
Nv2A needed to store a few kb of data in vram for the internal GPU system code. Also driver stored as well data there for the "context switching" allowing two codes to ignore each other GPU settings and just retrieve their own when needed (btw if we can write to that before the context switch that mean we can alter their GPU setting...).
So my question is... None of you noticed strange data stored there, beside frame buffers? If not, maybe RSX nvidia chip has additional internal memory... I haven't studied nv40 seriously yet.
CPU writing directly to gpu-side ram (alias vram, alias local memory):
=> around 4Gb/s
CPU reading directly from gpu-side ram (alias vram, alias local memory):
=> around 16Mb/s (this isn't a typo error...)
But of course GPU can use DMA (i.e CPU can issue a command to GPU, named bitblt) to move data at around 15-22Gb/s from anywhere to anywhere
It was something noticed by "the inquirer" in june 2006. And they got shocked. However, professional developers don't worry about it at all, they consider that CPU doesn't have to read directly from gpu-side ram.
http://www.theinquirer.net/default.aspx?article=32171
Last thing...
Nv2A needed to store a few kb of data in vram for the internal GPU system code. Also driver stored as well data there for the "context switching" allowing two codes to ignore each other GPU settings and just retrieve their own when needed (btw if we can write to that before the context switch that mean we can alter their GPU setting...).
So my question is... None of you noticed strange data stored there, beside frame buffers? If not, maybe RSX nvidia chip has additional internal memory... I haven't studied nv40 seriously yet.
In theory, they shouldn't. The GPU memory is exactly that - dual-ported memory to provide quick writes from the CPU and quick read/writes from the GPU. This is commonly known as VRAM and even on PCs, it's well known that performance sucks for any VRAM->CPU operations.ps2devman wrote:However, professional developers don't worry about it at all, they consider that CPU doesn't have to read directly from gpu-side ram.
The only time the CPU is supposed to be using GPU memory is for uploading textures or vertices for use by the GPU, it's never meant for program data storage.
If my last post sounded negative, then the bright side is that this speed is comparable to most drives true throughput from a couple of years ago and only about half what modern drives can do, so it's not that bad for use as swap space. On a live CD where disk might not be available for swap, it's fantastic.ps2devman wrote:CPU reading directly from gpu-side ram => around 16Mb/s
- StrontiumDog
- Posts: 55
- Joined: Wed Jun 01, 2005 1:41 pm
- Location: Somewhere in the South Pacific
Could anyone doing any SPU programming try and DMA data from an SPU to GPU memory and benchmark that??
The reason why I ask is the SPU's are supposed to be tuned to do multimedia type operations. It would be really stupid if they had crap DMA write and read performance to GPU memory.
I haven't done any SPU coding or even looked at that aspect yet, otherwise I would throw something together myself, but it seemed likely that someone else who is already doing SPU stuff could test this easier and quicker than I could get up to speed on the SPU details.
The reason why I ask is the SPU's are supposed to be tuned to do multimedia type operations. It would be really stupid if they had crap DMA write and read performance to GPU memory.
I haven't done any SPU coding or even looked at that aspect yet, otherwise I would throw something together myself, but it seemed likely that someone else who is already doing SPU stuff could test this easier and quicker than I could get up to speed on the SPU details.
The main problem is that the memory is allocated is kernel space, so we'd need an extension so that it can be accessed in userland. Of course, this may be exactly what the hypervisor module is for, but I haven't yet looked at it, as I still haven't got my standard kernel quite as I want it.StrontiumDog wrote:Could anyone doing any SPU programming try and DMA data from an SPU to GPU memory and benchmark that??
However, given that the framebuffer is exposed in the same way, we can probably assume results from that are similar. I haven't benchmarked reading from the screen, but certainly the SPU can write to the graphics memory very quickly indeed. If you recall, my julia demo was blitting to graphics memory at ~800 FPS for 480p and 155 FPS for 1080p and that involved lots of floating point calculations too.
What we need is a kernel module that accepts mmap requests. I am not expert, I will leave that to jimp :)ralferoo wrote:The main problem is that the memory is allocated is kernel space, so we'd need an extension so that it can be accessed in userland. Of course, this may be exactly what the hypervisor module is for, but I haven't yet looked at it, as I still haven't got my standard kernel quite as I want it.
What you measured here is a bandwidth from SPE to XDR not to VRAM. Only the kernel ps3fbd is doing XDR to VRAM transfers, and I am ready to bet it is not doing it at 800 FPS :)However, given that the framebuffer is exposed in the same way, we can probably assume results from that are similar. I haven't benchmarked reading from the screen, but certainly the SPU can write to the graphics memory very quickly indeed. If you recall, my julia demo was blitting to graphics memory at ~800 FPS for 480p and 155 FPS for 1080p and that involved lots of floating point calculations too.
BTW that maximum transfer rate I saw between SPEs and XDR is about 20GB/s, which is more than 10k 480p "FPS", with no computation of course.
Doh! You're absolutely right!ldesnogu wrote:What you measured here is a bandwidth from SPE to XDR not to VRAM. Only the kernel ps3fbd is doing XDR to VRAM transfers, and I am ready to bet it is not doing it at 800 FPS :)
However, we do know that it's one of the SPUs that's blitting from XDR memory to VRAM. Given that it must be capable to blitting at least 160Mb/s (1280*1080*4bpp*30fps). And that's just one SPU and already it's an order of magnitude higher than the limit we're worrying about. So, clearly write performance isn't bad from the SPUs.
We are not sure it's an SPU that does the job... Anyway I am ready to bet the hypervisor makes the transfer using DMA, and that's why the question StrontiumDog asks should get answered :)ralferoo wrote:However, we do know that it's one of the SPUs that's blitting from XDR memory to VRAM.
I think the data you can see on the Inquirer link posted above is accurate.Given that it must be capable to blitting at least 160Mb/s (1280*1080*4bpp*30fps). And that's just one SPU and already it's an order of magnitude higher than the limit we're worrying about. So, clearly write performance isn't bad from the SPUs.
But it is only achievable using DMA: I quickly wrote a memcpy optimized with usual stuff (manual unrolling, using altivec registers, L2 cache prefetching) and could only reach about 8 GB/s of R+W; when using DMA I saw above 20 GB/s.
It's actually the GPU that does the blitting (upon receiving a command from the SPU running the hypervisor).ralferoo wrote:However, we do know that it's one of the SPUs that's blitting from XDR memory to VRAM. Given that it must be capable to blitting at least 160Mb/s (1280*1080*4bpp*30fps). And that's just one SPU and already it's an order of magnitude higher than the limit we're worrying about. So, clearly write performance isn't bad from the SPUs.
@ps2devman: on that slide shown in The Inquirer's article, do you know what means "Local Memory"? I'm a little bit intrigued by this 16MB/s read from the Cell. Do they mean VRAM?
Where did you get that information from? All we know is that the kernel makes an hypervisor call to transfer its FB (in XDR) to GPU VRAM.mbf wrote:It's actually the GPU that does the blitting (upon receiving a command from the SPU running the hypervisor).
I bet it's done using the MFC DMA from the Cell.
Local memory indeed is GPU local memory that is VRAM.@ps2devman: on that slide shown in The Inquirer's article, do you know what means "Local Memory"? I'm a little bit intrigued by this 16MB/s read from the Cell. Do they mean VRAM?