Page 1 of 1
R5900 bug: executing instruction after loop before end
Posted: Tue Dec 28, 2004 12:47 pm
by Shine
The following code caused a bug:
Code: Select all
lui v1, 0x1000
lui a3, 0x1f00
ori v1, v1, 0x3020
nop
WaitF: lw v0, 0x0000(v1)
and v0, v0, a3
nop # filling with nops to avoid
nop # "Loop length is too short for r5900."
nop # warning
bne v0, zero, WaitF
# nop
li v1, 0x000e
It looks like v1 is set to 0x000e after the first iteration, but before the loop ends, which caused a crash at "lw v0, 0x0000(v1)" at the next iteration. If I include the commented last nop, everything works. I've tried to add another nop within the loop body, because the restriction documentation says, there must be more than 6 instructions, but this doesn't fix the bug.
Is there somewhere an errata document for all known R5900 bugs and how to avoid it?
Posted: Tue Dec 28, 2004 1:52 pm
by pixel
Err....
That is not a bug. Please go back to your MIPS documentation. Every instruction after a branch or jump is systematically executed.
Posted: Tue Dec 28, 2004 2:33 pm
by Shine
Thanks, you are right. After searching a bit, I've found it: It is called
Delay Slot Instruction. But I think it is not very useful.
Posted: Tue Dec 28, 2004 2:45 pm
by pixel
That is Delay Slot, yeah. Useful or not, it's here. I haven't really dig the issue behind the creation of the delay slots, but I know the previous issues on the good old 586 where you had no delay slots and where any jump was killing the whole pipeline process, beeing a hell to be able to roll back any change already started in the ALU, loosing a lot of cycles, etc. I am maybe wrong, but delay slots would ensure coherent time-consuming processing.
Posted: Tue Dec 28, 2004 7:59 pm
by cheriff
Maybe not so useful for most programmers, it probably made life easier for the chip designers.
Basicailly, by the time the CPU has decided whether or not to take the branch (2nd or 3rd stage of the pipeline), the very next instruction has already been loaded into the pipeline.
In the event that the branch is taken, a system could nullify the unneeded instruction into a nop, or do what MIPS does and make it a feature of the architecture.
Its a Good Thing. Imagine a tight loop where 1 in 6 cycles is wasted on a hardware inserted nop (may as well underclock the EE to 250mHz) ...
This way you can stick the nop in there yourself - as you noticed, but you also have the option to place another instruction to maximize throughput.
All that being said, it's not usually something you'd worry about too much untill your app is (mostly) finished and works... debugging optimized code is... uh, interesting.
PS. If the extra nop is not your thing, look at the "branch likely" bunch of variations to the regular branches. These assume the branch will be taken and you won't have to worry about the delay slot - things should behave as you originally expected them to...
Posted: Tue Dec 28, 2004 8:04 pm
by blackdroid
MIPS CPUs implement a delay slot for branch instructions. Branches require extra cycles to complete before they exit the pipeline. For this reason, the instruction after the branch is executed while awaiting completion of the branch instruction. ( now this goes afair for load instructions on r2k and r3k aswell )
So thankfully your insight about it not being useful hasnt stopped real engineers from thinking longer than their nose. This is a cycle saver!
As a side note for your code, do not use li(or any synthetic instructions) after a branch instruction, despite having a 16bit number atm once you do overflow 16bit the synthetic li instruction will expand into 2 instructions, and most likely the code will not work as expected.
The "loop length is too short" bug is something Ive never experienced so far, but might exist on older machines.
Posted: Tue Dec 28, 2004 10:34 pm
by Shine
blackdroid wrote:As a side note for your code, do not use li(or any synthetic instructions) after a branch instruction, despite having a 16bit number atm once you do overflow 16bit the synthetic li instruction will expand into 2 instructions, and most likely the code will not work as expected.
Thanks, I didn't know that li is a synthetic instruction, because in
PS2DIS it looks like a normal command, but decoding the opcode reveals that it is sometimes a ori and sometimes an addiu.
This makes my code easier. Instead of writing things like this:
Code: Select all
li v1, 0x000e
lui v0, 0x1000
dsll32 v0, v0, 0
ori v0, v0, 0x8001
pcpyld v1, v1, v0
now I can write it like this:
Code: Select all
li v1, 0x000000000000000e
li v0, 0x1000000000008001
pcpyld v1, v1, v0
Is there are a synthetic instruction for loading a qword? If you set 3 or less words in the qword, it will be expanded in 5 opcodes, which requires 20 bytes and which is the same as lq and the data stored in memory, but if I want to set more words, then the synthetic instruction could expand it to data and lq, I think, or a assembler can analyze the data and perhaps expand it to some optimized combinations with some of the fancy parallel extend and exchange instructions.
Posted: Tue Dec 28, 2004 11:42 pm
by pixel
You have to know several things btw. If you are using gas to compile your asm code, it will have a shitload of aliases, such as li, or many others I won't list here. Some of them can expand up to three instructions, like
which will expand to
Code: Select all
lui $t1, HI16(outside)
addu $t1, $t2
lw $t1, LO16(outside)($t1)
Also, if you are using gas, and if you do NOT put the
pseudo-instruction, or if you put the
pseudo-instruction, AND the -O option to gas command line, it will take care for you of all the delay slots and will swap instructions for you as needed. You just have to write the code in the "natural" code flow, and gas will take care of everything.
Finally, notice there's also some delay problem (don't remember if this is also called a delay slot or not) if you are doing for example:
Code: Select all
lw $t1, 0($t0)
addiu $t3, $t1
You can't reuse the same register just after a load in memory; gas will put a nop there, or try to swap code.
Posted: Wed Dec 29, 2004 12:07 am
by blackdroid
you can do
lw t1, 0(t0)
adidu t3, t1
and it will work. however it will stall, and in the case of r5900 the two instructions will be executed in the same pipeline after eachother, instead of potentially pipeline1, pipeline2
so a nop would make it look like this.
pipe1 pipe2
------ ------
lw t1, 0(t0) nop
addiu t3, t1
( ok I tried to format it nicely but no.. )
Posted: Wed Dec 29, 2004 12:19 am
by Shine
if you put the
pseudo-instruction, AND the -O option to gas command line, it will take care for you of all the delay slots and will swap instructions for you as needed.
That's nice. Now the assembler optimize even the nop after the last "jr ra" and uses the "sq v1, 0x0000(v0)" instruction before instead. The disassembled code looks a bit strange, but it works and it is faster.