R5900 bug: executing instruction after loop before end

Shine · Post by **Shine** » Tue Dec 28, 2004 12:47 pm

The following code caused a bug:

	lui	v1, 0x1000
	lui	a3, 0x1f00
	ori	v1, v1, 0x3020
	nop
WaitF&#58;	lw	v0, 0x0000&#40;v1&#41;
	and	v0, v0, a3
	nop	# filling with nops to avoid
	nop	#   "Loop length is too short for r5900."
	nop	#   warning
	bne	v0, zero, WaitF
#	nop
	li	v1, 0x000e

It looks like v1 is set to 0x000e after the first iteration, but before the loop ends, which caused a crash at "lw v0, 0x0000(v1)" at the next iteration. If I include the commented last nop, everything works. I've tried to add another nop within the loop body, because the restriction documentation says, there must be more than 6 instructions, but this doesn't fix the bug.

Is there somewhere an errata document for all known R5900 bugs and how to avoid it?

Post by **pixel** » Tue Dec 28, 2004 1:52 pm

Err....

That is not a bug. Please go back to your MIPS documentation. Every instruction after a branch or jump is systematically executed.

Shine · Post by **Shine** » Tue Dec 28, 2004 2:33 pm

Thanks, you are right. After searching a bit, I've found it: It is called Delay Slot Instruction. But I think it is not very useful.

Post by **pixel** » Tue Dec 28, 2004 2:45 pm

That is Delay Slot, yeah. Useful or not, it's here. I haven't really dig the issue behind the creation of the delay slots, but I know the previous issues on the good old 586 where you had no delay slots and where any jump was killing the whole pipeline process, beeing a hell to be able to roll back any change already started in the ALU, loosing a lot of cycles, etc. I am maybe wrong, but delay slots would ensure coherent time-consuming processing.

cheriff · Post by **cheriff** » Tue Dec 28, 2004 7:59 pm

Maybe not so useful for most programmers, it probably made life easier for the chip designers.
Basicailly, by the time the CPU has decided whether or not to take the branch (2nd or 3rd stage of the pipeline), the very next instruction has already been loaded into the pipeline.
In the event that the branch is taken, a system could nullify the unneeded instruction into a nop, or do what MIPS does and make it a feature of the architecture.
Its a Good Thing. Imagine a tight loop where 1 in 6 cycles is wasted on a hardware inserted nop (may as well underclock the EE to 250mHz) ...
This way you can stick the nop in there yourself - as you noticed, but you also have the option to place another instruction to maximize throughput.
All that being said, it's not usually something you'd worry about too much untill your app is (mostly) finished and works... debugging optimized code is... uh, interesting.

PS. If the extra nop is not your thing, look at the "branch likely" bunch of variations to the regular branches. These assume the branch will be taken and you won't have to worry about the delay slot - things should behave as you originally expected them to...

Post by **blackdroid** » Tue Dec 28, 2004 8:04 pm

MIPS CPUs implement a delay slot for branch instructions. Branches require extra cycles to complete before they exit the pipeline. For this reason, the instruction after the branch is executed while awaiting completion of the branch instruction. ( now this goes afair for load instructions on r2k and r3k aswell )

So thankfully your insight about it not being useful hasnt stopped real engineers from thinking longer than their nose. This is a cycle saver!

As a side note for your code, do not use li(or any synthetic instructions) after a branch instruction, despite having a 16bit number atm once you do overflow 16bit the synthetic li instruction will expand into 2 instructions, and most likely the code will not work as expected.

The "loop length is too short" bug is something Ive never experienced so far, but might exist on older machines.

Shine · Post by **Shine** » Tue Dec 28, 2004 10:34 pm

blackdroid wrote:As a side note for your code, do not use li(or any synthetic instructions) after a branch instruction, despite having a 16bit number atm once you do overflow 16bit the synthetic li instruction will expand into 2 instructions, and most likely the code will not work as expected.

Thanks, I didn't know that li is a synthetic instruction, because inPS2DIS it looks like a normal command, but decoding the opcode reveals that it is sometimes a ori and sometimes an addiu.

This makes my code easier. Instead of writing things like this:

Code: Select all

	li	v1, 0x000e
	lui	v0, 0x1000
	dsll32	v0, v0, 0
	ori	v0, v0, 0x8001
	pcpyld	v1, v1, v0

now I can write it like this:

Code: Select all

	li	v1, 0x000000000000000e
	li	v0, 0x1000000000008001
	pcpyld	v1, v1, v0

Is there are a synthetic instruction for loading a qword? If you set 3 or less words in the qword, it will be expanded in 5 opcodes, which requires 20 bytes and which is the same as lq and the data stored in memory, but if I want to set more words, then the synthetic instruction could expand it to data and lq, I think, or a assembler can analyze the data and perhaps expand it to some optimized combinations with some of the fancy parallel extend and exchange instructions.

Post by **pixel** » Tue Dec 28, 2004 11:42 pm

You have to know several things btw. If you are using gas to compile your asm code, it will have a shitload of aliases, such as li, or many others I won't list here. Some of them can expand up to three instructions, like

Code: Select all

lw  $t1, outside&#40;$t2&#41;

which will expand to

Code: Select all

        lui $t1, HI16&#40;outside&#41;
        addu $t1, $t2
        lw  $t1, LO16&#40;outside&#41;&#40;$t1&#41;

Also, if you are using gas, and if you do NOT put the

Code: Select all

.set noreorder

pseudo-instruction, or if you put the

Code: Select all

.set reorder

pseudo-instruction, AND the -O option to gas command line, it will take care for you of all the delay slots and will swap instructions for you as needed. You just have to write the code in the "natural" code flow, and gas will take care of everything.

Finally, notice there's also some delay problem (don't remember if this is also called a delay slot or not) if you are doing for example:

Code: Select all

lw    $t1, 0&#40;$t0&#41;
addiu $t3, $t1

You can't reuse the same register just after a load in memory; gas will put a nop there, or try to swap code.

Post by **blackdroid** » Wed Dec 29, 2004 12:07 am

you can do
lw t1, 0(t0)
adidu t3, t1

and it will work. however it will stall, and in the case of r5900 the two instructions will be executed in the same pipeline after eachother, instead of potentially pipeline1, pipeline2
so a nop would make it look like this.

pipe1 pipe2
------ ------
lw t1, 0(t0) nop
addiu t3, t1

( ok I tried to format it nicely but no.. )

Shine · Post by **Shine** » Wed Dec 29, 2004 12:19 am

if you put the
Code: Select all
.set reorder
pseudo-instruction, AND the -O option to gas command line, it will take care for you of all the delay slots and will swap instructions for you as needed.

That's nice. Now the assembler optimize even the nop after the last "jr ra" and uses the "sq v1, 0x0000(v0)" instruction before instead. The disassembled code looks a bit strange, but it works and it is faster.