Stupid assembly tricks

If you’re tight on registers, instead of using a third register to swap values, you can

eor r1, r2
eor r2, r1
eor r1, r2

and that’ll swap r1 and r2. It’s also slightly faster cycle wise than swapping with a third register.

2 Likes

Yeah, it is useful.

Another way I can come up with:
push r1,r2
pop r2,r1

See, the problem with that is that, according to the arm arm, push/pop always do your registers in sequence, lowest to the highest. If you compile the snippet

.thumb
.org 0x0
pop {r1, r3, r2}
pop {r1, r2, r3}

and then put the output in a hex editor, you’ll find that it’s

06 BC 06 BC

meaning that the two instructions are the same. To swap them, you’d have to

push {r1, r2}
pop {r2}
pop {r1}
1 Like

Yeah, you are right, thanks~

building on the fact that you can bx pc to get to something two instructions ahead, you can use that to switch to arm state, and then

lrd rm, =0xAnywhere
bx rm

to jump anywhere in the address space (4gb, starting at 0x00000000, ending at 0xFFFFFFFF. GBA address space is technically from 0x00000000 to 0x0FFFFFFF, although gamepak sram stops at 0x0E00FFFF). When you get there, depending if you’re trying to execute arm or thumb code, you can just start execution, or you can write a bx pc, then two instructions ahead you start your code in whatever state you want.
[spoiler=technical note]
In ARMv4, which is what the ARM7TDMI uses, bx rm ignores bits 1 and 0, and it automatically starts executing in arm mode where you bx to [/spoiler]
So if you’re trying to modify a function in place, and need more room, bx rm your way anywhere.

I make no guarantees as to if this will work or not.

You might want to read the previous two threads on longcalling:

bx rm (where rm is not sp/lr/pc) is basically just a long… goto, which might be all you need if the point is just to “continue” the function somewhere else. But things can get messy rather quickly if you make little patches like that all the time. Better to replace the entire original function with just the bx, and then bx lr back out of that (the idea being that lr still holds the ‘original’ return address).

But yeah, it’d be good to have some kind of gathered-together guide on writing good asm code, starting with things like the standard calling convention, setup/cleanup code for each function etc. We could get it started ITT, even.

The simple .align command.

.align n

This’ll align the next address to 2n. By default, .align is .align 2, which’ll align to word length.
.align 2 aligns to the next 0x0xxx0, 0xxx4, 0xxx8, 0xxxC
.align 3 aligns to the next 0x0xxx0, 0xxx8
.align 4 aligns to the next 0x0xxx0

Although that’s more a compiler trick than an assembly trick

ldrb r1 [something]
cmp r1, #0x0
beq specialcase
mov r0, #0x0
mov r2, #0x1 
mov r4, #0x0

start:
add r0, r0, #0x1
mov r3, r2
mov r2, r4
add r4, r3, r2
cmp r1, r0
bne start
strb r4, [somewhere]
bx @wherever we came from

specialcase: @r1 is 0
lrdb r4, #0x0
strb r4, [somewhere]
bx @back where we came from

Fibonacci sequence. In case, you know, you want to make a weapon that does damage based on it to confuse your player. I’m particularly pleased by the fact that the main loop is six instructions.

Where you wrote lrdb r4 #0x0 I assume that was a typo for ldrb, but presumably you actually meant mov since the default return value should be the value 0, not the contents of address 0 (which the GBA architecture protects anyway). So in a sense you don’t need a ‘special case’; just do the initial mov r4, #0x0 before cmp r1, #0x0, and then you find that you have two copies of the same cleanup code and you can just merge them.

But we can do better than that… :smile: That takes us down to 14 instructions with 6 in the main loop, but with some ingenuity we can factor the code into two separate functions (one that loads/stores to memory as before, and one that takes a parameter and returns a value normally) and make a main loop that’s still 6 instructions, but computes two values each time through the loop. Also we can do it with 2 fewer registers, which in turn means not worrying about saving and restoring r4. There are 15 total instructions here; we avoid the need for a branch that skips the loop entirely, but we add a bl/bx pair for the factoring into two functions.

fibonacci:
@ setup: use r2 as counter, r0 and r1 store a pair of successive terms
@ hack for optimization: set up so that we start computation from F(-2).
@ This lets us reorder the loop to minimize branch and compare instructions.
add r2, r0, #0x2
mov r1, #0x1
neg r0, r1 @ "mov r0, #-0x1"
@ Compute terms two at a time and check if we're done.
loop:
add r0, r0, r1
add r1, r0, r1 @ first time through, we get r0 = 0, r1 = 1
sub r2, #0x2
beq return_r0
cmp r2, #0x1
bne loop
mov r1, r0 @ If we exited the loop "at the end", return r1.
return_r0:
bx lr
@  
@ Wrapper to match original interface.
ldrb r0, @ source addr
bl fibonacci
strb r0, @ dest addr - but why limit it to a byte? >_< 
bx @ origin

We could also whip it into one function
Given a value at a memory address in r5 and r0-4 being clean

ldr r1, [r5] @source
add r1, #0x1 @add 1 to remove special casing
mov r2, #0x1 @first term

loop:
add r0, #0x1 @loop starts at 1 because r1 will never be less than 1
mov r3, r2
mov r2, r4
add r4, r3, r2
cmp r1, r0 @check if our loop counter equals our term
bne loop
sub r4, r4, r3 @this subtracts the one we added to r1
str r4, @somewhere else
bx lr @and we're done!

Because of the way it processes, we don’t actually need to do any jumping, since the conditional break can use a relative since it’s six instructions away. You can also save lines by plodding on if it is equal, because bne is ignored if it is equal.
Also, if you let whatever called it figure out what to do with r4 at the end, you let compiler assume where to start assembling, and you change the label, you can fit it in 140 characters.