(TL;DR: Look at the very last code snippet. Just fucking look at it. Also, I can’t seem to get code tags working inside quote tags, so I’ve resorted to using backticks instead. Halp.)
Continuing the discussion from [ASM] A (short) Treatise On Longcalling:
A lot of the time when we write ASM, we want to “call functions”, which in THUMB code is the natural purpose of the BL instruction*. However, there are two obvious limitations:
It can only jump forward or backward 4MB, so it can’t cover the entirety of a 16MB ROM, let alone a 32MB one.
The target is PC-relative, which means that if we’re calling from existing code to something new that we’re writing, or vice-versa, we have a problem unless we specify where our hack will be inserted in the ROM. It would be very nice not to have to do that, for the various patching tools that we’re currently working on (and, you know, just in general; because it hurts reusability of a hack if the person who wants to “install” the hack has to plan ahead around specific bytes not being used for something else).
What we want to do, essentially, is be able to load a constant that stores a pointer to the code we want to call, and branch to it, while setting the appropriate value in the LR. This works fine for our purposes: when we write a new function that wants to call existing code, we can generally hard-code the location; and when we need to “install” a hack by making the existing code call our new function, we can set up the patcher to write a pointer in the appropriate place, and it can automatically write a pointer value that corresponds to where the patch was placed. (If we wanted to use BLs for this, not only would we be constrained to putting patches in the first quarter of the ROM - where we’d have to go to special lengths to find or make freespace - we’d also have to make the patcher able to compute the corresponding BL instruction, and Cam hates me enough already.)
Now, the way people handled this approximately forever ago in our community was described by Colorz in the first treatise:
That is: load the target address into a register, store (current PC + 5) in the LR, then BX to the target address. This example is decorated with some register saving/restoring in case you’re already using those registers, and puts the constant data right there, with an unconditional branch over it.
add r2, #0x5? Well, because of pipelining, the PC value that gets stored in R2 in this code actually refers to two THUMB instructions (4 bytes) ahead of the line that actually copies the PC value - i.e., to
mov lr, r2. We need to skip ahead two more instructions - that one itself, and the BX, so that makes +4. The extra +1 is so that BXing doesn’t switch over to ARM mode (I’m not sure if a direct MOV PC will treat this “flag” the same way, but it’s not really a good idea anyway). Ugly as sin. That +1 is something that BL normally takes care of for us automatically, after all.
So, Colorz did a bit of research, and discovered what I’ve been calling a “BX ladder”:
So that gives us the way to do what we want, right? Except that in our own code, we won’t have a handy BX ladder to BL to, unless we make one ourselves. We’d have the same problem we started out with, of trying to reach an absolute address that might be over 4MB away. But we don’t really need the ladder; we just need the BX to be “out of the way”, so that when the code returns from the call, we don’t trigger the BX again.
Colorz demoed it thus:
Which I would probably reorder this way:
ldr r1, constant b skip_bx .align .long constant do_bx: bx r1 skip_bx: bl do_bx
Of course, you might still be forced to save a register, either way, but it turns out there’s really no avoiding it. Better one register than two, and this also saves opcodes.
SO. Why am I bringing this up? Because it’s 2015, motherfucker, and I found something better. I want to draw your attention to what GBATEK has to say about BL for a second:
This may be used to call (or jump) to a subroutine, return address is saved in LR (R14). Unlike all other THUMB mode instructions, this instruction occupies 32bit of memory which are split into two 16bit THUMB opcodes.
First Instruction - LR = PC+4+(nn SHL 12)
Bit Expl. 15-11 Must be 11110b for BL/BLX type of instructions 10-0 nn - Upper 11 bits of Target Address
Second Instruction - PC = LR + (nn SHL 1), and LR = PC+2 OR 1 (and BLX: T=0)
Bit Expl. 15-11 Opcode 11111b: BL label ;branch long with link 11101b: BLX label ;branch long with link switch to ARM mode (ARM9) 10-0 nn - Lower 11 bits of Target Address (BLX: Bit0 Must be zero)
The destination address range is (PC+4)-400000h…+3FFFFEh, ie. PC+/-4M. Target must be halfword-aligned. As Bit 0 in LR is set, it may be used to return by a BX LR instruction (keeping CPU in THUMB mode).
Return: No flags affected, PC adjusted, return address in LR.
Execution Time: 3S+1N (first opcode 1S, second opcode 2S+1N).
Exceptions may or may not occur between first and second opcode, this is “implementation defined” (unknown how this is implemented in GBA and NDS).
Using only the 2nd half of BL as “BL LR+imm” is possible (for example, Mario Golf Advance Tour for GBA uses opcode F800h as “BL LR+0”).[/quote]
Okay, pay close attention to that last bit. You can use the second instruction on its own. In the VBA disassembler, it will represent this as “BLH” (I guess it’s supposed to stand for “BL with Halfword”?).
When I first read this, I was left wondering why you’d ever want to BL LR. I was imagining some funky mutual-recursion or coroutine thing, I dunno. But that’s not really the point.
Look at the detailed explanation of the instructions again. The first instruction stores a PC-relative value, which we don’t want. The second instruction sets the PC to a possibly-tweaked LR value, and puts the necessary value in LR for
returning to the next instruction in sequence. Which we do want. And which we can do by itself.
Penny dropped yet?
All we do is, load an absolute value into LR, and then let that second instruction do its magic. It takes care of the offsetting of the stored LR value, the way we’re accustomed to BL working, and we don’t have to worry about pipelining or PCs or whatever because we already put the exact absolute address we want in LR.
Of course, we can’t actually LDC directly into LR, because the load PC-relative opcode only uses 3 bits to represent the destination register. But that’s trivial to work around. The result is stunningly simple:
ldr r1, constant mov lr, r1 bl lr + 0 @ or whatever syntax the assembler expects more_code: @ ... .align .long constant
Notice that this is actually 4 bytes shorter: we no longer have to BX, since our BL goes directly to where we want. We no longer have to use an extra B to skip over anything (assuming our LDR can reach the function’s constant pool), because there’s no longer a BX to avoid on the return trip. And while we’ve added a MOV, it’s offset by using a 2-byte instruction for the BL magic instead of a 4-byte one. As for clarity, I don’t think anyone can really argue the point.
Boom. The only remaining question is support from whatever assembler you’re using, but the CPU obviously supports it. Hell, hack it in as a short constant if you have to. But I am at least guessing that
bl lr+0 works with the a22i assembler, since that’s the syntax the No$GBA guy is using in the discussion, and he’s also responsible for that assembler, so yeah.
Of course, this loses out on the other important thing BX can do, i.e. switch to ARM mode. If you really need that, then stick with the previous magic. Or I guess you could use the new magic to reach the built-in BX ladder, if you really want.
* Wow, you read the whole post! Congratulations. Still remember at the top where I was talking about the “natural use” of BL? Just for what it’s worth - BL also sometimes gets used within the same function, to do branches that are further than a plain old B, BEQ etc. can reach. The value stored in LR ends up just being ignored. Yes, the base FE7 code contains functions like this. Ph34r. This made me sadface when I figured it out, because it’s yet another reason why “automatically detect where functions begin and end” is a hard problem.