Review Arm Assembly
January 4, 2021
There is no other way to learn something that playing with it.
Take assembly code, read it and predice what will do. Then test it.
Those mistakes, those mismatches between what you think and what it really is, those surprises are what move us forward into learning. Deeper.
In this post I will dig into Arm, assisted with an interactive assembler.
GCC generated Arm assembly
We will see the assembly of the following C code compiled as:
pi@raspberrypi:~$ gcc -S -O0 -o asm1.asm asm1.c
raspberrypi
is a QEMU virtual machine for Arm running a Raspbian Stretch. The setup is explained in my previous post QEMUlating a Rasbian (ARM).
The code is quite simple:
int rand() {
return 0x42;
}
int sum(int a, int b) {
return a+b;
}
int main() {
int r = rand();
if (r == 0)
return 0;
else if (r > 0x4041)
return sum(r, 0x4041);
else
return -1;
}
Let’s dig into the assembly. I will use an interactive assembler.
The rand
function
.align 2
.global rand
.arch armv6
.syntax unified
.arm
.fpu vfp
.type rand, %function
rand:
; link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
mov r3, #66
mov r0, r3
add sp, fp, #0
ldr fp, [sp], #4
bx lr
.size rand, .-rand
First, the code is aligned and the symbol is marked as “global”. .arm
says that the code is Arm (aka .code 32
).
Prologue
The function begins saving the frame pointer fp
in the stack.
The str fp, [sp, #-4]!
is a pre-index addressing store: the fp
is saved 4 bytes “up” in the stack (the stack grows towards lower addresses).
And the store is in pre write-back store (!
): the sp
is updated (decremented by 4) before performing the store.
The sp
points always to the latest valid value in the stack. That’s why sp
is decremented before performing the store.
The add fp, sp, #0
is an alternative to mov fp, sp
.
At the begin of the call:
------ - ------ ---- ------ --------- ------ ---------
r0 0 r1 0 r2 0 r3 0
r4 0 r5 0 r6 0 r7 0
r8 0 r9/sb 0 r10 0 r11/fp bbbb:bbbb
r12/ip 0 r13/sp 2000 r14/lr aaaa:aaaa r15/pc 100:0
------ - ------ ---- ------ --------- ------ ---------
After the fp
and sp
update:
------ - ------ ---- ------ --------- ------ -----
r0 0 r1 0 r2 0 r3 0
r4 0 r5 0 r6 0 r7 0
r8 0 r9/sb 0 r10 0 r11/fp 1ffc
r12/ip 0 r13/sp 1ffc r14/lr aaaa:aaaa r15/pc 100:4
------ - ------ ---- ------ --------- ------ -----
iasm, the interactive assembler, allows to explore the memory with the M
object. M[sp:]
means show the memory from the address stored in sp
to the last address mapped page.
In other words: show the stack.
And the state of the stack is:
100:4> ;! M[sp:]
[\xbb\xbb\xbb\xbb]
(fp)
100:4> ;! M[fp:]
[\xbb\xbb\xbb\xbb]
(fp)
sp
points always to the latest value of the stack; fp
points to the previous fp
value (0xbbbbbbbb
in this case).
Body
The assembler didn’t optimize the code: it stored in r3
the immediate value of #66
(0x42) to then copy it to r0
(the register used for returning values). mov r0, #66
would be shorter.
Epilogue
Then the sp
is restored to the current fp
and the fp
is restored to the previous fp
value with ldr fp, [sp], #4
This load is a pre-index addressing with post write-back. That’s it, the fp
is loaded with the valued pointed by sp
and then sp
is added 4 bytes (aka pop).
The compiler however should optimize this because the stack is not used at all so saving and restoring fp
has no value.
What the compiled did, it didn’t save the link register lr
.
The register holds the address to where return from a call. Because rand
doesn’t call anything, lr
from the caller is preserved so it is not needed to save it in the stack.
bx lr
returns to the caller.
The sum
function
sum:
str fp, [sp, #-4]!
add fp, sp, #0
sub sp, sp, #12
str r0, [fp, #-8]
str r1, [fp, #-12]
ldr r2, [fp, #-8]
ldr r3, [fp, #-12]
add r3, r2, r3
mov r0, r3
add sp, fp, #0
ldr fp, [sp], #4
bx lr
.size sum, .-sum
Prologue
In this case the function allocates 12 bytes to hold local variables (sub sp, sp, #12
).
The second argument r1
is stored in the top of the stack; the first argument r0
is stored below. Arguments are pushed from left (r0
) to right (r1
).
The call convention says that the arguments are passed via registers (up to 4 args). They are set by the caller and, if needed, the callee needs to preserve them in the stack.
No really needed here because sum
does not call other function but still the compiler follows the cookbook.
The function allocated 12 byte to hold 3 variables of 32 bits. We stored 2, the arguments, but the third element is never set.
The registers at the begin of the call were:
------ --------- ------ --------- ------ --------- ------ ---------
r0 cccc:cccc r1 dddd:dddd r2 0 r3 0
r4 0 r5 0 r6 0 r7 0
r8 0 r9/sb 0 r10 0 r11/fp bbbb:bbbb
r12/ip 0 r13/sp 2000 r14/lr aaaa:aaaa r15/pc 100:0
------ --------- ------ --------- ------ --------- ------ ---------
And after the stores, the stack has:
100:10> ;! M[sp:]
[\xdd\xdd\xdd\xdd\xcc\xcc\xcc\xcc\x00\x00\x00\x00\xbb\xbb\xbb\xbb]
(r1) (r0) (??) (fp)
100:10> ;! M[fp:]
[\xbb\xbb\xbb\xbb]
(fp)
I presume that the unused space (??) is for the lr
register.
The main
function
main:
push {fp, lr}
add fp, sp, #4
sub sp, sp, #8
bl rand
str r0, [fp, #-8]
ldr r3, [fp, #-8]
cmp r3, #0
bne .L6
mov r3, #0
b .L7
.L6:
ldr r3, [fp, #-8]
ldr r2, .L9
cmp r3, r2
ble .L8
ldr r1, .L9
ldr r0, [fp, #-8]
bl sum
mov r3, r0
b .L7
.L8:
mvn r3, #0
.L7:
mov r0, r3
sub sp, fp, #4
pop {fp, pc}
.L10:
.align 2
Prologue
The function saves fp
and lr
with a single push {fp,lr}
.
The {r,r}
notation is a set, not a list: registers are pushed in the inverse order of the registers (r0
to r15
) regardless of how the push
is written.
In our case fp
is r11
and lr
is r14
so that is the natural order, then the inverse order applies: r14
is pushed first, r11
later.
In short: r14
will be at the bottom of the stack (higher addresses) while r11
will be at the top (lower addresses).
The fp
is then updated to the base of the stack for the current function call. The stack frame begins after storing the previous fp
so the current fp
points to the saved lr
.
The fp
update is done with add fp, sp, #4
(by this moment the sp
is off by 4 due the push of lr
).
The registers at the begin of the call were:
------ --------- ------ --------- ------ --------- ------ ---------
r0 cccc:cccc r1 dddd:dddd r2 0 r3 0
r4 0 r5 0 r6 0 r7 0
r8 0 r9/sb 0 r10 0 r11/fp bbbb:bbbb
r12/ip 0 r13/sp 2000 r14/lr aaaa:aaaa r15/pc 100:0
------ --------- ------ --------- ------ --------- ------ ---------
And after the push
and add
, the registers were:
------ --------- ------ --------- ------ --------- ------ -----
r0 cccc:cccc r1 dddd:dddd r2 0 r3 0
r4 0 r5 0 r6 0 r7 0
r8 0 r9/sb 0 r10 0 r11/fp 1ffc
r12/ip 0 r13/sp 1ff8 r14/lr aaaa:aaaa r15/pc 100:4
------ --------- ------ --------- ------ --------- ------ -----
And the stack:
100:4> ;! M[sp:]
[\xbb\xbb\xbb\xbb\xaa\xaa\xaa\xaa]
(fp) (lr)
100:4> ;! M[fp:]
[\xaa\xaa\xaa\xaa]
(lr)
This is not compatible with what we saw in rand
and sum
: the fp
points to the saved fp
in these functions but points to lr
in main
.
Also, in sum
we believed that 4 unused bytes were reserved to store lr
but here we see that the space is reserved later with sub sp, sp, #8
and does not include space for lr
.
Comparisons
The call to rand
(parameterless) is done with bl
, branch and link.
The return value is in r0
and for some reason it is pushed and popped back from the stack into r3
.
M[fp - 8]
is used as the placeholder for this and for subsequent references to the returned value of rand
.
Two comparisons are made for the if-else if
statement:
ldr r3, [fp, #-8]
cmp r3, #0
...
ldr r3, [fp, #-8]
ldr r2, .L9
cmp r3, r2
The first compares r3
(rand
returned value) with a immediate value of 0
(cmp r3, #0
).
The second compares two registers, r3
and r2
, where r2
is also a fixed value but it is to large to fit in the cmp
instruction as an immediate value.
In this case the value is loaded in the r2
register from the code segment (label .L9
).
.L9:
.word 16449
Function call
A function call is done with branch with link bl
.
Arguments are passed via r0
to r3
registers from left to right. More than 4 arguments require the stack.
; call to sum(r, 0x4041)
ldr r1, .L9 ; second arg
ldr r0, [fp, #-8] ; first arg
bl sum
The bl
saves the next instruction (the return address) in the link lr
register (r14
) and set the destination address in the program counter pc
register (r15
).
bx lr
(branch and exchange) is used to return to the caller.
Arm directives
Two more fragments remains that are not part of any function.
These are directives for the GNU Assembler, see also this:
.arch armv6
.eabi_attribute 28, 1
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 34, 1
.eabi_attribute 18, 4
.file "asm1.c"
.text
.ident "GCC: (Raspbian 8.3.0-6+rpi1) 8.3.0"
.section .note.GNU-stack,"",%progbits
Final thoughts
I have being reading documentation and write ups about Arm during the last weeks.
When I started my idea was to use a QEMU virtual machine for testing: code a little of assembly, compile it, debugging it with GDB and seeing the effects.
It turns out to be tedious very quickly.
I relayed then more in the documentation and the instruction set reference but when I review real code (like the one in this post) some things made no sense.
Obviously there were errors in my interpretation of the code.
That’s why I coded an interactive assembler to have a quick feedback of what each instruction does without requiring a compile-upload-debug cycle.
It really help me to “smooth out certain rough edges” and understand better the code specially when the indexing flavors and how the things are pushed and popped from the stack.