How stack trace on ARM works

Some time ago I faced a small problem in Embox RTOS — gdb is not displaying stack trace correctly for Cortex-M when a program is in an interrupt handler. Therefore, I wanted to find out, what ways can you get a stack trace for ARM? What compilation flags affect the ability of a stack trace on ARM? How is this implemented in Linux? I decided to write this article based on the research results.

Let’s start with a simple approach that can be found in the Linux kernel, but which is currently marked ‘deprecated’ in GCC.

Imagine that a program is being executed, and at some point, we interrupt it and want to display the stack trace. We have a pointer to the current instruction that is being executed by the processor (PC), as well as the current pointer to the top of the stack (SP). Now, in order to “jump” up the stack to the previous frame, we need to know what kind of function it is and where we have to jump to in this frame. We can use Link Register (LR) for this purpose on ARM:

The Link Register (LR) is register R14. It stores the return information for subroutines, function calls, and exceptions. On reset, the processor sets the LR value to 0xFFFFFFFF.

Next, we need to step up through the stack loading the new values of the LR registers from each stack frame. The structure of the stack frame for the compiler looks like this:

/* The stack backtrace structure is as follows:
fp points to here:
| save code pointer | [fp]
| return link value | [fp, #-4]
| return sp value | [fp, #-8]
| return fp value | [fp, #-12]
[| saved r10 value |]
[| saved r9 value |]
[| saved r8 value |]
[| saved r7 value |]
[| saved r6 value |]
[| saved r5 value |]
[| saved r4 value |]
[| saved r3 value |]
[| saved r2 value |]
[| saved r1 value |]
[| saved r0 value |]
r0-r3 are not normally saved in a C function. */

This description is from GCC sources file ‘gcc/gcc/config/arm/arm.h’.

Those in the prologue of each function, the compiler prepares some kind of auxiliary structure. You can notice that this structure contains the “next” value of the LR register we need, and, most importantly, it contains the address of the next frame “| return fp value | [fp, #-12] “.

You have to use the option ‘-mapcs-frame’ to use this compiler mode. The option description mentions “Specifying -fomit-frame-pointer with this option causes the stack frames not to be generated for leaf functions.” Here, the ‘leaf’ functions are those that do not make any calls to other functions, so they can be implemented a little lighter.

You may also wonder what to do with the assembler functions in this case. In fact, nothing tricky — you need to insert special macros ENTRY/ENDPROC which is described in Linux objtool documentation:

Each callable function must be annotated as such with the ELF function type. In asm code, this is typically done using the ENTRY/ENDPROC macros.

The same document indicates that the need to manually use macros is an obvious disadvantage since this can make an error. Therefore it requires additional util. The ‘objtool’ utility is used to check whether all functions in the kernel are written in the correct format for the stack trace.

Below is the function to unwind the stack from the Linux kernel:

I want to pay attention to the line “defined (CONFIG_ARM_UNWIND)”. It hints that the Linux kernel also uses another unwind_frame implementation, and we will talk about it a little later.

The ‘-mapcs-frame’ option is only valid for the ARM instruction set. But it is known that ARM microcontrollers have another instruction set — ‘Thumb’ (Thumb-1 and Thumb-2). It is used mainly for the Cortex-M series. To enable Thumb mode frame generation, use the ‘-mtpcs-frame’ and ‘-mtpcs-leaf-frame’ flags. In fact, it is analogous to ‘-mapcs-frame’.

Let’s take a look at the following code:

static int my_func1(int a) {
my_func2(7);
return 0;
}

If we compile it for thumb mode and disassemble, we see the following prologue:

00008134 <my_func1>:8134:   b084        sub sp, #16
8136: b580 push {r7, lr}
8138: aa06 add r2, sp, #24
813a: 9203 str r2, [sp, #12]
813c: 467a mov r2, pc
813e: 9205 str r2, [sp, #20]
8140: 465a mov r2, fp
8142: 9202 str r2, [sp, #8]
8144: 4672 mov r2, lr
8146: 9204 str r2, [sp, #16]
8148: aa05 add r2, sp, #20
814a: 4693 mov fp, r2
814c: b082 sub sp, #8
814e: af00 add r7, sp, #0

By comparison, a disassembler of the same function for ARM instructions:

000081f8 <my_func1>:81f8:   e1a0c00d    mov ip, sp
81fc: e92dd800 push {fp, ip, lr, pc}
8200: e24cb004 sub fp, ip, #4
8204: e24dd008 sub sp, sp, #8

At first glance, it may seem that these are completely different things. However, the frames are equivalent. The fact is that in ‘Thumb’ mode, ‘push’ instruction can operate with the low registers ‘(r0 — r7)’ and the ‘lr’ registers. For all other registers, this has to be done in two stages through the ‘mov’ and ‘str’ instructions, as in the example above.

By the way, it turns out that these options currently only work for Cortex-M0/M1. I spent some time before I could figure out why I can’t compile the correct image for Cortex-M3/M4/…. After re-checking all GCC options for ARM and searching the Internet, I realized that this is probably a bug in GCC. Therefore, I got directly into the sources of the arm-none-eabi-gcc compiler and after studying how the compiler generates frames for ARM, Thumb-1, and Thumb-2, I came to the conclusion that at the moment, frames are correctly generated for Thumb-1 and ARM but not for Thumb-2. I had created a bug report, and the GCC developers explained to me that the ARM standard had already changed several times and these flags were very outdated, but for some reason, they all still exist in the compiler.

An alternative approach to stack unwinding is based on the “Exception Handling ABI for the ARM Architecture” (EHABI) standard. The main usage of this approach is exception handling in languages such as C++. However, the information prepared by the compiler for handling exceptions can be used for the stack trace as well. This mode is enabled with the GCC option -fexceptions (or -funwind-frames).

Let’s take a look at how this is done in more detail. The document (EHABI) requires the compiler to generate the auxiliary tables ‘.ARM.exidx’ and ‘.ARM.extab’. Below you can see how the ‘.ARM.exidx’ section is defined in the Linux kernel sources (file arch/arm/kernel/vmlinux.lds.h).

/* Stack unwinding tables */
#define ARM_UNWIND_SECTIONS \
. = ALIGN(8); \
.ARM.unwind_idx : { \
__start_unwind_idx = .; \
*(.ARM.exidx*) \
__stop_unwind_idx = .; \
} \

The “Exception Handling ABI for the ARM Architecture” standard defines each element of the ‘.ARM.exidx’ table like the following structure:

struct unwind_idx {
unsigned long addr_offset;
unsigned long insn;
};

The first element is the offset relative to the beginning of the function, and the second element is the address in the instructions table that needs to be specially interpreted in order to unwind the stack up to the next element. Each element of the interactions table is a sequence of words and half words, which are a sequence of instructions. The first word indicates the number of instructions that need to be executed in order to unwind the stack to the next frame.

The description of the instructions from the standard:

The main interpreter implementation on Linux is in arch/arm/kernel/unwind.c.

This is an implementation of the ‘unwind_frame’ function, which is used if the ‘CONFIG_ARM_UNWIND’ option is enabled. I inserted the comments with explanations directly into the source code.

Below is an example of what the ‘.ARM.exidx’ table element looks like for the kernel_init() function in Embox:

$ arm-none-eabi-readelf -u build/base/bin/emboxUnwind table index '.ARM.exidx' at offset 0xaa6d4 contains 2806 entries:
<...>
0x1c3c <kernel_start>: @0xafe40
Compact model index: 1
0x9b vsp = r11
0x40 vsp = vsp - 4
0x84 0x80 pop {r11, r14}
0xb0 finish
0xb0 finish
<...>

And the disassembler of it:

00001c3c <kernel_start>:
void kernel_start(void) {
1c3c: e92d4800 push {fp, lr}
1c40: e28db004 add fp, sp, #4
<...>

Let’s take a closer look. We see the assignment ‘vps = r11’. (R11 is FP) and then ‘vps = vps — 4’. It corresponds to the instruction `add fp, sp, # 4`. Next comes ‘pop {r11, r14}’, which corresponds to the instruction `push {fp, lr}`. The last `finish` instruction indicates the end of execution (to be honest, I still don’t understand why there are two finish instructions).

I considered how much additional memory is needed if we use the “-funwind-frames” flag. I compiled Embox for STM32F4-Discovery platform to both modes (with ‘-funwind-frames’ and without it). Here are the objdump results:

With ‘-funwind-frames’:

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0005a600 08000000 08000000 00004000 2**14
CONTENTS, ALLOC, LOAD, CODE
1 .ARM.exidx 00003fd8 0805a600 0805a600 0005e600 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .ARM.extab 000049d0 0805e5d8 0805e5d8 000625d8 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .rodata 0003e380 08062fc0 08062fc0 00066fc0 2**5

Without ‘-funwind-frames’:

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00058b1c 08000000 08000000 00004000 2**14
CONTENTS, ALLOC, LOAD, CODE
1 .ARM.exidx 00000008 08058b1c 08058b1c 0005cb1c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .rodata 0003e380 08058b40 08058b40 0005cb40 2**5

It is easy to calculate that the ‘.ARM.exidx’ and ‘.ARM.extab’ sections sizes are about 1/10 of the ‘.text’ size. After that I put together an image with more features for an ARM Integrator-CP platform. In this case, the ratio between additional sections and ‘.text’ was 1/12. It is clear that this ratio can vary from project to project.

It also turned out that the size of the image that adds the ‘-macps-frame’ flag is smaller than the option with exceptions. Which is quite expected. So, for example, with the size of the ‘.text’ section being 600 KB, the total size of ‘.ARM.exidx’ + ‘.ARM.extab’ was 50 KB, while with the ‘-mapcs-frame’ flag was only 10 KB. However, this does not apply to ‘thumb’ mode. We can remember that a large prologue was generated for the Cortex-M1 (via mov/str). Therefore, there will be practically no difference in this case.

The third approach is to use a stack trace through the debugger. It seems that many operating systems for microcontrollers at the moment assume this approach or offer to watch the disassembler (e.g. for FreeRTOS answer).

As a result, we came to the conclusion that the stack trace for ARM at run time is actually not applied anywhere. Probably, this is a consequence of the desire to make the most efficient code during runtime and to take debugging actions (which include stack unwinding) to compiled-time. However, if the OS uses C++ code, then it is quite possible to use the tracing implementation through ‘.ARM.exidx’.

The problem with the wrong stack unwinding in the interrupt in Embox was solved very simply; it turned out to be enough to save the LR register on the stack.