Debugging SOS

When developing an operating system on top of seL4 you do not have the luxury of using a debugger such as gdb. Your best bet is a combination of dprintf and objdump. If you are doing the project on a Linux machine (using our linux toolchain) you can also use the backtrace function, combined with a script for parsing objdump output to obtain a primitive stack trace.

Debugging Faults

All exceptions result in a context switch to the seL4 kernel. The kernel exports fault handling policies to user level by sending an IPC to the appropriate IPC endpoint (see Chapter 5 of the seL4 manual). For applications such as tty_test, the assigned endpoint will usually be the IPC endpoint that SOS waits on in the syscall_loop. SOS on the other hand is initialised with no fault endpoint and hence is expected to ensure that it does not cause a fault.

SOS faults

Below is an example of code that triggers a page fault from the SOS main function.

int main(void)
  {
      init_muslc();

      /* bootinfo was set as an environment variable in _sel4_start */
      char *bi_string = getenv("bootinfo");
      ZF_LOGF_IF(!bi_string, "Could not parse bootinfo from env.");

      /* register the location of the unwind_tables -- this is required for
       * backtrace() to work */
      __register_frame(&__eh_frame_start);

      seL4_BootInfo *boot_info;
      if (sscanf(bi_string, "%p", &boot_info) != 1) {
          ZF_LOGF("bootinfo environment value '%s' was not valid.", bi_string);
      }

      debug_print_bootinfo(boot_info);

      printf("\nSOS Starting...\n");

      NAME_THREAD(seL4_CapInitThreadTCB, "SOS:root");
      // cause a fault at 0xDEADBEEF
      *((char*)0xDEADBEEF) = 0;

When this code runs, the following output is displayed in the picocom terminal:

SOS Starting...
Caught cap fault in send phase at address (nil)
while trying to handle:
vm fault on data at address 0xdeadbeef with status 0x92000045
in thread 0xffffff807efd6400 "SOS:root" at address 0x407c50
With stack:
0x4e59c0: 0x0
0x4e59c8: 0x48966c
0x4e59d0: 0x4e5a78
0x4e59d8: 0x0
0x4e59e0: 0x0
0x4e59e8: 0x0
0x4e59f0: 0x5ea000
0x4e59f8: 0x4895ec
0x4e5a00: 0x4e5a78
0x4e5a08: 0x4ac025
0x4e5a10: 0x48965c
0x4e5a18: 0x0
0x4e5a20: 0x4e5af0
0x4e5a28: 0x0
0x4e5a30: 0x0
0x4e5a38: 0x0

What's going on here? First, a page fault is triggered by the dereference of invalid address 0xDEADBEEF. Then, seL4 attempts to send a page fault IPC to the fault handling endpoint of SOS. However, SOS doesn't have a fault handling endpoint, as it is set to the default value of 0, an empty slot in the cspace. As a result, a double fault is sent, which means the thread is stopped by the kernel, and in debug mode, the above output is produced.

From the output, observe that SOS caused a vm fault with the following important values:

The fault address

The fault address is the invalid address which caused a vm fault (a TLB translation error). This is 0xdeadbeef, the address we tried to write to. Note that invalid addresses will not always trigger a vm fault: if you are unlucky enough to accidentally derefence a mapped address no fault will occur.

The exception status register

The exception status register is referred to as far_el1 on 64-bit ARM architectures (aarch64) and DFSR on 32-bit ARM architectures (aarch32). To decode the value, look at section D.10.4 of the ARMv8 Architecture reference manual

The fault address register

The most useful address is the fault address register, which contains the value of the instruction pointer at the time of the fault. This value can then be looked up with the objdump tool to to determine the context of fault.

Application faults

When an application thread is created, an appropriate fault handler IPC endpoint should be set in the thread's TCB (see Chapter 5 of the seL4 manual). If the application causes a fault and IPC will be sent to the assigned fault endpoint. Message registers provide important information regarding the cause and location of the fault. objdump may then be used to determine the context of fault.

Using objdump

Whenever you cause a fault, you should be able to see the current program counter printed out. You can use objdump to find where the offending code is. Pass to objdump the location of the binary that faulted: in this case, it's the sos binary, located in the build directory:

aarch64-linux-gnu-objdump -Dlx build/projects/aos/sos/sos | less

You can now use the searching facility in less to search for the faulting address. In this case I find the following fragment of output:

 407c44:       97fff6dc        bl      4057b4 
/home/alyons/aos-2019-test/build/../projects/aos/sos/src/main.c:516
  407c48:       d297dde0        mov     x0, #0xbeef                     // #48879
  407c4c:       f2bbd5a0        movk    x0, #0xdead, lsl #16
  407c50:       3900001f        strb    wzr, [x0]
/home/alyons/aos-2019-test/build/../projects/aos/sos/src/main.c:519
  407c54:       f9401ba1        ldr     x1, [x29,#48]

We can see that the error has occurred on code at line 516 of main.c, exactly where we triggered a fault by trying to write to 0xDEADBEEF.

If the fault occurs in an application (such as tty_test or sosh) rather than SOS, the following command should be used, with APP_NAME substituted for the name of the application you are debugging:

aarch64-linux-gnu-objdump -Dlx projects/aos/apps/APP_NAME/APP_NAME  | less

More on objdump

objdump is a very handy utility for working out exactly what is where in an executable so you can work out what exactly is going wrong.

The two standard incantations for objdump are:

aarch64-linux-gnu-objdump -dl binary_file | less

and

aarch64-linux-gnu-objdump -lx binary_file | less

The first command (-dl) disassembles the text segment and shows you all the instructions and at what address. Using this information you can find out things such as:

NB: In case it's not yet obvious, you will need to get up to speed on your ARM aarch64 assembly and not be afraid to get your hands dirty if you want to minimise the time you spend debugging.

The second objdump command (-lx) is useful for when addresses appear inside an object file but outside of the text segment. This is especially useful when debugging ELF loading. The -lx option displays section and symbol information. Further options can be added to dump data segments etc. man objdump is your friend.

Generating a Stack Trace

Warning: Backtrace only supported for the Linux toolchain

We have provided a function print_backtrace in SOS which allows you to print a primitive stacktrace from anywhere in SOS. Note that backtrace does not work across random jumps, including from within muslc system calls.

You can call print_backtrace() from within any arbitrary place in your code. Backtrace will traverse the eh_frame constructed by gcc during compilation, which provides stack offsets, and print out the addresses of each function in the call stack (as long as gcc can recognise the function call) This can be useful if you need to know what your code is doing.

Consider this example, where print_backtrace is called in the network_init function. The output is as follows:

SOS Started!
test_cspace@tests.c:53 Test cspace
libsel4muslcsys: Error attempting syscall 233
run_tests@tests.c:116 Root CSpace test passed!
test_cspace@tests.c:53 Test cspace
run_tests@tests.c:124 Single level cspace test passed!
test_cspace@tests.c:53 Test cspace
libsel4muslcsys: Error attempting syscall 233
run_tests@tests.c:137 Double level cspace test passed!
run_tests@tests.c:141 DMA test passed!
Network init
Backtracing stack PCs:
0x408fa4
0x409288
0x407a24
0x418fec

Usage of the backtrace script is shown below.

$ ./projects/aos/backtrace.py build/projects/aos/sos/sos
Enter whitespace separated addresses, <Ctrl>+D to process and exit
>
> 0x408fa4
0x409288
0x407a24
0x418fec
> > > > Processing....
--------------------------------------------------------------------------------
0           408fa4 in print_backtrace
  408f9c:  52800141   mov  w1, #0xa                     // #10
  408fa0:  94000868   bl  40b140 
  408fa4:  b9006ba0   str  w0, [x29,#104]
  408fa8:  b9406ba0   ldr  w0, [x29,#104]
  408fac:  7100001f   cmp  w0, #0x0
--------------------------------------------------------------------------------
1           409288 in network_init
  409280:  f9000fa0   str  x0, [x29,#24]
  409284:  97ffff38   bl  408f64 
  409288:  d2a00022   mov  x2, #0x10000                 // #65536
  40928c:  d2b92821   mov  x1, #0xc9410000              // #3376480256
  409290:  f9400fa0   ldr  x0, [x29,#24]
--------------------------------------------------------------------------------
2           407a24 in main_continued
  407a1c:  911f8000   add  x0, x0, #0x7e0
  407a20:  94000616   bl  409278 
  407a24:  900004a0   adrp  x0, 49b000 
  407a28:  91112000   add  x0, x0, #0x448
  407a2c:  94021680   bl  48d42c 
--------------------------------------------------------------------------------
3           418fec in utils_run_on_stack
  418fe4:  aa0203e0   mov  x0, x2
  418fe8:  d63f0020   blr  x1
  418fec:  9100029f   mov  sp, x20
  418ff0:  aa0003e0   mov  x0, x0
  418ff4:  f90027a0   str  x0, [x29,#72]