Debugging SOS

This guide first introduces some low-level techniques for debugging an OS personality on the seL4 kernel. The second section describes the GDB support we have added to SOS to enable debugging of itself.

seL4 Low-level Debugging

When developing an operating system on top of seL4 you may not have the luxury of using a debugger such as gdb. In this case, your best bet is a combination of dprintf and objdump. If you are doing the project on a Linux machine (using our linux toolchain) you can also use the backtrace function, combined with a script for parsing objdump output to obtain a primitive stack trace.

Debugging Faults

All exceptions result in a context switch to the seL4 kernel. The kernel exports fault handling policies to user level by sending an IPC to the appropriate IPC endpoint (see Chapter 5 of the seL4 manual). For applications such as console_test, the assigned endpoint will usually be the IPC endpoint that SOS waits on in the syscall_loop. SOS on the other hand is initialised with no fault endpoint and hence is expected to ensure that it does not cause a fault.

SOS faults

Below is an example of code that triggers a page fault from the SOS main function.

int main(void)
  {
      init_muslc();

      /* bootinfo was set as an environment variable in _sel4_start */
      char *bi_string = getenv("bootinfo");
      ZF_LOGF_IF(!bi_string, "Could not parse bootinfo from env.");

      /* register the location of the unwind_tables -- this is required for
       * backtrace() to work */
      __register_frame(&__eh_frame_start);

      seL4_BootInfo *boot_info;
      if (sscanf(bi_string, "%p", &boot_info) != 1) {
          ZF_LOGF("bootinfo environment value '%s' was not valid.", bi_string);
      }

      debug_print_bootinfo(boot_info);

      printf("\nSOS Starting...\n");

      NAME_THREAD(seL4_CapInitThreadTCB, "SOS:root");
      // cause a fault at 0xDEADBEEF
      *((char*)0xDEADBEEF) = 0;

When this code runs, the following output is displayed in the serial terminal:

SOS Starting...
Caught cap fault in send phase at address (nil)
while trying to handle:
vm fault on data at address 0xdeadbeef with status 0x92000045
in thread 0xffffff807efd6400 "SOS:root" at address 0x407c50
With stack:
0x4e59c0: 0x0
0x4e59c8: 0x48966c
0x4e59d0: 0x4e5a78
0x4e59d8: 0x0
0x4e59e0: 0x0
0x4e59e8: 0x0
0x4e59f0: 0x5ea000
0x4e59f8: 0x4895ec
0x4e5a00: 0x4e5a78
0x4e5a08: 0x4ac025
0x4e5a10: 0x48965c
0x4e5a18: 0x0
0x4e5a20: 0x4e5af0
0x4e5a28: 0x0
0x4e5a30: 0x0
0x4e5a38: 0x0

What's going on here? First, a page fault is triggered by the dereference of invalid address 0xDEADBEEF. Then, seL4 attempts to send a page fault IPC to the fault handling endpoint of SOS. However, SOS doesn't have a fault handling endpoint, as it is set to the default value of 0, an empty slot in the cspace. As a result, a double fault is sent, which means the thread is stopped by the kernel, and in debug mode, the above output is produced.

From the output, observe that SOS caused a vm fault with the following important values:

The fault address

The fault address is the invalid address which caused a vm fault (a TLB translation error). This is 0xdeadbeef, the address we tried to write to. Note that invalid addresses will not always trigger a vm fault: if you are unlucky enough to accidentally derefence a mapped address no fault will occur.

The exception status register

The exception status register is referred to as far_el1 on 64-bit ARM architectures (aarch64) and DFSR on 32-bit ARM architectures (aarch32). To decode the value, look at section D.10.4 of the ARMv8 Architecture reference manual

The fault address register

The most useful address is the fault address register, which contains the value of the instruction pointer at the time of the fault. This value can then be looked up with the objdump tool to to determine the context of fault.

Application faults

When an application thread is created, an appropriate fault handler IPC endpoint should be set in the thread's TCB (see Chapter 5 of the seL4 manual). If the application causes a fault and IPC will be sent to the assigned fault endpoint. Message registers provide important information regarding the cause and location of the fault. objdump may then be used to determine the context of fault.

Using objdump

Whenever you cause a fault, you should be able to see the current program counter printed out. You can use objdump to find where the offending code is. Pass to objdump the location of the binary that faulted: in this case, it's the sos binary, located in the build directory:

aarch64-linux-gnu-objdump -Dlx build/projects/aos/sos/sos | less

You can now use the searching facility in less to search for the faulting address. In this case I find the following fragment of output:

 407c44:       97fff6dc        bl      4057b4 
/home/alyons/aos-2019-test/build/../projects/aos/sos/src/main.c:516
  407c48:       d297dde0        mov     x0, #0xbeef                     // #48879
  407c4c:       f2bbd5a0        movk    x0, #0xdead, lsl #16
  407c50:       3900001f        strb    wzr, [x0]
/home/alyons/aos-2019-test/build/../projects/aos/sos/src/main.c:519
  407c54:       f9401ba1        ldr     x1, [x29,#48]

We can see that the error has occurred on code at line 516 of main.c, exactly where we triggered a fault by trying to write to 0xDEADBEEF.

If the fault occurs in an application (such as console_test or sosh) rather than SOS, the following command should be used, with APP_NAME substituted for the name of the application you are debugging:

aarch64-linux-gnu-objdump -Dlx projects/aos/apps/APP_NAME/APP_NAME  | less

More on objdump

objdump is a very handy utility for working out exactly what is where in an executable so you can work out what exactly is going wrong.

The two standard incantations for objdump are:

aarch64-linux-gnu-objdump -dl binary_file | less

and

aarch64-linux-gnu-objdump -lx binary_file | less

The first command (-dl) disassembles the text segment and shows you all the instructions and at what address. Using this information you can find out things such as:

NB: In case it's not yet obvious, you will need to get up to speed on your ARM aarch64 assembly and not be afraid to get your hands dirty if you want to minimise the time you spend debugging.

The second objdump command (-lx) is useful for when addresses appear inside an object file but outside of the text segment. This is especially useful when debugging ELF loading. The -lx option displays section and symbol information. Further options can be added to dump data segments etc. man objdump is your friend.

Generating a Stack Trace

Warning: Backtrace only supported for the Linux toolchain

We have provided a function print_backtrace in SOS which allows you to print a primitive stacktrace from anywhere in SOS. Note that backtrace does not work across random jumps, including from within muslc system calls.

You can call print_backtrace() from within any arbitrary place in your code. Backtrace will traverse the eh_frame constructed by gcc during compilation, which provides stack offsets, and print out the addresses of each function in the call stack (as long as gcc can recognise the function call) This can be useful if you need to know what your code is doing.

Consider this example, where print_backtrace is called in the network_init function. The output is as follows:

SOS Started!
test_cspace@tests.c:53 Test cspace
libsel4muslcsys: Error attempting syscall 233
run_tests@tests.c:116 Root CSpace test passed!
test_cspace@tests.c:53 Test cspace
run_tests@tests.c:124 Single level cspace test passed!
test_cspace@tests.c:53 Test cspace
libsel4muslcsys: Error attempting syscall 233
run_tests@tests.c:137 Double level cspace test passed!
run_tests@tests.c:141 DMA test passed!
Network init
Backtracing stack PCs:
0x408fa4
0x409288
0x407a24
0x418fec

Usage of the backtrace script is shown below.

$ ./projects/aos/backtrace.py build/projects/aos/sos/sos
Enter whitespace separated addresses, <Ctrl>+D to process and exit
>
> 0x408fa4
0x409288
0x407a24
0x418fec
> > > > Processing....
--------------------------------------------------------------------------------
0           408fa4 in print_backtrace
  408f9c:  52800141   mov  w1, #0xa                     // #10
  408fa0:  94000868   bl  40b140 
  408fa4:  b9006ba0   str  w0, [x29,#104]
  408fa8:  b9406ba0   ldr  w0, [x29,#104]
  408fac:  7100001f   cmp  w0, #0x0
--------------------------------------------------------------------------------
1           409288 in network_init
  409280:  f9000fa0   str  x0, [x29,#24]
  409284:  97ffff38   bl  408f64 
  409288:  d2a00022   mov  x2, #0x10000                 // #65536
  40928c:  d2b92821   mov  x1, #0xc9410000              // #3376480256
  409290:  f9400fa0   ldr  x0, [x29,#24]
--------------------------------------------------------------------------------
2           407a24 in main_continued
  407a1c:  911f8000   add  x0, x0, #0x7e0
  407a20:  94000616   bl  409278 
  407a24:  900004a0   adrp  x0, 49b000 
  407a28:  91112000   add  x0, x0, #0x448
  407a2c:  94021680   bl  48d42c 
--------------------------------------------------------------------------------
3           418fec in utils_run_on_stack
  418fe4:  aa0203e0   mov  x0, x2
  418fe8:  d63f0020   blr  x1
  418fec:  9100029f   mov  sp, x20
  418ff0:  aa0003e0   mov  x0, x0
  418ff4:  f90027a0   str  x0, [x29,#72]

GDB-based SOS Debugging

We have built some support for GDB-based debugging of SOS itself. The following describes how to enable support and connect GDB to a gdbserver running within sos itself on the ODroid-C2. The later section describe a little of the software architecture used that might be relevant to your debugging.

Enabling and Connecting GDB

Building for GDB

The gdbserver support is compiled into the system itself. To include the support, ensure the following flag is ON in settings.cmake.

# Enable debugger
set(SosGDBSupport ON CACHE BOOL "" FORCE)

In addition to the following variables remaining as supplied (and shown below).

set(CMAKE_BUILD_TYPE "Debug" CACHE STRING "" FORCE)
set(KernelDebugBuild ON CACHE BOOL "" FORCE)
set(HardwareDebugAPI ON CACHE BOOL "" FORCE)

The codebase is then built and run as per usual.

Running with GDB support

When GDB support is enabled, the SOS boot sequence pauses and waits for GDB to connect before completing the usual startup activities. An example of the boot sequence waiting for a GDB connection is shown below.

SOS Starting...
SOS Started!
test_cspace@tests.c:59 Test cspace
run_tests@tests.c:173 Root CSpace test passed!
test_cspace@tests.c:59 Test cspace
run_tests@tests.c:181 Single level cspace test passed!
test_cspace@tests.c:59 Test cspace
cspace_alloc_slot@cspace.c:335 Cspace is full!

run_tests@tests.c:194 Double level cspace test passed!
run_tests@tests.c:198 DMA test passed!
run_tests@tests.c:202 Frame table test passed!
Network init
dwmac.b0002000 Waiting for PHY auto negotiation to complete....... done
Speed: 1000, full duplex
Protocol ethernet registered (layer: 2).
Protocol ipv4 registered (layer: 3).
Protocol ipv6 registered (layer: 3).
Protocol icmp4 registered (layer: 4).
Protocol icmp6 registered (layer: 4).
Protocol igmp registered (layer: 4).
Protocol udp registered (layer: 4).
Protocol tcp registered (layer: 4).
Assigned ipv4 0.0.0.0 to device sos picotcp
DHCP client: update link
Assigned ipv4 192.168.168.44 to device sos picotcp
DHCP client: renewal time (T1) 3163553
DHCP client: rebinding time (T2) 1241251
DHCP client: lease time 6327107
DHCP client: netmask 255.255.255.0
DHCP client: gateway 192.168.168.1
Awaiting GDB connection...

Connecting GDB

Start a new terminal window and change to your build directory, run the aarch64-linux-gnu-gdb version of GDB, with projects/aos/sos/sos as the executable to debug. Note: if you have your own development environment, you need version 15.1 of GDB and a version that supports aarch64. Older versions may not work correctly.

To connect GDB to the gdbserver on the Odroid-C2, enter target remote | ../odroid serial_raw in GDB. GDB should connect to your SOS instance in a similar manner to the example shown below. Set the GDB scheduler-locking mode with set scheduler-locking step. You can then enter c to continue your system.

~/.../build$ aarch64-linux-gnu-gdb projects/aos/sos/sos
GNU gdb (GDB) 15.1
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-pc-linux-gnu --target=aarch64-none-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from projects/aos/sos/sos...
(gdb) target remote | ../odroid serial_raw
Remote debugging using | ../odroid serial_raw
arm_sys_send_recv (sys=18446744073709551615, dest=903898, out_badge=0xc0063c78, info_arg=12326, out_info=0xc0063ca8,
    in_out_mr0=0xc0063ca0, in_out_mr1=0xc0063c98, in_out_mr2=0xc0063c90, in_out_mr3=0xc0063c88, reply=0)
    at ../kernel/libsel4/sel4_arch_include/aarch64/sel4/sel4_arch/syscalls.h:169
169         *out_info = info;
(gdb) set scheduler-locking step
(gdb)

A sample .gdbinit file

The following is a sample build/.gdbinit file that simplifies startup. It simply allows you to type connect each time.

  define connect
  target remote | odroid serial_raw
  set scheduler-locking step
  end

GDB Connection Architecture

The following diagram illustrates the overall system architecture of a SOS being debugged with GDB.

Development Host

The dev. host is the machine hosting your work (e.g. a CSE machine or your personal laptop). It connects to one of the Odroid-C2s via aos-2.

aos-2

aos-2.cse.unsw.edu.au is a host that authenticates and manages all the connections from students to Odroid-C2s in a cluster numbering approximately 60. It attempts to filter out gdb related content from the odroid serial stream to make the stream more intelligible. Note that it is possible that some artifacts may appear occasionally.

aos-2 also provides a private NFS export and TFTP service to each Odroid-C2

Odroid-C2

The Odroid-C2 is where the action is. It boots a system image from aos-2 containing seL4 and SOS.

Observations

The gdbserver and SOS debug printing share the same serial port. There is the potential to cross the streams resulting in unexpected output for intrepid Ghostbusters. That said, sharing seems to work well with GDB and the odroid mulitplexer keeping the streams separate.

The gdbserver and SOS share the same address space. The gdbserver has access to everything SOS has in order to debug SOS, however SOS can also corrupt the gdbserver. In practice, the gdbserver seems reliable, it's just not completely immune to SOS memory safety violations.

The gdbserver only has access to SOS. It cannot be used to debug clients of SOS, i.e. SOS processes.