# μ-Kernel Construction



## **Fundamental Abstractions**

- Thread
- Address Space
  - What is a thread?
  - How to implement?
  - What conclusions can we draw from our analysis with respect to µK construction?























Universität Karlsruhe (TH)



### Construction conclusion

From the view of the designer there are two alternatives.

### **Single Kernel Stack**

Only one stack is used all the time.

#### **Per-Thread Kernel Stack**

Every thread has a kernel stack.



# Single Kernel Stack

per Processor, event model

- either continuations
  - complex to program
  - must be conservative in state saved (any state that *might* be needed)
  - Mach (Draves), L4Ka::Strawberry, NICTA Pistachio, OKL4
- or stateless kernel
  - no kernel threads, kernel not interruptible, difficult to program
  - request all potentially required resources prior to execution
  - blocking syscalls must always be re-startable
  - Processor-provided stack management can get in the way
  - system calls need to be kept simple "atomic".
  - + kernel can be exchanged on-the-fly
  - e.g. the fluke kernel from Utah
- low cache footprint
  - always the same stack is used!
  - reduced memory footprint



## Per-Thread Kernel Stack

simple, flexible

#### **Conclusion:**

We have to look for a solution that minimizes the kernel stack size! kernel can always use threads, no special techniques
 required for keeping state while interrupted / blocked

 no conceptual difference between kernel mode and user mode

• **q.g.** L4

- but larger cache footprint
- larger memory footprint

# Kernel Entry/Exit

- A look at mechanics of kernel entry and exit
- Optimisations
- Context switching



### enter kernel (IA32)







- trap / fault occurs (*INT n* / exception / interrupt)
  - push user esp on to kernel stack, load kernel esp





- trap / fault occurs (*INT n* / exception / interrupt)
  - push user esp on to kernel stack, load kernel esp
  - push user eflags, reset flags (I=0, S=0)







- trap / fault occurs (*INT n* / exception / interrupt)
  - push user esp on to kernel stack, load kernel esp
  - push user eflags, reset flags (I=0, S=0)
  - push user eip, load kernel entry eip

hardware programmed, single instruction







- trap / fault occurs (INT n / exception / interrupt)
  - push user esp on to kernel stack, load kernel esp
  - push user eflags, reset flags (I=0, S=0)
  - push user eip, load kernel entry eip
- push X : error code (hw, at exception) or kernel-call type

hardware programmed, single instruction





- trap / fault occurs (*INT n* / exception / interrupt)
  - push user esp on to kernel stack, load kernel esp
  - push user eflags, reset flags (I=0, S=0)
  - push user eip, load kernel entry eip
- push X : error code (hw, at exception) or kernel-call type
- push registers (optional)

hardware programmed, single instruction







## Sysenter/Sysexit

- Fast kernel entry/exit
  - Only between ring 0 and 3
  - Avoid memory references specifying kernel entry point and saving state
- Use Model Specific Register (MSR) to specify kernel entry
  - Kernel IP, Kernel SP
  - Flat 4GB segments
  - Saves no state for exit
- Sysenter
  - EIP = MSR(Kernel IP)
  - ESP = MSR(Kernel SP)
  - Eflags.I = 0, FLAGS.S = 0

- Sysexit
  - $\blacksquare$  ESP = ECX
  - $\blacksquare$  EIP = EDX
  - Eflags.S = 3
- User-level has to provide IP and SP
- by convention registers (ECX, EDX?)
- Flags undefined
- Kernel has to re-enable interrupts



# Sysenter/Sysexit

Emulate int instruction (ECX=USP, EDX=UIP)

```
sub $20, esp
mov ecx, 16(esp)
mov edx, 4(esp)
mov $5, (esp)
```

Emulate iret instruction

```
mov 16(esp), ecx
mov 4(esp), edx
sti
sysexit
```

ESP

5 eip

tcb



#### Kernel-stack state

#### Uniprocessor:

- Any kstack ≠ myself is current!
  - (my kstack below [esp] is also current when in kernel mode.)

One thread is running and all the others are in their kernel-state and can analyze their stacks. All processes except the running are in kernel mode.

tcb edi ... eax x eip cs flg esp ss



#### Remember:

- We need to find
  - any thread's tcb starting from its uid
  - the currently executing thread's tcb





#### Remember:

- We need to find
  - any thread's tcb starting from its uid
  - the currently executing thread's tcb





## Thread switch (IA32)



### Switch threads (IA32)





### Switch threads (IA32)



int 0x32, push registers of the green thread





- int 0x32, push registers of the green thread
- switch kernel stacks (store and load esp)





- int 0x32, push registers of the green thread
- switch kernel stacks (store and load esp)
- set esp0 to new kernel stack



#### Switch threads (IA32)



- int 0x32, push registers of the green thread
- switch kernel stacks (store and load esp)
- set esp0 to new kernel stack
- pop orange registers, return to new user thread



## Mips R4600

- 32 Registers
- no hardware stack support
- special registers
  - exception IP, status, etc.
  - single registers, unstacked!
- Soft TLB !!

| r31 k0     |  |
|------------|--|
| r30 k1     |  |
| r29        |  |
| r28        |  |
| r27        |  |
| r26        |  |
| r25        |  |
| r24        |  |
| r23        |  |
| r22        |  |
| r21        |  |
| r20        |  |
| r19        |  |
| r18        |  |
| <u>r17</u> |  |
| <u>r16</u> |  |
| r15        |  |
| r14        |  |
| r13        |  |
| r12<br>r11 |  |
| r10        |  |
| r9         |  |
| r8         |  |
| r7         |  |
| r6         |  |
| r5         |  |
| r4         |  |
| r3         |  |
| r2         |  |
| r1         |  |
| r0 = 0     |  |

Kernel has to parse page table.



Exc PC Status

## **Exceptions on MIPS**

- On an exception (syscall, interrupt, ...)
  - Loads Exc PC with faulting intruction
  - Sets status register
    - Kernel mode, interrupts disabled, in exception.
  - Jumps to 0xfffffff80000180

| r31             | k0            |
|-----------------|---------------|
| r30             | k1            |
| r2              | 29            |
| r2              | 28            |
| r2              |               |
| r2              | 26            |
| r2              |               |
| r2              | 0             |
| r1              | 0             |
| <u>r1</u><br>r1 |               |
| r1              |               |
| r1              | 5             |
| r1              | 4             |
| r1              |               |
| r1              |               |
| r1              | 1             |
| r1              |               |
| r               |               |
| r               |               |
| r               |               |
| r               |               |
| r:              |               |
| r)<br>r)        |               |
|                 | <u>.</u><br>2 |
| r'              |               |
| r0 :            |               |
| 10              |               |



To switch to kernel mode

Exc PC Status

- Save relevant user state
- Set up a safe kernel execution environment
  - Switch to kernel stack
  - Able to handle kernel exceptions
  - Potentially enable interrupts

| r31 |                   | k0 |
|-----|-------------------|----|
| r30 |                   | k1 |
|     | r29               |    |
|     | r28               |    |
|     | r27               |    |
|     | r26               |    |
|     | <u>r25</u>        |    |
|     | <u>r24</u>        |    |
|     | <u>r23</u>        |    |
|     | <u>r22</u>        |    |
|     | <u>r21</u>        |    |
|     | <u>r20</u>        |    |
|     | <u>r19</u>        |    |
|     | <u>r18</u>        |    |
|     | <u>r17</u>        |    |
|     | <u>r16</u>        |    |
|     | r15               |    |
|     | <u>r14</u>        |    |
|     | <u>r13</u><br>r12 |    |
|     | r11               |    |
|     | r10               |    |
|     | r9                |    |
|     | r8                |    |
|     | r7                |    |
|     | r6                |    |
|     | <u>r5</u>         |    |
|     | r4                |    |
|     | r3                |    |
|     | r2                |    |
|     | r1                |    |
| r   | ) = 0             | )  |
|     |                   |    |

Exc PC Status

### **Problems**

- No stack pointer???
  - Defined by convention sp (r29)
- Load/Store Architecture: no registers to work with???
  - By convention k0, k1 (r31, r30) for kernel use only

| r31        | k0 |
|------------|----|
| r30        | k1 |
| r29        | 9  |
| r28        | 3  |
| r27        |    |
| r26        |    |
| r25        | 5  |
| r24        | 4  |
| r23        |    |
| r22        |    |
| r21        |    |
| r20        |    |
| r19        |    |
| <u>r18</u> |    |
| r17<br>r16 |    |
|            |    |
| r15<br>r14 |    |
| r13        |    |
| r12        | 2  |
| r11        |    |
| r10        |    |
| r9         |    |
| r8         |    |
| r7         |    |
| r6         |    |
| <u>r5</u>  |    |
| r4         |    |
| r3         |    |
| r2         |    |
| <u>r1</u>  | 0  |
| r0 =       | 0  |



# System Calls - Kernel Side

- Things left to do
  - Change to kernel stack
  - Preserve registers by saving to memory (the stack)
  - Leave saved registers somewhere accessible to
    - Read arguments
    - Store return values
  - Do the "read()"
  - Restore registers
  - Switch back to user stack
  - Return to application



Universität Karlsruhe (TH)

```
exception:
                               /* Save previous stack pointer in k1 */
   move k1, sp
   mfc0 k0, c0 status
                               /* Get status register */
   andi k0, k0, CST
                         /* Check the we-were-in-user-mode bit */
   beg k0, $0, 1f /* I
                           lear, from kernel, already have stack */

/* delay slot */

   nop
   /* Coming from user mode
                                                    nto sp */
   la k0, curkstack
                                                     |urkstack" */
   lw sp, 0(k0)
                                                     lue */
                                                     load */
   nop
                               Note k0, k1 registers
1:
                              available for kernel use
                        /* N
   mfc0 k0, c0 cause
                                                     ause. */
                                                     le */
   j common exception
   nop
```



```
exception:
                         /* Save previous stack pointer in k1 */
  move k1, sp
  mfc0 k0, c0 status /* Get status register */
   andi k0, k0, CST Kup /* Check the we-were-in-user-mode bit */
  beg k0, $0, 1f /* If clear, from kernel, already have stack */
                             /* delaw slot */
  nop
   /* Coming from user mode - load kernel stack into sp */
                             /* get address of "curkstack" */
   la k0, curkstack
   lw sp, 0(k0)
                                    /* get its value */
                             /* delay slot for the load */
  nop
1:
  mfc0 k0, c0 cause /* Now, load the exception cause. */
   j common exception
                             /* Skip to common code */
                             /* delay slot */
  nop
```



#### common exception:

```
/*
 * At this point:
 *
        Interrupts are off. (The processor did this for us.)
        k0 contains the exception cause value.
        k1 contains the old stack pointer.
 *
        sp points into the kernel stack.
 *
 *
        All other registers are untouched.
 */
/*
* Allocate stack space for 37 words to hold the trap frame,
* plus four more words for a minimal argument block.
*/
addi sp, sp, -164
```



These six stores are a "hack" to avoid confusing GDB You can ignore the details of why and how



```
/* The order here must match mips/include/trapframe.h. */
 sw ra, 160(sp)
                   /* dummy for qdb */
                                              The real work starts
 sw s8, 156(sp)
                   /* save s8 */
                                                     here
                   /* dummy for qdb */
 sw sp, 152(sp)
 sw qp, 148(sp) /* save qp */
 sw k1, 144(sp)
                   /* dummy for gdb */
                   /* dummy for qdb */
 sw k0, 140(sp)
                    /* real saved sp */
 sw k1, 152(sp)
                    /* delay slot for store */
 nop
 mfc0 k1, c0 epc /* Copr.0 reg 13 == PC for exception */
 sw k1, 160(sp)
                   /* real saved PC */
```



```
sw t9, 136(sp)
```

## Save all the registers on the kernel stack



```
/*
 * Save special registers.
 */
                                              We can now use the
mfhi t0 —
                                              other registers (t0, t1)
mflo t1
                                                  that we have
sw t0, 32(sp)
                                             preserved on the stack
sw t1, 28(sp)
/*
 * Save remaining exception context information.
 */
                               /* k0 was loaded with cause earlier */
     k0, 24(sp)
SW
                               /* Copr.0 reg 11 == status */
mfc0 t1, c0 status
   t1, 20(sp)
mfc0 t2, c0 vaddr
                               /* Copr.0 reg 8 == faulting vaddr */
sw t2, 16(sp)
/*
 * Pretend to save $0 for gdb's benefit.
 */
sw $0, 12(sp)
```



Create a pointer to the base of the saved registers and state in the first argument register



```
struct trapframe {
                                                                      Kernel Stack
   u_int32_t tf_status; /* status register */
   u int32 t tf cause; /* cause register */
   u int32 t tf lo;
   u int32 t tf hi;
   u int32 t tf ra;/* Saved register 31 */
                                                                          epc
   u int32 t tf at;/* Saved register 1 (AT) */
                                                                           s8
   u int32 t tf v0;/* Saved register 2 (v0) */
   u int32 t tf v1;/* etc. */
                                                                           Sp
   u int32 t tf a0;
                                                                           gp
   u int32 t tf a1;
                                                                           k1
   u int32 t tf a2;
   u int32 t tf a3;
                        By creating a pointer to here of
                                                                           k0
   u_int32 t tf t0;
                        type struct trapframe *, we can
                                                                           t9
                       access the user's saved registers
   u int32 t tf t7;
                                                                           t8
                         as normal variables within 'C'
   u int32 t tf s0;
                                                                           at
   u int32 t tf s7;
   u int32 t tf t8;
                                                                           ra
   u int32 t tf t9;
                                                                           hi
   u int32 t tf k0;/* dummy (see exception.S comment
   u int32 t tf k1;/* dummy */
                                                                           lo
   u int32 t tf gp;
                                                                         cause
   u int32 t tf sp;
                                                                         status
   u int32 t tf s8;
                      /* coprocessor 0 epc register
   u int32 t tf epc;
                                                                         vaddr
```



## enter kernel: (Mips)

Load kernel stack pointer if trap from user mode

```
k1, C0 status
            mov
                    k0,k1, exc code mask
            and
            sub
                    k0, syscall_code
            IFNZ
                    k0
                           k0, kernel base
                    mov
no syscall
                    jmp
                           other exception
trap
                    t0, k1
            mov
                   k1, 5
                            /* clear IE, EXL, ERL, KSU */
            srl
            sll
                  k1, 5
                   C0_status, k1
            mov
```

Push old sp (†2), ip (†1), and status (†0)

```
k1, t0, st_ksu_mask
and
IFNZ
       k1
              t2, sp
       mov
           sp, kernel_stack_bottom(k0)
    mov
FI
       t1, C0_exception_ip
mov
       [sp-8], t2
mov
       t1, t1, 4
add
       [sp-16], t1
mov
      [sp-24], t0
mov
IFZ
      AT, zero
      sub
             sp, 24
            k_ipc
      jmp
FI
```







## Construction Conclusions (1)

- Thread state must be saved / restored on thread switch.
- We need a thread control block (TCB) per thread.
- TCBs must be kernel objects.
  - Tcbs implement threads.
- We need to find
  - any thread's tcb starting from its uid
  - the currently executing thread's TCB (per processor)



## Thread ID

- thread number
  - to find the tcb
- thread version number
  - to make thread ids "unique" in time



## Thread ID → TCB (a)



jnz

invalid\_thread\_id



## Thread ID → TCB (b)

version



direct address

mov thread\_id, %eax

mov %eax, %ebx

and mask thread\_no, %eax

add offset tcb\_array, %eax

cmp %ebx, OFS\_TCB\_MYSELF(%eax)

jnz invalid\_thread\_id



### Thread ID translation

- Via table
  - no MMU
  - table access per TCB
  - TLB entry for table

 TCB pointer array requires 1M virtual memory for 256K potential threads

- Via MMU
  - MMU
  - no table access
  - TLB entry per TCB

virtual resource *TCB* array required, 256K
 potential threads need
 128M virtual space for
 TCBs



### Trick:



 TCB pointer array requires 1M virtual memory for 256K potential threads Allocate physical parts of table on demand, dependent on the max number of allocated tcb map all remaining parts to a 0-filled page any access to corresponding threads will result in "invalid thread id" however: requires 4K pages in this table TLB working set grows: 4 entries to cover 4000 threads. Nevertheless much better than 1 TLB for 8 threads like

in direct address.



## AS Layout 32bits, virt tcb, entire PM

user regions other kernel tables physical memory ———kernel code —— shared system regions per-space system regions



phys mem

#### Limitations

32bits, virt tcb, entire PM

- number of threads
- physical mem size



3 G 512 M 256 M 256 M

phys mem



Universität Karlsruhe (TH)

## **FPU Context Switching**

Strict switching

Thread switch:

Store current thread's FPU state Load new thread's FPU state

- Extremely expensive
  - IA-32's full SSE2 state is 512 Bytes
  - IA-64's floating point state is ~1.5KB
- May not even be required
  - Threads do not always use FPU







## **IPC**

Functionality & Interface



# What IPC primitives do we need to communicate?

- Send to (a specified thread)
- Receive from (a specified thread)

- Two threads can communicate
- Can create specific protocols without fear of interference from other threads
- Other threads block until it's their turn
- Problem:
  - How to communicate with a thread unknown a priori

(e.g., a server's clients)



# What IPC primitives do we need to communicate?

- Send to (a specified thread)
- Receive from (a specified thread)
- Receive (from any thread)

#### Scenario:

- A client thread sends a message to a server expecting a response.
- The server replies expecting the client thread to be ready to receive.
- Issue: The client might be preempted between the send to and receive from,



# What IPC primitives do we need to communicate?

- Send to (a specified thread)
- Receive from (a specified thread)
- Receive (from any thread)
- Call
   (send to, receive from specified thread)
- Send to & Receive (send to, receive from any thread)
- Send to, Receive from (send to, receive from specified different threads)

Are other combinations appropriate?

Atomic operation to ensure that server's (callee's) reply cannot arrive before client (caller) is ready to receive

Atomic operation for optimization reasons. Typically used by servers to reply and wait for the next request (from anyone).



# What message types are appropriate?

- Register
  - Short messages we hope to make fast by avoiding memory access to transfer the message during IPC
  - Guaranteed to avoid user-level page faults during IPC
- Dict string (entional)
  - In-memory message we construct to send
- Indirect stricen be combined in memory messages sent in place
- Map pages (optional)
  - Messages that map pages from sender to receiver



# What message types are appropriate?

[Version 4, Version X.2]

- Register
  - Short messages we hope to make fast by avoiding memory access to transfer the message during IPC
  - Guaranteed to avoid user-level page faults during IPC
- Strings (optional)
  - In-memory message we construct to send
- Indirect strings (optional,)
- Map pages (optional)
  - Messages that map pages from sender to receiver



- Operations
  - Send to
  - Receive from
  - Receive
  - Call
  - Send to & Receive
  - Send to, Receive from

- Message Types
  - Registers
  - Strings
  - Map pages



## **Problem**

- How to we deal with threads that are:
  - Uncooperative
  - Malfunctioning
  - Malicious
- That might result in an IPC operation never completing?



- Timeouts (v2, v x.0)
  - snd timeout, rcv timeout



- Timeouts (V2, V X.0)
  - snd timeout, rcv timeout
    - snd-pf timeout
      - specified by sender

Attack through receiver's pager:





- Timeouts (V2, V X.0)
  - snd timeout, rcv timeout
    - snd-pf / rcv-pf timeout
      - specified by receiver

Attack through sender's pager:





### Timeout Issues

- What timeout values are typical or necessary?
- How do we encode timeouts to minimize space needed to specify all four values.

- Timeout values
  - Infinite
    - Client waiting for a server
  - 0 (zero)
    - Server responding to a client
    - Polling
  - Specific time
    - 1us 19 h (log)



## To Compact the Timeout Encoding

- Assume short timeout need to finer granularity than long timeouts
  - Timeouts can always be combined to achieve long fine-grain timeouts



 Assume page fault timeout granularity can be much less than send/receive granularity

send/receive timeout = 
$$\begin{cases} \infty & \text{if } e = 0 \\ 4^{15-e}m & \text{if } e > 0 \\ 0 & \text{if } m = 0, e \neq 0 \end{cases}$$

Page fault timeout has no mantissa



page fault timeout = 
$$\begin{cases} \infty & \text{if } p = 0 \\ 4^{15-p} & \text{if } 0$$



## Timeout Range of Values (seconds) [v 2,

V X.0]

| е  | <i>m</i> = 1 | <i>m</i> =255 |  |
|----|--------------|---------------|--|
| 0  | $\infty$     |               |  |
| 1  | 268,435456   | 68451,04128   |  |
| 2  | 67,108864    | 17112,76032   |  |
| 3  | 16,777216    | 4278,19008    |  |
| 4  | 4,194304     | 1069,54752    |  |
| 5  | 1,048576     | 267,38688     |  |
| 6  | 0,262144     | 66,84672      |  |
| 7  | 0,065536     | 16,71168      |  |
| 8  | 0,016384     | 4,17792       |  |
| 9  | 0,004096     | 1,04448       |  |
| 10 | 0,001024     | 0,26112       |  |
| 11 | 0,000256     | 0,06528       |  |
| 12 | 0,000064     | 0,01632       |  |
| 13 | 0,000016     | 0,00408       |  |
| 14 | 0,000004     | 0,00102       |  |
| 15 | 0,000001     | 0,000255      |  |

Up to 19h with ~4.4min granularity

1μs – 255μs with 1μs granularity



- Timeouts (v2, v x.0)
  - snd timeout, rcv timeout
    - snd-pf / rcv-pf timeout

- timeout values
  - 0
  - infinite
  - 1us ... 19 h (log)
- Compact 32-bit encoding



- Timeouts (v x.2, v 4)
  - snd timeout, rcv timeout, xfer timeout snd, xfer timeout rcv





- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



#### Ideally Encoded in Registers

- Parameters in registers whenever possible
- Make frequent/simple operations simple and fast









#### Send and Receive Encoding

- 0 (Nil ID) is a reserved thread ID
- Define -1 as a wildcard thread ID





## Why use a single call instead of many?

- The implementation of the individual send and receive is very similar to the combined send and receive
  - We can use the same code
    - We reduce cache footprint of the code
    - We make applications more likely to be in cache



- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

#### Message Transfer

- Assume that 64 extra registers are available
  - Name them MR<sub>0</sub> ... MR<sub>63</sub> (message registers 0 ... 63)
  - All message registers are transferred during IPC



- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

#### Message construction

- Messages are stored in registers (MR<sub>0</sub> ... MR<sub>63</sub>)
- First register (MR<sub>0</sub>) acts as message tag
- Subsequent registers contain:
  - Untyped words (u), and
  - Typed words (t)(e.g., map item, string item)





#### Message construction

- Messages are stored in registers (MR<sub>0</sub> ... MR<sub>63</sub>)
- First register (MR<sub>0</sub>) acts as message tag
- Subsequent registers contain:
  - Untyped words (u), and
  - Typed words (t)(e.g., map item, string item)





#### Message construction

- Typed items occupy one or more words
- Three currently defined items:
  - Map item (2 words)
  - Grant item (2 words)
  - String item (2+ words)
- Typed items can have arbitrary order





#### Map and Grant items

Two words:





#### String items

- Max size 4MB (per string)
- Compound strings supported
  - Allows scatter-gather
- Incorporates cacheability hints
  - Reduce cache pollution for long copy operations





#### String items





- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



#### **Timeouts**

- Send and receive timeouts are the important ones
  - Xfer timeouts only needed during string transfer
  - Store Xfer timeouts in predefined memory location





- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

#### String Receival

- Assume that 34 extra registers are available
  - Name them BR<sub>0</sub> ... BR<sub>33</sub> (buffer registers 0 ... 33)
  - Buffer registers specify
    - Receive strings
    - Receive window for mappings



#### Receiving messages

- Receiver buffers are specified in registers (BR<sub>0</sub> ... BR<sub>33</sub>)
- First BR (BR<sub>0</sub>) contains "Acceptor"
  - May specify receive window (if not nil-fpage)
  - May indicate presence of receive strings/buffers (if s-bit set)

receive window 000s BR<sub>0</sub>
Acceptor



#### Receiving messages





- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

#### **IPC** Result

Error conditions are exceptional



- I.e., not common case
- No need to optimize for error handling
- Bit in received message tag indicate error
  - Fast check
- Exact error code store in predefined memory location



#### **IPC** Result

- IPC errors flagged in MR<sub>0</sub>
- Senders thread ID stored in register

# Sender Registers EAX destination ECX timeouts EDX receive specifier EBX EBP ESI EDI



- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC

#### **IPC** Redirection

- Redirection/deceiting IPC flagged by bit in the message tag
  - Fast check



- When redirection bit set
  - Thread ID to deceit as and intended receiver ID stored in predefined memory locations

- Send to
- Receive from
- Receive
- Call
- Send to & Receive
- Send to, Receive from
- Destination thread ID
- Source thread ID
- Send registers
- Receive registers
- Number of send strings
- Send string start for each string
- Send string size for each string
- Number of receive strings
- Receive string start for each string
- Receive string size for each string

- Number of map pages
- Page range for each map page
- Receive window for mappings
- IPC result code
- Send timeout
- Receive timeout
- Send Xfer timeout
- Receive Xfer timeout
- Receive from thread ID
- Specify deceiting IPC
- Thread ID to deceit as
- Intended receiver of deceited IPC



#### Virtual Registers

- What about message and buffer registers?
  - Most architeleine as virtual Registers
- What about predefined memory locations?
  - Must be thread local

Define as Virtual Registers



#### What are Virtual Registers?

- Virtual registers are backed by either
  - Physical registers, or
  - Non-pageable memory
- UTCBs hold the memory backed registers
  - UTCBs are thread local
  - UTCB can not be paged
    - No page faults
    - Registers always accessible





#### Other Virtual Register Motivation

- Portability
  - Common IPC API on different architectures
- Performance
  - Historically register only IPC was fast but limited to 2-3 registers on IA-32, memory based IPC was significantly slower but of arbitrary size
  - Needed something in between



#### Switching UTCBs (IA-32)

Locating UTCB must be fast

(avoid using system call)

 Use separate segment for UTCB pointer

mov %gs:0, %edi

Switch pointer on context switches





#### Switching UTCBs (IA-32)

Locating UTCB must be fast

(avoid using system call)

 Use separate segment for UTCB pointer

mov %gs:0, %edi

Switch pointer on context switches





#### Message Registers and UTCB

- Some MRs are mapped to physical registers
- Kernel will need UTCB pointer anyway pass it

#### Sender Registers

| EAX | destination       |
|-----|-------------------|
| ECX | timeouts          |
| EDX | receive specifier |
| EBX | MR <sub>1</sub>   |
| EBP | MR <sub>2</sub>   |
| ESI | $MR_0$            |
| EDI | UTCB              |

| from            |           |
|-----------------|-----------|
|                 |           |
|                 |           |
| MR <sub>1</sub> | 94 000 10 |
| $MR_2$          |           |
| MR <sub>0</sub> |           |
| UTCB            |           |



### Free Up Registers for Temporary Values

- Kernel need registers for temporary values
- MR<sub>1</sub> and MR<sub>2</sub> are the only registers that the kernel may not need

#### Sender Registers





### Free Up Registers for Temporary Values

- Sysexit instruction requires:
  - ECX = user IP
  - EDX = user SP

#### Sender Registers

## EAX destination ECX timeouts EDX receive specifier EBX ~ EBP ~ ESI MR<sub>0</sub> EDI UTCB





#### **IPC Register Encoding**

- Parameters in registers whenever possible
- Make frequent/simple operations simple and fast

#### **Sender Registers**

## EAX destination ECX timeouts EDX receive specifier EBX ~ EBP ~ ESI MR<sub>0</sub> EDI UTCB

| from            |
|-----------------|
| ~               |
| ~               |
| MR <sub>1</sub> |
| MR <sub>2</sub> |
| $MR_0$          |
| UTCB            |

