Learning Outcomes

- An understanding of virtual linear array page tables, and their use on the MIPS R3000.
- Exposure to alternative page table structures beyond two-level and inverted page tables.

Two-level Translation

- Can be optimised for TLB refill only
  - Does not need to check the exception type
  - Does not need to save any registers
    - It uses a specialised assembly routine that only uses k0 and k1.
    - Does not check if PTE exists
      - Assumes virtual linear array – see extended OS notes
  - With careful data structure choice, exception handler can be made very fast

R3000 TLB Refill

- An example routine
  
mfc0 k1, C0_CONTEXT  
mfc0 k0, C0_EPC  
# mfc0 delay  
lw $k1, 0($k1)  
# slot  
fault ($k)  
nop  
mfc0 k1, C0_EPC  
# orig EPC  
mtc0 k1, C0_ENTRYLO  
nop  
mtc0 k1, C0_ENTRYHI  

How does this work?

Virtual Linear Array page table

- Assume a 2-level PT
- Assume 2nd-level PT nodes are in virtual memory
- Assume all 2nd-level nodes are allocated contiguously ➔ 2nd-level nodes form a contiguous array indexed by page number
Virtual Linear Array Operation

- Index into 2nd level page table without referring to root PT!
- Simply use the full page number as the PT index!
- Leave unused parts of PT unmapped!
- If access is attempted to unmapped part of PT, a secondary page fault is triggered
  - This will load the mapping for the PT from the root PT
  - Root PT is kept in physical memory (cannot trigger page faults)

Virtual Linear Array Page Table

- Use Context register to simply load PTE by indexing a PTE array in virtual memory
- Occasionally, will get double faults
  - A TLB miss, while servicing a TLB miss
  - Handled by general exception handler

PTEbase in virtual memory in kseg2
- Protected from user access
- Use Context register to simply load PTE by indexing a PTE array in virtual memory
- Occasionally, will get double faults
- A TLB miss, while servicing a TLB miss
- Handled by general exception handler

Code for VLA TLB refill handler

- Load PTE address from context register
- Move the PTE into EntryLo.
- Write EntryLo into random TLB entry.
- Return from the exception

Software-loaded TLB

- Pros
  - Can simplify hardware design
  - Provide greater flexibility in page table structure
- Cons
  - Typically have slower refill times than hardware managed TLBs.
Design Tradeoffs for Software-Managed TLBs

David Nagle, Richard Uhlig, Tim Stanley, Stuart Sechrest, Trevor Mudge, & Richard Brown

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture

Trends at the time

- Operating systems
  - moving functionality into user processes
  - making greater use of virtual memory for mapping data structures held within the kernel.
- RAM is increasing
  - TLB capacity is relatively static
- Statement:
  - Trends place greater stress upon the TLB by increasing miss rates and hence, decreasing overall system performance.
  - True/False? How to evaluate?

Software Traps on TLB Misses

Figure 1: Tapeworm

The Tapeworm TLB simulator is built into the operating system and is invoked whenever there is a real TLB miss. The simulator uses the real TLB misses to simulate its own TLB configuration(s). Because the simulator resides in the operating system, Tapeworm captures the dynamic nature of the system and avoids the problems associated with simulators driven by static traces.

Table 3: Costs for Different TLB Types

<table>
<thead>
<tr>
<th>TLB Miss Type</th>
<th>UTRix</th>
<th>OS/2</th>
<th>Mach 3.0</th>
</tr>
</thead>
<tbody>
<tr>
<td>L1K</td>
<td>354</td>
<td>355</td>
<td>294</td>
</tr>
<tr>
<td>L2</td>
<td>494</td>
<td>511</td>
<td>407</td>
</tr>
<tr>
<td>L3</td>
<td>404</td>
<td>546</td>
<td>296</td>
</tr>
<tr>
<td>Notudy</td>
<td>375</td>
<td>436</td>
<td>496</td>
</tr>
<tr>
<td>Invalid</td>
<td>396</td>
<td>272</td>
<td>267</td>
</tr>
</tbody>
</table>

Note the TLB miss costs

- What is expected to be the common case?
Measurement Results

Table 5: Number of TLB Misses

<table>
<thead>
<tr>
<th>Feature</th>
<th>Total TLB Misses (mm)</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
</tr>
</thead>
<tbody>
<tr>
<td>User</td>
<td>51.14</td>
<td>15.46</td>
<td>40.16</td>
<td>15.67</td>
<td>54.30</td>
<td>15.78</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Other</td>
<td>54.40</td>
<td>15.78</td>
<td>40.18</td>
<td>15.67</td>
<td>54.30</td>
<td>15.78</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mach/AF/4</td>
<td>106.00</td>
<td>86.15</td>
<td>40.66</td>
<td>15.68</td>
<td>57.65</td>
<td>2.10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mach/AF/2</td>
<td>106.71</td>
<td>85.25</td>
<td>40.65</td>
<td>15.68</td>
<td>57.65</td>
<td>2.10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 6: Time Spent Handling TLB Misses

<table>
<thead>
<tr>
<th>Feature</th>
<th>Total TLB Misses (mm)</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
<th>L2 L1K</th>
</tr>
</thead>
<tbody>
<tr>
<td>User</td>
<td>51.14</td>
<td>15.46</td>
<td>40.16</td>
<td>15.67</td>
<td>54.30</td>
<td>15.78</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Other</td>
<td>54.40</td>
<td>15.78</td>
<td>40.18</td>
<td>15.67</td>
<td>54.30</td>
<td>15.78</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mach/AF/4</td>
<td>106.00</td>
<td>86.15</td>
<td>40.66</td>
<td>15.68</td>
<td>57.65</td>
<td>2.10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mach/AF/2</td>
<td>106.71</td>
<td>85.25</td>
<td>40.65</td>
<td>15.68</td>
<td>57.65</td>
<td>2.10</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Specialising the L2/L1K miss vector

<table>
<thead>
<tr>
<th>Type of PTE Miss</th>
<th>Counts</th>
<th>Previous Total Cost (mm)</th>
<th>New Total Cost (mm)</th>
<th>Time Saved (mm)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mach/AF/4</td>
<td>106.00</td>
<td>86.15</td>
<td>40.66</td>
<td>15.68</td>
</tr>
<tr>
<td>Mach/AF/2</td>
<td>106.71</td>
<td>85.25</td>
<td>40.65</td>
<td>15.68</td>
</tr>
</tbody>
</table>

Table 7: Reconstructed Cost of TLB Misses Given Additional Miss Vectors (Mach 3.2)

Other performance improvements?

- In Paper
  - Pinned slots
  - Increased TLB size
  - TLB associativity
- Other options
  - Bigger page sizes
  - Multiple page sizes
Itanium Page Table

- Takes a bet each way
- Loading
  - software
  - two different format hardware walkers
- Page table
  - software defined
  - Virtual linear array
  - Hashed

what about the P4?

- i.e. 32-bit x86 architecture.

P4

- Sophisticated, supports:
  - demand paging
  - pure segmentation
  - segmentation with paging
- Heart of the VM architecture
  - Local Descriptor Table (LDT)
  - Global Descriptor Table (GDT)
- LDT
  - 1 per process
  - describes segments local to each process (code, stack, data, etc.)
- GDT
  - shared by all programs
  - describes system segments (including OS itself)

P4

- To access a segment P4
  - loads a selector in 1 of the segment registers

P4

- a P4 selector:
• a P4 selector:

- when selector is in register, corresponding segment descriptor is
  - fetched by MMU
  - loaded in internal MMU registers
- Next, segment descriptor is used to handle memory reference (discussed later)

• finds a a P4 code segment descriptor

- calculating a linear address from selector+offset

P4 with paging
- every process has page directory
  - 1024 32bit entries
  - each entry points to page table
  - page table contains 1024 32bit entries
  - each entry points to page frame

IF no paging used: we are done
  - this is the physical address
ELSE
  - linear address interpreted as virtual address
  - paging again!
P4

- Many OSs:
  - BASE=0
  - LIMIT=MAX

- no segmentation at all

That is it!