

## Cache Organization

- Data transfer unit between registers and L1 cache: ≤ 1 word (1–16B)
- Cache *line* is transfer unit between cache and RAM (or lower cache)
  typically 16–32 bytes, sometimes 128 bytes and more
- Line is also unit of storage allocation in cache
- Each line has associated control info:
  - valid bit
  - modified bit
  - tag
- Cache improves memory access by:
  - absorbing most reads (increases bandwidth, reduces latency)
  - making writes asynchronous (hides latency)
  - clustering reads and writes (hides latency)

5 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License 💀 UNSA

# Cache Indexing



- The *tag* is used to distinguish lines of a set...
- Consists of high-order bits not used for indexing





- Virtually indexed: looked up by virtual address
  - operates concurrently with address translation
- Physically indexed: looked up by physical address
  - requires result of address translation
- Usually a hierarchy: L1 (on core), L2, ..., LLC (last-level cache, next to RAM)
  - L1 may use virtual address, all others use physical only

6 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License

# Cache Indexing



- · Address hashed to produce index of line set
- · Associative lookup of line within set
- *n* lines per set: *n*-way set-associative cache
  - typically n = 1 ... 5, some embedded processors use 32-64
  - n = 1 is called *direct mapped*
  - 2 ≤ n ≤ ∞ is called set associative
  - $n = \infty$  is called *fully associative* (unusual for I/D caches)
- Hashing must be simple (complex hardware is slow)
  - generally use least-significant bits of address (except L3 on recent x86)
- 8 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License





#### Cache Indexing: Fully Associative



## Cache Indexing: 2-Way Associative



## **Cache Mapping Implications**

Multiple memory locations map to the same cache line



- Locations mapping to cache set *i* are said to be of colour *i*
- n-way associative cache can hold n lines of the same colour
- Types of cache misses ("the four Cs"):
  - Compulsory miss: data cannot be in the cache (if infinite size)
    o first access (after flush)
  - Capacity miss: all cache entries are in use by other data  $_{\odot}$  would not miss on infinite-size cache
  - Conflict miss: all lines of the correct colour are in use by other data
    would not miss on fully-associative cache
  - Coherence miss: miss forced by hardware coherence protocol

12 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License



## Cache Replacement Policy

- Indexing (using address) points to specific line set
- On miss (all lines of set are valid): replace existing line
- Replacement strategy must be simple (hardware!)
  - dirty bit determines whether line must be written back
  - typical policies:

13 COMP9242 S2/2018 W03



© 2016 Gernot Heiser. Distributed under CC Attribution License

## Cache Addressing Schemes

- · For simplicity assumed so far that cache only sees one type of address: virtual or physical
- However, indexing and tagging can use different addresses!
- Four possible addressing schemes:
  - virtually-indexed, virtually-tagged (VV) cache
  - virtually-indexed, physically-tagged (VP) cache
  - physically-indexed, virtually-tagged (PV) cache
  - physically-indexed, physically-tagged (PP) cache
- PV caches can make sense only with unusual MMU designs
  - not considered any further

# Cache Write Policy

- Treatment of store operations
  - write back: Stores only update cache; memory is updated once dirty line is replaced (flushed) ✓ clusters writes
    - # memory inconsistent with cache
    - # multi-processor cache-coherency challenge
  - write through: stores update cache and memory immediately memory is always consistent with cache ₭ increased memory/bus traffic
- On store to a line not presently in cache (write miss):
  - write allocate: allocate a cache line and store there typically requires reading line into cache first!
  - **no allocate:** store directly to memory, bypassing the cache
- Typical combinations: ٠
  - write-back & write-allocate
  - write-through & no allocate

14 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License

# **UNSV**

# Virtually-Indexed, Virtually-Tagged Cache

- Also called virtually-addressed cache
- Various incorrect names in use:
  - virtual cache
  - virtual address cache
- Uses virtual addresses only ٠
- Can operate concurrently ٠ with MMU
- Still needs MMU lookup VD Tag for access rights VD Tag
- Writeback needs PA - TLB lookup?
- Used for on-core L1

Taq

CPU

tag<sub>(26</sub>

Word 0

Word 0

Word 0

Word (

index,2

Word 1

Word 1

Word 1

Word

MMU

byte<sub>(4)</sub>

Word 2

Word 2

Word 2

Word 3

Word 3

Word 3

Word 3

Word 3









## Summary: VP Caches

- Medium speed
  - ☑ lookup in parallel with address translation
  - $\ensuremath{^{\ensuremath{\ensuremath{\mathbb{H}}}}}$  tag comparison after address translation
- ☑ No homonym problem
- **#** Potential synonym problem
- Bigger tags (cannot leave off set-number bits)
  increases area, latency, power consumption
- Used on most contemporary architectures for L1 cache

© 2016 Gernot Heiser. Distributed under CC Attribution License

# Summary: PP Caches

#### ₭ Slowest

- # requires result of address translation before lookup starts
- No synonym problem
- ☑ No homonym problem
- Easy to manage
- ☑ If small or highly associative index can be in parallel with translation
  - all index bits come from page offset
  - combines advantages of VV and PP cache
  - useful for on-core L1 cache (Itanium, recent x86)
- ☑ Cache can use *bus snooping* to receive/supply DMA data
- ☑ Usable as post-MMU cache with any architecture

For an in-depths coverage see [Wiggins 03]

30 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License

## Write Buffer

29 COMP9242 S2/2018 W03

- · Store operations can take a long time to complete
  - eg if a cache line must be read or allocated
- Can avoid stalling the CPU by buffering writes
- Write buffer is a FIFO queue of incomplete stores
  - Also called store buffer or write-behind buffer
  - Typically between cache levels, cache and memory
- Can also read intermediate values out of buffer
  - to service lead of a value that is still in write buffer
  - avoids unnecessary stalls of load operations
- · Implies that memory contents are temporarily stale
  - on a multiprocessor, CPUs see different order of writes!
  - "weak store order", to be revisited in SMP context



UNS\

# Cache Hierarchy

- Hierarchy of caches to balance memory accesses:
  - small, fast, virtually-indexed L1
  - large, slow, physically indexed L2–L5
- Each level reduces and clusters traffic
- L1 typically split into I- and D-caches
  - "Harvard architecture"
  - requirement of pipelining
- Other levels generally unified
- Chip multiprocessors:
  - Usually L3 shared chip-wide
  - L2 private, clustered or shared chip-wide







## ODROID-C2 (Cortex A53) System Architecture



#### TLB Size (I-TLB + D-TLB)

| Architecture  | TLB Size<br>inst + data | TLB<br>Assoc | Page Size       | Coverage<br>(base page) |
|---------------|-------------------------|--------------|-----------------|-------------------------|
| VAX-11        | 64–256                  | 2            | 0.5 KiB         | 32–128 KiB              |
| ix86          | 32i + 64d               | 4            | 4 KiB + 4 MiB   | 128 KiB                 |
| MIPS          | 96–128                  | full         | 4 KiB – 16 MiB  | 384–512 KiB             |
| SPARC         | 64                      | full         | 8 KiB – 4 MiB   | 512 KiB                 |
| Alpha         | 32–128i + 128d          | full         | 8 KiB – 4 MiB   | 256 KiB                 |
| RS/6000       | 32i + 128d              | 2            | 4 KiB           | 256 KiB                 |
| Power-4 (G5)  | 1024                    | 4            | 4 KiB           | 512 KiB                 |
| PA-8000       | 96i + 96d               | full         | 4 KiB – 64 MiB  | 384 KiB                 |
| Itanium       | 64i + 96d               | full         | 4 KiB – 4 GiB   | 384 KiB                 |
| ARMv7 (A9)    | 64–128                  | 1–2          | 4 KiB – 16 MiB  | 256–512 KiB             |
| x86 (Skylake) | L1:128i+64d; L2:1536    | 4            | 4 KiB + 2/4 MiB | 1 MiB                   |

#### Not much growth in 40 years!

### Translation Lookaside Buffer (TLB)

• TLB is a (VV) cache for page-table entries

ASID

VPN

- TLB can be
  - hardware loaded, transparent to OS
- software loaded, maintained by OS
- TLB can be:
  split: I- and D-TLBs

unified

s

ASID

VPN

PFN

34 COMP9242 S2/2018 W03 © 2016 Gernot Heiser. Distributed under CC Attribution License

## TLB Size (I-TLB + D-TLB)

#### TLB coverage

- · Memory sizes are increasing
- · Number of TLB entries are roughly constant
- · Page sizes are steady
  - 4 KiB (SPARC, Alpha used 8KiB)
  - OS designers have trouble using superpages effectively
- · Consequences:
  - total amount of RAM mapped by TLB is not changing much
  - fraction of RAM mapped by TLB is shrinking dramatically!
  - Modern architectures have very low TLB coverage!
- The TLB can become a bottleneck



flags

flags

UNSV



### Intel Core i7 (Haswell) Cache Structure

