

## **Learning Outcomes**

- An understanding of the structure and limits of multiprocessor hardware.
- An appreciation of approaches to operating system support for multiprocessor machines.
- An understanding of issues surrounding and approaches to construction of multiprocessor synchronisation primitives.

- I INSW

1

2



## Multiprocessor System

- We will look at shared-memory multiprocessors
  - More than one processor sharing the same memory
- A single CPU can only go so fast
  - Use more than one CPU to improve performance
  - Assumes
    - Workload can be parallelised
  - Workload is not I/O-bound or memory-bound
- Disks and other hardware can be expensive
  - Can share hardware between CPUs

TT I NOW

3

4





6





7





9 10





11 12



Summary • Multiprocessors can • Increase computation power beyond that available from a single CPU • Share resources such as disk and memory However • Assumes parallelizable workload to be effective Assumes not I/O bound · Shared buses (bus bandwidth) limits scalability Can be reduced via hardware design • Can be reduced by carefully crafted software behaviour  $\bullet$  Good cache locality together with limited data sharing where possible

13 14



Each CPU has its own OS? • Statically allocate physical memory to each CPU • Each CPU runs its own independent OS • Share peripherals • Each CPU (OS) handles its processes system calls CPU 1 CPU 2 CPU 3 CPU 4 Memory I/O Has private OS Has private OS Has Has private OS private OS

15 16



Issues • Each processor has its own scheduling queue • We can have one processor overloaded, and the rest idle • Each processor has its own memory partition • We can a one processor thrashing, and the others with free memory • No way to move free memory from one OS to another CPU 1 CPU 2 CPU 3 CPU 4 Has Has Has private OS private OS private OS private OS

18





Symmetric Multiprocessors (SMP) Better alternative: identify largely independent parts of the kernel and make each of them their own critical section · Allows more parallelism in the kernel • Issue: Difficult task • Code is mostly similar to uniprocessor code • Hard part is identifying independent parts that don't interfere with each other Remember all the inter-dependencies between OS subsystems CPU 1 CPU 2 CPU 3 CPU 4 Runs users and shared O Runs Runs Runs sers and nared OS os 🔳

21

23

Symmetric Multiprocessors (SMP)

• Example:

• Associate a mutex with independent parts of the kernel

• Some kernel activities require more than one part of the kernel

• Need to acquire more than one mutex

• Great opportunity to deadlock!!!!

• Results in potentially complex lock ordering schemes that must be adhered to

CPU 1

Runs

users and shared OS

Runs
users and shared OS

Locks

Symmetric Multiprocessors (SMP)

• Example:

• Given a "big lock" kernel, we divide the kernel into two independent parts with a lock each

• Good chance that one of those locks will become the next bottleneck

• Leads to more subdivision, more locks, more complex lock acquisition rules

• Subdivision in practice is (in reality) making more code multithreaded (parallelised)

CPU 1

Runs

users and
shared OS

Runs
users and
shared OS

Locks

Real life Scalability Example

• Early 1990's, CSE wanted to run 80 X-Terminals off one or more server machines

• Winning tender was a 4-CPU bar-fridge-sized machine with 256M of RAM

• Eventual confige-CPU and 512M of RAM

• Machine ran fine in all pre-session testing

24

22





25 26



Lesson Learned

• Building scalable multiprocessor kernels is hard

• Lock contention can limit overall system performance

27 28



Multiprocessor Synchronisation

• Given we need synchronisation, how can we achieve it on a multiprocessor machine?

• Unlike a uniprocessor, disabling interrupts does not work.

• It does not prevent other CPUs from running in parallel

• Need special hardware support

29 30



Test-and-Set

• Hardware guarantees that the instruction executes atomically on a CPU.

• Atomically: As an indivisible unit.

• The instruction can not stop half way through

31 32



Test-and-Set on SMP A solution: • Hardware blocks all other CPUs from accessing the bus during the TSL instruction to prevent memory accesses by any other CPU. • TSL has mutually exclusive access to memory for duration of instruction. Word CPU 2 CPU 1 Memory 1000 is initially 0 CPU 1 reads a 0 2. CPU 2 reads a 0 3. CPU 1 writes a 1 4. CPU 2 writes a 1 34

33

Test-and-Set on SMP

• Test-and Set is a busy-wait synchronisation primitive
• Called a spinlock
• Issue:

Lock contention leads to spinning on the lock
• Spinning on a lock requires blocking the bus which slows all other CPUs down
• Independent of Weather other CPUs need a lock or not

Causes bus contention

Test-and-Set on SMP

Caching does not help reduce bus contention
Either TSL still blocks the bus
Or TSL requires exclusive access to an entry in the local cache
Requires invalidation of same entry in other caches, and loading entry into local cache
Many CPUs performing TSL simply bounce a single exclusive entry between all caches using the bus

35 36



Thomas Anderson, "The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors", *IEEE Transactions on Parallel and* Distributed Systems, Vol 1, No. 1, 1990

37 38

```
Compares Simple Spinlocks

• Test and Set

void lock (volatile lock_t *1) {

while (test_and_set(1));
}

• Read before Test and Set

void lock (volatile lock_t *1) {

while (*1 == BUSY || test_and_set(1));
}
```

Benchmark

for i = 1 .. 1,000,000 {
 lock(1)
 crit\_section()
 unlock()
 compute()
}

• Compute chosen from uniform random distribution
 of mean 5 times critical section

• Measure elapsed time on Sequent Symmetry (20
 CPU 30386, coherent write-back invalidate caches)

39 40



• Test and set performs poorly once there is enough CPUs to cause contention for lock
• Expected

• Read before Test and Set performs better
• Performance less than expected
• Still significant contention on lock when CPUs notice release and all attempt acquisition

• Critical section performance degenerates
• Critical section requires bus traffic to modify shared structure
• Lock holder competes with CPU that's waiting as they test and set, so the lock holder is slower
• Slower lock holder results in more contention

41 42





43

Spinning versus Blocking and Switching

• Spinning (busy-waiting) on a lock makes no sense on a uniprocessor

- The was no other running process to release the lock
- Blocking and (eventually) switching to the lock holder is the only sensible option.
- On SMP systems, the decision to spin or block is not as clear.
  - The lock is held by another running CPU and will be freed without necessarily switching away from the requestor



45



Spinning versus Switching Blocking and switching • to another process takes time Save context and restore another
 Cache contains current process not new process
 Adjusting the cache working set also takes time TLB is similar to cache
 Switching back when the lock is free encounters the same again Spinning wastes CPU time directly Trade off • If lock is held for less time than the overhead of switching to and back ⇒It's more efficient to spin ⇒Spinlocks expect critical sections to be short ⇒No waiting for I/O within a spinlock ⇒No nesting locks within a spinlock 48

## Preemption and Spinlocks

- Critical sections synchronised via spinlocks are expected to be short
   Avoid other CPUs wasting cycles spinning
- What happens if the spinlock holder is preempted at end of holder's timeslice

  - Mutual exclusion is still guaranteed
     Other CPUs will spin until the holder is scheduled again!!!!!
- ⇒ Spinlock implementations disable interrupts in addition to acquiring locks to avoid lock-holder preemption

