









HyperThreading allows two threads to be run concurrently, with one using the execution units that the other doesn't need. Backend of cpu is the similar, only instrictions need to writeback to correct register file.





0x114

ori

sub

0x118

0x11c

0x120



## Unfortunately...

Implementing this functionality on top of sim-outorder.c within the simplescalar test suite was a much larger undertaking than originally anticipated.

### Currently:

Can fork context upon a branch instruction and split incoming instructions between these 50/50. When the branch reaches writeback the appropriate context is selected and the modified registers are written back to the parent.

#### But:

Execution does not run to completion, memory reads/writes across contexts are being corrupted, this leads to an incorrect address being loaded and an attempted read from 0x00000000, crashing the app. However:

This is after 4043 cycles, or 3626 instructions, so I will attempt to make what conclusions I can.

# Observations...

Some things I noticed whilst stepping through traces:

- \* This will only ever be worthwhile if we only fork the times we mis-predict. Perhaps not necessary to do this every branch.
- \* Still quite useful during compulsory misses in the branch predictor
- \* Can aid performance by prematurely warming cache for the exit code of a loop. We can brace against the cost of tlb/cache miss on this code during the 2<sup>nd</sup> and other iterations of aloop.
- \* It might be beneficial to take advantage of known compiler quirks: eg: beq r0 r0 XXX should be considered a non-conditional branch and not be forked. It is advantagous that this isn't currently done for J insts.
- \* It is allowable in the PISA architecture to have 2 adjacent branch insts. Quite often one or both child contexts stall when they too come across a branch and cannot fork. This indicates that more contexts would allow increaced performance (and troubles)

### Stats...

Num branches encountered: 781

% cycles in forked state: 64.3% (2603 / 4043)

avg num insss in context[0]: avd num insts in context[1]:

% time stalled context[0]:

% time stalled context[1]: % time stalled context[2]:

avg amount of registers / mem locations writtenback during context:

## Wishlist...

Other things to implement: (in increacing order of need):

- \* Varying priorities to each context (eg: 27/75), based on confidence level of the branch predictor.
- \* Support for more than 1 level of forking, so if a forked context encounteres another branch it no longer needs to stall.
- \* Smarter handling of JAL/ JR combinations. Currently can only be done in root context, to save corruption of the return addr stack in the branch predictor
- \* Better reporting / accounting.
- \* Complete program correctness

Some of these can/will be achieved before the report is due.

## Conclusions...

- \* In all likelyhood, this idea is not worth being implemented, considering cost:benefit ratio.
- \* Have read other papers doing similar things that concluded the same thing.
- \* Implementing a new idea and seeing how it affects the program trace++
- \* Yet still immensely useful as alearning exercise: Actually seeing register, control and data dependancies work themselves out in an out of order environment perfectly brings home ideas learned in class
- \* Also skills involved in working on a large, forign codebase built upon