

## Presentation Overview

Current Research Direction

Related Work

**Experiments** 

What Next?

## **Current Research Direction**

- Wide superscalar, out-oforder execution processor core
- Exploits ILP
- But true data dependencies are inherent in application programs
- MIPS R10k, NetBurst, AMD
  etc. use *bypass network* to
  forward just-computed result
  → allow back-to-back issue
  of dependent instructions
- Complexity of bypass network grows quadratic w.r.t. issue width



#### **Current Research Direction** Solution 1: Multi-cycle broadcast ALU and bypass Wire delays – accounted for in Intel NetBurst ALU bypass Allows higher processor clock frequency ALU bypass 2 bypass 1 at the cost of reduced IPC bypass 1 bypass 2 bypass 3 ALU Observation 2: FP execution unit is idle overhead most of the time, even in FP-intensive [Sassone04] applications (5-10%) Proportion of Functional Unit Type Requested Rd/Wr Ports Rd/Wr Ports FP\_ALU Int\_MULT/DIV Int\_ALU ht\_ALU nch Applicatio













#### Experiment : Chaining pairs of dependent instructions

- 2 CIALUs sufficient
- IPC improvement of ~8%, solely due to savings of broadcast cycles
- Sequences utilization of IALUs by  $\sim 50\%$
- Reduces up to 45% of queue entries waiting for result
- Up to 25% speedup as broadcast cycles = 4





# What Next?

- Chaining sequence of 3 dependent instructions, other patterns out of the 80.
- Architectural impact of adding chained units
  - Somplexity of local bypass network etc.
- Replace chained units by xALUs converted from the CSA trees in a FP multiply/divide unit
  - Solution Need to explore the hardware circuits of FP multiply/divide
- Develop an adaptive configuration scheme to best match the interconnections of the swappable xALUs to the patterns of in-flight instructions.
  - Need to determine the most frequent subset of patterns

12

#### References

- [Vassiliadis96] High-Performance 3-1 Interlock Collapsing ALUs. James Phillips and Stamatis Vassiliadis.
- [Yeager96] *The MIPS R10000 Superscalar Microprocessor*. Kenneth C. Yeager. IEEE Micro 1996.
- [Palacharla97] Subbarao Palacharla, Norman P. Jouppi, J.E. Smith. *Complexity-Effective* Superscalar Processor. ISCA 1997.
- [Intel01] *The Microarchitecture of the Pentium 4 Processor.* Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel Intel Technology Journal Q1. 2001.
- [Epalza04] *Dynamic Reallocation of Functional Units In Superscalar Processors*. Marc Epalza, Paolo Ienne, Daniel Mlynek. In the 9th Asia-Pacific Computer Systems Architecture Conference (ACSAC), 2004.
- [Yehia04] From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance without ILP or Speculation. Sami Yehia and Olivier Temam.
- [Sassone04] *Multicycle Broadcast Bypass: Too Readily Overlooked.* Peter G. Sassone and D. Scott Wills, Proceedings of the Workshop on Complexity-Effective Design (WCED), May 2004.



## Overview of Research Topic



### Motivations

- Improved execution performance by exploiting parallelism and redundancy in hardware.
- Adaptation of hardware resources based on the dynamic behaviour of programs.
- Availability of runtime profile allows exploitation of runtime optimizations otherwise difficult to exploit at compile time.
- Compilation at the binary level allows execution of legacy software binaries.
- Runtime compilation allows transparent migration of software code to hardware.

16