Scalar processors fetch and issue max 1 operation in each clock cycle. Multiple-issue processors:
Superscalar (issue a varying number of instructions at each clock cycle). VLIW (issue a fixed number of instructions at each clock cycle).
Superscalar Processors
Issues from 1 to 8 instructions at each clock cycle. If instructions are dependent, only the instructions preceding that one are issued (in-order issue). This decision is made at run-time by the processor.
=> Variability in the issue rate.
Superscalar Processors
Can be: Statically scheduled: Do not allow (issue) instructions behind stalls to proceed or Dynamically scheduled and speculative (allow instructions behind RAW hazards to proceed).
The loop is unrolled 4 times (load/addd/store) in which RAW hazards have been reduced, but there are resource conflicts on the pipelines...
LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD 0(R1),F4 SD -8(R1),F8 SD -16(R1),F12 SUBI R1,R1,#32 BNEZ R1,LOOP SD (R1),F16 ; 8-32 = -24
Loop:
LD LD LD LD LD SD SD SD SD SUBI BNEZ SD
F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F18,-32(R1) 0(R1),F4 -8(R1),F8 -16(R1),F12 -24(R1),F16 R1,R1,#40 R1,LOOP -32(R1),F20
Fetch, issue and completion of up to 4 instructions per clock cycle. Six separate execution units buffered with reservation stations.
1 BRU, completes branches and informs the fetch unit of mispredictions. Includes the condition register used for conditional branches.
PowerPC Architecture
Speculative Tomasulo with register renaming. Extendend register file holds speculative result of an instruction until the instruction commits. The ROB enforces only in-order commit. Advantages: operands are available from a single location (no need for additional complex logic to access ROB result values)
PowerPC Pipeline
Fetch: The Fetch unit loads the decode queue with instructions from the cache. Next address is predicted through a 256-entry, two-way set associative BTB. A BPB is used if there is a miss in the BTB.
PowerPC Pipeline
Instruction decode: Instructions are decoded and inserted into an 8-entry instruction queue. Instruction Issue: 4 Instructions are taken from the 8-entry instruction queue and are issued to the RS. Allocate a rename register and a reorder buffer entry for the instruction issued. If we cant, stall.
PowerPC Pipeline
Execution: Proceeds with execution when all operands are available. At the end, the result is written on the result bus. The completion unit is notified that the instruction has completed.
PowerPC Pipeline
If the instruction is a (mispredicted) branch, IFU and IC(ompletion)U are notified. Instruction fetch restarts, and ICU discards all the speculated instructions after the branch and free the rename buffers. Commit: When all previous instructions have been committed, commit the result into the RF and free the rename buffer. Stores also commit from store buffer to memory.
Performance results
IPC from under 1 to 1.8. We do not reach IPC=4 due to:
Fus are not replicated for each instruction (structural hazards) Limited instruction level parallelism or limited buffering (insufficient buffers).
P6 Pipeline
Fetch/Decode Unit: decodes instructions and puts them in the instruction pool in-order.
converts the instructions in micro-ops that represent instruction code.
Dispatch/Execute Unit: out-of-order issue from the instruction pool in a reservation station and out-of-order execution of micro-ops. Retire Unit Reorders the instructions and commits speculative results to the architectural state.
P6 Instruction Decode
The decoder fetches 16 bytes at each clock cycle from the cache 3 parallel decoders convert most of the instructions into one or more triadic micro-ops. Some instruction need microcode (several micro-ops) to be executed. Register Alias Table unit converts logical reg. ref. into physical reg. ref. In the ROB (register renaming)
P6 Instruction Dispatch/Execute
The dispatch unit dispatches out-of-order the microops in the instruction pool through the reservation station unit This happens when:
All the operands are ready The resource needed is ready.
P6 Instruction Retire
The retire unit looks for micro-ops that have been executed and can be removed from the pool. The original architectural target of the micro-ops is written. This is done in-order by committing an instruction only if:
Previous instructions have been committed The instruction has been executed.
Pentium 4
Pentium 4
New NetBurst micro-architecture
20 pipeline stages (hyper-pipeline) 1.4 GHz to 2GHz
3 prefetching mechanisms
Harware instruction prefetcher (based on BTB). Software controlled data cache prefetching. L3->L2 data and instruction hardware prefetcher
Pentium 4
Execution Trace Cache
TC stores decoded IA-32 instructions or micro-ops. Removes decoding costs 12K micro-ops, 3 micro-ops per cycle fetch bandwidth It stores traces built across predicted branches. However some instructions need micro-code from ROM.
Pentium 4
Branch penalty delay can be much more than 10 cycles Uses BTB In case of a miss in the BTB, static prediction is used (back=T, forw=NT) Use of software branch hints during the trace construction that override static prediction.
Pentium 4
Execution Units and Issue Ports
Pentium 4
1 load and 1 store issue for each cycle. Loads can be reordered w.r.t. other loads and stores Loads can be executed speculatively Up to 4 outstanding load misses. Load/store forwarding
AMD Athlon K7
Nine-issue (micro-ops), super-pipelined, superscalar x86 processor Multiple x86 instruction decoders (into triadic microops) Three out-of-order, superscalar, fully pipelined floating point execution units. Three out-of-order, superscalar, pipelined integer units. Three out-of-order, superscalar, pipelined address calculation units. 72-entry instruction control unit (ROB)
AMD Athlon K7
AMD Athlon K7
The Instruction Control Unit contains a reorder buffer and distributed reservation stations to hold operands while OPs wait to be scheduled. The Integer Instruction Scheduler is an instruction scheduling logic that picks OPs for execution based on their operand availability and issues them to functional units or address generation units. The function units perform transformations on data and return their results to the reorder buffer, while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing.
Clustered VLIW
Dout1A
Dout1B
Read2A Read2B
Dout2A
Dout2B
Dout1A Dout1B
Clustered VLIW
To solve the bottleneck, create partitioned register files connected to small numbers of Executions Units
Global Bus
Register File
Register File
Register File
EU
EU
EU
Remote Instructions:
have one or more operands in non-local RF Copying of remote operands to local RFs takes clock cycles. Because copying is atomic part of remote instruction, execution unit is idle while copying is done => performance
Instruction Comression
Embedded Processors often put a limit on code size How to reduce size?
NOPs are common, use only a few bits (2-3) to represent a NOP. Mark explicitly start and stop of the long instruction and do not insert nop.
Instructions Decompression
On Instruction Cache fill
ICache has to hold uncompressed instructions - limits cache size
On instruction fetch
Decompression in critical path of fetch stage, may have to add one or more pipeline stages just for decompression
TMS320C6X CPU
8 independent execution units Execution unit types:
L : Integer adder, Logical, Bit Counting, FP adder, FP conversion S : Integer adder, Logical, Bit Manipulation, Shifting, Constant, Branch/Control, FP compare D : Integer adder, Load-Store M : Integer Multiplier, FP multiplier
Split into two identical datapaths, each contains the same four units (L, S, D, M)
16 x 32 RF
16 x 32 RF
M D
M D
Instruction Encoding
Internal Execution path is 256 bits-wide
Each operation is 32 bits wide => 8 operations per clock A fetch packet is a group of instructions fetched simultaneously. Fetch packet has 8 instructions. A execute packet is a group of instructions beginning execution in parallel. Execute packet has 8 instructions
Instruction Encoding
Instructions in ICache have an associated P-bit (Parallel-bit).
Fetch packet expanded to 1 to 8 Execute packets during fetch stage depending on P-bits
Execute Packet
n|n|A|n|n|n|n|n n|B|n|n|n|n|n|n n|n|n|n|n|C|n|n n|n|n|n|n|D|n|n n|n|n|E|n|n|n|n F|n|n|n|n|n|n|n n|n|n|n|n|n|G|n
n|n|n|n|n|n|n|H
64 instructions
P-bits A||B||C, D||E, F, G||H P-bit String of 1s followed by 0 means those execute in parallel. String starting with 0 indicates sequential execution.
40 instructions
Philips Trimedia
Five Execution Units => Five operations per clock issued 15 Read and 5 Write Ports on register File
Need 15 read ports for 5 Execution Units because each operation requires two operands and a Guard operand. Guard operand makes each operation conditional based upon value of LSB of the guard operand => Predicated Execution. 128 Registers (r0, r1 always 0)