Anda di halaman 1dari 20

Computer Architecture

Architecture - Interface between a user and an object


Computer Architecture is the science and art of selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. WWW Computer Architecture Page

Five computing classes


Sales in 2010 ~ 1.8 billion PMDs (90% cell phones), 350 million desktop PCs, and 20
million servers. The total number of embedded processors sold ~ 19 billion.
ARM (Advanced RISC Machine) -- the most popular RISC example.
ARM processors ~ 6.1 billion chips shipped in 2010, ~ 20 times as many chips that
shipped with 80x86 processors
Tablets and smart phones -- the PostPC era, PC personal mobile device (PMD)
Eight Great Ideas in Computer Architecture
1
Design for Moores Law
Integrated circuit resources double every 1824 months -1965 prediction by Gordon Moore,
one of the founders of Intel.
Computer designs can take years - computer architects must anticipate where the
technology will be when the design finishes
2
Use Abstraction to Simplify Design
A major productivity technique for hardware and software is to use abstractions to
represent the design at different levels of representation; lower-level details are hidden to
offer a simpler model at higher levels.
3
Make the Common Case Fast
Making the common case fast - better than optimizing the rare case.
What the common case is - careful experimentation and measurement
4
Performance via Parallelism
More performance - by performing operations in parallel
Data level parallelism - data width
Instruction level parallelism ILP pipeline stages
Thread level parallelism TLP multithreading common functional units
Processor level parallelism multicore, multiprocessor separate processors
5
Performance via Pipelining
A particular pattern of parallelism is so prevalent that it merits its own name: pipelining.
6
Performance via Prediction
It can be faster to guess and start working rather than wait until you know for sure,
assuming that the mechanism to recover from a misprediction is not too expensive and
your prediction is relatively accurate.
Branch prediction, Speculative execution
7
Hierarchy of Memories

The fastest, smallest, and most expensive memory per bit at the top of the hierarchy and
the slowest, largest, and cheapest per bit at the bottom.
Illusion that main memory is nearly as fast as the top of the hierarchy and nearly as big and
cheap as the bottom of the hierarchy.
8
Dependability via Redundancy
Redundant components - can take over when a failure occurs and help to detect failures.
Abstraction levels

Layout/silicon level

Abstraction Hierarchy

Circuit level

Logic (gate) level

Register-transfer level (RTL) Algorithmic level


System level
ISA - instruction set architecture
An abstract interface between the hardware and the lowest-level software - the
information necessary to write a machine language program that will run correctly including instructions, registers, memory access, I/O, and so on.

Abstraction levels for a computer system Abstraction levels in Gajskis Y-chart.


Behavioral Domain: what the design is supposed to do
Structural Domain: mapping of a behavioral representation to a set of components
Physical Domain: bind the structure to silicon
The same ISA - different organizations; MIPS, single cycle, multi cycle, pipelined
A particular architecture can be implemented by different microarchitectures
o Maintaining the instruction set architecture to run identical software
Modern instruction set architectures: IA-32, PowerPC, MIPS, SPARC, ARM,
ISA components:
Organization of Programmable Storage.- Memory, Registers
Data Types: Encodings & Representations.
Instruction Set, Formats.

Modes of Addressing and Accessing Data Items and Instructions.


Exceptional Conditions (Interrupts, Protections, I/O).
CISC (Complex Instruction Set Computing) x86
Examples: x86, VAX, Motorola 68000, etc.
Intels modern x86: RISC Inside
RISC (Reduced Instruction Set Computer)
Examples: PowerPC, ARM, SPARC, Alpha, PA-RISC, MIPS
RISC won the technology battles - embedded computing space
CISC won the high-end commercial space (1990s to today)
RISC vs.
Single-cycle execution possible
Hardwired (simple) control
Load/store architecture
Few memory addressing modes
Fixed-length instruction format
Many registers

CISC characteristcs
many multicycle operations
microcode for multi-cycle operations
register-memory and memory-memory
many modes
many formats and lengths
few registers

Classifying Instruction Set Architectures - four ISA classes -- Instruction Formats


Zero address formats: operands on a stack
Add
M[sp-1] M[sp] + M[sp-1]
Load
M[sp] M[M[sp]]
Stack can be in registers or in memory
usually top of stack cached in registers
One address formats: Accumulator Machines
Accumulator is always other implicit operand
Two address formats: the destination is same as one of the operand sources
Reg (Reg op Reg)
RI (RI) op (RJ)
Reg (Reg op Mem)
RI (RI) op M[x]
Three address formats: One destination and up to two operand sources per instruction
Reg (Reg op Reg)
RI (RJ) op (RK)
Reg (Reg op Mem)
RI (RJ) op M[x]

Operand locations for four instruction set architecture classes.


(a), Top Of Stack register (TOS) points to the top input operand, which is combined with the
operand below. The first operand is removed from the stack, the result takes the place of
the second operand, and TOS is updated to point to the result. All operands are implicit.
(b), the Accumulator is both an implicit input operand and a result.
(c), one input operand is a register, one is in memory, and the result goes to a register.

(d) All operands are registers in and, like the stack architecture, can be transferred to
memory only via separate instructions: push or pop for (a) and load or store for (d).
0, 1, 2, 3 address machines

The code sequence for C = A+B for four classes of instruction sets.
The Add instruction has implicit operands for stack and accumulator architectures, and
explicit operands for register architectures. A, B, and C are in the memory
Princeton/Harvard architectures

Processor Datapath + Control


Princeton, Von Neumann architecture
Harvard architecture

Princeton
Harvard
-Single storage for instructions and data
-Separate storage for instructions and data

RTL (Register Transfer Level) Description


A design methodology in which the system operation is described by how the data is
manipulated and moved among registers.

Standard model for CPUs, micro-controllers ....


Datapath
Abstract RTL a behavioral specification; it doesnt take into account the structure of
the digital system; it isnt related to resource and timing constraints
Concrete RTL an implementation of the behavioral specification based on a selected
organization at clock period granularity; it is related to resource and timing constraints.
Example RTL operations

Assignment.
=,
Tests for equality and inequality.
||
Bit string concatenation.
XY
Data transfer of contents of regY to regX
X0
Clears regX
XY+Z
Adds contents of regY with regZ, load into regX
X Y v Z
Ors contents of regY with regZ, load into regX
DR M[MAR]
Load into DR the contents of memory pointed to by MAR

R1 >> R1
R2 << R1
X Y, A B
(cond) A B
S0 A B
P (ab) R2 R3

Register R1 one bit right shift, with 0 into left-most bit


Register R1 one bit left shift, with 0 into right-most bit
Parallel transfers
If cond = 1 then transfer contents of regB into regA
When in state S0, load regA with contents of regB
When in state P, if a AND b is true then load regR2 with contents
of regR3; a, b are signals.
If (A)then B else C
Conditional Jump -- Branch
Use ; to separate transfers that occur on separate cycles.
Use , to separate transfers that occur on the same cycle.
Example (2 cycles):
regA regB, regB 0;
regC regA;
Algorithmic State Machine (ASM)
Abstract State Machine (ASM)
ASM chart describes sequence of events
ASM flowchart
ASM specifies time at clock level

State diagram

Notation for a state.

ASM chart

State diagram and ASM chart conversion


ASM chart One FF/State tranformation Rules -- One-Hot encoding
The design starts with the ASM chart, and replaces
1. State Boxes with flip-flops,
2. Decision Boxes with a corresponding decoder
3. Junctions with an OR gate, and
4. Conditional Outputs with AND gates.

State box transforms to a D Flip-Flop

Decision box transforms to a Decoder

Decision Box Transformation Rules


Entry point - Enable inputs a state
The Conditions are the Select inputs

Junction Transformation Rules


One Hot FSM Coding

Conditional output

Ring counter;
One-Hot Implementation Reset 0001
One-hot - reset one flipflop set to 1 instead of resetting all flip-flops to 0.
High-Level Synthesis (HLS) Synthesis of Digital Architectures
HLS: translation process from behavioural description to a structural description

Behavioral specification
HLS

RTL-datapath and controller


Xilinx
Vivado High-Level Synthesis
wireless, medical, defense, and consumer applications are more sophisticated than
ever before. Vivado Design Suite accelerates IP creation by enabling C, C++ and System
C specifications to be directly targeted into Xilinx Programmable devices without the need
to manually create RTL
DATA-FLOW MODEL OF COMPUTATION
Data-flow graphs (DFGs) (Data-Dependency graphs DDG) represent precedency
relationships and parallelism in computations. A data-flow graph is built from:
Nodes: representing computation
Edges: representing precedence relations.

Original Data flow graph


single-assignment form
COMPILATION - OPTIMIZATION TECHNIQUE EXAMPLES

Tree-height reduction
Time and space reduction
b=3. x; t = x <<1; b = x + t;
multiplication replaced by shift and addition
Operator strength reduction

Hierarchical combination of dataflow graphs.


Example of hierarchical graph
DFG entities model the data-flow -hierarchical linkage of the entities models the control.
Hierarchical graphs accommodate calls, branching and iteration

CONDITIONAL DATA-FLOW symbols


HLS Basic Issues
Resource Allocation - Selection of the types and number of hardware components
Scheduling - Assignment of operations to time slots (clock cycles) such that no
precedence constraint is violated.
Module Binding -Assignment of operations to the allocated hardware components
Controller Synthesis - Design of control unit.
Basic scheduling concepts

Sequential Ops

Parallel scheduling
CHAINING

Chaining

Multi-cycle unit

Pipelined unit

As-Soon-As-Possible

As-late-As-Possible

No chaining
Two chained additions
A multicycle multiplication
Chained and multicycle operations - time balancing along clock periods
EXAMPLE 1
DFG
program fragment
repeat
xl = x + dx;
ul = u - (3 * x * u * dx) - (3 * y * dx);
yl = y + u * dx;
c = xl < a;
x = xl; u = ul; y = yl;
until (c);
xl = x + dx
ul = u-(3.x.u.dx)-(3.y.dx)
yl = y+ u. dx
c = x<a

v10
v1-v7
v8, v9
v11

Example 1 -- specification
SCHEDULING AND BINDING.

Scheduled sequencing graph for Example 1;


Allocation and Binding
6 Multipliers and 5 ALUs
4 multipliers and 2 ALUs
Resource sharing
Resource binding may associate a resource-type to more than one operation
Operations corresponding to shared resources no concurrent scheduling
SCHEDULING METHODS
As soon as possible (ASAP), minimal latency

- an operation can be scheduled when all its predecessors have been scheduled.
- very simple, DFG need only to be traversed from inputs(s) to output(s).

ASAP Scheduling
Critical-path list scheduling
List scheduling
Op 3 higher priority than Op 1
- Extra criterion used for scheduling
o Critical-path list scheduling
- sorting criterion - the length of the longest path from operation to output.
- good results in practice
Since operation 3 has a higher priority than operation 1, it is scheduled first
As late as possible (ALAP)
- scheduling performed from output(s) to input(s).
Freedom-based scheduling
- compute both ASAP and ALAP schedules
- the difference in scheduling position gives the freedom or mobility of an operation
(operations in the critical path have mobility zero).
- take advantage of mobility to find a good position within scheduling range.

Example 1:

ASAP Schedule

ALAP Schedule, resources

Mobility: operations (1 5) = 0, (6, 7) = 1, (8, 9, 10, 11) = 2.


SCHEDULING WITH RESOURCE CONSTRAINTS (1)

Example 1; 2 multipliers and 2 ALU


1 multiplier and 1 ALU
ALU (addition/subtraction and comparison). ALU and multiplier latency - 1 cycle.

Register number Optimization:

Variable lifetime register allocation and binding 5 registers


Lifetime of a variable - the number of cycle times in which that variable is alive
Analyze the lifetimes of all variables
Establish the required number of registers
Datapath and Control synthesis

Functional Units

Component allocation and binding.

Registers

BUSes
CONTROL UNIT SYNTHESIS

Reg.Files

MUXs

Three-state (tri-state) drivers

Example 1; ASAP Scheduling


FSM - state transition diagram
The numbers by the vertices of the FSM - the FUs activation signals.
Example 1; data path and control design, constraints: 1 multiplier and 1 ALU

Example 1: DFG 1 multiplier, 1 ALU


Lifetime based register binding

State transition diagram


One state for each clock cycle

DATAPATH

Example 1: 1 multiplier, 1 ALU; the corresponding data path

Laboratory

Basys2

Basys2 board
Xilinx Spartan-3E FPGA
Programming Circuits
and Atmel AT90USB2 USB controller
Flash select - Mode Jumper (JP3) - ROM.

Basys2 I/O circuits


Meta-stability and synchronization

Basys2 Pmod connector circuits

Tclock > Tmeta


Simple synchronizer
This synchronizer will work correctly if the the clock period > period of metastability

Synchronized falling-edge detector circuit


Synchronized Monopulse generator
Debouncing Switches

Schmitt trigger, RC, Vthreshold

Rising-edge detector

debouncing a switch.

Debouncing circuit with delay counter


Pulse width modulation circuit duty cycle

Block diagram of a PWM circuit.


Output register for LED
A PWM circuit generates an output pulse with an adjustable duty cycle.
Used to control the on-off time of an external system average value D/A converter
LED intensity, Motor speed control

Four 7-segment display digits

LEDs in a common anode 7-segment display

5
7

2
4

1
3

Multiplexed 7seg display timing

Imax/Iavg luminosity

15-0
31

32-bit Register 0

16

1
0
SW
Div Freq

7segment display
Keypad

32-bit 4x8-binary display

31-16

Hex
To
7seg

3
2
1
0

8888

2 to 4
DEC
2

32-bit 2 x 4-hex digit display

Keypad scan column out; row -- in


1. Initialization
2. A key is being pressed?
C 0000 idle-- R 1111;
C 0000 -- key pressed-- R !1111; polling
C 0000 -- key pressed-- R1.R2.R3.R4 0; negative edge interrupt
3. Key scan to determine which key is being pressed
Drive each column low, one at a time, and see if any of the row inputs becomes low.
C 0111, if (R != 1111) Rin R;
C 1011, 1101, 1110,
4. Wait for the key to be released
UART Universal asynchronous receiver and transmitter

RS232 standard

PC -- HyperTerminal
RS232 - MAX232 voltage converter
TTL: logic 1 = +5V; RS232: logic 1 = from -3V to - 15V

ASCII codes

Data format Error: parity, overrun- receiver empty, frame stop bit; detection

OverSampling RxD with BclkX16;

UART receiver
The ASMD chart for the receiver
D-BIT - the number of data bits
SB-TICK - the number of ticks needed for the stop bits, which is
16, 24, and 32 for
1, 1.5, and 2 stop bits, respectively
s - t i c k signal (RXx16) is the enable tick from the baud rate generator and there are 16
ticks in a bit interval.
The FSM stays in the same state unless the s - t i c k signal is asserted.
The s counter -- keeps track of the nr. of sampling ticks and counts to
o 7 in the s t a r t state, to
o 15 in the data state, and to
o SB-TICK in the s t o p state.
The n counter -- keeps track of the nr. of data bits received in the data state.
The retrieved bits are shifted into and reassembled in the b register.
Status signal, rx-done-tick -- asserted for one clock cycle after the receiving process is
completed.

Complete UART System


FINITE STATE MACHINE

FSM ; Moore, Mealy outputs


+ Mealy machine ~~uses fewer states
+ Mealy machine responds faster - may be transparent to glitches
FSM REPRESENTATION
State diagram
Algorithmic State Machine (ASM)/ Abstract State Machine (ASM)

FSM partitioning

FSM before partitioning


Communicating FSMs

key idea: introduction of idle states

FSM after partitioning


Four-phase handshaking protocol

four-phase handshaking protocol

ASM chart

data ready (start) four-phase handshaking protocol

Write TL using the four-phase handshaking protocol

Read TL using the four-phase handshaking protocol.

One-Hot circuit

Handshaking system with synchronizers

Talker FSM
Listener FSM
ASM charts of the talker and listener of the four-phase handshaking protocol.

Anda mungkin juga menyukai