Computer Architecture Introduction

Computer Architecture
Architecture - Interface between a user and an object

Computer Architecture is the science and art of selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. WWW Computer Architecture Page
Five computing classes

Sales in 2010 ~ 1.8 billion PMDs (90% cell phones), 350 million desktop PCs, and 20
million servers. The total number of embedded processors sold ~ 19 billion.
ARM (Advanced RISC Machine) -- the most popular RISC example.
ARM processors ~ 6.1 billion chips shipped in 2010, ~ 20 times as many chips that
shipped with 80x86 processors
Tablets and smart phones -- the PostPC era, PC personal mobile device (PMD)
Eight Great Ideas in Computer Architecture
1
Design for Moores Law
Integrated circuit resources double every 1824 months -1965 prediction by Gordon Moore,
one of the founders of Intel.
Computer designs can take years - computer architects must anticipate where the
technology will be when the design finishes
2
Use Abstraction to Simplify Design
A major productivity technique for hardware and software is to use abstractions to
represent the design at different levels of representation; lower-level details are hidden to
offer a simpler model at higher levels.
3
Make the Common Case Fast
Making the common case fast - better than optimizing the rare case.
What the common case is - careful experimentation and measurement
4
Performance via Parallelism
More performance - by performing operations in parallel
Data level parallelism - data width
Instruction level parallelism ILP pipeline stages
Thread level parallelism TLP multithreading common functional units
Processor level parallelism multicore, multiprocessor separate processors
5
Performance via Pipelining
A particular pattern of parallelism is so prevalent that it merits its own name: pipelining.
6
Performance via Prediction
It can be faster to guess and start working rather than wait until you know for sure,
assuming that the mechanism to recover from a misprediction is not too expensive and
your prediction is relatively accurate.
Branch prediction, Speculative execution
7
Hierarchy of Memories
The fastest, smallest, and most expensive memory per bit at the top of the hierarchy and
the slowest, largest, and cheapest per bit at the bottom.
Illusion that main memory is nearly as fast as the top of the hierarchy and nearly as big and
cheap as the bottom of the hierarchy.
8
Dependability via Redundancy
Redundant components - can take over when a failure occurs and help to detect failures.
Abstraction levels
Layout/silicon level
Abstraction Hierarchy
Circuit level
Logic (gate) level
Register-transfer level (RTL) Algorithmic level

System level
ISA - instruction set architecture
An abstract interface between the hardware and the lowest-level software - the
information necessary to write a machine language program that will run correctly including instructions, registers, memory access, I/O, and so on.
Abstraction levels for a computer system Abstraction levels in Gajskis Y-chart.

Behavioral Domain: what the design is supposed to do
Structural Domain: mapping of a behavioral representation to a set of components
Physical Domain: bind the structure to silicon
The same ISA - different organizations; MIPS, single cycle, multi cycle, pipelined
A particular architecture can be implemented by different microarchitectures
o Maintaining the instruction set architecture to run identical software
Modern instruction set architectures: IA-32, PowerPC, MIPS, SPARC, ARM,
ISA components:
Organization of Programmable Storage.- Memory, Registers
Data Types: Encodings & Representations.
Instruction Set, Formats.
Modes of Addressing and Accessing Data Items and Instructions.

Exceptional Conditions (Interrupts, Protections, I/O).
CISC (Complex Instruction Set Computing) x86
Examples: x86, VAX, Motorola 68000, etc.
Intels modern x86: RISC Inside
RISC (Reduced Instruction Set Computer)
Examples: PowerPC, ARM, SPARC, Alpha, PA-RISC, MIPS
RISC won the technology battles - embedded computing space
CISC won the high-end commercial space (1990s to today)
RISC vs.
Single-cycle execution possible
Hardwired (simple) control
Load/store architecture
Few memory addressing modes
Fixed-length instruction format
Many registers
CISC characteristcs
many multicycle operations
microcode for multi-cycle operations
register-memory and memory-memory
many modes
many formats and lengths
few registers
Classifying Instruction Set Architectures - four ISA classes -- Instruction Formats

Zero address formats: operands on a stack
Add
M[sp-1] M[sp] + M[sp-1]
Load
M[sp] M[M[sp]]
Stack can be in registers or in memory
usually top of stack cached in registers
One address formats: Accumulator Machines
Accumulator is always other implicit operand
Two address formats: the destination is same as one of the operand sources
Reg (Reg op Reg)
RI (RI) op (RJ)
Reg (Reg op Mem)
RI (RI) op M[x]
Three address formats: One destination and up to two operand sources per instruction
Reg (Reg op Reg)
RI (RJ) op (RK)
Reg (Reg op Mem)
RI (RJ) op M[x]
Operand locations for four instruction set architecture classes.

(a), Top Of Stack register (TOS) points to the top input operand, which is combined with the
operand below. The first operand is removed from the stack, the result takes the place of
the second operand, and TOS is updated to point to the result. All operands are implicit.
(b), the Accumulator is both an implicit input operand and a result.
(c), one input operand is a register, one is in memory, and the result goes to a register.
(d) All operands are registers in and, like the stack architecture, can be transferred to
memory only via separate instructions: push or pop for (a) and load or store for (d).
0, 1, 2, 3 address machines
The code sequence for C = A+B for four classes of instruction sets.
The Add instruction has implicit operands for stack and accumulator architectures, and
explicit operands for register architectures. A, B, and C are in the memory
Princeton/Harvard architectures
Processor Datapath + Control

Princeton, Von Neumann architecture
Harvard architecture
Princeton
Harvard
-Single storage for instructions and data
-Separate storage for instructions and data
RTL (Register Transfer Level) Description

A design methodology in which the system operation is described by how the data is
manipulated and moved among registers.
Standard model for CPUs, micro-controllers ....

Datapath
Abstract RTL a behavioral specification; it doesnt take into account the structure of
the digital system; it isnt related to resource and timing constraints
Concrete RTL an implementation of the behavioral specification based on a selected
organization at clock period granularity; it is related to resource and timing constraints.
Example RTL operations
Assignment.
=,
Tests for equality and inequality.
||
Bit string concatenation.
XY
Data transfer of contents of regY to regX
X0
Clears regX
XY+Z
Adds contents of regY with regZ, load into regX
X Y v Z
Ors contents of regY with regZ, load into regX
DR M[MAR]
Load into DR the contents of memory pointed to by MAR
R1 >> R1
R2 << R1
X Y, A B
(cond) A B
S0 A B
P (ab) R2 R3
Register R1 one bit right shift, with 0 into left-most bit

Register R1 one bit left shift, with 0 into right-most bit
Parallel transfers
If cond = 1 then transfer contents of regB into regA
When in state S0, load regA with contents of regB
When in state P, if a AND b is true then load regR2 with contents
of regR3; a, b are signals.
If (A)then B else C
Conditional Jump -- Branch
Use ; to separate transfers that occur on separate cycles.
Use , to separate transfers that occur on the same cycle.
Example (2 cycles):
regA regB, regB 0;
regC regA;
Algorithmic State Machine (ASM)
Abstract State Machine (ASM)
ASM chart describes sequence of events
ASM flowchart
ASM specifies time at clock level
State diagram
Notation for a state.
ASM chart
State diagram and ASM chart conversion

ASM chart One FF/State tranformation Rules -- One-Hot encoding
The design starts with the ASM chart, and replaces
1. State Boxes with flip-flops,
2. Decision Boxes with a corresponding decoder
3. Junctions with an OR gate, and
4. Conditional Outputs with AND gates.
State box transforms to a D Flip-Flop
Decision box transforms to a Decoder
Decision Box Transformation Rules

Entry point - Enable inputs a state
The Conditions are the Select inputs
Junction Transformation Rules

One Hot FSM Coding
Conditional output
Ring counter;
One-Hot Implementation Reset 0001
One-hot - reset one flipflop set to 1 instead of resetting all flip-flops to 0.
High-Level Synthesis (HLS) Synthesis of Digital Architectures
HLS: translation process from behavioural description to a structural description
Behavioral specification
HLS
RTL-datapath and controller

Xilinx
Vivado High-Level Synthesis
wireless, medical, defense, and consumer applications are more sophisticated than
ever before. Vivado Design Suite accelerates IP creation by enabling C, C++ and System
C specifications to be directly targeted into Xilinx Programmable devices without the need
to manually create RTL
DATA-FLOW MODEL OF COMPUTATION
Data-flow graphs (DFGs) (Data-Dependency graphs DDG) represent precedency
relationships and parallelism in computations. A data-flow graph is built from:
Nodes: representing computation
Edges: representing precedence relations.
Original Data flow graph

single-assignment form
COMPILATION - OPTIMIZATION TECHNIQUE EXAMPLES
Tree-height reduction
Time and space reduction
b=3. x; t = x <<1; b = x + t;
multiplication replaced by shift and addition
Operator strength reduction
Hierarchical combination of dataflow graphs.

Example of hierarchical graph
DFG entities model the data-flow -hierarchical linkage of the entities models the control.
Hierarchical graphs accommodate calls, branching and iteration
CONDITIONAL DATA-FLOW symbols

HLS Basic Issues
Resource Allocation - Selection of the types and number of hardware components
Scheduling - Assignment of operations to time slots (clock cycles) such that no
precedence constraint is violated.
Module Binding -Assignment of operations to the allocated hardware components
Controller Synthesis - Design of control unit.
Basic scheduling concepts
Sequential Ops
Parallel scheduling
CHAINING
Chaining
Multi-cycle unit
Pipelined unit
As-Soon-As-Possible
As-late-As-Possible
No chaining
Two chained additions
A multicycle multiplication
Chained and multicycle operations - time balancing along clock periods
EXAMPLE 1
DFG
program fragment
repeat
xl = x + dx;
ul = u - (3 * x * u * dx) - (3 * y * dx);
yl = y + u * dx;
c = xl < a;
x = xl; u = ul; y = yl;
until (c);
xl = x + dx
ul = u-(3.x.u.dx)-(3.y.dx)
yl = y+ u. dx
c = x<a
v10
v1-v7
v8, v9
v11
Example 1 -- specification
SCHEDULING AND BINDING.
Scheduled sequencing graph for Example 1;

Allocation and Binding
6 Multipliers and 5 ALUs
4 multipliers and 2 ALUs
Resource sharing
Resource binding may associate a resource-type to more than one operation
Operations corresponding to shared resources no concurrent scheduling
SCHEDULING METHODS
As soon as possible (ASAP), minimal latency
- an operation can be scheduled when all its predecessors have been scheduled.
- very simple, DFG need only to be traversed from inputs(s) to output(s).
ASAP Scheduling
Critical-path list scheduling
List scheduling
Op 3 higher priority than Op 1
- Extra criterion used for scheduling
o Critical-path list scheduling
- sorting criterion - the length of the longest path from operation to output.
- good results in practice
Since operation 3 has a higher priority than operation 1, it is scheduled first
As late as possible (ALAP)
- scheduling performed from output(s) to input(s).
Freedom-based scheduling
- compute both ASAP and ALAP schedules
- the difference in scheduling position gives the freedom or mobility of an operation
(operations in the critical path have mobility zero).
- take advantage of mobility to find a good position within scheduling range.
Example 1:
ASAP Schedule
ALAP Schedule, resources
Mobility: operations (1 5) = 0, (6, 7) = 1, (8, 9, 10, 11) = 2.

SCHEDULING WITH RESOURCE CONSTRAINTS (1)
Example 1; 2 multipliers and 2 ALU

1 multiplier and 1 ALU
ALU (addition/subtraction and comparison). ALU and multiplier latency - 1 cycle.
Register number Optimization:
Variable lifetime register allocation and binding 5 registers

Lifetime of a variable - the number of cycle times in which that variable is alive
Analyze the lifetimes of all variables
Establish the required number of registers
Datapath and Control synthesis
Functional Units
Component allocation and binding.
Registers
BUSes
CONTROL UNIT SYNTHESIS
Reg.Files
MUXs
Three-state (tri-state) drivers
Example 1; ASAP Scheduling

FSM - state transition diagram
The numbers by the vertices of the FSM - the FUs activation signals.
Example 1; data path and control design, constraints: 1 multiplier and 1 ALU
Example 1: DFG 1 multiplier, 1 ALU

Lifetime based register binding
State transition diagram

One state for each clock cycle
DATAPATH
Example 1: 1 multiplier, 1 ALU; the corresponding data path
Laboratory
Basys2
Basys2 board
Xilinx Spartan-3E FPGA
Programming Circuits
and Atmel AT90USB2 USB controller
Flash select - Mode Jumper (JP3) - ROM.
Basys2 I/O circuits

Meta-stability and synchronization
Basys2 Pmod connector circuits
Tclock > Tmeta

Simple synchronizer
This synchronizer will work correctly if the the clock period > period of metastability
Synchronized falling-edge detector circuit

Synchronized Monopulse generator
Debouncing Switches
Schmitt trigger, RC, Vthreshold
Rising-edge detector
debouncing a switch.
Debouncing circuit with delay counter

Pulse width modulation circuit duty cycle
Block diagram of a PWM circuit.

Output register for LED
A PWM circuit generates an output pulse with an adjustable duty cycle.
Used to control the on-off time of an external system average value D/A converter
LED intensity, Motor speed control
Four 7-segment display digits
LEDs in a common anode 7-segment display
5
7
2
4
1
3
Multiplexed 7seg display timing
Imax/Iavg luminosity
15-0
31
32-bit Register 0
16
1
0
SW
Div Freq
7segment display
Keypad
32-bit 4x8-binary display
31-16
Hex
To
7seg
3
2
1
0
8888
2 to 4
DEC
2
32-bit 2 x 4-hex digit display
Keypad scan column out; row -- in

1. Initialization
2. A key is being pressed?
C 0000 idle-- R 1111;
C 0000 -- key pressed-- R !1111; polling
C 0000 -- key pressed-- R1.R2.R3.R4 0; negative edge interrupt
3. Key scan to determine which key is being pressed
Drive each column low, one at a time, and see if any of the row inputs becomes low.
C 0111, if (R != 1111) Rin R;
C 1011, 1101, 1110,
4. Wait for the key to be released
UART Universal asynchronous receiver and transmitter
RS232 standard
PC -- HyperTerminal
RS232 - MAX232 voltage converter
TTL: logic 1 = +5V; RS232: logic 1 = from -3V to - 15V
ASCII codes
Data format Error: parity, overrun- receiver empty, frame stop bit; detection
OverSampling RxD with BclkX16;
UART receiver
The ASMD chart for the receiver
D-BIT - the number of data bits
SB-TICK - the number of ticks needed for the stop bits, which is
16, 24, and 32 for
1, 1.5, and 2 stop bits, respectively
s - t i c k signal (RXx16) is the enable tick from the baud rate generator and there are 16
ticks in a bit interval.
The FSM stays in the same state unless the s - t i c k signal is asserted.
The s counter -- keeps track of the nr. of sampling ticks and counts to
o 7 in the s t a r t state, to
o 15 in the data state, and to
o SB-TICK in the s t o p state.
The n counter -- keeps track of the nr. of data bits received in the data state.
The retrieved bits are shifted into and reassembled in the b register.
Status signal, rx-done-tick -- asserted for one clock cycle after the receiving process is
completed.
Complete UART System

FINITE STATE MACHINE
FSM ; Moore, Mealy outputs

+ Mealy machine ~~uses fewer states
+ Mealy machine responds faster - may be transparent to glitches
FSM REPRESENTATION
State diagram
Algorithmic State Machine (ASM)/ Abstract State Machine (ASM)
FSM partitioning
FSM before partitioning

Communicating FSMs
key idea: introduction of idle states
FSM after partitioning

Four-phase handshaking protocol
four-phase handshaking protocol
ASM chart
data ready (start) four-phase handshaking protocol
Write TL using the four-phase handshaking protocol
Read TL using the four-phase handshaking protocol.
One-Hot circuit
Handshaking system with synchronizers
Talker FSM
Listener FSM
ASM charts of the talker and listener of the four-phase handshaking protocol.

Computer Architecture Introduction

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Computer Architecture Introduction

Diunggah oleh

Hak Cipta:

Format Tersedia

Computer Architecture

Architecture - Interface between a user and an object

Five computing classes

Logic (gate) level

Register-transfer level (RTL) Algorithmic level

Abstraction levels for a computer system Abstraction levels in Gajskis Y-chart.

Modes of Addressing and Accessing Data Items and Instructions.

Classifying Instruction Set Architectures - four ISA classes -- Instruction Formats

Operand locations for four instruction set architecture classes.

Processor Datapath + Control

RTL (Register Transfer Level) Description

Standard model for CPUs, micro-controllers ....

Register R1 one bit right shift, with 0 into left-most bit

Notation for a state.

State diagram and ASM chart conversion

State box transforms to a D Flip-Flop

Decision box transforms to a Decoder

Decision Box Transformation Rules

Junction Transformation Rules

RTL-datapath and controller

Original Data flow graph

Hierarchical combination of dataflow graphs.

CONDITIONAL DATA-FLOW symbols

Scheduled sequencing graph for Example 1;

ALAP Schedule, resources

Mobility: operations (1 5) = 0, (6, 7) = 1, (8, 9, 10, 11) = 2.

Example 1; 2 multipliers and 2 ALU

Register number Optimization:

Variable lifetime register allocation and binding 5 registers

Component allocation and binding.

Three-state (tri-state) drivers

Example 1; ASAP Scheduling

Example 1: DFG 1 multiplier, 1 ALU

State transition diagram

Example 1: 1 multiplier, 1 ALU; the corresponding data path

Basys2 I/O circuits

Basys2 Pmod connector circuits

Tclock > Tmeta

Synchronized falling-edge detector circuit

Schmitt trigger, RC, Vthreshold

Debouncing circuit with delay counter

Block diagram of a PWM circuit.

Four 7-segment display digits

LEDs in a common anode 7-segment display

Multiplexed 7seg display timing

32-bit 4x8-binary display

32-bit 2 x 4-hex digit display

Keypad scan column out; row -- in

OverSampling RxD with BclkX16;

Complete UART System

FSM ; Moore, Mealy outputs

FSM before partitioning

key idea: introduction of idle states

FSM after partitioning

four-phase handshaking protocol

data ready (start) four-phase handshaking protocol

Write TL using the four-phase handshaking protocol

Read TL using the four-phase handshaking protocol.

Handshaking system with synchronizers

Anda mungkin juga menyukai