Very Good Notes-Up2

E&CE 427: Digital Systems Engineering
Course Notes
Mark Aagaard
2006t3Fall
University of Waterloo
Dept of Electrical and Computer Engineering
September 18, 2006
Contents
I Course Notes 1
1 VHDL 3
1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . . . . . . . . . . 7
1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Comparison of VHDL to Other Hardware Description Languages . . . . . . . . . 9
1.2.1 VHDL Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 VHDL Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 VHDL and Other Languages . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3.1 VHDL vs Verilog . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3.2 VHDL vs SystemC . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3.3 VHDL vs Other Hardware Description Languages . . . . . . . . 10
1.2.3.4 Summary of VHDL Evaluation . . . . . . . . . . . . . . . . . . 11
1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Component Declaration and Instantiations . . . . . . . . . . . . . . . . . . 16
1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.8 A Few More Miscellaneous VHDL Features . . . . . . . . . . . . . . . . 18
1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Conditional Assignment vs If Statements . . . . . . . . . . . . . . . . . . 18
1.4.3 Selected Assignment vs Case Statement . . . . . . . . . . . . . . . . . . . 19
1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.5.1 Combinational Process vs Clocked Process . . . . . . . . . . . . . . . . . 22
1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
i
ii CONTENTS
1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . . . . . . . . . . . . 24
1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.6.1 Intuition Behind Delta-Cycle Simulation . . . . . . . . . . . . . . . . . . 24
1.6.2 Denitions and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.6.2.1 Temporal Granularities of Simulation . . . . . . . . . . . . . . . 25
1.6.2.2 Process Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.6.2.3 Simulation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 26
1.6.2.4 Delta-Cycle Denitions . . . . . . . . . . . . . . . . . . . . . . 28
1.6.3 Example 1: Process Execution (Bamboozle) . . . . . . . . . . . . . . . . . 29
1.6.4 Example 2: Process Execution (Flummox) . . . . . . . . . . . . . . . . . 38
1.6.5 Example: Need for Provisional Assignments . . . . . . . . . . . . . . . . 40
1.6.6 Delta-Cycle Simulations of Flip-Flops . . . . . . . . . . . . . . . . . . . . 42
1.7 Register-Transfer Level Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.7.1 Technique for Register-Transfer Level Simulation . . . . . . . . . . . . . . 45
1.7.2 Examples of RTL Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . 51
1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.8.2 Deprecated Building Blocks for RTL . . . . . . . . . . . . . . . . . . . . 52
1.8.2.1 An Aside on Flip-Flops and Latches . . . . . . . . . . . . . . . 52
1.8.2.2 Deprecated Hardware . . . . . . . . . . . . . . . . . . . . . . . 52
1.8.3 Hardware and Code for Flops . . . . . . . . . . . . . . . . . . . . . . . . 53
1.8.3.1 Flops with Waits and Ifs . . . . . . . . . . . . . . . . . . . . . . 53
1.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . . . . . . . 53
1.8.3.3 Flops with Chip-Enable . . . . . . . . . . . . . . . . . . . . . . 54
1.8.3.4 Flop with Chip-Enable and Mux on Input . . . . . . . . . . . . . 54
1.8.3.5 Flops with Chip-Enable, Muxes, and Reset . . . . . . . . . . . . 55
1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . . . . . . . . . . . 55
1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . . . . . . . . . . . . 62
1.10.5 Overloading of Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.10.6 Different Widths and Comparisons . . . . . . . . . . . . . . . . . . . . . . 62
1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . . . . . . . . . . . . . 64
1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.11.1.3 Different Wait Conditions . . . . . . . . . . . . . . . . . . . . . 65
1.11.1.4 Multiple if rising edges in Same Process . . . . . . . . . . . . 66
1.11.1.5 if rising edge and wait in Same Process . . . . . . . . . . . 66
1.11.1.6 if rising edge with else Clause . . . . . . . . . . . . . . . . 67
1.11.1.7 if rising edge Inside a for Loop . . . . . . . . . . . . . . . . 67
CONTENTS iii
1.11.1.8 wait Inside of a for loop . . . . . . . . . . . . . . . . . . . 68
1.11.2 Synthesizable, but Undesirable Hardware . . . . . . . . . . . . . . . . . . 69
1.11.2.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . 69
1.11.2.2 Combinational if-then Without else . . . . . . . . . . . . . 70
1.11.2.3 Bad Form of Nested Ifs . . . . . . . . . . . . . . . . . . . . . . 70
1.11.2.4 Deeply Nested Ifs . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.11.3 Synthesizable, but Unpredictable Hardware . . . . . . . . . . . . . . . . . 71
1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . . 71
1.12.1 Signal Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
1.12.2 Flip-Flops and Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.12.3 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.12.4 Multiplexors and Tri-State Signals . . . . . . . . . . . . . . . . . . . . . . 72
1.12.5 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.12.6 State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
1.12.7 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
1.13 VHDL Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
P1.3 Flops, Latches, and Combinational Circuitry . . . . . . . . . . . . . . . . 78
P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
P1.5 Arithmetic Overow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . . . . . . . . . . 82
P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . . . . . . . . . . . 82
P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
P1.9 VHDL VHDL Behavioural Comparison: Teradactyl . . . . . . . . . . . 85
P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega . . . . . . . . . . 86
P1.11 Waveform VHDL Behavioural Comparison . . . . . . . . . . . . . . . 88
P1.12 Hardware VHDL Comparison . . . . . . . . . . . . . . . . . . . . . . 90
P1.13 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . 91
P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
P1.13.3 Testbench for Register . . . . . . . . . . . . . . . . . . . . . . . 91
P1.14 Synthesizable VHDL and Hardware . . . . . . . . . . . . . . . . . . . . . 92
P1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
P1.15.1 Correct Implementation? . . . . . . . . . . . . . . . . . . . . . 94
P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
P1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . 97
iv CONTENTS
2 RTL Design with VHDL 99
2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.1.1 A Note on EDA for FPGAs and ASICs . . . . . . . . . . . . . . . . . . . 99
2.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . 100
2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . . . . . . . 100
2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . . . . . . . 104
2.2.2.2 Blocks of Cells for Generic FPGA . . . . . . . . . . . . . . . . 104
2.2.2.3 Clocks for Generic FPGAs . . . . . . . . . . . . . . . . . . . . 106
2.2.2.4 Special Circuitry in FPGAs . . . . . . . . . . . . . . . . . . . . 106
2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . 107
2.2.4 Altera APEX20K Information and Coding Guidelines . . . . . . . . . . . 108
2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.3.1 Generic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.3.2 Implementation Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
2.3.3 Design Flow: Datapath vs Control vs Storage . . . . . . . . . . . . . . . . 111
2.3.3.1 Classes of Hardware . . . . . . . . . . . . . . . . . . . . . . . . 111
2.3.3.2 Datapath-Centric Design Flow . . . . . . . . . . . . . . . . . . 112
2.3.3.3 Control-Centric Design Flow . . . . . . . . . . . . . . . . . . . 113
2.3.3.4 Storage-Centric Design Flow . . . . . . . . . . . . . . . . . . . 113
2.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.4.1 Flow Charts and State Machines . . . . . . . . . . . . . . . . . . . . . . . 114
2.4.2 Data-Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.4.3 High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
2.5.1 Introduction to State-Machine Design . . . . . . . . . . . . . . . . . . . . 116
2.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . . . . . . . 116
2.5.1.2 Introduction to State Machines and VHDL . . . . . . . . . . . . 116
2.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . . . . . . . 117
2.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . . . . . . . . 118
2.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . . . . . . . 119
2.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . . . . . . . 120
2.5.2.3 Explicit Moore with Combinational Outputs . . . . . . . . . . . 121
2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment . . . 122
2.5.2.5 Explicit-Current+Next Moore with Combinational Process . . . 123
2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . . . . . . . . 124
2.5.3.1 Implicit Mealy State Machine . . . . . . . . . . . . . . . . . . . 125
2.5.3.2 Explicit Mealy State Machine . . . . . . . . . . . . . . . . . . . 126
2.5.3.3 Explicit-Current+Next Mealy . . . . . . . . . . . . . . . . . . . 127
2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
2.5.5.1 Constants vs Enumerated Type . . . . . . . . . . . . . . . . . . 130
2.5.5.2 Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 131
2.6 Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
CONTENTS v
2.6.1 Dataow Diagrams Overview . . . . . . . . . . . . . . . . . . . . . . . . 132
2.6.2 Dataow Diagrams, Hardware, and Behaviour . . . . . . . . . . . . . . . 135
2.6.3 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
2.6.4 Dataow Diagram Execution . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.6.5 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 139
2.7 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.7.1 Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
2.7.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.7.2.1 Using a Two-Dimensional Array for Memory . . . . . . . . . . 142
2.7.2.2 Memory Arrays in Hardware . . . . . . . . . . . . . . . . . . . 143
2.7.2.3 VHDL Code for Single-Port Memory Array . . . . . . . . . . . 144
2.7.2.4 Using Library Components for Memory . . . . . . . . . . . . . 145
2.7.2.5 Build Memory from Slices . . . . . . . . . . . . . . . . . . . . 146
2.7.2.6 Dual-Ported Memory . . . . . . . . . . . . . . . . . . . . . . . 148
2.7.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
2.7.4 Memory Arrays and Dataow Diagrams . . . . . . . . . . . . . . . . . . . 150
2.7.5 Example: Memory Array and Dataow Diagram . . . . . . . . . . . . . . 153
2.8 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
2.9 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.9.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
2.9.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
2.9.3 Initial Dataow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
2.9.4 Dataow Diagram Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 158
2.9.5 Optimize Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . 160
2.9.6 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
2.9.7 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
2.9.8 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
2.9.9 Datapath for DP+Ctrl Model . . . . . . . . . . . . . . . . . . . . . . . . . 166
2.10 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
2.10.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
2.10.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
2.10.3 Initial Dataow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
2.10.4 Reschedule to Meet Requirements . . . . . . . . . . . . . . . . . . . . . . 172
2.10.5 Optimize Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
2.10.6 Assign Names to Registered Values . . . . . . . . . . . . . . . . . . . . . 175
2.10.7 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
2.10.8 Tangent: Combinational Outputs . . . . . . . . . . . . . . . . . . . . . . . 177
2.10.9 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
2.10.10 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
2.10.11 Hardware Block Diagram and State Machine . . . . . . . . . . . . . . . . 180
2.10.11.1 Control for Registers . . . . . . . . . . . . . . . . . . . . . . . 180
2.10.11.2 Control for Datapath Components . . . . . . . . . . . . . . . . . 181
2.10.11.3 Control for State . . . . . . . . . . . . . . . . . . . . . . . . . . 182
vi CONTENTS
2.10.11.4 Complete State Machine Table . . . . . . . . . . . . . . . . . . 182
2.10.12 VHDL Code with Explicit State Machine . . . . . . . . . . . . . . . . . . 183
2.10.13 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
2.10.14 Notes and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
2.11 Design Example: Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
2.11.1 Stack: Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
2.11.1.1 Entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
2.11.1.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
2.11.1.3 Instruction Encoding . . . . . . . . . . . . . . . . . . . . . . . 189
2.11.1.4 Miscellaneous Requirements . . . . . . . . . . . . . . . . . . . 190
2.11.2 Stack: Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
2.11.3 Stack: Dataow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
2.11.3.1 Data-Dependency Graphs . . . . . . . . . . . . . . . . . . . . . 192
2.11.3.2 Partition into Clock Cycles . . . . . . . . . . . . . . . . . . . . 193
2.11.4 Stack: High-Level Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
2.11.5 Stack: Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
2.11.5.1 Individual Block Diagrams . . . . . . . . . . . . . . . . . . . . 197
2.11.5.2 Complete Block Diagram . . . . . . . . . . . . . . . . . . . . . 199
2.11.6 Stack: Register Transfer Level . . . . . . . . . . . . . . . . . . . . . . . . 200
2.11.6.1 Stack: Separate Control, Datapath and Storage . . . . . . . . . . 200
2.11.6.2 Stack: Datapath Operations . . . . . . . . . . . . . . . . . . . . 205
2.11.6.3 Stack: Explicit State Machine . . . . . . . . . . . . . . . . . . . 207
2.12 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
2.12.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
2.12.1.1 Arithmetic Strength Reduction . . . . . . . . . . . . . . . . . . 210
2.12.1.2 Boolean Strength Reduction . . . . . . . . . . . . . . . . . . . . 210
2.12.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
2.12.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
2.12.2.2 Common Subexpression Elimination . . . . . . . . . . . . . . . 211
2.12.2.3 Computation Replication . . . . . . . . . . . . . . . . . . . . . 211
2.12.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
2.12.4 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
2.13 Design Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
P2.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 213
P2.1.2 Own Code vs Libraries . . . . . . . . . . . . . . . . . . . . . . 213
P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
P2.3 Dataow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . 214
P2.3.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 214
P2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
P2.4 Dataow Diagram Design . . . . . . . . . . . . . . . . . . . . . . . . . . 215
P2.4.1 Maximum Performance . . . . . . . . . . . . . . . . . . . . . . 215
P2.4.2 Minimum area . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
P2.5 Michener: Design and Optimization . . . . . . . . . . . . . . . . . . . . . 216
P2.6 Dataow Diagrams with Memory Arrays . . . . . . . . . . . . . . . . . . 216
CONTENTS vii
P2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
P2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
P2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
P2.7.1 Generic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
P2.7.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3 Functional Verication 219
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.2.1 Terminology: Validation / Verication / Testing . . . . . . . . . . . . . . . 220
3.2.2 The Difculty of Designing Correct Chips . . . . . . . . . . . . . . . . . . 221
3.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . . . . . . . 221
3.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) . . . 221
3.3 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
3.3.1 Test Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
3.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
3.3.3 Floating Point Divider Example . . . . . . . . . . . . . . . . . . . . . . . 223
3.4 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.4.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . . . . . . . 225
3.4.2 Reference Model Style Testbench . . . . . . . . . . . . . . . . . . . . . . 226
3.4.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . 227
3.4.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . . . . . . . . 227
3.4.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.4.6 Verication Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
3.5 Functional Verication for Datapath Circuits . . . . . . . . . . . . . . . . . . . . . 229
3.5.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
3.5.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . 231
3.5.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
3.5.4 Have Separate Specication Entity . . . . . . . . . . . . . . . . . . . . . . 233
3.5.5 Generate Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
3.5.6 Relational Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
3.6 Functional Verication of Control Circuits . . . . . . . . . . . . . . . . . . . . . . 236
3.6.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . . . . . . . 236
3.6.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.6.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
3.6.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . 241
3.6.3 Code Structure for Verication . . . . . . . . . . . . . . . . . . . . . . . . 241
3.6.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
3.6.5 Coverage Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
3.6.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
3.6.7 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
3.6.8 Queue Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
3.6.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
viii CONTENTS
3.7 Functional Verication Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
P3.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
P3.2 Trafc Light Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
P3.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
P3.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . 250
P3.2.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
P3.3 State Machines and Verication . . . . . . . . . . . . . . . . . . . . . . . 251
P3.3.1 Three Different State Machines . . . . . . . . . . . . . . . . . . 251
P3.3.2 State Machines in General . . . . . . . . . . . . . . . . . . . . . 252
P3.4 Test Plan Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
P3.4.1 Early Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
P3.4.2 Corner Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
P3.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4 Performance Analysis and Optimization 255
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . . . . . . . . 257
4.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . . . . . . . . . 261
4.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . . . . . . . 261
4.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . . . . . . . . 263
4.4.4 Effect of Time to Market on Relative Performance . . . . . . . . . . . . . 265
4.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
4.5 Performance Analysis and Dataow Diagrams . . . . . . . . . . . . . . . . . . . . 267
4.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . . . . . . . . . . . . 267
4.5.2 Examples of Dataow Diagrams for Two Instructions . . . . . . . . . . . . 268
4.5.2.1 Scheduling of Operations for Different Clock Periods . . . . . . 269
4.5.2.2 Performance Computation for Different Clock Periods . . . . . . 269
4.5.2.3 Example: Two Instructions Taking Similar Time . . . . . . . . . 270
4.5.2.4 Example: Same Total Time, Different Order for A . . . . . . . . 271
4.5.3 Example: From Algorithm to Optimized Dataow . . . . . . . . . . . . . 272
4.6 Performance Analysis and Optimization Problems . . . . . . . . . . . . . . . . . . 280
P4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
P4.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
P4.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . . . . . . . 281
P4.2.2 Packet Size and Performance . . . . . . . . . . . . . . . . . . . 281
P4.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . 281
P4.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
P4.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
P4.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . . . . . . . 282
P4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
P4.5 Dataow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . 282
CONTENTS ix
P4.6 Performance Optimization with Memory Arrays . . . . . . . . . . . . . . 283
P4.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
P4.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . 284
P4.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 285
5 Timing Analysis 287
5.1 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5.1.1 Background Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
5.1.2 Clock-Related Timing Denitions . . . . . . . . . . . . . . . . . . . . . . 288
5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
5.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
5.1.3 Storage Related Timing Denitions . . . . . . . . . . . . . . . . . . . . . 290
5.1.3.1 Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.1.3.2 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.1.3.3 Clock-to-Q Time . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.1.4.1 Load Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
5.1.4.2 Interconnect Delays . . . . . . . . . . . . . . . . . . . . . . . . 291
5.1.5 Summary of Delay Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 292
5.1.6 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
5.1.6.1 Minimum Clock Period . . . . . . . . . . . . . . . . . . . . . . 293
5.1.6.2 Hold Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 294
5.1.6.3 Example Timing Violations . . . . . . . . . . . . . . . . . . . . 294
5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . . . . . . . . . . 296
5.2.1 Review: Latch, Flip-Flop, Setup, Hold, Clock-to-Q . . . . . . . . . . . . . 296
5.2.2 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . . . . 297
5.2.2.1 Structure and Behaviour of Multiplexer Latch . . . . . . . . . . 297
5.2.2.2 Strategy for Timing Analysis of Storage Devices . . . . . . . . . 298
5.2.2.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . . . . . . . . 300
5.2.2.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . . . 300
5.2.2.5 Hold Time of a Multiplexer Latch . . . . . . . . . . . . . . . . . 304
5.2.2.6 Example of a Bad Latch . . . . . . . . . . . . . . . . . . . . . . 307
5.2.3 Timing Analysis of Transmission-Gate Latch . . . . . . . . . . . . . . . . 307
5.2.3.1 Structure and Behaviour of a Transmission Gate (Smith 2.4.3) . . 308
5.2.3.2 Structure and Behaviour of Transmission-Gate Latch (Smith 2.5.1)308
5.2.3.3 Clock-to-Q Delay for Transmission-Gate Latch . . . . . . . . . 309
5.2.3.4 Setup and Hold Times for Transmission-Gate Latch . . . . . . . 309
5.2.4 Falling Edge Flip Flop (Smith 2.5.2) . . . . . . . . . . . . . . . . . . . . . 309
5.2.4.1 Structure and Behaviour of Flip-Flop . . . . . . . . . . . . . . . 310
5.2.4.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . 311
5.2.4.3 Setup of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 312
5.2.4.4 Hold of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 313
5.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) . . . . . . . . . . . . . . . . 313
5.2.5.1 Standard Timing Equations . . . . . . . . . . . . . . . . . . . . 314
x CONTENTS
5.2.5.2 Hierarchical Timing Equations . . . . . . . . . . . . . . . . . . 314
5.2.5.3 Actel Act 2 Logic Cell . . . . . . . . . . . . . . . . . . . . . . . 314
5.2.5.4 Timing Analysis of Actel Sequential Module . . . . . . . . . . . 316
5.2.6 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
5.3.1 Introduction to Critical and False Paths . . . . . . . . . . . . . . . . . . . 317
5.3.1.1 Example of Critical Path in Full Adder . . . . . . . . . . . . . . 318
5.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . . . . . . . . . 320
5.3.1.3 Longest Path and Critical Path . . . . . . . . . . . . . . . . . . 321
5.3.1.4 Timing Simulation vs Static Timing Analysis . . . . . . . . . . . 323
5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
5.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
5.3.3.1 Preliminaries for Detecting a False Path . . . . . . . . . . . . . 325
5.3.3.2 Almost-Correct Algorithm to Detect a False Path . . . . . . . . . 328
5.3.3.3 Examples of Detecting False Paths . . . . . . . . . . . . . . . . 329
5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . . . . . . . . . 333
5.3.4.1 Algorithm to Find Next Candidate Path . . . . . . . . . . . . . . 333
5.3.4.2 Examples of Finding Next Candidate Path . . . . . . . . . . . . 334
5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . . . . . . . . . 341
5.3.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
5.3.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
5.3.6 Further Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
5.4 Analog Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
5.4.1 Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
5.4.1.1 Equation for Output Voltage . . . . . . . . . . . . . . . . . . . . 352
5.4.1.2 Extrinsic / Intrinsic Delays . . . . . . . . . . . . . . . . . . . . 354
5.5 Elmore Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
5.5.1 Elmore Time Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
5.5.2 Interconnect with Single Fanout . . . . . . . . . . . . . . . . . . . . . . . 356
5.5.3 Interconnect with Multiple Gates in Fanout . . . . . . . . . . . . . . . . . 358
5.6 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 361
5.6.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
5.6.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . . . . . . . 363
5.6.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
5.6.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
5.6.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 363
5.7 Timing Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
P5.2.3 Rectication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . . . . 367
P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
CONTENTS xi
P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 368
P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . . . 368
P5.6 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
P5.7 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
P5.7.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 370
P5.7.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
P5.7.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . . . 370
P5.8 Worst Case Conditions and Derating Factor . . . . . . . . . . . . . . . . . 370
P5.8.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . . . . 370
P5.8.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . . . 370
P5.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . 370
6 Power Analysis and Power-Aware Design 371
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
6.1.1 Importance of Power and Energy . . . . . . . . . . . . . . . . . . . . . . . 371
6.1.2 Industrial Names and Products . . . . . . . . . . . . . . . . . . . . . . . . 371
6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
6.1.4 Batteries, Power and Energy . . . . . . . . . . . . . . . . . . . . . . . . . 373
6.1.4.1 Do Batteries Store Energy or Power? . . . . . . . . . . . . . . . 373
6.1.4.2 Battery Life and Efciency . . . . . . . . . . . . . . . . . . . . 373
6.1.4.3 Battery Life and Power . . . . . . . . . . . . . . . . . . . . . . 374
6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
6.2.1 Switching Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
6.2.2 Short-Circuited Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
6.2.3 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
6.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
6.2.5 Note on Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
6.3 Overview of Power Reduction Techniques . . . . . . . . . . . . . . . . . . . . . . 379
6.4 Voltage Reduction for Power Reduction . . . . . . . . . . . . . . . . . . . . . . . 380
6.5 Data Encoding for Power Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 381
6.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . . . . . . . . 381
6.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . . . . . . . 381
6.5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 381
6.5.2.2 Additional Information . . . . . . . . . . . . . . . . . . . . . . 382
6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . 386
6.6.2 Implementing Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . 387
6.6.3 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
6.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . 389
6.6.5 Example: Reduced Activity Factor with Clock Gating . . . . . . . . . . . 391
6.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . 391
6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 391
xii CONTENTS
6.6.6.2 How Many Clock Cycles for Module? . . . . . . . . . . . . . . 392
6.6.6.3 Adding Clock-Gating Circuitry . . . . . . . . . . . . . . . . . . 393
6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . . . . . . . 395
6.7 Power Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.1 Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.1.1 Power and Temperature . . . . . . . . . . . . . . . . . . . . . . 401
P6.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.1.3 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.1.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.2.1 Effect on Power . . . . . . . . . . . . . . . . . . . . . . . . . . 401
P6.2.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
P6.3 Advertising Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
P6.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
P6.5 Clock Speed Increase Without Power Increase . . . . . . . . . . . . . . . 403
P6.5.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
P6.5.2 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
P6.6 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 403
P6.6.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
P6.6.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . 403
P6.6.3 Adding Registers to Inputs . . . . . . . . . . . . . . . . . . . . 403
P6.6.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
P6.7 Power Consumption on New Chip . . . . . . . . . . . . . . . . . . . . . . 404
P6.7.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
P6.7.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
P6.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
CONTENTS xiii
7 Fault Testing and Testability 405
7.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
7.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . 405
7.1.1.1 Faults (Smith 14.3) . . . . . . . . . . . . . . . . . . . . . . . . 405
7.1.1.2 Causes of Faults (Smith 14.3) . . . . . . . . . . . . . . . . . . . 405
7.1.1.3 Testing (Smith 14) . . . . . . . . . . . . . . . . . . . . . . . . . 406
7.1.1.4 Burn In (Smith 14.3.1) . . . . . . . . . . . . . . . . . . . . . . . 406
7.1.1.5 Bin Sorting (Smith 5.1.6) . . . . . . . . . . . . . . . . . . . . . 406
7.1.1.6 Testing Techniques (Smith 14) . . . . . . . . . . . . . . . . . . 407
7.1.1.7 Design for Testability (DFT) (Smith 14.6) . . . . . . . . . . . . 407
7.1.2 Example Problem: Economics of Testing (Smith 14.1) . . . . . . . . . . . 408
7.1.3 Physical Faults (Smith 14.3.3) . . . . . . . . . . . . . . . . . . . . . . . . 409
7.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . . . . . . . 409
7.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . . . . . . . 409
7.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . . . . . . . 410
7.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . . . . . . . 410
7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
7.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . . . . . . . 411
7.1.5 Mathematical Models of Faults (Smith 14.3.4) . . . . . . . . . . . . . . . 412
7.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . . . . . . 412
7.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) . . . . . . 413
7.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
7.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . . . . . . . 414
7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
7.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . . . . . . . 414
7.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . . . . . . . 416
7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
7.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . 420
7.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . . . . . . . 420
7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
7.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . . . . . . . . 421
7.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . . . . . . . . . 421
7.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . . . . . . . 422
7.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . . . . . . . 424
7.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 425
7.2.5.4 Faults Not Covered by Required Test Vectors . . . . . . . . . . . 425
7.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . . . . . . . 426
7.2.5.6 Summary of Technique to Find and Order Test Vectors . . . . . 427
7.2.5.7 Complete Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 428
7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . . . . . . . 429
xiv CONTENTS
7.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . . . . . . . . . 430
7.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . . . . . . . 430
7.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . . . . . . . 431
7.3.2.3 Scan in Operation with Example Circuit . . . . . . . . . . . . . 432
7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 437
7.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . . . . . . . . 438
7.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
7.4.1 Boundary Scan History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
7.4.2 JTAG Scan Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.4.3 Scan Registers and Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.4.4 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.4.5 TAP Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.4.6 Other descriptions of JTAG/IEEE 1194.1 . . . . . . . . . . . . . . . . . . 442
7.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
7.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
7.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . . . . . . . . 445
7.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . . . . . . 446
7.5.2 Arithmetic over Binary Fields . . . . . . . . . . . . . . . . . . . . . . . . 446
7.5.3 Shift Registers and Characteristic Polynomials . . . . . . . . . . . . . . . 447
7.5.3.1 Circuit Multiplication . . . . . . . . . . . . . . . . . . . . . . . 449
7.5.4 Bit Streams and Characteristic Polynomials . . . . . . . . . . . . . . . . . 449
7.5.5 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
7.5.6 Signature Analysis: Math and Circuits . . . . . . . . . . . . . . . . . . . . 450
7.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
7.7 Problems on Faults, Testing, and Testability . . . . . . . . . . . . . . . . . . . . . 453
P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . 453
P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . 453
P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . 454
P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . 454
P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . 454
P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 455
P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . 455
P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . 455
P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 456
P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . 456
P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . 456
CONTENTS xv
P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . 456
P7.9.6 Detecting a Specic Fault . . . . . . . . . . . . . . . . . . . . . 456
P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . 456
P7.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
P7.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . 457
P7.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
P7.12.1 Are there any physical faults that are detectable by scan testing
but not by built-in self testing? . . . . . . . . . . . . . . . . . . 457
P7.12.2 Are there any physical faults that are detectable by built-in self
testing but not by scan testing? . . . . . . . . . . . . . . . . . . 457
P7.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . . 458
P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . . 458
P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . . . 458
P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
8 Review 459
8.1 Overview of the Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
8.2 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
8.2.1 VHDL Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
8.2.2 VHDL Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 460
8.3 RTL Design Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
8.3.1 Design Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
8.3.2 Design Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 461
8.4 Functional Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
8.4.1 Verication Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
8.4.2 Verication Example Problems . . . . . . . . . . . . . . . . . . . . . . . . 462
8.5 Performance Analysis and Optimization . . . . . . . . . . . . . . . . . . . . . . . 463
8.5.1 Performance Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
8.5.2 Performance Example Problems . . . . . . . . . . . . . . . . . . . . . . . 463
8.6 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
8.6.1 Timing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
8.6.2 Timing Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 464
8.7 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
8.7.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
8.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 465
8.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
8.8.1 Testing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
8.8.2 Testing Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 466
8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . 467
xvi CONTENTS
II Solutions to Assignment Problems 1
5 VHDL Problems 3
P5.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
P5.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
P5.3 Flops, Latches, and Combinational Circuitry . . . . . . . . . . . . . . . . . . . . . 7
P5.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
P5.5 Arithmetic Overow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
P5.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
P5.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
P5.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
P5.9 VHDL VHDL Behavioural Comparison: Teradactyl . . . . . . . . . . . . . . . 19
P5.10VHDL VHDL Behavioural Comparison: Ichtyostega . . . . . . . . . . . . . . 20
P5.11Waveform VHDL Behavioural Comparison . . . . . . . . . . . . . . . . . . . . 22
P5.12Hardware VHDL Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 24
P5.138-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
P5.13.1Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
P5.13.2Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
P5.13.3Testbench for Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
P5.14Synthesizable VHDL and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 29
P5.15Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
P5.15.1Correct Implementation? . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
P5.15.2Smallest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
P5.15.3Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Design Problems 37
P6.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
P6.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
P6.1.2 Own Code vs Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
P6.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
P6.3 Dataow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
P6.3.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
P6.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
P6.4 Dataow Diagram Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
P6.4.1 Maximum Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
P6.4.2 Minimum area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
P6.5 Michener: Design and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 45
P6.6 Dataow Diagrams with Memory Arrays . . . . . . . . . . . . . . . . . . . . . . 46
P6.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
P6.6.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
P6.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
P6.7.1 Generic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
P6.7.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
P6.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
CONTENTS xvii
7 Functional Verication Problems 53
P7.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
P7.2 Trafc Light Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
P7.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
P7.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
P7.2.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
P7.3 State Machines and Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
P7.3.1 Three Different State Machines . . . . . . . . . . . . . . . . . . . . . . . 55
P7.3.1.1 Number of Test Scenarios . . . . . . . . . . . . . . . . . . . . . 55
P7.3.1.2 Length of Test Scenario . . . . . . . . . . . . . . . . . . . . . . 56
P7.3.1.3 Number of Flip Flops . . . . . . . . . . . . . . . . . . . . . . . 56
P7.3.2 State Machines in General . . . . . . . . . . . . . . . . . . . . . . . . . . 57
P7.4 Test Plan Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
P7.4.1 Early Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
P7.4.2 Corner Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
P7.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8 Performance Analysis and Optimization Problems 61
P8.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
P8.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
P8.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
P8.2.2 Packet Size and Performance . . . . . . . . . . . . . . . . . . . . . . . . . 64
P8.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
P8.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
P8.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
P8.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
P8.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
P8.5 Dataow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
P8.6 Performance Optimization with Memory Arrays . . . . . . . . . . . . . . . . . . . 68
P8.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
P8.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
P8.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
xviii CONTENTS
9 Timing Analysis Problems 77
P9.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
P9.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
P9.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
P9.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
P9.2.3 Rectication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
P9.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
P9.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
P9.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
P9.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
P9.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
P9.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
P9.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . . . . . . . . . 83
P9.6 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
P9.7 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
P9.7.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
P9.7.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
P9.7.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
P9.8 Worst Case Conditions and Derating Factor . . . . . . . . . . . . . . . . . . . . . 87
P9.8.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
P9.8.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
P9.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . 87
10 Power Problems 89
P10.1Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
P10.1.1Power and Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
P10.1.2Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
P10.1.3Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
P10.1.4Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
P10.2VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
P10.2.1Effect on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
P10.2.2Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
P10.3Advertising Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
P10.4Vary Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
P10.5Clock Speed Increase Without Power Increase . . . . . . . . . . . . . . . . . . . . 93
P10.5.1Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
P10.5.2Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
P10.6Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
P10.6.1Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
P10.6.2Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
P10.6.3Adding Registers to Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 95
P10.6.4Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
P10.7Power Consumption on New Chip . . . . . . . . . . . . . . . . . . . . . . . . . . 96
P10.7.1Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
P10.7.2Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
CONTENTS xix
P10.7.3Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11 Problems on Faults, Testing, and Testability 99
P11.1Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . . . 99
P11.2Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
P11.3Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
P11.4Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P11.5Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P11.6Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P11.7Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.7.1Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.7.2Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.8Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
P11.9BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
P11.9.1Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 105
P11.9.2Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
P11.9.3Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
P11.9.4Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 111
P11.9.5Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 112
P11.9.6Detecting a Specic Fault . . . . . . . . . . . . . . . . . . . . . . . . . . 112
P11.9.7Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
P11.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
P11.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
P11.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
P11.12.1Are there any physical faults that are detectable by scan testing but not by
built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
P11.12.2Are there any physical faults that are detectable by built-in self testing but
not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
P11.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
P11.13.1Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
P11.13.2Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 117
P11.13.3Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . . . 118
P11.13.4Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Part I
Course Notes
1
Chapter 1
VHDL: The Language
1.1 Introduction to VHDL
1.1.1 Levels of Abstraction
There are many different levels of abstraction for working with hardware:
Quantum: Schrodingers equations describe movement of electrons and holes through mate-
rial.
Energy band: 2-dimensional diagrams that capture essential features of Schrodingers equa-
tions. Energy-band diagrams are commonly used in nano-scale engineering.
Transistor: Signal values and time are continous (analog). Each transistor is modeled by a
resistor-capacitor network. Overall behaviour is dened by differential equations in terms of
the resistors and capacitors. Spice is a typical simulation tool.
Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa-
tions are used, rather than differential equations. A rising edge may be modeled as a linear
rise over some range of time, or the time between a denite low value and a denite high
value may be modeled as having an undened or rising value.
Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discrete
values such as pure Boolean (0 or 1) or IEEE Standard Logic 1164, which has representations
for different types of unknown or undened values. Time may be continuous or may be
discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate
has a delay of 1 and AND gate has a delay of 2).
3
4 CHAPTER 1. VHDL
Register transfer level: The essential characteristic of the register transfer level is that the
behaviour of hardware is modeled as assignments to registers and combinational signals.
Equations are written where a register signal is a function of other signals (e.g. c = a
and b;). The assignments may be either combinational or registered. Combinational as-
signments happen instanteously and registered assignments take exactly one clock cycle.
There are variations on the pure register-transfer level. For example, time may be measured
in clock phases rather than clock cycles, so as to allow assignments on either the rising or
falling edge of a clock. Another variation is to have multiple clocks that run at different
speeds a clock on a bus might run at half the speed of the primary clock for the chip.
Transaction level: The basic unit of computation is a transaction, such as executing an in-
struction on a microprocessor, transfering data across a bus, or accessing memory. Time
is usually measured as an estimate (e.g. a memory write requires 15 clock cycles, or a
bus transfer requires 250 ns.). The building blocks of the transaction level are processors,
controllers, memory arrays, busses, intellectual property (IP) blocks (e.g. UARTs). The
behaviour of the building blocks are described with software-like models, often written in
behavioural VHDL, SystemC, or SystemVerilog. The transaction level has many similarities
to a software model of a distributed system.
Electronic-system level: Looks at an entire electronic system, with both hardware and soft-
ware.
In this course, we will focus on the register-transfer level. In the second half of the course, we will
look at how analog phenomenon, such as timing and power, affect the register-transfer level. In
these chapters we will occasionally dip down into the transistor, switch, and gate levels.
1.1.2 VHDL Origins and History
VHDL = VHSIC Hardware Description Language
VHSIC = Very High Speed Integrated Circuit
The VHSIC Hardware Description Language (VHDL) is a formal notation intended
for use in all phases of the creation of electronic systems. Because it is both machine
readable and human readable, it supports the development, verication, synthesis and
testing of hardware designs, the communication of hardware design data, and the
maintenance, modication, and procurement of hardware.
Language Reference Manual (IEEE Design Automation Standards Committee,
1993a)
1.1.2 VHDL Origins and History 5
development
verication
synthesis
testing
hardware designs
communication
maintenance
modication
procurement
VHDL is a lot more than synthesis of digital
hardware
VHDL History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Developed by the United States Department of Defense as part of the very high speed integrated
circuit (VHSIC) program in the early 1980s.
The Department of Defense intended VHDL to be used for the documentation, simulation and
verication of electronic systems.
Goals:
improve design process over schematic entry
standardize design descriptions amongst multiple vendors
portable and extensible
Inspired by the ADA programming language
large: 97 keywords, 94 syntactic rules
verbose (designed by committee)
static type checking, overloading
complicated syntax: parentheses are used for both expression grouping and array indexing
Example:
a <= b * (3 + c); -- integer
a <= (3 + c); -- 1-element array of integers
Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000.
In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164
(IEEE Standard 1164-1993), was developed.
std_logic_1164 denes 9 different values for signals
In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were
dened (IEEE Standard 1076.31997).
numeric_std denes arithmetic over std logic vectors and integers.
Note: This is the package that you should use for arithmetic. Dont
use std logic arith it has less uniform support for mixed inte-
ger/signal arithmetic and has a greater tendency for differences between
tools.
6 CHAPTER 1. VHDL
numeric_bit denes arithmetic over bit vectors and integers. We wont use bit
signals in this course, so you dont need to worry about this package.
1.1.3 Semantics
The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit
behaviour.
a
b
c
simulation c <= a AND b;
But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of
the circuit.
Synthesis: converts one type of description (behavioural) into another, lower level, description
(usually a netlist).
a
b
c c <= a AND b; synthesis
Synthesis is a computer-aided design (CAD) technique that transforms a designers concise, high-
level description of a circuit into a structural description of a circuit.
CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CAD Tools allow designers to automate lower-level design processes in implementing the desired
functionality of a system.
NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD.
Synthesis vs Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
a
b
c c <= a AND b; synthesis
1.1.4 Synthesis of a Simulation-Based Language 7
The VHDL semantics dene the behaviour of the hardware that is generated, not the structure
of the hardware. The scenario below complies with the semantics of VHDL, because the two
synthesized circuits produce the same behaviour. If the two synthesized circuits had different
behaviour, then the scenario would not comply with the VHDL Standard.
a
b
c
a
b
c
c <= a AND b;
a
b
c
different
structure
same
behaviour s
y
n
t
h
e
s
i
s
simulation
a
b
c
simulation
s
y
n
t
h
e
s
i
s
1.1.4 Synthesis of a Simulation-Based Language
Not all of VHDL is synthesizable
c <= a AND b; (synthesizable)
c <= a AND b AFTER 2ns; (NOT synthesizable)
how do you build a circuit with exactly 2ns of delay through an AND gate?
more examples of non-synthesizable code are in section 1.11
See section 1.11 for more details
Different synthesis tools support different subsets of VHDL
Some tools generate erroneous hardware for some code
behaviour of hardware differs from VHDL semantics
Some tools generate unpredictable hardware (Hardware that has the correct behaviour, but un-
desirable or weird structure).
There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors dont
yet conform to it. (Most vendors still dont have full support for the 1993 extensions to VHDL!).
For more info, see http://www.vhdl.org/siwg/.
1.1.5 Solution to Synthesis Sanity
Pick a high-quality synthesis tool and study its documentation thoroughly
Learn the idioms of the tool
Different VHDL code with same behaviour can result in very different circuits
Be careful if you have to port VHDL code from one tool to another
8 CHAPTER 1. VHDL
KISS: Keep It Simple Stupid
VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synop-
sys, Mentor Graphics, Altera, Xilinx, and most other companies as well.
Follow the coding guidelines and examples from lecture
As you write VHDL, think about the hardware you expect to get.
Note: If you cant predict the hardware, then the hardware probably
wont be very good (small, fast, correct, etc)
1.1.6 Standard Logic 1164
At the core of VHDL is a package named STANDARD that denes a type named bit with values
of 0 and 1. For simulation, it helpful to have additional values, such as undened and
high impedance. Many companies created their own (incompatible) denitions of signal types
for simulation. To regain compatibility amongst packages from different companies, the IEEE
dened std logc 1164 to be the standard type for signal values in VHDL simulation.
U uninitialized
X strong unknown
0 strong 0
1 strong 1
Z high impedance
W weak unknown
L weak 0
H weak 1
-- dont care
The most common values are: U, X, 0, 1.
If you see X in a simulation, it usually means that there is a mistake in your code.
Every VHDL le that you write should begin with:
library ieee;
use ieee.std_logic_1164.all;
Note: std logic vs boolean The std logic values 1 and 0 are not
the same as the boolean values true and false. For example, you must
write if a = 1 then .... The code if a then ... will not type-
check if a is of type std logic.
From a VLSI perspective, a weak value will come from a smaller gate. One aspect of VHDL that
we dont touch on in ece427 is resolution, which describes how to determine the value of a signal
if the signal is driven by bmore than one/b process. (In ece427, we restrict ourselves to having
each signal be driven by (be the target of) exactly one process). The std logic 1164 library provides
a resolution function to deal with situation where different processes drive the same signal with
different values. In this situation, a strong value (e.g. 1) will overpower a weak value (e.g. L).
If two processes drive the signal with different strong values (e.g. 1 and 0) the signal resolves
1.2. COMPARISON OF VHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES 9
to a strong unknown (X). If a signal is driven with two different weak values (e.g. H and L),
the signal resolves to a weak unknown (W).
1.2 Comparison of VHDLto Other Hardware Description Lan-
guages
1.2.1 VHDL Disadvantages
Some VHDL programs cannot be synthesized
Different tools support different subsets of VHDL.
Different tools generate different circuits for same code
VHDL is verbose
Many characters to say something simple
VHDL is complicated and confusing
Many different ways of saying the same thing
Constructs that have similar purpose have very different syntax (case vs. select)
Constructs that have similar syntax have very different semantics (variables vs signals)
Hardware that is synthesized is not always obvious (when is a signal a ip-op vs latch vs
combinational)
The infamous latch inference problem (See section 1.5.2 for more information)
1.2.2 VHDL Advantages
VHDL supports unsynthesizable constructs that are useful in writing high-level models, test-
benches and other non-hardware or non-synthesizable artifacts that we need in hardware design.
VHDL can be used throughout a large portion of the design process in different capacities, from
specication to implementation to verication.
VHDL has static typechecking many errors can be caught before synthesis and/or simulation.
(In this respect, it is more similar to Java than to C.)
VHDL has a rich collection of datatypes
VHDL is a full-featured language with a good module system (libraries and packages).
VHDL has a well-dened standard.
10 CHAPTER 1. VHDL
1.2.3 VHDL and Other Languages
1.2.3.1 VHDL vs Verilog
Verilog is a simpler language: smaller language, simple circuits are easier to write
VHDL has more features than Verilog
richer set of data types and strong type checking
VHDL offers more exibility and expressivity for constructing large systems.
The VHDL Standard is more standard than the Verilog Standard
VHDL and Verilog have simulation-based semantics
Simulation vendors generally conform to VHDL standard
Some Verilog constructs dont simulate the same in different tools
VHDL is used more than Verilog in Europe and Japan
Verilog is used more than VHDL in North America
South-East Asia, India, South America: ?????
1.2.3.2 VHDL vs SystemC
System C looks like C familiar syntax
C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable
code as well?
If you think VHDL is hard to synthesize, try C....
SystemC simulation is slower than advertised
1.2.3.3 VHDL vs Other Hardware Description Languages
Superlog: Aproposed language that was based on Verilog and C. Basic core comes fromVerilog.
C-like extensions included to make language more expressive and powerful. Developed by the
Co-Design company, but no longer under active development. Superlog has been superseded by
SystemVerilog, see below.
SystemVerilog: A language originally proposed by Co-Design and now being standardized by
Accellera, and organization aimed at standardizing EDA languages. SystemVerilog is inspired
by Verilog, Superlog, and System-C. SystemVerilog is a superset of Verilog aimed to support
both high-level design and verication.
Esterelle: A language evolving from academia to commercial viability. Very clean semantics.
Aimed at state machines, limited support for datapath operations.
1.3. OVERVIEW OF SYNTAX 11
1.2.3.4 Summary of VHDL Evaluation
VHDL is far from perfect and has lots of annoying characteristics
VHDL is a better language for education than Verilog because the static typechecking enforces
good software engineering practices
The richness of VHDL will be useful in creating concise high-level models and powerful test-
benches
1.3 Overview of Syntax
This section is just a brief overview of the syntax of VHDL, focusing on the constructs that are
most commonly used. For more information, read a book on VHDL and use online resources.
(Look for VHDL under the Documentation tab in the E&C 427 web pages.)
1.3.1 Syntactic Categories
There are ve major categories of syntactic constructs.
(There are many, many minor categories and subcategories of constructs.)
Library units (section 1.3.2)
Top-level constructs (packages, entities, architectures)
Concurrent statements (section 1.3.4)
Statements executed at the same time (in parallel)
Sequential statements (section 1.3.7)
Statements executed in series (one after the other)
Expressions
Arithmetic (section 1.10), Boolean, Vectors , etc
Declarations
Components , signals, variables, types, functions, ....
1.3.2 Library Units
Library units are the top-level syntactic constructs in VHDL. They are used to dene and include
libraries, declare and implement interfaces, dene packages of declarations and otherwise bind
together VHDL code.
Package body
dene the contents of a library
Packages
12 CHAPTER 1. VHDL
determine which parts of the library are externally visible
Use clause
use a library in an entity/architecture or another package
technically, use clauses are part of entities and packages, but they proceed the entity/package
keyword, so we list them as top-level constructs
Entity (section 1.3.3)
dene interface to circuit
Architecture (section 1.3.3)
dene internal signals and gates of circuit
1.3.3 Entities and Architecture
Each hardware module is described with an Entity/Architecture pair
architecture
entity
architecture
entity
Figure 1.1: Entity and Architecture
Entity: interface
names, modes (in / out), types of
externally visible signals of circuit
Architecture: internals
structure and behaviour of module
library ieee;
entity and_or is
port (
a, b, c : in std_logic ;
z : out std_logic
);
end and_or;
Figure 1.2: Example of an entity
1.3.3 Entities and Architecture 13
The syntax of VHDL is dened using a variation on Backus-Naur forms (BNF).
[ { use_clause } ]
entity ENTITYID is
[ port (
{ SIGNALID : (in | out) TYPEID [ := expr ] ; }
);
]
[ { declaration } ]
[ begin
{ concurrent_statement } ]
end [ entity ] ENTITYID ;
Figure 1.3: Simplied grammar of entity
architecture main of and_or is
signal x : std_logic;
begin
x <= a AND b;
z <= x OR (a AND c);
end main;
Figure 1.4: Example of architecture
[ { use_clause } ]
architecture ARCHID of ENTITYID is
[ { declaration } ]
begin
[ { concurrent_statement } ]
end [ architecture ] ARCHID ;
Figure 1.5: Simplied grammar of architecture
14 CHAPTER 1. VHDL
1.3.4 Concurrent Statements
Architectures contain concurrent statements
Concurrent statements execute in parallel (Figure1.6)
Concurrent statements make VHDL fundamentally different from most software languages.
Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hard-
ware.
At each innitesimally small moment of time, each gate:
1. samples its inputs
2. computes the value of its output
3. drives the output
architecture main of bowser is
begin
x1 <= a AND b;
x2 <= NOT x1;
z <= NOT x2;
end main;
architecture main of bowser is
begin
z <= NOT x2;
x2 <= NOT x1;
x1 <= a AND b;
end main;
a
b
z
x1 x2
Figure 1.6: The order of concurrent statements doesnt matter
1.3.4 Concurrent Statements 15
conditional assignment . . . <= . . . when . . . else . . .;
normal assignment (. . . <= . . .)
if-then-else style (uses when)
selected assignment
with . . . select
. . . <= . . . when . . . | . . . ,
. . . when . . . | . . . ,
. . .
. . . when . . . | . . . ;
case/switch style assignment
component instantiation . . . : . . . port map ( . . . => . . . , . . . );
use an existing circuit
section 1.3.5
for-generate
. . . : for . . . in . . . generate
. . .
end generate;
replicate some hardware
if-generate
. . . : if . . . generate
. . .
end generate;
conditionally create some hardware
process
process . . . begin
. . .
end process;
the body of a process is executed sequentially
Sections 1.3.6, 1.6
Figure 1.7: The most commonly used concurrent statements
16 CHAPTER 1. VHDL
1.3.5 Component Declaration and Instantiations
There are two different syntaxes for component declaration and instantiation. The VHDL-93 syn-
tax is much more concise than the VHDL-87 syntax.
Not all tools support the VHDL-93 syntax. For E&CE 427, some of the tools that we use do not
support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax.
1.3.6 Processes
Processes are used to describe complex and potentially unsynthesizable behaviour
A process is a concurrent statement (Section 1.3.4).
The body of a process contains sequential statements (Section 1.3.7)
Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)
process (a, b, c)
begin
y <= a AND b;
if (a = 1) then
z1 <= b AND c;
z2 <= NOT c;
else
z1 <= b OR c;
z2 <= c;
end if;
end process;
process
begin
y <= a AND b;
z <= 0;
wait until rising_edge(clk);
if (a = 1) then
z <= 1;
y <= 0;
else
y <= a OR b;
end if;
end process;
Figure 1.8: Examples of processes
Processes must have either a sensitivity list or at least one wait statement on each execution path
through the process.
Processes cannot have both a sensitivity list and a wait statement.
Sensitivity List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The sensitivity list contains the signals that are read in the process.
A process is executed when a signal in its sensitivity list changes value.
1.3.7 Sequential Statements 17
An important coding guideline to ensure consistent synthesis and simulation results is to include
all signals that are read in the sensitivity list. If you forget some signals, you will either end up
with unpredictable hardware and simulation results (different results from different programs) or
undesirable hardware (latches where you expected purely combinational hardware). For more on
this topic, see sections 1.5.2 and 1.6.
There is one exception to this rule: for a process that implements a ip-op with an if rising edge
statement, it is acceptable to include only the clock signal in the sensitivity list other signals
may be included, but are not needed.
[ PROCLAB : ] process ( sensitivity_list )
[ { declaration } ]
begin
{ sequential_statement }
end process [ PROCLAB ] ;
Figure 1.9: Simplied grammar of process
1.3.7 Sequential Statements
Used inside processes and functions.
wait wait until . . . ;
signal assignment . . . <= . . . ;
if-then-else if . . . then . . . elsif . . . end if;
case case . . . is
when . . . | . . . => . . . ;
when . . . => . . . ;
end case;
loop loop . . . end loop;
while loop while . . . loop . . . end loop;
for loop for . . . in . . . loop . . . end loop;
next next . . . ;
Figure 1.10: The most commonly used sequential statements
18 CHAPTER 1. VHDL
1.3.8 A Few More Miscellaneous VHDL Features
Some constructs that are useful and will be described in later chapters and sections:
report : print a message on stderr while simulating
assert : assertions about behaviour of signals, very useful with report statements.
generics : parameters to an entity that are dened at elaboration time.
attributes : predened functions for different datatypes. For example: high and low indices of a
vector.
1.4 Concurrent vs Sequential Statements
All concurrent assignments can be translated into sequential statements. But, not all sequential
statements can be translated into concurrent statements.
1.4.1 Concurrent Assignment vs Process
The two code fragments below have identical behaviour:
architecture main of tiny is
begin
b <= a;
end main;
architecture main of tiny is
begin
process (a) begin
b <= a;
end process;
end main;
1.4.2 Conditional Assignment vs If Statements
The two code fragments below have identical behaviour:
Concurrent Statements
t <= <val1> when <cond>
else <val2>;
Sequential Statements
if <cond> then
t <= <val1>;
else
t <= <val2>;
end if
1.4.3 Selected Assignment vs Case Statement 19
1.4.3 Selected Assignment vs Case Statement
The two code fragments below have identical behaviour
with <expr> select
t <= <val1> when <choices1>,
<val2> when <choices2>,
<val3> when <choices3>;
case <expr> is
when <choices1> =>
t <= <val1>;
when <choices2> =>
t <= <val2>;
when <choices3> =>
t <= <val3>;
end case;
1.4.4 Coding Style
Code thats easy to write with sequential statements, but difcult with concurrent:
case <expr> is
when <choice1> =>
if <cond> then
o <= <expr1>;
else
o <= <expr2>;
end if;
when <choice2> =>
. . .
end case;
Overall structure:
with <expr> select
t <= ... when <choice1>,
... when <choice2>;
Failed attempt:
with <expr> select
t <= -- want to write:
-- <val1> when <cond>
-- else <val2>
-- but conditional assignment
-- is illegal here
when c1,
. . .
when c2;
Concurrent statement with correct behaviour, but messy:
t <= <expr1> when (expr = <choice1> AND <cond>)
else <expr2> when (expr = <choice1> AND NOT <cond>)
else . . .
;
20 CHAPTER 1. VHDL
1.5 Overview of Processes
Processes are the most difcult VHDL construct to understand. This section gives an overview of
processes. Section 1.6 gives the details of the semantics of processes.
Within a process, statements are executed almost sequentially
Among processes, execution is done in parallel
Remember: a process is a concurrent statement!
entity ENTITYID is
interface declarations
end ENTITYID;
architecture ARCHID of ENTITYID is
begin
concurrent statements =
process begin
sequential statements =
end process;
concurrent statements =
end ARCHID;
Figure 1.11: Sequential statements in a process
Key concepts in VHDL semantics for processes:
VHDL mimics hardware
Hardware (gates) execute in parallel
Processes execute in parallel with each other
All possible orders of executing processes must produce the same simulation results (wave-
forms)
If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrent statements must
produce the same waveforms
It doesnt matter whether you are running on a single-threaded operating system, on a multi-
threaded operating system, on a massively parallel supercomputer, or on a special hardware emu-
lator with one FPGA chip per VHDL process all simulations must be the same.
These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6)
and lead to the phenomenon of latch-inference (Section 1.5.2).
1.5. OVERVIEW OF PROCESSES 21
architecture
procA: process
stmtA1;
stmtA2;
stmtA3;
end process;
procB: process
stmtB1;
stmtB2;
end process;
execution sequence
A1
A2
A3
B1
B2
execution sequence
A1
A2
A3
B1
B2
execution sequence
A1
A2
A3
B1
B2
single threaded:
procA before procB
single threaded:
procB before procA
multithreaded: procA
and procB in parallel
Figure 1.12: Different process execution sequences
Figure 1.13: All execution orders must have same behaviour
Sections 1.5.11.5.3 discuss the hardware generated by processes.
Sections 1.61.6.5 discuss the behaviour and execution of processes.
22 CHAPTER 1. VHDL
1.5.1 Combinational Process vs Clocked Process
Each well-written synthesizable process is either combinational or clocked. Some synthesizable
processes that do not conform to our coding guidelines are both combintational and clocked. For
example, in a ip-op with an asynchronous reset, the output is a combinational function of the
reset signal and a clocked function of the data input signal. We will deal with only with processes
that follow our coding conventions, and so we will continue to say that each process is either
combinational xor clocked.
Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Executing the process takes part of one clock cycle
Target signals are outputs of combinational circuitry
A combinational processes must have a sensitivity list
A combinational process must not have any wait statements
A combinational process must not have any rising_edges, or falling_edges
The hardware for a combinational process is just combinational circuitry
Clocked process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Executing the process takes one (or more) clock cycles
Target signals are outputs of ops
Process contains one or more wait or if rising edge statements
Hardware contains combinational circuitry and ip ops
Note: Clocked processes are sometimes called sequential processes,
but this can be easily confused with sequential statements, so in E&CE 427
well refer to synthesizable processes as either combinational or clocked.
Example Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Combinational Process
process (a,b,c)
p1 <= a;
if (b = c) then
p2 <= b;
else
p2 <= a;
end if;
end process;
1.5.2 Latch Inference 23
Clocked Processes
process
begin
b <= a;
end process;
process (clk)
begin
if rising_edge(clk) then
b <= a;
end if;
end process;
1.5.2 Latch Inference
The semantics of VHDL require that if a signal is assigned a value on some passes through a
process and not on other passes, then on a pass through the process when the signal is not assigned
a value, it must maintain its value from the previous pass.
process (a, b, c)
begin
if (a = 1) then
z1 <= b;
z2 <= b;
else
z1 <= c;
end if;
end process;
a
b
c
z1
z2
Figure 1.14: Example of latch inference
When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store
the value.
If you want a latch or a ip-op for the signal, then latch inference is good.
If you want combinational circuitry, then latch inference is bad.
Loop, Latch, Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
b
a
z
Combinational loop
b z
a EN
Latch
b z
a
D Q
Flip-op
24 CHAPTER 1. VHDL
Question: Write VHDL code for each of the above circuits
Causes of Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Usually, latch inference refers to the unintentional creation of latches.
The most common cause of unintended latch inference is missing assignments to signals in if-then-
else and case statements.
Latch inference happens during elaboration. When using the Synopsys tools, look for:
Inferred memory devices
in the output or log les.
1.5.3 Combinational vs Flopped Signals
Signals assigned to in combinational processes are combinational.
Signals assigned to in clocked processes are outputs of ip-ops.
1.6 Details of Process Execution
1.6.1 Intuition Behind Delta-Cycle Simulation
Zero-delay simulation might appear to be the simpler than simulation with delays through gates
(timing simulation), but in reality, zero-delay simulation algorithms are more complicated than
algorithms for timing simulation. The reason is that in zero-delay simulation, a sequence of de-
pendent events must appear to happen instantaneously (in zero time). In particular, the effect of an
event must propagate instantaneously through the combinational circuitry.
Two fundamental rules for zero-delay simulation:
1. events appear to propagate through combinational circuitry instantaneously.
2. all of the gates appear to operate in parallel
To make it appear that events propagate instaneously, VHDL introduces an articial unit of time,
the delta cycle, to represent an innitesimally small amount of time. In each delta cycle, every gate
in the circuit will sample its inputs, compute its result, and drive its output signal with the result.
Because software executes in serial, a simulator cannot run/simulate multiple gates in parallel.
Instead, the simulator must simulate the gates one at a time, but make the waveforms appear as
if all of the gates were simulated in parallel. In each delta cycle, the simulator will simulate any
gate whose input changed in the previous delta cycle. To preserve the illusion that the gates ran in
parallel, the effect of simulating a gate remains invisible until the end of the delta cycle.
1.6.2 De nitions and Algorithm 25
1.6.2 De nitions and Algorithm
1.6.2.1 Temporal Granularities of Simulation
This begins our discussion of the behaviour and execution of processes.
There are several different granularities of time to analyze VHDL behaviour. In this course, we
will discuss three major granularities: clock cycles, timing simulation, and delta cycles.
clock-cycle
smallest unit of time is a clock cycle
combinational logic has zero delay
ip-ops have a delay of one clock cycle
used for simulation early in the design cycle
fastest simulation run times
timing simulation
smallest unit of time is a nano, pico, or fempto second
combinational logic and wires have delay as computed by timing analysis tools
ip-ops have setup, hold, and clock-to-Q timing parameters
used for simulation when ne-tuning design and conrming that timing contraints are
satised
slow simulation times for large circuits
delta cycles
units of time are artifacts of VHDL semantics and simulation software
simulation cycles, delta cycles, and simulation steps are innitesimally small amounts of
time
VHDL semantics are dened in terms of these concepts
In assignments and exams, you will need to be able to simulate VHDL code at each of the three
different levels of temporal granularity. In the laboratories and project, you will use simulation
programs for both clock-cycle simulation and timing simulation. We dont have access to a pro-
gram that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job
or fourth-year design project....
For the remainder of section 1.6, well look at only the delta cycle view of the world.
26 CHAPTER 1. VHDL
1.6.2.2 Process Modes
An architecture contains a set of processes. Each process is in one of the following modes: active,
suspended, or postponed.
Note: postponed This use of the word postponed differs from that in
the VHDL Standard. We wont be using postponed processes as dened in the
Standard.
Note: postponed Postponed in VHDL terminology is a synonym for
some operating-systems usage of ready to describe a process that is ready
to execute.
s
u
s
p
e
n
d
resume
a
c
t
i
v
a
t
e
active
suspended postponed
Suspended
Nothing to currently execute
A process stays suspended until the event
that it is waiting for occurs: either a
change in a signal on its sensitivity list
or the condition in a wait statement
Postponed
Wants to execute, but not currently active
A process stays postponed until the sim-
ulator chooses it from the pool of post-
poned processes
Active
Currently executing
A process stays active until it hits a wait
statement or sensitivity list, at which
point it suspends
Figure 1.15: Process modes
1.6.2.3 Simulation Algorithm
The algorithm presented here is a simplication of the actual algorithm in Section 12.6 of the
VHDL Standard. The most signicant simplication is that this algorithm does not support de-
layed assignments. To support delayed assignments, each signals provisional value would be gen-
eralized to an event wheel, which is a list containing the times and values for multiple provisional
assignments in the future.
A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to
the semantics of executing processes.
1.6.2 De nitions and Algorithm 27
The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulations start at step 1 with all processes postponed and all signals with a default value (e.g.,
U for std logic).
1. While there are postponed processes:
(a) Pick one or more postponed processes to execute (become active).
(b) As a process executes, assignments to signals are provisional new values do
not become visible until step 3
(c) A process executes until it hits its sensitivity list or a wait statement, at which point
it suspends. At a wait statement, the process will suspend even if the condition is
true during the current simulation cycle.
(d) Processes that become suspended stay suspended until there are no more post-
poned or active processes.
2. Each process looks at signals that changed value (provisional value differs from visible
value) and at the simulation time. If a signal in a processs sensitivity list changed value,
or if the wait condition on which a process is suspended became true, then the process
resumes (becomes postponed).
3. Each signal that changed value is updated with its provisional value (the provisional
value becomes visible).
4. If there are no postponed processes, then increment simulation time to the next sched-
uled event.
Note: Parallel execution In n-threaded execution, at most n processes are
active at a time
28 CHAPTER 1. VHDL
1.6.2.4 Delta-Cycle Denitions
Denition simulation step: Executing one sequential assignment or process mode
change.
Denition simulation cycle: The operations that occur in one iteration of the simulation
algorithm.
Denition delta cycle: A simulation cycle that does not advance simulation time.
Equivalently: A simulation cycle with zero-delay assignments where the assignment
causes a process to resume.
Denition simulation round: A sequence of simulation cycles that all have the same
simulation time. Equivalently: a contiguous sequence of zero or more delta cycles
followed by a simulation cycle that increments time (i.e., the simulation cycle is not a
delta cycle).
Note: Ofcial and unofcial terminology Simulation cycle and delta cycle
are ofcial denitions in the VHDL Standard. Simulation step and simulation
round are not standard denitions. They are used in E&CE 427 because we
need words to associate with the concepts that they describe.
1.6.3 Example 1: Process Execution (Bamboozle) 29
1.6.3 Example 1: Process Execution (Bamboozle)
This example (Bamboozle) and the next example (Flummox, section 1.6.4) are very similar. The
VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The
stimulus for signals a and b also differs.
entity bamboozle is
begin
end bamboozle;
architecture main of bamboozle is
signal a, b, c, d : std_logic;
begin
procA : process (a, b) begin
c <= a AND b;
end process;
procB : process (b, c, d)
begin
d <= NOT c;
e <= b AND d;
end process;
procC : process
begin
a <= 0;
b <= 1;
wait for 10 ns;
a <= 1;
wait for 2 ns;
b <= 0;
wait for 3 ns;
a <= 0;
wait for 20 ns;
end main;
Figure 1.16: Example bamboozle circuit for process execution
30 CHAPTER 1. VHDL
Initial conditions (Shown in slides, not in notes)
Step 1(a): Activate procA (Shown in slides, not in notes)
a
b
c d
e
U
U
U U
U
procA: process (a, b) begin
c <= a AND b;
end process;
procB: process (b, c, d) begin
d <= NOT c;
end process;
e <= b AND d;
A
P
Step 1(a): Activate procA
procC: process begin
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
P
wait for 20 ns;
a
b
c
d
e
U
U
U
U
U
0ns
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
A
?
Step 1(c): Suspend procA (Shown in slides, not in notes)
Step 1(a): Activate procC (Shown in slides, not in notes)
Step 1(b): Provisional assignment to a (Shown in slides, not in notes)
Step 1(b): Provisional assignment to b (Shown in slides, not in notes)
a
b
c d
e
U
U
U U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
P
0
Step 1(b): Provisional assignment to b
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
A
U
1
wait for 20 ns;
a
b
c
d
e
U
U
U
U
U
0ns
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
?
U
A S
A
U
U
Step 1(a): Activate procB (Shown in slides, not in notes)
Step 1(b): Provisional assignment to d (Shown in slides, not in notes)
Step 1(b): Provisional assignment to e (Shown in slides, not in notes)
Step 1(c): Suspend procB (Shown in slides, not in notes)
a
b
c
d
e
0ns
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
?
a
b
c d
e
U
U
U U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
Step 1(c): Suspend procB
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
U U
U
wait for 20 ns;
A S
A
U
U
S
A
U
U
U
S
a
b
c d
e
U
U
U U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
All processes suspended
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
U U
U
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
?
A S
A
U
U
S
A
U
U
U
S
E
?
Step 3: Update signal values (Shown in slides, not in notes)
32 CHAPTER 1. VHDL
a
b
c d
e
U U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
P
P
0
1
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
U U
U
wait for 20 ns;
0ns
P A S
A S
A
U
U
U
S
a
b
c
d
e
U
U
U
U
U
procA
procB P
procC P
sim round
sim cycle
delta cycle
B
B
?
P
P
Step 3: Update signal values
U
U
0
1
a
b
c d
e
U U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
Step 4: Simulation time remains at 0 ns --- delta cycle
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
A S
A
U
U
S
A
U
U
U
S
P
P
0
1
E
E
Step 1(a): Activate procA (Shown in slides, not in notes)
Step 1(b): Provisional assignment to c (Shown in slides, not in notes)
Step 1(c): Suspend procA (Shown in slides, not in notes)
a
b
c d
e
U U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
U 0
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
U
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
P
P
A
U
S
A
U
U
S
U
U
U
0
1
?
Step 4: Simulation time remains at 0ns delta cycle (Shown in slides, not in notes)
Compact simulation cycle (Shown in slides, not in notes)
34 CHAPTER 1. VHDL
Begin next simulation cycle (Shown in slides, not in notes)
All processes suspended (Shown in slides, not in notes)
a
b
c d
e
0 U
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
1
U
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
P
B
A
U
S
U
U
U
E
B E
U
U
U
U
0
1
P
P
0
?
a
b
c d
e
0
U
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
P
0
1
1
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
P
B
A
U
S
U
U
U
E
B E
U
U
U
U
0
1
P
P
0
1
P
?
a
b
c d
e
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
U
0 1 1
1
Step 1(c): Suspend procB
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
U
U
U
E
B E
B
B
U
E
E
B
?
U
A
U
S
U
U
U
0
1
P
P
P P
0
1
a
b
c d
e
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
0 1
1
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
U
U
U
E
B E
B
B
U
E
E
B
U
P A
U
S
U
U
U
0
1
P
P
P
1
?
0
1
36 CHAPTER 1. VHDL
Step 1: No postponed processes (Shown in slides, not in notes)
a
b
c d
e
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
S
S
0
1
0 1
1
Step 1: no postponed processes
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
U
U
U
E
B E
B
B
U
E
E
B
U
U
E
E
10ns
U
U
U
0
1
P
P
P P
0
1
1
Step 1(a): Activate procC (Shown in slides, not in notes)
Step 1(b): Provisional assignment to a (Shown in slides, not in notes)
Step 1(c): Suspend procC (Shown in slides, not in notes)
Step 2: Check sensitivity list; resume processes (Shown in slides, not in notes)
a
b
c d
e
c <= a AND b;
end process;
d <= NOT c;
end process;
e <= b AND d;
P
S
1
0 1
1
1
a <= 0;
b <= 1;
a <= 1;
b <= 0;
a <= 0;
wait for 10 ns;
wait for 2 ns;
wait for 3 ns;
end process;
S
wait for 20 ns;
0ns
a
b
c
d
e
U
U
U
U
U
procA P
procB P
procC P
sim round
sim cycle
delta cycle
B
B
B
U
U
E
E
B
U
U
U
E
B E
B
B
U
E
E
B
U
B
U
E
E
P A S
B
B
B
10ns
U
U
U
0
1
1
P
P
P
P P
0
1
1
38 CHAPTER 1. VHDL
1.6.4 Example 2: Process Execution (Flummox)
This example is a variation of the Bamboozle example from section 1.6.3.
entity flummox is
begin
end flummox;
architecture main of flummox is
signal a, b, c, d : std_logic;
begin
proc1 : process (a, b, c) begin
c <= a AND b;
d <= NOT c;
end process;
proc2 : process (b, d)
begin
e <= b AND d;
end process;
proc3 : process
begin
a <= 1;
b <= 0;
wait for 3 ns;
b <= 1;
wait for 99 ns;
end main;
Figure 1.17: Example ummox circuit for process execution
a
b
c
d
e
proc1
proc2
proc3
delta cycle
sim cycle
sim round B
B
B
P
P
P
U
U
U
U
U
A
U
S
A
1
0
S
A S
U
U
E
E
P
P
A
0
U
S
A S
B
B E
E
P A S
0
1
B
B E
E
P A S
0
B E
E
P A S
1
P
P A S
1
A S
1
1
B
B
B
E
E
P A S
1
0
P A S
0 0
B
B E
E E
E
B
B
0ns
B E
E
102ns 3ns
1.6.4 Example 2: Process Execution (Flummox) 39
To get a more natural view of the behaviour of the signals, we draw just the waveforms and use a
timescale of nanoseconds plus delta cycles:
a
b
c
d
e
U
U
U
U
U
+1 +2 +3
3ns
+1 +2 +3
0ns 102ns
U
U
U
U
U
Finally, we draw the behaviour of the signals using the standard time scale of nanoseconds. Notice
that the delta-cycles within a simulation round all collapse to the left, so the signals change value
exactly at the nanosecond boundaries. Also, the glitch on e dissappears.
Answer:
a
b
c
d
e
3ns 0ns 102ns
U
U
U
U
U
2ns 1ns 4ns 100ns 101ns
Note and Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note: If a signal is updated with the same value it had in the previous sim-
ulation cycle, then it does not change, and therefore does not trigger processes
to resume.
Question: What are the different granularities of time that occur when doing
delta-cycle simulation?
40 CHAPTER 1. VHDL
Answer:
simulation step, delta cycle, simulation cycle, simulation round
Question: What is the order of granularity, from nest to coarsest, amongst the
different granularities related to delta-cycle simulation?
Answer:
Same order as listed just above. Note: delta cycles have a ner granularity
that simulation cycles, because delta cycles do not advance time, while
simulation cycles that are not delta cycles do advance time.
1.6.5 Example: Need for Provisional Assignments
This is an example of processes where updating signals during a simulation cycle leads to different
results for different process execution orderings.
architecture main of swindle is
begin
p_c: process (a, b) begin
c <= a AND b;
end process;
p_d: process (a, c) begin
d <= a XOR c;
end process;
end main;
a
b
c
d
Figure 1.18: Circuit to illustrate need for provisional assignments
1.6.5 Example: Need for Provisional Assignments 41
1. Start with all signals at 0.
2. Simultaneously change to a = 1 and b = 1.
. .
If assignments are not visible within same simulation cycle (correct: i.e. provisional
assignments are used)
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p c is scheduled before p d, then d will
have a 1 pulse.
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p d is scheduled before p c, then d will
have a 1 pulse.
. .
If assignments are visible within same simulation cycle (incorrect)
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p c is scheduled before p d, then d will
stay constant 0.
a
b
c
d
0
0
0
0
p_d
p_c P
P
A S
A S P A S
If p d is scheduled before p c, then d will
have a 1 pulse.
With provisional assignments, both orders of scheduling processes result in the same behaviour
on all signals. Without provisional assignments, different scheduling orders result in different
behaviour.
42 CHAPTER 1. VHDL
1.6.6 Delta-Cycle Simulations of Flip-Flops
This example illustrates the delta-cycle simulation of a ip-op. Notice how the delta-cycle simu-
lation captures the expected behaviour of the ip op: the signal q changes at the same time (10ns)
as rising edge on the clock.
p_a : process begin
a <= 0;
wait for 15 ns;
a <= 1;
wait for 20 ns;
end process;
p_clk : process begin
clk <= 0;
wait for 10 ns;
clk <= 1;
wait for 10 ns;
end process;
flop : process ( clk ) begin
if rising_edge( clk ) then
q <= a;
end if;
end process;
a
clk
q
flop
p_a
p_clk
sim round
sim cycle
delta cycle
0ns
P
P
U
U
U
P
U
B
B
B
E
E
A S
A S
U
A S P A S
E
E
B
10ns 0ns+1
P
A S
0
0
B/E
A S P
U
15ns
P A S
20ns
P A S
30ns
P A S
A S
1
0
0
A S P
1
1
B/E
B
B
B
E
E
E
E
E
E
E
E
E B E B E B E
B E B/E
B/E
B/E
B/E
B/E
B/E B
B B E
B
B B E
35ns
1
P
Redraw with Normal Time Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
To clarify the behaviour, we redraw the same simulation using a normal time scale.
a
clk
q
0ns 10ns 20ns
U
U
5ns 15ns 30ns 35ns
U
25ns
1.6.6 Delta-Cycle Simulations of Flip-Flops 43
Back-to-Back Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
In the previous simulation, the input to the ip-op (a) changed several nanoseconds before the
rising-edge on the clock. In zero delay simulation, the output of a ip-op changes exactly on
the rising edge of the clock. This means that the input to the next ip-op will change at exactly
the same time as a rising edge. This example illustrates how delta-cycle simulation handles the
situation correctly.
p_a : process begin
a <= 0;
wait for 15 ns;
a <= 1;
wait for 20 ns;
end process;
p_clk : process begin
clk <= 0;
wait for 10 ns;
clk <= 1;
wait for 10 ns;
end process;
flops : process ( clk ) begin
q1 <= a;
q2 <= q1;
end if;
end process;
a
clk
q1
flops
p_a
p_clk
sim round
sim cycle
delta cycle
10ns
P
A S
0
0
B/E
A S P
U
15ns
P A S
20ns
P A S
30ns
P A S
A S
1
0
0
A S P
1
1
B/E
B
B
B
E
E
E
E
E
E
E
E
E B E B E B E
B E B/E
B/E
B/E
B/E
B/E
B/E B
B B E
B
B B E
35ns
1
P
U
q2 U U U
To clarify the behaviour, we redraw the same simulation using a normal time scale.
a
clk
q1
0ns 10ns 20ns
U
U
5ns 15ns 30ns 35ns
U
25ns
q2 U
44 CHAPTER 1. VHDL
Testbenches and Clock Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
env : process begin
a <= 1;
clk <= 0;
wait for 10 ns;
a <= 0;
clk <= 1;
wait for 10 ns;
end process;
flop : process ( clk ) begin
q1 <= a
end if;
end process;
a
clk
q1
flop2
flop1 P
P
S
A S
P A S
U
U
U
env P
A
U
U
U
P S
A S
A
sim round
sim cycle
delta cycle B
B
B
E
E E
E
B
0ns 0ns+1 10ns
B
B E
E
P S A
B
B E
E
1
0 1
0
S
A S
A
P
P
B
B
B
E
E
20ns
U
a
clk
q1
0ns 10ns 20ns
U
U
U
1.7. REGISTER-TRANSFER LEVEL SIMULATION 45
Note: Testbench signals For consistent results across different simulators,
simulation scripts vs test benches, and timing-simulation vs zero-delay simula-
tion do not change signals in your testbench or script at the same time as the
clock changes.
a is output of clocked or combina-
tional process
a
clk
q1
0ns 10ns 20ns
U
U
U
30ns 40ns 50ns 60ns
a is output of timed process
(testbench or environment) POOR
DESIGN
a
clk
q1
0ns 10ns 20ns
U
U
U
30ns 40ns 50ns 60ns
a is output of timed process (test-
bench or environment) GOOD
DESIGN
a
clk
q1
0ns 10ns 20ns
U
U
U
30ns 40ns 50ns 60ns
1.7 Register-Transfer Level Simulation
1.7.1 Technique for Register-Transfer Level Simulation
The register-transfer-level is a coarser level of temporal abstraction than the delta-cycle level.
In delta-cycle simulation, many delta-cycles can elapse without an increment in real time (e.g.
nanoseconds). In register-transfer-level simulation, all of the events that take place in the same
moment of real time take place at same moment in the simulation. In other words, all of the events
that take place at the same time are drawn in the same column of the waveform diagram.
Register-transfer-level simulation can be done for legal VHDL code, either synthesizable or unsyn-
thesizable, so long as the code does not contain combinational loops. For any piece of VHDL code
without combinational loops, the register-transfer-level simulation and the delta-cycle simulation
will have same value for each signal at the end of each simulation round.
A combinational loop is a circuit that contains a cyclic path through the circuit that includes only
combinational gates. Combinational loops can cause signals to oscillate, which in delta-cycle
simulation with zero-delay assignments, corresponds to an innite sequence of delta cycles.
46 CHAPTER 1. VHDL
RTL Simulation Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Pre-processing
(a) Separate processes into combinational and non-combinational (clocked and timed)
(b) Decompose each combinational process into separate processes with one target signal
per process
(c) Sort processes into topological order based on dependencies
2. For each clock cycle or unit of time:
(a) Run non-combinational processes in any order. Non-combinational assignments read
from earlier clock cycle / time step.
(b) Run combinational processes in topological order. Combinational assignments read
from current clock cycle / time step.
1.7.2 Examples of RTL Simulation
Combinational Process Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
proc(a,b,c)
if a = 1 then
d <= b;
e <= c;
else
d <= not b;
e <= b and c;
end if;
end process;
Original code
proc(a,b,c)
if a = 1 then
d <= b;
else
d <= not b;
end if;
end process;
proc(a,b,c)
if a = 1 then
e <= c;
else
e <= b and c;
end if;
end process;
After decomposition into separate processes
for d and e
1.7.2 Examples of RTL Simulation 47
RTL Simulation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Revisit an earlier example, but do register-transfer-level simulation, rather than delta-cycle simu-
lation.
1. Original code:
proc1: process (a, b, c) begin
c <= a AND b;
d <= NOT c;
end process;
proc2: process (b, d) begin
e <= b AND d;
end process;
proc3: process begin
a <= 1;
b <= 0;
wait for 3 ns;
b <= 1;
wait for 99 ns;
end process;
2. Decompose combinational processes into single-target processes:
proc1c: process (a, b) begin
c <= a AND b;
end process;
proc1d: process (c) begin
d <= NOT c;
end process;
proc2: process (b, d) begin
e <= b AND d;
end process;
proc3: process begin
a <= 1;
b <= 0;
wait for 3 ns;
b <= 1;
wait for 99 ns;
end process;
3. Combinational processes are already in topological order, because each signal is assigned a
value before it is read.
4. Run timed process (proc3) until suspend at wait for 3 ns;.
The signal a gets 1 from 0 to 3 ns.
The signal b gets 0 from 0 to 3 ns.
5. Run proc1c
The signal c gets a AND b (0 AND 1 = 0) from 0 to 3 ns.
6. Run proc1d
The signal d gets NOT c (NOT 0 = 1) from 0 to 3 ns.
7. Run proc2
The signal e gets b AND d (0 AND 1 = 0) from 0 to 3 ns.
48 CHAPTER 1. VHDL
8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to
102ns.
9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to
102ns.
Question: Draw the RTL waveforms that correspond to the delta-cycle waveform
below.
a
b
c
d
e
proc1
proc2
proc3
delta cycle
sim cycle
sim round B
B
B
P
P
P
U
U
U
U
U
A
U
S
A
1
0
S
A S
U
U
E
E
P
P
A
0
U
S
A S
B
B E
E
P A S
0
1
B
B E
E
P A S
0
B E
E
P A S
1
P
P A S
1
A S
1
1
B
B
B
E
E
P A S
1
0
P A S
0
102ns
0
B
B E
E E
E
E
B
B
0ns
3ns
B E
E
U
0ns+1 0ns+2 0ns+2 3ns+1 3ns+2 3ns+3
Answer:
a
b
c
d
e
U
U
U
U
U
1
0
0
1
0
1
1
0
0ns 1ns 2ns 3ns
102ns
1.7.2 Examples of RTL Simulation 49
Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note: It is easier to do a simulation by hand if you start your clock at 0
and use the rst clock phase in the waveform diagram for the rst values that
your VHDL code assigns to signals
Simulate If-Then-Else, Wait Until . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
huey: process
begin
clk <= 1;
wait for 10 ns;
clk <= 0;
wait for 10 ns;
end process;
dewey: process
begin
a <= to_unsigned(0,4);
wait until re(clk);
while (a < 4) loop
a <= a + 1;
wait until re(clk);
end loop;
end process;
louie: process
begin
wait until re(clk);
d <= 1;
if (a < 2) then
d <= 0;
wait until re(clk);
end if;
end process;
clk
a
d
50 CHAPTER 1. VHDL
A Related Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Small changes to the code can cause signicant changes to the behaviour.
riri: process
begin
clk <= 1;
wait for 10 ns;
clk <= 0;
wait for 10 ns;
end process;
fifi: process
begin
a <= to_unsigned(0,4);
wait until re(clk);
while (a < 4) loop
a <= a + 1;
wait until re(clk);
end loop;
end process;
loulou: process
begin
wait until re(clk);
d <= 1;
if (a < 2) then
d <= 0;
wait until re(clk);
end if;
end process;
clk
a
d
I 0 5 10 15 20 25 30 35 40 45 50 55 60 70 65 75 80 85 90 95 100 110 120
1.8. VHDL AND HARDWARE BUILDING BLOCKS 51
1.8 VHDL and Hardware Building Blocks
This section outlines the building blocks for register transfer level design and how to write VHDL
code for the building blocks.
1.8.1 Basic Building Blocks
(also: n-to-1 muxes)
2:1 mux
WE
A0
DI0
DO0
A1 DO1
WE
A
DI
DO CE
S
R
D Q
Hardware VHDL
AND, OR, NAND, NOR, XOR,
XNOR
and, or, nand, nor, xor, xnor
multiplexer if-then-else, case statement,
selected assignment, conditional as-
signment
adder, subtracter, negater +, -, -
shifter, rotater sll, srl, sla, sra, rol, ror
ip-op wait until, if-then-else,
rising edge
memory array, register le, queue 2-d array or library component
Figure 1.19: RTL Building Blocks
52 CHAPTER 1. VHDL
1.8.2 Deprecated Building Blocks for RTL
Some of the common gates you have encountered in previous courses should be avoided when
synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation tech-
nology.
1.8.2.1 An Aside on Flip-Flops and Latches
ip-op Edge sensitive: output only changes on rising (or falling) edge of clock
latch Level sensitive: output changes whenever clock is high (or low)
A common implementation of a ip-op is a pair of latches (Master/Slave op).
Latches are sometimes called transparent latches, because they are transparent (input directly
connected to output) when the clock is high.
The clock to a latch is sometimes called the enable line.
There is more information in the course notes on timing analysis for storage devices (Section 5.2).
1.8.2.2 Deprecated Hardware
Latches
Use ops, not latches
Latch-based designs are susceptible to timing problems
The transparent phase of a latch can let a signal leak through a latch causing the
signal to affect the output one clock cycle too early
Its possible for a latch-based circuit to simulate correctly, but not work in real hardware,
because the timing delays on the real hardware dont match those predicted in synthesis
T, JK, SR, etc ip-ops
Limit yourself to D-type ip-ops
Some FPGA and ASIC cell libraries include only D-type ip ops. Others, such as Al-
teras APEX FPGAs, can be congured as D, T, JK, or SR ip-ops.
Tri-State Buffers
Use multiplexers, not tri-state buffers
Tri-state designs are susceptible to stability and signal integrity problems
Getting tri-state designs to simulate correctly is difcult, some library components dont
support tri-state signals
Tri-state designs rely on the code never letting two signals drive the bus at the same time
It can be difcult to check that bus arbitration will always work correctly
1.8.3 Hardware and Code for Flops 53
Manufacturing and environmental variablity can make real hardware not work correctly
even if it simulates correctly
Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state
signals at the board level
Note: Unfortunately and surprisingly, PalmChip has been awarded a
US patent for using uni-directional busses (i.e. multiplexers) for system-
on-chip designs. The patent was led in 2000, so all fourth-year design
projects since 2000 that use muxes on FPGAs will need to pay royalties to
PalmChip
1.8.3 Hardware and Code for Flops
1.8.3.1 Flops with Waits and Ifs
The two code fragments below synthesize to identical hardware (ops).
If
process (clk)
begin
q <= d;
end if;
end process;
Wait
process
begin
q <= d;
end process;
1.8.3.2 Flops with Synchronous Reset
The two code fragments below synthesize to identical hardware (ops with synchronous reset).
Notice that the synchronous reset is really nothing more than an AND gate on the input.
If
process (clk)
begin
if (reset = 1) then
q <= 0;
else
q <= d;
end if;
end if;
end process;
Wait
process
begin
if (reset = 1) then
q <= 0;
else
q <= d0;
end if;
end process;
54 CHAPTER 1. VHDL
1.8.3.3 Flops with Chip-Enable
The two code fragments below synthesize to identical hardware (ops with chip-enable lines).
If
process (clk)
begin
if (ce = 1) then
q <= d;
end if;
end if;
end process;
Wait
process
begin
if (ce = 1) then
q <= d;
end if;
end process;
1.8.3.4 Flop with Chip-Enable and Mux on Input
The two code fragments below synthesize to identical hardware (ops with chip-enable lines and
muxes on inputs).
If
process (clk)
begin
if (ce = 1) then
if (sel = 1) then
q <= d1;
else
q <= d0;
end if;
end if;
end if;
end process;
Wait
process
begin
if (ce = 1) then
if (sel = 1) then
q <= d1;
else
q <= d0;
end if;
end if;
end process;
1.8.4 An Example Sequential Circuit 55
1.8.3.5 Flops with Chip-Enable, Muxes, and Reset
The two code fragments below synthesize to identical hardware (ops with chip-enable lines,
muxes on inputs, and synchronous reset). Notice that the synchronous reset is really nothing
more than a mux, or an AND gate on the input.
Note: The specic combination and order of tests is important to guarantee
that the circuit synthesizes to a op with a chip enable, as opposed to a level-
sensitive latch testing the chip enable and/or reset followed by a op.
Note: The chip-enable pin on the op is connected to both ce and reset.
If the chip-enable pin was not connected to reset, then the op would ignore
reset unless chip-enable was asserted.
If
process (clk)
begin
if (ce = 1 or reset =1 ) then
if (reset = 1) then
q <= 0;
elsif (sel = 1) then
q <= d1;
else
q <= d0;
end if;
end if;
end if;
end process;
Wait
process
begin
if (ce = 1 or reset = 1) then
if (reset = 1) then
q <= 0;
q <= d1;
else
q <= d0;
end if;
end if;
end process;
1.8.4 An Example Sequential Circuit
There are many ways to write VHDL code that synthesizes to the schematic in gure1.20. The
major choices are:
1. Categories of signals
(a) All signals are outputs of ip-ops or inputs (no combinational signals)
(b) Signals include both opped and combinational
2. Number of opped signals per process
(a) All opped signals in a single process
(b) Some processes with multiple opped signals
(c) Each opped signal in its own process
3. Style of op code
56 CHAPTER 1. VHDL
(a) Flops use if statements
(b) Flops use wait statements
Some examples of these different options are shown in gures1.211.24.
S
R
S
R
sel reset
clk
c
a
entity and_not_reg is
port (
reset,
clk,
sel : in std_logic;
c : out std_logic
);
end;
Schematic and entity for examples of different code organizations in Figures1.211.24
Figure 1.20: Schematic and entity for and not reg
One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture one_proc of and_not_reg is
signal a : std_logic;
begin
process begin
if (reset = 1) then
a <= 0;
a <= NOT a;
else
a <= a;
end if;
c <= NOT a;
end process;
end one_proc;
Figure 1.21: Implementation of Figure1.20: all signals are ops, all ops in one process, ops use waits
1.8.4 An Example Sequential Circuit 57
Two Processes, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture two_proc_wait of and_not_reg is
begin
process begin
if (reset = 1) then
a <= 0;
a <= NOT a;
else
a <= a;
end if;
end process;
process begin
c <= NOT a;
end process;
end two_proc_wait;
Figure 1.22: Implementation of Figure1.20: all signals are ops, one op per process, ops use waits
58 CHAPTER 1. VHDL
Two Processes with If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture two_proc_if of and_not_reg is
begin
process (clk)
begin
if (reset = 1) then
a <= 0;
a <= NOT a;
else
a <= a;
end if;
end if;
end process;
process (clk)
begin
c <= NOT a;
end if;
end process;
end two_proc_if;
Figure 1.23: Implementation of Figure1.20: all signals are ops, one op per process, ops use if-then-else
1.9. ARRAYS AND VECTORS 59
Concurrent Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture comb of and_not_reg is
signal a, b, d : std_logic;
begin
process (clk) begin
if (reset = 1) then
a <= 0;
else
a <= d;
end if;
end if;
end process;
process (clk) begin
c <= NOT a;
end if;
end process;
d <= b when (sel = 1) else a;
b <= NOT a;
end comb;
Figure 1.24: Implementation of Figure1.20: opped and combinational signals, one op per process, ops use if-then-else
1.9 Arrays and Vectors
VHDL supports multidimensional arrays over elements of any type. The most common array is an
array of std logic signals, which has a predened type: std logic vector. Throughout
the rest of this section, we will discuss only std logic vector, but the rules apply to arrays
of any type.
VHDL supports reading from and assigning to slices (aka discrete subranges) of vectors. The
rules for working with slices of vectors are listed below and illustrated in gure1.25.
1. The ranges on both sides of the assignment must be the same.
2. The direction (downto or to) of each slice must match the direction of the signal declara-
tion.
3. The direction of the target and expression may be different.
60 CHAPTER 1. VHDL
Declarations
----------------------------------------------------
a, b : in std_logic_vector(15 downto 0);
c, d, e : out std_logic_vector(15 downto 0);
----------------------------------------------------
ax, bx : in std_logic_vector(0 to 15);
cx, dx, ex : out std_logic_vector(0 to 15);
----------------------------------------------------
m, n : in unsigned(15 downto 0);
p, q, r : out unsigned(15 downto 0);
----------------------------------------------------
w, x : in signed(15 downto 0);
y, z : out signed(15 downto 0)
----------------------------------------------------
Legal code
c(3 downto 0) <= a(15 downto 12);
cx(0 to 3) <= a(15 downto 12);
(e(3), e(4)) <= bx(12 to 13);
(e(5), e(6)) <= b(13 downto 12);
Illegal code
d(0 to 3) <= a(15 to 12); -- slice dirs must be same as decl
e(3) & e(2) <= b(12 to 13); -- syntax error on &
p(3 downto 0) <= (m + n)( 3 downto 0); -- syntax error on )(
z(3 downto 0) <= m(15 downto 12); -- types on lhs and rhs must match
Figure 1.25: Illustration of Rules for Slices of Vectors
1.10 Arithmetic
VHDL includes all of the common arithmetic and logical operators.
Use the VHDL arithmetic operators and let the synthesis tool choose the better implementation for
you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic
libraries.
To use the operators, you must choose which arithmetic package you wish to use (section 1.10.1).
The arithmetic operators are overloaded, and you can usually use any mixture of constants and sig-
nals of different types that you need (Section 1.10.3). However, you might need to convert a signal
from one type (e.g. std logic vector) to another type (e.g. integer) (Section 1.10.7).
1.10.1 Arithmetic Packages 61
1.10.1 Arithmetic Packages
Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the
numeric std package.
To do arithmetic with signals, use the numeric_std package. This package denes types
signed and unsigned, which are std_logic vectors on which you can do signed or un-
signed arithmetic.
numeric std supersedes earlier arithmetic packages, such as std logic arith.
Use only one arithmetic package, otherwise the different denitions will clash and you can get
strange error messages.
1.10.2 Shift and Rotate Operations
Shift and rotate operations are described with three character acronyms:
shift/rotate ) left/right ) arithmetic/logical )
The shift right arithmetic (sra) operation preserves the sign of the operand, by copying the most
signicant bit into lower bit positions.
The shift left arithmetic (sla) does the analogous operation, except that the least signicant bit is
copied.
1.10.3 Overloading of Arithmetic
The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and
integers. Tables1.11.4 show the different combinations of target and source types and widths that
can be used.
Table 1.1: Overloading of Arithmetic Operations (+, -)
target src1/2 src2/1
unsigned unsigned integer OK
unsigned signed fails in analysis
In these tables means dont care. Also, src1/2 and src2/1 mean rst or second operand, and
respectively second or rst operand. The rst line of the table means that either the st operand is
unsigned and the second is an integer, or the second operand is unsigned and the rst is an integer.
Or, more concisely: one of the operands is unsigned and the other is integer.
62 CHAPTER 1. VHDL
1.10.4 Different Widths and Arithmetic
Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)
target src1/2 src2/1
narrow wide fails in elaboration
wide narrow int fails in elaboration
wide wide OK
narrow narrow narrow OK
narrow narrow int OK
Example vectors
wide unsigned(7 downto 0)
narrow unsigned(4 downto 0)
1.10.5 Overloading of Comparisons
Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <)
src1/2 src2/1
unsigned integer OK
signed integer OK
unsigned signed fails in analysis
1.10.6 Different Widths and Comparisons
Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <)
src1/2 src2/1
wide OK
narrow OK
1.10.7 Type Conversion 63
1.10.7 Type Conversion
The functions unsigned, signed, to integer, to unsigned and to signed are used
to convert between integers, std-logic vectors, signed vectors and unsigned vectors.
If you convert between two types of the same width, then no additional hardware will be generated.
The listing below summarizes the types of these functions.
unsigned( val : std_logic_vector ) return unsigned;
signed( val : std_logic_vector ) return signed;
to_integer( val : signed ) return integer;
to_integer( val : unsigned ) return integer;
to_unsigned( val : integer; width : natural) return unsigned;
to_signed( val : integer; width : natural) return signed;
The most common need to convert between two types arises when using a signal as an index into
an array. To use a signal as an index into an array, you must convert the signal into an integer
using the function to_integer (Figure1.26).
signal i : unsigned( 3 downto 0);
signal a : std_logic_vector(15 downto 0);
...
... a(i) ... -- BAD: wont typecheck
... a( to_integer(i) ) ... -- OK
Avoid (or at least take care when) converting a signal into an integer and then performing arithmetic
on the signal. The default size for integers is 32 bits, so sometimes when a signal is converted into
an integer, the resulting signals will be 32 bits wide.
library ieee;
use ieee.numeric_std.all;
...
signal bit_sig : std_logic;
signal uns_sig : unsigned(7 downto 0);
signal vec_sig : std_logic_vector(255 downto 0);
...
bit_sig <= vec_sig( to_integer(uns_sig) );
...
Figure 1.26: Using an unsigned signal as an index to array
To convert a std_logic_vector signal into an integer, you must rst say whether the signal
should be interpreted as signed or unsigned. As illustrated in gure1.27, this is done by:
64 CHAPTER 1. VHDL
1. Convert the std_logic_vector signal to signed or unsigned, using the function
signed or unsigned
2. Convert the signed or unsigned signal into an integer, using to_integer
library ieee;
...
signal bit_sig : std_logic;
signal std_sig : std_logic_vector(7 downto 0);
signal vec_sig : std_logic_vector(255 downto 0);
...
bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) );
...
Figure 1.27: Using a std logic vector as an index to array
1.11 Synthesizable vs Non-Synthesizable Code
Synthesis is done by matching VHDL code against templates or patterns. Its important to use
idioms that your synthesis tools recognizes. If you arent careful, you could write code that has
the same behaviour as one of the idioms, but which results in inefcient or incorrect hardware.
Section 1.8 described common idioms and the resulting hardware.
Most synthesis tools agree on a large set of idioms, and will reliably generate hardware for these
idioms. This section is based on the idioms that Synopsys, Xilinx, Altera, and Mentor Graphics are
able to synthesize. One exception is that Alteras Quartus does not support implicit state machines
(as of v5.0).
We consider combinational loops to be unsynthesizable. Although it is obviously possible to build
a circuit with a combinational loop, in most cases the behaviour of such a circuit is undened.
Section 1.11.1 gives rules for unsynthesizable VHDL code. Section 1.11.2 gives rules for syn-
thesizable, but undesirable VHDL code for FPGAs. Undesirable code for FPGAs will produce
circuits that contain latches, asynchronous resets, or are particularly inefcient in either area or
performance.
We consider code that is synthesizable but produces undesirable hardware for FPGAs to be bad
practice. We limit our denition of bad practice to code that produces undesriable hardware.
Poor coding style that does not affect the hardware, for example, including extraneous signals in a
sensitivity list, should certainly be avoided, but falls into the general realm of software engineering
and programming, so will not be discussed.
1.11.1 Unsynthesizable Code 65
1.11.1 Unsynthesizable Code
1.11.1.1 Initial Values
Initial values on signals (UNSYNTHESIZABLE)
signal bad_signal : std_logic := 0;
Reason: In most implementation technologies, when a circuit powers up, the values on signals
are completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip is
powered up, all ip ops will be 0. For other FPGAs, the initial values can be programmed.
1.11.1.2 Wait For
Wait for length of time (UNSYNTHESIZABLE)
wait for 10 ns;
Reason: Delays through circuits are dependent upon both the circuit and its operating environment,
particularly supply voltage and temperature.
1.11.1.3 Different Wait Conditions
wait statements with different conditions in a process (UNSYNTHESIZABLE)
-- different clock signals
process
begin
wait until rising_edge(clk1);
x <= a;
wait until rising_edge(clk2);
x <= a;
end process;
-- different clock edges
process
begin
x <= a;
wait until falling_edge(clk);
x <= a;
end process;
Reason: processes with multiple wait statements are turned into nite state machines. The wait
statements denote transitions between states. The target signals in the process are outputs of ip
ops. Using different wait conditions would require the ip ops to use different clock signals
at different times. Multiple clock signals for a single ip op would be difcult to synthesize,
inefcient to build, and fragile to operate.
66 CHAPTER 1. VHDL
1.11.1.4 Multiple if rising edges in Same Process
Multiple if rising edge statements in a process (UNSYNTHESIZABLE)
process (clk)
begin
q0 <= d0;
end if;
q1 <= d1;
end if;
end process;
Reason: The idioms for synthesis tools generally expect just a single if rising edge state-
ment in each process. The simpler the VHDL code is, the easier it is to synthesize hardware.
Programmers of synthesis tools make idiomatic restrictions to make their jobs simpler.
1.11.1.5 if rising edge and wait in Same Process
An if rising edge statement and a wait statement in the same process (UNSYNTHESIZ-
ABLE)
process (clk)
begin
q0 <= d0;
end if;
q0 <= d1;
end process;
Reason: The idioms for synthesis tools generally expect just a single type of op-generating state-
ment in each process.
1.11.1 Unsynthesizable Code 67
1.11.1.6 if rising edge with else Clause
The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE).
process (clk)
begin
q0 <= d0;
else
q0 <= d1;
end if;
end process;
Reason: Generally, an if-then-else statement synthesizes to a multiplexer. The condition that is
tested in the if-then-else becomes the select signal for the multiplexer. In an if rising edge
with else, the select signal would need to detect a rising edge on clk, which isnt feasible to
synthesize.
1.11.1.7 if rising edge Inside a for Loop
An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys)
process (clk) begin
for i in 0 to 7 loop
q(i) <= d;
end if;
end loop;
end process;
Reason: just an idiom of the synthesis tool.
Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are de-
scribed in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for
functional verication (Chapter 3).
68 CHAPTER 1. VHDL
Synthesizable Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-
edge outside of the for loop.
process (clk) begin
q(i) <= d;
end loop;
end if;
end process;
1.11.1.8 wait Inside of a for loop
wait statements in a for loop (UNSYNTHESIZABLE)
process
begin
x <= to_unsigned(i,4);
end loop;
end process;
Reason: Unknown. Clocked for-loops are generally unsynthsizable, but while-loops with
the same behaviour are synthesizable.
Note: Combinational for-loops Combinational for-loops are usually
synthesizable. They are often used to build a combinational circuit for each
element of an array.
Note: Clocked for-loops Clocked for-loops are not synthesizable,
but are very useful in simulation, particular to generate test vectors for test
benches.
1.11.2 Synthesizable, but Undesirable Hardware 69
Synthesizable Alternative to Wait-Inside-For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
while loop (synthesizable)
This is the synthesizable alternative to the the wait statement in a for loop above.
process
begin
-- output values from 0 to 4 on i
-- sending one value out each clock cycle
i <= to_unsigned(0,4);
while (4 > i) loop
i <= i + 1;
end loop;
end process;
1.11.2 Synthesizable, but Undesirable Hardware
Note: For some of the results in this section, the results are highly depen-
dent upon the synthesis tool that you use and the target technology library.
1.11.2.1 Asynchronous Reset
In an asynchronous reset, the test for reset occurs outside of the test for the clock edge.
process (reset, clk)
begin
if (reset = 1) then
q <= 0;
elsif rising_edge(clk) then
q <= d1;
end if;
end process;
Asynchronous resets are bad, because if a reset occurs very close to a clock edge, some parts of
the circuit might be reset in one clock cycle and some in the subsequent clock cycle. This can lead
the circuit to be out of sync as it goes through the reset sequence, potentially causing erroneous
internal state and output values.
70 CHAPTER 1. VHDL
1.11.2.2 Combinational if-then Without else
process (a, b)
begin
if (a = 1) then
c <= b;
end if;
end process;
Reason: This code synthesizes c to be a latch, and latches are undesirable.
1.11.2.3 Bad Form of Nested Ifs
if rising edge statement inside another if (BAD HARDWARE)
In Synopsys, with some target libraries, this design results in a level-sensitive latch whose input is
a op.
process (ce, clk)
begin
if (ce = 1) then
q <= d1;
end if;
end if;
end process;
1.11.2.4 Deeply Nested Ifs
Deeply chained if-then-else statements can lead to long chains of dependent gates, rather
than checking different cases in parallel.
Slow (maybe)
if cond1 then
stmts1
elsif cond2 then
stmts2
elsif cond3 then
stmts3
elsif cond4 then
stmts4
end if;
Fast (hopefully)
if only one of the conditions can be true at a
time, then try using a case statement or some
other technique that allows the conditions to
be evaluated in parallel.
1.11.3 Synthesizable, but Unpredictable Hardware 71
1.11.3 Synthesizable, but Unpredictable Hardware
Some coding styles are synthesizable and might produce desirable hardware with a particular syn-
thesis tool, but either be unsynthesizable or produce undesirable hardware with another tool.
variables
level-sensitive wait statements
missing signals in sens list
If you are using a single synthesis tool for an extended period of time, and want to get the full
power of the tool, then it can be advantageous to write your code in a way that works for your tool,
but might produce undesirable results with other tools.
1.12 Synthesizable VHDL Coding Guidelines
This section gives guidelines for building robust, portable, and synthesizable VHDL code. Porta-
bility is both for different simulation and synthesis tools and for different implementation tech-
nologies.
Remember, there is a world of difference between getting a design to work in simulation and
getting it to work on a real FPGA. And there is also a huge difference between getting a design
to work in an FPGA for a few minutes of testing and getting thousands of products to work for
months at a time in thousands of different environments around the world.
The coding guidelines here are designed both for helping you to get your E&CE 427 project to
work as well as all of the subsequent industrial designs.
Finally, note that there are exceptions to every rule. You might nd yourself in a circumstance
where your particular situation (e.g. choice of tool, target technology, etc) would benet from
bending or breaking a guideline here. Within E&CE 427, of course, there wont be any such
circumstances.
1.12.1 Signal Declarations
Use signals, do not use variables
reason The intention of the creators of VHDL was for signals to be wires and variables to be
just for simulation. Some synthesis tools allow some uses of variables, but when using
variables, it is easy to create a design that works in simulation but not in real hardware.
Use std_logic signals, do not use bit or Boolean
reason std_logic is the most commonly used signal type across synthesis tools, simulation
tools, and cell libraries
Use in or out, do not use inout
reason inout signals are tri-state.
72 CHAPTER 1. VHDL
note If you have an output signal that you also want to read from, you might be tempted to
declare the mode of the signal to be inout. A better solution is to create a new, internal,
signal that you both read from and write to. Then, your output signal can just read from
the internal signal.
Declare the primary inputs and outputs of chips as either std logic and std logic vector.
Do not use signed or unsigned for primary inputs or outputs.
reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned
vectors in entities into std-logic-vectors. If you want your same testbench to work for both
functional simulation and timing simulation, you must not use signed or unsigned signals
in the top-level entity of your chip.
note Signed and unsigned signals are ne inside testbenches, for non-top-level entities, and
inside architectures. It is only the top-level entity that should not use signed or unsigned
signals.
1.12.2 Flip-Flops and Latches
Use ops, not latches (see section 1.8.2).
Use D-ops, not T, JK, etc (see section 1.8.2).
For every signal in your design, know whether it should be a ip-op or combinational. Before
simulating your design, examine the log le e.g. LOG/dc shell.log to see if the ip
ops in your circuit match your expectations, and to check that you dont have any latches in
your design.
Do not assign a signal to itself (e.g. a <= a; is bad). If the signal is a op, use a chip enable
to cause the signal to hold its value. If the signal is combinational, then assigning a signal to
itself will cause combinational loops, which are bad.
1.12.3 Inputs and Outputs
Put ip ops on primary inputs and outputs of a chip
reason Creates more robust implementations. Signal delays between chips are unpredictable.
Signal integrity can be a problem(remember transmission lines from E&CE 324?). Putting
ip ops on inputs and outputs of chip provides clean boundaries between circuits.
note This only applies to primary inputs and outputs of a chip (the signals in the top-level
entity). Within a chip, you should adopt a standard of putting ip-ops on either inputs or
outputs of modules. Within a chip, you do not need to put ip-ops on both inputs and
outputs.
1.12.4 Multiplexors and Tri-State Signals
Use multiplexors, not tri-state buffers (see section 1.8.2).
1.12.5 Processes 73
1.12.5 Processes
For a combinational process, the sensitivity list should contain all of the signals that are read in
the process.
reason Gives consistent results across different tools. Many synthesis tools will implicitly
include all signals that a process reads in its sensitivity list. This differs from the VHDL
Standard. A tool that adheres to the standard will introduce latches if not all signals that
are read from are included in the sensitivity list.
exception In a clocked process using an if rising edge, it is acceptable to have only the
clock in the sensitivity list
For a combinational process, every signal that is assigned to, must be assigned to in every branch
of if-then and case statements.
reason If a signal is not assigned a value in a path through a combinational process, then that
signal will be a latch.
note For a clocked process, if a signal is not assigned a value in a clock cycle, then the ip-op
for that signal will have a chip-enable pin. Chip-enable pins are ne; they are available on
ip-ops in essentially every cell library.
Each signal should be assigned to in only one process.
reason Multiple processes driving the same signal is the same as having multiple gates driving
the same wire. This can cause contention, short circuits, and other bad things.
exception Multiple drivers are acceptable for tri-state busses or if your implementation tech-
nology has wired-ANDs or wired-ORs. FPGAs dont have wired-ANDs or wired-ORs.
Separate unrelated signals into different processes
reason Grouping assignments to unrelated signals into a single process can complicate the
control circuitry for that process. Each branch in a case statement or if-then-else adds a
multiplexor or chip-enable circuitry.
reason Synthesis tools generally optimize each process individually, the larger a process is, the
longer it will take the synthesis program to optimize the process. Also, larger processes
tend to be more complicated and can cause synthesis programs to miss helpful optimiza-
tions that they would notice in smaller processes.
1.12.6 State Machines
In a state machine, illegal and unreachable states should transition to the reset state
reason Creates more robust implementations. In the eld, your circuit will be subjected to
illegal inputs, voltage spikes, temperature uctuations, clock speed variations, etc. At
some point in time, something weird will happen that will cause it to jump into an illegal
state. Having a system reset and reboot is much better than having it generate incorrect
outputs that arent detected.
If your state machine has less than 16 states, use a one-hot encoding.
74 CHAPTER 1. VHDL
reason For n states, a one-hot encoding uses n ip-ops, while a binary encoding uses log
2
n
ip-ops. One-hot signals are simpler to decode, because only one bit must be checked to
determine if the circuit is in a particular state. For small values of n, a one-hot signal results
in a smaller and faster circuit. For large values of n, the number of signals required for a
one-hot design is too great of a penalty to compensate for the simplicity of the decoding
circuitry.
note Using an enumerated type for states allows the synthesis tool to choose state encodings
that it thinks will work well to balance area and clock speed. Quartus uses a modied
one-hot encoding, where the bit that denotes the reset state is inverted. That is, when the
reset bit is 0, the system is in the reset state and when the reset bit is a 1 the system
is not in the reset state. The other bits have the normal polarity. The result is that when the
system is in the reset state, all bits are 0 and when the system is in a non-reset state, two
bits are 1.
note Using your own encoding allows you to leverage knowledge about your design that the
synthesis tool might not be able to deduce.
1.12.7 Reset
Include a reset signal in all clocked circuits.
reason For most implementation technologies, when you power-up the circuit, you do not
know what state it will start in. You need a reset signal to get the circuit into a known state.
reason If something goes wrong while the circuit is running, you need a way to get it into a
known state.
For implicit state machines (section 2.5.1.3), check for reset after every wait statement.
reason Missing a wait statement means that your circuit might not notice a reset signal, or
different signals could reset in different clock cycles, causing your circuit to get out of
synch.
Connect reset to the important control signals in the design, such as the state signal. Do not reset
every ip op.
reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the
faster and smaller your design will be.
note Connect the reset signal to critical ip-ops, such as the state signal. Datapath signals
rarely need to be reset. You do not need to reset every signal
Use synchronous, not asynchronous, reset
reason Creates more robust implementations. Signal propagation delays mean that asyn-
chronous resets cause different parts of the circuit to be reset at different times. This can
lead to glitches, which then might cause the circuit to move to an illegal state.
1.12.7 Reset 75
Covering All Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When writing case statements or selected assignments that test the value of std logic signals,
you will get an error unless you include a provision for non 1/0 signals.
For example:
signal t : std_logic;
...
case t is
when 1 => ...
when 0 => ...
end case;
will result in an error message about missing cases. You must provide for t being H, U, etc.
The simplest thing to do is to make the last test when other.
76 CHAPTER 1. VHDL
1.13 VHDL Problems
P1.1 IEEE 1164
For each of the values in the list below, answer whether or not it is dened in the ieee.std_logic_1164
library. If it is part of the library, write a 23 word description of the value.
Values: -, #, 0, 1, A, h, H, L, Q, X, Z.
P1.2 VHDL Syntax
Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.
NOTES: 1) ... represents a fragment of legal VHDL code.
2) For full marks, if the code is illegal, you must explain why.
3) The code has been written so that, if it is illegal, then it is illegal for both
simulation and synthesis.
q2a
architecture main of anchiceratops is
signal a, b, c : std_logic;
begin
process begin
wait until rising_edge(c);
a <= if (b = 1) then
...
else
...
end if;
end process;
end main;
q2b
architecture main of tulerpeton is
begin
lab: for i in 15 downto 0 loop
...
end loop;
end main;
P1.2 VHDL Syntax 77
q2c
architecture main of metaxygnathus is
begin
lab: if (a = 1) generate
...
end generate;
end main;
q2d
architecture main of temnospondyl is
component compa
port (
a : in std_logic;
b : out std_logic
);
end component;
signal p, q : std_logic;
begin
coma_1 : compa
port map (a => p, b => q);
...
end main;
q2e
architecture main of pachyderm is
function inv(a : std_logic)
return std_logic is
begin
return(NOT a);
end inv;
signal p, b : std_logic;
begin
p <= inv(b => a);
...
end main;
q2f
architecture main of apatosaurus is
type state_ty is (S0, S1, S2);
signal st : state_ty;
signal p : std_logic;
begin
case st is
when S0 | S1 => p <= 0;
when others => p <= 1;
end case;
end main;
78 CHAPTER 1. VHDL
P1.3 Flops, Latches, and Combinational Circuitry
For each of the signals p...z in the architecture main of montevido, answer whether the signal
is a latch, combinational gate, or ip-op.
entity montevido is
port (
a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;
l : in std_logic_vector (1 downto 0);
p, q, r, s, t, u, v, w, x, y, z : out std_logic
);
end montevido;
architecture main of montevido is
signal i, j : std_logic;
begin
i <= c0 XOR c1;
j <= c0 XOR c1;
process (a, i, j) begin
if (a = 1) then
p <= i AND j;
else
p <= NOT i;
end if;
end process;
process (a, b0, b1) begin
if rising_edge(a) then
q <= b0 AND b1;
end if;
end process;
process
(a, c0, c1, d0, d1, e0, e1)
begin
if (a = 1) then
r <= c0 OR c1;
s <= d0 AND d1;
else
r <= e0 XOR e1;
end if;
end process;
process begin
wait until rising_edge(a);
t <= b0 XOR b1;
u <= NOT t;
v <= NOT x;
end process;
process begin
case l is
when "00" =>
w <= b0 AND b1;
x <= 0;
when "01" =>
w <= -;
x <= 1;
when "1-" =>
w <= c0 XOR c1;
x <= -;
end case;
end process;
y <= c0 XOR c1;
z <= x XOR w;
end main;
P1.4 Counting Clock Cycles 79
P1.4 Counting Clock Cycles
This question refers to the VHDL code shown below.
NOTES:
1. ... represents a legal fragment of VHDL code
2. assume all signals are properly declared
3. the VHDL code is intendend to be legal, synthesizable code
4. all signals are initially U
80 CHAPTER 1. VHDL
entity bigckt is
port (
a, b : in std_logic;
c : out std_logic
);
end bigckt;
architecture main of bigckt is
begin
process (a, b)
begin
if (a = 0) then
c <= 0;
else
if (b = 1) then
c <= 1
else
c <= 0;
end if;
end if;
end process;
end main;
entity tinyckt is
port (
clk : in std_logic;
i : in std_logic;
o : out std_logic
);
end tinyckt;
architecture main of tinyckt is
component bigckt ( ... );
signal ... : std_logic;
begin
p0 : process begin
p0_a <= i;
end process;
p1 : process begin
p1_b <= p1_d;
p1_c <= p1_b;
p1_d <= s2_k;
end process;
p2 : process (p1_c, p3_h, p4_i, clk) begin
p2_e <= p3_h;
p2_f <= p1_c = p4_i;
end if;
end process;
p3 : process (i, s4_m) begin
p3_g <= i;
p3_h <= s4_m;
end process;
p4 : process (clk, i) begin
if (clk = 1) then
p4_i <= i;
else
p4_i <= 0;
end if;
end process;
huge : bigckt
(a => p2_e, b => p1_d, c => h_y);
s1_j <= s3_l;
s2_k <= p1_b XOR i;
s3_l <= p2_f;
s4_m <= p2_f;
end main;
For each of the pairs of signals below, what is the minimum length of time between when a change
occurs on the source signal and when that change affects the destination signal?
P1.5 Arithmetic Overow 81
src dst Num clock cycles
i p0 a
i p1 b
i p1 b
i p1 c
i p2 e
i p3 g
i p4 i
s4 m h y
p1 b p1 d
p2 f s1 j
p2 f s2 k
P1.5 Arithmetic Overow
Implement a circuit to detect overow in 8-bit signed addition.
An overow in addition happens when the carry into the most signicant bit is different from the
carry out of the most signicant bit.
When performing addition, for overow to happen, both operands must have the same sign. Pos-
itive overow occurs when adding two positive operands results in a negative sum. Negative
overow occurs when adding two negative operands results in a positive sum.
82 CHAPTER 1. VHDL
P1.6 Delta-Cycle Simulation: Pong
Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram.
INSTRUCTIONS:
1. The simulation is to be done at the granularity of simulation-steps.
2. Show all changes to process modes and signal values.
3. Each column of the timing diagram corresponds to a simulation step that changes a signal or
process.
4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation
round by writing in the appropriate row a B at the beginning and an E at the end of the cycle
or round.
5. End your simulation just before 20 ns.
architecture main of pong_machine is
signal ping_i, ping_n, pong_i, pong_n : std_logic;
begin
reset_proc: process
reset <= 1;
wait for 10 ns;
reset <= 0;
wait for 100 ns;
end process;
clk_proc: process
clk <= 0;
wait for 10 ns;
clk <= 1;
wait for 10 ns;
end process;
next_proc: process (clk)
begin
ping_n <= ping_i;
pong_n <= pong_i;
end if;
end process;
comb_proc: process (pong_n, ping_n, reset)
begin
if (reset = 1) then
ping_i <= 1;
pong_i <= 0;
else
ping_i <= pong_n;
pong_i <= ping_n;
end if;
end process;
end main;
P1.7 Delta-Cycle Simulation: Baku
INSTRUCTIONS:
P1.7 Delta-Cycle Simulation: Baku 83
3. Each column of the timing diagram corresponds to a simulation step.
or round.
5. Write t=5ns and t=10ns at the top of columns where time advances to 5 ns and 10 ns.
6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize the signals
have completed).
7. End your simulation just before 15 ns;
entity baku is
port (
clk, a, b : in std_logic;
f : out std_logic
);
end baku;
architecture main of baku is
signal c, d, e : std_logic;
begin
proc_clk: process
begin
clk <= 0;
wait for 10 ns;
clk <= 1;
wat for 10 ns;
end process;
proc_extern : process
begin
a <= 0;
b <= 0;
wait for 5 ns;
a <= 1;
b <= 1;
wait for 15 ns;
end process;
proc_1 : process (a, b, c)
begin
c <= a and b;
d <= a xor c;
end process;
proc_2 : process
begin
e <= d;
end process;
proc_3 : process (c, e) begin
f <= c xor e;
end process;
end main;
84 CHAPTER 1. VHDL
P1.8 Clock-Cycle Simulation
Given the VHDL code for anapurna and waveform diagram below, answer what the values of
the signals y, z, and p will be at the given times.
entity anapurna is
port (
clk, reset, sel : in std_logic;
a, b : in unsigned(15 downto 0);
p : out unsigned(15 downto 0)
);
end anapurna;
architecture main of anapurna is
type state_ty is (mango, guava, durian, papaya);
signal y, z : unsigned(15 downto 0);
signal state : state_ty;
begin
proc_herzog: process
begin
top_loop: loop
wait until (rising_edge(clk));
next top_loop when (reset = 1);
state <= durian;
state <= papaya;
while y < z loop
if sel = 1 then
state <= mango;
end if;
state <= papaya;
end loop;
end loop;
end process;
proc_hillary: process (clk)
begin
if (state = durian) then
z <= a;
else
z <= z + 2;
end if;
end if;
end process;
y <= b;
p <= y + z;
end main;
P1.9 VHDL VHDL Behavioural Comparison: Teradactyl 85
P1.9 VHDL VHDL Behavioural Comparison: Teradactyl
For each of the VHDL architectures q3a through q3c, does the signal v have the same behaviour
as it does in the main architecture of teradactyl?
NOTES: 1) For full marks, if the code has different behaviour, you must explain
why.
2) Ignore any differences in behaviour in the rst few clock cycles that is
caused by initialization of ip-ops, latches, and registers.
3) All code fragments in this question are legal, synthesizable VHDL code.
entity teradactyl is
port (
a : in std_logic;
v : out std_logic
);
end teradactyl;
architecture main of teradactyl is
signal m : std_logic;
begin
m <= a;
v <= m;
end main;
architecture q3a of teradactyl is
signal b, c, d : std_logic;
begin
b <= a;
c <= b;
d <= c;
v <= d;
end q3a;
architecture q3b of teradactyl is
begin
process (a, m) begin
v <= m;
m <= a;
end process;
end q3b;
architecture q3c of teradactyl is
begin
process (a) begin
m <= a;
end process;
process (m) begin
v <= m;
end process;
end q3c;
86 CHAPTER 1. VHDL
P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega
as it does in the main architecture of ichthyostega?
why.
entity ichthyostega is
port (
clk : in std_logic;
b, c : in signed(3 downto 0);
v : out signed(3 downto 0)
);
end ichthyostega;
architecture main of ichthyostega is
signal bx, cx : signed(3 downto 0);
begin
process begin
bx <= b;
cx <= c;
end process;
process begin
if (cx > 0) then
v <= bx;
else
v <= to_signed(-1, 4);
end if;
end process;
end main;
architecture q4a of ichthyostega is
begin
process begin
bx <= b;
cx <= c;
end process;
process begin
if (cx > 0) then
v <= bx;
else
end if;
end process;
end q4a;
P1.10 VHDL VHDL Behavioural Comparison: Ichtyostega 87
architecture q4b of ichthyostega is
begin
process begin
bx <= b;
cx <= c;
if (cx > 0) then
v <= bx;
else
end if;
end process;
end q4b;
architecture q4c of ichthyostega is
signal bx, cx, dx : signed(3 downto 0);
begin
process begin
bx <= b;
cx <= c;
end process;
process begin
v <= dx;
end process;
dx <= bx when (cx > 0)
else to_signed(-1, 4);
end q4c;
88 CHAPTER 1. VHDL
P1.11 Waveform VHDL Behavioural Comparison
Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as
the timing diagram.
NOTES: 1) Same behaviour means that the signals a, b, and c have the same values at
the end of each clock cycle in steady-state simulation (ignore any irregularities
in the rst few clock cycles).
2) For full marks, if the code does not match, you must explain why.
3) Assume that all signals, constants, variables, types, etc are properly dened
and declared.
4) All of the code fragments are legal, synthesizable VHDL code.
clk
a
b
c
q3a
architecture q3a of q3 is
begin
process begin
a <= 1;
loop
a <= NOT a;
end loop;
end process;
b <= NOT a;
c <= NOT b;
end q3a;
q3b
architecture q3b of q3 is
begin
process begin
b <= 0;
a <= 1;
a <= b;
b <= a;
end process;
c <= a;
end q3b;
P1.11 Waveform VHDL Behavioural Comparison 89

q3c
architecture q3c of q3 is
begin
process begin
a <= 0;
b <= 1;
b <= a;
a <= b;
end process;
c <= NOT b;
end q3c;
q3d
architecture q3d of q3 is
begin
process (b, clk) begin
a <= NOT b;
end process;
process (a, clk) begin
b <= NOT a;
end process;
c <= NOT b;
end q3d;
q3e
architecture q3e of q3 is
begin
process
begin
b <= 0;
a <= 1;
a <= c;
b <= a;
end process;
c <= not b;
end q3e;
q3f
architecture q3f of q3 is
begin
process begin
a <= 1;
b <= 0;
c <= 1;
a <= c;
b <= a;
c <= NOT b;
end process;
end q3f;
90 CHAPTER 1. VHDL
P1.12 Hardware VHDL Comparison
For each of the circuits q2aq2d, answer
whether the signal d has the same behaviour
as it does in the main architecture of q2.
entity q2 is
port (
a, clk, reset : in std_logic;
d : out std_logic
);
end q2;
architecture main of q2 is
signal b, c : std_logic;
begin
b <= 0 when (reset = 1)
else a;
process (clk) begin
c <= b;
d <= c;
end if;
end process;
end main;
q2a
clk
a
0
reset
d
q2b
clk
a
0
reset
d
q2c
clk
a
0
reset
d
q2d
clk
a
0
reset
d
clk
P1.13 8-Bit Register 91
P1.13 8-Bit Register
Implement an 8-bit register that has:
clock signal clk
input data vector d
output data vector q
synchronous active-high input reset
synchronous active-high input enable
P1.13.1 Asynchronous Reset
Modify your design so that the reset signal is asynchronous, rather than synchronous.
P1.13.2 Discussion
Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on
an FPGA.
P1.13.3 Testbench for Register
Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
92 CHAPTER 1. VHDL
P1.14 Synthesizable VHDL and Hardware
For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If the
code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of
the code. If the the code is not synthesizable, explain why.
q4a
process begin
e <= d;
wait until rising_edge(b);
e <= NOT d;
end process;
q4b
process begin
while (c /= 1) loop
if (b = 1) then
e <= d;
else
e <= NOT d;
end if;
end loop;
e <= b;
end process;
q4c
process (a, d) begin
e <= d;
end process;
process (a, e) begin
f <= NOT e;
end if;
end process;
q4d
process (a) begin
if b = 1 then
e <= 0;
else
e <= d;
end if;
end if;
end process;
P1.14 Synthesizable VHDL and Hardware 93
q4e
process (a,b,c,d) begin
e <= c;
else
if (b = 1) then
e <= d;
end if;
end if;
end process;
q4f
process (a,b,c) begin
if (b = 1) then
e <= 0;
else
e <= c;
end if;
end if;
end process;
94 CHAPTER 1. VHDL
P1.15 Datapath Design
Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit.
The circuit is intended to perform the following sequence of operations (not all operations are
required to use a clock cycle):
read in source and destination addresses from i src1,
i src2, i dst
read operands op1 and op2 from memory
compute sum of operands sum
write sum to memory at destination address dst
write sum to output o result
i_src1
i_src2
i_dst
o_result
clk
P1.15.1 Correct Implementation?
For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation
of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in
which cycle you need load=1.
NOTES:
1. You may choose the number of clock cycles required to execute the sequence of operations.
2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0.
3. The control circuitry that controls the datapath will output a signal load, which will be 1
when the sum is to be written into memory.
4. The code fragment with the signal declaractions, connections for inputs and outputs, and the
instantiation of memory is to be used for all three code fragments q4aq4c.
5. The memory has registered inputs and combinational (unregistered) outputs.
6. All of the VHDL is legal, synthesizable code.
P1.15 Datapath Design 95
-- This code is to be used for
-- all three code fragments q4a--q4c.
signal state : std_logic_vector(3 downto 0);
signal src1, src2, dst, op1, op2, sum,
mem_in_a, mem_out_a, mem_out_b,
mem_addr_a, mem_addr_b
: unsigned(7 downto 0);
...
process (clk)
begin
src1 <= i_src1;
src2 <= i_src2;
dst <= i_dst;
o_result <= sum;
end if;
end process;
mem : ram256x16d
port map (clk => clk,
i_addr_a => mem_addr_a,
i_addr_b => mem_addr_b,
i_we_a => mem_we,
i_data_a => mem_in_a,
o_data_a => mem_out_a,
o_data_b => mem_out_b);
96 CHAPTER 1. VHDL
q4a
op1 <= mem_out_a when state = "0010"
else (others => 0);
op2 <= mem_out_b when state = "0010"
else (others => 0);
sum <= op1 + op2 when state = "0100"
else (others => 0);
mem_in_a <= sum when state = "1000"
else (others => 0);
mem_addr_a <= dst when state = "1000"
else src1;
mem_we <= 1 when state = "1000"
else 0;
mem_addr_b <= src2;
process (clk)
begin
if (load = 1) then
state <= "1000";
else
-- rotate state vector one bit to left
state <= state(2 downto 0) & state(3);
end if;
end if;
end process;
q4b
process (clk) begin
op1 <= mem_out_a;
op2 <= mem_out_b;
end if;
end process;
sum <= op1 + op2;
mem_in_a <= sum;
mem_we <= load;
mem_addr_a <= dst when load = 1
else src1;
mem_addr_b <= src2;
P1.15 Datapath Design 97
q4c
process
begin
op1 <= mem_out_a;
op2 <= mem_out_b;
sum <= op1 + op2;
mem_in_a <= sum;
end process;
process (load, dst, src1) begin
if load = 1 then
mem_addr_a <= dst;
else
mem_addr_a <= src1;
end if;
end process;
mem_addr_b <= src2;
P1.15.2 Smallest Area
Of all of the circuits (q4aq4c), including both correct and incorrect circuits, predict which will
have the smallest area.
If you dont have sufcient information to predict the relative areas, explain what additional infor-
mation you would need to predict the area prior to synthesizing the designs.
P1.15.3 Shortest Clock Period
have the shortest clock period.
If you dont have sufcient information to predict the relative periods, explain what additional
information you would need to predict the period prior to performing any synthesis or timing
analysis of the designs.
98 CHAPTER 1. VHDL
Chapter 2
RTL Design with VHDL: From
Requirements to Optimized Code
2.1 Prelude to Chapter
2.1.1 A Note on EDA for FPGAs and ASICs
The following is from John Cooleys column The Industry Gady from 2003/04/30. The title of
this article is: The FPGA EDA Slums.
For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the
FPGA market was US$2.6 billion.
Whats more interesting is that the 2001 ASIC EDA market was US$2.2 billion while
the FPGA EDA market was US$91.1 million. Nope, thats not a mistake. Its ASIC
EDA and billion versus FPGA EDA and million. Do the math and youll see that for
every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor.
For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor.
Not good.
Its the old free milk and a cow story according to Gary Smith, the Senior EDA
Analyst at Dataquest. Altera and Xilinx have fowled their own nest. Their free tools
spoil the FPGA EDA market, says Gary. EDA vendors know that theres no money
to be made in FPGA tools.
99
100 CHAPTER 2. RTL DESIGN WITH VHDL
2.2 FPGA Background and Coding Guidelines
2.2.1 Generic FPGA Hardware
2.2.1.1 Generic FPGA Cell
Cell = Logic Element (LE) in Altera
= Congurable Logic Block (CLB) in Xilinx
CE
S
R
D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_out
comb
comb_data_out
flop_data_in
2.2.2 Area Estimation
We estimate the number of FPGA cells required for a design by counting the number of ip-
ops and primary inputs that are in the fanin of each ip-op. Only ip-ops count, because
combinational signals are collapsed into the circuity within an FPGA cell. The circuitry for any
ip-op signal with up to four source ip-ops can be implemented on a single FPGA cell. If a
ip-op signal is dependent upon ve source ip-ops, then two FPGA cells are required.
Source ops/inputs Minimum cells
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 4
2.2.2 Area Estimation 101
For a single target signal, this technique gives a lower bound on the number of cells needed. For
example, some functions of seven inputs require more than two cells. As a particular example, a
four-to-one multiplexer has six inputs and requires three cells.
When dealing with multiple target signals, this technique might be an overestimate, because a
single cell can drive several other cells (common subexpression elimination).
PLA and Flop for Different Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CE
S
R
D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_out
comb
comb_data_out
flop_data_in
PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CE
S
R
D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_out
comb
comb_data_out
flop_data_in
PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CE
S
R
D Q
comb_data_in
ctrl_in
carry_in
carry_out
flop_data_out
comb
comb_data_out
flop_data_in
Estimate Area for Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Map the combinational circuits below onto generic FPGA cells.
a
b
c
d
z
CE
S
R
D Q
comb
a
b
c
d
z
a
b
c
d
z y
x e
f
g
h
i
CE
S
R
D Q
comb
CE
S
R
D Q
comb
x z
y
z
y
a
b
c
d
a
b
c
d
z
w
x e
f
g
h
i
y
CE
S
R
D Q
comb
CE
S
R
D Q
comb
CE
S
R
D Q
comb
x z
y
z
y
a
b
c
d
b
c
d
w
2.2.2.1 Interconnect for Generic FPGA
Note: In these slides, the space between tightly grouped wires sometimes
disappears, making a group of wires appear to be a single large wire.
There are two types of wires that connect a cell to the rest of the chip:
General purpose interconnect (congurable, slow)
Carry chains and cascade chains (verticaly adjacent cells, fast)
2.2.2.2 Blocks of Cells for Generic FPGA
Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within
a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks
might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested
for-generate statements that replicate a single component (cell) hundreds of thousands of
times.
Cells not used for computation can be used as wires to shorten length of path between cells.
2.2.2.3 Clocks for Generic FPGAs
Characteristics of clock signals:
High fanout (drive many gates)
Long wires (destination gates scattered all over chip)
Characteristics of FPGAs:
Very few gates that are large (strong) enough to support a high fanout.
Very few wires that traverse entire chip and can be connected to every ip-op.
2.2.2.4 Special Circuitry in FPGAs
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
For more than ve years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs,
these circuits are called ESBs (Embedded System Blocks). These special circuits are possible
because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply
contain small chunks of SRAM.
Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same
chip as programmable hardware.
Hard Soft
Altera Arm 922T with 200 MIPs Nios with ?? MIPs
Xilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs
The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the rst-
generation Intel Pentium microprocessor.
Arithmetic Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and
adders.
Altera: Mercury 1616 at 130MHz
Xilinx: Virtex-II Pro 1818 at ???MHz
Using these resources can improve signicantly both the area and performance of a design.
2.2.3 Generic-FPGA Coding Guidelines 107
Input / Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of
communication with the outside world.
Product
Altera True-LVDS (1 Gbps)
Xilinx Rocket I/O (3 Gbps)
2.2.3 Generic-FPGA Coding Guidelines
Flip-ops are almost free in FPGAs
reason In FPGAs, the area consumed by a design is usually determined by the amount of
combinational circuitry, not by the number of ip-ops.
Aim for using 8090% of the cells on a chip.
reason If you use more than 90% of the cells on a chip, then the place-and-route program
might not be able to route the wires to connect the cells.
reason If you use less than 80% of the cells, then probably:
there are optimizations that will increase performance and still allow the design to t
on the chip;
or you spent too much human effort on optimizing for low area;
or you could use a smaller (cheaper!) chip.
exception In E&CE 427 (unlike in real life), the mark is based on the actual number of cells
used.
Use just one clock signal
reason If all ip-ops use the same clock, then the clock does not impose any constraints on
where the place-and-route tool puts ip-ops and gates. If different ip-ops used different
clocks, then ip-ops that are near each other would probably be required to use the same
clock.
Use only one edge of the clock signal
reason There are two ways to use both rising and falling edges of a clock signal: have rising-
edge and falling-edge ip ops, or have two different clock signals that are inverses of
each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a
clock signal is equivalent to having two different clock signals, which is deprecated by the
preceding guideline.
2.2.4 Altera APEX20K Information and Coding Guidelines
APEX20K Block Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chip
52 Mega Logic Array Blocks (MegaLABs)
1 Embedded System Block (ESB)
Memory and wide combinational
functions
16 Logic Array Blocks (LABs)
10 Logic Elements (LEs)
4-input lookup table
Carry and cascade
Flip-op
Each level of hierarchy has its own interconnect (wires).
LE Computation and Storage . . . . . . . . .
4-input lookup table (LUT)
Carry-chain computation circuitry
Cascade-chain computation circuitry
Flip-op with load, clear, clock-enable
LE Interconnect . . . . . . . . . . . . . . . . . . . . . .
4 data inputs
2 data outputs
Carry in, carry out
Cascade in, cascade out
Clock, clock-enable
Async clear, synch set (load), synch clear
(reset)
Global reset
Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Altera APEX20K chips initialize all ip ops to 0 at startup. To mimic this behaviour in
simulation, you should put an initial value of 0 on all ip ops. If you are doing your own
encoding for a state machine, choose the reset state to be encoded as all zeroes.
You should not put initial values on inputs or combinational signals.
2.3. DESIGN FLOW 109
2.3 Design Flow
2.3.1 Generic Design Flow
Most people agree on the general terminology and process for a digital hardware design ow.
However, each book and course has its own particular way of presenting the ideas. Here we will
lay out the consistent set of denitions that we will use in E&CE 427. This might be different from
what you have seen in other courses or on a work term. Focus on the ideas and you will be ne
both now and in the future.
The design ow presented here focuses on the artifacts that we work with, rather than the opera-
tions that are performed on the artifacts. This is because the same operations can be performed at
different points in the design ow, while the artifacts each have a unique purpose.
Analyze
Modify
Analyze
Modify
Analyze
Modify
Analyze
Modify
Analyze
Modify
Requirements
Opt. RTL Code
Implementation
Hardware
DP+Ctrl Code
High-Level Model
dp/ctrl
specific
Algorithm
Figure 2.1: Generic Design Flow
Table 2.1: Artifacts in the Design Flow
Requirements Description of what the customer wants
Algorithm Functional description of computation. Probably not syn-
thesizable. Could be a owchart, software, diagram,
mathematical equation, etc..
High-Level Model HDL code that is not necessarily synthesizable, but di-
vides algorithm into signals and clock cycles. Possibly
mixes datapath and control. In VHDL, could be a single
process that captures the behaviour of the algorithm. Usu-
ally synthesizable; resulting hardware is usually big and
slow compared to optimized RTL code.
Dataow Diagram A picture that depicts the datapath computation over time,
clock-cycle by clock-cycle (Section 2.6)
Hardware Block Diagram A picture that depicts the structure of the datapath: the
components and the connections between the compo-
nents. (e.g., netlist or schematic)
State Machine A picture that depicts the behaviour of the control cir-
cuitry over time (Section 2.5)
DP+Ctrl RTL code Synthesizable HDL code that separates the datapath and
control into separate processes and assignments.
Optimized RTL Code HDL code that has been written to meet design goals (high
performance, low power, small, etc.)
Implementation Code A collection of les that include all of the information
needed to build the circuit: HDL program targeted for
a particular implementation technology (e.g. a specic
FPGA chip), constraint les, script les, etc.
Note: Recomendation Spend the time up front to plan a good design on
paper. Use dataow diagrams and state machines to predict performance and
area. The E&CE 427 project might appear to be sufciently small and simple
that you can go straight to RTL code. However, you will probably produce
a more optimal design with less effort if you explore high-level optimizations
with dataow diagrams and state machines.
2.3.2 Implementation Flows
Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They
have very few, if any, technology-specic algorithms. Instead, they rely on libraries to describe
technology-specic parameters of the primitive building blocks (e.g. the delay and area of individ-
ual gates, PLAs, CLBs, ops, memory arrays).
2.3.3 Design Flow: Datapath vs Control vs Storage 111
Mentor Graphics product Leonardo Spectrum, Cadences product BuildGates, and Synplicitys
product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell
separate tools that do place-and-route and other low-level (physical design) tasks.
These general-purpose synthesis tools do not (generally) do the nal stages of the design, such as
place-and-route and timing analysis, which are very specic to a given implementation technology.
The implementation-technology-specic tools generally also produce a VHDL le that accurately
models the chip. We will refer to this le as the implementation VHDL code.
With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF le for
the netlist and a TCL le for the commands to Quartus. Quartus then generates a sof (SRAM
Object File), which can be downloaded to an Altera SRAM-based FPGA. The extension of the
implementation VHDL le is often .vho, for VHDL output.
With the Synopsys and Xilinx tools, we compile VHDL code into a Xilinx-specic design le
(xnf Xilinx netlist le). We then use the Xilinx tools to generate a bit le, which can be
downloaded to a Xilinx FPGA. The name of the implementation VHDL le is often sufxed with
routed.vhd.
Terminology: Behavioural and Structural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note: behavioural and structural models The phrases behavioural model
and structural model are commonly used for what well call high-level
models and synthesizable models. In most cases, what people call struc-
tural code contains both structural and behavioural code. The technically cor-
rect denition of a structural model is an HDL program that contains only
component instantiations and generate statements. Thus, even a program with
c <= a AND b; is, strictly speaking, behavioural.
2.3.3 Design Flow: Datapath vs Control vs Storage
2.3.3.1 Classes of Hardware
Each circuit tends to be dominated by either its datapath, control (state machine) or storage (mem-
ory).
Datapath
Purpose: compute output data based on input data
Each parcel of input produces one parcel of output
Examples: arithmetic, decoders
Storage
Purpose: hold data for future use
Data is not modied while stored
Examples: register les, FIFO queues
Control
Purpose: modify internal state based on inputs, compute outputs from state and inputs
Mostly individual signals, few data (vectors)
Examples: bus arbiters, memory-controllers
All three classes of circuits (datapath, control, and storage) follow the same generic design ow
(Figure2.1) and use dataow diagrams, hardware block diagrams, and state machines. The differ-
ences in the design ows appear in the relative amount of effort spent on each type of description
and the order in which the different descriptions are used. The differences are most pronounced
in the transition from the high-level model to the model that separates the datapath and control
circuitry.
2.3.3.2 Datapath-Centric Design Flow
Analyze
Modify
Analyze
Modify
Block Diagram State Machine
High-Level Model
Dataflow
DP+Ctrl RTL Code
Figure 2.2: Datapath-Centric Design Flow
2.4. ALGORITHMS AND HIGH-LEVEL MODELS 113
2.3.3.3 Control-Centric Design Flow
Analyze
Modify
Analyze
Modify
Analyze
Modify
High-Level Model
State Machine
Dataflow Diagram
Block Diagram
DP+Ctrl RTL Code
Figure 2.3: Control-Centric Design Flow
2.3.3.4 Storage-Centric Design Flow
In E&CE 427, we wont be discussing storage-centric design. Storage-centric design differs from
datapath- and control-centric design in that storage-centric design focusses on building many repli-
cated copies of small cells.
Storage-centric designs include a wide range of circuits, from simple memory arrays to compli-
cated circuits such as register les, translation lookaside buffers, and caches. The complicated
circuits can contain large and very intricate state machines, which would benet from some of the
techniques for control-centric circuits.
2.4 Algorithms and High-Level Models
For designs with signicant control ow, algorithms can be described in software languages, ow-
charts, abstract state machines, algorithmic state machines, etc.
For designs with trivial control ow (e.g. every parcel of input data undergoes the same computa-
tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm.
For designs with a small amount of control ow (e.g. a microprocessor, where a single decision is
made based upon the opcode) a set of data-dependency graphs is often a good choice.
Software executes in series;
hardware executes in parallel
When creating an algorithmic description of your hardware design, think about how you can repre-
sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism
to improve the performance of your design.
2.4.1 Flow Charts and State Machines
Flow charts and various avours of state machines are covered well in many courses. Generally
everything that youve learned about these forms of description are also applicable in hardware
design.
In addition, you can exploit parallelismin state machine design to create communicating nite state
machines. A single complex state machine can be factored into multiple simple state machines that
operate in parallel and communicate with each other.
2.4.2 Data-Dependency Graphs
In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount
of time to execute as: (a + b) + (c + d) + (e + f).
But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guide
parallel vs serial execution.
Datadependency graphs capture algorithms of datapath-centric designs.
Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes
the same computation.
Serial Parallel
(((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)
a b c d e f
+
+
+
+
+
a b c d e f
+
+
+
+
+
5 adders on longest path (slower) 3 adders on longest path (faster)
5 adders used (equal area) 5 adders used (equal area)
2.4.3 High-Level Models 115
2.4.3 High-Level Models
There are many different types of high-level models, depending upon the purpose of the model
and the characteristics of the design that the model describes. Some models may capture power
consumption, others performance, others data functionality.
High-level models are used to estimate the most important design metrics very early in the design
cycle. If power consumption is more important that performance, then you might write high-
level models that can predict the power consumption of different design choices, but which has
no information about the number of clock cycles that a computation takes, or which predicts the
latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate
high-level models that do not contain any information about power consumption.
Conventionally, performance has been the primary design metric. Hence, high-level models that
predict performance are more prevalent and more well understood than other types of high-level
models. There are many research and entrepreneurial opportunities for people who can develop
tools and/or languages for high-level models for estimating power, area, maximum clock speed,
etc.
In E&CE 427 we will limit ourselves to the well-understood area of high-level models for perfor-
mance prediction.
2.5 Finite State Machines in VHDL
2.5.1 Introduction to State-Machine Design
2.5.1.1 Mealy vs Moore State Machines
Moore Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outputs are dependent upon only the state
No combinational paths from inputs to outputs
s0/0
s1/1 s2/0
s3/0
a !a
Mealy Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outputs are dependent upon both the state and the in-
puts
Combinational paths from inputs to outputs
s0
s1 s2
s3
a/1 !a/0
/0 /0
2.5.1.2 Introduction to State Machines and VHDL
A state machine is generally written as a single clocked process, or as a pair of processes, where
one is clocked and one is combinational.
2.5.1 Introduction to State-Machine Design 117
Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Moore vs Mealy (Sections 2.5.2 and 2.5.3)
Implicit vs Explicit (Section 2.5.1.3)
State values in explicit state machines: Enumerated type vs constants (Section 2.5.5.1)
State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5)
VHDL Constructs for State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The following VHDL control constructs are useful to steer the transition from state to state:
if ... then ... else
case
for ... loop
while ... loop
loop
next
exit
2.5.1.3 Explicit vs Implicit State Machines
There are two broad styles of writing state machines in VHDL: explicit and implicit. Explicit
and implicit refer to whether there is an explicit state signal in the VHDL code. Explicit state
machines have a state signal in the VHDL code. Implicit state machines do not contain a state
signal. Instead, they use VHDL processes with multiple wait statements to control the execution.
In the explicit style of writing state machines, each process has at most one wait statement. For
the explicit style of writing state machines, there are two sub-categories: current state and cur-
rent+next state.
In the explicit-current style of writing state machines, the state signal represents the current state
of the machine and the signal is assigned its next value in a clocked process.
In the explicit-current+next style, there is a signal for the current state and another signal for the
next state. The next-state signal is assigned its value in a combinational process or concurrent state-
ment and is dependent upon the current state and the inputs. The current-state signal is assigned
its value in a clocked process and is just a opped copy of the next-state signal.
For the implicit style of writing state machines, the synthesis program adds an implicit register to
hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis
tools, the state signal dened by the synthesizer is named multiple wait state reg.
In Mentor Graphics, the state signal is named STATE VAR
We can think of implicit state machines as having 0 state signals, explicit-current state machines
as having 1 state signal, and explicit-current+next state machines as having 2 state signals.
As with all topics in E&CE 427, there are tradeoffs between these different styles of writing state
machines. Most books teach only the explicit-current+next style. This style is the style closest to
the hardware, which means that they are more amenable to optimization through human interven-
tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style is
that they are concise and readable for control ows consisting of nested loops and branches (e.g.
the type of control ow that appears in software). For control ows that have less structure, it
can be difcult to write an implicit state machine. Very few books or synthesis manuals describe
multiple-wait statement processes, but they are relatively well supported among synthesis tools.
Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difcult to
write some state machines with complicated control ows in an implicit style. The following
example illustrates the point.
s0/0
s1/1
s2/0
s3/0
a
!a
!a
a
Note: The terminology of explicit and implicit is somewhat standard,
in that some descriptions of processes with multiple wait statements describe
the processes as having implicit state machines.
There is no standard terminology to distinguish between the two explicit styles:
explicit-current+next and explicit-current.
2.5.2 Implementing a Simple Moore Machine
s0/0
s1/1 s2/0
s3/0
a !a
entity simple is
port (
a, clk : in std_logic;
z : out std_logic
);
end simple;
2.5.2 Implementing a Simple Moore Machine 119
2.5.2.1 Implicit Moore State Machine
architecture moore_implicit of simple is
begin
process
begin
z <= 0;
if (a = 1) then
z <= 1;
else
z <= 0;
end if;
z <= 0;
end process;
end moore_implicit;
Flops
Gates
Delay
2.5.2.2 Explicit Moore with Flopped Output
architecture moore_explicit_v1 of simple is
type state_ty is (s0, s1, s2, s3);
begin
process (clk)
begin
case state is
when s0 =>
if (a = 1) then
state <= s1;
z <= 1;
else
state <= s2;
z <= 0;
end if;
when s1 | s2 =>
state <= s3;
z <= 0;
when s3 =>
state <= s0;
z <= 1;
end case;
end if;
end process;
end moore_explicit_v1;
Flops
Gates
Delay
2.5.2.3 Explicit Moore with Combinational Outputs
begin
process (clk)
begin
case state is
when s0 =>
if (a = 1) then
state <= s1;
else
state <= s2;
end if;
when s1 | s2 =>
state <= s3;
when s3 =>
state <= s0;
end case;
end if;
end process;
z <= 1 when (state = s1)
else 0;
Flops
Gates
Delay
2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment
signal state, state_nxt : state_ty;
begin
process (clk)
begin
state <= state_nxt;
end if;
end process;
state_nxt <= s1 when (state = s0) and (a = 1)
else s2 when (state = s0) and (a = 0)
else s3 when (state = s1) or (state = s2)
else s0;
else 0;
Flops
Gates
Delay
The hardware synthesized fromthis architecture is the same as that synthesized frommoore explicit v2,
which is written in the current-explicit style.
2.5.2.5 Explicit-Current+Next Moore with Combinational Process
begin
process (clk)
begin
state <= state_nxt;
end if;
end process;
process (state, a)
begin
case state is
when s0 =>
if (a = 1) then
state_nxt <= s1;
else
state_nxt <= s2;
end if;
when s1 | s2 =>
state_nxt <= s3;
when s3 =>
state_nxt <= s0;
end case;
end process;
else 0;
For this architecture, we
change the selected assign-
ment to state into a combi-
national process using a case
statement.
Flops
Gates
Delay
The hardware synthe-
sized from this archi-
tecture is the same as
that synthesized from
moore explicit v2 and
v3.
2.5.3 Implementing a Simple Mealy Machine
Mealy machines have a combinational path from inputs to outputs, which often violates good
coding guidelines for hardware. Thus, Moore machines are much more common. You should
know how to write a Mealy machine if needed, but most of the state machines that you design will
be Moore machines.
This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine
is the same as the Moore machine, except for the timing relationship between the output (z) and
the input (a).
s0
s1 s2
s3
a/1 !a/0
/0 /0
entity simple is
port (
a, clk : in std_logic;
z : out std_logic
);
end simple;
2.5.3 Implementing a Simple Mealy Machine 125
2.5.3.1 Implicit Mealy State Machine
Note: An implicit Mealy state machine is nonsensical.
In an implicit state machine, we do not have a state signal. But, as the example below illustrates,
to create a Mealy state machine we must have a state signal.
An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen-
dent upon the input in the current clock cycle, the output cannot be a op. For the output to be
combinational and dependent upon both the current state and the current input, we must create a
state signal that we can read in the assignment to the output. Creating a state signal obviates the
advantages of using an implicit style of state machine.
architecture implicit_mealy of simple is
begin
process
begin
state <= s0;
if (a = 1) then
state <= s1;
else
state <= s2;
end if;
state <= s3;
end process;
z <= 1 when (state = s0) and a = 1
else 0;
end mealy_implicit;
Flops
Gates
Delay
2.5.3.2 Explicit Mealy State Machine
architecture mealy_explicit of simple is
begin
process (clk)
begin
case state is
when s0 =>
if (a = 1) then
state <= s1;
else
state <= s2;
end if;
when s1 | s2 =>
state <= s3;
when others =>
state <= s0;
end case;
end if;
end process;
else 0;
end mealy_explicit;
Flops
Gates
Delay
2.5.3 Implementing a Simple Mealy Machine 127
2.5.3.3 Explicit-Current+Next Mealy
architecture mealy_explicit_v2 of simple is
begin
process (clk)
begin
state <= state_nxt;
end if;
end process;
state_nxt <= s1 when (state = s0) and a = 1
else s2 when (state = s0) and a = 0
else s3 when (state = s1) or (state = s2)
else s0;
else 0;
end mealy_explicit_v2;
Flops
Gates
Delay
2.5.4 Reset
All circuits should have a reset signal that puts the circuit back into a good initial state. However,
not all ip ops within the circuit need to be reset. In a circuit that has a datapath and a state
machine, the state machine will probably need to be reset, but datapath may not need to be reset.
There are standard ways to add a reset signal to both explicit and implicit state machines.
It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or
your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.
Reset with Implicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
With an implicit state machine, we need to insert a loop in the process and test for reset after each
wait statement.
Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold.
architecture moore_implicit of simple is
begin
process
begin
init : loop -- outermost loop
z <= 0;
next init when (reset = 1); -- test for reset
if (a = 1) then
z <= 1;
else
z <= 0;
end if;
z <= 0;
end process;
end moore_implicit;
2.5.4 Reset 129
Reset with Explicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reset is often easier to include in an explicit state machine, because we need only put a test for
reset = 1 in the clocked process for the state.
The pattern for an explicit-current style of machine is:
process (clk) begin
if reset = 1 then
state <= S0;
else
if ... then
state <= ...;
elif ... then
... -- more tests and assignments to state
end if;
end if;
end if;
end process;
Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:
begin
process (clk)
begin
if (reset = 1) then
state <= s0;
else
case state is
when s0 =>
if (a = 1) then
state <= s1;
else
state <= s2;
end if;
when s1 | s2 =>
state <= s3;
when s3 =>
state <= s0;
end case;
end if;
end if;
end process;
else 0;
The pattern for an explicit-current+next style is:
process (clk) begin
if reset = 1 then
state_cur <= reset state;
else
state_cur <= state_nxt;
end if;
end if;
end process;
2.5.5 State Encoding
When working with explicit state machines, we must address the issue of state encoding: what
bit-vector value to associate with each state?
With implicit state machines, we do not need to worry about state encoding. The synthesis program
determines the number of states and the encoding for each state.
2.5.5.1 Constants vs Enumerated Type
Using an enumerated type, the synthesis tools chooses the encoding:
Using constants, we choose the encoding:
type state_ty is std_logic_vector(1 downto 0);
constant s0 : state_ty := "11";
Providing Encodings for Enumerated Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Many synthesizers allow the user to provide hints on how to encode the states, or allow the user to
provide explicitly the desire encoding. These hints are done either through VHDL attributes
or special comments in the code.
2.5.5 State Encoding 131
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When doing functional simulation with enumerated types, simulators often display waveforms
with pretty-printed values rather than bits (e.g. s0 and s1 rather than 11 and 10). However,
when simulating a design that has been mapped to gates, the enumerated type dissappears and you
are left with just bits. If you dont know the encoding that the synthesis tool chose, it can be very
difcult to debug the design.
However, this opens you up to potential bugs if the enumerated type you are testing grows to
include more values, which then end up unintentionally executing your when other branch,
rather than having a special branch of their own in the case statement.
Unused Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
If the number of values you have in your datatype is not a power of two, then you will have some
unused values that are representable.
For example:
This type only needs ve unique values, but can represent eight different values. What should we
do with the three representable values that we dont need? The safest thing to do is to code your
design so that if an illegal value is encountered, the machine resets or enters an error state.
2.5.5.2 Encoding Schemes
Binary: Conventional binary counter.
One-hot: Exactly one bit is asserted at any time.
Modied one-hot: Alteras Quartus synthesizer generates an almost-one-hot encoding where the
bit representing the reset state is inverted. This means that the reset state is all Os and all other
states have two 1s: one for the reset state and one for the current state.
Gray: Transition between adjacent values requires exactly one bit ip.
Custom: Choose encoding to simplify combinational logic for specic task.
Tradeoffs in Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g.
no random jumps).
One-hot usually has less combinational logic and runs faster than binary for machines with up
to a dozen or so states. With more than a dozen states, the extra ip-ops required by one-hot
encoding become too expensive.
Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into
the guts of your design.
Note: Dont care values When we dont care what is the value of a signal we
assign the signal -, which is dont care in VHDL. This should allow the
synthesis tool to use whatever value is most helpful in simplifying the Boolean
equations for the signal (e.g. Karnaugh maps). In the past, some groups in
E&CE 427 have used - quite succesfuly to decrease the area of their design.
However, a few groups found that using - increased the size of their design,
when they were expecting it to decrease the size. So, if you are tweaking your
design to squeeze out the last few unneeded FPGA cells, pay close attention as
to whether using - hurts or helps.
2.6 Dataow Diagrams
2.6.1 Dataow Diagrams Overview
Dataow diagrams are data-dependency graphs where the computation is divided into clock
cycles.
Purpose:
Provide a disciplined approach for designing datapath-centric circuits
Guide the design from algorithm, through high-level models, and nally to register transfer
level code for the datapath and control circuitry.
Estimate area and performance
Make tradeoffs between different design options
Background
Based on techniques from high-level synthesis tools
Some similarity between high-level synthesis and software compilation
Each dataow diagram corresponds to a basic block in software compiler terminology.
2.6.1 Dataow Diagrams Overview 133
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Data-dependency graph for z = a + b + c + d + e + f
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Dataow diagram for z = a + b + c + d + e + f
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
Horizontal lines mark
clock cycle boundaries
The use of memory arrays in dataow diagrams is described in section 2.7.4.
2.6.2 Dataow Diagrams, Hardware, and Behaviour 135
2.6.2 Dataow Diagrams, Hardware, and Behaviour
Primary Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataow Diagram
i
x
Hardware
i x
Behaviour
clk
i
x
Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataow Diagram
i
x
Hardware
i
x
Behaviour
clk
i
x
Register Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataow Diagram
i1
x
+
i2
Hardware
i2
x
i1
+
Behaviour
clk
i1
i2
x

Combinational-Component Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataow Diagram
i1
x
+
i2
Hardware
i2
i1
+ x
Behaviour
clk
i1
i2
x

2.6.3 Area Estimation
Maximum number of blocks in a clock cycle is total number of that component that are needed
Maximum number of signals that cross a cycle boundary is total number of registers that are
needed
Maximum number of unconnected signal tails in a clock cycle is total number of inputs that
are needed
Maximum number of unconnected signal heads in a clock cycle is total number of outputs
that are needed
The information above is only for estimating the number of components that are needed. In fact,
these estimates give lower bounds. There might be constraints on your design that will force you
to use more components (e.g., you might need to read all of your inputs at the same time).
Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath
components, might force you to make tradeoffs that increase the number of datapath components
to decrease the overall area of the circuit.
Of particular relevance to FPGAs:
With some FPGA chips, a 2:1 multiplexer has the same area as an adder.
With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell
per bit.
In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by the
amount of combinational logic, not the number of ip-ops.
In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and
registers are quite expensive in area.
2.6.4 Dataow Diagram Execution 137
2.6.4 Dataow Diagram Execution
Execution with Registers on Both Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
6
0 1 2 3 4 5 6
x5
Execution Without Output Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
clk
a
x1
x2
x3
x4
x5
z
0
1
2
3
4
5
0 1 2 3 4 5 6
x5
2.6.5 Performance Estimation
Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance
1
TimeExec
TimeExec = Latency ClockPeriod
Latency = Number of clock cycles from inputs to outputs
There is much more information on performance in chapter4, which is devoted to performance.
Performance of Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Latency: count horizontal lines in diagram
Min clock period (Max clock speed) limited by longest path in a clock cycle
2.6.6 Design Analysis
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
num inputs 6
num outputs 1
num registers 6
num adders 1
min clock period delay through op and one adder
latency 6 clock cycles
2.6.7 Area / Performance Tradeoffs 139
2.6.7 Area / Performance Tradeoffs
one add per clock cycle two adds per clock cycle
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
5
6
x5
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
x5
Note: In the Two-add design, half of the last clock cycle is wasted.
Two Adds per Clock Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
clk
a
x1
x2
x3
x4
x5
z
0 1 2 3 4 5 6
4
x5
Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
One add per clock cycle Two adds per clock cycle
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
5
6
x5
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
0
1
2
3
4
x5
inputs 6 6
outputs 1 1
registers 6 6
adders 1 2
clock period op + 1 add op + 2 add
latency 6 4
Question: Under what circumstances would each design option be fastest?
Answer:
time = latency * clock period
compare execution times for both options
T
1
= 6(T
f
+T
a
)
T
2
= 4(T
f
+2T
a
)
One-add will be faster when T
1
< T
2
:
6(T
f
+T
a
) < 4(T
f
+2T
a
)
6T
f
+6T
a
< 4T
f
+8T
a
2T
f
< 2T
a
T
f
< T
a
Sanity check: If add is slower than op, then want to minimize the number of
adds. One-add has fewer adds, so one-add will be faster when add is slower
than op.
2.7. MEMORY ARRAYS AND RTL DESIGN 141
2.7 Memory Arrays and RTL Design
2.7.1 Memory Operations
Read of Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataow Diagram
M
d
mem(rd)
a
Hardware
WE
A
DI
DO a do
M
clk
we
Behaviour
clk
d
a
M(
a
)
d
we
do
-
-
Write to Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataow Diagram
M
M
mem(wr)
a di
Hardware
WE
A
DI
DO a
M
clk
di
we
do
Behaviour
clk
d
a
M(
a
)
d
we
di
-
-
-
do U
-
-
Dual-Port Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M
M
mem(wr)
a0 di0
mem(rd)
a1
do1
a0
M
clk
di0
we WE
A0
DI0
DO0
A1 DO1 a1 do1
do0
clk
d
a0
M(
a
)
d
we
di0
-
-
-
-
a
a1
do0
-
-
d
M(
a
)
U
d
do1 -
Sequence of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M
M
mem(wr)
a0
di0
mem(rd)
a1
do1
mem(rd)
do1
mem(rd)
do0
a1
a0
clk
d1
a0
M(
a
)
d
we
di0
a
a1
do0
d
M(
a
)
d
do1 -
d2
a
-
-
-
d1
d
-
d
M(
a
) -
d
M(
a
)
?
2.7.2 Memory Arrays in VHDL
2.7.2.1 Using a Two-Dimensional Array for Memory
A memory array can be written in VHDL as a two-dimensional array:
subtype data is std_logic_vector(7 downto 0);
type data_vector is array( natural range <> ) of data;
signal mem : data_vector(31 downto 0);
These two-dimensional arrays can be useful in high-level models and in specications. However,
it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some
synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two-
dimensional arrays very inefciently.
The example below illustrates: lack of interface protocol, combinational write, multiple write
ports, multiple read ports.
2.7.2 Memory Arrays in VHDL 143
architecture main of mem_not_hw is
begin
y <= mem( a );
mem( a ) <= b; -- comb read
process (clk) begin
mem( c ) <= w; -- write port #1
end if;
end process;
process (clk) begin
mem( d ) <= v; -- write port #2
end if;
end process;
u <= mem( e ); -- read port #2
end main;
2.7.2.2 Memory Arrays in Hardware
Most simple memory arrays are single- or dual-
ported, support just one write operation at a time,
and have an interface protocol using a clock and
write-enable.
WE
A
DI
DO
WE
A0
DI0
DO0
A1 DO1
2.7.2.3 VHDL Code for Single-Port Memory Array
package mem_pkg is
end;
entity mem is
port (
clk : in std_logic;
we : in std_logic -- write enable
a : in unsigned(4 downto 0); -- address
di : in data; -- data_in
do : out data -- data_out
);
end mem;
architecture main of mem is
begin
do <= mem( to_integer( a ) );
process (clk) begin
if we = 1 then
mem( to_integer( a ) ) <= di;
end if;
end if;
end process;
end main;
The VHDL code above is accurate in its behaviour and interface, but might be synthesized as
distributed memory (a large number of ip ops in FPGA cells), which will be very large and very
slow in comparison to a block of memory.
Synopsys synthesis tools implement each bit in a two-dimensional array as a ip-op.
Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster than
a two-dimensional array of ip ops. These libraries exploit specialized hardware on the chips to
implement the memory.
Note: To synthesize a reasonable implementation of a memory array with
Synopsys, you must instantiate a vendor-supplied memory component.
Some other synthesis tools, such as Xilinx XST, can infer memory arrays from two-dimensional
arrays and synthesize efcient implementations.
Recommended Design Process with Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. high-level model with two-dimensional array
2. two-dimensional array packaged inside memory entity/architecture
3. vendor-supplied component
2.7.2.4 Using Library Components for Memory
Altera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Altera uses MegaFunctions to implement RAM in VHDL. A MegaFunction is a black-box de-
scription of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM
components of different sizes. In E&CE 427 we will provide you with the VHDL code for the
RAM components that you will need in Lab-3 and the Project.
The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System
Blocks (ESB). Each ESB can store 2048 bits and can be congured in any of the following sizes:
Number of Elements Word Size (bits)
2048 1
1024 2
512 4
256 8
128 16
Xilinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Use component instantiation to get these components
ram16x1s 161 single ported memory
ram16x1d 161 dual-ported memory
Other sizes are also available, consult the datasheet for your chip.
2.7.2.5 Build Memory from Slices
If the vendors libraries of memory components do not include one that is the correct size for your
needs, you can construct your own component from smaller ones.
WE
A
DI
DO
WE
A
DI
DO
NxW NxW
WriteEn
Addr
DataIn[W-1..0]
DataIn[2W-1..2]
Clk
DataOut[W-1..0]
DataOut[2W-1..W]
Figure 2.4: An N2W memory from NW components
WE
A
DI
DO
WE
A
DI
DO
NxW
NxW
WriteEn
Addr[logN-1..0]
DataIn
Clk
DataOut
Addr[logN]
1 0
Figure 2.5: A 2NW memory from NW components
A 164 Memory from 161 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
library ieee;
entity ram16x4s is
port (
clk, we : in std_logic;
data_in : in std_logic_vector(3 downto 0);
addr : in unsigned(3 downto 0);
data_out : out std_logic_vector(3 downto 0)
);
end ram16x4s;
architecture main of ram16x4s is
component ram16x1s
port (d : in std_logic; -- data in
a3, a2, a1, a0 : in std_logic; -- address
we : in std_logic; -- write enable
wclk : in std_logic; -- write clock
o : out std_logic -- data out
);
end component;
begin
mem_gen:
for i in 0 to 3 generate
ram : ram16x1s
port map (
we => we,
wclk => clk,
----------------------------------------------
-- d and o are dependent on i
a3 => addr(3), a2 => addr(2),
a1 => addr(1), a0 => addr(0),
d => data_in(i),
o => data_out(i)
----------------------------------------------
);
end generate;
end main;
2.7.2.6 Dual-Ported Memory
Dual ported memory is similar to single ported memory, except that it allows two simultaneous
reads, or a simultaneous read and write.
When doing a simultaneous read and write to the same address, the read will usually not see the
data currently being written.
Question: Why do dual-ported memories usually not support writes on both ports?
Answer:
What should your memory do if you write different values to the same
address in the same clock cycle?
2.7.3 Data Dependencies
Denition of Three Types of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
There are three types of data dependencies. The names come from pipeline terminology in com-
puter architecture.
M[i] :=
:= M[i]
:=
M[i]
:=
:=
M[i]
:=
M[i]
:=
:=
M[i]
:=
Read after Write Write after Write Write after Read
(True dependency) (Load dependency) (Anti dependency)
Instructions in a program can be reordered, so long as the data dependencies are preserved.
2.7.3 Data Dependencies 149
Purpose of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
R3 := ......
... := ... R3 ...
producer
consumer
W
1
R
1
R3 := ...... W
0
W
2
WAW ordering prevents W
0

from happening after W
1
WAR ordering prevents W
2

from happening before R
1
RAW ordering prevents R
1

from happening before W
1
R3 := ......
Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specic purpose
in ensuring that producer-consumer relationships are preserved.
Ordering of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3] M[2] M[1] M[0]
30 20 10 0
M[3] C :=
21
Initial Program with Dependencies
M[2] := 21
M[3] 31 :=
A := M[2]
B := M[0]
M[3] 32 :=
M[0] 01 :=
C := M[3]
Valid Modication
M[2] := 21
M[3] 31 :=
A := M[2]
B := M[0]
M[3] 32 :=
M[0] 01 :=
C := M[3]
Valid (or Bad?) Modication
Answer:
Bad modication: M[3] := 32 must happen before C := M[3].
2.7.4 Memory Arrays and Dataow Diagrams
Legend for Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
name
name name
name (rd) name(wr)
Input port Output port State signal Array read Array write
Basic Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
mem(rd)
addr
data
mem
mem
(anti-dependency)
mem(wr)
data addr mem
mem
data := mem[addr]; mem[addr] := data;
Memory Read Memory Write
Dataow diagrams show the dependencies between operations. The basic memory operations are
similar, in that each arrow represents a data dependency.
There are a few aspects of the basic memory operations that are potentially surprising:
The anti-dependency arrow producing mem on a read.
Reads and writes are dependent upon the entire previous value of the memory array.
The write operation appears to produce an entire memory array, rather than just updating an
individual element of an existing array.
Normally, we think of a memory array as stationary. To do a read, an address is given to the array
and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to
see the read and write operations consuming and producing memory arrays.
Our goal is to support memory operations in dataow diagrams. We want to model memory oper-
ations similarly to datapath operations. When we do a read, the data that is produced is dependent
upon the contents of the memory array and the address. For write operations, the apparent depen-
dency on, and production of, an entire memory array is because we do not know which address
in the array will be read from or written to. The antidependency for memory reads is related to
Write-after-Read dependencies, as discussed in Section 2.7.3. There are optimizations that can be
performed when we know the address (Section 2.7.4).
2.7.4 Memory Arrays and Dataow Diagrams 151
Dataow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Algo: mem[wr addr] := data in;
data out := mem[rd addr];
data_out
mem(wr)
data_in wr_addr
rd_addr
mem
mem(rd)
mem
Read after Write
Algo: mem[wr addr] := data in;
data out := mem[rd addr];
data_out
mem(wr)
data_in wr_addr
rd_addr
mem
mem(rd)
mem
Optimization when rd addr ,= wr addr
Algo: mem[wr1 addr] := data1;
mem[wr2 addr] := data2;
mem(wr)
mem
mem(wr)
data1 wr1_addr
wr2_addr
mem
data2
Write after Write
Algo: mem[wr1 addr] := data1;
mem[wr2 addr] := data2;
mem(wr)
mem(wr)
data1 wr1_addr
wr2_addr
mem
data2 mem
Scheduling option when
wr1 addr ,= wr2 addr
Algo: rd data := mem[rd addr];
mem[wr addr] := wr data;
mem(wr)
mem
mem(rd)
rd_addr
wr_addr
mem
wr_data
rd_data
Write after Read
Algo: rd data := mem[rd addr];
mem[wr addr] := wr data;
mem(wr)
mem
mem(rd)
rd_addr wr_addr
mem
wr_data
rd_data
Optimization when rd addr ,= wr addr
2.7.5 Example: Memory Array and Dataow Diagram 153
2.7.5 Example: Memory Array and Dataow Diagram
M(wr)
data_in wr_addr
2
M(rd)
mem
M 21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
3
01 0
M(rd)
C M
M[2]
M[3]
M[3]
M[0]
:=
A
B
21
31
32
01
:=
:=
:=
M[2]
M[0]
:=
:=
M[3] C :=
1
2
3
4
5
6
7
1
2
3 4
5
6
7
Figure 2.6: Memory array example code and initial dataow diagram
The dependency and anti-dependency arrows in dataow diagram in Figure2.6 are based solely
upon whether an operation is a read or a write. The arrows do not take into account the address
that is read from or written to.
In gure2.7, we have used knowledge about which addresses we are accessing to remove unneeded
dependencies. These are the real dependencies and match those shown in the code fragment for
gure2.6. In gure2.8 we have placed an ordering on the read operations and an ordering on the
write operations. The ordering is derived by obeying data dependencies and then rearranging the
operations to perform as many operations in parallel as possible.
M(wr)
2
M(rd)
M 21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 0
3
M(rd)
C M
Figure 2.7: Memory array with minimal dependencies
M(wr)
2
M(rd)
M 21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 0
3
M(rd)
C M
3
2
1 1 2
3 4
Figure 2.8: Memory array with orderings
M(wr)
2
M(rd)
M
21 2
M(wr)
31 3
A
0
M(rd)
B
M(wr)
32 3
M(wr)
01 0 3
M(rd)
C M
3
2
1 1
2
3
4
Figure 2.9: Final version of Figure2.6
Put as many parallel operations into same clock cycle as allowed by resources (one write + one
read, two reads, or one write for dual port RAM). Preserve depencies by putting dependent opera-
tions in separate clock cycles.
2.8. INPUT / OUTPUT PROTOCOLS 155
2.8 Input / Output Protocols
An important aspect of hardware design is choosing a input/output protocol that is easy to im-
plement and suits both your circuit and your environment. Here are a few simple and common
protocols.
rdy
data
ack
Figure 2.10: Four phase handshaking protocol
Used when timing of communication between producer and consumer is unpredictable. The dis-
advantage is that it is cumbersome to implement and slow to execute.
clk
data
valid
Figure 2.11: Valid-bit protocol
A low overhead (both in area and performance) protocol. Consumer must always be able to accept
incoming data. Often used in pipelined circuits. More complicated versions of the protocol can
handle pipeline stalls.
clk
data_in
start
done
data_out
Figure 2.12: Start/Done protocol
A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece
of data at a time and the time to compute the result is unpredictable.
2.9 Design Example: Massey
Well go through the following artifacts:
1. requirements
2. algorithm
3. dataow diagram
4. high-level models
5. hardware block diagram
6. RTL code for datapath
7. state machine
8. RTL code for control
Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Scheduling (allocate operations to clock cycles)
2. I/O allocation
3. First high-level model
4. Register allocation
5. Datapath allocation
6. Connect datapath components, insert muxes where needed
7. Design implicit state machine
8. Optimize
9. Design explicit-current state machine
10. Optimize
2.9.1 Requirements
Functional requirements:
Compute the sum of six 8-bit numbers: output = a + b + c + d + e + f
Use registers on both inputs and outputs
Performance requirements:
Maximum clock period: unlimited
Maximum latency: four
Cost requirements:
2.9.2 Algorithm 157
Maximum of two adders
Small miscellaneous hardware (e.g. muxes) is unlimited
Maximum of three inputs and one output
Design effort is unlimited
Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex-
pensive than a full-adder. A 2:1 mux has three inputs while an adder has only
two inputs (the carry-in and carry-out signals usually use the special verti-
cal connections on the FPGA cell). In FPGAs, sharing an adder between two
signals can be more expensive than having two adders. In a generic-gate
technology, a multiplexor contains three two-input gates, while a full-adder
contains fourteen two-input gates.
2.9.2 Algorithm
Well use parentheses to group operations so as to maximize our opportunities to perform the work
in parallel:
z = (a + b) + (c + d) + (e + f)
This results in the following data-dependency graph:
a b c d e f
+
+
+
+
+
2.9.3 Initial Dataow Diagram
z
a b c d
e f +
+
+
+
+
This dataow diagram violates the require-
ment to use at most three inputs.
2.9.4 Dataow Diagram Scheduling
We can potentially optimize the inputs, outputs, area, and performance of a dataow diagram by
rescheduling the operations, that is allocating the operations to different clock cycles.
Parallel algorithms have higher performance and greater scheduling exibility than serial algo-
rithms
Serial algorithms tend to have less area than parallel algorithms
Serial Parallel
(((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)
a b c d e f
+
+
+
+
+
a b c d e f
+
+
+
+
+
2.9.4 Dataow Diagram Scheduling 159
Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Original parallel Parallel after scheduling
a b c d e f
+
+
+
+
+
a b c d
e f +
+
+
+
+
inputs 6 4
outputs 1 1
registers 6 4
adders 3 2
latency 3 3
Scheduling to Optimize Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rescheduling the dataow diagram from the
parallel algorithm reduced the area from
three adders to two. However, it still vio-
lates the restriction of a maximum of three
inputs. We can reschedule the operations to
keep the same area, but reduce the number
of inputs.
The tradeoff is that reducing the number of
inputs causes an increase in the latency from
four to ve.
z
a b
c d
e f
+
+
+
+
+
A latency of ve violates the design requirement of a maximum latency of four clock cycles. In
comparing the dataow diagram above with the design requirements, we notice that the require-
ments allow a clock cycle that includes two additions and three inputs.
It appears that the parallel algorithm will not
lead us to a design that satises the require-
ments.
We revisit the algorithm and try a serial al-
gorithm:
z = ((((a + b) + c) + d) + e) + f
The corresponding dataow diagram is
shown to the right.
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
2.9.5 Optimize Inputs and Outputs
When we rescheduled the parallel algorithm, we rescheduled the input values. This requires rene-
gotiating the schedule of input values with our environment. Sometimes the environment of our
circuit will be willing to reschedule the inputs, but in other situations the environment will impose
a non-negotiable schedule upon us.
If you are currently storing all inputs and can change environments behaviour to delay sending
some inputs, then you can reduce the number of inputs and registers.
We will illustrate this on both the one-add and the two-add designs.
One-add before I/O opt One-add after I/O opt
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
a b
c
d
e
f
+
+
+
+
+
x1
x2
x3
x4
z
inputs 6 2
regs 6 2
2.9.5 Optimize Inputs and Outputs 161
Two-add before I/O opt Two-add after I/O opt
a b c d e f
+
+
+
+
+
x1
x2
x3
x4
z
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
inputs 6 2
regs 6 2
Design Comparison Between One and Two Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
One-add after I/O opt Two-add after I/O opt
a b
c
d
e
f
+
+
+
+
+
x1
x2
x3
x4
z
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
inputs 2 3
outputs 1 1
registers 2 3
adders 1 2
latency 6 4
Hardware Recipe for Two-Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
We return now to the two-add design, with
the dataow diagram:
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
Based on the dataow diagram, we can de-
termine the hardware resources required for
the datapath.
Table 2.2: Hardware Recipe for Two-Add
inputs 3
adders 2
registers 3
output 1
registered inputs YES
registered outputs YES
clock cycles from inputs to outputs 4
2.9.6 Input/Output Allocation
Our rst step after settling on a hardware recipe is I/O allocation, because that determines the
interface between our circuit and the outside world.
From the hardware recipe, we know that we need only three inputs and one output. However, we
have six different input values. We need to allocate these input values to input signals before we
can write a high-level model that performs the computation of our design.
Based on the input and output information in the hardware recipe, we can dene our entity:
entity massey is
port (
clk : in std_logic;
i1, i2, i3 : in unsigned(7 downto 0);
o1 : out unsigned(7 downto 0)
);
end massey;
2.9.6 Input/Output Allocation 163
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
+
+
i1 i2 i3
o1
i2 i3
i2
i1 i2 i3
o1
Figure 2.13: Dataow diagram and hardware block diagram with I/O port allocation
Based upon the dataow diagram after I/O
allocation, we can write our rst high-level
model (hlm v1).
In the high-level model the entire circuit will
be implemented in a single process. For
larger circuits it may be benecial to have
separate processes for different groups of
signals.
In the high-level model, the code between
wait statements describes the work that is
done in a clock cycle.
The hlm v1 architecture uses an implicit
state machine.
Because the process is clocked, all of the
signals that are assigned to in the process are
registers. Combinational signals would need
to be done using concurrent assignments or
combinational processes.
architecture hlm_v1 of massey is
...internal signal decls...
process begin
a <= i1;
b <= i2;
c <= i3;
x2 <= (a + b) + c;
d <= i2;
e <= i3;
x4 <= (x2 + d) + e;
f <= i2;
z <= (x4 + f);
end process;
o1 <= z;
end hlm_v1;
2.9.7 Register Allocation
The next step after I/O allocation could be either register allocation or datapath allocation. The
benet of doing register allocation rst is that it is possible to write VHDL code after register
allocation is done but before datapath allocation is done, while the inverse (datapath done but
register allocation not done) does not make sense if written in a hardware description language.
In this example, we will do register allocation before datapath allocation, and show the resulting
VHDL code.
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
i1 i2 i3
o1
i2 i3
i2
+
+
i1 i2 i3
o1
r1 r2 r3
r2 r3
r2
r3
r1
r1
r2 r3 r1
I/O Allocation
i1 a
i2 b, d, f
i3 c, e
o1 z
Register Allocation
r1 a, x2, x4
r2 b, d, f
r3 c, e
architecture hlm_v2 of massey is
process begin
r1 <= i1;
r2 <= i2;
r3 <= i3;
r1 <= (r1 + r2) + r3;
r2 <= i2;
r3 <= i3;
r1 <= (r1 + r2) + r3;
r2 <= i2;
r3 <= (r1 + r2);
end process;
o1 <= r3;
end hlm_v2;
Figure 2.14: Block diagram after I/O and register allocation
2.9.8 Datapath Allocation 165
2.9.8 Datapath Allocation
In datapath allocation, we allocate each of the data operations in the dataow diagram to one of
the datapath components in the hardware block diagram.
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
a1
a2
a1
a2
a1
r1 r2 r3
r2 r3
r2
r3
r1
r1
i1 i2 i3
o1
i2 i3
i2
+
+
a1
a2
r2 r3 r1
i1 i2 i3
o1
I/O Allocation
i1 a
i2 b, d, f
i3 c, e
o1 z
Register Allocation
r1 a, x2, x4
r2 b, d, f
r3 c, e
Datapath Allocation
a1 x1, x3, z
a2 x2, x4
architecture hlm_dp of massey is
process begin
r1 <= i1;
r2 <= i2;
r3 <= i3;
r1 <= a2;
r2 <= i2;
r3 <= i3;
r1 <= a2;
r2 <= i2;
r3 <= a1;
end process;
a1 <= r1 + r2;
a2 <= a1 + r3;
o1 <= r3;
end hlm_dp;
Figure 2.15: Block diagram after I/O, register, and datapath allocation
2.9.9 Datapath for DP+Ctrl Model
We will now evolve from an implicit state machine to an explicit state machine. The rst step is to
label the states in the dataow diagram and the construct tables to nd the values for chip-enable
and mux-select signals.
a b c
d e
f
+
+
+
+
+
x1
x2
x3
x4
z
a1
a2
a1
a2
a1
r1 r2 r3
r2 r3
r2
r3
r1
r1
i1 i2 i3
o1
i2 i3
i2
S0
S1
S2
S3
S0
Datapath for DP+Ctrl Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r1 r2 r3
S0 ce=1 , d=i1 ce=1 , d=i2 ce=1 , d=i3
S1 ce=1 , d=a2 ce=1 , d=i2 ce=1 , d=i3
S2 ce=1 , d=a2 ce=1 , d=i2 ce=, d=
S3 ce=, d= ce=, d= ce=1 , d=a1
a1 a2
S0 src1=, src2= src1=, src2=
S1 src1=r1, src2=r2 src1=a1, src2=r3
S3 src1=r1, src2=r2 src1=, src2=
Choose Dont-Care Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r1 r2 r3
S0 ce=1, d=i1 ce=1, d=i2 ce=1, d=i3
S1 ce=1, d=a2 ce=1, d=i2 ce=1, d=i3
S2 ce=1, d=a2 ce=1, d=i2 ce=1, d=i3
S3 ce=1, d=a2 ce=1, d=i2 ce=1, d=a1
a1 a2
2.9.9 Datapath for DP+Ctrl Model 167
Simplify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r1 r2 = i2 r3
S0 d=i1 d=i3
S1 d=a2 d=i3
S2 d=a2 d=i3
S3 d=a2 d=a1
a1 a2
src1=r1, src2=r2 src1=a1, src2=r3
VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
architecture explicit_v1 of massey is
signal
constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010";
begin
----------------------
-- r1
process (clk) begin
if state = S0 then
r_1 <= i_1;
else
r_1 <= a_2;
end if;
end if;
end process;
----------------------
-- r_2
process (clk) begin
r_2 <= i_2;
end if;
end process;
----------------------
-- r_3
process (clk) begin
if state = S3 then
r_3 <= a_1;
else
r_3 <= i_3;
end if;
end if;
end process
----------------------
-- combinational datapath
a_1 <= r_1 + r_2;
a_2 <= a_1 + r_3;
o_1 <= r_3;
----------------------
-- state machine
process (clk) begin
if reset = 1 then
state <= S0;
else
case state is
when S0 => state <= S1;
end case;
end if;
end if;
end process;
end explicit_v1;
Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Peephole optimizations are localized optimizations to code, in that they affect only a few lines of
code. In hardware design, peephole optimizations are usually done to decrease the clock period,
although some optimizations might also decrease area. There are many different types of opti-
mizations, and many optimizations that designers do by hand are things that you might expect a
synthesis tool to do automatically.
In a comparison such as: state = S0, when we use a one-hot state encoding, we need com-
pare only one of the bits of the state. The comparison can be simplied to: state(0) = 1.
Without this optimization, many synthesis tools will produce hardware that tests all of the bits of
the state signal. This increases the area, because more bits are required as inputs to the compari-
son, and increases the clock period because the wider comparison leads to a tree-like structure of
combinational logic, or an increased number of FPGA cells.
2.9.9 Datapath for DP+Ctrl Model 169
In this example, we will take advatage of our state encoding to optimize the code for r 1, r 3, and
the state machine.
-- r_1
process (clk) begin
if state = S0 then
r_1 <= i_1;
else
r_1 <= a_2;
end if;
end if;
end process;
-- r_1 (optimized)
process (clk) begin
if state(0) = 1 then
r_1 <= i_1;
else
r_1 <= a_2;
end if;
end if;
end process;
The code for r 2 remains unchanged.
-- r_3
process (clk) begin
if state = S3 then
r_3 <= a_1;
else
r_3 <= i_3;
end if;
end if;
end process;
-- r_3 (optimized)
process (clk) begin
if state(3) then
r_3 <= a_1;
else
r_3 <= i_3;
end if;
end if;
end process;
-- state machine
process (clk) begin
if reset = 1 then
state <= S0;
else
case state is
end case;
end if;
end if;
end process;
-- state machine (optimized)
-- NOTE: "st" = "state"
process (clk) begin
if reset = 1 then
st <= S0;
else
st(i) <= st( (i + 1) mod 3);
end loop;
end if;
end if;
The hardware-block diagram that corresponds to the tables and VHDL code is:
+
+
a1
a2
r2 r3 r1
i1 i2 i3
o1
State(1) State(2) State(3)
reset
State(0)
2.10. DESIGN EXAMPLE: VANIER 171
2.10 Design Example: Vanier
Well go through the following artifacts:
1. requirements
2. algorithm
3. dataow diagram
4. high-level models
5. hardware block diagram
6. RTL code for datapath
7. state machine
8. RTL code for control
Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Scheduling (allocate operations to clock cycles)
2. I/O allocation
3. First high-level model
4. Register allocation
5. Datapath allocation
6. Connect datapath components, insert muxes where needed
7. Design implicit state machine
8. Optimize
9. Design explicit-current state machine
10. Optimize
2.10.1 Requirements
Functional requirements: compute the following formula:
output = (a d) + c + (d b) + b
Performance requirement:
Max clock period: op plus (2 adds or 1 multiply)
Max latency: 4
Cost requirements
Maximum of two adders
Maximum of two multipliers
Unlimited registers
Maximum of three inputs and one output
Maximum of 5000 student-minutes of design effort
Registered inputs and outputs
2.10.2 Algorithm
Create a data-dependency graph for the algo-
rithm.
z
a d
+
+
+
b c
2.10.3 Initial Dataow Diagram
Schedule operations into clock cycles. Use an
as soon as possible schedule, obeying perfor-
mance requirement of a maximum clock period
of one multiply or two additions. In this initial
diagram, we ignore the resource requirements.
This allows us to establish a lower bound on
the latency, which gives us the maximum per-
formance that we can hope to achieve.
z
a d
+
+
+
b c
2.10.4 Reschedule to Meet Requirements
We have four inputs, but the requirements allow a maximum of three. We need to move one input
into the second clock cycle. We want to choose an input that can be delayed by one clock cycle
without violating a requirement and with minimal degradation of performance (clock period and
latency).
2.10.4 Reschedule to Meet Requirements 173
If delaying an input by a clock cycle causes a requirement to be violated, we can often reschedule
the operations to remove the violation. So, we sometimes create an intermediate dataow diagram
that violates a requirement, then reschedule the operations to bring the dataow diagram back into
compliance.
The critical path is from d and b, through a multiplier, the middle adder, the nal adder, and then
out through z. Because the inputs d and b are on the critical path, it would be preferable to choose
another input (either a or c) as the input to move into the second clock cycle.
If we move c, we will move the rst addition in the second clock cycle, which will force us to use
three adders, which violates our resource requirement of a maximum of two adders.
By process of elimination, we have settled on a
as our input to be delayed. This causes one of
the multiply operations to be moved into second
clock cycle, which is good because it reduces
our resources from two multipliers to just one.
z
d
+
+
+
b c
a
Moving a into the second clock cycle has caused
a clock period violation, because our clock pe-
riod is now a register, a multiply, and an add.
This forces us to add an additional clock cycle,
which gives us a latency of four.
z
a
d
+
+
+
b c
2.10.5 Optimize Resources
We can exploit the additional clock cycle to
reschedule our operations to reduce the number
of inputs from three to two. The disadvantage is
that we have increased the number of registers
from four to ve.
z
a
d
+
+
+
b
c
Two side comments:
Moving the second addition from the third clock cycle to the second will not improve the per-
formance or the area. The number of adders will remain at two, the number of registers will
remain at ve, and the clock period will remain at the maximum of a multiply or two additions.
In hindsight, if we had chosen originally to move c, rather than a into the second clock cycle,
we would likely have produced this same dataow diagram. After moving c, we would see
the resource violation of three adders in the second clock cycle. This violation would cause us
to add a third clock cycle, and given us an opportunity to move a into the second clock cycle.
The lesson is that there are usually several different ways to approach a design problem, and it
is infeasible to predict which approach will result in the best design. At best, we have many
heuristics, or rules of thumb, that give us guidelines for techniques that usually work well.
Having nalized our input/output scheduling, we can write our entity. Note: we will add a reset
signal later, when we design the state machine to control the datapath.
entity vanier is
port (
clk : in std_logic;
i_1, i_2 : in std_logic_vector(15 downto 0);
o_1 : out std_logic_vector(15 downto 0)
);
end vanier;
2.10.6 Assign Names to Registered Values 175
2.10.6 Assign Names to Registered Values
We must assign a name to each registered value. Optionally, we may also assign names to com-
binational values. Registers require names, because in VHDL each register (except implicit state
registers) is associated with a named signal. Combinational signals do not require names, be-
cause VHDL allows anonymous (unnamed) combinational signals. For example, in the expression
(a+b)+c we do not need to provide a name for the sum of a and b.
If a single value spans multiple clock cycles, it
only needs to be named once. In our example
x 1, x 2, and x 4 each cross two boundaries.
z
a
d
+
+
+
b
c
x1 x2
x3 x4 x5
x6 x7
x8
2.10.7 Input/Output Allocation
Now that we have names for all of our registered signals, we can allocate input and output ports to
signals.
After the input and output ports have been allocated to signals, we can write our rst model. We
use an implicit state machine and dene only the registered values. In each state, we dene the
values of the registered values that are computed in that state.
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
architecture hlm_v1 of vanier is
signal x_1, x_2, x_3, x_4, x_5
: std_logic_vector(15 downto 0);
begin
process begin
------------------------------
------------------------------
x_1 <= i_1;
x_2 <= i_2;
------------------------------
------------------------------
x_3 <= i_1;
x_4 <= x_1 * x_2;
x_5 <= i_2;
------------------------------
------------------------------
x_6 <= x_3 * x_1;
x_7 <= x_2 + x_5;
------------------------------
------------------------------
x_8 <= x_6 + (x_4 + x_7);
end process;
o_1 <= x_8;
end hlm_v1;
The model hlm v1 is synthesizable. If we are happy with the clock speed and area, we can stop
now! The remaining steps of the design process seek to optimize the design by reducing the area
and clock period. For area, we will reduce the number of registers, datapath components, and
multiplexers. Reducing the clock period will occur as we reduce the number of multiplexers and
potentially perform peephole (localized) optimizations, such as Boolean simplication.
2.10.8 Tangent: Combinational Outputs 177
2.10.8 Tangent: Combinational Outputs
To demonstrate a high-level model where the output is combinational, we modify hlm v1 so that
the output is combinational, rather than a register (see hlm v1 2). To make the output (x 8) com-
binational, we move the assignment to x 8 out of the main clocked process and into a concurrent
statement.
architecture hlm_v1_2 of vanier is
signal x_1, x_2, x_3, x_4, x_5, x_6, x_7
begin
process begin
------------------------------
------------------------------
x_1 <= i_1;
x_2 <= i_2;
------------------------------
------------------------------
x_3 <= i_1;
x_4 <= x_1 * x_2;
x_5 <= i_2;
------------------------------
------------------------------
x_6 <= x_3 * x_1;
x_7 <= x_2 + x_5;
end process;
o_1 <= x_6 + (x_4 + x_7);
end hlm_v1_2;
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
2.10.9 Register Allocation
Our previous model (hlm v1) uses eight registers (x 1. . . x 8). However, our analysis of the
dataow diagrams says that we can implement the diagram with just ve registers. Also, the code
for hlm v1 contains two occurrences of the multiplication symbol (*) and three occurrences of the
addition symbol (+). Our analysis of the dataow diagram showed that we need only one multiply
and two adds. In hlm v1 we are relying on the synthesis tool to recognize that even though the
code contains two multiplies and three adds, the hardware needs only one multiply and two adds.
Register allocation is the task of assigning each of our registered values to a register signal. Dat-
apath allocation is the task of assigning each datapath operation to a datapath component. Only
high-level synthesis tools (and software compilers) do register allocation. So, as hardware design-
ers, we are stuck with the task of doing register allocation ourselves if we want to further optimize
our design. Some register-transfer-level synthesis tools do datapath allocation. If your synthesis
tool does datapath allocation, it is important to learn the idioms and limitations of the tool so that
you can write your code in a style that allows the tool to do a good job of allocation and optimiza-
tion. In most cases where area or clock speed are important design metrics, design engineers do
datapath allocation by hand or ad-hoc software and spreadsheets.
We will now step through the tasks of register allocation and datapath allocation. In our eight-
register model, each register holds a unique value we do not reuse registers. To reduce the
number of registers from eight to ve, we will need to reuse registers, so that a register potentially
holds different values in different clock cycles.
When doing register allocation, we assign a register to each signal that crosses a clock cycle bound-
ary. When creating the hardware block diagram, we will need to add multiplexers to the inputs of
modules that are connected to multiple registers. To reduce the number of multiplexers, we try to
allocate the same registers to the same inputs of the same type of module. For example, x 7 is an
input to an adder, we allocate r 5 to x 7, because r 5 was also an input to an adder in another
clock cycle. Also in the third clock cycle, we allocate r 2 to x 6, because in the second clock
cycle, the inputs to an adder were r 2 and r 5. In the last clock cycle, we allocate r 5 to x 8,
because previously r 5 was used as the output of r 2 + r 5.
We update our model to reect register allocation, by replacing the signals for registered values
(x 1. . . x 8) with the registers r 1. . . r 5.
2.10.9 Register Allocation 179
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
architecture hlm_v2 of vanier is
signal x_1, x_2, x_3, x_4, x_5
begin
process begin
------------------------------
------------------------------
r_1 <= i_1;
r_2 <= i_2;
------------------------------
------------------------------
r_3 <= i_1;
r_4 <= r_1 * r_2;
r_5 <= i_2;
------------------------------
------------------------------
r_2 <= r_3 * r_1;
r_5 <= r_2 + r_5;
------------------------------
------------------------------
r_5 <= r_2 + (r_4 + r_5);
end process;
o_1 <= r_5;
end hlm_v2;
Both of our models so far (hlm v1 and hlm v2) have used implicit state machines. The optimiza-
tion from hlm v1 to hlm v2 was done to reduce the number of registers by performing register
allocation. Most of the remaining optimizations require an explicit state machine. We will con-
struct an explicit state machine using a methodical procedure that gradually adds more information
to the dataow diagram. The rst step in this procedure is to datapath allocation, which is similar
to register allocation, except that we allocate datapath components to datapath operations, rather
than allocate registers to names.
To control the datapath, we need to provide the following signals for registers and datapath com-
ponents:
registers chip-enable and mux-select signals
datapath components instruction (e.g. add, sub, etc for ALUs) and mux-select
After we determine the chip-enable, mux-select, and instruction signals, and then calculate what
value each signal needs in each clock cycle, we can build the explicit state machine to control the
datapath.
After we build the state machine, we will add a reset to the design.
2.10.10 Datapath Allocation
In datapath allocation, we allocate an adder (ei-
ther a1 or a2) to each addition operation and a
multiplier (either m1 or m2) to each multiplica-
tion operation. As with register allocation, we
attempt to reduce the number of multiplexers
will be required by connecting the same data-
path component to the same register in multiple
clock cycles.
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
m1
m1
a1
a2
a1
2.10.11 Hardware Block Diagram and State Machine
To build an explicit state machine, we rst determine what states we need. In this circuit, we need
four states, one for each clock cycle in the dataow diagram. If our algorithmic description had
included control ow, such as loops and branches, then it becomes more difcult to determine the
states that are needed.
We will use four states: S0..S3, where S0 corresponds to the rst clock cycle (during which the
input is read) and S3 corresponds to the last clock cycle.
2.10.11.1 Control for Registers
To determine the chip enable and mux select signals for the registers, we build a table where each
state corresponds to a row and each register corresponds to a column.
For each register and each state, we note whether the register loads in a new value (ce) and what
signal is the source of the loaded data (d).
r1 r2 r3 r4 r5
S0 ce=1, d=i1 ce=1, d=i2 ce=, d= ce=, d= ce=, d=
S1 ce=0, d= ce=0, d= ce=1, d=i1 ce=1, d=m1 ce=1, d=i2
S2 ce=, d= ce=1, d=m1 ce=, d= ce=0, d= ce=1, d=a1
S3 ce=, d= ce=, d= ce=, d= ce=, d= ce=1, d=a1
Eliminate unnecessary chip enables and muxes.
A chip enable is needed if a register must hold a single value for multiple clock cycles (ce=0).
2.10.11 Hardware Block Diagram and State Machine 181
A multiplexer is needed if a register loads in values from different sources in different clock
cycles.
The register simplications are as follows:
r1 Chip-enable, because S1 has ce=0. No multiplexer, because i1 is the only input.
r2 Chip-enable, because S1 has ce=0. Multiplexer to choose between i2 and m1.
r3 No chip enable, no multiplexer. The register r3 simplies to be just r3=i1 without a mul-
tiplexer or chip-enable, because there is only one state where we care about its behaviour
(S1) all of the other states are dont cares for both chip enable and mux.
r4 Chip-enable, because S2 has ce=0. No multiplexer, because m1 is the only input.
r5 No chip-enable, because do not have any states with ce=0. Multiplexer between i2 and a1.
The simplied register table is shown below. For registers that do not have multiplexers, we show
their input on the top row. For registers that need neither a chip enable nor a mux (e.g. r3), we
write the assignment in the rst row and leave the other rows blank.
r1=i1 r2 r3=i1 r4=m1 r5
S0 ce=1 ce=1, d=i2 ce= d=
S1 ce=0 ce=0, d= ce=1 d=i2
S2 ce= ce=1, d=m1 ce=0 d=a1
S3 ce= ce=, d= ce= d=a1
The chip-enable and mux-select signals that are needed for the registers are: r1 ce, r2 ce,
r2 sel, r4 ce, and r5 sel.
2.10.11.2 Control for Datapath Components
Analogous to the table for registers, we build a table for the datapath components. Each of our
components has two inputs (src1 and src2). Each component performs a single operation (either
addition or multiplication), so we do not need to dene operation or instruction signals for the
datapath components.
a1 a2 m1
src1 src2 src1 src2 src1 src2
S0
S1 r1 r2
S2 r2 r5 r3 r1
S3 r2 a2 r4 r5
Based on the table above, the adder a1 will need a multiplexer for src2. The multiplier m1 will
need two multiplexers: one for each input.
Note that the operands to addition and multiplication are commutative, so we can choose which
signal goes to src1 and which to src2 so as to minimize the need for multiplexers.
We notice that for m1, we can reduce the number of multiplexers from 2 to 1 by swapping the
operands in the second clock cycle. This makes r1 the only source of operands for the src1 input.
This optimization is reected in the table below.
a1 a2 m1
src1 src2 src1 src2 src1 src2
S0
S1 r1 r2
S2 r2 r5 r1 r3
S3 r2 a2 r4 r5
The mux-select signals for the datapath components are: a1 src2 sel and m1 src2 sel.
2.10.11.3 Control for State
We need to control the transition from one state to the next. For this example, the transition is very
simple, each state transitions to its successor: S0 S1 S2 S3 S0....
2.10.11.4 Complete State Machine Table
The state machine table is shown below. Note that the state signal is a register; the table shows the
next value of the signal.
r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state
S0 1 1 i2 S1
S1 0 0 1 i2 r2 S2
S2 1 m1 0 a1 r5 r3 S3
S3 a1 a2 S0
We now choose instantiations for the dont care values so as to simplify the circuitry. Different
state encodings will lead to different simplications. For fully-encoded states, Karnaugh maps are
helpful in doing simplications. For a one-hot state encoding, it is usually better to create situations
where conditions are based upon a single state. The reason for this heuristic with one-hot encodings
will be clear when we get to explicit v2.
2.10.12 VHDL Code with Explicit State Machine 183
r1 ce We rst choose 0 as the dont care instantiation, because that leaves just one state where
we need to load. (At the end of the dont care allocation, well revisit this decision and
change our mind.)
r2 ce We choose 1 for S3, so that we have just one state where we do not do a load. If we
had chosen 0 for r2ce in S3, we would have two states where we do a load and two where
we do not load. If we were using fully-encoded states, this even separation might have left
us with a very nice Karnaugh map; or it might have left us with a Karnaugh map that has a
checkerboard pattern, which would not simplify. This helps illustrate why state encoding is
a difcult problem.
r2 sel We choose m1 arbitrarily. The choice of i2 would have also resulted in three assign-
ments from one signal and one assignment from the other signal.
r4 ce We choose 1 because it is conceptually cleaner to do an assignment in just the one clock
cycle where we care about the value, rather than not do an assignment in the one clock cycle
where we must hold the value.
r5 sel Choose a1 so that we have three assignments from the same signal and just one assign-
ment from the other signal.
a1 src2 Choose a2 arbitrarily.
m1 src2 Choose r3 arbitrarily.
r1 ce (again) We examine r1 ce and r2 ce and see that if we choose 1 for the dont care
instantiation of r1 ce, we will have the same choices for both chip enables. This will
simplify our state machine. Also, r4 ce is the negation of r2 ce, so we can use just an
inverter to control r4 ce.
r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state
S0 1 1 i2 0 a1 a2 r3 S1
S1 0 0 m1 1 i2 a2 r2 S2
S2 1 1 m1 0 a1 r5 r3 S3
S3 1 1 m1 0 a1 a2 r3 S0
2.10.12 VHDL Code with Explicit State Machine
VHDL code can be written directly from the tables and the dataow diagram that shows register
allocation, input allocation, and datapath allocation. As a simplication, rather than write explicit
signals for the chip-enable and mux-select signals, we use select and conditional assignment state-
ments that test the state in the condition.
We chose a one-hot encoding of the state, which usually results in small and fast hardware for state
machines with sixteen or fewer states.
architecture explicit_v1 of vanier is
signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);
begin
----------------------
-- r_1
process (clk) begin
if state = S0 then
r_1 <= i_1;
end if;
end if;
end process;
----------------------
-- r_2
process (clk) begin
if state = S0 or state = S2 then
if state = S0 then
r_2 <= i_2;
else
r_2 <= m_1;
end if;
end if;
end if;
end process;
----------------------
-- r_3
process (clk) begin
r_3 <= i_1;
end if;
end process;
----------------------
-- r_4
process (clk) begin
if state = S1 then
r_4 <= m_1;
end if;
end if;
end process;
----------------------
-- r_5
process (clk) begin
if state = S1 then
r_5 <= i_2;
else
r_5 <= a_1;
end if;
end if;
end process;
----------------------
with state select
a1_src2 <= r_5 when S2,
a_2 when others;
with state select
m1_src2 <= r_2 when S1
r_3 when others;
a_1 <= a_2 + a1_src2;
a_2 <= r_4 + r_5;
m_1 <= r_1 * m1_src2;
o_1 <= r_5;
----------------------
-- state machine
process (clk) begin
if reset = 1 then
state <= S0;
else
case state is
end case;
end if;
end if;
end process;
----------------------
end explicit_v1;
2.10.13 Peephole Optimizations 185
The hardware-block diagram that corresponds to the tables and VHDL code is:
z
a
d
+
+
+
b
c
i1 i2
o1
i1 i2
x1 x2
x3 x4 x5
x6 x7
x8
r1 r2
r3 r4 r5
r2 r5
r5
m1
m1
a1
a2
a1
+
+
m1
a1
a2
r1 r2 r3
r4
r5
i1 i2
S0
S1
S2
S3
S0
2.10.13 Peephole Optimizations
We will illustrate several peephole optimizations that take advantage of our state encoding.
-- r_1
process (clk) begin
if state = S0 then
r_1 <= i_1;
end if;
end if;
end process;
-- r_1 (optimized)
process (clk) begin
if state = S0 then
r_1 <= i_1;
end if;
end if;
end process;
Analogous optimizations can be used when comparing against multiple states:
-- r_2
process (clk) begin
if state = S0 or state = S2 then
if state = S0 then
r_2 <= i_2;
else
r_2 <= m_1;
end if;
end if;
end if;
end process;
-- r_2 (optimized)
process (clk) begin
if (state(0) or state(1)) = 1 then
r_2 <= i_2;
else
r_2 <= m_1;
end if;
end if;
end if;
end process;
Next-state assignment for a one-hot state machine can be done with a simple shift register:
-- state machine
process (clk) begin
if reset = 1 then
state <= S0;
else
case state is
end case;
end if;
end if;
end process;
-- state machine (optimized)
-- NOTE: "st" = "state"
process (clk) begin
if reset = 1 then
st <= S0;
else
st( (i+1) mod 4 ) <= st( i );
end loop;
end if;
end if;
end process;
2.10.13 Peephole Optimizations 187
The resulting optimized code is shown on the next page.
architecture explicit_v2 of vanier is
signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);
begin
----------------------
-- r_1
process (clk) begin
r_1 <= i_1;
end if;
end if;
end process;
----------------------
-- r_2
process (clk) begin
if (state(0) or state(2)) = 1 then
r_2 <= i_2;
else
r_2 <= m_1;
end if;
end if;
end if;
end process;
----------------------
-- r_3
process (clk) begin
r_3 <= i_1;
end if;
end process;
----------------------
-- r_4
process (clk) begin
r_4 <= m_1;
end if;
end if;
end process;
----------------------
-- r_5
process (clk) begin
r_5 <= i_2;
else
r_5 <= a_1;
end if;
end if;
end process;
----------------------
a1_src2 <= r_5 when state(2) = 1
else a_2;
m1_src2 <= r_2 when state(1)= 1
else r_3;
a_1 <= a_2 + a1_src2;
a_2 <= r_4 * r_5;
m_1 <= r_1 * m1_src2;
o_1 <= r_5;
----------------------
-- state machine
process (clk) begin
if reset = 1 then
state <= S0;
else
state(i) <=
state( (i + 1) mod 3);
end loop;
end if;
end if;
end process;
----------------------
end explicit_v1;
2.10.14 Notes and Observations
Our functional requirements were written as:
output = (a d) + (d b) + b + c
Alternatively, we could have achieved exactly the same functionality with the functional require-
ments written as (the two statements are mathematically equivalent):
output = (a d) + b + (d b) + c
The naive data dependency graph for the alternative formulation is much messier than the data
dependency graph for the original formulation:
Original
(a d) + (d b) + b + c
z
a d
+
+
+
b c
Alternative
(a d) + c + (d b) + b
z
a b
+
+ +
c d
An observation: it can be helpful to explore several equivalent formulations of the mathematical
equations while constructing the data dependency graph. A mathematical formulation that places
occurrences of the same identier close to each other often results in a simpler data dependency
graph. The simpler the data dependency graph, the easier it will be to identify helpful optimizations
and efcient schedules.
2.11. DESIGN EXAMPLE: STACK 189
2.11 Design Example: Stack
The purpose of the stack example is to illustrate the design techniques on a slightly larger example
than Vanier and Massey. There are not any new concepts in this section.
2.11.1 Stack: Requirements
2.11.1.1 Entity
VHDL entity for the stack:
entity stack is
port (
reset, clk : in std_logic;
inp : in std_logic_vector(3 downto 0);
outp : out std_logic_vector(3 downto 0)
);
end stack;
The input signal inp is used for both instructions and data.
2.11.1.2 Instructions
push put a new piece of data onto the top of the stack
pop remove the top piece of data from the stack
swap swap the top two pieces of data
tos output the current data on the top of the stack
2.11.1.3 Instruction Encoding
VHDL package dening stack instructions:
package stack_instr is
constant pop : std_logic_vector(3 downto 0) := "0001";
constant push : std_logic_vector(3 downto 0) := "0010";
constant tos : std_logic_vector(3 downto 0) := "0100";
constant swap : std_logic_vector(3 downto 0) := "1000";
end stack_instr;
2.11.1.4 Miscellaneous Requirements
The stack shall have 16 elements
The inputs shall be registered.
When a push operation is done, in the clock cycle following the push instruction, inp shall have
the data that is to be pushed onto the stack.
Popping from an empty stack or pushing onto a full stack results in undened behaviour.
When doing a tos or pop operation, the output outp shall have the tos data in the clock cycle
after the tos instruction is input. At all other times the output is unconstrained.
In the clock cycle following reset being asserted (set to 1), the stack shall be empty.
2.11.2 Stack: Algorithm
A simple Perl program to implement an algorithmic description of the stack.
Note: You dont need to know Perl in E&CE 427. Perl is just one example of
the many different software programming languages that can be used to create
algorithmic descriptions of circuits.
2.11.2 Stack: Algorithm 191
Stack Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#! /usr/bin/perl -Wall
local ($line, @stack, $stack, $tmp);
$tos = 0;
% ]
while ($line = <STDIN>) {
chop( $line );
if ( $line eq "tos") {
print( $stack{$tos} );
} elsif ( $line eq "pop") {
print( $stack{$tos} );
$tos = $tos - 1;
} elsif ( $line eq "push" ) {
$tos = $tos + 1;
$line = <STDIN>;
chop( $line );
$stack{$tos} = $line;
} elsif ( $line eq "swap" ) {
$tmp = $stack{$tos};
$stack{$tos} = $stack{$tos-1};
$stack{$tos-1} = $tmp;
}
}
Usage of Perl Stack . . . . . . . . . . . . . .
push
3
tos
3
push
4
tos
4
pop
4
tos
3
2.11.3 Stack: Dataow Diagram
2.11.3.1 Data-Dependency Graphs
Do one data-dependency graph for each operation. Convert each data-dependency graph into a
dataow diagram by adding clock-cycle boundaries.
tos stack
stack(rd) -1
tos data_out stack
Pop
tos stack
stack(wr)
+1
tos stack
data_in
Push
tos stack
stack(rd)
tos data_out stack
Tos
tos stack
tos stack
-1
stack(rd) stack(rd)
stack(wr)
stack(wr)
Swap
Note: scheduling decision and anti-dependency arrows
2.11.3 Stack: Dataow Diagram 193
2.11.3.2 Partition into Clock Cycles
Note: The memory array used in this example supports combinational
reads, hence read operations can be done in the middle of a clock cycle. For the
Altera memory arrays used in E&CE 427 the read operations are registered.
tos stack
stack(rd) -1
tos data_out stack
tos stack
stack(wr)
+1
tos stack
data_in
Pop Push
2 registers (stack, tos)
1 ALU
3 registers (stack, tos, data in)
1 ALU
tos stack
stack(rd)
tos data_out stack
Tos
2 registers (stack, tos)
tos stack
tos stack
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
5 registers (stack, tos, stack[tos], stack[tos-1], tos-1)
1 ALU
Swap version 1
tos stack
tos stack
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
-1
4 registers (stack, tos, stack[tos], stack[tos-1])
1 ALU
Swap version 2 (Optimized)
2.11.4 Stack: High-Level Model
This high-level model is taken directly from the dataow diagrams and block diagrams.
There is one process that combines control, datapath, and storage; except for the output (outp),
which is done with a concurrent assignment statement.
Notice that there is a next init when (reset = 1); after every wait statement. This
is needed to get the circuit back to its initial state in the next clock cycle when reset is asserted.
First, well see the overall structure of the hlm architecture, and then the gory details.
architecture hlm of stack is
...declarations...
begin
-----------------------------------------------
process
begin
init : loop
...reset assignments...
loop
--------------------------------
next init when (reset = 1);
--------------------------------
case inp is
when pop =>
...pop code...
when push =>
...push code...
when swap =>
...swap code...
when tos =>
...tos code...
when others =>
next init;
end case;
end loop;
end loop;
end process;
-----------------------------------------------
outp <= stack(to_integer(tos));
-----------------------------------------------
end hlm;
2.11.4 Stack: High-Level Model 195
Now for the actual code.
architecture hlm of stack is
-----------------------------------------------
subtype data_ty is std_logic_vector(3 downto 0);
type stack_ty is array (15 downto 0) of data_ty;
-----------------------------------------------
signal tos : unsigned(3 downto 0);
signal tmp1, tmp2 : data_ty;
signal stack : stack_ty;
signal empty : std_logic;
-----------------------------------------------
begin
---------------------------------------------------------
process
begin
init : loop
--------------------------------
tos <= to_unsigned(0,4);
empty <= 1;
--------------------------------
loop
--------------------------------
--------------------------------
case inp is
when pop =>
tos <= tos - 1;
when push =>
if (empty = 0) then
tos <= tos + 1;
end if;
--------------------------------
--------------------------------
stack(to_integer(tos)) <= inp;
empty <= 0;
Continued...
...continued
when swap =>
tmp1 <= stack(to_integer(tos-1));
--------------------------------
--------------------------------
tmp2 <= stack(to_integer(tos));
--------------------------------
--------------------------------
stack(to_integer(tos-1)) <= tmp2;
--------------------------------
--------------------------------
stack(to_integer(tos)) <= tmp1;
when tos =>
null;
when others =>
next init;
end case;
end loop;
end loop;
end process;
-----------------------------------------------
outp <= stack(to_integer(tos));
-----------------------------------------------
end hlm;
The high-level model is synthesizable, but might be large and slow.
It uses a 2-d array for the stack, rather than specialized memory components from the library.
We are relying on the synthesis tool to build a state machine to drive the datapath. Sometimes,
by writing code that is closer to gate-level hardware, we can improve peformance and/or area.
2.11.5 Stack: Block Diagram 197
2.11.5 Stack: Block Diagram
2.11.5.1 Individual Block Diagrams
Build one block diagram for each operation.
tos stack
stack(rd) -1
tos data_out stack
t
o
s
outp
stack
we
a
di do
0
+
-1
-
Pop
tos stack
stack(wr)
+1
tos stack
data_in
stack
we
a
di do
+
1
inp
control
tos
ce
d q
Push
tos stack
stack(rd)
tos data_out stack
t
o
s
outp
stack
we
a
di do
0
Tos
tos stack
tos stack
-1
stack(rd)
stack(rd)
stack(wr)
stack(wr)
-1
t
o
s
stack
we
a
di do +
control
-1
ce
d q
tmp1
ce
d q
tmp2
Swap
2.11.5 Stack: Block Diagram 199
2.11.5.2 Complete Block Diagram
Merge all of the block diagrams together, reusing components whereever possible.
stack
we
a
di do +
1
inp
control
-1
tmp1
ce
d q
tmp2
outp
tos
ce
d q
ce
d q
t
o
s
_
i
n
c
_
d
e
c
_
s
e
l
s
t
a
c
k
_
a
d
d
r
_
s
e
l
t
o
s
_
c
e
s
t
a
c
k
_
w
e
s
t
a
c
k
_
d
a
t
a
_
s
e
l
t
m
p
1
_
c
e
t
m
p
2
_
c
e
r
reset
All Operations
2.11.6 Stack: Register Transfer Level
Structuring RTL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
There are four different ways to structure your RTL code:
Control
Storage
Datapath
Single
process
Control
Storage
Datapath
Separate
datapath
Control
Storage
Datapath
Separate control, storage,
and datapath
Control
Storage
Storage
Datapath
Next-State
Funs
Fully disassembled
Section 1.8.4 described a variety of options for coding the individual modules in the above diagram.
For example: whether to use both opped and combinational signals, the number of target signals
per process, and whether to if or wait statements for ip ops.
Stack RTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
To write the RTL code for the stack, consider the following options:
Replacing the stack as an array with a component instantiation of a memory array from the
FPGA libraries
Dening a state machine and signals to control the datapath
(e.g. dene a state type and a signal of type state and do assignments to current and next-state
signals)
Question to ponder: does an explicit state machine result in better hardware?
2.11.6.1 Stack: Separate Control, Datapath and Storage
This design is derived directly from the hardware block diagram.
2.11.6 Stack: Register Transfer Level 201
We separate the state machine and datapath using the control signals that drive the datapath (mux
select lines, chip enables, etc).
The state machine drives signals that control the datapath.
The state machine is very similar to that in the high level model.
In every state we assign values to the signals that control the datapath.
The datapath is done with concurrent statements. By using concurrent statements, rather than
processes, for the datapath, we eliminate the need for the datapath assignments to have sensitivity
lists, which simplies the code.
This style works best when there are a large number of states and a small number of datapath
components.
The outline of the code is:
architecture sepfsm of stack is
...declarations...
begin
...component instantiation for memory...
...clocked process for state machine...
...clocked process for tmp1...
...clocked process for tmp2...
...clocked process for tos...
...concurrent assignment for tos adj...
...concurrent assignment for stack addr...
...concurrent assignment for stack data in...
end sepfsm;
We now step through the code in detail, beginning with signal declarations:
architecture sepfsm of stack is
signal tos,
tos_adj,
stack_addr : unsigned(3 downto 0);
signal inp_intern,
stack_data_in,
stack_data_out,
tmp1,
tmp2 : std_logic_vector(3 downto 0);
signal synch_reset,
empty,
tos_inc_dec_sel,
stack_addr_sel,
tos_ce,
stack_we,
tmp1_ce,
tmp2_ce : std_logic;
signal stack_data_sel : std_logic_vector(1 downto 0);
...ram component instantiation...
Continued...
...continued
process begin
init : loop
--------------------------------
empty <= 1;
tos_inc_dec_sel <= -;
stack_addr_sel <= -;
tos_ce <= 1;
stack_we <= 0;
stack_data_sel <= "--";
tmp1_ce <= -;
tmp2_ce <= -;
--------------------------------
loop
--------------------------------
--------------------------------
case inp is
when pop =>
tos_inc_dec_sel <= 0;
stack_addr_sel <= 1;
tos_ce <= 1;
stack_we <= 0;
tmp1_ce <= -;
tmp2_ce <= -;
when push =>
if (empty = 1) then
tos_inc_dec_sel <= -;
tos_ce <= 0;
else
tos_inc_dec_sel <= 1;
tos_ce <= 1;
end if;
stack_we <= 0;
tmp1_ce <= -;
tmp2_ce <= -;
--------------------------------
--------------------------------
empty <= 0; ...more assignments...
when swap => ...
end case;
end loop;
end loop;
end process;
Continued...
...continued
------------------------------------------------------
process (clk)
begin
if (tmp1_ce = 1) then
tmp1 <= stack_data_out;
end if;
end if;
end process;
... tmp2 assignment ...
------------------------------------------------------
process (clk)
begin
if (reset = 1) then
tos <= to_unsigned(0, 4);
elsif (tos_ce = 1) then
tos <= tos_adj;
end if;
end if;
end process;
------------------------------------------------------
tos_adj <= tos + 1 when (tos_inc_dec_sel = 1)
else tos - 1 ;
...
...tos_adj, stack_addr, and stack_data_in...
end sepfsm;
2.11.6.2 Stack: Datapath Operations
The state machine in Section 2.11.6.1 controlled each datapath component individually.
An alternative style is for the state machine to tell the datapath what state it is in, or what global
collection of operations to perform, then each part of the datapath decodes this and takes the
appropriate action.
This style works best when there are a small number of states and a large number of datapath
components.
architecture dp_op of stack is
----------------------------------------------------
-- define the states
type dp_op_ty is
(init_op,
pop_op,
push1_op,
push2_op,
swap_wr_tmp1_op,
swap_wr_tmp2_op,
swap_rd_tmp1_op,
swap_rd_tmp2_op,
nop_op
);
signal dp_op : dp_op_ty;
signal tos,
tos_adj,
stack_addr : unsigned(3 downto 0);
signal inp_intern,
stack_data_in,
stack_data_out,
tmp1,
tmp2 : std_logic_vector(3 downto 0);
signal empty,
stack_we : std_logic;
begin
Continued ...
...continued
---------------------------------------------------------
process begin
init : loop
--------------------------------
empty <= 1;
dp_op <= init_op;
loop
--------------------------------
--------------------------------
case inp is
when pop =>
dp_op <= pop_op;
when push =>
dp_op <= push1_op;
--------------------------------
--------------------------------
-- stack(to_integer(tos)) <= inp;
dp_op <= push2_op;
empty <= 0;
when swap =>
...
end case;
end loop;
end loop;
end process;
-----------------------------------------------------
process (clk)
begin
inp_intern <= inp;
end if;
end process;
Continued...
...continued
------------------------------------------------------
process (clk)
begin
if (dp_op = init_op) then
elsif ( (dp_op = pop_op)
OR (dp_op = push1_op and (empty = 0))
)
then
tos <= tos_adj;
end if;
end if;
end process;
------------------------------------------------------
tos_adj
<= tos + to_unsigned(1,3) when (dp_op = push1_op)
else tos - to_unsigned(1,3)
;
------------------------------------------------------
stack_addr
<= tos_adj
when
( (dp_op = pop_op)
OR ((dp_op = push1_op) AND (empty = 0))
OR (dp_op = swap_wr_tmp1_op)
OR (dp_op = swap_rd_tmp2_op)
)
else tos
;
...stack_data_in, stack_we, out, ram ...
end dp_op;
2.11.6.3 Stack: Explicit State Machine
Here we drop the loop ... wait ... style of implicit state machines and build an explicit
state machine with current and next state signals.
Notice that the stack is such a simple design that each datapath operation in the Dp-Op architecture
is used in only one state. This is a sign that the Dp-Op style is not well-suited to the stack.
This example also illustrates the use of a function to capture common code. The function is used
here to determine which state to go to next when a new input instruction arrives.
architecture state of stack is
type state_ty is
(init_st,
pop_st,
push1_st, push2_st,
swap_wr_tmp1_st, swap_wr_tmp2_st,
swap_rd_tmp1_st, swap_rd_tmp2_st,
nop_st
);
signal state, state_n : state_ty;
...
...
--------------------------------------------------------
function restart
(inp : std_logic_vector(3 downto 0))
return state_ty
is
begin
case inp is
when pop =>
return(pop_st);
when push =>
return(push1_st);
when swap =>
return(swap_wr_tmp1_st);
when others =>
return(nop_st);
end case;
end restart;
begin
------------------------------------------------------
process (clk) begin
if (reset = 1) then
state <= init_st;
empty_n <= 1;
else
state <= state_n;
empty_n <= empty;
end if;
end if;
end process;
Continued...
...continued
------------------------------------------------------
process (state, inp) begin
case state is
when init_st | pop_st | push2_st
| swap_wr_tmp2_st | nop_st
=>
state_n <= restart(inp);
when push1_st =>
state_n <= push2_st;
when swap_rd_tmp1_st =>
state_n <= swap_rd_tmp2_st;
when swap_rd_tmp2_st =>
state_n <= swap_wr_tmp1_st;
when swap_wr_tmp1_st =>
state_n <= swap_wr_tmp2_st;
end case;
end process;
...
process (clk)
begin
if (state = init_st) then
elsif ( (state = pop_st)
OR (state = push1_st and (empty = 0))
)
then
tos <= tos_adj;
end if;
end if;
end process;
------------------------------------------------------
tos_adj
<= tos + to_unsigned(1,3) when (state = push1_st)
else tos - to_unsigned(1,3)
;
------------------------------------------------------
stack_addr <= tos_adj when
( (state = pop_st)
OR ((state = push1_st) AND (empty = 0))
OR (state = swap_wr_tmp1_st)
OR (state = swap_rd_tmp2_st)
)
else tos ;
...
end state;
2.12 Optimization Techniques
2.12.1 Strength Reduction
Strength reduction replaces one operation with another that is simpler.
2.12.1.1 Arithmetic Strength Reduction
Multiply by a constant power of two wired shift logical left
Multiply by a power of two shift logical left
Divide by a constant power of two wired shift logical right
Divide by a power of two shift logical right
Multiply by 3 wired shift and addition
2.12.1.2 Boolean Strength Reduction
Boolean tests that can be implemented as wires
is odd, is even
is neg, is pos
By choosing your encodings carefully, you can sometimes reduce a vector comparisons to a wire.
For example if your state uses a one-hot encoding, then the comparison state = S3 reduces
to state(3) = 1. You might expect a reasonable logic-synthesis tool to do this reduction
automatically, but most tools do not do this reduction.
When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector
comparisons. By carefully choosing our state assignments, when we use a full binary encoding for
8 states, the comparison:
(state = S0 or state = S3 or state = S4) = 1
can be reduced to a single bit comparison, such as state(2) = 1.
2.12.2 Replication and Sharing 211
2.12.2 Replication and Sharing
2.12.2.1 Mux-Pushing
Pushing multiplexors into the fanin of a signal can reduce area.
Before
z <= a + b when (w = 1)
else a + c;
After
tmp <= b when (w = 1)
else c;
z <= a + tmp;
The rst circuit will have two adders, while the second will have one adder. Some synthesis tools
will perform this optimization automatically, particularly if all of the signals are combinational.
2.12.2.2 Common Subexpression Elimination
Introduce new signals to capture subexpressions that occur multiple places in the code.
Before
y <= a + b + c when (w = 1)
else d;
z <= a + c + d when (w = 1)
else e;
After
tmp <= a + c;
y <= b + tmp when (w = 1)
else d;
z <= d + tmp when (w = 1)
else e;
Note: Clocked subexpressions Care must be taken when doing common
subexpression elimination in a clocked process. Putting the temporary sig-
nal in the clocked process will add a clock cycle to the latency of the com-
putation, because the tmp signal will be ip-op. The tmp signal must be
combinational to preserve the behaviour of the circuit.
2.12.2.3 Computation Replication
To improve performance
If same result is needed at two very distant locations and wire delays are signicant, it might
improve performance (increase clock speed) to replicate the hardware
To reduce area
If same result is needed at two different times that are widely separated, it might be cheaper to
reuse the hardware component to repeat the computation than to store the result in a register
Note: Muxes are not free Each time a component is reused, multiplexors
are added to inputs and/or outputs. Too much sharing of a component can cost
more area in additional multiplexors than would be spent in replicating the
component
2.12.3 Arithmetic
VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) +
c) + d). You can use parentheses to suggest parallelism.
Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a
result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller
and faster design than computing all 16 bits of the result and trimming the result to 12 bits.
2.12.4 Pipelining
You can turn a dataowdiagraminto a pipeline by making each clock cycle of the dataow diagram
a separate pipe stage. However, this can be complicated and error-prone. You need to worry
about data hazards if you have state-holding registers in your algorithm. You need to worry about
structural hazards if different instructions have different latencies.
A rough description of the technique to turn dataow diagram into pipeline:
Group one or more consecutive clock cycles of computation for all instructions into each stage.
Each stage becomes a single module. Hardware is not shared between stages. So, moving from a
non-pipelined implementation to a pipelined implementation will increase the area of the design.
For pipelines, the most important measure of performance is usually throughput, which is the
inverse of number of clock cycles that are grouped into a single stage. For example if each clock
cycle becomes a single stage, then the throughput (as measured in clock cycles) is 1 parcel/clock-
cycle. As another example, if two clock cycles are grouped into a single stage, then a new parcel
can enter the pipeline once every two clock cycles.
2.13. DESIGN PROBLEMS 213
2.13 Design Problems
P2.1 Synthesis
This question is about using VHDL to implement memory structures on FPGAs.
P2.1.1 Data Structures
If you have to write your own code (i.e. you do not have a library of memory components or a
special component generation tool such as LogiBlox or CoreGen), what datastructures in VHDL
would you use when creating a register le?
P2.1.2 Own Code vs Libraries
When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL
code for memory, rather than instantiate memory components from a library?
P2.2 Design Guidelines
While you are grocery shopping you encounter your co-op supervisor from last year. Shes now
forming a startup company in Waterloo that will build digital circuits. Shes writing up the de-
sign guidelines that all of their projects will follow. She asks for your advice on some potential
guidelines.
What is your response to each question?
What is your justication for your answer?
What are the tradeoffs between the two options?
0. Sample Should all projects use silicon chips, or should all use biological chips, or should
each project choose its own technique?
Answer: All projects should use silicon based chips, because biological chips dont
exist yet. The tradeoff is that if biological chips existed, they would probably con-
sume less power than silicon chips.
1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset
signal, or should each project choose its own technique?
2. Should all projects use latches, or should all projects use ip-ops, or should each project
choose its own technique?
3. Should all chips have registers on the inputs and outputs or should chips have the inputs
and outputs directly connected to combinational circuitry, or should each project choose
its own technique? By register we mean either ip-ops or latches, based upon your
answer to the previous question. If your answer is different for inputs and outputs, explain
why.
4. Should all circuit modules on all chips have ip-ops on the inputs and outputs or should
chips have the inputs and outputs directly connected to combinational circuitry, or
should each project choose its own technique? By register we mean either ip-ops or
latches, based upon your answer to the previous question. If your answer is different for
inputs and outputs, explain why.
5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should
P2.3 Dataow Diagram Optimization
Use the dataow diagram below to answer problems P2.3.1 and P2.3.2.
f
f
a b c
d
g
f
g
e
P2.3.1 Resource Usage
List the number of items for each resource used in the dataow diagram.
P2.4 Dataow Diagram Design 215
P2.3.2 Optimization
Draw an optimized dataow diagram that improves the performance and produces the same output
values. Or, if the performance cannot be improved, describe the limiting factor on the preformance.
NOTES:
you may change the times when signals are read from the environment
you may not increase the resource usage (input ports, registers, output ports, f components,
g components)
you may not increase the clock period
P2.4 Dataow Diagram Design
Your manager has given you the task of implementing the following pseudocode in an FPGA:
if is_odd(a + d)
p = (a + d)*2 + ((b + c) - 1)/4;
else
p = (b + c)*2 + d;
NOTES: 1) You must use registers on all input and output ports.
2) p, a, b, c, and d are to be implemented as 8-bit signed signals.
3) A 2-input 8-bit ALU that supports both addition and subtraction takes 1
clock cycle.
4) A 2-input 8-bit multiplier or divider takes 4 clock cycles.
5) A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a
MUX) can be squeezed into the same clock cycle(s) as an ALU operation,
multiply, or divide.
6) You can require that the environment provides the inputs in any order and
that it holds the input signals at the same value for multiple clock cycles.
P2.4.1 Maximum Performance
What is the minimum number of clock cycles needed to implement the pseudocode with a circuit
that has two input ports?
What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum
number of clock cycles that you just calculated?
P2.4.2 Minimum area
What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles
needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and
one divider?
P2.5 Michener: Design and Optimization
Design a circuit named michener that performs the following operation: z = (a+d) + ((b -
c) - 1)
NOTES:
1. Optimize your design for area.
2. You may schedule the inputs to arrive at any time.
3. You may do algebraic transformations of the specication.
P2.6 Dataow Diagrams with Memory Arrays
Component Delay
Register 5 ns
Adder 25 ns
Subtracter 30 ns
ALU with +, , >, =, , AND, XOR 40 ns
Memory read 60 ns
Memory write 60 ns
Multiplication 65 ns
2:1 Multiplexor 5 ns
NOTES:
1. The inputs of the algorithms are a and b.
2. The outputs of the algorithms are p and q.
3. You must register both your inputs and outputs.
4. You may choose to read your input data values at any time and produce your outputs at any
time. For your inputs, you may read each value only once (i.e. the environment will not send
multiple copies of the same value).
5. Execution time is measured from when you read your rst input until the latter of producing
your last output or the completion of writing a result to memory
6. M is an internal memory array, which must be implemented as dual-ported memory with one
read/write port and one read port.
7. M supports synchronous write and asynchronous read.
P2.7 2-bit adder 217
8. Assume all memory address and other arithmetic calculations are within the range of repre-
sentable numbers (i.e. no overows occur).
9. If you need a circuit not on the list above, assume that its delay is 30 ns.
10. You may sacrice area efciency to achieve high performance, but marks will be deducted
for extra hardware that does not contribute to performance.
P2.6.1 Algorithm 1
Algorithm
q = M[b];
M[a] = b;
p = M[b+1] * a;
Assuming a b, draw a dataow diagram that is optimized for the fastest overall execution
time.
P2.6.2 Algorithm 2
q = M[b];
M[a] = q;
p = (M[b-1]) * b) + M[b];
Assuming a > b, draw a dataow diagram that is optimized for the fastest overall execution
time.
P2.7 2-bit adder
This question compares an FPGA and generic-gates implementation of 2-bit full adder.
P2.7.1 Generic Gates
Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.
P2.7.2 FPGA
Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for the
lookup tables.
CE
S
R
D Q
c_in
comb
sum[0]
CE
S
R
D Q
comb
a[0]
b[0]
a[1]
b[1]
sum[1]
c_out
carry_1
P2.8 Sketches of Problems
1. calculate resource usage for a dataow diagram (input ports, output ports, registers, datapath
components)
2. calculate performance data for a dataow diagram (clock period and number of cycles to
execute (CPI))
3. given a dataow diagram, calculate the clock period that will result in the optimum perfor-
mance
4. given an algorithm, design a dataow diagram
5. given a dataow diagram, design the datapath and nite state machine
6. optimize a dataow diagram to improve performance or reduce resource usage
7. given fsm diagram, pick VHDL code that best implements diagram correct behaviour,
simple, fast hardware or critique hardware
Chapter 3
Functional Veri cation
3.1 Introduction
3.1.1 Purpose
The purpose of this chapter is to illustrate techniques to quickly and reliably detect bugs in datapath
and control circuits.
Section 3.5 discusses verication of datapath circuits and introduces the notions of testbench, spec-
ication, and implementation. In section 3.6 we discuss techniques that are useful for debugging
control circuits.
The verication guild website:
http://www.janick.bergeron.com/guild/default.htm
is a good source of information on functional verication.
3.2 Overview
The purpose of functional verication is to detect and correct errors that cause a system to produce
erroneous results. The terminology for validation, verication, and testing differs somewhat from
discipline to discipline. In this section we outline some of the terminology differences and describe
the terminology used in E&CE 427. We then describe some of the reasons that chips tend to work
incorrectly.
219
220 CHAPTER 3. FUNCTIONAL VERIFICATION
3.2.1 Terminology: Validation / Veri cation / Testing
functional validation
Comparing the behaviour of a design against the customers expectations. In validation, the
specication is the customer. There is no specication that can be used to evaluate the
correctness of the design (implementation).
functional verication
Comparing the behaviour of a design (e.g. RTL code) against a specication (e.g. high-level
model) or collection of properties
usually treats combinational circuitry as having zero-delay
usually done by simulating circuit with test vectors
big challenges are simulation speed and test generation
formal verication
checking that a design has the correct behaviour for every possible input and internal state
uses mathematics to reason about circuit, rather than checking individual vectors of 1s and
0s
capacity problems: only usable on detailed models of small circuits or abstract models of
large circuits
mostly a research topic, but some practical applications have been demonstrated
tools include model checking and theorem proving
formal verication is not a guarantee that the circuit will work correctly
performance validation
checking that implementation has (at least) desired performance
power validation
checking that implementation has (at most) desired power
equivalence verication (checking)
checking that the design generated by a synthesis tool has same behaviour as RTL code.
timing verication
checking that all of the paths in a circuit t meet the timing constraints
Hardware vs Software Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Note: in software testing refers to running programs with specic inputs and checking if the
program does the right thing. In hardware, testing usually means manufacturing testing, which
is checking the circuits that come off of the manufacturing line.
3.2.2 The Dif culty of Designing Correct Chips 221
3.2.2 The Dif culty of Designing Correct Chips
3.2.2.1 Notes from Kenn Heinrich (UW E&CE grad)
Everyone should get a lecture on why their rst industrial design wont work in the eld.
Here are few reasons getting a single system to work correctly for a few minutes in a university lab
is much easier than getting thousands of systems to work correctly for months at a time in dozens
of countries around the world.
1. You forgot to make your unreachable states transition to the initial (reset) state. Clock
glitches, power surges, etc will occasionally cause your system to jump to a state that isnt
dened or produce an illegal data value. When this happens, your design should reset itself,
rather than crash or generatel illegal outputs.
2. You have internal registers that you cant access or test. If you can set a register you must
have some way of reading the register from outside the chip.
3. Another chip controls your chip, and the other chip is buggy. All of your external control
lines should be able to be disabled, so that you can isolate the source of problems.
4. Not enough decoupling capacitors on your board. The analog world is cruel and and un-
usual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital
signals. Trying to save a few cents on decoupling capacitors can cause headaches and sig-
nicant nancial costs in the future.
5. You only tested your system in the lab, not in the real world. As a product, systems will
need to run for months in the eld, simulation and simple lab testing wont catch all of the
weirdness of the real world.
6. You didnt adequately test the corner cases and boundary conditions. Every corner case is as
important as the main case. Even if some weird event happens only once every six months,
if you do not handle it correctly, the bug can still make your system unusable and unsellable.
3.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys)
More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem
that whose severity forced the design to be reworked.
Even experienced designers have difculty building chips that function correctly on the rst pass
(gure3.1).
61% of new chip designs require at least one re-spin
Functional logic error
Analog tuning issue
Signal integrity issue
Clock scheme error
Reliability issue
Mixed-signal problem
Timing issue (slow paths)
Timing issue (fast paths)
IR drop issues
Firmware error
Other problem
(43%)
(20%)
(17%)
(14%)
(12%)
(11%)
(11%)
(10%)
(10%)
(7%)
(4%)
(3%)
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
At least one
error/issue/problem
(61%)
Uses too much power
Source: Aart de Geus, Chairman and CEO of Synopsys. Keynote address. Synopsys Users
Group Meeting, Sep 9 2003, Boston USA.
Figure 3.1: Problems found on rst-spins of new chip designs
3.3 Test Cases and Coverage
3.3.1 Test Terminology
Test case / test vector :
A combination of inputs and internal state values. Represents one possible test of the system.
Boundary conditions / corner cases :
A test case that represents an unusual situation on input and/or internal state signals. Corner
cases are likely to contain bugs.
Test scenario :
Asequence of test vectors that, together, exercise a particular situation (scenario) on a circuit.
For example, a scenario for an elevator controller might include a sequence of button pushes
and movements between oors.
Test suite :
A collection of test vectors that are run on a circuit.
3.3.2 Coverage 223
3.3.2 Coverage
To be absolutely certain that an implementation is correct, we must check every combination of
values. This includes both input values and internal state (ip ops).
If we have ni bits of inputs and ns bits in ip-ops, we have to test 2
ni +ns
different cases when
doing functional verication.
Question: If we have nc combinational signals, why dont we have to test
2
ni+ns+nc
different cases?
Denition Coverage: The coverage that a suite of tests achieves on a circuit is the
percentage of cases that are simulated by the tests. 100% coverage means that the
circuit has been simulated for all combinations of values for input signals and internal
signals.
Note: Coverage Terminology There are many different types of coverage,
which measure everything from percentage of cases that are exercised to num-
ber of output values that are exercised.
There are many different commercial software programs that measure code and other types of
coverage.
Company Tool Coverage
Cadence Afrma Coverage Analyzer
Cadence DAI Coverscan code, expressions, fsm
Cadence Codecover code, expressions, fsm
Fintronic FinCov code
Summit Design HDLScore code, events, variables
Synopsys CoverMeter code coverage (dead?)
TransEDA Verication Navigator code and fsm
Verisity SureCov code, block, values, fsm
Veritools Express VCT, VeriCover code, branch
Aldec Riviera code, block
3.3.3 Floating Point Divider Example
This example illustrates the difculty of achieving signicant coverage on realistic circuits.
Consider doing the functional simulation for a double precision (64-bit) oating-point divider.
Given Information
Data width 64 bits
Number of gates in circuit 10 000
Number of assembly-language instructions to simulate one
gate for one test case
100
Number of clock cycles required to execute one assembly
language instruction on the computer that is running the
simulation
0.5
Clock speed of computer that is running the simulation 1 Gigahertz
Number of Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: How many cases must be considered?
Simulation Run Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: How long will it take to simulate all of the different possible cases using a
single computer?
Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: If you can run simulations non-stop for one year on ten computers, what
coverage will you achieve?
Simulation vs the Real World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FromValidating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Con-
ference 2001. (Link on E&CE 427 web page.)
Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz.
By tapeout, over 200 billion simulation cycles had been run on a network of computers.
All of these simulations represent less than two minutes of running a real processor.
3.4. TESTBENCHES 225
3.4 Testbenches
A test bench (also known as a test rig, test harness, or test jig) is a collection of code used
to simulate a circuit and check if it works correctly.
Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of
VHDL. Use the full power of VHDL to make your testbenches concise and powerful.
3.4.1 Overview of Test Benches
stimulus
implementation
specification
check
testbench
Implementation Circuit that youre checking for bugs
also known as: design under test or unit under test
Stimulus Generates test vectors
Specication Describes desired behaviour of implementation
Check Checks whether implementation obeys specication
Notes and observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Testbenches usually do not have any inputs or outputs.
Inputs are generated by stimulus
Outputs are analyzed by check and relevant information is printed using report statements
Different circuits will use different stimuli, specications, and checks.
The roles of the specication and check are somewhat exible.
Most circuits will have complex specications and simple checks.
However, some circuits will have simple specications and complex checks.
If two circuits are supposed to have the same behaviour, then they can use the same stimuli,
specication, and check.
If two circuits are supposed to have the same behaviour, then one can be used as the specication
for the other.
Testbenches are restricted to stimulating only primary inputs and observing only primary out-
puts. To check the behaviour of internal signals, use assertions.
3.4.2 Reference Model Style Testbench
stimulus
implementation
specification
reference model testbench
Specication has same inputs and outputs as implementation.
Specication is a clock-cycle accurate description of desired behaviour of implementation.
Check is an equality test between outputs of specication and implementation.
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution modules: output is sum, difference, product, quotient, etc.of inputs
DSP lters
Instruction decoders
Note: Functional specication vs Reference model Functional speci-
cation and reference model are often used interchangeably.
3.4.3 Relational Style Testbench 227
3.4.3 Relational Style Testbench
stimulus
implementation
relational testbench
check
Relational testbenches, or relational specications are used when we do not want to specify the
specic output values that the implementation must produce.
Instead, we want to check that some relationship holds between the output and the input, or
that some relationship holds amongst the output values (independent of the values of the input
signals.)
Specication is usually just wires to feed the input signals to the check.
Check is the brains and encodes the desired behaviour of the circuit.
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact
values of each individiual output.
Arbiters: every request is eventually granted, but do not specify in which order requests are
granted.
One-hot encoding: exactly one bit of vector is a 1, but do not specify which bit is a 1.
Note: Relational specication vs relational testbench Relational speci-
cation and relational testbench are often used interchangeably.
3.4.4 Coding Structure of a Testbench
architecture main of athabasca_tb is
component declaration for implementation;
other declarations
begin
implementation instantiation;
stimulus process;
specification process (or component instantiation);
check process;
end main;
3.4.5 Datapath vs Control
Datapath and control circuits tend to use different styles of testbenches.
Datapath circuits tend to be well-suited to reference-model style testbenches:
Each set of inputs generates one set of outputs
Each set of outputs is a function of just one set of inputs
Control circuits often pose problems for testbenches,
Many more internal signals than outputs.
The behaviour of the outputs provides a view into only a fragment of the current state of the
circuit.
It may take many clock cycles from when a bug is exercised inside the circuit until it generates
a deviation from the correct behaviour on the outputs.
When the deviation on the outputs is observed, it is very difcult to pinpoint the precise cause
of the deviation (the root cause of the bug).
Assertions can be used to check the behaviour of internal signals. Control circuits tend to use
assertions to check correctness and rely on testbenches only to stimulate inputs.
3.4.6 Veri cation Tips
Suggested order of simulation for functional verication.
1. Write high-level model.
2. Simulate high-level model until have correct functionality and latency.
3. Write synthesizable model.
4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against
high-level model.
5. Optimize the synthesizable model.
6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high-
level model.
7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high-
level model.
section 3.5 describes a series of testbenches that are particularly useful for debugging datapath
circuits in the early phases of the design cycle.
3.5. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS 229
3.5 Functional Veri cation for Datapath Circuits
In this section we will incrementally develop a testbench for a very simple circuit: an AND gate.
Although the example circuit is trivial in size, the process scales well to very large circuits. The
process allows verication to begin as soon a circuit is simulatable, even before a complete speci-
cation has been written.
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
entity and2 is
port (
c : out std_logic
);
end and2;
architecture main of and2 is
begin
c <= 1 when (a = 1 AND b = 1)
else 0;
end and2;
3.5.1 A Spec-Less Testbench
(NOTE: this code has been reviewed manually but has not been simulated. The concepts are
illustrated correctly, but there might be typographical errors in the code.)
First, use waveform viewer to check that implementation generates reasonable outputs for a small
set of inputs.
entity and2_tb is
end and2_tb;
architecture main_tb of and2_tb is
component and2
port (
c : out std_logic
);
end component;
signal ta, tb, tc_impl : std_logic;
signal ok : boolean;
begin
---------------------------------------------
impl : and2 port map (a => ta, b => tb, c => tc_impl);
---------------------------------------------
stimulus : process
begin
ta <= 0; tb <= 0;
wait for 10ns;
ta <= 1; tb <= 1;
wait for 10ns;
end process;
---------------------------------------------
end main_tb;
Use the spec-less testbench until implementation generates solid Boolean values (No X or U data)
and have checked that a few simple test cases generate correct outputs.
3.5.2 Use an Array for Test Vectors 231
3.5.2 Use an Array for Test Vectors
Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code
up test vectors in an array.
(NOTE: this code has not been checked for correctness)
...
begin
...
stimulus : process
type test_datum_ty is record
ra, rb : std_logic;
end record;
type test_vectors_ty is
array(natural range <>) of test_datum_ty
;
constant test_vectors : test_vectors_ty :=
-- a b
( ( 0, 0),
( 1, 1)
);
begin
for i in test_vectorslow to test_vectorshigh loop
ta <= test_vectors(i).ra;
tb <= test_vectors(i).rb;
wait for 10 ns;
end loop;
end process;
end main_tb;
Use this testbench until checking the correctness of the outputs by hand using waveform viewer
becomes difcult.
3.5.3 Build Spec into Stimulus
(NOTE: this code has not been checked for correctness)
After a few test vectors appear to be working correctly (via a manual check of waveforms on
simulation), begin automatically checking that outputs are correct.
Add expected result to stimulus
Add check process
...
begin
------------------------------------------
------------------------------------------
stimulus : process
ra, rb, rc : std_logic;
end record;
type test_vectors_ty is array(natural range <>) of test_datum_ty;
-- a, b: inputs
-- c : expected output
-- a b c
( ( 0, 0, 0),
( 0, 1, 0),
( 1, 1, 1)
);
begin
tc_spec <= test_vectors(i).rc;
wait for 10 ns;
end loop;
end process; ------------------------------------------
check : process (tc_impl, tc_spec)
begin
ok <= (tc_impl = tc_spec);
end process;
------------------------------------------
end main_tb;
Use this testbench until it becomes tedious to calculate manually the correct result for each test
case.
3.5.4 Have Separate Speci cation Entity 233
3.5.4 Have Separate Speci cation Entity
Rather than write the specication as part of stimulus, create separate specication entity/architecture.
The specication component then calculates the expected output values.
(NOTE: if your simulation tool supports congurations, the spec and impl can share the same
entity, well see this in section 3.6)
entity and2_spec is
...(same as and2 entity)...
end and2_spec;
architecture spec of and2_spec is
begin
c <= a AND b;
end spec;
component and2 ...;
component and2_spec ...;
signal ta, tb, tc_impl, tc_spec : std_logic;
signal ok : boolean;
begin
------------------------------------------
spec : and2_spec port map (a => ta, b => tb, c => tc_spec);
------------------------------------------
stimulus : process
ra, rb : std_logic;
end record;
-- a b
( ( 0, 0),
( 1, 1)
);
begin
wait for 10 ns;
end loop;
end process;
------------------------------------------
begin
ok <= (tc_impl = tc_spec);
end process;
------------------------------------------
end main_tb;
3.5.5 Generate Test Vectors 235
3.5.5 Generate Test Vectors
When it becomes tedious to write out each test vector by hand, we can automaticaly compute them.
This example uses a pair of nested for loops to generate all four permutations of input values
for two signals.
...
begin
...
stimulus : process
subtype std_test_ty of std_logic is (0, 1);
begin
for va in std_test_tylow to std_test_tyhigh loop
for vb in std_test_tylow to std_test_tyhigh loop
ta <= va;
tb <= vb;
wait for 10 ns;
end loop;
end loop;
end process;
...
end main_tb;
3.5.6 Relational Speci cation
...
begin
------------------------------------------
------------------------------------------
stimulus : process
...
end process;
------------------------------------------
begin
ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0));
end process;
------------------------------------------
end main_tb;
3.6 Functional Veri cation of Control Circuits
Control circuits are often more challenging to verify than datapath circuits.
Control circuits have many internal signals. Testbenches are unable access key information
about the behaviour of a control circuit.
Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect
value and when an output signal shows the effect of the bug.
In this section, we will explore the functional verication of state machines via a First-In First-Out
queue.
The VHDL code for the queue is on the web at:
http://www.ece.uwaterloo.ca/ece427/exs/queue
3.6.1 Overview of Queues in Hardware
write read

q
u
e
u
e

Figure 3.2: Structure of queue
3.6.1 Overview of Queues in Hardware 237
Empty
Write 1
A
Write 2
A
Figure 3.3: Write Sequence
Write 1
B
A
Write 2
B
A
Figure 3.4: A Second Example Write
Read 1
B
A
Read 2
B
A
Figure 3.5: Example Read Sequence
3.6.1 Overview of Queues in Hardware 239
Write 1
B
C
D
E
F
G
H
I
J
Write 2
B
C
D
E
F
G
H
I
J
Figure 3.6: Write Illustrating Index Wrap
Write 1
B
C
D
E
F
G
H
I
J
K
Write 2
B
C
D
E
F
G
H
I
J
K
Figure 3.7: Write Illustrating Full Queue
empty
mem
wr_idx
rd_idx
data_wr
data_rd
do_wr
do_rd
Figure 3.8: Queue Signals
empty
mem
wr_idx
rd_idx
data_wr
data_rd
do_wr
do_rd
WE
A0
DI0
DO0
A1 DO1
Figure 3.9: Incomplete Queue Blocks
Control circuitry not shown.
3.6.2 VHDL Coding
3.6.2.1 Package
Things to notice in queue package:
1. separation of package and body
package queue_pkg is
function to_data(i : integer) return data;
end queue_pkg;
package body queue_pkg is
function to_data(i : integer) return data is
begin
return std_logic_vector(to_unsigned(i, 4));
end to_data;
end queue_pkg;
3.6.3 Code Structure for Veri cation 241
3.6.2.2 Other VHDL Coding
VHDL coding techniques to notice in queue implementation:
1. type declaration for vectors
2. attributes
(a) low, high, length,
3. functions (reduce overall implementation and maintenance effort)
(a) reduce redundant code
(b) hide implementation details
(c) (just like software engineering....)
3.6.3 Code Structure for Veri cation
Verication things to notice in queue implementation:
1. instrumentation code
2. coverage monitors
3. assertions
architecture ... is
...
begin
... normal implementation ...
process (clk)
begin
... instrumentation code ...
prev_signame <= signame;
end if;
end process;
... assertions ...
... coverage monitors ...
end;
3.6.4 Instrumentation Code
Added to implementation to support verication
Usually keeps track of previous values of signals
Does not create hardware (Optimized away during synthesis)
Does not feed any output signals
Must use synthesizable subset of VHDL
process (clk) begin
prev_rd_idx <= rd_idx;
prev_wr_idx <= wr_idx;
prev_do_rd <= do_rd;
prev_do_wr <= do_wr;
end if;
end process;
Note: Naming convention for instrumentation For assertions, signals are
named prev signame and signame, rather than next signame and
signame as is done for state machines. This is because for assertions we
use the prev signals as history signals, to keep track of past events. In con-
trast, for state machines, we name the signals next, because the state machine
computes the next values of signals.
3.6.5 Coverage Monitors
The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a
test suite does not trigger a coverage monitor, then we probably want to add a test vector that will
trigger the monitor.
For example, for a circuit used in a microwave oven controller, we might want to make sure that
we simulate the situation when the door is opened while the power is on.
1. Identify important events, conditions, transitions
2. Write instrumentation code to detect event
3. Use report to write when event happens
4. When run simulation, report statements will print when coverage condition detected
5. Pipe simulation results to log le
6. Examine log le and coverage monitors to nd cases and transitions not tested by existing
test vectors
7. Add test vectors to exercise missing cases
8. Idea: automate detection of missing cases using Perl script to nd coverage messages in
VHDL code that arent in log le
3.6.5 Coverage Monitors 243
9. Real world: most commercial synthesis tools come with add-on packages that provide dif-
ferent types of coverage analysis
10. Research/entrepreneurial idea: based on missing coverage cases, nd new test vectors to
exercise case
Coverage Events for Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Prev Now
wr rd
rd wr
Prev Now
wr
rd
rd wr
Prev Now
wr
rd rd wr
Question: What events should we monitor to estimate the coverage of our functional
tests?
Answer:
wr idx and rd idx are far apart
wr idx and rd idx are equal
wr idx catches rd idx
rd idx catches wr idx
rd idx wraps
wr idx wraps
Coverage Monitor Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
process (signals read)
begin
if (condition) then
report "coverage: message";
elsif (condition) ) then
report "coverage: message";
else
report "error: case fall through on message"
severity warning;
end if;
end process;
Coverage Monitor Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Events related to rd idx equals wr idx.
process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx)
begin
if (rd_idx = wr_idx) then
if ( prev_rd_idx = prev_wr_idx ) then
report "coverage: read = write both moved";
elsif ( rd_idx /= prev_rd_idx ) then
report "coverage: Read caught write";
elsif ( wr_idx /= prev_wr_idx ) then
report "coverage: Write caught read";
else
report "error: case fall through on rd/wr catching"
severity warning;
end if;
end if;
end process;
Events related to rd idx wrapping.
process (rd_idx)
begin
if (rd_idx = low_idx) then
report "coverage: rd mv to low";
elsif (rd_idx = high_idx) then
report "coverage: rd mv to high";
else
report "coverage: rd mv normal";
end if;
end process;
3.6.6 Assertions 245
3.6.6 Assertions
Assertions for Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. If rd idx changes, then it increments or wraps.
2. If rd idx changes, then do rd was 1, or reset is 1.
3. If wr idx changes, then it increments or wraps.
4. If wr idx changes, then do wr was 1, or reset is 1.
5. And many others....
Assertion Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
process (signals read) begin
assert (required condition)
report "error: message" severity warning;
end process;
Assertions: Read Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
process (rd_idx) begin
assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx))
report "error: rd inc" severity warning;
assert ((prev_do_rd = 1) or (reset = 1))
report "error: rd imp do_rd" severity warning;
end process;
Assertions: Write Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
process (wr_idx) begin
assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx))
report "error: wr inc" severity warning;
assert ((prev_do_wr = 1) or (reset = 1))
report "error: wr imp do_wr" severity warning;
end process;
3.6.7 VHDL Coding Tips
Vector Type Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
type data_array_ty is array(natural range <>) of data;
signal data_array : data_array_ty(7 downto 0);
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
function to_idx
(i : natural range data_arraylow to data_arrayhigh)
return idx_ty
is
begin
return to_unsigned(i, idx_tylength);
end to_idx;
Conversion to Index
Without Function With Function
rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5);
The function code is verbose, but is very maintainable, because neither the function itself nor uses
of the function need to know the width of the index vector.
Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
function inc_idx (idx : idx_ty) return idx_ty is
begin
if idx < data_arrayhigh then
return (idx + 1);
else
return (to_idx(data_arraylow));
end if;
end inc_idx;
Feedback Loops, and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coding guideline: use functions. Dont use procedures.
inc as fun inc as proc
wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx);
Functions clearly distinguish between reading from a signal and writing to a signal. By examining
the use of a procedure, you cannot tell which signals are read from and which are written to. You
must examine the declaration or implementation of the procedure to determine modes of signals.
Modifying a signal within a procedure results in a tri-state signal. This is bad.
3.6.8 Queue Speci cation 247
File I/O (textio package) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TEXTIO denes read, write, readline, writeline functions.
Described in:
http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio
These functions can be used to read test vectors from a le and write results to a le.
3.6.8 Queue Speci cation
Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of
indices.
Specication should be obviously correct. Avoid bugs in specication by making specication
queue larger than the max number of writes that we will do in test suite. Thus, the specication
queue will never become full or wrap. However, the implementation queue will become full and
wrap.
Write Index Update in Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
We increment write-index on every write, we never wrap.
process (clk) begin
if (reset = 1) then
wr_idx <= 0;
elsif (do_wr = 1) then
wr_idx <= wr_idx + 1;
end if;
end if;
end process;
Things to Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Things to notice in queue specication:
1. dont care conditions (-)
2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?
Dont Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
rd_data <= data_array(rd_idx) when (do_rd =1)
else (others => -);
3.6.9 Queue Testbench
Things to notice in queue testbench:
1. running multipe test sequences
2. uninitialized data U
3. std_match to compare spec and impl data
0 0
0 L
1 1
1 H
- everything
everything else , everything
With equality, - ,= 1, but we want to use - to mean dont care in specication.
The solution is to use std match, rather than = to check implementation signals against
the specication.
Stimulus Process Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The stimulus process runs multiple test vectors in a single simulation run.
3.6.9 Queue Testbench 249
stimulus : process
type test_datum_ty is
record
r_reset, ... normal fields ...
end record;
( -- reset ...
( 1, normal fields),
...
-- wr_idx passes rd_idx (overwrite entries)
-- reset ...
...
);
begin
for i in test_vectorsrange loop
if (test_vectors(i).r_reset = 1) then
... reset code ...
end if;
reset <= 0;
... normal sequence ...
end loop;
end process;
After reset is asserted, set signals to U.
3.7 Functional Veri cation Problems
P3.1 Carry Save Adder
1. Functionality Briey describe the functionality of a carry-save adder.
2. Testbench Write a testbench for a 16-bit combinational carry save adder.
3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the
adder and the latency of the computation.
NOTES:
(a) You do not need to support pipelined adders.
(b) VHDL generics might be useful.
P3.2 Traf c Light Controller
P3.2.1 Functionality
Briey describe the functionality of a trafc-light controller that has sensors to detect the presence
of cars.
P3.2.2 Boundary Conditions
Make a list of boundary conditions to check for your trafc light controller.
P3.2.3 Assertions
Make a list of assertions to check for your trafc light controller.
P3.3 State Machines and Veri cation 251
P3.3 State Machines and Veri cation
P3.3.1 Three Different State Machines
s0 s1
s2 s3
1/0
0/0
*/0
*/0
*/1
Figure 3.10: A very simple machine
s0 s1
s3
s4
*/0
s2
s8
s7
s9
s6
s5
*/0
*/0
*/0
*/0
*/0
*/0
*/0
*/0
*/1
Figure 3.11: A very big machine
s0 s1
s2
*/0
*/0
*/0
*/1
q0 q1
q2
q4
*/0
*/0
*/0
*/1
q3
*/0
Figure 3.12: A concurrent machine
input/output
* = dont care
Figure 3.13: Legend
Answer each of the following questions for the three state machines in gures3.103.12.
Number of Test Scenarios How many test scenarios (sequences of test vectors) would you
need to fully validate the behaviour of the state machine?
Length of Test Scenario What is the maximum length (number of test vectors) in a test scenario
for the state machine?
Number of Flip Flops Assuming that neither the inputs nor the outputs are registered, what is
the minimum number of ip-ops needed to implement the state machine?
P3.3.2 State Machines in General
If a circuit has i signals of 1-bit each that are inputs, f 1-bit signals that are outputs of ip-ops
and c 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of
states that the circuit can have?
P3.4 Test Plan Creation
Youre on the functional verication team for a chip that will control a simple portable CD-
player. Your task is to create a plan for the functional verication for the signals in the entity
cd digital.
Youve been told that the player behaves just like all of the other CD players out there. If your
test plan requires knowledge about any potential non-standard features or behaviour, youll need
to document your assumptions.
pwr
track min
prev next stop play
sec
entity cd_digital is
port (
----------------------------------------------------
-- buttons
prev,
stop,
play,
next,
pwr : in std_logic;
----------------------------------------------------
-- detect if player door is open
open : in std_logic;
----------------------------------------------------
-- output display information
track : out std_logic_vector(3 downto 0);
min : out unsigned(6 downto 0);
sec : out unsigned(5 downto 0)
);
end cd_digital;
P3.5 Sketches of Problems 253
P3.4.1 Early Tests
Describe ve tests that you would run as soon as the VHDL code is simulatable. For each test:
describe what your specication, stimulus, and check. Summarize the why your collection of tests
should be the rst tests that are run.
P3.4.2 Corner Cases
Describe ve corner-cases or boundary conditions, and explain the role of corner cases and
boundary conditions in functional verication.
NOTES:
1. You may reference your answer for problem P3.4.1 in this question.
2. If you do not know what a corner case or boundary condition is, you may earn partial
credit by: checking this box and explaining ve things that you would do in functional
verication.
1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieve
n% coverage.
2. Given a fragment of VHDL code, list things to do to make it more robust e.g. illegal data
and states go to initial state.
3. Smith Problem 13.29
Chapter 4
Performance Analysis and Optimization
4.1 Introduction
Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good
information on performance. We will use some of the same denitions and formulas as Hennessey
and Patterson, but we will move away from generic denitions of performance for computer sys-
tems and focus on performance for digital circuits.
4.2 De ning Performance
Performance =
Work
Time
You can double your performance by:
doing twice the work in the same amount of time
OR doing the same amount of work in half the time
Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Measuring time is easy, but how do we accurately measure work?
The game of benchmarketing is nding a denition of work that makes your system appear to get
the most work done in the least amount of time.
255
256 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION
Measure of Work Measure of Performance
clock cycle MHz
instruction MIPs
synthetic program Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs)
real program SPEC
travel 1/4 mile drag race
The Spec Benchmarks are among the most respected and accurate predictions of real-world per-
formance.
Denition SPEC: Standard Performance Evaluation Corporation MISSION: To
establish, maintain, and endorse a standardized set of relevant benchmarks and
metrics for performance evaluation of modern computer systems
http://www.spec.org.
The Spec organization has different benchmarks for integer software, oating-point software, web-
serving software, etc.
4.3 Comparing Performance
4.3.1 General Equations
Equation for Big is n% greater than Small :
n% =
Big Small
Small
For the above equation, it can be difcult to remember whether the denominator is the larger
number or the smaller number. To see why Small is the only sensible choice, consider the situation
where a is 100% greater than b. This means that the difference between a and b is 100% of
something. Our only variables are a and b. It would be nonsensical for the difference to be a,
because that would mean: a b = a. However, if a b = b, then for a to be 100% greater than b
simply means that a = 2b.
Using n% greater formula, the phrase The performance of A is n% greater than the performance
of B is:
n% =
Performance
A
Performance
B
Performance
B
4.3.2 Example: Performance of Printers 257
Performance is inversely proportional to time:
Performance =
1
Time
Substituting the above equation into the equation for the performance of A is n% greater than the
performance of B gives:
n% =
Time
B
Time
A
Time
A
In general, the equation for a fast system to be n% faster than a slow system is:
n% =
TSlow TFast
TFast
Another useful formula is the average time to do one of k different tasks, each of which happens
%i of the time and takes an amount of time T
i
to do each time it is done .
TAvg =
k
i=1
(%i)(T
i
)
We can measure the performance of practically anything (cars, computers, vacuum cleaners, print-
ers....)
4.3.2 Example: Performance of Printers
Black and White Colour
printer1 9ppm 6ppm
printer2 12ppm 4ppm
Question: Which printer is faster at B&W and how much faster is it?
Answer:
BW Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
n% faster =
TSlowTFast
TFast
BW1 =
1
9ppm
= 0.1111min/page
BW2 =
1
12ppm
= 0.0833min/page
BWFaster =
TSlowTFast
TFast
=
BW1BW2
BW2
=
0.11110.08333
0.08333
= 33%faster
Performance for Different Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: If average workload is 90% BW and 10% Colour, which printer is faster
and how much faster is it?
4.3.2 Example: Performance of Printers 259
Answer:
TAvg1 = %BWBW1+%CC1
= (0.900.1111) +(0.100.1667)
= 0.1167min/page
TAvg2 = %BWBW2+%CC2
= (0.900.0833) +(0.100.2500)
= 0.1000min/page
AvgFaster =
TSlowTFast
TFast
=
Avg1Avg2
Avg2
=
0.11670.1000
0.1000
= 16.7%faster
Optimizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: If we want to optimize printer1 to match performance of printer2, should
we optimize BW or Colour printing?
Answer:
Colour printing is slower, so appears that can save more time by optimizing
colour printing.
However, look at extreme case of optimizing colour printing to be
instantaneous for P1:
0.000m/p
0.050m/p
0.100m/p
0.150m/p
P1 P2
Even if make colour printing instantaneous for printer 1 and kept same for
printer 2, printer 1 would not be measurably faster.
Amdahls law Make the common case fast.
Optimizations need to take into account
both run time and frequency of
occurrence.
We should optimize black and white printing.
Question: If you have to re all of the engineers because your stock price plummeted,
how can you get printer1 to be faster than printer2?
Note: This question was actually humorous during the high-tech bubble of
2000...
Answer:
Hire more marketing people!
Notice that colour printing on printer 1 is faster than on printer 2. So,
marketing suggests that people are increasing the percentage of printing that
is done in colour.
Question: Revised question: what percentage of printing must be done in colour for
printer1 to beat printer2?
4.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 261
Answer:
TAvg1 TAvg2
%BWBW1+%CC1 %BWBW2+%CC2
%BW = 1%C
(1%C) BW1+%CC1 (1%C) BW2+%CC2
BW1+%C(C1BW1) BW2+%C(C2BW2)
%C
BW1BW2
BW1BW2+C2C1
%C
0.11110.0833
0.11110.0833+0.25000.1667
%C 0.25
4.4 Clock Speed, CPI, Program Length, and Performance
4.4.1 Mathematics
CPI Cycles per instruction
NumInsts Number of instructions
ClockSpeed Clock speed
ClockPeriod Clock period
Time = NumInstsCPI ClockPeriod
Time =
NumInstsCPI
ClockSpeed
4.4.2 Example: CISC vs RISC and CPI
Clock Speed SPECint
AMD Athlon 1.1GHz 409
Fujitsu SPARC64 675MHz 443
The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu
SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires
20% more instructions to write a program in the Sparc instruction set than the same program re-
quires in IA-32.
Question: Which of the two processors has higher performance?
Answer:
SPECint, SPECfp, and SPEC are measures of performance. Therefore, the
higher the SPEC number, the higher the performance. The Fujitsu SPARC64
has higher performance
Question: What is the ratio between the CPIs of the two microprocessors?
Answer:
We will use a as the subscript for the Athlon and s as the subscript for the
Sparc.
Time =
NumInstsCPI
ClockSpeed
CPI =
TimeClockSpeed
NumInsts
CPI =
ClockSpeed
Perf NumInsts
CPI
A
CPI
S
=
_
ClockSpeed
A
Perf
A
NumInsts
A
_
_
Perf
S
NumInsts
S
ClockSpeed
S
_
ClockSpeed
A
= 1.1
ClockSpeed
S
= 0.675
Perf
A
= 409
Perf
S
= 443
NumInsts
S
= 1.2NumInsts
A
=
_
1.1
409NumInsts
A
_
_
4431.2NumInsts
A
0.675
_
= 2.1
Executing the average Athlon instruction requires 210% more clock cycles
than executing the average Sparc instruction.
4.4.3 Effect of Instruction Set on Performance 263
Question: Can you determine the absolute (actual) CPI of either microprocessor?
Answer:
To determine the absolute CPI, we would need to know the actual number of
instructions execute by at least one of the processors.
4.4.3 Effect of Instruction Set on Performance
Your group designs a microprocessor and you are considering adding a fused multiply-accumulate
to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply
and an addition. It is often used in digital signal processing.)
Your studies have shown that, on average, half of the multiply operations are followed by an add
instruction that could be done with a fused multiply-add.
Additionally, you know:
cpi %
ADD 0.8 CPIavg 15%
MUL 1.2 CPIavg 5%
Other 1.0 CPIavg 80%
You have three options:
option 1 : no change
option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same
CPI as MUL.
option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is
50% greater than that of a multiply.
Question: Which option will result in the highest overall performance?
Answer:
Time =
NumInstsCPI
ClockSpeed
Perf =
ClockSpeed
NumInstsCPI
We need to nd NumInsts, CPI, and ClockSpeed for each of the three
options. Option 1 is the baseline, so we will dene values for variables in
Options 2 and 3 in terms of the Option 1 variables.
Options 2 and 3 will have the same number of instructions. Half of the
multiply instructions are followed by an add that can be fused.
In questions that involve changing both CPI and NumInsts, it is often easiest
to work with the product of CPI and NumInsts, which represents the total
number of clock cycles needed to execute the program. Additionally, set the
problem up with an imaginary program of 100 instructions on the baseline
system.
NumMAC
2
= 0.5NumMul
1
= 0.55
= 2.5
NumMUL
2
= 0.5NumMul
1
= 0.55
= 2.5
NumADD
2
= NumAdd
1
0.5NumMul
1
= 150.55
= 12.5
Find the total number of clock cycles for each option.
Cycles
1
= NumMUL
1
CPI
MUL
+NumADD
1
CPI
ADD
+NumOth
1
CPI
Oth
= (51.2) +(150.8) +(801.0)
= 98
Cycles
2
= (NumMAC
2
CPI
MAC
) +(NumMUL
2
CPI
MUL
)
+(NumADD
2
CPI
ADD
) +(NumOth
2
CPI
Oth
)
= (2.51.2) +(2.51.2) +(12.50.8) +(801.0)
= 96
Cycles
3
= (NumMAC
3
CPI
MAC
) +(NumMUL
3
CPI
MUL
)
+(NumADD
3
CPI
ADD
) +(NumOth
3
CPI
Oth
)
= (2.5(1.51.2)) +(2.51.2) +(12.50.8) +(801.0)
= 97.5
Calculate performance for each option using the formula:
Performance =
1
CyclesClockPeriod
4.4.4 Effect of Time to Market on Relative Performance 265
Performance
1
= 1/(981)
= 1/98
Performance
2
= 1/(961.2)
= 1/115
Performance
3
= 1/(97.51)
= 1/97.5
The third option is the fastest.
4.4.4 Effect of Time to Market on Relative Performance
Assume that performance of the average product in your market segment doubles every 18 months.
You are considering an optimization that will improve the performance of your product by 7%.
Question: If you add the optimization, how much can you allow your schedule to slip
before the delay hurts your relative performance compared to not doing the
optimization and launching the product according to your current schedule?
Answer:
P(t) = performance at time t
= P
0
2
t/18
From problem statement:
P(t) = 1.07P
0
Equate two equations for P(t), then solve for t.
1.07P
0
= P
0
2
t/18
2
t/18
= 1.07
t/18 = log
2
1.07
t = 18(log
2
1.07)
Use: log
b
x =
logx
logb
= 18
_
log1.07
log2
_
= 1.76months
4.4.5 Summary of Equations
Time to perform a task:
Time =
NumInstsCPI
ClockSpeed
Average time to do one of k different tasks:
TAvg =
k
i=1
(%i)(T
i
)
Performance:
Performance =
Work
Time
Speedup:
Speedup =
TSlow
TFast
TFast is n% faster than TSlow:
n% faster =
TSlowTFast
TFast
Performance at time t if performance increases by factor of k every n units of time:
Perf (t) = Perf (0) k
t/n
4.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 267
4.5 Performance Analysis and Dataow Diagrams
4.5.1 Dataow Diagrams, CPI, and Clock Speed
One of the challenges in designing a circuit is to choose the clock speed. Increasing the clock
speed of a circuit might not improve its performance. In this section we will work through several
example dataow diagrams to pick a clock speed for the circuit and schedule operations into clock
cycles.
When partitioning dataow diagrams into clock cycles, we need to choose a clock period. Choos-
ing a clock period affects many aspects of the design, not just the overall performance. Different
design goals might put conicting pressure on the clock period: some goals will tend toward short
clock periods and some goals will tend toward long clock periods. For performance, not only is
clock period a poor indicator of the relative performance of two different systems, even for the
same system decreasing the clock period might not increase the performance.
Goal Action Affect
Minimize area decrease clock pe-
riod
fewer operations per clock cycle, so
fewer datapath components and more
opportunities to reuse hardware
Increase scheduling exibil-
ity
increase clock pe-
riod
more exibility in grouping operations
in clock cycles
Decrease percentage of clock
cycle spent in ops (overhead
time in ops is not doing
useful work)
increase clock pe-
riod
decreases number of ops that data tra-
verses through
Decrease time to execute an
instruction
???? depends on dataow diagram
Our general plan to nd the clock period for maximum performance is:
1. Pick clock period to be delay through slowest component + delay through op.
2. For each instruction, for each operation, schedule the operation in the earliest clock cycle
possible without violating clock-period timing constraints.
3. Calculate average time to execute an instruction as:
Combine: Time =
NumInstsCPI
ClockSpeed
and: CPI
avg
=
k
i=1
%i CPI
i
to derive: Time =
NumInsts
_
k
i=1
%i CPI
i
_
ClockSpeed
4. If the maximum latency through dataow diagram is greater than 1, then increase clock
period by minimum amount needed to decrease latency by one clock period and return to
Step 2.
5. If the maximum latency through dataow diagram is 1, then clock period for highest perfor-
mance is clock period resulting in fastest Time.
6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences
of a component per instruction per clock cycle without increasing latency for any instruction.
4.5.2 Examples of Dataow Diagrams for Two Instructions
Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, the
circuit is doing either A or B it does not need to support doing A and B simultaneously.
The diagrams below show the ow for each instruction and the delay through the components
(f,g,h,i) that the instructions use.
The delay through a register is 5ns.
Each operation (A and B) occurs 50% of the time.
Our goal is to nd a clock period and dataow diagram for the circuit that will give us the highest
overall performance.
Instruction A
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
Instruction B
i (40ns)
g (50 ns)
4.5.2 Examples of Dataow Diagrams for Two Instructions 269
4.5.2.1 Scheduling of Operations for Different Clock Periods
55ns Clock Period
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
55ns
55ns
55ns
55ns
Instr A Instr B
75ns Clock Period
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
75ns
75ns
75ns
Instr A Instr B
85ns Clock Period
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns) 85ns
85ns
Instr A Instr B
95ns Clock Period
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
95ns
95ns
Instr A Instr B
155ns Clock Period
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
155ns
Instr A Instr B
4.5.2.2 Performance Computation for Different Clock Periods
Question: Which clock speed will result in the highest overall performance?
Answer:
Clock Period CPI
A
CPI
B
Tavg
55ns 4 2 55(0.54+0.52) = 165
75ns 3 2 75(0.53+0.52) = 187.5
85ns 2 2 85(0.52+0.52) = 170
95ns 2 1 95(0.52+0.51) = 143
155ns 1 1 155(0.51+0.51) = 155
4.5.2.3 Example: Two Instructions Taking Similar Time
Question: For the ow below, which clock speed will result in the highest overall
performance?
A B
30ns 40ns
50ns 50ns
20ns 40ns
50ns
Answer:
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
55ns
55ns
55ns
55ns
i (40ns)
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
75ns
75ns
75ns
i (40ns)
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
85ns
85ns
85ns
i (40ns)
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
95ns
95ns
i (40ns)
4.5.2 Examples of Dataow Diagrams for Two Instructions 271
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
105ns
105ns
i (40ns)
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
135ns
135ns
i (40ns)
Should skip 105 ns, because it has same latency as 95 ns.
f (30ns)
g (50 ns)
h (20 ns)
g (50 ns)
i (40ns)
g (50 ns)
155ns
i (40ns)
Clock Period CPI
A
CPI
B
Tavg
55ns 4 3 193
75ns 3 3 225
85ns 2 3 213
95ns 2 2 190
105ns 2 2 NO GAIN
135ns 2 1 203
155ns 1 1 155
A clock period of 155 ns results in the highest performance.
For a clock period of 105 ns, we did not calculate the performance, because
we could see that it would be worse than the performance with a clock period
of 95 ns. The dataow diagram with a 105 ns clock period has the same
latency as the diagram with a clock period of 95 ns. If the data ow diagram
with the longer clock period has the same latency as the diagram with the
shorter clock period, then the diagram with the longer clock period will have
lower performance.
4.5.2.4 Example: Same Total Time, Different Order for A
Question: For the ow below, which clock speed will result in the highest overall
performance?
A B
30ns 40ns
20ns 50ns
50ns 40ns
50ns
Answer:
Clock Period CPI
A
CPI
B
Tavg
55ns 3 3 165ns
95ns 3 2 238ns
105ns 2 2 210ns
135ns 2 1 203ns
155ns 1 1 155ns
A clock period of 155 ns results in lowest average
execution time, and hence the highest
performance.
This is the same answer as the previous problem,
but the total times for higher clock frequencies
differ signicantly between the two problems.
4.5.3 Example: From Algorithm to Optimized Dataow
This question involves doing some of the design work for a circuit that implements InstP and InstQ
using the components described below.
Instruction Algorithm Frequence of Occurrence
InstP ab((ab) +(bd) +e) 75%
InstQ (i + j +k +l) m 25%
Component Delays
2-input Mult 40ns
2-input Add 25ns
Register 5ns
NOTES
There is a resource limitation of a maximum of 3 input ports. (There are no other resource
limitations.)
You must put registers on your inputs, you do not need to register your outputs.
The environment will directly connect your outputs (its inputs) to registers.
Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value
in multiple clock cycles, you must store it in a register.
Question: What clock period will result in the best overall performance?
Answer:
4.5.3 Example: From Algorithm to Optimized Dataow 273
Algorithm Answers (InstP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a b e
*
+
d
*
+
*
b*d a*b
(a*b) + (b*d)
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
InstP data-dep graph
a b e
+
d
*
+
*
b*d a*b
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
InstP: common subexpr elim
b
e
+
d
*
+
*
b*d
a*b (a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
a
(b*d) + e
*
InstP: alternative data dependency graph.
Both options have critical path of 2mults+2adds.
First option allows three operations to be done
with just three inputs (a,b,d). Second option
requires all four inputs to do three operations.
a b
e
+
d
*
+
*
b*d a*b
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
InstP: clock=50ns, lat=4, T=200
a b
e
+
d
*
+
*
b*d a*b
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
InstP: clock=55ns, lat=3, T=165ns
a b
e
+
d
*
+
*
b*d a*b
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
a b e
+
d
*
+
*
b*d a*b
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
InstP: illegal: 4 inputs
b
e
+
d
*
+
*
b*d
a*b (a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
70ns
a
(b*d) + e
*
InstP: dataflow diagram with alternative
data-dep graph.
Adds a third clock cycle without any gain
in clock speed. From diagram, its clear that
its better to put a*b in first clock cycle and e in
second, because a*b can be done in parallel
with b*d.
Fastest option for InstP is 70ns clock, which gives a total execution time of
140 ns.
Algorithm Answers (InstQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i j l m
+ +
k
*
+
InstQ: data-dep graph with max parallelism
i j
l m
+
+
k
*
+
InstQ: alternative data-dep graph:
able to do two operations with three inputs,
while first data-dep graph required four inputs
to do two operations. We are limited to three
inputs, so choose this data-dep graph for
dataflow diagrams.
i j
l m
+
+
k
*
+
InstQ: clock=50ns, lat=4, T=200ns.
i j
l m
+
+
k
*
+
i j
l m
+
+
k
*
+
i j
l m
+
+
k
*
+
InstQ: irrelevant: lat did not decrease
i j
l m
+
+
k
*
+
InstQ: clock=120ns, lat=1, T=120ns
i j
l m
+
+
k
*
+
70ns
InstQ
Fastest option for InstQ is 70ns clock, which gives a total execution time of
140 ns.
Both InstP and InstQ need a 70ns clock period to maximize their
performance. So, use a 70ns clock, which gives a latency of 2 clock cycles for
both instructions.
Fastest execution time 140ns
Clock period 70ns
Question: Find a minimal set of resources that will achieve the performance you
calculated.
Answer:
Final dataow graphs for InstP and InstQ
a b
e
+
d
*
+
*
b*d a*b
(a*b) + (b*d) + e
(a*b)*((a*b) + (b*d) + e)
*
i j
l m
+
+
k
*
+
70ns
InstQ
Need do only one of InstP and InstQ at any time, so simply take max of each
resource.
InstP InstQ System
Inputs 3 3 3
Outputs 1 1 1
Registers 3 3 3
Adders 2 2 2
Multipliers 2 1 2
Question: Design the datapath and state machine for your design
Answer:
a b
e
+
d
*
+
* *
i j
l m
+
+
k
*
+
InstQ: clock=70ns, lat=2, T=140ns. InstP: clock=70ns, lat=2, T=140ns.
r1 r2 r3
m1 m2
r1 r2 r3
a2
a1
r1 r2 r3
a2
r1 r2 r3
a1
a1
m2
m2
S0
S1
S0
S0
S1
S0
i1 i2 i3
i1 i2 i3
o1
o1
i2
i2 i3
Control Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r1 r2 r3 m1 m2 a1 a2
ce mux ce mux ce mux src1 src2 src1 src2 src1 src2 src1 src2
InstP S0 1 i1 1 i2 1 i3 r1 r2 r3 a1 m1 m2
InstP S1 1 a2 1 i2 1 m1 r2 r3 r1 r2
InstQ S0 1 i1 1 i2 1 i3 a1 r3 r1 r2 a1 r3
InstQ S1 1 a2 1 i2 1 i3 r1 r2
Optimize Control Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
r1 r2 r3 m1 m2 a1 a2
mux mux mux src1 src2 src1 src2 src1 src2 src1 src2
InstP S0 i1 i2 i3 r1 r2 a1 r3 r1 r2 m1 m2
InstP S1 a2 i2 m1 r1 r2 r2 r3 r1 r2 m1 m2
InstQ S0 i1 i2 i3 r1 r2 a1 r3 r1 r2 a1 r3
InstQ S1 a2 i2 i3 r1 r2 r2 r3 r1 r2 a1 r3
Write VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Use the optimized control table as basis for VHDL code.
process (clk) begin
if state=S0 then
r1 <= i1
else
r1 <= a2
end if;
end if;
end process;
process (clk) begin
r2 <= i2
end if;
end process;
process (clk) begin
if inst=instP and state=S0 then
r3 <= m1
else
r1 <= i3
end if;
end if;
end process;
m1 <= r1 * r2;
m2_src1 <= r2 when state=S0
else a1;
m2 <= m2_src1 * r3;
a1 <= r1 + r2;
process (inst, m1, m2, a1, r3) begin
if inst=instP then
a2_src1 <= m1;
a2_src2 <= m2;
else
a2_src1 <= a1;
a2_src2 <= r3;
end if;
end process;
4.6 Performance Analysis and Optimization Problems
P4.1 Farmer
A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard
to the market.
Facts:
capacity of
truck
speed when
loaded with
apples
speed when
unloaded (no
apples)
big truck 12 tonnes 15kph 38kph
small truck 6 tonnes 30kph 70kph
distance to market 120 km
amount of apples 85 tonnes
NOTES:
1. All of the loads of apples must be carried using the same truck
2. Elapsed time is counted from beginning to deliver rst load to returning to the orchard after
the last load
3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc.
4. For each trip, a truck travels either its fully loaded or empty speed.
Question: Which truck will take the least amount of time and what percentage faster
will the truck be?
Question: In planning ahead for next year, is there anything the farmer could do to
decrease his delivery time with little or no additional expense? If so, what is it, if not,
explain.
P4.2 Network and Router 281
P4.2 Network and Router
In this question there is a network that runs a protocol called BigLan. You are designing a router
called the DataChopper that routes packets over the network running BigLan (i.e. theyre BigLan
packets).
The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan
packet contains 100 Bytes of routing information and 1000 Bytes of data.
You are working on the DataChopper router, which has the following performance numbers:
75MHz clock speed
4 cycles for a byte of either data or header
500 number of additional clock cycles to process the routing information
for a packet
P4.2.1 Maximum Throughput
Which has a higher maximum throughput (as measured in data bits per second that is only the
payload bits count as useful work), the network or your router, and how much faster is it?
P4.2.2 Packet Size and Performance
Explain the effect of an increase in packet length on the performance of the DataChopper (as
measured in the maximum number of bits per second that it can process) assuming the header
remains constant at 100 bytes.
P4.3 Performance Short Answer
If performance doubles every two years, by what percentage does performance go up every month?
This question is similar to compound growth from your economics class.
P4.4 Microprocessors
The Yme microprocessor is very small and inexpensive. One performance sacrice the designers
have made is to not include a multiply instruction. Multiplies must be written in software using
loops of shifts and adds.
The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4.
A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the
Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on
the Y!v1.
P4.4.1 Average CPI
Question: What is the average CPI for the Y!v1? If you dont have enough
information to answer this question, explain what additional information you need
and how you would use it?
A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply
instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply
instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average pro-
gram. The brochures also claim that the average performance of Y!u2 is 30% better than that of
the Y!v1.
P4.4.2 Why not you too?
Question: Assuming the advertising claims are true, what is the average CPI for the
Y!u2? If you dont have enough information to answer this question, explain what
additional information you need and how you would use it?
P4.4.3 Analysis
Question: Which of the following do you think is most likely and why.
1. the Y!u2 is basically the same as the Y!v1 except for the multiply
2. the Y!u2 designers made performance sacrices in their design in order to include a multiply
instruction
3. the Y!u2 designers performed other signicant optimizations in addition to creating a mul-
tiply instruction
Draw an optimized dataow diagram that improves the performance and produces the same output
values. Or, if the performance cannot be improved, describe the limiting factor on the performance.
NOTES:
you may not increase the resource usage (input ports, registers, output ports, f components, g
components)
P4.6 Performance Optimization with Memory Arrays 283
f
f
a b c
d
g
f
g
e
Before Optimization
f
f
a b
c
d
g
f
g
e
After Optimization
P4.6 Performance Optimization with Memory Arrays
This question deals with the implementation and optimization for the algorithm and library of
circuit components shown below.
Algorithm
q = M[b];
if (a > b) then
M[a] = b;
p = (M[b-1]) * b) + M[b];
else
M[a] = b;
p = M[b+1] * a;
end;
Component Delay
Register 5 ns
Adder 25 ns
Subtracter 30 ns
Memory read 60 ns
Memory write 60 ns
NOTES:
1. 25% of the time, a > b
2. The inputs of the algorithm are a and b.
3. The outputs of the algorithm are p and q.
time. For your inputs, you may read each value only once (i.e. the environment will not send
multiple copies of the same value).
7. M is an internal memory array, which must be implemented as dual-ported memory with one
read/write port and one write port.
8. Assume all memory address and other arithmetic calculations are within the range of repre-
sentable numbers (i.e. no overows occur).
10. Your dataow diagram must include circuitry for computing a > b and using the result to
choose the value for p
Draw a dataow diagram for each operation that is optimized for the fastest overall execution time.
NOTE: You may sacrice area efciency to achieve high performance, but marks will be deducted
P4.7 Multiply Instruction
You are part of the design team for a microprocessor implemented on an FPGA. You currently im-
plement your multiply instruction completely on the FPGA. You are considering using a special-
ized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality
tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip.
If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not
change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run
at a slower clock speed.
FPGA option FPGA + MULT option
FPGA FPGA
MULT
average CPI 5 ???
% of instrs that are multiplies 10% 10%
CPI of multiply 20 6
Clock speed 200 MHz 160 MHz
P4.7.1 Highest Performance
Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and
what percentage faster is the higher-performance option?
P4.7 Multiply Instruction 285
P4.7.2 Performance Metrics
Explain whether MIPs is a good choice for the performance metric when making this decision.
Chapter 5
Timing Analysis
5.1 Delays and De nitions
In this section we will look at the different timing parameters of circuits. Our focus will be on
those parameters that limit the maximum clock speed at which a circuit will work correctly.
5.1.1 Background De nitions
Denition fanin: The fanin of a gate or signal x are all of the gates or signals y where an
input of x is connected to an output of y.
Denition fanout: The fanout of a gate or signal x are all of the gates or signals y where
an output of x is connected to an input of y.
y1
y2
y3
y4
y0
x
Figure 5.1: Immediate Fanin of x
x
y1
y2
y3
y4
y0
Figure 5.2: Immediate Fanout of x
287
288 CHAPTER 5. TIMING ANALYSIS
Denition immediate fanin/fanout: The phrases immediate fanout and immediate fanin
mean that there is a direct connection between the gates.
x
Figure 5.3: Transitive Fanin
x
Figure 5.4: Transitive Fanout
Denition transitive fanin/fanout: The phrases transitive fanout and transitive fanin
mean that there is either a direct or indirect connection between the gates.
Note: Immediate vs Transitive fanin and fanout Be careful to dis-
tinguish between immediate fan(in/out) and transitive fanin/out. If fanin
or fanout are not qualied with immediate or transitive, be sure to
make sure whether immediate or transitive is meant. In E&CE 427,
fan(in/out) will mean immediate fan(in/out).
5.1.2 Clock-Related Timing De nitions
5.1.2.1 Clock Skew
skew
clk1
clk2
clk3
clk4
clk1
clk2
clk3
clk4
Denition Clock Skew: The difference in arrival times for the same clock edge at
different ip-ops.
Clock skew is caused by the difference in interconnect delays to different points on the chip.
Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated
synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still
generate PhD theses.
5.1.2 Clock-Related Timing De nitions 289
5.1.2.2 Clock Latency
latency
master clock
intermediate clock
final clock
master clock
i
n
t
e
r
m
e
d
i
a
t
e

c
l
o
c
k
final clock
Denition Clock Latency: The difference in arrival times for the same clock edge at
different levels of interconnect along the clock tree. (Intuitively different points in
the clock generation circuitry.)
Note: Clock latency Clock latency does not affect the limit on the minimim
clock period.
5.1.2.3 Clock Jitter
jitter
ideal clock
clock with jitter
Denition Clock Jitter: Difference between actual clock period and ideal clock period.
Clock jitter is caused by:
temperature and voltage variations over time
temperature and voltage variations across different locations on a chip
manufacturing variations between different parts
etc.
5.1.3 Storage Related Timing De nitions
Storage devices (latches, ip-ops, memory arrays, etc) dene setup, hold and clock-to-Q times.
d q
clk
d
clk
q
Clock-to-Q
Hold Setup
Figure 5.5: Setup, hold, and clock-to-Q times for a ip op

Note: Require / Guarantee Setup and hold times are requirements that the
storage device imposes upon its environment. Clock-to-Q is a guarantee that
the storage device provides its environment. If the environment satises the
setup and hold times, then the storage device guarantees that it will satisfy the
clock-to-Q time.
In this section, we will use the denitions of setup, hold and clock-to-Q. Section 5.2 will show how
to calculate setup, hold, and clock-to-Q times for ip ops, latches, and other storage devices.
5.1.3.1 Setup Time
Denition Setup Time (T
SUD
) : Latest time before arrival of clock edge (ip op), or
deasserting of enable line (latch), that input data is required to be stable in order for
storage device to work correctly.
If setup time is violated, current input data will not be stored; input data from previous clock cycle
might remain stored.
5.1.3.2 Hold Time
Denition Hold Time (T
HO
): Latest time after arrival of clock edge (ip op), or
deasserting of enable line (latch), that input data is required to remain stable in order
for storage device to work correctly.
If hold time is violated, current input data will not be stored; input data from next clock cycle
might slip through and be stored.
5.1.4 Propagation Delays 291
5.1.3.3 Clock-to-Q Time
Denition Clock-to-Q Time (T
CO
): Earliest time after arrival of clock edge (ip op),
or asserting of enable line (latch) when output data is guaranteed to be stable.
5.1.4 Propagation Delays
Propagation delay is the time it takes a signal to travel from the source (driving) op to the desti-
nation op. The two factors that contribute to propagation delay are the load of the combinational
gates between the ops and the delay along the interconnect (wires) between the gates.
5.1.4.1 Load Delays
Load delay is proportional to load capacitance.
Timing of a simple inverter with a load.
Vi Vo
Schematic
1->0
0->1
Input 1 0:
Charge output cap
0->1
1->0
Input 0 1:
Discharge output
cap
Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big
the other gates are.
Section 5.4.1 goes into more detail on timing models and equations for load delay.
5.1.4.2 Interconnect Delays
Wires, also known as interconnect, have resistance, and there is a capacitance between parallel
wires. Both of these factors increase delay.
Wire resistance is dependent upon the material and geometry of the wire.
Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials.
Shorter wires are faster.
Fatter wires are faster.
FPGAs have special routing resources for long wires.
CMOS processes use higher metal layers for long wires, these layers have wires with much
larger cross sections than lower levels of metal.
More on this in section 5.5.
5.1.5 Summary of Delay Factors
Name Symbol Denition
Skew Difference in arrival times for different clock
signals
Jitter Difference in clock period over time
Clock-to-Q T
CO
Delay from clock signal to Q output of op
Setup T
SUD
Length of time prior to clock/enable that data
must be stable
Hold T
HO
Length of time after clock/enable that data must
be stable
Load Delay due to load (fanout/consumers/readers)
Interconnect Delay along wire
Table 5.1: Summary of delay factors
5.1.6 Timing Constraints
For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown
in table5.1.
Denition Margin: The difference between the required value of a timing parameter
and the actual value. A negative margin means that there is a timing violation. A
margin of zero means that the timing parameter is just satised: changing the timing
of the signals (which would affect the actual value of the parameter) could violate the
timing parameter. A positive margin means that the constraint for the timing
parameter is more than satised: the timing of the signals could be changed at least a
little bit without violating the timing parameter.
Note: Margin is often called slack. Both terms are used commonly.
5.1.6 Timing Constraints 293
5.1.6.1 Minimum Clock Period
signal may change
signal is stable
a b
clk1 clk2
signal may rise
signal may fall
clk1
clk2
a
b
skew jitter clock-to-Q interconnect + load setup
clock period
propagation
slack
ClockPeriod >
_
Skew+Jitter +T
CO
+Interconnect +Load+T
SUD
_
Note: The minimum clock period is independent of hold time.
5.1.6.2 Hold Constraint
clk1
clk2
a
b
skew jitter
c
l
o
c
k
-
t
o
-
Q
p
r
o
p
a
g
a
t
i
o
n
hold
slack
_
Skew+Jitter +T
HO
_
_
T
CO
+Interconnect +Load
_
5.1.6.3 Example Timing Violations
The gures below illustrate correct timing behaviour of a circuit and then two types of violations:
setup violation and hold violation. In the gures, the black rectangles identify the point where the
violation happens.
5.1.6 Timing Constraints 295
a
b
clk
a
clk
b
d
c
c
Clock-to-Q
Setup
Prop
d

Hold
Figure 5.6: Good Timing
a
clk
b
c
???
a
clk
b
c
Clock-to-Q
Setup
Prop
d

???
Figure 5.7: Setup Violation
a
b
clk
a
clk
b
d
c
c
Hold
d

???
Clock-to-Q
Prop
Figure 5.8: Hold Violation
5.2 Timing Analysis of Latches and Flip Flops
In this section, we show how to nd the clock-to-Q, setup, and hold times for latches, ip-ops,
and other storage elements.
5.2.1 Review: Latch, Flip-Flop, Setup, Hold, Clock-to-Q
d
clk
q
Flop Behaviour
d
clk
q
Latch Behaviour
Review: Timing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Setup : Time before arrival of clock edge (ip op), or deasserting of enable line (latch), that
input data is required to start being stable
5.2.2 Simple Multiplexer Latch 297
Hold : time after arrival of clock edge (ip op), or deasserting of enable line (latch), that input
data is required to remain stable
Clock-to-Q : Time after arrival of clock edge (ip op), or asserting of enable line (latch) when
output data is guaranteed to start being stable
5.2.2 Simple Multiplexer Latch
We begin our study of timing analysis for storage devices with a simple latch built from an inverter
ring and multiplexer. There are many better ways to build latches, primarily by doing the design
at the transistor level. However, the simplicity of this design makes it ideal for illustrating timing
analysis.
5.2.2.1 Structure and Behaviour of Multiplexer Latch
Two modes for storage devices:
loading data:
loads input data into storage circuitry
input data passes through to output
using stored data
input signal is disconnected from output
storage circuitry drives output
i
o
clk
Schematic
i
o
1
Loading / pass-through mode
i
o
0
Storage mode
Unfold Multiplexer to Simple Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a
b
s
o
a
sel
b
o
Multiplexer: symbol and implementation
d
clk
o
Latch implementation
Note: inverters on clk Both of the inverters on the clk signal are needed.
Together, they prevent a glitch on the OR gate when clk is deasserted. If
there was only one inverter, a glitch would occur. For more on this, see sec-
tion 5.2.2.6
0
1
1
1
0
0
d=0
clk=1
o
1
Loading 0
1
0
0
0
0
0
d=1
clk=1
o
1
Loading 1
0 1
0
1
1
d
clk=0
o=0
0
1
Storing 0
1 0
0
1
0
d
clk=0
o=1
0
0
Storing 1
5.2.2.2 Strategy for Timing Analysis of Storage Devices
The key to calculating setup and hold times of a latch, op, etc is to identify:
1. how the data is stored when not connected to the input (often a pair of inverters in a loop)
2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission
gate or multiplexor)
3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate
or multiplexor)
0
1
d
clk=0
o
0 1
0
d
clk=1
o
0
Note: Clock-to-Q for latches For latches, clock-to-Q times are measured
with respect to the clock edge that connects the data input to the output. For
active-high latches, this is a rising edge.
Setup and hold timing constraints ensure that, when the storage device transitions from load mode
to store mode, the input data is stored correctly in the storage device. Thus, the setup and hold
timing constraints come into play when the storage device transitions from load mode to store
mode.
Note: Setup and hold time for latches For latches, hold time and setup time
are measured with respect to the clock edge that disconnects the data input
from the output. For active-high latches, this is a falling edge.
Hold time is concerned with the next data value sneaking in before the latch goes into storage
mode.
Setup time is concerned with the previous data value still being in the storage circuitry when the
input is disconnected.
Note: Storage devices vs. Signals We can talk about the setup and hold
time of a signal or of a storage device. For a storage device, the setup and
hold times are requirements that it imposes upon all environments in which it
operates. For an individual signal in a circuit, there is a setup and hold time,
which is the amount of time that the signal is stable before and after a clock
edge.
5.2.2.3 Clock-to-Q Time of a Multiplexer Latch
clk
d
l1
l2
qn q
s2
s1
cn
c2
Figure 5.9: Latch for Clock-to-Q analysis
d
l1
l2
qn
q
s1
s2
clk
cn
c2
clock-to-Q
Figure 5.10: Waveforms of latch showing Clock-to-Q timing
Assume that input is stable, and then clock signal transitions to cause the circuit to move from
storage mode to load mode.
Calculate clock-to-Q time by nding delay of critical path from where clock signal enters storage
circuit to where q exits storage circuit.
The path is: clk cn c2 l2 qn q, which has a delay of 5 (assuming each gate has a
delay of exactly one time unit).
5.2.2.4 Setup Timing of a Multiplexer Latch
Storage device transitions from load mode to store mode. Setup is time that input must be stable
before clock changes.
clk
d
l1
l2
qn q
s2
s1
cn
c2
Figure 5.11: Latch for Setup Analysis
d
l1
l2
qn
q
s1
s2
clk
cn
c2
setup + margin
Figure 5.12: Setup with margin: goal is to store

Step-by-step animation of latch transitioning from load to store mode.
clk
d
1 0 1
0
0
Circuit is stable in load mode
clk
d
0 1 0
1
t=3: l2 is set to 0,
because c2 turns off AND gate
clk
d
0 0 1
0
0
t=0: Clk transitions from load to store
clk
d
0 1 0
1
t=4: from store path propagates to q
clk
d
0 1 1
1
0
clk
d
0 1 0
1
t=5: from store path completes cycle
clk
d
0 1 0
1
t=2: s1 propagates to s2,
because cn turns on AND gate
The value on s1 at t=1 will propagate from the store loop to the output and back through the store
loop. At t=1, s1 must have the value that we want to store. Or, equivalently, the value to store must
have saturated the store loop by t=1. It takes 5 time units for a value on the input d to propagate to
s1 (d l1 l2 qn q s1).
The setup time is the difference in the delay from d to s1 and the delay from clk to cn: 5 1 = 4,
so the setup time for this latch is 4 time units.
d
l1
l2
qn
q
s1
s2
clk
cn
setup with negative margin

c2
/
/
/
/
/
/
/
/
/
/
/
/
Figure 5.13: Setup Violation

d
l1
l2
qn
q
s1
s2
clk
cn
setup
c2
Figure 5.14: Minimum Setup Time

must arrive at s1 before cn is asserted. Otherwise, will affect storage circuitry when data
input is disconnected.
5.2.2.5 Hold Time of a Multiplexer Latch
clk
d
l1
l2
qn q
s2
s1
cn
c2
Figure 5.15: Latch for Hold Analysis
d
l1
l2
qn
q
s1

s2
clk
c2
cn
hold + margin
Figure 5.16: Hold OK: goal is to store

clk
d
1 0
0
0
Circuit is stable in load mode
1 clk
d
0 1
1
t=6: Clk transition propagates to c2,
l1 may change now without affecting storage device
0
clk
d
0 0
0
0
1 clk
d
0 1
1
t=7: Clk transition propagates to l2,
0
0
clk
d
0 1
1
0
t=5: Clk transition propagates to cn
1
Figure 5.17: Animation of hold analysis
It takes 6 time units for a change on the clock signal to propagate to the input of the AND gate that
controls the load path. It takes 1 time unit for a change on d to propagate to its input to this AND
gate. The data input must remain stable for 6 1 = 5 time units after the clock transitions from
load to store mode, or else the new data value (e.g., ) will slip into the storage loop and corrupt
the value that we are trying to store.
d
l1
l2
qn
q
s1
s2
clk
c2
cn

hold with negative margin
Figure 5.18: Hold violation: slips through to q
d
l1
l2
qn
q
s1
s2
clk
c2
cn
hold
Figure 5.19: Minimum Hold Time
Cant let affect l1 before c2 deasserts.
Hold time is difference between path from clk to c2 and path from d to l1.
5.2.3 Timing Analysis of Transmission-Gate Latch 307
5.2.2.6 Example of a Bad Latch
This latch is very similar to the one from section 5.2.2.5, however this one does not work correctly.
The difference between this latch and the one from section 5.2.2.5 is the location of the inverter
that determines whether l2 or s2 is enabled. When the clock signal is deasserted, c2 turns off the
AND gate l2 before the AND gate s2 turns on. In this interval when both l2 and s2 are turned
off, a glitch is allowed to enter the feedback loop.
The glitch on the feedback loop is independent of the timing of the signals d and clk.
clk
d
l1
l2
qn q
s2
s1
cn
c2
d
l1
l2
qn
q
s1

s2
clk
c2
cn
5.2.3 Timing Analysis of Transmission-Gate Latch

The latch that we now examine is more realistic than the simple multiplexer-based latch. We
replace the multiplexer with a transmission gate.
5.2.3.1 Structure and Behaviour of a Transmission Gate (Smith 2.4.3)
Symbol
i o
s
s
Implementation
i o
1
0
Open
i o
0
1
Closed
0
1
1
Transmit 1
0
1
0
Transmit 0
i o
s
Transmission gate as switch
5.2.3.2 Structure and Behaviour of Transmission-Gate Latch (Smith 2.5.1)
d
clk
q
d
clk
q
1
0
1
0
1
Loading data into latch
d
clk
q
1
0
1
0
1
Using stored data from latch
5.2.4 Falling Edge Flip Flop (Smith 2.5.2) 309
5.2.3.3 Clock-to-Q Delay for Transmission-Gate Latch
d
clk
q
1
5.2.3.4 Setup and Hold Times for Transmission-Gate Latch
d
clk
q
1
path1
path2
Setup time = path1 path2
Setup time for latch
d
clk
q
1
path1
path2
Hold time = path1 path2
Hold time for latch
5.2.4 Falling Edge Flip Flop (Smith 2.5.2)
We combine two active-high latches to create a falling-edge, master-slave ip op. The analysis
of the master-slave ip-op illustrates how to do timing analysis for hierarchical storage devices.
Here, we use the timing information for the active high latch to compute the timing information
of the ip-op. We do not need to know the primitive structure of the latch in order to derive the
timing information for the ip op.
5.2.4.1 Structure and Behaviour of Flip-Flop
EN EN
d m q
clk
A
??
B C D E F
A B D E

d
clk
m
clk_b
q ??
EN EN
d m q
clk
d
clk
m
clk_b
q
Latch
Clock-Q
TInv
Latch
Setup
Tmd Tinv
TInv delay through an inverter

Tmd propagation delay from m to d
5.2.4 Falling Edge Flip Flop (Smith 2.5.2) 311
5.2.4.2 Clock-to-Q of Flip-Flop
EN EN
d m q
clk
d
clk
m
clk_b
q
Latch
Clock-to-Q
Tinv
Flop
Clock-to-Q
T
CO
Flop = TInv+T
CO
Latch
5.2.4.3 Setup of Flip-Flop
EN EN
d m q
clk
d
clk
m
clk_b
q
Flop
Setup
Latch
Setup
T
SUD
Flop = T
SUD
Latch
The setup time of the ip op is the same as the setup time of the master latch. This is because,
once the data is stored in the master latch, it will be held for the slave latch.
5.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) 313
5.2.4.4 Hold of Flip-Flop
EN EN
d m q
clk
d
clk
m
clk_b
q
Hold time for latch
Hold time for flop

T
HO
Flop = T
HO
Latch
The hold of the ip op is the same as the hold time of the master latch. This is because, once the
data is stored in the master latch, it will be held for the slave latch.
5.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5)
We can apply hierarchical analysis to structures that include both datapath and storage circuitry.
We use an Actel FPGA cell to illustrate. The description of the Actel FPGA cell in the course notes
is incomplete, refer to Smiths book for additional material.
5.2.5.1 Standard Timing Equations
T
PD
= delay from D-inputs to storage element
T
CLKD
= delay from clk-input to storage element
T
OUT
= delay from storage element to output
T
SUD
= setup time
= slowest D path fastest clk path
= T
PD Max
T
CLKD Min
T
HO
= hold time
= slowest clk path fastest D path
= T
CLKD Max
T
PD Min
T
CO
= delay clk to Q
= clk path+output path
= T
CLKD
+T
OUT
5.2.5.2 Hierarchical Timing Equations
Add combinational logic to inputs, clock, and outputs of storage element.
t
SUD
HO
t
t
CO
PD
t
CLKD
t
data inputs
clk
d
q
clk
t
OUT
T
SUD
= T
SUD
+T
PD Max
T
CLKD Min
T
HO
= T
HO
+T
CLKD Max
T
PD Min
T
CO
= T
CO
+T
CLKD Max
+T
OUT Max
5.2.5.3 Actel Act 2 Logic Cell
Timing analysis of Actel Act 2 logic cell (Smith 5.1.5).
5.2.5 Timing Analysis of FPGA Cells (Smith 5.1.5) 315
Actel ACT
Basic logic cells are called Logic Module
ACT 1 family: one type of Logic Module (see Figure 5.1, Smiths pp. 192)
ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4,
Smiths pp. 198)
C-Module (Combinatorial Module) combinational logic similar to ACT 1 Logic Mod-
ule but capable of implementing ve-input logic function
S-Module (Sequential Module) C-Module + Sequential Element (SE) that can be con-
gured as a ip-op
Actel Timing
ACT family: (see Figure 5.5, Smiths pp. 200)
Simple. Why?
Only logic inside the chip
Not exact delay (as no place and route, physical layout, hence not accounting for inter-
connection delay)
Non-Deterministic Actel Architecture
All primed parameters inside S-Module are assumed Calculate tSUD, tH, and tCO
The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and
2.6 ns went into increasing the clock-output delay, tCO. From outside we can say that the
combinational logic delay is buried in the ip-op set up time
d
clk
q
Simple Actel-style latch
d
clk
q
clr
Actel latch with active-low
clear
d
clk
m
clr
q
Actel op with active-low clear
clk
m
clr
q
d00
d01
d10
d11
a1
b1
a0
b0
C-Module
SE-Module
se_clk se_clk_n
Actel sequential module
5.2.5.4 Timing Analysis of Actel Sequential Module
Timing parameters for Actel latch with
active-low clear
T
SUD
0.4ns
T
HO
0.9ns
T
CO
0.4ns
Other given timing parameters
C-Module delay (t
PD
) 3ns
tCLKD (from clk to se clk and se clk n) 2.6ns
Question: What are the setup, hold, and T
CO
times for the entire Actel sequential
module?
5.2.6 Exotic Flop
As a contrast to the gate-level implementations of latches that we looked at previously, the gure
below is the schematic for a state-of-the-art high-performance latch circa 2001.
5.3. CRITICAL PATHS AND FALSE PATHS 317
d
clk
q
inverter chain
precharge node precharge node keeper keeper
The inverter chain creates an evaluation window in time when clock has just risen and the p tran-
sistors are turned on.
When clock is 0, the left precharge node charges to 1 and the right precharge node discharges
to 0.
If d is 1 during the evaluation window, the left precharge node discharges to 0. The left
precharge nodes goes through an inverter to the second precharge node, which will charge from
0 to texttt1, resulting in a 0 on q.
If d is 0 during the evaluation window, the left precharge node stays at the precharge value of
1. The left precharge nodes goes through an inverter to the second precharge node, which will
stay at 0, resulting in a 1 on q.
The two inverter loops are keepers, which provide energy to keep the precharge nodes at their
values after the evaluation window has passed and the clock is still 1.
5.3 Critical Paths and False Paths
5.3.1 Introduction to Critical and False Paths
In this section we describe how to nd the critical path through the circuit: the path that limits the
maximumclock speed at which the circuit will work correctly. A complicating factor in nding the
critical path is the existence of false paths: paths through the circuit that appear to be the critical
path, but in fact will not limit the clock speed of the circuit. The reason that a path is false is that
the behaviour of the gates prevents a transition (either 0 1 or 1 0) from travelling along the
path from the source node the destination node.
To conrm that a path is a true critical path, and not a false path, we must nd a pair of input
vectors that exercise the critical path. The two input vectors differ only their value for the input
signal on the critical path. The change on this signal (either 0 1 or 1 0) must propagate along
the candidate critical path from the input to the output.
Usually the two input vectors will produce different output values. However, a critical path might
produce a glitch (0 1 0 or 1 0 1) on the output, in which case the path is still the critical
path, but the two input vectors both result in the same value on the output signal. Glitches should be
ignored, because they may result in setup violations. If the glitching value is inside the destination
op or latch at the end of the clock period, then the storage element will not store a stable value.
The algorithm that we present comes from McGeer and Brayton in the DAC 198? paper. The
algorithm to nd the critical path through a circuit is presented in several parts.
1. Section 5.3.2: Find the longest path ignoring the possibility of false paths.
2. Section 5.3.3: Almost-correct algorithm to identify if a candidate critical path is a false path.
3. Section 5.3.4: If a candidate path is a false path, then nd the next candidate path, and repeat
the false-path detection algorithm.
4. Section 5.3.5: Correct, complete, and complex algorithm to nd the critical path in a circuit.
Note: The analysis of critical paths and false paths assumes that all inputs
change values at exactly the same time. Timing differences between inputs are
modelled by the skew parameter in timing analysis.
Note: To exercise a path, only one input needs to change. Stated another
way, if a path cannot be exercised by toggling one input, then the path cannot
be exercised by toggling more than one input.
Throughout our discussion of critical paths, we will use the delay values for gates shown in the
table below.
gate delay
NOT 2
AND 4
OR 4
XOR 6
5.3.1.1 Example of Critical Path in Full Adder
Question: Find the critical path through the full-adder circuit shown below.
5.3.1 Introduction to Critical and False Paths 319
ci
a
b
co
s
Answer:
Annotate with Max Distance to Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ci
a
b
co
s
6
6
0
8
8
4 4
4
0
14
14
8
8
8
4
0
0
8
14
14
Find Candidate Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ci
a
b
co
s
6
6
0
8
8
4 4
4
0
14
14
8
8
8
4
0
0
8
14
14
There are two paths of length 14: aco and bco. We arbitrarily choose
aco.
Test if Candidate is Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ci
a
b
co
s
0 0
0
0 0 1
1
Yes, the candidate path is the critical path.
The assignment of ci=1, a=0, b=0 followed in the next clock cycle by ci=1,
a=1, b=0 will exercise the critical path. As a shortcut, we write the pair of
assignments as: ci=1, a=, b=0.
Question: Do the input values of ci=0, a=, b=1 exercise the critical path?
Answer:
ci
a
b
co
s
0
0
1 1
1
0
0
The alternative does not exercise the critical path. Instead, the alternative
excitation follows a shorter path, so the output stabilizes sooner.
Lesson: not all all transitions on the inputs will exercise the critical path.
Using timing simulation to nd the maximum clock speed of a circuit might
overestimate the clock speed, because the inputs values that you simulate
might not exercise the critical path.
5.3.1.2 Preliminaries for Critical Paths
Denition critical path: The slowest path on the chip between ops or ops and pins.
The critical path limits the maximum clock speed.
There are three classes of paths on a chip:
entry path: from an input to a op
Quartus does not report this by default. When Quartus reports this path, it is reported as the
period associated with System fmax.
In Xilinx timing reports this is reported as Maximum Delay
stage path: from one op to another op
In Quartus timing reports, this is reported as the period associated with Internal fmax.
In Xilinx timing reports, this is reported as Clock to Setup and Maximum Frequency.
exit path: from a op to an output
Quartus does not report this by default. When Quartus reports this path, it is reported as the
period associated with System fmax.
In Xilinx timing reports this is reported as Maximum Delay
5.3.1.3 Longest Path and Critical Path
The longest path through the circuit might not be the critical path, because the behaviour of the
gates might prevent an edge (0 1 or 1 0) from travelling along the path.
If an edge cannot travel along the a path, the path is called a false path.
Example False Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Determine whether the longest path in the circuit below is a false path
y
a
b
Answer:
For this example, we use a very naive approach simply to illustrate the
phenomenon of false paths. Sections 5.3.25.3.5 present a better algorithm
to detect false paths and nd the real critical path.
In the circuit above, the longest path is from b to y:
The four possible scenarios for the inputs are:
(a = 0, b = 0 1)
(a = 0, b = 1 0)
(a = 1, b = 0 1)
(a = 1, b = 1 0)
a = 0, b = 0 1 a = 0, b = 1 0
y
a
b
y
a
b
0
0
0
0
0
0
a = 1, b = 0 1 a = 1, b = 1 0
y
a
b
0
0
0
0
0
0
y
a
b
1
1
1
1
In each of the four scenarios, the edge is blocked at either the AND gate or
the OR gate. None of the four scenarios result in an edge on the output y, so
the path from b to y is a false path.
Question: How can we determine analytically that this is a false path?
Answer:
The value on a will always force either the AND gate to be a 0 (when a is
0) or the the OR gate to be a 1 (when a is 1). For both a=0 and
a=1, a change on b will be unable to propagate to y. The algorithm to
detect false paths is based upon this type of analysis.
Preview of Complete Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
This example illustrates all of the concepts in analysing critical paths. Here, we explore the circuit
informally. In section 5.3.5, we will revisit this circuit and analyse it according to the complete,
correct, and complex algorithm.
Question: Find the critical path through the circuit below.
a
b
c
d e
f
g
Answer:
Even though the equation for this circuit reduces to false, the output signal (g)
is not a constant 0. Instead, glitches can occur on g. To explore the
behaviour of the circuit, we will stimulate the circuit rst with a falling edge,
then a rising edge.
Stimulate the circuit with a falling edge and see which path the edge follows.
0 0 2 4 6
0
2
0
10 a
b
c
d e
f
g
The longest path through the circuit is the middle path.
At g, the side input a has a controlling value before the falling edge arrives on
the path input e. Thus, a falling edge is unable to excite the longest path
through the circuit.
Stimulate the circuit with a rising edge and see which path the edge follows.
0 0 2 4 6
0
2
0
6
10
a
b
c
d e
f
g
At f, the side input c has a controlling value before the falling edge arrives on
the path input e. Thus, a rising edge is unable to excite the longest path
through the circuit.
Of the two scenarios, the falling edge follows a longer path through the circuit
than the rising edge. The critical path is the lower path through the circuit.
When we develop our rst algorithm to detect false paths (section 5.3.3), we
will assume that at each gate, the input that is on the critical path will arrive
after the other inputs. Not all circuits satisfy the assumption. At f, when a is a
falling edge, the path input c arrives before the side input e.
5.3.1.4 Timing Simulation vs Static Timing Analysis
The delay through a component is usually dependent upon the values on signals. This is because
different paths in the circuit have different delays and some input values will prevent some paths
from being exercised. Here are two simple examples:
In a ripple-carry adder, if a carry out of the MSB is generated from the least signicant bit,
then it will take longer for the output to stabilize than if no carries generated at all.
In a state machine using a one-hot state encoding, false paths might exist when more than
one state bit is a 1.
Because of these effects, static timing analysis might be overly conservative and predict a delay
that is greater than you will experience in practice. Conversely, a timing simulation may not
demonstrate the actual slowest behaviour of your circuit: if you dont ever generate a carry from
LSB to MSB, then youll never exercise the critical path in your adder. The most accurate delay
analysis requires looking at the complete set of actual data values that will occur in practice.
5.3.2 Longest Path
The following is an algorithm to nd the longest path from a set of source signals to a set of
destination signals. We rst provide a high-level, intuitive, description, and then present the actual
algorithm.
Outline of Algorithm to Find Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Start at destination signals and traverse through fanin to source signals, annotating each inter-
mediate signal with the maximum delay from the intermediate signal to the destination signals.
The source signal with the maximum delay is the start of the longest path. The delay annotation
of this signal is the delay of the longest path.
The longest path is found by working from the source signal to the destination signals, picking
the fanout signal with the maximum delay at each step.
Algorithm to Find Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Set current time to 0
2. Start at destination signals
3. For each input to a gate that drives a destination signal, annotate the input with the current
time plus the delay through the gate
4. For each gate that has times on all of its fanout but not a time for itself,
(a) annotate each input to the gate with the maximum time on the fanout plus the delay
through the gate
(b) go to step 4
5. To nd the longest path, start at the source node that has the maximum delay. Work forward
through the fanout. For signals that fanout to multiple signals, choose the fanout signal with
the maximum delay.
Longest Path Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Find the longest path through the circuit below.
a
b
c
l
m
d
e
f
g
h
i
j
k
5.3.3 Detecting a False Path 325
Answer:
Annotate signals with the maximum delay to an output:
a
b
c
0
14 12 12
12
6 4
4
8 8
8
4
4
8 2 0
16
12
10
d
e
f
g
h
i
j
k
Find longest path:
a
b
c
0
14 12 12
12
6 4
4
8 8
8
4
4
8 2 0
16
12
10
d
e
f
g
h
i
j
k
The path from a to y has a delay of 16.
5.3.3 Detecting a False Path
In this section, we will explore a simple and almost correct algorithm to determine if a path is a
false path. The simple algorithm in this section sometimes gives the incorrect results if the candi-
date path intersects false paths. For all of the example circuits in this section, the algorithm gives
the correct result. The purpose of presenting this almost-correct algorithm is that it is relatively
easy to understand and introduces one of the key concepts used in the complicated, correct, and
complete algorithm for nding the critical path in section 5.3.5.
5.3.3.1 Preliminaries for Detecting a False Path
Controlling Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The controlling value of a gate is the value such that if one of the inputs has this value, the output
can be determined independently of the other inputs.
For an AND gate, the controlling value is 0, because when one of the inputs is a 0, we know
that the output will be 0 regardless of the values of the other inputs.
The controlled output value of a gate is the value produced by the controlling input value.
Gate Controlling Value Controlled Output
AND 0 0
OR 1 1
NAND 0 1
NOR 1 0
XOR none none
Path Input, Side Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
For a gate on a path (either a candidate critical path, or a real critical path), the path input is the
input signal that is on the path.
For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the
input signals that are not on the path.
The key idea behind the almost-correct algorithm is that: for an edge to propagate along a path,
the side inputs to each gate on the path must have non-controlling values. The complete, correct,
and complicated algorithm generalizes this constraint to handle circuits where the side inputs are
on false paths.
Reconvergent Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Most of the difculties both with critical paths and with testing circuits for manufacturing faults
(Chapter 7) are caused by reconvergent fanout: the wires that fanout from a gate reconverge at
another gate.
y
a
b
c
z
d e
f
h
g
There are two sets of reconvergent paths in the circuit above. One set of reconvergent paths goes
from a to y and one set goes from d to z.
If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path
might cause a side input along the path to have a rising or falling edge, rather than a stable 0 or
1.
To support reconvergent fanout, we extend the rule for side inputs having non-controlling values
to say that side inputs must have either non-controlling values or have edges that stabilize in non-
controlling values.
Rules for Propagating an Edge Along a Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
These rules assume that side inputs arrive before path inputs. Section 5.3.5 relaxes this constraint.
1 1
0 0
1 1
0 0
NOT
AND
OR
XOR
Question: Why do the rules not have falling edges for AND gates or rising edges for
OR gates?
Answer:
a
b
c
a
b
c
For an AND gate, a falling edge on side-input will force the output to change
and prevent the path input from affecting the output. This is because the nal
value of a falling edge is the controlling value for an AND gate. Similarly, for an
OR gate, the nal value of a rising edge is the controlling value for the gate.
Analyzing Rules for Propagating Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The pictures below show all combinations of output edge (rising or falling) and input values (con-
stant 1, constant 0, rising edge, falling edge) for AND and OR gates. These pictures assume that
the side input arrives before the path intput. The pictures that are crossed out illustrate situations
that prevent the path input from affecting the output. In these situations the inputs cause either a
constant value on the output or the side input affects the output but the path input does not. The
pictures that are not crossed out correspond to the rules above for pushing edges through AND and
OR gates.
0 1
0 1
constant 0 output
0 is controlling
constant 0 output
constant 0 output
AND
0 1
0 1
constant 1 output
1 is controlling constant 1 output
constant 1 output
OR
5.3.3.2 Almost-Correct Algorithm to Detect a False Path
The rules above for propagating an edge along a candidate path assume that the values on side
inputs always arrive before the value on the path input. This is always true when the candidate
path is the longest path in the circuit. However, if the longest path is a false path, then when we are
testing subsequent candidate paths, there is the possibility that a side input will be on a false path
and the side input value will arrive later than the value from the path input.
This almost-correct algorithm assumes that values on side inputs always arrive before values on
path inputs. The correct, complex, and complete critical path algorithm in section 5.3.5 extends
the almost correct algorithm to remove this assumption.
To determine if a path through a circuit is a false path:
1. Annotate each side input along the path with its non-controlling value. These annotations
are the constraints that must be satised for the candidate path to be exercised.
2. Propagate the constraints backward fromthe side inputs of the path to the inputs of the circuit
under consideration.
3. If there is a contradiction amongst the constraints, then the candidate path is a false path.
4. If there is no contradiction, then the constraints on the inputs give the conditions under which
an edge will traverse along the candidate path from input to output.
5.3.3.3 Examples of Detecting False Paths
False-Path Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Determine if the longest path in the circuit below is a false path.
a
b
c
0
14 12 12
12
6 4
4
8 8
8
4
4
8 2 0
16
12
10
d
e
f
g
h
i
j
k
Answer:
Compute constraints for side inputs to have non-controlling values:
a
b
c
l
m
1
1 0
1
Contradictory values.
0
0
1
d
e
f
g
h
i
j
k
side input non-controlling value constraint
g[b] 1 b
i[e] 0 c
k[h] 1 b
Found contradiction between g[b] needing b and k[h] needing b, therefore the
candidate path is a false path.
Analyze cause of contradiction:
a
b
c
l
m
2
These side inputs will always have
opposite values. Both side inputs
feed the same type of gate (AND),
so it always be the case that one of the
side inputs will be a controlling value (0).
d
e
f
g
h
i
j
k
Question: Determine if the longest path through the circuit below is a critical path.
If the longest path is a critical path, nd a pair of input vectors that will exercise the
path.
a
c
b
e
d
f
g
h
Answer:
a
c
b
e
d
f
g
h
1
0
1
e[a] 1 a
g[b] 0 b
h[f] 1 a+b
Complete constraint is conjunction of constraints: ab(a+b), which reduces to
false. Therefore, the candidate path is a false path.
This example illustrates a candidate path that is a true path.
Question: Determine if the longest path through the circuit below is a critical path. If
the longest path is a critical path, nd a pair of input vectors that will exercise the
path.
a
c
b
e
d
f
g
h
Answer:
Find longest path; label side inputs with non-controlling values:
a
c
b
e
d
f
g
h
0
0
1
Table of side inputs, non-controlling values, and constraints on primary inputs:
e[a] 0 a
g[b] 0 b
h[b] 1 a+b
The complete constraint is ab(a+b), which reduces to ab. Thus, for an edge
to propagate along the path, a must be 0 and b must be 0.
Critical path c, e, g, h
Delay 14
Input vector a=0, b=0, c=rising edge
Illustration of rising edge propagating along path:
a
c
b
e
d
f
g
h
0
0
1
0
0 1 1
1
0
Illustration of falling edge propagating along path:
a
c
b
e
d
f
g
h
0
0
1
0
0 1 1
1
0
This example illustrates reconvergent fanout.
a
c
b
e
d
f
g
Answer:
a
c
b
e
d
f
g
1
1
e[b] 1 b
g[d] 1 a
The complete constraint is ab.
The constraint includes the input to the path (a), which indicates that not all
edges will propagate along the path. The polarity of the path input indicates
the nal value of the edge. In this case, the constraint of a means that we
need a rising edge.
5.3.4 Finding the Next Candidate Path 333
Critical path a, c, e, f, g
Delay 12
Input vector a=rising edge, b=1
Illustration of rising edge propagating along path:
a
c
b
e
d
f
g
1 1
If we try to propagate a falling edge along the path, the falling edge on the
side input d forces the output g to fall before the arrival of the falling edge on
the path input f. Thus, the edge does not propagate along the candidate
path.
a
c
b
e
d
f
g
1 1 1
Patterns in False Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
After analyzing these examples, you might have begun to observe some patterns in how false paths
arise. There are several patterns in the types of reconvergent fanout that lead to false paths. For
example, if the candidate path has an OR gate and an AND that are both controlled by the same
signal and the candidate has an even number of inverters between these gates then the candidate
path is almost certainly a false path. The reason is the same as illustrated in the rst example of a
false path. The side input will always have a controlling value for either the OR gate or the AND
gate.
5.3.4 Finding the Next Candidate Path
If the longest path is a false path, we need to nd the next longest path in the circuit, which will be
our next candidate critical path. If this candidate fails, we continue to nd the next longest of the
remaining paths, ad innitum.
5.3.4.1 Algorithm to Find Next Candidate Path
To nd the next candidate path, we use a path table, which keeps track of the partial paths that
we have explored, their maximum potential delay, and the signals that we can follow to extend a
partial path toward the outputs. We keep the path table sorted by the maximum potential delay of
the paths. We delete a path from the table if we discover that it is a false path.
The key to the path table is how to update the potential delay of the partial paths after we discover
a false path. All partial paths that are prexes of the false path will need to have their potential
delay values recomputed. The updated delay is found by following the unexplored signals in the
fanout of the end of the partial path.
1. Initialize path table with primary inputs, their potential delay, and fanout.
2. Sort path table by potential delay (path with greatest potential delay at bottom of table)
3. If the partial path with the maximum potential delay has just one unused fanout signal,
then extend the partial path with this signal.
Otherwise:
(a) Create a new entry in the path table for the partial path extended by the unused fanout
signal with the maximum potential delay.
(b) Delete this fanout signal from the list of unused fanout signals for the partial path.
4. Compute the constraint that side input of the new signal does not have a controlling value,
and update constraint table.
5. If the new constraint does not cause a contradiction,
then return to step 3.
Otherwise:
(a) Mark this partial path as false.
(b) For each partial path that is a prex of the false path:
reduce the potential delay of the path by the difference between the potential delay
of the fanout that was followed and the unused fanout with next greatest delay value.
(c) Return to step 2
5.3.4.2 Examples of Finding Next Candidate Path
Next-Path Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Starting from the initial delay calculation and longest path, nd the next
candidate path and test if it is a false path.
a
b
c
0
14 12 12
12
6 4
4
8 8
8
4
4
8 2 0
16
12
10
d
e
f
g
h
i
j
k
Answer:
Initial state of path table:
potential unused
delay fanout path
10 e c
12 h, g b
16 d a
Extend path with maximum potential delay until nd contradiction or reach
end of path. Add an entry in path table for each intermediate path with
multiple fanouts.
Path table after detecting that the longest path is a false path:
potential unused
delay fanout path
10 e c
12 h, g b
16 j, i a, d, f, g
false a, d, f, g, i, k
The longest path is a false path. Recompute potential delay of all paths in
path table that are prexes of the false path.
The one path that is a prex of the false path is: a,d,f,g). The remaining
unused fanout of this path is j, which has a potential delay on its input of 2.
The previous potential delay of g was 8, thus the potential delay of the prex
reduces by 82 = 6, giving the path a potential delay of 166 = 10.
Path table after updating with new potential delays:
potential unused
delay fanout path
10 e c
10 i a, d, f, g
12 h, g b
Extend b through g, because g has greater potential delay than the other
fanout signal (h).
potential unused
delay fanout path
10 e c
10 i a, d, f, g
12 h, g b
12 i, j b, g
g[a] 1 a
From g, we will follow i, because it has greater potential delay than j.
potential unused
delay fanout path
10 e c
10 i a, d, f, g
12 h, g b
12 i, j b, g
12 b, g, i, k
g[a] 1 a
i[e] 0 c
k[h] 1 b
We have reached an output without encountering a contradiction in our
constraints. The complete constraint is abc.
Critical path b, g, i, k
Delay 12
Input vector a=1, b=falling edge, c=1
Illustrate the propagation of a falling edge:
a
b
c
2
d
e
f
g
h
i
j
k
1
0 1
1
At k, the rising edge on the side input (h) arrives before the falling edge on
the path input (i). For a brief moment in time, both the side input and path
input are 1, which produces a glitch on k.
Next-Path Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Find the critical path in the circut below
a
b
c
d
e
f
g h
i
j
k
l m
m
k
Answer:
Find the longest path:
a
b
c
d
e
f
g h
i
j
k
l m
m
k
0
0
2
6
6
6
0
4
4
10
10 10
14 16
20
20 22
14
14
10
20
4
potential unused
delay fanout path
4 k e
10 j, l a
14 i b
20 g c
22 f d
Extend path with maximum potential delay until nd contradiction or reach
end of path. Add an entry in path table for each intermediate path with
multiple fanouts.
potential unused
delay fanout path
4 k e
10 j, l a
14 i b
20 g c
22 j, k d, f, g, h, i
false d, f, g, h, i, j, l
g[c] 1 c
i[b] 0 b
j[b] 0 a
l[b] 1 a
Contradiction between j[a] and l[a], therefore the longest path (dm) is a
false path.
To nd next candidate, begin by recomputing delays along the candidate path.
The last intermediate path with unused fanouts before the contradiction is i.
Cut the candidate path at this signal. The remainder of the candidate path is:
d, f, g, h, i. The only unused fanout of this path is k. The potential delay of
this path is reduced by 6, because the delay of j, l, m is 10 and the delay of k
is 4. This brings the potential delay of d, f, g, h, i down to 226 = 16.
The partial path with the maximum potential delay is c. The new critical path
candidate will be: c, g, h, i, j, l, m.
Update the path table with delay of 16 for previous candidate path. Extend c
along path with maximum potential delay until nd contradiction or reach end
of path. Add an entry in path table for each intermediate path with multiple
fanouts.
potential unused
delay fanout path
4 k e
10 j, l a
14 i b
16 k d, f, g, h, i
20 k c, f, g, h, i
false c, f, g, h, i, j, l
We encounter the same contradiction as with the previous candidate, and so
we have another false path. We could have detected this false path without
working through the path table, if we had recognized that our current
candidate path overlaps with the section (j, l) of previous candidate that
caused the false path.
As with the previous candidate, we reduce the potential delay of the path up
through i by 6, giving us a potential delay of 2010 = 14 for c, f, g, h, i. The
next candidate path is d, f, g, h, i, k with a delay of 16.
potential unused
delay fanout path
false c, f, g, h, i, j, l
4 k e
10 j, l a
14 i b
14 k c, f, g, h, i
16 k d, f, g, h, i
g[c] 1 c
i[b] 0 b
k[e] 0 e
The complete constraint is bce. There is no constraint on a and d may be
either a rising edge or a falling edge.
Critical path d, f, g, h, i, k
Delay 16
Input vector a=0, b=0, c=1, d=rising edge, e=0
Next Path Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Find the critical path in the circuit below.
m
a
b
p
d
c
e
f
j
k
g h i
l
m
n
o
p
Answer:
m
a
b
p
d
c
e
f
j
k
g h i
l
m
n
o
p
0
0
4
4
4 6
4
8
8
8
8 16
4
4
4
8
8
8
8
12
10 12
12 14
potential unused
delay fanout path
8 n, o d
12 j, k a
14 e b
16 f c
Extend c through f:
potential unused
delay fanout path
8 n, o d
12 j, k a
14 e b
16 m, n c,f,g,h,i
false c,f,g,h,i,n,p
n[d] 1 d
p[o] 1 d
The rst candidate is a false path. Recompute potential delay of c, f, g, h, i,
which reduces it from 16 to 12.
potential unused
delay fanout path
false c,f,g,h,i,n,p
8 n, o d
12 j, k a
12 m c,f,g,h,i
14 e b
Extend b through e:
5.3.5 Correct Algorithm to Find Critical Path 341
potential unused
delay fanout path
false c,f,g,h,i,n,p
8 n, o d
12 j, k a
12 m c,f,g,h,i
false b,e,k,l
k[a] 1 a
l[j] 1 a
The second candidate is a false path. There is no unused fanout signal from
l for the path b, e, k, l, so this partial path is a false path and there is no new
delay information to compute.
There are two paths with a potential delay of 12. Choose c, f, g, h, i,
because the end of the path is closer to an output, so there will be less work
to do in analyzing the path.
potential unused
delay fanout path
false c,f,g,h,i,n,p
false b,e,k,l
8 n, o d
12 j, k a
12 c,f,g,h,i,m
m[l] 0 a+ab
Critical path c,f,g,h,i,m
Delay 12
Input vector a=0, b=1, c=rising edge, d=0
5.3.5 Correct Algorithm to Find Critical Path
In this section, we remove the assumption that values on side inputs always arrive earlier than the
value on the path input.
5.3.5.1 Algorithm
If nd contradiction on path, check for side inputs that are on previously discovered false paths.
If a side input to candidate path is on a previously discovered false path, then the side input
denes a prex of that false path.
Compute constraint to excite the prex (this is called the viability constraint of the prex.
To the row of the late arriving side input in the constraint table, add as a disjunction the
constraint that the prex is viable and the path input has a controlling value.
5.3.5.2 Examples
Complete Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a
b
c
d e
f
g
Answer:
a
b
c
d e
f
g
0
4
4 4
8
8
8 10
8 10 12 14 14
potential unused
delay fanout path
14 g, b, c a
false a,b,d,e,f,g
f[c] 1 a
g[a] 1 a
First false path, pursue next candidate.
potential unused
delay fanout path
false a,b,d,e,f,g
10 g, c a
10 a,c,f,g
f[e] 1 a
g[a] 1 a
At rst, this path appears to be false, but the side input e is on the prex of
the false path a,b,d,e,f,g and the start of the false path is the primary input to
the current candidate. Thus, f[e] is a late arriving side input.
The candidate path will be a true path if the side input arrives late and the
path input is a controlling value. The viability condition for the path a,b,d,e is
true. The constraint for the path input (c) to have a controlling value for f is a.
Together, the viability constraint of true and the controlling value constraint of
a give us a late-side constraint of a.
Updating the constraint table with the late arriving side input constraint gives
us:
f[e] 1 a+a = true
g[a] 1 a
The constraint reduces to a. A rising edge will exercise the path.
Critical path a, c, f, g
Delay 10
Input vector a=falling edge
a
b
c
d
e g
h
i
j
j
d
f
Answer:
a
b
c
0
4
4 4
4
8
8
8
8
8 10
14
14
14 16
14
8
d
e g
h
i
j
j
d
8
12
12
f
potential unused
delay fanout path
8 k a
12 i d
16 h e
18 f b
20 f, h c,e
false c,e,h,j,l,m
f[b] 0 b
i[d] 1 d
l[j] 1 ce
m[k] 0 a(b+c)
Contradiction. First false path, pursue next candidate.
potential unused
delay fanout path
false c,e,h,j,l,m
8 k a
12 i d
16 h e
18 f b
18 h c,e
18 c,e,h,j,l,m
h[e] 0 e
l[i] 1 bcd
m[k] 0 a(b+c)
Initially, found contradiction, but c,f,g,i is a prex of a false path and l[i] is
a side input to the candidate path. We have a late side input. The viability
constraint for this prex of bd. The constraint for the path input (j) to have a
controlling value of 0 for l is c +e. Combining the two constraints together
gives us a constraint for the late side input for l[i] to be bd(c+e).
Adding the constraint of the late side input to to the condition table gives us:
h[e] 0 e
l[i] 1 bcd +bd(c +e) = bd
m[k] 0 a(b+c)
The constraint for the candidate path reduces to: abcde.
Critical path c, e, h, j, l, m
Delay 18
Input vector a=0, b=0, c=falling edge, d=1, e=0
Demonstrate excitation of path:
a
b
c
d
e g
h
i
j
j
d
f
0
0
1
a
b
j
h
c
k
i g e
d f
Answer:
a
b
c
0
4
4
8
8
8
12
12
12
8
14
12
16
10 12 14
j
h k
i g e
d f
potential unused
delay fanout path
12 h, k a
14 e c
false b,d,f,h,j,k
h[a] 1 a
j[i] 0 c
k[a] 0 a
potential unused
delay fanout path
false b,d,f,h,j,k
12 h, k a
14 c,e,g,i,j,k
j[h] 0 a+b
k[a] 0 a
The constraint reduces to a.
Because the minimum delay from an input to the side input h is greater than
the delay to the path input i, we might be tempted (incorrectly!) to treat h as
a late arriving side input to j. This would be a mistake. The primary input to
the path (c) does not fanout to h, thus h will have a stable value (Remember,
to detect whether a candidate path is false, the only input to the circuit that
changes value is the primary input to the critical path.) Late arriving side
inputs are relevant only to signals that are affected by the primary input to the
path.
k
a
b e
g
i
k
c
h f d
j
Answer:
potential unused
delay fanout path
12 g, h a
16 e, j c
16 d, e b
false e, d b,e,g,i
e[c] 0 c
g[a] 1 a
i[a] 0 a
potential unused
delay fanout path
false e, d b,e,g,i
12 g, h a
14 d b
16 e, j c
false c,e,g,i
Second false path, pursue next candidate.
potential unused
delay fanout path
false e, d b,e,g,i
false c,e,g,i
8 j c
12 g, h a
14 d b
j[c] 1 c
k[i] 0 a
Third candidate is a true path.
If the initial analysis suggested that the candidate was a false path, we might
be tempted to use k[i] as a late side input. However, this is not a late side
input, because b-i and c-i are false paths. There is no true path to i that
has a delay longer than the delay of the current candidate path c,d,f,h,j.
If we did not see immediately that k[i] is not a late side input, we would
discover this when we tried but failed to construct the viability condition for the
paths b-i and c-i.
Although the path a-i is a true path, it does not contribute to the making
k[i] a late side input, because the delay from a to i is less than the delay
along the candidate path.
5.3.6 Further Extensions
McGeer and Braytons paper includes two extensions to the critical path algorithm presented here
that we will not cover.
gates with more than two inputs
nding all input values that will exercise the critical path
5.4 Analog Timing Model
There are many different models used to describe the timing of circuits. In the section on critical
paths, we used a timing model that was based on the size of the gate. The timing model ignored
interconnect delays and treated all gates as if they had the same fanout. For example, the delay
through an AND gate was 4, independent of how many gates were in its immediate fanout.
In this section and the next (section 5.5) we discuss two timing models. In this section, we discuss
the detailed analog timing model, which reects quite accurately the actual voltages on different
nodes. The SPICE simulation program uses very detailed analog models of transistors (dozens of
parameters to describe a single transistor). In the next section, we describe the Elmore delay
model, which achieves greater simplicity than the analog model, but at a loss of accuracy.
Transistor Level
(P-Tran)
gate
source
drain
Mask Level (P-Tran)
gate
source
poly
p-diff
contact
drain
Cross-Section of
Fabricated Transistor
poly
p-diff
contact
substrate
Switch Level (P-Tran)
gate
source
drain
Transistor Level
(N-Tran)
gate
source
drain
Mask Level (N-Tran)
gate
source
poly
n-diff
drain
contact
Cross-Section of
Fabricated Transistor
poly
p-diff
contact
substrate
Switch Level
(N-Tran)
gate
source
drain
5.4. ANALOG TIMING MODEL 349
Different Levels of Abstraction for Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gate Level
a b
Transistor Level
a b
VDD
GND
Mask Level
VDD
GND
a b
poly
n-diff
p-diff
metal
metal
contact
RC-Network for Timing Analysis
a b
Rpu
Rpd
Cp
VDD
GND
C
L
Contacts (vias) have resistance (R
V
)
Metal areas (wires) have resistance (R
W
) and capacitance (C
W
).
The resistance is dependent upon the geometry of the wire.
The capacitance is dependent upon the geometry of the wire and the other wires adjacent to
it.
For most circuits, the via resistance is much greater than the wire resistance (R
V
R
W
)
A Pair of Inverters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gate Level
a
b
c
Transistor Level
a
b
VDD
GND
c
Mask Level
VDD
GND
a
b c
A Pair of Inverters (Contd) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mask Level
VDD
GND
a
b c
a
b
Rpu
Rpd
Cp
VDD
GND
c
Rpu
Rpd
Cp C
L
C
L
C
W
R
W
R
V
A Circuit with Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gate Level
a
b
c
d
Gate Level (physical layout)
a
b c
d
c
5.4. ANALOG TIMING MODEL 351
Transistor Level
a
b
VDD
GND
c b d
c
Mask Level
VDD
GND
a d b
b
c
c
a
Rpu
Rpd
Cp
GND
c
Rpu
Rpd
Cp
d
Rpu
Rpd
Cp
c
C
L
C
L
C
L
VDD
b
C
W1
R
W1
R
V
b
C
W2
R
W2
R
V
C
W3
R
W3
5.4.1 Timing Model
Vi
Vo
Rpu
Rpd
Cp Cout
Timing model
R
pu
pull up resistor in p-tran
R
pd
pull down resistor in n-tran
C
p
parasitic capacitance
C
out
load capacitance
5.4.1.1 Equation for Output Voltage
Output voltage when V
o
discharges through R
pd
.
V
o
=V
DD
e
_
t
R
pd
(C
p
+C
out
)
_
Measuring Delay Through an Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gate Level
a
b
c
a
b
RC-Network (Analog Level)
RC-network of 2 inverters
a
b
How do we use the analog waveforms to determine the discrete delay through the inverter?
5.4.1 Timing Model 353
Trip Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
To measure delay through inverter, what voltage levels should we use?
Denition Trip Points: A high or 1 trip point is the voltage level where an upwards
transition means the signal represents a 1.
A low or 0 trip point is the voltage level where a downwards transition means the
signal represents a 0.
In the gure below the gray line represents the actual voltage on a signal. The black line is digital
discretization of the analog signal.
a
b
We need to pick our trip points, then these determine the start and stop time for measuring delay.
Pick the trip points to simplify the delay equation.
Pick trips points of 0.35/0.65:
low-voltage (0) trip point of 0.35 Vdd
high-voltage (1) trip point of 0.65 Vdd
Setup the delay equation for T
PD
to be the time for V
o
to fall from V
DD
to the low trip point of
0.35V
DD
:
Original equation
V
o
=V
DD
e
_
t
R
pd
(C
p
+C
out
)
_
0.35V
DD
trip point
0.35V
DD
=V
DD
e
_
T
PD
R
pd
(C
p
+C
out
)
_
T
PD
represents the propagation delay, which is the sum of the interconnect and load delays.
Solving for T
PD
, using ln(1/0.35) 1, doing some more approximations:
T
PD
R
pd
(C
p
+C
out
)
Some Rough Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A larger transistor has a lower resistance, but a higher capacitance.
Resistance affects timing of source (driving) signals.
Capacitance affects (mostly) timing of destination (load) signals.
Decreasing resistance increases the current through drivers.
Increasing capacitance slows down (dis)charging of load capacitors.
5.4.1.2 Extrinsic / Intrinsic Delays
Denition intrinsic delay: Delay resulting from pull(up/down) resistor and parasitic
capacitance.
Denition extrinsic delay: Delay resulting from load capacitance.
5.5 Elmore Delay Model
The Elmore delay model is an appealing tradeoff between the cumbersome detail of the accurate
analog delay model and the simplistic inaccuracy of models that use average interconnect and
load delays.
5.5.1 Elmore Time Constant
Elmore time constants are used to analyze interconnect and load delay with intermediate
connections and/or fanout.
Original equation
V
o
=V
DD
e
_
t
R
pd
(C
p
+C
out
)
_
0.35V
DD
trip point
0.35V
DD
=V
DD
e
_
T
PD
R
pd
(C
p
+C
out
)
_
Introduce Elmore-delay constant
0.35V
DD
=V
DD
e
_
T
PD
Di
_
5.5.1 Elmore Time Constant 355
V
i
(t) The voltage on node i (capacitor i) at time t
= e
t/
Di
Di
Elmore time constant for node i
=
n
k=1
ER
k,i
C
k
(n is the number of nodes in the circuit)
ER
k,i
= resistance along path from node i to the source-ground node that
is also on the path from node k to the source-ground node (source
ground is the ground node below the pull-down resistor of the
source)
If we:
approximate V
i
(t) as an exponential waveform, and
use 0.35/0.65 trip points
then the delay from the source to node i is
Di
seconds.
5.5.2 Interconnect with Single Fanout
G1 G2
G1
Ra1
C1
Ra2
Ra3
C2
C3
Ra4
G2
Rw1
Rw2
Rw3
C1
G1
Vi
Rpu
Rpd
Cp C2
Rw1
C3
Rw2 Rw3
CG2
G2
Ra1 Ra2 Ra3 Ra4
G* gate
C* capacitance on wire
Ra* resistance through antifuse
Rw* resistance through wire
Question: Calculate delay from gate 1 to gate 2
Answer:
Gate 2 represents node 4 on the RC tree.
5.5.2 Interconnect with Single Fanout 357
D4
=
4
k=1
ER
k,i
C
k
= ER
1,4
C
1
+ER
2,4
C
2
+ER
3,4
C
3
+ER
4,4
C
4
= (Ra
1
+Rw
1
+Ra
2
+Rw
2
+Ra
3
+Rw
3
+Ra
4
)CG2
+(Ra
1
+Rw
1
+Ra
2
+Rw
2
+Ra
3
+Rw
3
)C
3
+(Ra
1
+Rw
1
+Ra
2
+Rw
2
)C
2
+(Ra
1
+Rw
1
)C
1
approximate Ra Rw
= (Ra
1
)C
1
+(Ra
1
+Ra
2
)C
2
+(Ra
1
+Ra
2
+Ra
3
)C
3
+(Ra
1
+Ra
2
+Ra
3
+Ra
4
)CG2
approximate Ra
i
= Ra
j
= 4(Ra)CG2+3(Ra)C
3
+2(Ra)C
2
+(Ra)C
1
Question: If you double the number of antifuses and wires needed to connect two
gates, what will be the approximate effect on the wire delay between the gates?
Answer:
Di
=
n
k=1
ER
k,i
C
k
Assume all resistances and capacitances are the same
values (R and C), and assume that all intermediate
nodes are along path between the two gates of inter-
est.
ER
k,i
= k R
Di
= (
n
k=1
k)RC
Using the mathematical theorem:
n
i=1
i =
(n+1)n
2
n
2
We simplify delay equation:
Di
= (
n
k=1
k)RC
= n
2
RC
We see that the delay is propotional to the square of the number of antifuses
along the path.
5.5.3 Interconnect with Multiple Gates in Fanout
G1 G2
G3
G1
G2
G3
Question: Assuming that wire resistance is much less than antifuse resistance and
that all antifuses have equal resistance, calculate the delay from the source inverter
(G1) to G2
Answer:
1. There are a total of 7 nodes in the circuit (n = 7).
2. Label interconnect with resistance and capacitance identiers.
G1
R1
C1
R2
R3
C2
C4
R4
G2
C6
R6
R5
G3
C3
C5
C7
5.5.3 Interconnect with Multiple Gates in Fanout 359
3. Draw RC tree
C1
G1
Vi
Rpu
Rpd
Cp C2
R1
C4
R2 R3 R4
C5
G2
C6
R5 R6
C7
G3
C3
n1 n2 n3 n4 n5
n6 n7
4. G2 is node 5 in the circuit (i = 5).
5. Elmore delay equations
D5
=
7
k=1
ER
k,5
Ck
= ER
1,5
C
1
+ER
2,5
C
2
+ER
3,5
C
3
+ER
4,5
C
4
+ER
5,5
C
5
+ER
6,5
C
6
+ER
7,5
C
7
6. Elmore resistances
ER
1,5
= R1 = R
ER
2,5
= R1 + R2 = 2R
ER
3,5
= R1 + R2 = 2R
ER
4,5
= R1 + R2 + R3 = 3R
ER
5,5
= R1 + R2 + R3 + R4 = 4R
ER
6,5
= R1 + R2 = 2R
ER
7,5
= R1 + R2 = 2R
7. Plug resistances into delay equations
D5
= (R)C
1
+(2R)C
2
+(2R)C
3
+(3R)C
4
+(4R)C
5
+(2R)C
6
+(2R)C
7
Delay from G1 to G3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Assuming that wire resistance is much less than antifuse resistance and
that all antifuses have equal resistance, calculate the delay from the source inverter
(G1) to G3
Answer:
1. G3 is node 7 in the circuit (i = 7).
2. Elmore delay equations
Di
=
n
k=1
ER
k,i
C
k
D7
=
7
k=1
ER
k,7
C
k
= ER
1,7
C
1
+ER
2,7
C
2
+ER
3,7
C
3
+ER
4,7
C
4
+ER
5,7
C
5
+ER
6,7
C
6
+ER
7,7
C
7
3. Elmore resistances
ER
1,7
= R1 = R
ER
2,7
= R1 + R2 = 2R
ER
3,7
= R1 + R2 = 2R
ER
4,7
= R1 + R2 = 2R
ER
5,7
= R1 + R2 = 2R
ER
6,7
= R1 + R2 + R5 = 3R
ER
7,7
= R1 + R2 + R5 + R6 = 4R
4. Plug resistances into delay equations
D7
= (R)C
1
+(2R)C
2
+(2R)C
3
+(2R)C
4
+(2R)C
5
= +(3R)C
6
+(4R)C
7
5.6. PRACTICAL USAGE OF TIMING ANALYSIS 361
Delay to G2 vs G3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Assuming all wire segments at same level have roughly the same
capacitance, which is greater, the delay to G2 or the delay to G3?
Answer:
1. Equations for delay to G2 (
D5
) and G3 (
D7
)
D5
= (R)C
1
+(2R)C
2
+(2R)C
3
+(3R)C
4
+(4R)C
5
+(2R)C
6
+(2R)C
7
D7
= (R)C
1
+(2R)C
2
+(2R)C
3
+(2R)C
4
+(2R)C
5
+(3R)C
6
+(4R)C
7
2. Difference in delays
D5
D7
= RC
4
+2RC
5
RC
6
2RC
7
3. Compare capacitances
C
4
C6
C
5
C7
4. Conclusion: delays are approximately equal.
5.6 Practical Usage of Timing Analysis
Speed Grading
Fabs sort chips according to their speed (sorting is known as speed grading or speed
binning)
Faster chips are more expensive
In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires
become a larger portiono of delay, some analysis of wire delays is also being done.
Propagation delay is the average of the rising and falling propagation delays.
Typical speed grades for FPGAs:
Std standard speed grade
1 15% faster than Std
Worst-Case Timing
Maximum Delay in CMOS. When?
Minimum voltage
Maximum temperature
Slow-slow conditions (process variation/corner which result in slow p-channel and
slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners
Increasing temperature increases delay
Temp = resistivity
resistivity = electron vibration
electron vibration = colliding with current electrons
colliding with current electrons = delay
Increasing supply voltage decreases delay
supply voltage = current
current = load capacitor charge time
load capacitor charge time = total delay
Derating factor is a number used to adjust timing number to account for voltage and temp
conditions
ASIC manufacturers classes, based on variety of environments:
VDD TA (ambient temp) TC (case temp)
Commercial 5V 5% 0 to +70C
Industrial 5V 10% 40 to +85C
Military 5V 10% 55 to +125C
What is important is the transistor temperature inside the chip, TJ (junction temperature)
5.6.1 Speed Binning
Speed binning is the process of testing each manufactured part to determine the maximum clock
speed at which it will run reliably.
Manufacturers sell chips off of the same manufacturing line at different prices based on how fast
they will run.
A speed bin is the clock speed that chips will be labeled with when sold.
Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your
software crashes more frequently than your over-stressed hardware will).
5.6.2 Worst Case Timing 363
5.6.1.1 FPGAs, Interconnect, and Synthesis
On FPGAs 40-60% of clock cycle is consumed by interconnect.
When synthesizing, increasing effort (number of iterations) of place and route can signicantly
reduce the clock period on large designs.
5.6.2 Worst Case Timing
5.6.2.1 Fanout delay
In Smiths book, Table 5.2 (Fanout delay) combines two separate parameters:
capacitive load delay
interconnect delay
into a single parameter (fanout). This is common, and ne.
But, when reading a table such as this, you need to know whether fanout delay is combining both
capacitive load delay and interconnect delay, or is just capacitive load.
5.6.2.2 Derating Factors
Delays are dependent upon supply voltage and temperature.
Temp = Delay
Supply voltage = Delay
Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Temp = Delay
Temp = Resistivity of wires
As temp goes up, atoms vibrate more, and so have greater probability of colliding with
electrons owing with current.
Supply voltage = Delay
Supply voltage = current (V = IR)
current = time to charge load capacitors to threshold voltage
Derating Factor Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A derating factor is a number to adjust timing numbers to account for different temperature and
voltage conditions.
Excerpt from table 5.3 in Smiths book (Actel Act 3 derating factors):
Derating factor Temp Vdd
1.17 125C 4.5V
1.00 70C 5.0V
0.63 -55C 5.5V
5.7. TIMING ANALYSIS PROBLEMS 365
5.7 Timing Analysis Problems
P5.1 Terminology
For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answer
which time periods (one or more of t1 t9 or NONE) are examples of the term.
NOTES:
1. The timing diagram shows the limits of the allowed times (either minimum or maximum).
2. All timing parameters are non-negative.
3. The signal a is the input to a rising-edge op and b is the output. The clock is clk1.
signal may change
signal is stable
t10 t11
clk1
clk2
b
a
b
t1 t2
t3
t9
t6
t7
t4
t5
t8
clock skew
clock period
setup time
hold time
P5.2 Hold Time Violations
P5.2.1 Cause
What is the cause of a hold time violation?
P5.2.2 Behaviour
What is the bad behaviour that results if a hold time violation occurs?
P5.2.3 Rectication
If a circuit has a hold time violation, how would you correct the problem with minimal effort?
P5.3 Latch Analysis
Does the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q,
setup, and hold times; and answer whether it is active-high or active-low.
Gate Delays
AND 4
OR 2
NOT 1
en
d
q
P5.4 Critical Path and False Path 367
P5.4 Critical Path and False Path
Find the critical path through the following circuit:
a

b

c

d
e
f g
h
i
j
k l
m
P5.5 Critical Path
a
b
c
d
e
f
g
h
l
i
j
k
m
gate delay
NOT 2
AND 4
OR 4
XOR 6
Assume all delay and timing factors other than combinational logic delay are negligible.
P5.5.1 Longest Path
List the signals in the longest path through this circuit.
P5.5.2 Delay
What is the combinational delay along the longest path?
P5.5.3 Missing Factors
What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take
into account?
P5.5.4 Critical Path or False Path?
Is the longest path that you found a real critical path, or a false path? If it is a false path, nd the
real critical path. If it is a critical path, nd a set of assignments to the primary inputs that
exercises the critical path.
P5.6 Timing Models 369
P5.6 Timing Models
In your next job, you have been told to use a fanout timing model, which states that the delay
through a gate increases linearly with the number of gates in the immediate fanout. You dimly
recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore,
El-Morre, or something like that.
For the circuit shown below as a schematic and as a layout, answer whether the fanout timing
model closely matches the delay values predicted by the Elmore delay model.
G1
G2
G3
G4
G5
G1
G2 G3 G4 G5
Gate
Interconnect level 2
Symbol Description Capacitance
Cg
Cx
Cy
Resistance
Antifuse R
0
0
0
0
Assumptions:
The capacitance of a node on a wire is independent of where the node is located on the wire.
P5.7 Short Answer
P5.7.1 Wires in FPGAs
In an FPGA today, what percentage of the clock period is typically consumed by wire delay?
P5.7.2 Age and Time
If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit
today, would you nd that the percentage of the total clock period consumed by capacative load
has increased, stayed the same, or decreased?
P5.7.3 Temperature and Delay
As temperature increases, does the delay through a typical combinational circuit increase, stay
the same, or decrease?
P5.8 Worst Case Conditions and Derating Factor
Assume that we have a Std speed grade Actel A1415 (an ACT 3 part) Logic Module that drives
4 other Logic Modules:
P5.8.1 Worst-Case Commercial
Estimate the delay under worst-case commercial conditions (assume that the junction temperature
is the same as the ambient temperature)
P5.8.2 Worst-Case Industrial
Find the derating factor for worst-case industrial conditions and calculate the delay (assume that
the junction temperature is the same as the ambient temperature).
P5.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature
Estimate the delay under the worst-case industrial conditions (assuming that the junction
temperature is 105C).
Chapter 6
Power Analysis and Power-Aware Design
6.1 Overview
6.1.1 Importance of Power and Energy
Laptops, PDA, cell-phones, etc obvious!
For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing
cost
Approx 25% of operating expense of server farm goes to energy bills
(Dis)Comfort of Unix labs in E2
Sandia Labs had to build a special sub-station when they took delivery of Teraops massively
parallel supercomputer (over 9000 Pentium Pros)
High-speed microprocessors today can run so hot that they will damage themselves Athlon
reliability problems, Pentium 4 processor thermal throttling
In 2000, information technology consumed 8% of total power in US.
Future power viruses: cell phone viruses cause cell phone to run in full power mode and
consume battery very quickly; PC viruses that cause CPU to meltdown batteries
6.1.2 Industrial Names and Products
All of the articles and papers below are linked to from the Documentation page on the E&CE 427
web site.
Overview white paper by Intel:
PC Energy-Efciency Trends and Technologies An 8-page overview of energy and power trends,
written in 2002. Available from the web at an intolerably long URL.
371
372 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
AMDs Athlon PowerNow!
Reduce power consumption in laptops when running on battery by allowing software to
reduce clock speed and supply voltage when performance is less important than battery life.
Intel Speedstep
Reduce power consumption in laptops when running on battery by reducing clock speed to
70-80% of normal.
Intel X-Scale
An ARM5-compatible microprocessor for low-power systems:
http://developer.intel.com/design/intelxscale/
Synopsys PowerMill
A simulator that estimates power consumption of the circuit as it is simulated:
http://www.synopsys.com/products/etg/powermill ds.html
DEC / Compaq / HP Itsy A tiny but powerful PDA-style computer running linux and
X-windows. Itsy was created in 1998 by DECs Western Research Laboratory to be an
experimental platform in low-power, energy-efcient computing. Itsy lead to the iPAQ
PocketPC.
www.hpl.hp.com/techreports/Compaq-DEC/WRL-2000-6.html
www.hpl.hp.com/research/papers/2003/handheld.html
Satellites Satellites run on solar power and batteries. They travel great distances doing very
little, then have a brief period very intense activity as they pass by an astronomical object of
interest. Satellites need efcient means to gather and store energy while they are ying
through space. Satellites need powerful, but energy efcient, computing and
communication devices to gather, process, and transmit data. Designing computing devices
for satellites is an active area of research and business.
6.1.3 Power vs Energy
Most people talk about power reduction, but sometimes they mean power and sometimes
energy.
Power minimization is usually about heat removal
Energy minimization is usually about battery life or energy costs
Type Units Equivalent Types Equations
Energy Joules Work = VoltsCoulombs
=
1
2
CVolts
2
Power Watts Energy / Time = VoltsI
= Joules/sec
6.1.4 Batteries, Power and Energy 373
6.1.4 Batteries, Power and Energy
6.1.4.1 Do Batteries Store Energy or Power?
Energy = VoltsCoulombs
Power =
Energy
Time
Batteries rated in Amp-hours at a voltage.
battery = Amps Seconds Volts
=
Coulombs
Seconds
Seconds Volts
= Coulombs Volts
= Energy
Batteries store energy.
6.1.4.2 Battery Life and Efciency
To extend battery life, we want to increase the amount of work done and/or decrease energy
consumed.
Work and energy are same units, therefore to extend battery life, we truly want to improve
efciency.
Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of
efciency?
MIPs
Watts
=
millions of instructions
Seconds

Seconds
Energy
=
millions of instructions
Energy
Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of
efciency.
(This assumes that all instructions perform the same amount of work!)
6.1.4.3 Battery Life and Power
Question: Running a VHDL simulation requires executing an average of 1 million
instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and
burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my
computers clock cycles go towards running VHDL simulations, how many
simulation steps can I run on one battery charge?
Question: If I use the SpeedStep feature of my computer, my computer runs at
600MHz with 60W of power. With SpeedStep activated, much longer can I keep the
computer running on one battery?
Question: With SpeedStep activated, how many more simulation steps can I run on
one battery?
6.2 Power Equations
Power = SwitchPower +ShortPower
. .
+ LeakagePower
. .
DynamicPower StaticPower
Dynamic Power dependent upon clock speed
Switching Power useful charges up transistors
Short Circuit Power not useful both N and P transistors are on
Static Power independent of clock speed
Leakage Power not useful leaks around transistor
Dynamic power is proportional to how often signals change their value (switch).
Roughly 20% of signals switch during a clock cycle.
Need to take glitches into account when calculating activity factor. Glitches increase the
activity factor.
Equations for dynamic power contain clock speed and activity factor.
6.2.1 Switching Power 375
6.2.1 Switching Power
1->0
0->1
CapLoad
Charging a capacitor
0->1
1->0
CapLoad
Disharging a capacitor
energy to (dis)charge capacitor =
1
2
CapLoadVoltSup
2
When a capacitor C is charged to a voltage V, the energy stored in capacitor is
1
2
CV
2
.
The energy required to charge the capacitor from 0 to V is CV
2
. Half of the energy (
1
2
CV
2
is
dissipated as heat through the pullup resistance. Half of energy is transfered to the capacitor.
When the capacitor discharges from V to 0, the energy stored in the capacitor (
1
2
CV
2
) is
dissipated as heat through the pulldown resistance.
f
: frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in
Smith)
average switching power = f
CapLoadVoltSup
2
ClockSpeed clock speed
ActFact average number of times that signal switches from 0 1 or from
1 0 during a clock cycle
average switching power =
1
2
ActFact ClockSpeedCapLoadVoltSup
2
6.2.2 Short-Circuited Power
Vi Vo
IShort
VoltSup
GND
VoltThresh
VoltSup - VoltThresh
P-trans on
N-trans on
TimeShort
Gate Voltage
PwrShort = ActFact ClockSpeedTimeShort IShort VoltSup
6.2.3 Leakage Power
N-substrate
P
Vi
Vo
N N P
P
Cross section of invertor showing parasitic
diode
I
V
ILeak
Leakage current through parasitic diode
PwrLk = ILeakVoltSup
ILeak e
_
qVoltThresh
k T
_
6.2.4 Glossary 377
6.2.4 Glossary
ClockSpeed def Clock speed
aka f
ActFact def activity factor
aka A
=
NumTransitions
NumSignals NumClockCycles
= Per signal: percentage of clock cycles when signal changes value.
= Per clock cycle: percentage of signals that change value per clock
cycle. Note: When measuring per circuit, sometimes approximate by
looking only at ops, rather than every single signal.
TimeShort def short circuit time
aka
= Time that both N and P transistors are turned on when signal changes
value.
MaxClockSpeed def Maximum clock speed that an implementation technology can sup-
port.
aka f
max
(VoltSupVoltThresh)
2
VoltSup
VoltSup def Supply voltage
aka V
VoltThresh def Threshold voltage
aka V
th
= voltage at which P transistors turn on
ILeak def Leakage current
aka I
S
(reverse bias saturation current)
e
_
qVoltThresh
k T
_
IShort def Short circuit current
aka I
short
= Current that goes through transistor network while both N and P tran-
sistors are turned on.
CapLoad def load capacitance
aka C
L
PwrSw def switching power (dynamic)
=
1
2
ActFact ClockSpeedCapLoadVoltSup
2
PwrShort def switching power (dynamic)
= ActFact ClockSpeedTimeShort IShort VoltSup
PwrLk def leakage power (static)
= ILeakVoltSup
Power def total power
= PwrSw+PwrShort +PwrLk
q def electron charge
= 1.6021810
19
C
k def Boltzmanns constant
= 1.3806610
23
J/K
T def temperature in Kelvin
6.2.5 Note on Power Equations
The power equation:
Power = DynamicPower +StaticPower
= PwrSw+PwrShort +PwrLk
= (ActFact ClockSpeed
1
2
CapLoadVoltSup
2
)
+ (ActFact ClockSpeedTimeShort IShort VoltSup)
+ (ILeakVoltSup)
is for an individual signal.
To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort:
DynamicPower = (
n
i=1
ActFact
i
1
2
CapLoad
i
ClockSpeedVoltSup
2
)
+ (
n
i=1
ActFact
i
ClockSpeedTimeShort
i
IShort
i
VoltSup)
If know the average CapLoad, TimeShort, and IShort for a collection of n signals, then the
above formula simplies to:
DynamicPower = (nActFact
AVG
1
2
CapLoad
AVG
ClockSpeedVoltSup
2
)
+ (nActFact
AVG
ClockSpeedTimeShort
AVG
IShort
AVG
VoltSup)
If capacitances and short-circuit parameters dont have an even distribution, then dont average
them. If high-capacitance signals have high-activity factors, then averaging the equations will
result in erroneously low predictions for power.
6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES 379
6.3 Overview of Power Reduction Techniques
We can divide power reduction techniques into two classes: analog and digital.
analog
Parameters to work with:
capacitance for example, Silicon on Insulator (SOI)
resistance for example, copper wires
voltage low-voltage circuits
Techniques:
dual-VDD Two different supply voltages: high voltage for performance-critical
portions of design, low voltage for remainder of circuit. Alternatively, can vary
voltage over time: high voltage when running performance-critical software and
low voltage when running software that is less sensitive to performance.
dual-Vt Two different threshold voltages: transistors with low threshold voltage for
performance-critical portions of design (can switch more quickly, but more
leakage power), transistors with high threshold voltage for remainder of circuit
(switches more slowly, but reduces leakage power).
exotic circuits Special ops, latches, and combinational circuitry that run at a high
frequency while minimizing power
adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but
not 1 0 transitions. These sacrice performance for reduced power.
clock trees Up to 30% of total power can be consumed in clock generation and
clock tree
digital
Parameters to work with:
capacitance (number of gates)
activity factor
clock frequency
Techniques:
multiple clocks Put a high speed clock in performance-critical parts of design and a
low speed clock for remainder of circuit
clock gating Turn off clock to portions of a chip when its not being used
data encoding Gray coding vs one-hot vs fully encoded vs ...
glitch reduction Adjust circuit delays or add redundant circuitry to reduce or
eliminate glitches.
asynchronous circuits Get rid of clocks altogether....
Additional low-power design techniques for RTL from a Qualis engineer:
http://home.europa.com/
celiac/lowpower.html
6.4 Voltage Reduction for Power Reduction
If our goal is to reduce power, the most promising approach is to reduce the supply voltage,
because, from:
Power = (ActFact ClockSpeed
1
2
CapLoadVoltSup
2
)
+ (ActFact ClockSpeedTimeShort IShort VoltSup)
+ (ILeakVoltSup)
we observe:
Power VoltSup
2
Reducing Difference Between Supply and Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . .
As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases
the load delay of a circuit.
In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the
delay through a circuit. (From V = IR, increasing V causes an increase in I, which causes the
capacitive load to charge more quickly.) However, it is more accurate to take into account both
the value of the supply voltage, and the difference between the supply voltage and the threshold
voltage.
MaxClockSpeed
(VoltSupVoltThresh)
2
VoltSup
Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage
is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the
supply voltage is dropped to 2.2 V.
Reducing Threshold Voltage Increases Leakage Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not
increase the delay through the circuit. However, as threshold voltage drops, leakage current
increases:
ILeak e
_
qVoltThresh
k T
_
And increasing the leakage current increases the power:
Power ILeak
So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing
power), and increasing ILeak, which has a linear affect on increasing power.
6.5. DATA ENCODING FOR POWER REDUCTION 381
6.5 Data Encoding for Power Reduction
6.5.1 How Data Encoding Can Reduce Power
Data encoding is a technique that chooses data values so that normal execution will have a low
activity factor.
The most common example is Gray coding where exactly one bit changes value each clock
cycle when counting.
Decimal Gray Binary
0 0000 0000
1 0001 0001
2 0011 0010
3 0010 0011
4 0110 0100
5 0111 0101
6 0101 0110
7 0100 0111
8 1100 1000
9 1101 1001
10 1111 1010
11 1110 1011
12 1010 1100
13 1011 1101
14 1001 1110
15 1000 1111
Question: For an eight-bit counter, how much more power will a binary counter
consume than a gray-code counter?
Question: For completely random eight-bit data, how much more power will a binary
circuit consume than a gray-code circuit?
6.5.2 Example Problem: Sixteen Pulser
6.5.2.1 Problem Statement
Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on
the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for
one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.)
done
1 2 3 16 15 17 32 31 33
clk
Required behaviour
You have been asked to consider three different types of counters: a binary counter, a Gray-code
counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different
encodings.)
Question: What is the relative amount of power consumption for the different
options?
6.5.2.2 Additional Information
Your implementation technology is an FPGA where each cell has a programable combinational
circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of
the combinational circuit is twice that of the ip-op.
PLA
cell
1. You may neglect power associated with clocks.
2. You may assume that all counters:
(a) are implemented on the same fabrication process
(b) run at the same clock speed
(c) have negligible leakage and short-circuit currents
6.5.2.3 Answer
Outline of Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factors to consider that distinguish the options: capacitance and activity factor:
Capacitance is dependent upon the number of signals, and whether a signal is combinational or a
op.
Sketch out the circuitry to evaluate capacitance.
6.5.2 Example Problem: Sixteen Pulser 383
Sketch the Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Name the output done and the count digits d().
PLA
PLA
PLA
PLA
d(0)
d(1)
d(2)
d(3)
done
PLA
Block diagram for Gray and Binary Counters
d(0) d(1)
d(15)
done
PLA PLA PLA
Block diagram for One-Hot
Observation:
The Gray and Binary counters have the same design, and the Gray counter will have
the lower activity factor. Therefore, the Gray counter will have lower power than the
Binary counter.
However, we dont know how much lower the power of the Gray counter will be, and
we dont know how much power the One-Hot counter will consume.
Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
cap number subtotal cap
Gray d() PLAs 2 4 8
Flops 1 4 4
done PLAs 2 1 2
Flops 1 0 0
1-Hot d() PLAs 2 0 0
Flops 1 16 16
done PLAs 2 0 0
Flops 1 0 0
Binary d() PLAs 2 4 8
Flops 1 4 4
done PLAs 2 1 2
Flops 1 0 0
Activity Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
d(0)
d(1)
d(2)
d(3)
done
clk
4/16
2/16
2/16
2/16
8/16
Gray coding
d(0)
d(1)
d(2)
done
clk
2/16
2/16
2/16
2/16
2/16
One-hot coding
d(0)
d(1)
d(2)
d(3)
done
clk
8/16
4/16
2/16
2/16
16/16
Binary coding
6.5.2 Example Problem: Sixteen Pulser 385
act fact
Gray d() PLAs 1/4 signals in each clock cycle
Flops 1/4 signals in each clock cycle
done PLAs 2 transitions / 16 clock cycles
Flops
1-Hot d() PLAs
Flops 2 transitions / 16 clock cycles
done PLAs
Flops
Binary d() PLAs
16 + 8 + 4 + 2 transitions
4 signals 16 clock cycles
= 0.47
Flops
16 + 8 + 4 + 2 transitions
4 signals 16 clock cycles
= 0.47
done PLAs 2 transitions / 16 clock cycles
Flops
Note: Activity factor for One-Hot counter Because all signals have same
capacitance, and all clock cycles have the same number of transitions for the
One-Hot counter, could have calculated activity factor as two transitions per
sixteen signals.
Putting it all Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
subtotal cap act fact power
Gray d() PLAs 8 1/4 2
Flops 4 1/4 1
done PLAs 2 2/16 4/16
Flops 0 0
Total 3.25
1-Hot d() PLAs 0 0
Flops 16 2/16 2
done PLAs 0 0
Flops 0 0
Total 2
Binary d() PLAs 8 0.47 3.76
Flops 4 0.47 1.88
done PLAs 2 2/16 0.25
Flops 0 0
Total 5.87
If choose Binary counting as baseline, then relative amounts of power are:
Gray 54%
One-Hot 35%
Binary 100%
If choose One-Hot counting as baseline, then relative amounts of power are:
Gray 156%
One-Hot 100%
Binary 288%
6.6 Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt
needed. This reduces the activity factor.
6.6.1 Introduction to Clock Gating
Examples of Clock Gating
Condition Circuitry turned off
O/S in standby mode Everything except core state (PC, registers, caches, etc)
No oating point instructions
for k clock cycles
oating point circuitry
Instruction cache miss Instruction decode circuitry
No instruction in pipe stage i Pipe stage i
6.6.2 Implementing Clock Gating 387
Design Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+ Can signicantly reduce activity factor (Synopsys PowerCompiler claims that can cut power
to be 5080% of ungated level)
Increases design complexity
design effort
bugs!
Increases area
Increases clock skew
Functional Validation and Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Its a functional bug to turn a clock off when its needed for valid data.
Its functionally ok, but wasteful to turn a clock on when its not needed.
(About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clock
gating.) Nicolas Mokhoff. EE Times. June 27, 2001.
http://www.edtn.com/story/OEG20010621S0080
6.6.2 Implementing Clock Gating
Clock gating is implemented by adding a component that disables the clock when the circuit isnt
needed.
i_data
clk
o_data
i_valid
o_valid
Without clock gating
Clock Enable
State Machine
clk
i_wakeup
clk_en
cool_clk
i_data o_data
i_valid
o_valid
With clock gating
The total power of a circuit with clock gating is the sum of the power of the main circuit with a
reduced activity factor and the power of the clock gating state machine with its activity factor.
The clock-gating state machine must always be on, so that it will detect the wakeup signal do
not make the mistake of gating the clock to your clock gating circuit!
6.6.3 Design Process
Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What level of granularity for gated clocks?
entire module?
individual pipe stages?
something in between?
When should the clocks turn off?
When should the clocks turn on?
Protocol for incoming wakeup signal?
Protocol for outgoing wakeup signal?
Wakeup Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Designers negotiate incoming and outgoing wakeup protocol with environment.
An example wakeup protocol:
wakeup in will arrive 1 clock cycle before valid data
wakeup in will stay high until have at least 3 cycles of invalid data
Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When designing clock gating circuitry, consider the two extreme case:
a constant stream of valid data
circuit is turned off and receives a single parcel of valid data
For a constant stream of valid data, the key is to not incur a large overhead in design complexity,
area, or clock period when clocks will always be toggling.
For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data
can percolate through circuit. Also, we want to turn off the clock as soon as possible after data
leaves.
6.6.4 Effectiveness of Clock Gating 389
6.6.4 Effectiveness of Clock Gating
We can measure the effectiveness of clock gating by comparing the percentage of clock cycles
when the clock is not toggling to the percentage of clock cycles that the circuit does not have
valid data (i.e. the clock does not need to toggle).
The most ineffective clock gating scheme is to never turn off the clock (let the clock always
toggle). The most effective clock gating scheme is to turn off the clock whenever the circuit is not
processing valid data.
Parameters to characterize effectiveness of clock gating:
Eff = effectiveness of clock gating
PctValid = percentage of clock cycles with valid data in the circuit the clock
must be toggling
PctClk = percentage of clock cycles that clock toggles
Effectiveness measures the percentage of clock cycles with invalid data in which the clock is
turned off. Equation for effectiveness of clock gating:
Eff =
PctClkOff
PctInvalid
=
1PctClk
1PctValid
Question: What is the effectiveness if the clock toggles only when there is valid data?
Answer:
PctClk = PctValid and the effectiveness should be 1:
Eff =
1PctClk
1PctValid
=
1PctValid
1PctValid
= 1
Question: What is the effectiveness of a clock that always toggles?
Answer:
If the clock is always toggling, then PctClk = 100% and the effectiveness
should be 0.
Eff =
1PctClk
1PctValid
=
11
1PctValid
= 0
Question: What does it mean for a clock gating scheme to be 75% effective?
Answer:
75% of the time that the there is invalid data, the clock is off.
Question: What happens if PctClk < PctValid?
Answer:
If PctClk < PctValid, then:
1PctClk > 1PctValid
so, effectiveness will be greater than 100%.
In some sense, it makes sense that the answer would be nonsense, because
a clock gating scheme that is more than 100% effective is too effective: it is
turning off the clock sometime when it shouldnt!
We can see the effect of the effectiveness of a clock-gating scheme on the activity factor:
A
Eff
A
0
1
0
PctValid * A
When the effectiveness is zero, the new activity factor is the same as the original activity factor.
For a 100% effective clock gating scheme, the activity factor is APctValid. Between 0% and
100% effectiveness, the activity factor decreases linearly.
The new activity factor with a clock gating scheme is:
A
= A(1PctValid) Eff A
6.6.5 Example: Reduced Activity Factor with Clock Gating 391
6.6.5 Example: Reduced Activity Factor with Clock Gating
Question: How much power will be saved in the following clock-gating scheme?
70% of the time the main circuit has valid data
clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock
is off)
clock gating circuit has 10% of the area of the main circuit
clock gating circuit has same activity factor as main circuit
neglect short-circuiting and leakage power
Answer:
The new power consumption is 83% of original power
6.6.6 Clock Gating with Valid-Bit Protocol
A common technique to determine when a circuit has valid data is to use a valid-bit protocol. In
section 6.6.6.1 we review the valid-bit procotol and then in section 6.6.6.3 we add clock-gating
circuitry to a circuit that uses the valid-bit protocol.
6.6.6.1 Valid-Bit Protocol
Need a mechanism to tell circuit when to pay attention to data inputs e.g. when is it supposed
to decode and execute an instruction, or write data to a memory array?
clk
i_valid
i_data o_data
o_valid
clk
i_valid
i_data
o_data
o_valid

i valid: high when i data has valid data signies whether circuit should pay attention to
or ignore data.
o valid: high when o data has valid data signies whether whether environment should
pay attention to output of circuit.
For more on circuit protocols, see section 2.8.
Microscopic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Which clock edges are needed?
i_valid
clk
o_valid
clk
i_valid
o_valid
6.6.6.2 How Many Clock Cycles for Module?
Given a module with latency Lat , if the module receives a stream of NumPcls consecutive valid
parcels, how many clock cycles must the clock-enable signal be asserted?
t
i1
time of rst i valid
t
o1
time of rst o valid
t
ik
time of last i valid
t
ok
time of last o valid
t
start
rst clock cycle with clock enabled
t
last
last clock cycle with clock enabled
Initial equations to describe relationships between different points in time:
t
o1
= t
i1
+Lat
t
ok
= t
o1
+NumPcls 1
t
rst
t
i1
+1
t
last
t
ok
+1
To understand the 1 in the equation for t
ok
, examine the situation when NumPcls = 1. With just
one parcel going through the system t
o1
= t +ok, so we have: t
ok
= t
o1
+11.
In the equation for t
last
, we need the +1 to clear the last valid bit.
Solve for the length of time that the clock must be enabled. The +1 at the end of this equation is
becuase if t
last
= t
rst
, we would have the clock enabled for 1 clock cycle.
ClkEnLen = t
last
t
rst
+1
= t
ok
+1(t
i1
+1) +1
= t
ok
t
i1
+1
= t
o1
+NumPcls 1t
i1
+1
= t
o1
+NumPcls t
i1
= t
i1
+Lat +NumPcls t
i1
= Lat +NumPcls
6.6.6 Clock Gating with Valid-Bit Protocol 393
We are left with the formula that the number of clock cycles that the modules clock must be
enabled is the latency through the module plus the number of consecutive parcels.
6.6.6.3 Adding Clock-Gating Circuitry
Before Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
data_in
clk
data_out
valid_in valid_out
clk

data_in
valid_in
data_out
valid_out
dont care
uninitialized
After Clock Gating: Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Clock Enable
State Machine
data_in
hot_clk
wakeup_in
data_out
clk_en
cool_clk
valid_in valid_out
wakeup_out
hot clk: clock that always toggles
cool clk: gated clock sometimes toggles, sometimes stays low
wakeup: alerts circuit that valid data will be arriving soon
clk en: turns on cool clk
After Clock Gating: New Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

data_in
valid_in
hot_clk
data_out
valid_out
wakeup_in
cool_clk
clk_en
wakeup_out
6.6.7 Example: Pipelined Circuit with Clock-Gating 395
6.6.7 Example: Pipelined Circuit with Clock-Gating
Design a clock enable state machine for the pipelined component described below.
capacitance of pipelined component = 200
latency varies from 5 to 10 clock cycles, even distribution of latencies
contains a maximum of 6 instructions (parcels of data).
60% of incomming parcels are valid
average length of continuous sequence of valid parcels is 80
use input and output valid bits for wakeup
leakage current is negligible
short-circuit current is negligible
Capacitance of building blocks (per bit) for state machine
eq comparator 2
increment 3
increment / reset 4
increment / decrement 5
le,lt,eq comparator 5
increment / decrement / reset 6
ip-op 4
The two factors affecting power are activity factor and capacitance.
1. Scenario: turned off and get one parcel.
(a) Need to turn on and stay on until parcel departs
(b) idea #1 (parcel count):
count number of parcels inside module
keep clocks toggling if have non-zero parcels.
(c) idea #2 (cycle count):
count number of clock cycles since last valid parcel entered module
once hit 10 clock cycles without any valid parcels entering, know that all parcels
have exited.
keep clocks toggling if counter is less than 10
2. Scenario: constant stream of parcels
(a) parcel count would require looking at input and output stream and conditionally
incrementing or decrementing counter
(b) cycle count would keep resetting counter
Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i_valid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
o_valid
parcel_count
parcel_clk_en
18 19 20 21 22 23 24
i_valid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
o_valid
cycle_count
1 2 0 0 0 1 2 3 4 1 2 3 4 5 6 7 8 9 10 0 0
cycle_clk_en
18 19 20 21 22 23 24
5
Outline:
1. sketch out circuitry for parcel count and cycle count state machine
2. estimate capacitance of each state machine
3. estimate activity factor of main circuit, based on behaviour
Parcel Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Need to count (0..6) parcels, therefore need 3 bits for counter.
Counter must be able to increment and decrement.
Equations for counter action (increment/decrement/no-change):
i valid o valid action
0 0 no change
0 1 decrement
1 0 increment
1 1 no change
To keep clock enabled for additional clock cycle to clear the valid bit, add an extra op to hold a
delayed version of o valid. Use this delayed o valid to decrement the counter.
In addition to the increment/decrement counter, we need an equality test on parcel count for zero,
so that we know whether the clock should be on or off.
Each bit of the counter needs:
component cap
ip-op 4
inc/dec 5
eq-comp 2
total 11
Total capacitance is 311+4 = 37.
Cycle Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Latency is 10 clock cycles. We need to keep the clock enabled for 11 clock cycles, so that we can
clear the valid bit. We will count from 0 to 11; when the counter reaches 11, we saturate the
counter and turn off the clock. Need to count (0..11), therefore need 4 bits for counter.
Counter must be able to increment, saturate, and reset.
i valid saturated action
0 0 increment
0 1 no change
1 0 reset
1 1 reset
Use an equality comparator to detect when saturated. Clock is enabled whenever the counter is
not saturated, so can use a single comparator for both detecting saturation and enabling the clock.
Each bit of the counter needs:
component cap
ip-op 4
inc/reset 4
eq-comp 2
total 10
Total capacitance is 410 = 40.
Capacitance result:
parcel count : capacitance = 37
cycle count : capacitance = 40
Behavioural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Question: Without further detailed analysis, can we determine which design is better
option?
Answer:
If parcel leaves after 5 clock cycles, cycle count will continue to power circuit
for another 5 cycles (wasting power!). So, parcel count wins on both
capacitance and activity factor.
If we needed only to determine which option was better, we could stop now.
This analysis has approximated that the activity factors for the clock enabling
circuit will be the same for both options. For these state machines and
implementation technology, estimating the activity factor would be very
complicated.
Question: Which design option has lower power and how much lower is it?
Answer:
Goal: determine what percentage of time cool clk is toggling for each of
the two design options.
1. Assume that all three of the circuits in question (main circuit without
clock gating, and the two clock enable state machines) have the same
activity factor.
2. Construct average waveform for cool clock.
(a) 60% of incoming data are valid
(b) average length of valid data is 80 instructions
(c) length of window for average data is:
WindowLength =
ValidLength
PctValid
=
80
0.6
= 133cycles
80 valid parcels
133 clock cycles
3. Calculate percentage of clock cycles that parcel count circuit is powered.
(a) Clock will run for:
80 clock cycles
+ average latency 1
+ 1 cycle to clear out last parcel
The rst clock cycle of the last parcel is counted in the 80 clock
cycles, hence, we take the average latency 1, rather than the
average latency.
The last clock cycle clears out the last valid parcel by opping in an
invalid parcel. See section 6.6.6.1.
(b) Minimum latency is 5, max is 10, distribution is even. Therefore
average latency is 7.5.
(c) Clock will run for: 80+(7.51) +1 = 87.5cycles.
(d) Percentage clocking is 87.5/133 = 65.8%
4. Calculate percentage of clock cycles that cycle count circuit is powered.
(a) Clock will run for:
80 clock cycles
+ 10 - 1 for powering last parcel
+ 1 cycle to clear out last parcel
= 90.0 clock cycles
(b) Percentage clocking is 90.0/133 = 67.7%
5. Total power consumption
Parcel Count Cycle Count
Main capacitance 200 200
Fsm capacitance 37 40
Percentage clocking 65.8% 67.7%
Use A for the activity factor without clock gatigg.
P
tot
= P
main
+P
fsm
= PctClkActFactC
main
+ActFactC
fsm
P
pcl
= 65.8%A200+A37
= 168.6A
P
cyc
= 67.7%A200+A40
= 175.4A
Parcel count consumes less power.
6. How much more power does the cycle count design consume?
n%more power =
CycPwr PclPwr
PclPwr
=
175.4168.6
168.6
= 4%
6.7. POWER PROBLEMS 401
6.7 Power Problems
P6.1 Short Answers
P6.1.1 Power and Temperature
As temperature increases, does the power consumed by a typical combinational circuit increase,
stay the same, or decrease?
P6.1.2 Leakage Power
The new vice president of your company has set up a contest for ideas to reduce leakage power in
the next generation of chips that the company fabricates. The prize for the person who submits
the suggestion that makes the best tradeoff between leakage power and other design goals is to
have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your
idea require in order to achieve the reduction in leakage power?
P6.1.3 Clock Gating
In what situations could adding clock-gating to a circuit increase power consumption?
P6.1.4 Gray Coding
What are the tradeoffs in implementing a program counter for a microprocessor using Gray
coding?
P6.2 VLSI Gurus
The VLSI gurus at your company have come up with a way to decrease the average rise and fall
time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication
tweaks, they can decrease this to 0.85ns .
P6.2.1 Effect on Power
If you implement their suggestions, and make no other changes, what effect will this have on
power? (NOTE: Based on the information given, be as specic as possible.)
P6.2.2 Critique
A group of wannabe performance gurus claim that the above optimization can be used to improve
performance by at least 15%. Briey outline what their plan probably is, critique the merits of
their plan, and describe any affect their performance optimization will have on power.
P6.3 Advertising Ratios
One day you are strolling the hallways in search of inspiration, when you bump into a person
from the marketing department. The marketing department has been out surng the web and has
noticed that companies are advertising the MIPs/mm
2
, MIPs/Watt, and Watts/cm
3
of their
products. This wide variety of different metrics has confused them.
Explain whether each metric is a reasonable metric for customers to use when choosing a system.
If the metric is reasonable, say whether bigger is better (e.g. 500 MIPs/mm
2
is better than 20
MIPs/mm
2
) or smaller is better (e.g. 20 MIPs/mm
2
is better than 500 MIPs/mm
2
), and which
one type of product (cell phone, desktop computer, or compute server) is the metric most relevant
to.
MIPs/mm
2
MIPs/Watt
Watts/cm
3
P6.4 Vary Supply Voltage
As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit
can run at decreases.
The scaling down of supply voltage is a popular technique for minimizing power. The maximum
clock speed is related to the supply voltage by the following equation:
MaxClockSpeed
(VoltSupVoltThresh)
2
VoltSup
Where VoltSup is supply voltage and VoltThresh is threshold voltage.
With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is
measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?
P6.5 Clock Speed Increase Without Power Increase 403
P6.5 Clock Speed Increase Without Power Increase
The following are given:
You need to increase the clock speed of a chip by 10%
You must not increase its dynamic power consumption
The only design parameter you can change is supply voltage
Assume that short-circuiting current is negligible
P6.5.1 Supply Voltage
How much do you need to decrease the supply voltage by to achieve this goal?
What problems will you encounter if you continue to decrease the supply voltage?
P6.6 Power Reduction Strategies
In each low power approach described below identify which component(s) of the power equation
is (are) being minimized and/or maximized:
Designers scaled down the supply voltage of their ASIC
P6.6.2 Transistor Sizing
The transistors were made larger.
P6.6.3 Adding Registers to Inputs
All inputs to functional units are registered
P6.6.4 Gray Coding
Gray coding of signals is used for address signals.
P6.7 Power Consumption on New Chip
While you are eating lunch at your regular table in the company cafeteria, a vice president sits
down and starts to talk about the difculties with a new chip.
The chip is a slight modication of existing design that has been ported to a new fabrication
process. Earlier that day, the rst sample chips came back from fabrication. The good news is that
the chips appear to function correctly. The bad news is that they consume about 10% more power
than had been predicted.
The vice president explains that the extra power consumption is a very serious problem, because
power is the most important design metric for this chip.
The vice president asks you if you have any idea of what might cause the chips to consume more
power than predicted.
P6.7.1 Hypothesis
Hypothesize a likely cause for the surprisingly large power consumption, and justify why your
hypothesis is likely to be correct.
P6.7.2 Experiment
Briey describe how to determine if your hypothesized cause is the real cause of the surprisingly
large power consumption.
P6.7.3 Reality
The vice president wants to get the chips out to market quickly and asks you if you have any ideas
for reducing their power without changing the design or fabrication process. Describe your ideas,
or explain why her suggestion is infeasible.
Chapter 7
Fault Testing and Testability
7.1 Faults and Testing
7.1.1 Overview of Faults and Testing
7.1.1.1 Faults (Smith 14.3)
During manufacturing, faults can occur that make the physical product behave incorrectly.
Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either
break or connect to something it shouldnt.
Good wires Shorted wires Open wire
7.1.1.2 Causes of Faults (Smith 14.3)
Fabrication process (initial construction is bad)
chemical mix
impurities
dust
Manufacturing process (damage during construction)
handling
probing
cutting
405
406 CHAPTER 7. FAULT TESTING AND TESTABILITY
mounting
materials
corrosion
adhesion failure
cracking
peeling
7.1.1.3 Testing (Smith 14)
Denition Testing is the process of checking that the manufactured wafer/chip/board/system has
the same functionality as the simulations.
7.1.1.4 Burn In (Smith 14.3.1)
Some chips that come off the manufacturing line will work for a short period of time and then fail.
Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps,
high and low voltages, high and low clock speeds) before and during testing.
The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in early
use by customers.
Soon to break wire
The hope is that the extreme conditions will cause chips to break that would otherwise have
broken in the customers system soon after arrival.
The trick is to create conditions that are extreme enough that bad chips will break, but not so
extreme to cause good chips to break.
7.1.1.5 Bin Sorting (Smith 5.1.6)
Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled
(binned) by the maximum clock frequency at which they will work reliably.
For example, chips coming off of the same production line might be labelled as 800MHz,
900MHz, and 1000MHz.
Overclocking is taking a chip rated at nMHz and running it at 1.x nMHz. (Sure your computer
often crashes and loses your assignment, but just think how much more productive you are when it
is working...)
7.1.1 Overview of Faults and Testing 407
7.1.1.6 Testing Techniques (Smith 14)
Scan Testing or Boundary Scan Testing (BST, JTAG) (Smith 14.2, 14.6):
Load test vector from tester into chip
Run chip on test data
Unload result data from chip to tester
Compare results from chip against those produced by simulation
If results are different, then chip was not manufactured correctly
Built In Self Test (BIST) (Smith 14.7):
Build circuitry on chip that generates tests and compares actual and expected results
IDDQ Testing : (Smith 14.3.6)
Measure the quiescent current between VDD and GND.
Variations from expected values indicate faults.
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The challenges in testing:
test circuitry consumes chip area
test circuitry reduces performance
decrease fault escapee rate of product that ships while having minimal impact on production
cost and chip performance
external tester can only look at I/O pins
ratio of internal signals to I/O pins is increasing
some faults will only manifest themselves at high-clock frequencies
The crux of testing is to use yesterdays technology to nd faults in tomorrows chips. Agilent
engineer at ARVLSI 2001.
7.1.1.7 Design for Testability (DFT) (Smith 14.6)
Scan testing and self-testing require adding extra circuitry to chips.
Design for test is the process of adding this circuitry in a disciplined and correct manner.
A hot area of research, that is becoming mainstream practice, is developing synthesis tools to
automatically add the testing circuitry.
7.1.2 Example Problem: Economics of Testing (Smith 14.1)
Given information:
The ACHIP costs $10 without any testing
Each board uses one ACHIP (plus lots of other chips that we dont care about)
68% of the manufactured ACHIPS do not have any faults
For the ACHIP, it costs $1 per chip to catch half of the faults
Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests
that are run)
If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP
Board-level testing will detect 100% of the faults in an ACHIP
Question: What escapee fault rate will minimize cost of the ACHIP?
Answer:
TotCost =NoTestCost +TestCost +EscapeeProbReplaceCost
NoTestCost Testcost EscapeeProb ReplaceCost TotCost
$10 $0 32% (2000.32 = $64) $74
$10 $1 16% (2000.16 = $32) $43
$10 $2 8% (2000.08 = $16) $28
$10 $4 4% (2000.04 = $8) $22
$10 $8 2% (2000.02 = $4) $22
$10 $16 1% (2000.01 = $2) $28
$10 $32 0.5% (2000.005 = $1) $43
The lowest total cost is $22. There are option with a total cost of $22: $4 of
testing and $8 of testing. Economically, we can choose either option.
For high-volume, small-area chips, testing can consume more than 50% of the total cost.
7.1.3 Physical Faults (Smith 14.3.3) 409
7.1.3 Physical Faults (Smith 14.3.3)
7.1.3.1 Types of Physical Faults
Good Circuit Bad Circuits
a
b
c
d
open
a
b
c
d
wired-AND bridging short
a
b
c
d
wired-OR bridging short
a
b
c
d
stronger wins bridging short
a
b
c
d
(b is stronger)
short to VDD
a
b
c
d
short to GND
a
b
c
d
7.1.3.2 Locations of Faults
Each segment of wire, poly, diffusion, via, etc is a potential fault location.
Different segments affect different gates in the fanout.
A potential fault location is a segment or segments where a fault at any position affects the same
set of gates in the same way.
b
b
b
BAD
BAD
b
OK
BAD
b
OK
BAD
Three different locations for potential faults.
When working with faults, we work with wire segments, not signals. In the circuit below, there
are 8 different wire segments (L1L8). Each wire segment corresponds to a logically distinct fault
location. All physical faults on a segment affect the same set of signals, so they are grouped
together into a logical fault. If a signal has a fanout of 1, then there is one wire segment. A
signal with a fanout of n, where n > 1, has at least n+1 wire segments one for the source
signal and one for each gate of fanout. As shown in section 7.1.3.3, the layout of the circuit can
have more than n+1 segments.
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
7.1.3.3 Layout Affects Locations
a
d
e
f
g
h
i
b
c
e
g
h
b
L1
L2
L3
L4
e
g
h
b
L1
L2
L3
L4
L5
For the signal b in the schematic above, we can have either four or ve different locations for
potential faults, depending upon how the circuit is layed out.
7.1.3.4 Naming Fault Locations
Two ways to name a fault location:
pin-fault model Faults are modelled as occuring on input and output pins of gates.
net-fault model Faults are modelled as occuring on segments of wires.
In E&CE 427, well use the net-fault model, because it is simpler to work with and is closer to
what actually happens in hardware.
7.1.4 Detecting a Fault
To detect a fault, we compare the actual output of the circuit against the expected value.
To nd a test vector that will detect a fault:
7.1.4 Detecting a Fault 411
1. build Boolean equation (or Karnaugh map) of correct circuit
2. build Boolean equation (or Karnaugh map) of faulty circuit
3. compare equations (or Karnaugh maps), regions of difference represent test vectors that
will detect fault
7.1.4.1 Which Test Vectors will Detect a Fault?
Question: For the good circuit and faulty circuit shown below, which test vectors will
detect the fault?
a
b
c
d
e
Good circuit
a
b
c
d
e
Faulty circuit
Answer:
a b c good faulty
0 0 0 0 0
0 0 1 1 1
0 1 0 0 0
0 1 1 1 1
1 0 0 0 0
1 0 1 1 1
1 1 0 1 0
1 1 1 1 1
The only test vector that will detect
the fault in the circuit is 110.
Sometimes multiple test vectors will catch the same fault.
Sometimes a single test vector can catch multiple faults.
a
b
c
d
e
Another fault
a b c good faulty
1 1 0 1 0
The test vector 110 can catch both this fault and the previous one.
With testing, we are primarily concerned with determining whether a circuit works correctly or
not detecting whether there is a fault. If the circuit has a fault, we usually do not care where
the fault is diagnosing the fault. To detect the two faults above, the test vector 110 is
sufcient, because if either of the two faults is present, 110 will detect that the circuit does not
work correctly.
Note: Detect vs. diagnose Testing detects faults. Testing does not diagnose
which fault occurred.
If we have a higher-than-expected failure rate for a chip, we might want to investigate the cause of
the failures, and so would need to diagnose the faults. In this case, we might do more exhaustive
analysis to see which test vectors pass and which fail. We might also need to examine the chip
physically with probes to test a few individual wires or transistors. This is done by removing the
top layers of the chip and using very small and very sensitive probes, analogous to how we use a
multimeter to test a circuit on a breadboard.
7.1.5 Mathematical Models of Faults (Smith 14.3.4)
Goal: develop reliable and predictable technique for detecting faults in circuits.
Observations:
The possible faults in a circuit are dependent upon the physical layout of the circuit.
A very wide variety of possible faults
A single test vector can catch many different faults
Need: a mathematical model for faults that is abstracted from complexities of circuit layout and
plethora of possible faults, yet still detects most or all possible faults.
7.1.5.1 Single Stuck-At Fault Model
Although there are many different bad behaviours that faults can lead to, the simple model of
single-stuck-at-faults has proven very capable of nding real faults in real circuits.
Two simplifying assumptions:
1. A maximum of one fault per tested circuit (hence single)
2. All faults are either:
(a) stuck-at 1: short to VDD
(b) stuck-at 0: short to GND
hence, stuck at
7.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4) 413
Example of Stuck-At Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a
d
i
b
c
L1
L2
L3
L4
L5
L6
L7
L8
L9
L10
L11
L12
12 fault locations 2 types of faults = 24 possible faults.
If restrict to single stuck-at fault model, then have 24 faulty circuits to consider.
If allowed multiple faults, then the circuit above could have up to 12 different faults. How many
faulty circuits would need to be considered?
Each of the 12 locations has three possible values: good, stuck-at-1, stuck-at-0. Therefore,
3
12
= 5.310
5
different circuits would need to be considered!
If allowed multiple faults of 4 different types at 12 different locations, then would have
5
12
1 = 2.410
8
different faulty circuits to consider!
There are 2
2
4
= 6.610
4
different Boolean functions of four inputs (A k-map of four variables is
a grid of 2
4
squares; each square is either 0 or 1, which gives 2
2
4
different combinations). There
are 6.610
4
possible equations for circuits with four inputs and one output. This is much less
than the number of faulty circuit models that would be generated by the
simultaneous-faults-at-every-location models. So both of the
simultaneous-faults-at-every-location models are too extreme.
7.1.6 Generate Test Vector to Find a Mathematical Fault (Smith 14.4)
Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with
test-vectors and checking that the real circuit gives the correct output.
Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical
evidence demonstrate that if a circuit appears to be free of single stuck-at faults, then probably it
also free of other types of faults. That is, testing a circuit for single stuck-at faults will also detect
many other types of faults and will often detect multiple faults.
7.1.6.1 Algorithm
1. compute Karnaugh map for correct circuit
2. compute Karnaugh map for faulty circuit
3. nd region of disagreement
4. any assignment in region of disagreement is a test vector that will detect fault
5. any assignmemnt outside of region of disagreement will result in same output on both
correct and faulty circuit
7.1.6.2 Example of Finding a Test Vector
a
b
c
d
e
c
b a
1
0
10 11 01 00
b a b a b a
c
a
b
c
Good circuit
a
b
c
d
e
a
b
c
Faulty circuit
a
b
c
Difference between good and faulty circuits
7.1.7 Undetectable Faults
Not all faults are detectable.
1. If a circuit is irredundant then all single stuck-at faults can be detected.
A redundant circuit is one where one or more gates can be removed without
affecting the functional behaviour.
2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can
mask a fault that you are looking for.
7.1.7.1 Redundant Circuitry
Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a
circuit.
7.1.7 Undetectable Faults 415
Timing Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Static hazard
Dynamic hazard
Timing hazards are often removed by adding
redundant circuitry.
Redundant Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a
b
c
1,0
1,1
1,1
0,1
0,1
1,0
1,0,1
d
e
f
g
Irredundant circuit
a
b
c
d
e
f
g
Illustration of timing hazard
Glitch on g is caused because the AND gate for e turns off before f turns on.
Question: Add one or more gates to the circuit so that the static hazard is guaranteed
to be prevented, independent of the delay values through the gates
In this sum-of-products style circuit, each AND gate corresponds to a cube in the Karnaugh map.
a b
c
We can prevent this transition from causing a glitch by adding a cube that covers the two squares
of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map
below and the signal h in the redundant circuit below.
a b
c
a b
c
a
b
c
d
e
f
g
h
L1
Redundant circuit
a
b
c
d
e
f
g
h
No more timing hazards
Question: Has the redundant circuitry introduced any undetectable faults? If so,
identify an undetectable fault.
L1@0 is undetectable.
Correct circuit
ab+bc
Faulty circuit
ab+bc +ac
With L1@0, ac 0
ab+bc +0
ab+bc
Same equation as correct circuit
A stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but
could allow timing glitches to occur.
7.1.7.2 Curious Circuitry and Fault Detection
The two circuits below have the same steady-state behaviour.
a
b
c
z
L1
L2
L3
a
c
z
a
b
c
Because the two circuits have the same behaviour, it might appear that the leftmost two XOR gates
are redundant. However, these gates are not redundant. In the test for redundancy, when we
remove a gate, we delete it; we do not replace it with other circuitry.
Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.
fault eqn K-map diff w/ ckt
L2@0 a(bc)
a
b
c
a
b
c
L2@1 a(bc)
a
b
c
a
b
c
7.2. TEST GENERATION 417
7.2 Test Generation
7.2.1 A Small Example
Throughout this section we will use the circuit below:
a
b
c
z
L2
L4
L5
ab+bc
a
b
c
At rst, we will consider only the following faults: L2@1, L4@1, L5@1.
fault eqn K-map diff w/ ckt test vectors
1) L2@1 a+c
a
b
c
a
b
c
101, 001, 100
2) L4@1 a+bc
a
b
c
a
b
c
101, 100
3) L5@1 ab+c
a
b
c
a
b
c
101, 001
Choose Test Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
If we choose 101, we can detect all three faults. Choosing either 001
or 100 will miss one of the three faults.
a
b
c
7.2.2 Choosing Test Vectors
The goal of test vector generation is to nd the smallest set of test vectors that will detect the
faults of interest.
Test vector generation requires analyzing the faults.
We can simplify the task of fault analysis by reducing the number of faults that we have to
analyze.
Smith has examples of this in Figures 14.13 and 14.14.
7.2.2.1 Fault Domination
fault eqn K-map Diff w/ ckt test vectors
1) L5@1 ab+c
a
b
c
a
b
c
101, 001
2) L6@1 1
a
b
c
a
b
c
101, 001, 100, 010, 000
Any test vector that detects L5@1 will also detect L6@1: L5@1 is detected by 101 and 001, each
of which will detect L6@1. L6@1 does not dominate L5@1, because there is at least one test
vector that detecs L6@1 but does not detect L5@1 (e.g. each of 100, 010, 000 detect L6@1 but
not L5@1).
Denition dominates: f
1
dominates f
2
: any test vector that detects f
1
will also detect
f
2
.
When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault.
L5@1 dominates L6@1.
When choosing test vectors we can ignore L6@1 and just include L5@1.
Question: To detect both L5@1 and L6@1, can we ignore one of the faults?
Answer:
We can ignore L6@1, because L5@1 dominates L6@1: each test vector
that detects L5@1 also detects L6@1.
Question: What would happen if we ignored the wrong fault?
Answer:
If we ignore L5@1, but keep L6@1, we can choose any of 5 test vectors that
detect L6@1. If we chose 100, 010, or 000 as our test vector to detect L6@1,
then we would not detect L5@1.
7.2.2 Choosing Test Vectors 419
7.2.2.2 Fault Equivalence
fault eqn K-map Diff w/ ckt
1) L1@1 b
a
b
c
a
b
c
2) L3@1 b
a
b
c
a
b
c
The two faults above are equivalent.
Denition fault equivalence: f
1
is equivalent to f
2
: f
1
and f
2
are detected by exactly
the same set of test vectors. That is, all of the test vectors that detect f
1
will also
detect f
2
, and vice versa.
When choosing test vectors we can ignore one of the faults and just include the other.
7.2.2.3 Gate Collapsing
A controlling value on an input to a gate forces the output to be the controlled value. If a stuck-at
fault on the input causes the input to have a controlling value, then that fault is equivalent to the
output having a stuck-at fault of being at the controlled value.
For example, a 1 on the input to an OR gate will force the output to be 1. So, a stuck-at-1 fault on
either input to an OR gate is equivalent to a stuck-at-1 fault on the output of the gate, and is
equivalent to a stuck-at-1 fault on any other input to the OR gate.
A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the
OR gate.
Denition Gate collapsing: : The technique of looking at the functionality of a gate and
nding equivalent faults between inputs and outputs.
Sets of collapsable faults for common gates
AND
@0
@0
@0
OR
@1
@1
@1
QuestionWhat is the set of collapsible faults for a NAND gate?
NAND
7.2.2.4 Node Collapsing
Note: Node collapsing is relevant only for the pin-fault model
When two segments affect the same set of gates (ignoring any gates between the two segments),
then faults on the two segments can be collapsed.
With an invertor or buffer, the segment on the input affects the same gates as the output.
Therefore, faults on the input and output segments are equivalent.
Sets of collapsable faults for nodes
NOT-1
@1 @0
NOT-0
@1 @0
With the net-fault model, which is the one we are using in E&CE 427, inverters and buffers are
the only gates where node collapsing is relevant.
With the pin-fault model, where faults are modelled as occuring on the pins of gates, there are
other instances where node collapsing can be used.
7.2.2.5 Fault Collapsing Summary
When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of:
gate collapsing
node collapsing (if using pin-fault model)
general fault equivalence (intelligent collapsing)
fault domination
to reduce the number of faults that you must examine.
Fault collapsing is an optimization. If you skip this step, you will still get the correct answer, it
will just take more work to get the correct answer, because in each step you will analyze a greater
number of faults than if you do fault collapsing.
7.2.3 Fault Coverage
Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors.
FaultCoverage =
DetectedFaults
DetectableFaults
7.2.4 Test Vector Generation and Fault Detection 421
Some peoples denition of fault coverage has a denominator of AllPossibleFaults, not just those
that are detectable.
If the denominator is AllPossibleFaults, then, if a circuit has 100% single stuck-at fault coverage
with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more
vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no
redundant circuitry.
Even if the denominator is AllPossibleFaults, it is possible that achieving 100% coverage for
single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or
stuck-at-0.
I think, but havent seen a proof, that achieving 100% single stuck-at coverage will detect all
combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a
stuck-at fault that you arent testing for can mask (hide) a fault that you are testing for.
NOTE: In Smiths book, undetectable faults dont hurt your coverage. This is not universally true.
7.2.4 Test Vector Generation and Fault Detection
There are two ways to generate vectors and check results: built-in tests and scan testing.
Both require:
generate test vectors
overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops
compare outputs of ops to expected result
7.2.5 Generate Test Vectors for 100% Coverage
In this section we will nd the test vectors to achieve 100% coverage of single stuck at faults for
the circuit of the day.
We will use a simple algorithm, there are much more sophisticated algorithms that are more
efcient.
The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG)
and continues to be an active area of research.
A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectors
that catch the maximum number of faults.
The classic algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2).
An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergent
fanout and was developed by Goel in 1981 (Smith 14.5.3).
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
ab+bc
a
b
c
Figure 7.1: Example Circuit with Fault Locations and Karnaugh Map
7.2.5.1 Collapse the Faults
Initial circuit with potential faults:
a
b
c
z
L7@0,1
L6@0,1
L8@0,1
L1@0,1
L2@0,1
L3@0,1
L4@0,1
L5@0,1
Gate collapsing
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
@0
@0
@0
L1@0, L4@0, L6@0
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
@0
@0
@0
L3@0, L5@0, L7@0
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
@1
@1
@1
L6@1, L7@1, L8@1
Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Node collapsing: none applicable (no invertors or buffers).
7.2.5 Generate Test Vectors for 100% Coverage 423
Remaining faults:
a
b
c
z
L7@0
L6@0
L8@0,1
L1@1
L2@0,1
L3@1
L4@1
L5@1
Intelligent Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sometimes, after the regular forms of fault collapsing have been done, there will still be some sets
of equivalent faults in the circuit. It is usually benecial to quickly look for patterns or
symmetries in the circuit that will indicate a set of potentially equivalent faults.
Intelligent Collapsing
a
b
c
z
L8@0 L2@0
L2@0, L8@0 Both L2@0 and L8@0 result in the
equation 0.
a
b
c
z
L1@1
L3@1
L1@1, L3@1 Both L1@1 and L3@1 result in the
equation b
Remaining faults:
a
b
c
z
L7@0
L6@0
L8@0,1 L2@1
L3@1
L4@1
L5@1
7.2.5.2 Check for Fault Domination
1) L2@1 a+c
a
b
c
a
b
c
dominated by L4@1, L5@1
2) L3@1 b
a
b
c
a
b
c
3) L4@1 a+bc
a
b
c
a
b
c
4) L5@1 ab+c
a
b
c
a
b
c
5) L6@0 bc
a
b
c
a
b
c
6) L7@0 ab
a
b
c
a
b
c
7) L8@0 0
a
b
c
a
b
c
dominated by L6@0, L7@0
8) L8@1 1
a
b
c
a
b
c
dominated by L2@1, L3@1, L4@1, L5@1
Remove dominated faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Current faults:
a
b
c
z
L7@0
L6@0
L8@0,1 L2@1
L3@1
L4@1
L5@1
Dominated faults: (L2@1, L8@0, L8@1).
1) L3@1 b
a
b
c
a
b
c
2) L4@1 a+bc
a
b
c
a
b
c
3) L5@1 ab+c
a
b
c
a
b
c
4) L6@0 bc
a
b
c
a
b
c
5) L7@0 ab
a
b
c
a
b
c
a
b
c
z
L7@0
L6@0
L3@1
L4@1
L5@1
7.2.5.3 Required Test Vectors
If we have any faults that are detected by just one test-vector, then we
must include that test vector in our suite.
Denition required test vector: A test vector tv is required
if there is a fault for which tv is the only test vector that
will detect the fault.
Required vectors
L3@1 010
L6@0 110
L7@0 011
7.2.5.4 Faults Not Covered by Required Test Vectors
1) L4@1 a+bc
a
b
c
a
b
c
2) L5@1 ab+c
a
b
c
a
b
c
The intersection of the two difference regions
is 101.
Choosing 101 detects both L4@1 and L5@1.
Add 101 to suite of test vectors.
Final set of test vectors is:
010, 110, 011, 101.
7.2.5.5 Order to Run Test Vectors
The order in which the test vectors are run is important because it can affect how long a faulty
chip stays in the tester before the chips fault is detected.
The rst vector to run should be the one that detects the most faults.
Build a table for which faults each test vector will detect.
Test Vector
fault
a
b
c
a
b
c
a
b
c
a
b
c
110 010 011 101
1) L1@0
a
b
c
1
2) L1@1
a
b
c
1
3) L2@0
a
b
c
1 1
4) L2@1
a
b
c
1
5) L3@0
a
b
c
1
6) L3@1
a
b
c
1
7) L4@0
a
b
c
1
8) L4@1
a
b
c
1
9) L5@0
a
b
c
1
10) L5@1
a
b
c
1
11) L6@0
a
b
c
1
12) L6@1
a
b
c
1 1
13) L7@0
a
b
c
1
14) L7@1
a
b
c
1 1
15) L8@0
a
b
c
1 1
16) L8@1
a
b
c
1 1
Faults detected 5 5 5 6
101 detects the most faults, so we should run it rst.
This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found
by 101).
This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010.
We settle on a nal order for our test suite of: 101, 011, 110, 010.
7.2.5.6 Summary of Technique to Find and Order Test Vectors
1. identify all possible faults
2. gate collapsing
3. node collapsing
4. intelligent collapsing
5. fault domination
6. determine required test vectors
7. choose minimal set of test vectors to detect remaining faults
8. order test vectors based on number of faults detected (NOTE: when iterating through this
step, need to take into account faults detected by earlier test vectors)
7.2.5.7 Complete Analysis
In case you dont trust the fault collapsing analysis, heres the complete analysis.
1) L1@0 bc
a
b
c
a
b
c
2) L1@1 b
a
b
c
a
b
c
3) L2@0 0
a
b
c
a
b
c
dominated by 1, 5
4) L2@1 a+c
a
b
c
a
b
c
dominated by 8, 10
5) L3@0 ab
a
b
c
a
b
c
6) L3@1 b same as 2
7) L4@0 bc same as 1
8) L4@1 a+bc
a
b
c
a
b
c
9) L5@0 ab same as 5
10) L5@1 ab+c
a
b
c
a
b
c
11) L6@0 bc same as 1
12) L6@1 1
a
b
c
a
b
c
dominated by 8, 10
13) L7@0 ab same as 5
14) L7@1 1 same as 12
15) L8@0 0 same as 3
16) L8@1 1 same as 12
7.2.6 One Fault Hiding Another 429
7.2.6 One Fault Hiding Another
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults,
but L3 is at risk for faults.
a
b
c
z
L1
L3
a
b
c
z
L1
L3
Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0.
In the presence of other faults, the set of test vectors to detect a fault will change.
fault(s) eqn K-map Diff w/ ckt
L3@0 ab
a
b
c
a
b
c
L1@1,L3@0 b
a
b
c
a
b
c
7.3 Scan Testing in General
Scan testing is based on the techniques described in section 7.2.5. The generation of test vectors
and the checking of the result are done off-chip. In comparison, built-in self test (section 7.5)
does test-vector generation and result checking on chip. Scan testing has the advantage of
exibility and reduced on-chip hardware, but increases the length of time required to run a test. In
scan testing, we want to individually drive and read every op in the circuit.
Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testing
must be very frugal in its use of pins. Flops are connected together in scan chains with one
input pin and one output pin.
7.3.1 Structure and Behaviour of Scan Testing
circuit
under
test
data_in(3)
data_in(1)
data_in(2)
data_in(0)
zeta_in(3)
zeta_in(1)
zeta_in(2)
zeta_in(0)
a
n
o
t
h
e
r

c
i
r
c
u
i
t

#
0
a
n
o
t
h
e
r

c
i
r
c
u
i
t

#
1
Normal Circuit
circuit
under
test
a
n
o
t
h
e
r

c
i
r
c
u
i
t
y
e
t

a
n
o
t
h
e
r

c
i
r
c
u
i
t
mode0 scan_in0
scan_out0
mode1 scan_in1
scan_out1
s
c
a
n

c
h
a
i
n

0
s
c
a
n

c
h
a
i
n

1
Circuit with Scan Chains Added
7.3.2 Scan Chains
7.3.2.1 Circuitry in Normal and Scan Mode
mode0 scan_in0
circuit
under
test
scan_out0
mode1 scan_in1
scan_out1
data_in(3)
data_in(1)
data_in(2)
data_in(0)
zeta_in(3)
zeta_in(1)
zeta_in(2)
zeta_in(0)
Normal Mode
7.3.2 Scan Chains 431
mode0 scan_in0
circuit
under
test
scan_out0
mode1 scan_in1
scan_out1
Scan Mode
7.3.2.2 Scan in Operation
circuit
under
test
a
n
o
t
h
e
r
c
i
r
c
u
i
t
y
e
t
a
n
o
t
h
e
r
c
i
r
c
u
i
t
mode0 scan_in0
scan_out0
mode1 scan_in1
s
c
a
n

c
h
a
i
n

0
scan_out1
s
c
a
n

c
h
a
i
n

0
Circuit under test with scan chains
Sequence of load; test; unload
Load Test Vector
(1 cycle per bit)
Run Test Vector
Through Circuit
Unload Result
(1 cycle per bit)
Unload and Load and Same Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Unload Prev Result
Load Cur Test Vector
(1 cycle per bit)
Run Cur Test Vector
Through Circuit
Unload Cur Result
Load New Test Vector
(1 cycle per bit)
clk
scan_in0
mode0
scan_out1
next test
vector0
previous
results1
scan_out0
scan_in1
current
vector1
current
results0
previous
results0
current
vector0
next test
vector1
current
results1
Sequence of load; run; unload
7.3.2.3 Scan in Operation with Example Circuit
a
b
c
z
d
y
Circuit under test
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
Circuit under test with scan test circuitry
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
Start Loading Test Vector (Load )

mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
Load
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
Load
mode0 scan_in0
a
b
c
z
d
y
mode1 scan_in1
scan_out0 scan_out1
clk
mode0
Load
mode0 scan_in0 mode1 scan_in1
scan_out0 scan_out1
clk
mode0

Run Test Vector
scan_out0 scan_out1
clk
mode0
__
__
+
__
__
+
Test Values Propagate
(
__
+)
scan_out0 scan_out1
-
clk
mode0
+
__
__
+
Flop-In Result, Start (Un)loading Test Vector
scan_out0 scan_out1
+
__
(
__
+, +
__
)
clk
mode0
Continue (Un)loading Test Vector

7.3.3 Summary of Scan Testing 437
scan_out0 scan_out1
clk
mode0
(
__
+, +
__
)
Finish (Un)loading Test Vector
scan_out0 scan_out1
(
__
+, +
__
)
clk
mode0

Run Next Test Vector

7.3.3 Summary of Scan Testing
Adding scan circuitry
1. Registers around circuit to be tested are grouped into scan chains
2. Replace each op with mux + op
3. Flops and muxes wired together into scan chains
4. Each scan chain is connected to dedicated I/O pins for loading and unloading test
vectors
Running test vectors
1. Put scan chain in scan mode
2. Load in test vector (one element of vector per clock cycle)
3. Put scan chain in normal mode
4. Run circuit for one clock cycle load result of test into ops
5. Unload results of current test vector while simultaneously loading in next test vector
(one element of vector per clock cycle)
7.3.4 Time to Test a Chip
If the length (number of ops) of a scan chain is n, then it takes 2n+1 clock cycles to run a single
test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles
to scan out the results. Once the results are scanned out, they can be compared to the expected
results for a correctly working circuit.
If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests),
then we speed things up by scanning in the next test vector while we scan out the previous result.
ScanLength = number of ip ops in a scan chain
NumVectors = number of test vectors in test suite
TimeScan = number of clock cycles to run test suite
= NumVectors (ScanLength+1) +ScanLength
7.3.4.1 Example: Time to Test a Chip
A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and
two of 15,000 bits.
500,000 test vectors are used for each scan chain.
The tests are run at 80% of full speed.
Question: Calculate the total test time.
Answer:
We can load and unload all of the scan chains at the same time, so time will
be limited by the longest (22,000 bits).
7.4. BOUNDARY SCAN AND JTAG 439
For the rst test vector, we have to load it in, run the circuit for one clock
cycle, then unload the result.
Loading the second test vector is done while unloading the rst.
TimeTot = ClockPeriod
(MaxLengthVec +NumVecs(MaxLengthVec +1))
= (1/(0.8080010
6
)) (22, 000+500, 000(22, 000+1))
= 17secs
7.4 Boundary Scan and JTAG
Boundary scan originated as technique to test wires on printed circuit boards (PCBs).
Goal was to replace bed-of-nails style testing with technique that would work for high-density
PCBs (lots of small wires close together)
Now used to test both boards and chip internals.
Used both on boundaries (I/O pins) and internal ops.
Boundary Scan with JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Standardized by IEEE (1149) and previously by JTAG:
4 required signals (Scan Pins: TDI, TDO, TCK, TMS)
1 optional signal (Scan Pin: TRST)
protocol to connect circuit under test to tester and other circuits
state machine to drive test circuitry on chip
Boundary Scan Description Language (BSDL): structural language used to describe which
features of JTAG a circuit supports
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a
JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using
JTAG circuits, not constructing new ones.
Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB)
and the JTAG components on each chip (in BSDL) to test generation software. The software then
generates a sequence of JTAG commands and data that can be used to test the wires on the circuit
board for opens and shorts.
7.4.1 Boundary Scan History
1985 JETAG: Joint European Test Action Group
1986 JTAG (North American companies joined)
1990 JTAG 2.0 formed basis for IEEE 1491 Test access port and boundary scan architecture
7.4.2 JTAG Scan Pins
TDI test data input:
input testvector to chip
TDO test data output:
output result of test
TCK test clock:
clock signal that test runs on
TMS test mode select:
controls scan state machine
TRST test reset (optional):
resets the scan state machine
scan registers
TDI TDO
TCK
TMS
circuit
under
test
chip
control
normal
input
pins
normal
output
pins
High-level view
BSC
BSC
BSC
BR
IR
IDCODE
TAP Controller
BSR
TDI TDO
TCK
TMS
IRC IRC
circuit
under
test
chip
Instruction Decoder
BSC
BSC
BSC
control
Detailed view
7.4.3 Scan Registers and Cells
Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TDR Test data register
The boundary scan registers on a chip
DR Fig 14.2 Data register cell
Often used as a Boundary scan cell (BSC)
JTAG Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.4 Scan Instructions 441
Fig 14.8 Top level diagram
BSR Fig 14.5 Boundary scan register
A chain of boundary scan cells (BSCs)
BSC Fig 14.2 Boundary scan cell
Connects external input and scan signal to internal circuit. Acts as
wire between external input and internal circuit in normal mode.
BR Fig 14.3 Bypass-register cell
Allows direct connection from TDI to TDO. Acts as a wire when
executing BYPASS instruction.
IDCODE Device identication register
data register to hold manufacturers name and chip identier. Used
in IDCODE instruction.
IR cell Fig 14.4 Instruction register cell
Cells are combined together as a shift register to form an instruction
register (IR)
IR Fig 14.6 Instruction register
Two or more IR cells in a row. Holds data that is shifted in on TDI,
sends this data in parallel to instruction decoder.
IDecode Table 14.4 Instruction decoder
Reads instruction stored in instruction register (IR) and sends control
signals to bypass register (BR) and boundary scan register (BSR)
Fig 14.7 TAP Controller
State machine that, together with instruction decoder, controls the
scan circuitry.
7.4.4 Scan Instructions
This the set of required instructions, other instructions are optional.
EXTEST Test board-level interconnect. Drive output pins of chip with hard-
coded test vector. Sample results on inputs.
SAMPLE Sample result data
PRELOAD Load test vector
BYPASS Directly connect TDI to TDO. This is used when several chips are
daisy chained together to skip loading data into some chips.
IDCODE Output manufacturer and part number
7.4.5 TAP Controller
The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7 of
Smith.
7.4.6 Other descriptions of JTAG/IEEE 1194.1
Texas Instruments introductory seminar on IEEE 1149.1
http://www.ti.com/sc/docs/jtag/seminar1.pdf
Texas Instruments intermediate seminar on IEEE 1149.1
http://www.ti.com/sc/docs/jtag/seminar2.pdf
Sun midroSPARC-IIep scan-testing documentation
http://www.sun.com/microelectronics/whitepapers/wpr-0018-01/
Intellitech JTAG overview:
http://www.intellitech.com/resources/technology.html
Actels JTAG description:
http://www.actel.com/appnotes/97s05d15.pdf
Description of JTAG support on Motorola Coldle microprocessor:
http://e-www.motorola.com/collateral/MCF5307TR-JTAG.pdf
7.5. BUILT IN SELF TEST 443
7.5 Built In Self Test
With built-in self test, the circuit tests itself. Both test vector generation and checking are done
using linear feedback shift registers (LFSRs).
7.5.1 Block Diagram
mode
circuit
under
test
data_in(0)
data_in(2)
data_in(1)
data_in(3)
test
generator
signature
analyzer0
signature
analyzer1
signature
analyzer2
signature
analyzer3
result
checker
all_ok
test gen LFSR
d_out(0)
d_out(1)
d_out(2)
d_out(3)
diz(0)
diz(1)
diz(2)
diz(3)
ok(0)
ok(1)
ok(2)
ok(3)
BIST
7.5.1.1 Components
There is one test generator per group of inputs (or internal ops) that drive the same circuit to be
tested.
There is one signature analyzer per output (or internal op).
Note: MISR A n exception to
the above rule is a multiple input signature register (MISR), which can be used to analyze several
outputs of the circuit under test.
The test generator and signature analyzer are both built with linear-feedback shift registers.
Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
generates a psuedo-random set of test vectors
for n output bits, generates all vectors from 1 to 2
n
1 in a pseudo random order
built with a linear-feedback shift register (shift-register portion is the input ops)
The gure below shows an LFSR that generates all possible 3-bit vectors except 000. (An n bit
LFSR that generates 2
n
1 different vectors is called a maximal-length LFSR.)
Assume that reset initializes the circuit to
111. The sequence that is generated is: 111,
011, 001, 100, 010, 101, 110. This sequence
is repeated, so the number after 110 is 111.
q2
q1
q0
Question: Why not just use a counter to generate 1..2
n
1?
Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Checking is done by building one signature analyzer circuit for each signal tested. The circuit
returns true if the signal generates the correct sequence of outputs for the test vectors. Doing this
with complete accuracy would require storing 2
n
bits of information for each output for a circuit
with n inputs. This would be as expensive as the original circuit. So, BIST uses mathematics
similar to error correction/detection to approximate whether the outputs are correct. This
technique is called signature analysis and originated with Hewlett-Packard in the 1970s.
The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuit
is designed to output a 1 at the end of the sequence of 2
n
1 test results if the sequence of results
matches the correct circuit. We could do this with an LFSR of 2
n
1 ops, but as said before, this
would be at least as expensive as duplicating the original circuit.
The checking LFSR is designed similarly to a hashing function or parity checking circuit. If it
returns 0, then we know that there is a fault in the circuit. If it returns a 1, then there is probably
not a fault in the circuit, but we cant say for sure.
There is a tradeoff between the accuracy of the analyzer and its area. The more accurate it is, the
more ip ops are required.
Summary: the signature analyzer:
checks that the output it is examining has the correct results for the complete set of tests that
are run
only has a meaningful result at the end of the entire test sequence.
built with a linear-feedback shift register
similar to a hash function or a lossy compression function
if there are no faults, the signature analyzer will denitely say ok (no false negatives)
7.5.1 Block Diagram 445
if there is a fault, the signature analyzer might say ok or might say bad (false positives are
possible)
design tradeoff: more accurate signature analyzers require more hardware
Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at
the end of running the complete set of test vectors
the result checker looks at test vector inputs to detect the end of the test suite and outputs
all ok if all signature analyzers report ok at that moment
implemented as an AND gate
7.5.1.2 Linear Feedback Shift Register (LFSR)
Basically, a shift register (sequence of ip-ops) with the output of the last ip-op fed back into
some of the earlier ip-ops with XOR gates.
Design parameters:
number of ip-ops
external or internal XOR
feedback taps (coefcients)
external-input or self-contained
reset or set
Example LFSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
S
R
S
R
S
R
reset
d0 q0 d1 q1 d2 q2
i
External-XOR, input, reset
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
External-XOR, no input, set
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
i
Internal-XOR, input, set
S
R
S
R
S
R
reset
d0 q0 d1 q1 d2 q2
i
Internal-XOR, input, reset
In E&CE 427, we use internal-XOR LFSRs, because the circuitry matches the mathematics of
Galois elds.
External-XOR LFSRs work just ne, but they are more difcult to analyze, because their
behaviour cant be treated as Galois elds.
7.5.1.3 Maximal-Length LFSR
Denition maximal-length linear feedback shift register: An LFSR that outputs a
pseudo-random sequence of all representable bit-vectors except 0...00.
Denition pseudo random: The same elements in the same order every time, but the
relationship between consecutive elements is apparantly random.
Maximal-length linear feedback shift registers are used to generate test vectors for built-in self
test.
Maximal-Length LFSR Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The gures below illustrate the two maximal-length internal-XOR linear feedback shift registers
that can be constructed with 3 ops.
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
Maximal-length internal-XOR LFSR
7.5.2 Arithmetic over Binary Fields
Galois Fields!
Two operations: + and
Two values: 0 and 1
+ represents XOR
expression result
0+0 0
0+1 1
1+0 1
1+1 0
x +x 0
represents concatenating shift registers
expression result
x
4
1 x
4
x
2
x
3
x
5
7.5.3 Shift Registers and Characteristic Polynomials 447
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Calculate (x
3
+x
2
+1) (x
2
+x)
x
2
(x
3
+x
2
+1) = x
5
+ x
4
+ x
2
x (x
3
+x
2
+1) = x
4
+ x
3
+ x
x
5
+ x
3
+ x
2
+ x
7.5.3 Shift Registers and Characteristic Polynomials
Each linear feedback shift register has a characteristic polynomial that corresponds to the
behaviour of the signal that is the output of the last ip op in the shift register.
The exponents in the polynomial correspond to the delay: x
0
is the input to the shift register, x
1
is
the output of the rst ip-op, x
2
is the output of the second, etc. The coefcient is 1 if the output
feeds back into the ip op. Usually (Internal ops or input ops with an external input), the
feedback is done via an XOR gate. For input ops without an external input signal, the feedback is
done directly, with a wire. The non-existant external input is equivalent to a 0, and 0 XOR a
simplies to a, which is a wire.
From polynomials to hardware:
The maximum exponent denotes the number of ops
The other exponents denote the ops that tap off of feedback line from last op
S
R
S
R
reset
d0 q0 q1
i
S
R q2
p(x) = x
3
S
R
S
R
reset
d0 q0 q1
S
R q2 d1
i
p(x) = x
3
+x
S
R
S
R
reset
d0 q0 q1
i
S
R q2
p(x) = x
3
+1
S
R
S
R
reset
d0 q0 d1 q1
i
S
R q2
p(x) = x
3
+x +1
S
R
S
R
reset
d0 q0 d1 q1
i
S
R q2 d2
p(x) = x
3
+x
2
+x +1
S
R
S
R
reset
d0 q0 d1 q1
i
S
R q2
S
R q3 d3
p(x) = x
4
+x
3
+x +1
7.5.4 Bit Streams and Characteristic Polynomials 449
7.5.3.1 Circuit Multiplication
Redoing the multiplication example (x
2
+x) (x
3
+x
2
+1) as circuits:
x
2
+x
x
3
+x
2
+1
(x
2
+x) (x
3
+x
2
+1)
= x (x
3
+x
2
+1)
+ x
2
(x
3
+x
2
+1)
= x
5
+x
3
+x
2
+x
The op for the most-signicant bit is represented by a coeffcient of 1 for the maximum exponent
in the polynomial. Hence, MSB of the rst partial product cancels the x
4
of the second partial
product, resulting in a coefcient of 0 for x
4
in the answer.
7.5.4 Bit Streams and Characteristic Polynomials
A bit stream, or bit sequence, can be represented as a polynomial.
The oldest (rst) bit in a sequence of n bits is represented by x
n1
and the youngest (last) bit is x
0
.
The bit sequence 1010011 can be represented as x
6
+x
4
+x +1:
1 0 1 0 0 1 1
= 1x
6
+ 0x
5
+ 1x
4
+ 0x
3
+ 0x
2
+ 1x
1
+ 1x
0
= x
6
+x
4
+x +1
7.5.5 Division
With rules for multiplication and addition, we can dene division.
A fundamental theorem of division denes q and r to be the quotient and remainder, respectively,
of mp iff:
m(x) = q(x) p(x) +r(x)
In Galois elds, we do division just as with long division in elementary school.
Given:
m(x) = x
6
+x
4
+x
3
p(x) = x
4
+x
Calculate the quotient, q(x) and remainder r(x) for m(x) p(x):
x
2
+ 1
x
4
+x x
6
+ 0x
5
+ 1x
4
+ 1x
3
+ 0x
2
+ 0x
1
+ 0x
0
x
6
+ 1x
3
1x
4
1x
4
+ x
x
Quotient q(x) = x
2
+1
Remainder r(x) = x
Check result:
m(x) = q(x) p(x) + r(x)
= (x
2
+1) (x
4
+x) + x
= x
6
+x
3
+x
4
+x + x
= x
6
+x
4
+x
3
7.5.6 Signature Analysis: Math and Circuits
The input to the signature analyzer is a message, m(x), which is a sequence of n bits
represented as a polynomial.
After n shifts through an LFSR with l ops:
The sequence of output bits forms a quotient, q(x), of length nl
The ops in the analyzer form a remainder, r(x), of length l
m(x) = q(x) p(x) +r(x)
The remainder is the signature.
The mathematics for an LFSR without an input i:
same polynomial as if the circuit had an input
7.5.7 Summary 451
input sequence is all 0s
An input stream with an error can be represented as m(x) +e(x)
e(x) is the error polynomial
bits in the message that are ipped have a coefcient of 1 in e(x)
m(x) +e(x) = q
(x) p(x) +r
(x)
The error e(x) will be detected if it results in a different signature (remainder).
m(x) and m(x) +e(x) will have the same remainder iff
e(x) mod p(x) = 0
That is e(x) must be a multiple of p(x).
The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).
7.5.7 Summary
Adding test circuitry
1. Pick number of ops for generator
2. Build generator (maximal-length linear feedback shift register)
3. Pick number of ops for signature analysis
4. Pick coeffecients (feedback taps) for analyzer
5. Based on generator, circuit under test, and signature analyzer; determine expected
output of analyzer
6. Based on expected output of analyzer, build result checker
Running test vectors
1. Put circuit in test mode
2. Set reset = 1
3. Run one clock cycle, set reset = 0
4. Run one clock cycle for each test vector
5. At end of test sequence, all ok signals should be 1
6. To run n test vectors requires n+1 clock cycles.
7.6 Scan vs Self Test
Scan
less hardware
slower
well dened coverage
test vectors are easy to modify
Self Test
more hardware
faster
ill dened coverage
test vectors are hard to modify
7.7. PROBLEMS ON FAULTS, TESTING, AND TESTABILITY 453
7.7 Problems on Faults, Testing, and Testability
P7.1 Based on Smith q14.9: Testing Cost
A modern (circa 1995) production tester costs US$510 million. This cost is depreciated over the
life of the tester (usually ve years in the States due to tax guidelines).
1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours a
day, 365 days per year how much does one second of test time cost?
2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behind
schedule. After the chips begin shipping, the tester is used 100% of the time. What is the
cost of testing the chips relative to the cost if the chips had been completed on time?
3. The dimensions of the die to be tested are 20mm10mm. The wafers are 200mm in
diameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that the
number of die per wafer is equal to wafer area divided by chip area.
What percentage of the fabrication + test cost is for test if the chip is on schedule and
requires 1 minute to test?
P7.2 Testing Cost and Total Cost
Given information:
Each board uses two ACHIPs (plus lots of other chips that we dont care about)
Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of
tests that are run)
If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replace
the ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is much
less than the total cost of $200).
What fault escapee rate will result in the lowest total cost for ACHIPs?
P7.3 Minimum Number of Faults
In a circuit with i inputs, o outputs, and g gates with an average fanout of fo (fo > 1), and average
fanin of , what is the minimum number of faults that must be considered when using a
single-stuck-at fault model?
P7.4 Smith q14.10: Fault Collapsing
Draw the set of faults that collapse for AND, OR, NAND, and NOR gates, and a two-input mux.
P7.5 Mathematical Models and Reality
Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at fault
model detect the fault? If so, identify a single-stuck at fault that will detect, or explain why cant
be detected.
P7.6 Undetectable Faults
Identify one of the undetectable single stuck-at fault in the circuit below, or say NONE if all
single stuck-at faults are detectable.
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
P7.7 Test Vector Generation
Your task is to generate test vectors to detect faults in the circuit shown below.
Your manager has said that manufacturing only has time to run three test vectors on the circuit.
a
b
c
L1
L2
L3
L4
L5
L6
L7
L8
P7.8 Time to do a Scan Test 455
P7.7.1 Choice of Test Vectors
Which test vectors should you run and in what order should you run them?
P7.7.2 Number of Test Vectors
Write a brief statement (justied by data) to support either staying with three test vectors or
increasing the test suite to four vectors.
P7.8 Time to do a Scan Test
A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, and
two of 12,000 bits.
Calculate the total test time.
P7.9 BIST
In this problem, we will revisit the circuit from section 7.2.5, which is shown below. But, this
time well use BIST to test the circuit, rather than analyzing the faults and then choosing test
vectors to catch the potential faults.
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
P7.9.1 Characteristic Polynomials
Derive the characteristic polynomials for the linear feedback shift registers shown below:
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
P7.9.2 Test Generation
Do either of the circuits generate a maximal-length non-repeating sequence?
P7.9.3 Signature Analyzer
Given a signature analyzer equation of x
2
+x +1, nd the expected value of the ops in the
signature analyzer at the end of the test sequence. Also, design the hardware for the signature
analyzer and result checker.
P7.9.4 Probabilty of Catching a Fault
Find the approximate probability of a fault not being detected
If we increase the size of the signature analyzer by one ip op, by how much do we change the
the approximate probability of a fault not being detected?
P7.9.6 Detecting a Specic Fault
Determine if a L7@0 is detectable
P7.9.7 Time to Run Test
Find the number of clock cycles to run the test
P7.10 Power and BIST
You add a BIST circuit to a chip. This causes the chip to exceed the power envelop that marketing
has dictaed is needed. What can you do to reduce the power consumption of the chip without
negatively affecting performance or incuring signicant design effort?
P7.11 Timing Hazards and Testability 457
P7.11 Timing Hazards and Testability
This question deals with with following circuit:
a
b
c
z
L1
L2
L3
L7
L8
L4
L9
L5
L6
L10
L11
L12
L13
L14
L15
1. Does the circuit have any untestable single-stuck-at faults? If so, identify them.
2. Does the circuit have any static timing hazards?
3. Add any circuitry needed to prevent static timing hazards in the circuit below, then identify
any untestable single-stuck-at faults in the resulting circuit.
P7.12 Testing Short Answer
P7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in
self testing?
If not, explain why. If so, describe such a fault.
P7.12.2 Are there any physical faults that are detectable by built-in self testing but not by
scan testing?
P7.13 Fault Testing
In this question, you will design and analyze built-in self test circuitry for the circuit-under-test
shown below.
P7.13.1 Design test generator
Draw the schematic for a 2-bit maximal-length linear feedback shift register and demonstrate that
it is maximal length.
P7.13.2 Design signature analyzer
Design a signature analyzer circuit for a characteristic polynomial of x +1.
P7.13.3 Determine if a fault is detectable
Is a stuck-at-1 fault on the output of the inverter detectable with the circuitry that youve
designed?
P7.13.4 Testing time
How many clock cycles does your BIST circuitry require to test the circuit under test? Explain
how each clock cycle is used.
Chapter 8
Review
This chapter lists the major topics of the term. The Topics List section for each major area is
meant to be relatively complete.
8.1 Overview of the Term
The purely digital world
VHDL
design and optimization methods
functional verication
performance analysis
Analog effects in the digital world
timing analysis
power
faults and testing
459
460 CHAPTER 8. REVIEW
8.2 VHDL
8.2.1 VHDL Topics
simple syntax and semantics things that you should know simply by having done the labs
and project
behavioural semantics of VHDL
synthesis semantics of VHDL
synthesizable and unsynthesizable code
8.2.2 VHDL Example Problems
identify whether a particular signal will be the output of combinational circuitry or a op
identify whether a particular process is combinational or clocked
legal, synthesizable, and good code
perform delta-cycle simulation of VHDL
perform RTL simulation of VHDL
identify whether two VHDL fragments have same behaviour
match VHDL code with waveforms
match VHDL code with hardware
choose the VHDL fragment that generates smaller or faster hardware
8.3. RTL DESIGN TECHNIQUES 461
8.3 RTL Design Techniques
8.3.1 Design Topics
coding guidelines
generic FPGA hardware
area estimation
nite state machines
implicit
explicit-current
explicit-current+next
from algorithm to hardware
dependency graph
dataow diagram
scheduling
input/output allocation
register allocation
datapath allocation
hardware block diagram
state machine
memory dependencies
memory arrays and dataow diagrams
8.3.2 Design Example Problems
choose design guidelines to follow in different situations
estimate area to implement a circuit in an FPGA
calculate resource usage for a dataow diagram
calculate performance data for a dataow diagram
given an algorithm, design a dataow diagram
given a dataow diagram, design the datapath and nite state machine
optimize a dataow diagram to improve performance or reduce resource usage
given a dataow diagram, calculate the clock period that will result in the maximum
performance
8.4 Functional Veri cation
8.4.1 Veri cation Topics
test cases
measuring coverage
time for verication
test benches
assertions
coverage monitors
relational specication
functional specication
boundary conditions / corner cases
8.4.2 Veri cation Example Problems
choose rst cases to test
identify corner cases
choose technique to detect bug (test case, assertion/test bench)
determine whether a code change will cause a bug
identify a test case and either assertion or test bench to catch a bug
8.5. PERFORMANCE ANALYSIS AND OPTIMIZATION 463
8.5 Performance Analysis and Optimization
8.5.1 Performance Topics
time to execute a program
denition of performance
speedup
n% faster
calculating performance of different different tasks and of average task
choosing which task to optimize to best improve overall performance
cpi calculations
performance increase over time
design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market)
CPI calculations
MIPs calculations
Clock speed vs. performance
Optimality performance / area tradeoffs
8.5.2 Performance Example Problems
calculate performance / area tradeoffs
calculate performance / time tradeoffs
compare performance data between products
evaluate performance criteria
8.6 Timing Analysis
8.6.1 Timing Topics
circuit parameters that affect delay
clock period
clock skew
clock jitter
propagation delay
load delay
setup time
hold time
clock-to-Q time
timing analysis of latch
timing analysis of master-slave ip-op
timing analysis of hierachical storage device
critical path and false path
algorithm to nd critical path
algorithm to determine if path is false or critical
signal assignment to exercise critical path
elmore timing model
derating factors
8.6.2 Timing Example Problems
timing parameters for minimum clock period
timing parameters for hold constraint
nd the critical path and assignment to exercise it
compute elmore delay constant
compare accuracy of different timing models
determine if a storage device will work correctly
compute timing parameters of storage device
identify timing violation, suggest remedy
suggest design change to increase clock speed
8.7. POWER 465
8.7 Power
8.7.1 Power Topics
power vs energy
equations for power
dynamic power
static power
switching power
short circuit power
leakage power
activity factor
leakage current
threshold voltage
supply voltage
analog power reduction techniques
rtl power reduction techniques
data encoding
clock gating
8.7.2 Power Example Problems
predict effect of new fabrication process on power
predict effect of environment change (temp, supply voltage, etc) on power consumption
predict effect of design change on power consumption (capacitance, activity factor)
design data-encoding scheme for a circuit, predict effect on power consumption
design clock gating scheme for a circuit, predict effect on power consumption
asses validity of various power- or energy-consumption metrics
8.8 Testing
8.8.1 Testing Topics
causes of faults
locations of faults
physical faults
single stuck-at fault model
testable / untestable fault
economics of testing
fault coverage
test vector generation
order test vectors to reduce test time
behaviour of a scan chain
time to run a scan test
JTAG
built-in self-test
linear feedback shift register
signature analyzer
Galois elds
process and time to run a BIST test
8.8.2 Testing Example Problems
compute optimal amount of testing to maximize prots
compute coverage for a given set of test vectors
nd test vectors to catch a set of faults, choose order to run test vectors
determine if a fault is detectable
choose an LFSR to use for BIST test generation
choose an LFSR to use for BIST signature analysis
determine if a given BIST will catch a given fault
determine probability that a given BIST technique will report that a faulty circuit is correct
determine if a given fault-testing scheme will detect a physical fault
match LFSR to characteristic polynomial
match BIST hardware to Galois mathematics
perform Galois eld mathematics, compare to waveforms
8.9. FORMULAS TO BE GIVEN ON FINAL EXAM 467
8.9 Formulas to be Given on Final Exam
T =
Ins C
F
Pf =
W
T
S =
T1
T2
M =
F/10
6
(
n
i=0
PI
i
C
i
)
P =
1
2
(ACLV
2
F) +( AVIShF) +(VIL)
q = 1.6021810
19
C
k = 1.3806610
23
J/K
F
(VVTh)
2
V
IL e
qVTh
k T
0 CHAPTER 8. REVIEW
Part II
Solutions to Assignment Problems
1
Chapter 5
VHDL Problems
P5.1 IEEE 1164
For each of the values in the list below, answer whether or not it is dened in the
ieee.std_logic_1164 library. If it is part of the library, write a 23 word description of the
value.
Values: -, #, 0, 1, A, h, H, L, Q, X, Z.
Answer:
In std logic 1164?
Yes No Description
- X dont care
# X
0 X strong 0
1 X strong 1
A X
h X
H X weak 1
L X weak 0
Q X
X X strong unknown
Z X high impedance
NOTE: h is not in the package, because characters are case sensitive. For
example a is different than A.
3
4 CHAPTER 5. VHDL PROBLEMS
P5.2 VHDL Syntax
Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.
NOTES: 1) ... represents a fragment of legal VHDL code.
2) For full marks, if the code is illegal, you must explain why.
3) The code has been written so that, if it is illegal, then it is illegal for both
simulation and synthesis.
q2a
architecture main of anchiceratops is
signal a, b, c : std_logic;
begin
process begin
wait until rising_edge(c);
a <= if (b = 1) then
...
else
...
end if;
end process;
end main;
ILLEGAL: if-then-else is a state-
ment, not an expression, you cant put it
if-then-else on right-hand-side of as-
signment since it doesnt produce a value
to assign to signal a.
q2b
architecture main of tulerpeton is
begin
lab: for i in 15 downto 0 loop
...
end loop;
end main;
ILLEGAL: loop statements are sequen-
tial, while architecture bodies contain
concurrent statements.
P5.2. VHDL SYNTAX 5
q2c
architecture main of metaxygnathus is
begin
lab: if (a = 1) generate
...
end generate;
end main;
ILLEGAL: condition for if-generate
statements must be statically determined;
testing the value of a signal is dynamic.
q2d
architecture main of temnospondyl is
component compa
port (
a : in std_logic;
b : out std_logic
);
end component;
signal p, q : std_logic;
begin
coma_1 : compa
port map (a => p, b => q);
...
end main;
LEGAL
q2e
architecture main of pachyderm is
function inv(a : std_logic)
return std_logic is
begin
return(NOT a);
end inv;
signal p, b : std_logic;
begin
p <= inv(b => a);
...
end main;
ILLEGAL: the argument to inv should
be (a => b). In function calls and
component instantiations, when using
named parameter instantiation (as op-
posed to positional parameter instantia-
tion), the syntax is formal => actual.
In the problem, the function denition is:
inv( a : std logic ) and the function
call is: inv( a => b). Here a is the
formal argument and b is the actual argu-
ment, so the correct function call would be:
inv( a => b )
q2f
architecture main of apatosaurus is
type state_ty is (S0, S1, S2);
signal st : state_ty;
signal p : std_logic;
begin
case st is
when S0 | S1 => p <= 0;
when others => p <= 1;
end case;
end main;
ILLEGAL: case statements are sequential;
but the body of an architecture contains
concurrent statements.
P5.3. FLOPS, LATCHES, AND COMBINATIONAL CIRCUITRY 7
P5.3 Flops, Latches, and Combinational Circuitry
For each of the signals p...z in the architecture main of montevido, answer whether the signal
is a latch, combinational gate, or ip-op.
entity montevido is
port (
a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;
l : in std_logic_vector (1 downto 0);
p, q, r, s, t, u, v, w, x, y, z : out std_logic
);
end montevido;
architecture main of montevido is
signal i, j : std_logic;
begin
i <= c0 XOR c1;
j <= c0 XOR c1;
process (a, i, j) begin
if (a = 1) then
p <= i AND j;
else
p <= NOT i;
end if;
end process;
process (a, b0, b1) begin
q <= b0 AND b1;
end if;
end process;
process
(a, c0, c1, d0, d1, e0, e1)
begin
if (a = 1) then
r <= c0 OR c1;
s <= d0 AND d1;
else
r <= e0 XOR e1;
end if;
end process;
process begin
t <= b0 XOR b1;
u <= NOT t;
v <= NOT x;
end process;
process begin
case l is
when "00" =>
w <= b0 AND b1;
x <= 0;
when "01" =>
w <= -;
x <= 1;
when "1-" =>
w <= c0 XOR c1;
x <= -;
end case;
end process;
y <= c0 XOR c1;
z <= x XOR w;
end main;
Answer:
Latch Combinational Flip-op
p X
q X
r X
s X
t X
u X
v X
w X
x X
y X
z X
Explanation of why e, which is the output of a ip-op, have a value at 5ns,
which is before the rst rising edge of the clock.
Before the rst rising edge of the clock, the following assignments will all
happen:
a <= 0;
b <= 0;
...
----------- end of delta cycle
d <= 0
...
----------- end of delta cycle
e <= d;
If you were to implement VHDL code in hardware, e would be the output of a
op, and as such would remain as U until the rst rising edge of the clock.
This is a situation where simulating the VHDL code will have slightly different
results than simulating the hardware. Most questions in ece427 that ask you
to compare the behaviour of VHDL code with the behaviour of a circuit will
say to focus on the steady-state behaviour and ignore any differences in the
rst few clock cycles.
P5.4. COUNTING CLOCK CYCLES 9
P5.4 Counting Clock Cycles
This question refers to the VHDL code shown below.
NOTES:
1. ... represents a legal fragment of VHDL code
2. assume all signals are properly declared
3. the VHDL code is intendend to be legal, synthesizable code
4. all signals are initially U
entity bigckt is
port (
c : out std_logic
);
end bigckt;
architecture main of bigckt is
begin
process (a, b)
begin
if (a = 0) then
c <= 0;
else
if (b = 1) then
c <= 1
else
c <= 0;
end if;
end if;
end process;
end main;
entity tinyckt is
port (
clk : in std_logic;
i : in std_logic;
o : out std_logic
);
end tinyckt;
architecture main of tinyckt is
component bigckt ( ... );
signal ... : std_logic;
begin
p0 : process begin
p0_a <= i;
end process;
p1 : process begin
p1_b <= p1_d;
p1_c <= p1_b;
p1_d <= s2_k;
end process;
p2 : process (p1_c, p3_h, p4_i, clk) begin
p2_e <= p3_h;
p2_f <= p1_c = p4_i;
end if;
end process;
p3 : process (i, s4_m) begin
p3_g <= i;
p3_h <= s4_m;
end process;
p4 : process (clk, i) begin
if (clk = 1) then
p4_i <= i;
else
p4_i <= 0;
end if;
end process;
huge : bigckt
(a => p2_e, b => p1_d, c => h_y);
s1_j <= s3_l;
s2_k <= p1_b XOR i;
s3_l <= p2_f;
s4_m <= p2_f;
end main;
For each of the pairs of signals below, what is the minimum length of time between when a
change occurs on the source signal and when that change affects the destination signal?
P5.5. ARITHMETIC OVERFLOW 11
Answer:
s2_k p1_d p1_b p1_c
clk
0
i
p2_f s4_m p3_h
p2_e p4_i
NOTE: i doesnt affect the value of p4 i just before a rising edge of clock, so
i doesnt affect p2 e at all along the path that goes through p4 i.
p1_c
clk
0
p4_i
p2_f
i
p1_c
clk
p4_i
p2_f
i

src dst Num clock cycles
i p0 a 1 clock cycle
i p1 b 2 clock cycles
i p1 c 3 clock cycles
i p2 e 5 clock cycles
i p3 g same clock cycle
i p4 i same clock cycle
s4 m h y 1 clock cycle
p1 b p1 d 1 clock cycle
p2 f s1 j same clock cycle
p2 f s2 k no connection
P5.5 Arithmetic Overow
Implement a circuit to detect overow in 8-bit signed addition.
An overow in addition happens when the carry into the most signicant bit is different from the
carry out of the most signicant bit.
When performing addition, for overow to happen, both operands must have the same sign.
Positive overow occurs when adding two positive operands results in a negative sum. Negative
overow occurs when adding two negative operands results in a positive sum.
Answer:
We use xor to check if two bits are not-equal.
library ieee;
entity overflow is
port (
num1,
num2 : in signed(7 downto 0);
cin : in std_logic;
overflow : out std_logic
);
end overflow;
architecture main of overflow is
signal result : signed(7 downto 0);
begin
result <= num1 + num2 + ("0000000" & cin);
ovrflw <= not (num1(7) xor num2(7))
and ( num1(7) xor result(7) );
end overflow;
P5.6. DELTA-CYCLE SIMULATION: PONG 13
P5.6 Delta-Cycle Simulation: Pong
INSTRUCTIONS:
3. Each column of the timing diagram corresponds to a simulation step that changes a signal
or process.
or round.
5. End your simulation just before 20 ns.
architecture main of pong_machine is
signal ping_i, ping_n, pong_i, pong_n : std_logic;
begin
reset_proc: process
reset <= 1;
wait for 10 ns;
reset <= 0;
wait for 100 ns;
end process;
clk_proc: process
clk <= 0;
wait for 10 ns;
clk <= 1;
wait for 10 ns;
end process;
next_proc: process (clk)
begin
ping_n <= ping_i;
pong_n <= pong_i;
end if;
end process;
comb_proc: process (pong_n, ping_n, reset)
begin
if (reset = 1) then
ping_i <= 1;
pong_i <= 0;
else
ping_i <= pong_n;
pong_i <= ping_n;
end if;
end process;
end main;
Answer:
t
=
0
n
s
r
e
s
e
t
c
l
k
p
i
n
g
_
i
p
i
n
g
_
n
p
o
n
g
_
i
p
o
n
g
_
n
n
e
x
t
_
p
r
o
c
c
o
m
b
_
p
r
o
c
c
l
k
_
p
r
o
c
r
e
s
e
t
_
p
r
o
c
A
S
A
S
S
S
A
P P P P U U U U U U
d
e
l
t
a

c
y
c
l
e
s
i
m

c
y
c
l
e
s
i
m

r
o
u
n
d
B B B
A
U
U
P P E E
S
A
1 0
A
S
t
=
1
0
n
s
P P
S
A
1 0
S
A
1 0 P P
E
B B
B
/
E
B
/
E
B B
S
A
1 0
S
A
P
S
A
t
=
2
0
n
s
E
E E
E E
E E
t
=
0
n
s
+
1
t
=
1
0
n
s
+
1
t
=
1
0
n
s
+
2
t
=
1
0
n
s
+
3
U
U
U U
B B
B B
1 0
U
U
U
U
U
U
U
U
P5.7. DELTA-CYCLE SIMULATION: BAKU 15
P5.7 Delta-Cycle Simulation: Baku
INSTRUCTIONS:
3. Each column of the timing diagram corresponds to a simulation step.
or round.
5. Write t=5ns and t=10ns at the top of columns where time advances to 5 ns and 10 ns.
6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize the
signals have completed).
7. End your simulation just before 15 ns;
entity baku is
port (
clk, a, b : in std_logic;
f : out std_logic
);
end baku;
architecture main of baku is
signal c, d, e : std_logic;
begin
proc_clk: process
begin
clk <= 0;
wait for 10 ns;
clk <= 1;
wat for 10 ns;
end process;
proc_extern : process
begin
a <= 0;
b <= 0;
wait for 5 ns;
a <= 1;
b <= 1;
wait for 15 ns;
end process;
proc_1 : process (a, b, c)
begin
c <= a and b;
d <= a xor c;
end process;
proc_2 : process
begin
e <= d;
end process;
proc_3 : process (c, e) begin
f <= c xor e;
end process;
end main;
Answer:
c
l
k
d
e
l
t
a

c
y
c
l
e
s
i
m
u
l
a
t
i
o
n

c
y
c
l
e
s
i
m
u
l
a
t
i
o
n

r
o
u
n
d a b c d e f
p
r
o
c
_
1
p
r
o
c
_
2
p
r
o
c
_
3
P
A
S
P P
A
S
A
S
P
A
S
B B
B B
B
B B
E E
E E
E E
E E
p
r
o
c
_
c
l
k
A
S
P
A
S
t
=
5

n
s
t
=
1
0

n
s
E E
B
p
r
o
c
_
e
x
t
e
r
n
P
1 1
1 1
1 0
B
/
E
B
/
E
B
P
B
/
E
B
/
E
B
t
=
1
5

n
s
N
o
t
e
:
t
h
e
i
n
s
t
r
u
c
t
i
o
n
t
o
e
n
d
j
u
s
t
b
e
f
o
r
e
1
5
n
s
s
i
m
p
l
y
c
a
u
s
e
s
u
s
t
o
s
t
o
p
t
h
e
s
i
m
u
-
l
a
t
i
o
n
j
u
s
t
b
e
f
o
r
e
1
5
n
s
.
T
h
e
v
a
l
u
e
s
o
n
t
h
e
s
i
g
n
a
l
s
a
t
t
h
e
e
n
d
o
f
1
0
n
s
w
i
l
l
r
e
m
a
i
n
u
n
t
i
l
t
h
e
n
e
x
t
e
v
e
n
t
,
w
h
i
c
h
h
a
p
p
e
n
s
a
t
2
0
n
s
.
P5.8. CLOCK-CYCLE SIMULATION 17
P5.8 Clock-Cycle Simulation
Given the VHDL code for anapurna and waveform diagram below, answer what the values of
the signals y, z, and p will be at the given times.
entity anapurna is
port (
clk, reset, sel : in std_logic;
a, b : in unsigned(15 downto 0);
p : out unsigned(15 downto 0)
);
end anapurna;
architecture main of anapurna is
type state_ty is (mango, guava, durian, papaya);
signal y, z : unsigned(15 downto 0);
begin
proc_herzog: process
begin
top_loop: loop
state <= durian;
state <= papaya;
while y < z loop
if sel = 1 then
state <= mango;
end if;
state <= papaya;
end loop;
end loop;
end process;
proc_hillary: process (clk)
begin
if (state = durian) then
z <= a;
else
z <= z + 2;
end if;
end if;
end process;
y <= b;
p <= y + z;
end main;
Answer:
0 20 40 60 80 100 120 140 160 180 200
reset
clk
b 0F 03 0D 05 0B 07 01
sel
a 01 0E 02 0C 0A 06 08 0E
z
state
p
55ns 107ns 147ns 195ns
y
0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07
02 0C 0A 06 08 0E 02 0C 04 04 04 0A
U U D P D P D P P P
U U 2 4 6 8 A
0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07
U A C
U 07 15 11
55ns 107ns 147ns 195ns
y 7 5 F 7
z U 2 6 A
p U 7 15 11
P5.9. VHDL VHDL BEHAVIOURAL COMPARISON: TERADACTYL 19
P5.9 VHDL VHDL Behavioural Comparison: Teradactyl
as it does in the main architecture of teradactyl?
why.
entity teradactyl is
port (
a : in std_logic;
v : out std_logic
);
end teradactyl;
architecture main of teradactyl is
begin
m <= a;
v <= m;
end main;
architecture q3a of teradactyl is
signal b, c, d : std_logic;
begin
b <= a;
c <= b;
d <= c;
v <= d;
end q3a;
SAME - Intermediate signals are optimized out.
architecture q3b of teradactyl is
begin
process (a, m) begin
v <= m;
m <= a;
end process;
end q3b;
SAME - Putting it in a process doesnt matter.
architecture q3c of teradactyl is
begin
process (a) begin
m <= a;
end process;
process (m) begin
v <= m;
end process;
end q3c;
SAME - Putting it in a seperate process doesnt
matter due to the parallel nature of VHDL.
P5.10 VHDL VHDL Behavioural Comparison: Ichtyostega
as it does in the main architecture of ichthyostega?
why.
entity ichthyostega is
port (
clk : in std_logic;
b, c : in signed(3 downto 0);
v : out signed(3 downto 0)
);
end ichthyostega;
architecture main of ichthyostega is
begin
process begin
bx <= b;
cx <= c;
end process;
process begin
if (cx > 0) then
v <= bx;
else
end if;
end process;
end main;
architecture q4a of ichthyostega is
begin
process begin
bx <= b;
cx <= c;
end process;
process begin
if (cx > 0) then
v <= bx;
else
end if;
end process;
end q4a;
DIFFERENT: evaluations of cx > 0 and
v <= bx are separated by a clock cycle.
P5.10. VHDL VHDL BEHAVIOURAL COMPARISON: ICHTYOSTEGA 21
architecture q4b of ichthyostega is
begin
process begin
bx <= b;
cx <= c;
if (cx > 0) then
v <= bx;
else
end if;
end process;
end q4b;
DIFFERENT: each assignment state-
ment (e.g. bx <= b) will execute every
other clock cycle, rather than every clock
cycle.
architecture q4c of ichthyostega is
signal bx, cx, dx : signed(3 downto 0);
begin
process begin
bx <= b;
cx <= c;
end process;
process begin
v <= dx;
end process;
dx <= bx when (cx > 0)
else to_signed(-1, 4);
end q4c;
SAME
P5.11 Waveform VHDL Behavioural Comparison
Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as
the timing diagram.
NOTES: 1) Same behaviour means that the signals a, b, and c have the same values at
the end of each clock cycle in steady-state simulation (ignore any irregularities
in the rst few clock cycles).
2) For full marks, if the code does not match, you must explain why.
3) Assume that all signals, constants, variables, types, etc are properly dened
and declared.
4) All of the code fragments are legal, synthesizable VHDL code.
clk
a
b
c
q3a
architecture q3a of q3 is
begin
process begin
a <= 1;
loop
a <= NOT a;
end loop;
end process;
b <= NOT a;
c <= NOT b;
end q3a;
SAME
q3b
architecture q3b of q3 is
begin
process begin
b <= 0;
a <= 1;
a <= b;
b <= a;
end process;
c <= a;
end q3b;
SAME
P5.11. WAVEFORM VHDL BEHAVIOURAL COMPARISON 23
q3c
architecture q3c of q3 is
begin
process begin
a <= 0;
b <= 1;
b <= a;
a <= b;
end process;
c <= NOT b;
end q3c;
SAME
q3d
architecture q3d of q3 is
begin
process (b, clk) begin
a <= NOT b;
end process;
process (a, clk) begin
b <= NOT a;
end process;
c <= NOT b;
end q3d;
DIFFERENT: this code has combinational
loops
q3e
architecture q3e of q3 is
begin
process
begin
b <= 0;
a <= 1;
a <= c;
b <= a;
end process;
c <= not b;
end q3e;
DIFFERENT: a is a constant 1
q3f
architecture q3f of q3 is
begin
process begin
a <= 1;
b <= 0;
c <= 1;
a <= c;
b <= a;
c <= NOT b;
end process;
end q3f;
DIFFERENT: a and c are constant 1
P5.12 Hardware VHDL Comparison
For each of the circuits q2aq2d, answer
whether the signal d has the same behaviour
as it does in the main architecture of q2.
entity q2 is
port (
a, clk, reset : in std_logic;
d : out std_logic
);
end q2;
architecture main of q2 is
signal b, c : std_logic;
begin
b <= 0 when (reset = 1)
else a;
process (clk) begin
c <= b;
d <= c;
end if;
end process;
end main;
q2a
clk
a
0
reset
d
q2b
clk
a
0
reset
d
P5.12. HARDWARE VHDL COMPARISON 25
q2c
clk
a
0
reset
d
q2d
clk
a
0
reset
d
clk
Answer:
q2a: a shouldnt be opped. q2b: One too many FFs. q2c: Correct
operation. q2d: This will work (i.e. it has the same input-output
characteristics) but the internal description is different.
P5.13 8-Bit Register
Implement an 8-bit register that has:
clock signal clk
input data vector d
output data vector q
synchronous active-high input reset
synchronous active-high input enable
Answer:
library ieee;
entity reg_8 is
port (
clk,
reset,
enable : in std_logic;
d : in std_logic_vector (7 downto 0);
q : out std_logic_vector (7 downto 0)
);
end reg_8;
architecture main of reg_8 is
begin
reg: process
begin
if reset = 1 then
q <= (others => 0);
elsif enable = 1 then
q <= d;
end if;
end process reg;
end main;
P5.13.1 Asynchronous Reset
Modify your design so that the reset signal is asynchronous, rather than synchronous.
P5.13.2 Discussion 27
Answer:
reg : process(clk, reset)
begin
if reset = 1 then
q <= (other => 0);
elsif rising_edge(clk) then
if enable = 1 then
q <= d;
end if;
end if;
end process reg;
P5.13.2 Discussion
Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on
an FPGA.
Answer:
Synchronous resets lead to more robust designs. With an asynchronous
reset, a op is reset whenever the reset signal arrives. Due to wire delays,
signals will arrive at different ops at different times. If an asynchronous reset
occurs at about the same time as a clock edge, some ops might be reset in
one clock cycle and some in the next. This can lead to glitches and/or illegal
values on internal state signals.
The tradeoff is that asynchronous reset is often easier to code in VHDL and
requires less hardware to implement.
P5.13.3 Testbench for Register
Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
Answer:
library ieee;
entity reg_8_tb is
end reg_8_tb;
architecture main of reg_8_tb is
component reg_8 is
port (
clk : in std_logic;
reset : in std_logic;
enable : in std_logic;
d : in std_logic_vector (7 downto 0);
q : out std_logic_vector (7 downto 0);
end component;
signal clk, reset, enable : std_logic;
signal d, q : std_logic_vector(7 downto 0);
begin
uut : reg_8 port map
( clk => clk,
reset => reset,
enable => enable,
d => d,
q => q
);
process begin
clk <= 1 ; reset <= 0 ;
wait for 20 ns; -- time=20 ns
clk <= 0 ; reset <= 1 ; enable <= 1 ; d <= "10101011";
clk <= 1 ; reset <= 0 ;
clk <= 0 ; enable <= 0 ; d <= "00001011";
clk <= 1 ;
clk <= 0 ; enable <= 1 ;
clk <= 1 ;
end process;
end main;
P5.14. SYNTHESIZABLE VHDL AND HARDWARE 29
P5.14 Synthesizable VHDL and Hardware
For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If the
code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of
the code. If the the code is not synthesizable, explain why.
q4a
process begin
e <= d;
wait until rising_edge(b);
e <= NOT d;
end process;
Answer:
Unsynthesizable: different
conditions in wait statements in
same process. This would lead
to a single ip-op requiring
multiple clock signals.
q4b
process begin
while (c /= 1) loop
if (b = 1) then
e <= d;
else
e <= NOT d;
end if;
end loop;
e <= b;
end process;
Answer:
unsynthesizable: while loop
around code where some paths
have wait statements and some
do not. Even having a while
loop with a dynamic condition
around code without a wait
statement would be
unsynthesizable, because it
would lead to combinational
loops in the hardware.
q4c
process (a, d) begin
e <= d;
end process;
process (a, e) begin
f <= NOT e;
end if;
end process;
Answer:
Flop with inverter on input
d
e
a
f
q4d
process (a) begin
if b = 1 then
e <= 0;
else
e <= d;
end if;
end if;
end process;
Answer:
Synchronous reset (AND with
bubble). The Reset pin on a
ip-op is generally
asynchronous, so a op with a
reset pin would be incorrect.
b
e
a
d
q4e
process (a,b,c,d) begin
e <= c;
else
if (b = 1) then
e <= d;
end if;
end if;
end process;
Answer:
Unsynthesizable: An if
rising edge with else clause is
unsynthesizable because it
requires a signal (the select
signal for the multiplexer) to
detect a rising edge.
q4f
process (a,b,c) begin
if (b = 1) then
e <= 0;
else
e <= c;
end if;
end if;
end process;
Answer:
Flop with asynchronous reset.
b
e
a
R
c
P5.15. DATAPATH DESIGN 31
P5.15 Datapath Design
Each of the three VHDL fragments q4aq4c, is intended to be the datapath for the same circuit.
The circuit is intended to perform the following sequence of operations (not all operations are
required to use a clock cycle):
read in source and destination addresses from i src1,
i src2, i dst
read operands op1 and op2 from memory
compute sum of operands sum
write sum to memory at destination address dst
write sum to output o result
i_src1
i_src2
i_dst
o_result
clk
P5.15.1 Correct Implementation?
For each of the three fragments of VHDL q4aq4c, answer whether it is a correct implementation
of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in
which cycle you need load=1.
NOTES:
1. You may choose the number of clock cycles required to execute the sequence of operations.
2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0.
3. The control circuitry that controls the datapath will output a signal load, which will be
1 when the sum is to be written into memory.
4. The code fragment with the signal declaractions, connections for inputs and outputs, and
the instantiation of memory is to be used for all three code fragments q4aq4c.
5. The memory has registered inputs and combinational (unregistered) outputs.
6. All of the VHDL is legal, synthesizable code.
-- This code is to be used for
-- all three code fragments q4a--q4c.
signal state : std_logic_vector(3 downto 0);
signal src1, src2, dst, op1, op2, sum,
mem_in_a, mem_out_a, mem_out_b,
mem_addr_a, mem_addr_b
: unsigned(7 downto 0);
...
process (clk)
begin
src1 <= i_src1;
src2 <= i_src2;
dst <= i_dst;
o_result <= sum;
end if;
end process;
mem : ram256x16d
port map (clk => clk,
i_addr_a => mem_addr_a,
i_addr_b => mem_addr_b,
i_we_a => mem_we,
i_data_a => mem_in_a,
o_data_a => mem_out_a,
o_data_b => mem_out_b);
P5.15.1 Correct Implementation? 33
q4a
op1 <= mem_out_a when state = "0010"
else (others => 0);
op2 <= mem_out_b when state = "0010"
else (others => 0);
sum <= op1 + op2 when state = "0100"
else (others => 0);
mem_in_a <= sum when state = "1000"
else (others => 0);
mem_addr_a <= dst when state = "1000"
else src1;
mem_we <= 1 when state = "1000"
else 0;
mem_addr_b <= src2;
process (clk)
begin
if (load = 1) then
state <= "1000";
else
-- rotate state vector one bit to left
state <= state(2 downto 0) & state(3);
end if;
end if;
end process;
Answer:
The circuit is not correct: all of the signals are combinational.
Also, there could be initialization problems with state.
q4b
process (clk) begin
op1 <= mem_out_a;
op2 <= mem_out_b;
end if;
end process;
sum <= op1 + op2;
mem_in_a <= sum;
mem_we <= load;
mem_addr_a <= dst when load = 1
else src1;
mem_addr_b <= src2;
Answer:
The circuit is correct.
load = 1 in clock cycle 3
0. inputs available
1. src1; mem addr a,b
2. mem out a,b
3. op1,2; sum; mem in a; load
P5.15.2 Smallest Area 35
q4c
process
begin
op1 <= mem_out_a;
op2 <= mem_out_b;
sum <= op1 + op2;
mem_in_a <= sum;
end process;
process (load, dst, src1) begin
if load = 1 then
mem_addr_a <= dst;
else
mem_addr_a <= src1;
end if;
end process;
mem_addr_b <= src2;
Answer:
If the code is taken exactly as is:
the circuit is incorrect, because mem we is missing.
If assume that mem we is added:
The circuit is correct.
Need load = 1 in cycle 5.
0. inputs available
1. src1; mem addr a,b
2. mem out a,b
3. op1,2
4. sum
5. mem in a; load
P5.15.2 Smallest Area
have the smallest area.
If you dont have sufcient information to predict the relative areas, explain what additional
information you would need to predict the area prior to synthesizing the designs.
Answer:
Assuming that q4c includes mem we:
All of the circuits have an adder, memory, input ops, output ops, and a mux
for mem addr a. The differences are in the ops and misc circuitry:
For q4a, each of the signals op1, op2, sum, mem in a, and mem we is
assigned either zero or the value of another signal, depending on the state.
Because one of the inputs is a constant (0), we can implement this with an
AND gate rather than a mux. Each bit of each signal requires one AND gate.
We have ve signals of eight bits each, therefore we need 58 = 40 AND
gates.
q4a q4b q4c
ops 1*4 2*8 4*8
ands 5*8 0 0
From this analysis, q4a has the smallest area. There is the implicit
assumption that an AND gate is much smaller than a FF.
P5.15.3 Shortest Clock Period
have the shortest clock period.
If you dont have sufcient information to predict the relative periods, explain what additional
information you would need to predict the period prior to performing any synthesis or timing
analysis of the designs.
Answer:
Assuming that the memory is not on the critical path, q4c has the shortest
clock period, because it does the least amount of computation between ip
ops all of the signals are opped.
Chapter 6
Design Problems
P6.1 Synthesis
This question is about using VHDL to implement memory structures on FPGAs.
P6.1.1 Data Structures
If you have to write your own code (i.e. you do not have a library of memory components or a
special component generation tool such as LogiBlox or CoreGen), what datastructures in VHDL
would you use when creating a register le?
P6.1.2 Own Code vs Libraries
When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL
code for memory, rather than instantiate memory components from a library?
P6.2 Design Guidelines
While you are grocery shopping you encounter your co-op supervisor from last year. Shes now
forming a startup company in Waterloo that will build digital circuits. Shes writing up the design
guidelines that all of their projects will follow. She asks for your advice on some potential
guidelines.
What is your response to each question?
What is your justication for your answer?
What are the tradeoffs between the two options?
37
38 CHAPTER 6. DESIGN PROBLEMS
0. Sample Should all projects use silicon chips, or should all use biological chips, or should
Answer: All projects should use silicon based chips, because biological chips dont
exist yet. The tradeoff is that if biological chips existed, they would probably
consume less power than silicon chips.
1. Should all projects use an asynchronous reset signal, or should all use a synchronous
reset signal, or should each project choose its own technique?
Answer:
Synchronous reset: Synchronous reset leads to more robust designs.
With asynchronous reset, a op is reset whenever the reset signal
arrives. Due to wire delays, signals will arrive at different ops at different
times. If an asynchronous reset occurs at about the same time as a clock
edge, some ops might be reset in one clock cycle and others in the next.
This can lead to glitches and/or illegal values on internal state signals.
The tradeoff is that asynchronous reset is often easier to code in VHDL
and requires less hardware to implement.
2. Should all projects use latches, or should all projects use ip-ops, or should each project
choose its own technique?
Answer:
Flops Flip ops lead to more robust designs than latches. Latches are
level sensitive and act as wires when enabled. For a latch based design
to work correctly, there cannot be any overlap in the time when a
consecutive pair of latches are enabled. If this happens, the value on a
signal will leak through the latch and arrive at the next set of latches one
clock phase too early. Thus, latch based designs are more sensitive to
the timing of clock signals. Another disadvantage of latches is that some
FPGAs and cell libraries do not support them. In comparison, D-type ip
ops are almost always supported.
The tradeoff is that latches are smaller and faster than ip ops. A
common implementation of a ip-op is a pair of latches in a master/slave
combination.
3. Should all chips have registers on the inputs and outputs or should chips have the inputs
and outputs directly connected to combinational circuitry, or should each project
choose its own technique? By register we mean either ip-ops or latches, based upon
your answer to the previous question. If your answer is different for inputs and outputs,
explain why.
P6.2. DESIGN GUIDELINES 39
Answer:
Flops on outputs and inputs Putting ops on inputs and outputs will
make the clock speed of the chip less dependent of the propagation
delay between chips. Flops can also be used to isolate the internals of
the chip from glitches and other anomolous behaviour that can occur on
the boards.
The tradeoff is that ops consume area and will increase the latency
through the chip.
4. Should all circuit modules on all chips have ip-ops on the inputs and outputs or should
chips have the inputs and outputs directly connected to combinational circuitry, or
should each project choose its own technique? By register we mean either ip-ops or
latches, based upon your answer to the previous question. If your answer is different for
inputs and outputs, explain why.
Answer:
Each project should adopt a convention of either using ops on inputs of
modules or outputs of modules. It is rarely necessary to put ops on both
inputs and outputs of modules on the same chip. This is because the
wire delay between modules is usually less than a clock period. Putting
ops on either the inputs or outputs is advantageous because it provides
a standard design convention that makes it easier to glue modules
together without violating timing constraints. If modules were allowed to
have combinational circuitry on both inputs and outputs, the maximum
clock speed of the design could not be determined until all of the modules
were glued together.
The tradeoff is that ops add area and latency. Sometimes there will be
two modules where the combinational circuitry on the outputs of one can
be combined with the combinational circuitry on the inputs of the second
without violating timing constraints. This discipline prevents that
optimization.
Aside: Sometimes, to meet performance targets, in situations such as
this, a project will remove or move the ops between modules and do
clock borrowing to t the maximum amount of circuitry into a clock
period. This is a rather low-level optimization that happens late in the
design cycle. It can cause big headaches for functional validation and
equivalence verication, because the specications for modules are no
longer clean and the boundaries between modules on the low-level
design might be different from the boundaries in the high-level design.
5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should
Answer:
Multiplexors Multiplexors lead to more robust designs. Tri-state buffers
rely on analog characteristics of devices to work correctl, and can work
incorrectly in the presence of voltage uctuations or fabrication process
variations. Multiplexors work on a purely Boolean level and as such are
less sensitive to changes in voltages or fabrication processes.
The tradeoff is that latches are smaller and faster than multiplexors. It
should be noted that some designs require tri-state buffers, especially
circuits that use a shared bus among many devices. Bi-directional pins,
shared busses, and semiconductor memories all need tri-state buffers to
work correctly.
Use the dataow diagram below to answer problems P2.3.1 and P2.3.2.
f
f
a b c
d
g
f
g
e
P6.3.1 Resource Usage
List the number of items for each resource used in the dataow diagram.
P6.3.2 Optimization 41
Answer:
input ports 3
output ports 1
registers 4
f components 2
g components 1
P6.3.2 Optimization
Draw an optimized dataow diagram that improves the performance and produces the same
output values. Or, if the performance cannot be improved, describe the limiting factor on the
preformance.
NOTES:
you may not increase the resource usage (input ports, registers, output ports, f components,
g components)
Answer:
f
f
a b
c
d
g
f
g
e
P6.4 Dataow Diagram Design
Your manager has given you the task of implementing the following pseudocode in an FPGA:
if is_odd(a + d)
p = (a + d)*2 + ((b + c) - 1)/4;
else
p = (b + c)*2 + d;
NOTES: 1) You must use registers on all input and output ports.
2) p, a, b, c, and d are to be implemented as 8-bit signed signals.
3) A 2-input 8-bit ALU that supports both addition and subtraction takes 1
clock cycle.
4) A 2-input 8-bit multiplier or divider takes 4 clock cycles.
5) A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a
MUX) can be squeezed into the same clock cycle(s) as an ALU operation,
multiply, or divide.
6) You can require that the environment provides the inputs in any order and
that it holds the input signals at the same value for multiple clock cycles.
P6.4.1 Maximum Performance
What is the minimum number of clock cycles needed to implement the pseudocode with a circuit
that has two input ports?
Answer:
Optimizations:
Multiplication by a constant power of 2 can be done without hardware, just
connect the wires between the signals. For example, if we have
a <= b*2;, we can do this with a(0) <= b(1); a(1) <= b(2); etc.
Testing if a signal is odd or even can be done simply by extracting the least
signicant bit of the signal.
P6.4.1 Maximum Performance 43
b c
d a
1
Data ow for odd case
b c
d
Data ow for even case
Even ow requires 4 clock cycles (3 cycles in the datapath plus one more
because we have to have ops on both inputs and outputs). Therefore total
design will require 4 clock cycles.
What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum
number of clock cycles that you just calculated?
Answer:
b c
dd a
-1
xor
and
4 clock cycles
2 ALUs
0 dividers
0 multipliers
wired shift
left by 1
Dataow for entire circuit
P6.4.2 Minimum area
What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles
needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and
one divider?
Answer:
P6.5. MICHENER: DESIGN AND OPTIMIZATION 45
b c
dd a
-1
5 clock cycles
3 8b regs
0 6b regs
0 4b regs
dd
0 1b regs
and
wired shift
left by 1
P6.5 Michener: Design and Optimization
Design a circuit named michener that performs the following operation: z = (a+d) + ((b -
c) - 1)
NOTES:
1. Optimize your design for area.
2. You may schedule the inputs to arrive at any time.
3. You may do algebraic transformations of the specication.
Answer:
z
a
+
+
d b c
1
Data-dependency graph
z
+
+
1
b
a
d
c
Dataow diagram
P6.6 Dataow Diagrams with Memory Arrays
Component Delay
Register 5 ns
Adder 25 ns
Subtracter 30 ns
Memory read 60 ns
Memory write 60 ns
NOTES:
1. The inputs of the algorithms are a and b.
2. The outputs of the algorithms are p and q.
time. For your inputs, you may read each value only once (i.e. the environment will not
send multiple copies of the same value).
6. M is an internal memory array, which must be implemented as dual-ported memory with
one read/write port and one read port.
P6.6.1 Algorithm 1 47
7. M supports synchronous write and asynchronous read.
8. Assume all memory address and other arithmetic calculations are within the range of
representable numbers (i.e. no overows occur).
10. You may sacrice area efciency to achieve high performance, but marks will be deducted
P6.6.1 Algorithm 1
Algorithm
q = M[b];
M[a] = b;
p = M[b+1] * a;
Assuming a b, draw a dataow diagram that is optimized for the fastest overall execution
time.
Answer:
1. a b means that the addresses (a and b+1) are not equal to each
other, which allows writing to M[a] to be done in parallel with reading
from M[b+1]. We must read from M[b] before we write to M[a],
because it could be that b and a are the same address.
2. Initial dataow diagram:
M(wr)
q p
a b M
M(rd)
1
M(rd)
M
3. Find the critical path
M(wr)
q
a b M
M(rd)
1
M(rd)
M p
25ns
60ns
60ns
60ns
65ns
150ns
Critical path is from b to p: 150ns.
4. Explore performance with different clock periods
M(wr)
q
a b M
M(rd)
1
M(rd)
M p
25ns
60ns
60ns
60ns
65ns
5ns
5ns
5ns
5ns
period 70 ns
latency 4 cycles
time 280 ns
M(wr)
q
a b M
M(rd)
1
M(rd)
M
25ns
60ns
60ns
60ns
65ns
5ns
5ns
5ns
p
period 90 ns
latency 3 cycles
time 270 ns
5. Minimum latency is 3 clock cycles because we cant do all memory
operations in parallel and we need registers on both inputs and outputs.
6. Best performance is with a clock period of 90 ns.
7. Resource usage:
Component Quantity
Input 1
Output 1
Register 5 (including mem array)
Adder 1
Memory read 2
Memory write 1
Multiplication 1
Clock Period 90 ns
Latency 3 cycles
Execution Time 270 ns
P6.6.2 Algorithm 2 49
P6.6.2 Algorithm 2
q = M[b];
M[a] = q;
p = (M[b-1]) * b) + M[b];
Assuming a > b, draw a dataow diagram that is optimized for the fastest overall execution
time.
Answer:
1. a > b means that a ,= b and a ,= b-1, so there are no memory
address conicts to create dependencies. There is a data-dependency
through q from M[b] to M[a]. The resource constraint of the dual-port
memory array also prevents us from doing all three memory operations
in parallel.
2. Explore performance with different clock periods
M(wr)
q
p
a b M
M(rd)
1
M(rd)
M
30ns
60ns
65ns
5ns
5ns
5ns
5ns
25ns
60ns
5ns
period 70 ns
latency 5 cycles
time 350 ns
M(wr)
q
p
a b M
M(rd)
1
M(rd)
M
30ns
60ns
65ns
5ns
5ns
5ns
25ns
60ns
period 95 ns
latency 3 cycles
time 285 ns
3. Area optimization: change b - 1 to b + (-1).
M(wr)
q
p
a b M
M(rd)
-1
M(rd)
M
25 ns
60ns
65ns
5ns
5ns
5ns
25ns
60ns
4. Resource usage:
Component Quantity
Input 1
Output 1
Register 5 (including mem array)
Adder 1
Memory read 2
Memory write 1
Multiplication 1
Clock Period 95 ns
Latency 3 cycles
Execution Time 285 ns
P6.7 2-bit adder
This question compares an FPGA and generic-gates implementation of 2-bit full adder.
P6.7.1 Generic Gates
Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.
P6.7.2 FPGA
Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for the
lookup tables.
P6.8. SKETCHES OF PROBLEMS 51
CE
S
R
D Q
c_in
comb
sum[0]
CE
S
R
D Q
comb
a[0]
b[0]
a[1]
b[1]
sum[1]
c_out
carry_1
1. calculate resource usage for a dataow diagram (input ports, output ports, registers,
datapath components)
2. calculate performance data for a dataow diagram (clock period and number of cycles to
execute (CPI))
3. given a dataow diagram, calculate the clock period that will result in the optimum
performance
4. given an algorithm, design a dataow diagram
5. given a dataow diagram, design the datapath and nite state machine
6. optimize a dataow diagram to improve performance or reduce resource usage
7. given fsm diagram, pick VHDL code that best implements diagram correct behaviour,
simple, fast hardware or critique hardware
Chapter 7
Functional Veri cation Problems
P7.1 Carry Save Adder
1. Functionality Briey describe the functionality of a carry-save adder.
2. Testbench Write a testbench for a 16-bit combinational carry save adder.
3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the
adder and the latency of the computation.
NOTES:
(a) You do not need to support pipelined adders.
(b) VHDL generics might be useful.
P7.2 Traf c Light Controller
P7.2.1 Functionality
Briey describe the functionality of a trafc-light controller that has sensors to detect the
presence of cars.
Answer:
Given a normal trafc light, which spends a constant amount of time as green
in each direction, add the following two transitions to the system:
1. If the less-busy road does not have any cars present for t1 minutes, then
transition the trafc light to make the busier of the two roads as green.
2. If the busy road has a car waiting for t2 minutes, then transition the trafc
light to make the busier of the two roads as green.
53
54 CHAPTER 7. FUNCTIONAL VERIFICATION PROBLEMS
P7.2.2 Boundary Conditions
Make a list of boundary conditions to check for your trafc light controller.
Answer:
1. A car arrives at the intersection and triggers the sensor, but makes a
right turn before the light turns green in its direction. Should the light turn
to green in the direction of the now vacant road, or stay green in the
current direction?
2. Same as 1, but the makes a right turn after the other road already has a
yellow light. Should the light turn to green in the direction of the now
vacant road, or transition from yellow back to green, or very briey stay
green in the vacant direction?
3. If the less-busy road is yellow, theres no car at the busy road, and a car
arrives at the less busy road. Same questions as the rst two situations.
P7.2.3 Assertions
Make a list of assertions to check for your trafc light controller.
Answer:
1. if a light is green, the next colour will be yellow
2. if a light is yellow, the next colour will be red
3. if a light is red, the next colour will be green
4. if no car has been at the less-busy road for at least t
1
minutes then the
less-busy road is red.
5. if the car sensor has been continuously on for the busy road for at least
t
2
minutes then the busy road is green.
P7.3. STATE MACHINES AND VERIFICATION 55
P7.3 State Machines and Veri cation
P7.3.1 Three Different State Machines
s0 s1
s2 s3
1/0
0/0
*/0
*/0
*/1
Figure 7.1: A very simple machine
s0 s1
s3
s4
*/0
s2
s8
s7
s9
s6
s5
*/0
*/0
*/0
*/0
*/0
*/0
*/0
*/0
*/1
Figure 7.2: A very big machine
s0 s1
s2
*/0
*/0
*/0
*/1
q0 q1
q2
q4
*/0
*/0
*/0
*/1
q3
*/0
Figure 7.3: A concurrent machine
input/output
* = dont care
Figure 7.4: Legend
Answer each of the following questions for the three state machines in gures7.17.3.
P7.3.1.1 Number of Test Scenarios
How many test scenarios (sequences of test vectors) would you need to fully validate the
behaviour of the state machine?
P7.3.1.2 Length of Test Scenario
What is the maximum length (number of test vectors) in a test scenario for the state machine?
P7.3.1.3 Number of Flip Flops
Assuming that neither the inputs nor the outputs are registered, what is the minimum number of
ip-ops needed to implement the state machine?
Answer:
scenarios max len min ops
Figure7.1 sequence expected behaviour
1) 000 s0, s2, s3, s0
2) 001 s0, s2, s3, s0
3) 010 s0, s2, s3, s0
4) 011 s0, s2, s3, s0
5) 1000 s0, s1, s2, s3, s0
6) 1001 s0, s1, s2, s3, s0
...
12) 1111 s0, s1, s2, s3, s0
4 2
1) 0000000000 s0, s1, s2 ..., s9, s0
2) 0000000001 s0, s2, s2 ..., s9, s0
1024) 1111111111 s0, s1, s2 ..., s9, s0
10 4
1) 0...00 (s0,q0), (s1,q1),
(s2,q2), (s0,q3),
(s1,q4), (s2,q0),
(s0,q1), (s1,q2),
(s2,q3), (s0,q4),
(s1,q0), (s2,q1),
(s0,q2), (s1,q3),
(s2,q4), (s0,q0)
2) 0...01 same behaviour
2
15
) 1..11 same behaviour
15 5 or 4
For gure7.3, if we implement each machine separately we need 5 ops, 2 for
the S machine and 3 for the Q machine. If we merge the state machines, we
need log
2
(35) = 4 ops.
One of the purposes of this exercise is to illustrate how many test vectors it
requires to exhaustively test the behaviour of even simple circuits. Also, this
P7.3.2 State Machines in General 57
demonstrates how the structure of a circuit affects the number of test vectors
needed. Size alone is not the determining factor.
P7.3.2 State Machines in General
If a circuit has i signals of 1-bit each that are inputs, f 1-bit signals that are outputs of ip-ops
and c 1-bit signals that are the outputs of combinational circuitry, what is the maximum number
of states that the circuit can have?
Answer:
The maximum number of states for a circuit with i inputs and f ops is 2
i+f
.
The values of combinational signals are determined by the ops and the
inputs, and so they dont contribute to the total number of states. Each output
is either a combinational signal or the output of a ip op, so the outputs are
subsumed by the combinational and opped signals.
P7.4 Test Plan Creation
Youre on the functional verication team for a chip that will control a simple portable CD-player.
Your task is to create a plan for the functional verication for the signals in the entity
cd digital.
Youve been told that the player behaves just like all of the other CD players out there. If your
test plan requires knowledge about any potential non-standard features or behaviour, youll need
to document your assumptions.
pwr
track min
prev next stop play
sec
entity cd_digital is
port (
----------------------------------------------------
-- buttons
prev,
stop,
play,
next,
pwr : in std_logic;
----------------------------------------------------
-- detect if player door is open
open : in std_logic;
----------------------------------------------------
-- output display information
track : out std_logic_vector(3 downto 0);
min : out unsigned(6 downto 0);
sec : out unsigned(5 downto 0)
);
end cd_digital;
P7.4.1 Early Tests
Describe ve tests that you would run as soon as the VHDL code is simulatable. For each test:
describe what your specication, stimulus, and check. Summarize the why your collection of
tests should be the rst tests that are run.
Answer:
test1
specication when power is turned on, the display will show the number
of tracks on the CD, and the minutes and seconds will show the total
length of the CD.
stimulus power=0; wait; power=1, all other signals are 0.
check display outputs of circuit match specication
test2
specication when power is on, play starts CD playing, display for
track=1, min and sec show remaining time for song and start
decrementing.
stimulus power=1; play=0; wait; play=1, all other
signals are 0.
test3
P7.4.2 Corner Cases 59
specication when power is on and CD is playing, next starts next song.
Display for track increments, min and sec show remaining time for
next song and start decrementing.
stimulus power=1; play=0; next=0; wait;
play=1; wait; next=1, all other signals are 0.
test4
specication when power is on and CD is playing, prev starts previous
song. Display for track decrements, min and sec show remaining
time for previous song and start decrementing.
stimulus power=1; play=0; prev=0; wait;
play=1; wait; prev=1, all other signals are 0.
test5
specication when power is on and CD is playing, stop causes CD to
stop.
stimulus power=1; play=0; stop=0; wait;
play=1; wait; stop=1, all other signals are 0.
justication for choices
These cases test the basic operations of the CD player. Each test
focusses on a different aspect of the players behaviour.
P7.4.2 Corner Cases
Describe ve corner-cases or boundary conditions, and explain the role of corner cases and
boundary conditions in functional verication.
NOTES:
1. You may reference your answer for problem P3.4.1 in this question.
2. If you do not know what a corner case or boundary condition is, you may earn partial
credit by: checking this box and explaining ve things that you would do in functional
verication.
Answer:
case 1 : press both prev and next while a CD is playing
case 2 : open the case while a CD is playing
case 3 : press play and stop at the same time
case 4 : press any button other than power when the player is off
case 5 : press next repeatedly until track counter wraps around
role of corner cases : The purpose of corner cases is to test unusual
situations that designers might not have thought of, and so are more
likely to contain bugs than normal behaviour.
1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieve
n% coverage.
2. Given a fragment of VHDL code, list things to do to make it more robust e.g. illegal data
and states go to initial state.
3. Smith Problem 13.29
Chapter 8
Performance Analysis and Optimization
Problems
P8.1 Farmer
A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard
to the market.
Facts:
capacity of
truck
speed when
loaded with
apples
speed when
unloaded (no
apples)
big truck 12 tonnes 15kph 38kph
small truck 6 tonnes 30kph 70kph
distance to market 120 km
amount of apples 85 tonnes
NOTES:
1. All of the loads of apples must be carried using the same truck
2. Elapsed time is counted from beginning to deliver rst load to returning to the orchard after
the last load
3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc.
4. For each trip, a truck travels either its fully loaded or empty speed.
Question: Which truck will take the least amount of time and what percentage faster
will the truck be?
61
62 CHAPTER 8. PERFORMANCE ANALYSIS AND OPTIMIZATION PROBLEMS
Answer:
TimeTot = NumTrips (TimeLoaded+TimeUnloaded)
NumTrips = Harvest/Capacity|
All trips are for the same distance, so distance cancels out of
the equations:
Time 1/Speed
TimeTotBig 85/12| (1/15+1/38)
80.0930
0.7439
TimeTotSmall 85/6| (1/30+1/70)
150.0477
0.7143
Small truck will take less time
PctFaster =
TimeSlowTimeFast
TimeFast
=
TimeTotBigTimeTotSmall
TimeTotSmall
=
0.74390.7143
0.7143
= 4.15%
Question: In planning ahead for next year, is there anything the farmer could do to
decrease his delivery time with little or no additional expense? If so, what is it, if not,
explain.
Answer:
Use two drivers
Use a combination of the small truck and large truck to improve his utilization.
P8.2 Network and Router
In this question there is a network that runs a protocol called BigLan. You are designing a router
called the DataChopper that routes packets over the network running BigLan (i.e. theyre BigLan
packets).
P8.2.1 Maximum Throughput 63
The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each
BigLan packet contains 100 Bytes of routing information and 1000 Bytes of data.
You are working on the DataChopper router, which has the following performance numbers:
75MHz clock speed
4 cycles for a byte of either data or header
500 number of additional clock cycles to process the routing information
for a packet
P8.2.1 Maximum Throughput
Which has a higher maximum throughput (as measured in data bits per second that is only the
payload bits count as useful work), the network or your router, and how much faster is it?
Answer:
Data throughput can be thought of as useful data / time. So, often in these
types of questions you will have to do the following:
total data/time * useful data/total data.
The maximum data throughput of the two technologies in terms of bits can be
calculated as follows:
1. BigLan Network Protocol
Maximum data throughput = 160 Mbps * (8000 useful data bits per packet / 8800 total data bits per packet)
= 145.45 Mbps
2. DataChopper Router
Time required for a packet = 500 clock cycles
+ 0.5 CPI per data bit * 8800 packet bits
= 500 clock cycles + 4400 clock cycles
= 4900 clock cycles
= 4900 clock cycles * 13.33 ns per cycle
= 65333 ns per packet
Time required for a data bit = 65333 ns per packet / 8000 data bits
= 8.167 ns per data bit
Maximum data throughput = 1 / 8.167 ns per data bit
= 122.46 Mbps
You could also use the previous method: = cycles/sec * total bytes/cycle * useful bytes/total bytes
The network has a higher maximum throughput.
What percentage higher?
n% higher performance = (perf high - perf low) / perf low
= (145 - 122)/122
= 19%
The network has 19% higher maximum performance. Therefore, the router
cant keep up with the network.
P8.2.2 Packet Size and Performance
Explain the effect of an increase in packet length on the performance of the DataChopper (as
measured in the maximum number of bits per second that it can process) assuming the header
remains constant at 100 bytes.
Answer:
As packet size increases, the overhead associated with the constant routing
delay will become less signicant.
The data rate of the router will slowly approach that of the network but it will
never surpass the network throughput. If there was not any overhead for
routing, the peak data rate for the router would be 150 Mbps compared to 160
Mbps of the network.
It shoud be noted that even though a giant packet size would seem like an
ideal solution in this question, in reality lost packets, latency, and small data
sizes would make this impractical. For example, if each packet was 1 GB and
the network was transmitting a cell-phone conversation, you would have to
wait a very long time for the rst packet to arrive before you could hear the
other person. Also, if a packet was lost, youd have to wait a long time to see
if the other person is still on the phone.
P8.3 Performance Short Answer
If performance doubles every two years, by what percentage does performance go up every
month? This question is similar to compound growth from your economics class.
P8.4. MICROPROCESSORS 65
Answer:
P = 2
t/24
(where t is measured in months)
= 2
1/24
= 1.029
Therefore, performance goes up by 2.9% each month.
P8.4 Microprocessors
The Yme microprocessor is very small and inexpensive. One performance sacrice the designers
have made is to not include a multiply instruction. Multiplies must be written in software using
loops of shifts and adds.
The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4.
A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the
Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on
the Y!v1.
P8.4.1 Average CPI
Question: What is the average CPI for the Y!v1? If you dont have enough
information to answer this question, explain what additional information you need
and how you would use it?
Answer:
Use the following subscripts: Yme 1
Y!v1 2
Y!u2 3
The Yme is 10% faster than the Y!v1.
NumInst
2
= NumInst
1
ClockSpeed
1
= 200MHz
ClockSpeed
2
= 150MHz
CPI
1
= 4
Solve for CPI
2
.
Time =
NumInst CPI
ClockSpeed
Time
2
Time
1
Time
1
= 0.10
Time
2
Time
1
= 1.10
Time
2
= 1.10Time
1
NumInst
2
CPI
2
ClockSpeed
2
= 1.10
NumInst
1
CPI
1
ClockSpeed
1
CPI
2
= 1.10
ClockSpeed
2
NumInst
1
CPI
1
NumInst
2
ClockSpeed
1
= 1.10
ClockSpeed
2
CPI
1
ClockSpeed
1
= 1.10
150MHz4
200MHz
= 3.3
Common mistakes:
Swapping performance of Yme and Y!v1.
A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply
instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply
instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average
program. The brochures also claim that the average performance of Y!u2 is 30% better than that
of the Y!v1.
P8.4.2 Why not you too?
Question: Assuming the advertising claims are true, what is the average CPI for the
Y!u2? If you dont have enough information to answer this question, explain what
additional information you need and how you would use it?
P8.4.3 Analysis 67
Answer:
1.3Time
3
= Time
2
1.3
NumInst
3
CPI
3
ClockSpeed
3
=
NumInst
2
CPI
2
ClockSpeed
2
Solve forCPI
3
:
CPI
3
=
ClockSpeed
3
NumInst
2
CPI
2
1.3NumInst
3
ClockSpeed
2
=
180MHz3.3
1.30.9150MHz
= 3.38
Common mistakes:
Comparing performance of Y!u2 to Yme, rather than Y!v1.
Saying that time for Y!u2 is 70% of Y!v1.
Forgeting to take into account reduced number of instructions.
P8.4.3 Analysis
Question: Which of the following do you think is most likely and why.
1. the Y!u2 is basically the same as the Y!v1 except for the multiply
2. the Y!u2 designers made performance sacrices in their design in order to include a
multiply instruction
3. the Y!u2 designers performed other signicant optimizations in addition to creating a
multiply instruction
Answer:
The most likely analysis is that the Y!u2 is basically the same as the Y!v1
except for the multiply. This is because the Y!u2 has a slightly larger CPI
than the Y!v1, this is in keeping with the addition of a multiply instruction. A
multiply instruction probably has a larger-than-average CPI.
The increase in clock speed likely comes from a new fabrication process, and
would not have required signicant changes to the design of the chip.
Draw an optimized dataow diagram that improves the performance and produces the same
output values. Or, if the performance cannot be improved, describe the limiting factor on the
performance.
NOTES:
you may not increase the resource usage (input ports, registers, output ports, f components, g
components)
f
f
a b c
d
g
f
g
e
Before Optimization
f
f
a b
c
d
g
f
g
e
After Optimization
P8.6 Performance Optimization with Memory Arrays
This question deals with the implementation and optimization for the algorithm and library of
circuit components shown below.
Algorithm
q = M[b];
if (a > b) then
M[a] = b;
p = (M[b-1]) * b) + M[b];
else
M[a] = b;
p = M[b+1] * a;
end;
Component Delay
Register 5 ns
Adder 25 ns
Subtracter 30 ns
Memory read 60 ns
Memory write 60 ns
P8.6. PERFORMANCE OPTIMIZATION WITH MEMORY ARRAYS 69
NOTES:
1. 25% of the time, a > b
2. The inputs of the algorithm are a and b.
3. The outputs of the algorithm are p and q.
time. For your inputs, you may read each value only once (i.e. the environment will not
send multiple copies of the same value).
7. M is an internal memory array, which must be implemented as dual-ported memory with
one read/write port and one write port.
8. Assume all memory address and other arithmetic calculations are within the range of
representable numbers (i.e. no overows occur).
10. Your dataow diagram must include circuitry for computing a > b and using the result to
choose the value for p
Draw a dataow diagram for each operation that is optimized for the fastest overall execution
time.
NOTE: You may sacrice area efciency to achieve high performance, but marks will be
deducted for extra hardware that does not contribute to performance.
Answer:
a > b (25%)
q = M[b];
M[a] = b;
p = (M[b-1] * b) + M[b];
a b (75%)
q = M[b];
M[a] = b;
p = M[b+1] * a;
1. a b happens 75% of the time, so initially focus on common case.
(a) a b means that a ,=b+1, therefore can do M[b+1] read in
parallel with M[a] write or with M[b] read.
(b) But, could have a =b, so cant do M[a] write in parallel with M[b]
read.
M(wr)
q
a b M
M(rd)
1
M(rd)
M p
25ns
60ns
60ns
60ns
65ns
150ns
(c) Critical path is from b to p: 150ns + 5ns for mux on p = 155ns.
(d) Longest operation in diagram is multiplication: 65ns.
(e) Minimum clock period is 65ns + 5ns for register = 70ns.
M(wr)
q
a b M
M(rd)
1
M(rd)
M p
25ns
60ns
60ns
60ns
65ns
5ns
5ns
5ns
5ns
M(wr)
q
a b M
M(rd)
1
M(rd)
M
25ns
60ns
60ns
60ns
65ns
5ns
5ns
5ns
p
M(wr)
q
p
a b M
M(rd)
1
M(rd)
M
30ns
60ns
65ns
5ns
5ns
5ns
5ns
25ns
60ns
5ns
period 70 ns 75 ns 90 ns
latency 5 cycles 4 cycles 3 cycles
time 350 ns 300 ns 270 ns
(f) Minimum latency is 3 clock cycles, because cant do all memory
operations in parallel and need registers on both inputs and outputs.
(g) Best overall performance for a b case is with clock period of 90
ns.
2. Now try a > b with 90 ns clock period.
P8.6. PERFORMANCE OPTIMIZATION WITH MEMORY ARRAYS 71
(a) a > b means that a ,= b and a ,= b-1, so no memory address
conicts to create dependencies and complications.
M(wr)
q
p
a b M
M(rd)
1
M(rd)
M
30ns
60ns
65ns
5ns
5ns
5ns
25ns
60ns M(wr)
q
p
a b M
M(rd)
-1
M(rd)
M
25 ns
60ns
65ns
5ns
5ns
5ns
25ns
60ns
period 90 ns 95 ns
latency 4 cycles 3 cycles
time 360 ns 285 ns
(b) Without going to a triple-ported memory, cant reduce latency below
3.
(c) Best performance for a > b case is with clock period of 95 ns.
3. Choose 95 ns clock period, which gives a latency of 3 clock cycles for
both options.
4. Optimize dataow diagrams to reduce area without sacricing
performance.
M(wr)
q
a
b M
M(rd)
1
M(rd)
M
p
25ns
60ns
60ns
60ns
65ns
5ns
5ns
5ns
M(wr)
q
p
a
b M
M(rd)
1
M(rd)
30ns
60ns
65ns
5ns
5ns
5ns
5ns
M
25ns
5. Merge dataow diagrams.
M(wr)
q p
a b M
M(rd)
1
M(rd)
30ns
60ns
65ns
5ns
5ns
5ns
5ns
M
25ns
1
M(rd)
0
Optimal performance (Period = 95 ns)
Component Quantity
Input 2
Output 2
Register 5
Adder 1
Subtracter 1
ALU 0
Memory read 2
Memory write 1
Multiplication 1
2:1 Multiplexor 2
Clock Period 95 ns
Average Latency 3 cycles
Average Execution Time 285 ns
P8.7. MULTIPLY INSTRUCTION 73
M(wr)
q p
a
b M
M(rd)
1
M(rd)
30ns
60ns
65ns
5ns
5ns
5ns
5ns
M
25ns
1
M(rd)
Suboptimal area (two multipliers)
M(wr)
q p
a b M
M(rd)
1
30ns
60ns
65ns
5ns
5ns
5ns
5ns
M
25ns
1
M(rd)
5ns
Suboptimal performance (Period =
100 ns)
P8.7 Multiply Instruction
You are part of the design team for a microprocessor implemented on an FPGA. You currently
implement your multiply instruction completely on the FPGA. You are considering using a
specialized multiply chip to do the multiplication. Your task is to evaluate the performance and
optimality tradeoffs between keeping the multiply circuitry on the FPGA or using the external
multiplier chip.
If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not
change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to
run at a slower clock speed.
FPGA option FPGA + MULT option
FPGA FPGA
MULT
average CPI 5 ???
% of instrs that are multiplies 10% 10%
CPI of multiply 20 6
Clock speed 200 MHz 160 MHz
P8.7.1 Highest Performance
Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and
what percentage faster is the higher-performance option?
Answer:
MIPs for FPGA option:
MIPs
FPGA
=
MHz
FPGA
C
FPGA
=
200
5
= 40
Find MIPs for FPGA+MULT option:
MIPs
FM
=
MHz
FM
C
FM
Find CPI for MIPS+FPGA option:
C
FM
= PI
mult
C
mult
+PI
other
C
other
Find CPI for non-multiply (other) instructions. Key
insight is that the CPI for non-multiply instructions is
the same for both the FPGA and FPGA+MULT.
P8.7.1 Highest Performance 75
C
FPGA
= PI
mult
C
mult
+PI
other
C
other
C
other
=
C
FPGA
PI
mult
C
mult
PI
other
=
50.120
0.9
= 3.333
C
FM
= PI
mult
C
mult
+PI
other
C
other
= 0.16+0.93.333
= 3.6
MIPs
FM
=
MHz
FM
C
FM
MIPs
FM
=
160
3.6
= 44.4
MIPs
FM
> MIPs
FPGA
, therefore the FPGA+MULT is the higher performance
option.
n =
Pf
FM
Pf
FPGA
Pf
FPGA
=
44.440
40
= 11.1%
The FPGA+MULT option is 11% faster than the FPGA option.
P8.7.2 Performance Metrics
Explain whether MIPs is a good choice for the performance metric when making this decision.
Answer:
MIPs is a good metric for this example, because we are comparing two
microprocessors that use the same instruction set and will be used in the
same environment.
In general, the disadvantage of MIPs is that it doesnt take into account that
different instructions accomplish different amounts of work. This causes
problems when comparing microprocessors that use different instruction
sets (e.g. one with a cosine instruction and one without).
On an exam, you need to explain whether or not MIPS is a good choice to
use based on what is stated in the question. For example, if a question
states that two processors have the same instruction set, are running the
same program, give some information relating CPI and clock speed, then
you can state that MIPS would be an okay comparison for the stated
reasons.
Chapter 9
Timing Analysis Problems
P9.1 Terminology
For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answer
which time periods (one or more of t1 t9 or NONE) are examples of the term.
NOTES:
1. The timing diagram shows the limits of the allowed times (either minimum or maximum).
2. All timing parameters are non-negative.
3. The signal a is the input to a rising-edge op and b is the output. The clock is clk1.
signal may change
signal is stable
t10 t11
clk1
clk2
b
a
b
t1 t2
t3
t9
t6
t7
t4
t5
t8
77
78 CHAPTER 9. TIMING ANALYSIS PROBLEMS
Answer:
clock skew t
3
clock period t
7
setup time t
1
hold time t
2
P9.2 Hold Time Violations
P9.2.1 Cause
What is the cause of a hold time violation?
Answer:
The cause of a hold time violation is that new data reaches the gate that
enables the input to affect the output before the gate is turned off.
P9.2.2 Behaviour
What is the bad behaviour that results if a hold time violation occurs?
Answer:
The bad behaviour that results from a hold time violation is that the new data
will corrupt the contents of the storage loop, which is trying to store the
previous data. If the new data arrives early enough to satisfy the setup
constraint, then the new data will overwrite the previous data and will slip
through the latch or op.
P9.2.3 Recti cation
If a circuit has a hold time violation, how would you correct the problem with minimal effort?
Answer:
A hold time violation can be corrected by adding a delay (buffer) to the data
path before the input gate.
P9.3. LATCH ANALYSIS 79
P9.3 Latch Analysis
Does the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q,
setup, and hold times; and answer whether it is active-high or active-low.
Gate Delays
AND 4
OR 2
NOT 1
en
d
q
Answer:
en
d

1
0
1
1

Load mode
en
d

0
1
0
1

Store mode
From the mode diagrams, if the circuit is a latch, it is active high, because
latch is in load mode when en=1.
Now check if timing of circuit is correct. The critical transition is from load
mode to store mode.
en
d
s1
cn
l1
q
Node labels
en
d
s1
cn
l1
q
Timing diagram for transition fromload

to store mode.
Clock-to-Q: 6 (1 AND and 1 OR gate from d to q.)
Setup: 6 (1 AND and 1 OR from d to controlling gate for storage loop, 0 gates
from enable to controlling gate for storage loop.)
Hold:
Hold time constraint must prevent new value arriving at d before en sets l1
to 1.
Delay along data path is 0.
Delay along clock path is 1.
Hold time is 1.
P9.4. CRITICAL PATH AND FALSE PATH 81
P9.4 Critical Path and False Path
Find the critical path through the following circuit:
a

b

c

d
e
f g
h
i
j
k l
m
P9.5 Critical Path
a
b
c
d
e
f
g
h
l
i
j
k
m
gate delay
NOT 2
AND 4
OR 4
XOR 6
Assume all delay and timing factors other than combinational logic delay are negligible.
P9.5.1 Longest Path
List the signals in the longest path through this circuit.
Answer:
a
b
c
d
e
f
g
h
l
i
j
k
m
6
6
8 8
8
8
2 4
10
12
12
12
12
12
18
16
16
16
8
10
2
2
6 8
4
Longest path is: b, e, g, j
P9.5.2 Delay
What is the combinational delay along the longest path?
Delay: 18
P9.5.3 Missing Factors 83
P9.5.3 Missing Factors
What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take
into account?
Answer:
false paths
wire delay
clock skew
clock jitter
P9.5.4 Critical Path or False Path?
Is the longest path that you found a real critical path, or a false path? If it is a false path, nd the
real critical path. If it is a critical path, nd a set of assignments to the primary inputs that
exercises the critical path.
P9.6 Timing Models
In your next job, you have been told to use a fanout timing model, which states that the delay
through a gate increases linearly with the number of gates in the immediate fanout. You dimly
recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore,
El-Morre, or something like that.
For the circuit shown below as a schematic and as a layout, answer whether the fanout timing
model closely matches the delay values predicted by the Elmore delay model.
G1
G2
G3
G4
G5
G1
G2 G3 G4 G5
Gate
Symbol Description Capacitance
Cg
Cx
Cy
Resistance
Antifuse R
0
0
0
0
Assumptions:
The capacitance of a node on a wire is independent of where the node is located on the wire.
Answer:
Equivalent Circuit:
P9.6. TIMING MODELS 85
R R R
R
Cg
Cg
Cg
G5
G4
G3
G2
Cy
Cy
Cy
Cy Cx Cy
G1
R
R
R
R
R
R
Cg
t
DG2
= RC
y
+2RC
x
+3RC
y
+4RC
g
+ 2R(C
y
+C
g
)
+ 2R(C
y
+C
g
)
+ 2R(C
y
+C
g
)
= 2RC
x
+4RC
y
+4RC
g
+ 6R(C
y
+C
g
)
In general, for a similar circuit with fanout n:
t
DGn
= 2RC
x
+4RC
y
+4RC
g
+ 2(n1)R(C
y
+C
g
)
= 2RC
x
+2(n+1)R(C
y
+C
g
)
There are two components in the delay equation:
1. A xed component that is not a function of the fanout (2RC
x
).
2. A component that varies linearly with the fanout (2(n+1)R(C
y
+C
g
)).
Yes, the fanout model closely matches the timing predicted by the Elmore
model.
P9.7 Short Answer
P9.7.1 Wires in FPGAs
In an FPGA today, what percentage of the clock period is typically consumed by wire delay?
Answer:
4060%
P9.7.2 Age and Time
If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit
today, would you nd that the percentage of the total clock period consumed by capacative load
has increased, stayed the same, or decreased?
Answer:
Decreased.
Justication:
Transistors have gotten smaller, die size has remained roughly the same
size or even increased, clock speeds are increasing.
Signals are travelling roughly the same distance as before, but driving
smaller capactive loads. Thus, wire delay is not decreasing much, but
capacitive load is decreasing.
The clock period is decreasing, so the wire delay is taking up a larger
percentage of the clock period and capacitive load delay is taking up a
smaller percentage.
P9.7.3 Temperature and Delay
As temperature increases, does the delay through a typical combinational circuit increase, stay
the same, or decrease?
Answer:
Increase.
Justication:
As temperature increases, atoms vibrate more, and so have greater
probability of colliding with electrons owing with current.
This increases resistivity, which increases delay.
P9.8. WORST CASE CONDITIONS AND DERATING FACTOR 87
P9.8 Worst Case Conditions and Derating Factor
Assume that we have a Std speed grade Actel A1415 (an ACT 3 part) Logic Module that drives
4 other Logic Modules:
P9.8.1 Worst-Case Commercial
Estimate the delay under worst-case commercial conditions (assume that the junction temperature
is the same as the ambient temperature)
Answer:
For worst-case commercial condition, assuming that TA = TJ, Logic Module
delay, tPD, for ACT 3 Std with 4 fanout is 5.7 ns (see Smith Table 5.2).
Assume this is the slowest path, then estimated critical path delay between
registers, tCRIT (worst-case commercial) is:
tCRIT = tPD+tSUD+tCO
= 5.7ns+0.8ns+3.0ns
= 9.5ns
P9.8.2 Worst-Case Industrial
Find the derating factor for worst-case industrial conditions and calculate the delay (assume that
the junction temperature is the same as the ambient temperature).
Answer:
For worst-case industrial conditions, assuming that TA = TJ, the derating
factor is 1.07 (see Table 5.3). Hence the delay tCRIT (worst-case industrial)
is: 7% greater than worst case commercial delay: 1.079.5 = 10.2ns
P9.8.3 Worst-Case Industrial, Non-Ambient Junction Temperature
Estimate the delay under the worst-case industrial conditions (assuming that the junction
temperature is 105C).
Answer:
For worst-case industrial conditions, the derating factor at 105C is found by
linear interpolation between the values for 85C (1.07) and 125C (1.17).
The interpolated derating factor is 1.12. Hence the delay is: tCRIT
(worst-case industrial, TJ = 105 0C) 1.129.5 = 10.6ns.
Chapter 10
Power Problems
P10.1 Short Answers
P10.1.1 Power and Temperature
As temperature increases, does the power consumed by a typical combinational circuit increase,
stay the same, or decrease?
Answer:
Power will increase.
Justication:
Leakage power will increase, because the equation for the leakage power
is:
IL e
qVTh
k T
where T is temperature.
Short circuiting power will increase because:
As temperature increases, atoms vibrate more, and so have greater
probability of colliding with electrons owing with current.
This increases resistivity, which increases delay.
Signals will rise and fall more slowly, which will increase the short
circuiting time, and hence increase short circuiting power
89
90 CHAPTER 10. POWER PROBLEMS
P10.1.2 Leakage Power
The new vice president of your company has set up a contest for ideas to reduce leakage power in
the next generation of chips that the company fabricates. The prize for the person who submits
the suggestion that makes the best tradeoff between leakage power and other design goals is to
have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your
idea require in order to achieve the reduction in leakage power?
Answer:
Increase transistor size so as to increase threshold voltage. This will require
an increase in supply voltage, which will likely increase total power.
Alternative: when increase transistor size, keep the supply voltage the
same, but decrease performance.
Alternative: change fabrication process and materials to reduce leakage
current. This will likely be expensive.
Alternative: Use dual-Vt fabrication process.
P10.1.3 Clock Gating
In what situations could adding clock-gating to a circuit increase power consumption?
Answer:
If the circuitry has a high utilization rate, then the power consumed by the
clock gating circuit could be more than that saved in the main circuit.
Alternative: Even if the utilization rate is low, the utilization pattern could
prevent the clock gating circuitry from turning off the clock to main circuit. For
example, if the circuit receives new data every other clock cycle, it would have
a utilization rate of 50%, but might need to be powered up 100% of the time.
P10.1.4 Gray Coding
What are the tradeoffs in implementing a program counter for a microprocessor using Gray
coding?
Answer:
P10.2. VLSI GURUS 91
Gray coding is designed to reduce power, because only one bit changes
when incrementing or decrementing.
Program counters usually increment, rather than jump to completely
different values. So, using gray coding should reduce power consumption.
The downside is that the memory system probably doesnt use gray-coded
addresses, so additional circuitry would be needed to convert between gray
and binary codes. This will increase area and likely decrease performance.
Additionally, the extra circuitry to do the translation might require more
power than is saved by using gray coding.
P10.2 VLSI Gurus
The VLSI gurus at your company have come up with a way to decrease the average rise and fall
time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication
tweaks, they can decrease this to 0.85ns .
P10.2.1 Effect on Power
If you implement their suggestions, and make no other changes, what effect will this have on
power? (NOTE: Based on the information given, be as specic as possible.)
Answer:
Reducing short circuit time from 1 ns to 0.85 ns means reducing
raising/falling time. Hence, the new short circuit power is 85% of original.
P10.2.2 Critique
A group of wannabe performance gurus claim that the above optimization can be used to improve
performance by at least 15%. Briey outline what their plan probably is, critique the merits of
their plan, and describe any affect their performance optimization will have on power.
Answer:
The plan was probably to increase clock speed by 15%. However reducing
Tshort by 0.15 ns can at most decrease clock period by 20.15 = 0.30 ns,
while clock period 1 ns. Therefore, it does not work.
P10.3 Advertising Ratios
One day you are strolling the hallways in search of inspiration, when you bump into a person
from the marketing department. The marketing department has been out surng the web and has
noticed that companies are advertising the MIPs/mm
2
, MIPs/Watt, and Watts/cm
3
of their
products. This wide variety of different metrics has confused them.
Explain whether each metric is a reasonable metric for customers to use when choosing a system.
If the metric is reasonable, say whether bigger is better (e.g. 500 MIPs/mm
2
is better than 20
MIPs/mm
2
) or smaller is better (e.g. 20 MIPs/mm
2
is better than 500 MIPs/mm
2
), and which
one type of product (cell phone, desktop computer, or compute server) is the metric most relevant
to.
MIPs/mm
2
Answer:
Unreasonable: with performance we care about the volume of a system,
not its area.
MIPs/Watt
Answer:
reasonable: bigger is better; e.g. cell-phone
Watts/cm
3
Answer:
reasonable; smaller is better; server farm
P10.4 Vary Supply Voltage
As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit
can run at decreases.
The scaling down of supply voltage is a popular technique for minimizing power. The maximum
clock speed is related to the supply voltage by the following equation:
MaxClockSpeed
(VVTh)
2
V
Where V is supply voltage and VTh is threshold voltage.
With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is
measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?
P10.5. CLOCK SPEED INCREASE WITHOUT POWER INCREASE 93
Answer:
MaxClockSpeed
(VVTh)
2
V
MaxClockSpeed
1
MaxClockSpeed
2
=
_
(V
1
VTh)
2
V
1
__
V
2
(V
2
VTh)
2
_
MaxClockSpeed
1
= MaxClockSpeed
2
_
(V
1
VTh)
2
V
1
__
V
2
(V
2
VTh)
2
_
MaxClockSpeed
1
= 200MHz
_
(1.5V 0.8V)
2
1.5V
__
3V
(3V 0.8V)
2
_
MaxClockSpeed
1
= 40MHz
P10.5 Clock Speed Increase Without Power Increase
The following are given:
You need to increase the clock speed of a chip by 10%
You must not increase its dynamic power consumption
The only design parameter you can change is supply voltage
Assume that short-circuiting current is negligible
How much do you need to decrease the supply voltage by to achieve this goal?
Answer:
Total power:
Power = (AF
1
2
CLV
2
)
+ (AF IShV)
+ (ILV)
Only need to reduce dynamic power, therefore neglect static (leakage) power.
Neglect short circuiting current.
Power = (AF
1
2
CLV
2
)
F
= 1.1F
Power
= Power
Power
= Power
(AF
1
2
CLV
2
) = (AF
1
2
CLV
2
)
(F
V
2
) = (FV
2
)
(1.1FV
2
) = (FV
2
)
V
(FV
2
)
1.1F
V
= 0.95V
We need to decrease the supply voltage to be 95.3% of its original value.
What problems will you encounter if you continue to decrease the supply voltage?
Answer:
Decreasing the supply voltage will bring it closer to the threshold voltage. As
the difference between the supply and threshold voltage decreases, it will
limit the maximum frequency that the circuit can run at.
This then leads to decreasing the threshold voltage, which will then increase
the leakage current, and raise the static power dissipation:
P10.6 Power Reduction Strategies
In each low power approach described below identify which component(s) of the power equation
is (are) being minimized and/or maximized:
P10.6.1 Supply Voltage 95
Designers scaled down the supply voltage of their ASIC
Answer:
Scaling the supply voltage (V) reduces all components of the power
equation. Switching power is proportional to the square of supply voltage, so
switching power reduces the most.
P10.6.2 Transistor Sizing
The transistors were made larger.
Answer:
Resizing transistor to increase the width to length ratio decreases the
resistance of the transistor, which makes it faster. This means that the supply
voltage can be reduced to save power while maintaining performance.
However, increasing the width to length ratio increases the capacitance. After
a certain point, the capacitance increase becomes more signicant than the
reduction in supply voltage, causing power to increase.
Therefore, resizing is adjusting supply voltage and load capacitance to
minimize their product in the switching power component.
P10.6.3 Adding Registers to Inputs
All inputs to functional units are registered
Answer:
When inputs are registered, the activity factor is decreased, which decreases
the dynamic power.
P10.6.4 Gray Coding
Gray coding of signals is used for address signals.
Answer:
Gray coding reduces the activity factor on signals that typically change by 1
or a small amount. Address signals have this behaviour, in contrast to data
signals, where consecutive values are often completely different.
Reducing the activity factor will reduce the dynamic power.
However, as noted in problem P6.1.4, there are complications in using gray
coding. For example, the compiler would need to use gray coding when
calculating addresses, and the computers datapath would need a gray-code
adder to compute address values.
Overall, the added design complexity of gray coding is probably too great a
cost to pay except in situations where the encoding is entirely internal to the
system being designed.
P10.7 Power Consumption on New Chip
While you are eating lunch at your regular table in the company cafeteria, a vice president sits
down and starts to talk about the difculties with a new chip.
The chip is a slight modication of existing design that has been ported to a new fabrication
process. Earlier that day, the rst sample chips came back from fabrication. The good news is that
the chips appear to function correctly. The bad news is that they consume about 10% more power
than had been predicted.
The vice president explains that the extra power consumption is a very serious problem, because
power is the most important design metric for this chip.
The vice president asks you if you have any idea of what might cause the chips to consume more
power than predicted.
P10.7.1 Hypothesis
Hypothesize a likely cause for the surprisingly large power consumption, and justify why your
hypothesis is likely to be correct.
Answer:
Leakage current because same design, but different fabrication process
resulted in power change.
P10.7.2 Experiment 97
P10.7.2 Experiment
Briey describe how to determine if your hypothesized cause is the real cause of the surprisingly
large power consumption.
Answer:
Measure power consumption of circuit with clock turned off, if higher than
expected, then leakage current is the problem.
P10.7.3 Reality
The vice president wants to get the chips out to market quickly and asks you if you have any ideas
for reducing their power without changing the design or fabrication process. Describe your ideas,
or explain why her suggestion is infeasible.
Answer:
Reduce the clock frequency.
Chapter 11
Problems on Faults, Testing, and Testability
P11.1 Based on Smith q14.9: Testing Cost
A modern (circa 1995) production tester costs US$510 million. This cost is depreciated over the
life of the tester (usually ve years in the States due to tax guidelines).
1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours a
day, 365 days per year how much does one second of test time cost?
Answer:
CostPerSecond =
PurchaseCost
Lifespan
=
510
6
5365246060
= $0.031 for a US$ 5 million tester
= $0.062 for a US$ 10 million tester
2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behind
schedule. After the chips begin shipping, the tester is used 100% of the time. What is the
cost of testing the chips relative to the cost if the chips had been completed on time?
Answer:
6 months is 10% of a 5 year lifespan
Therefore the tester will test 90% of the total number of chips that it would
normally test.
The cost per chip for testing will be:
99
100 CHAPTER 11. PROBLEMS ON FAULTS, TESTING, AND TESTABILITY
NewTestCost =
1
0.90
OrigTestCost
= 111%OrigTestCost
3. The dimensions of the die to be tested are 20mm10mm. The wafers are 200mm in
diameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that the
number of die per wafer is equal to wafer area divided by chip area.
What percentage of the fabrication + test cost is for test if the chip is on schedule and
requires 1 minute to test?
Answer:
DiePerWafer =
_
WaferArea
DieArea
_
=
_
(200/2)
2
1020
_
= 157
DieFabCost =
WaferFabCost
DiePerWafer
=
$3000
157
= $19.10
DieTestCost = TestCostPerSec TestTime
= $0.06260
= $3.72
TestCostPct =
DieTestCost
DieTestCost +DieFabCost
=
$3.72
$3.72+$19.10
= 16.3%
Note: the 70% information is extraneous.
P11.2. TESTING COST AND TOTAL COST 101
P11.2 Testing Cost and Total Cost
Given information:
Each board uses two ACHIPs (plus lots of other chips that we dont care about)
Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of
tests that are run)
If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replace
the ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is much
less than the total cost of $200).
What fault escapee rate will result in the lowest total cost for ACHIPs?
Answer:
From section 7.1.2:
TotCost =NoTestCost +TestCost +EscapeeProbReplaceCost
However, here we have two ACHIPs per board, so we need to use the
escapee probability to compute the probability of board needing to be
replaced. The revised equation for total cost is:
TotCost =NoTestCost +TestCost +ReplaceProbReplaceCost
The testing cost doubles, because we have two ACHIPs per board to test.
The probablity of a board having at least one bad ACHIP (and therefore
needing to be replaced) is 1 - the probability that both ACHIPs are good.
ReplaceProb = 1(1EscapeeProb)
2
NoTestCost Testcost EscapeeProb ReplaceProb AvgReplaceCost TotCost
$10 $0 32% 54% $108 $118
$10 2 $1 = $2 16% 29% $58 $70
$10 2 $2 = $4 8% 15% $30 $44
$10 2 $4 = $8 4% 8% $16 $34
$10 2 $8 = $16 2% 4% $8 $34
$10 2 $16 = $32 1% 2% $4 $46
$10 2 $32 = $64 0.5% 1% $2 $76
The chips will have a lowest cost if either $8 or $16 is spent on testing and
they have a fault escapee rate of 4% or 2%. We choose to spend $16 on
testing, because that has a lower escapee rate for the same total cost. The
lower escapee rate will improve our reputation for quality.
P11.3 Minimum Number of Faults
In a circuit with i inputs, o outputs, and g gates with an average fanout of fo (fo > 1), and average
fanin of , what is the minimum number of faults that must be considered when using a
single-stuck-at fault model?
Answer:
The minimum number of wire segments to connect a gate or input to fo other
gates or outputs is fo + 1. (Assuming fo > 1). If fo = 1, then the minimum
number of wire segments is 1.
With i inputs and g gates, this results in (i+g)(fo+1) wire segments.
Each wire segment has two possible faults (stuck-at-1 and stuck-at-0),
therefore there are 2(i+g)(f+1) potential single-stuck-at faults that must be
considered.
NOTE: the fanin degree does not direcly factor into this equation. However,
there is a relationship between the number of gates g, the number of inputs i,
the depth of the circuitry, the fanout degree fo, and the fanin degree . For
example, the maximum number of gates whose inputs are all primary inputs
is i fo/.
P11.4. SMITH Q14.10: FAULT COLLAPSING 103
P11.4 Smith q14.10: Fault Collapsing
Draw the set of faults that collapse for AND, OR, NAND, and NOR gates, and a two-input mux.
Answer:
@0
@0
@0
@1
@1
@1
@0
@0
@1
@1
@1
@0
A two-input mux does not have any controlling inputs, so it does not have any
collapsible faults.
P11.5 Mathematical Models and Reality
Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at fault
model detect the fault? If so, identify a single-stuck at fault that will detect, or explain why cant
be detected.
P11.6 Undetectable Faults
Identify one of the undetectable single stuck-at fault in the circuit below, or say NONE if all
single stuck-at faults are detectable.
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
P11.7 Test Vector Generation
Your task is to generate test vectors to detect faults in the circuit shown below.
Your manager has said that manufacturing only has time to run three test vectors on the circuit.
a
b
c
L1
L2
L3
L4
L5
L6
L7
L8
P11.7.1 Choice of Test Vectors
Which test vectors should you run and in what order should you run them?
P11.7.2 Number of Test Vectors
Write a brief statement (justied by data) to support either staying with three test vectors or
increasing the test suite to four vectors.
P11.8 Time to do a Scan Test
A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, and
two of 12,000 bits.
Calculate the total test time.
Answer:
We can load and unload all of the scan chains at the same time, so time will
be limited by the longest (30,000 bits).
For the rst test vector, we have to load it in, run the circuit for one clock
cycle, then unload the result.
P11.9. BIST 105
Loading the second test vector is done while unloading the rst.
Clock Cycles Vector 1 Vector 2 Vector 3 ...
30,000 Load
1 Run
30,000 Dump Load
1 Run
30,000 Dump Load
... ... ... ...
TimeTot = ClockPeriod(MaxLengthVec +NumVecs (MaxLengthVec +1))
=
_
1
0.501.210
9
_
(30, 000+500, 000(30, 000+1))
= 20.8secs
P11.9 BIST
In this problem, we will revisit the circuit from section 7.2.5, which is shown below. But, this
time well use BIST to test the circuit, rather than analyzing the faults and then choosing test
vectors to catch the potential faults.
a
b
c
z
L1
L2
L3
L4
L5
L6
L7
L8
P11.9.1 Characteristic Polynomials
Derive the characteristic polynomials for the linear feedback shift registers shown below:
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
S
R
S
R
S
R
set
d0 q0 d1 q1 d2 q2
Answer:
Both circuits have three ops, so their maximum exponent is x
3
.
A feedback tap on each signal di has corresponds to a coefcient of 1 on x
i
in the characteristic polynomial.
The rst circuit has feedback taps for d0, d1, and d2. This gives a
characteristic polynomial of:
x
3
+x
2
+x +1
The second circuit has taps on d0 and d1, but not one on d2:
x
3
+x +1
P11.9.2 Test Generation
Do either of the circuits generate a maximal-length non-repeating sequence?
Answer:
For an LFSR with n ops, the length of a maximal-length non-repeating
sequence is 2
n
1. Both of the LFSRs under consideration have 3 ops, so
we are looking for a sequence of 7 non-repeating values.
We will rst simulate the circuits to see their values, and then demonstrate
how characteristic polynomials and division over Galois elds can be used to
accomplish the same thing.
d0 q0 d1 q1 d2 q2
1) 1 1 0 1 0 1
2) 0 1 1 0 0 0
3) 0 0 0 1 1 0
4) 1 0 1 0 1 1
1 1 0 1 0 1 same as 1)
x
3
+x
2
+x +1
d0 q0 d1 q1 q2
1) 1 1 0 1 1
2) 1 1 0 0 1
3) 0 1 1 0 0
4) 0 0 0 1 0
5) 1 0 1 0 1
6) 0 1 1 1 0
7) 1 0 1 1 1
x
3
+x +1
P11.9.2 Test Generation 107
For x
3
+x
2
+x +1, we see that it repeats after 4 values.
For x
3
+x +1, we see that it generates a sequence of 7 different values before
repeating. The circuit has three ops, so the maximum length sequence of
non-repeating values it can generate is 2
3
1, which is 7. Thus, x +x
3
is a
maximal length linear feedback shift register.
Format for division:
quotient
lfsr message
...
remainder
For an LFSR with no external input and n ops, the rst n coefcients of the
message are the reset values of the LFSR, and all of the other remaining
coefcients are 0.
For a test vector generator LFSR, the reset values are all 1s.
We hope to have a sequence of 7 unique remainders. With the three initial
values in the LFSR ops, we require a message polynomial of 3 + 7=10
values.
The message polynomial is then:
1x
9
+1x
8
+1x
7
+0x
6
+0x
5
+0x
4
+0x
3
+0x
2
+0x
1
+0x
0
Carry out the division:
1x
6
+ 1x
5
+ 0x
4
+ 0x
3
+ 1x
2
+0x
1
+ 1x
0
1x
3
+0x
2
+1x
1
+1x
0
1x
9
+ 1x
8
+ 1x
7
+ 0x
6
+ 0x
5
+0x
4
+ 0x
3
+ 0x
2
+ 0x
1
+ 0x
0
1x
9
+ 0x
8
+ 1x
7
+ 1x
6
1x
8
+ 0x
7
+ 1x
6
+ 0x
5
1x
8
+ 0x
7
+ 1x
6
+ 1x
5
0x
7
+ 0x
6
+ 1x
5
+0x
4
0x
7
+ 0x
6
+ 0x
5
+0x
4
0x
6
+ 1x
5
+0x
4
+ 0x
3
0x
6
+ 0x
5
+0x
4
+ 0x
3
1x
5
+0x
4
+ 0x
3
+ 0x
2
1x
5
+0x
4
+ 1x
3
+ 1x
2
0x
4
+ 1x
3
+ 1x
2
+ 0x
1
0x
4
+ 0x
3
+ 0x
2
+ 0x
1
1x
3
+ 1x
2
+ 0x
1
+ 0x
0
1x
3
+ 0x
2
+ 1x
1
+ 1x
0
1x
2
+ 1x
1
+ 1x
0
Quotient 1x
6
+1x
5
+1x
2
+1x
0
Remainder 1x
2
+1x
1
+1x
0
The values on the ip ops inside an LFSR with n ops show up as the
n-most-signicant coefcients on the polynomials immediately below the
subtraction lines in the long-divison. For example, after the second
subtraction, the polynomial is:
0x
7
+0x
6
+1x
5
+0x
4
. The three most signicant coefcients are: 001 and the
value on (q2,q1,q0) after two steps of execution is also 001.
P11.9.3 Signature Analyzer
Given a signature analyzer equation of x
2
+x +1, nd the expected value of the ops in the
signature analyzer at the end of the test sequence. Also, design the hardware for the signature
analyzer and result checker.
Answer:
S
R
S
R
S
R
set
q0
q1
q2
mode
i_d(0)
i_d(1)
i_d(2)
z
Connect test generator to circuit
Expected sequence of values
from circuit:
q0 q1 q2 z
1) 1 1 1 1 x
6
2) 1 0 1 0 x
5
3) 1 0 0 0 x
4
4) 0 1 0 0 x
3
5) 0 0 1 0 x
2
6) 1 1 0 1 x
1
7) 0 1 1 1 x
0
Polynomial for output se-
quence of circuit under test:
x
6
+x +1
Remainder of result sequence divided by signature analyzer is values in ops
of signature analyzer at end of test sequence.
m(x) message (output of circuit under test) x
6
+x +1
p(x) polynomial of signature analyzer x
2
+x +1
q(x) quotient
r(x) remainder
Format for division:
quotient
signature analyzer circuit under test
...
remainder
P11.9.3 Signature Analyzer 109
Carry out the division:
1x
4
+ 1x
3
+ 0x
2
+ 1x
1
+ 1x
0
1x
2
+1x
1
+1x
0
1x
6
+ 0x
5
+ 0x
4
+ 0x
3
+ 0x
2
+ 1x
1
+ 1x
0
1x
6
+ 1x
5
+ 1x
4
+
1x
5
+ 1x
4
+ 0x
3
1x
5
+ 1x
4
+ 1x
3
0x
4
+ 1x
3
+ 0x
2
0x
4
+ 0x
3
+ 0x
2
1x
3
+ 0x
2
+ 1x
1
1x
3
+ 1x
2
+ 1x
1
1x
2
+ 0x
1
+ 1x
0
1x
2
+ 1x
1
+ 1x
0
1x
1
+ 0x
0
Quotient 1x
4
+1x
3
+1x
1
+1x
0
Remainder 1x
1
0
Check division:
m(x) = ( q(x) p(x) ) + r(x)
1x
6
+x +1 = ( (1x
4
+1x
3
+1x
1
+1x
0
) (1x
2
+1x
1
+1x
0
) ) + x
1
= ( x
6
+1 ) + x
= 1x
6
+x +1
Division was done correctly.
The nal value on the three ops in the signature analyzer will be the
remainder: 1x
1
+0x
0
= 10.
NOTE: When looking at the remainder (signature), we look at the outputs of
the ops, representing the op nearest the input as x
0
.
Using hardware:
S
R
S
R
reset
d0 q0 d1 q1
i
clk
i
d0
q1
q0
1
0
0
0
0 1
0
1
1 0 0
1
1
0 0 1 1
1
0
0
0
0
0
0
0
remainder
quotient
d1 0 1 1 0 1 1 1 0
1 1 0 1 1 1
Signature analyzer and timing diagram
The quotients and the remainder calculated using long division match the
ones that were calculated using the circuit. The values on the ops in the
signature analyzer match, cycle by cycle, the two most signicant coefcients
on the intermediate remainders calculated during long division. The
intermediate remainders are the polynomials below the subtraction lines.
(When looking at the circuit, remember that for an LFSR with n ops, it takes
n clock cycles for the circuit to become primed with the input sequence and
match the long-division arithmetic.)
The ok circuit for this signature analyzer is just a 2-input AND gate, because
the remainder is 11.
q0
q1
S
R
S
R
reset
d0 q0 d1 q2
i
ok
Signature analyzer with ok circuit
The result checker should check the ok signal one cycle after the last test
vector.
The last test vector in the sequence is 110. We can either look for 110 and
delay by one clock cycle, or we can look for the rst test vector (111) in
second iteration the sequence. To make sure that we are looking at the
second iteration of the sequence, and not the rst, we look at reset.
P11.9.4 Probabilty of Catching a Fault 111
ok
m
a
x
-
l
e
n
g
t
h
L
F
S
R
signature
analyzer
q0
q1
q2
all_ok
circuit
under
test
z
Result checker circuit option 1
ok
m
a
x
-
l
e
n
g
t
h
L
F
S
R
signature
analyzer
q0
q1
q2
all_ok
circuit
under
test
z
reset
Result checker circuit option 2
Find the approximate probability of a fault not being detected
Answer:
We have a sequence of 7 bits coming from the circuit under test.
This gives us 2
7
= 128 possible sequences. Of these, 1 is the good sequence
and 127 are faulty sequences.
The signature analyzer stores 2 bits of data, which gives us 4 possible values.
Thus, on average 128/4 = 32 different result sequences will map to the same
2-bit signature.
Of these 32 vectors, 1 is the good sequence and 31 are faulty sequences.
Assume that each result sequence is equally likely to occur.
(NOTE: this is a poor assumption, a full analysis would make each stuck-at
fault equally likely, then compute the result vector for each fault.)
With this assumption, there is a 31/127 = 24% chance that a faulty sequence
will result in the same signature as the good sequence.
There is approximately a 24% chance that a faulty circuit will not be detected.
If we increase the size of the signature analyzer by one ip op, by how much do we change the
the approximate probability of a fault not being detected?
Answer:
A signature analyzer with 3 bits of data gives us 8 possible values.
Thus, on average 128/8 = 16 different result sequences will map to the same
3-bit signature.
Assuming that each result sequence is equally likely to occur, there is a
15/127 = 11.8% chance that a faulty sequence will result in the same
signature as the good sequence.
There is approximately a 12% chance that a faulty circuit will not be detected.
Thus, we have decreased the probability of a faulty circuit not being detected
from 24% to 12%.
P11.9.6 Detecting a Speci c Fault
Determine if a L7@0 is detectable
Answer:
Equation for faulty circuit: a AND b.
Faulty sequence of values from circuit:
a b c z
1 1 1 1 x
6
1 0 1 0 x
5
1 0 0 0 x
4
0 1 0 0 x
3
0 0 1 0 x
2
1 1 0 1 x
1
0 1 1 0 x
0
Polynomial for result sequence: x
6
+x
P11.9.7 Time to Run Test 113
Compute remainder
1x
4
+ 1x
3
+ 0x
2
+ 1x
1
+ 1x
0
1x
2
+1x
1
+1x
0
1x
6
+ 0x
5
+ 0x
4
+ 0x
3
+ 0x
2
+ 1x
1
+ 0
1x
6
+ 1x
5
+ 1x
4
1x
5
+ 1x
4
+ 0x
3
1x
5
+ 1x
4
+ 1x
3
0x
4
+ 1x
3
+ 0x
2
0x
4
+ 0x
3
+ 0x
2
1x
3
+ 0x
2
+ 1x
1
1x
3
+ 1x
2
+ 1x
1
1x
2
+ 0x
1
+ 0x
0
1x
2
+ 1x
1
+ 1x
0
1x
1
+ 1x
0
Quotient 1x
4
+1x
3
+1x
1
+1
Remainder 1x
1
+1x
0
This remainder is different from the remainder for the correct circuit, thus the
fault will be detected.
In hardware:
clk
i
d0
q1
q0
1
0
0
0
0 1
0
1
1 0 0
1
1
0 0 1 1
1
0
0
0
0
0
0
0
remainder
quotient
d1 0 1 1 0 1 1 1 0
1 1 0 1 1 1
P11.9.7 Time to Run Test
Find the number of clock cycles to run the test
Answer:
For a maximal-length LFSR of n bits, it takes 1 clock cycle to reset the circuit,
2
n
1 clock cycles to generate the 2
n
1 test vectors, and nally one cycle at
the end to op the results. This gives a total of 2
n
+1 clock cycles, which in
our case is 9.
P11.10 Power and BIST
You add a BIST circuit to a chip. This causes the chip to exceed the power envelop that marketing
has dictaed is needed. What can you do to reduce the power consumption of the chip without
negatively affecting performance or incuring signicant design effort?
Answer:
When in test mode, run the clock at a lower frequency so that the chip will
consume less power.
Add clock gating to signature analyzer so that it is turned off when the chip is
in normal mode.
P11.11 Timing Hazards and Testability
This question deals with with following circuit:
a
b
c
z
L1
L2
L3
L7
L8
L4
L9
L5
L6
L10
L11
L12
L13
L14
L15
1. Does the circuit have any untestable single-stuck-at faults? If so, identify them.
Answer:
a
b
c
None of the minterms are completely covered by other minterms, so the
circuit is irredundant and does not have undetectable faults.
The two minterms ac and ab overlap, but neither is completely covered by
other minterms. So, if one of them was stuck at 0, there would be at least
one set of input values that would cause the faulty circuit to differ from the
correct circuit.
P11.11. TIMING HAZARDS AND TESTABILITY 115
2. Does the circuit have any static timing hazards?
Answer:
Moving from abc to abc moves between minterms. Thus, there is a
potential timing hazard.
a
b
c
z
Potential glitch
(static hazard)
3. Add any circuitry needed to prevent static timing hazards in the circuit below, then identify
any untestable single-stuck-at faults in the resulting circuit.
Answer:
a
b
c
a
b
c
z
L13@0
L16@0
L17@0
L18@0
L19@0
L1
L2
L3
L7
L4
L9@0
L5
L6
L10
L11 L4
L14
L8
L12
L15
The minterms ab and bc are both completely covered by other minterms.
Thus, these minterms are redundant and are sources of undetectable
faults.
This gives us L13@0 and L19@0 as undetectable single stuck-at faults.
Using gate collapsing, we see that the following faults are equivalent to
L13@0: L9@0, L160.
And the following are equivalent to L19@0: L17@0, L18@0.
NOTE: although both L16@0 and L17@0 are undetectable, this does not
mean that L2@0 is undetectable. L2@0 is equivalent to having both
L16@0 and L17@0 at the same time. Check the Boolean equations if you
are in doubt about this.
P11.12 Testing Short Answer
P11.12.1 Are there any physical faults that are detectable by scan testing
but not by built-in self testing?
Answer:
Yes.
A fault that is only detectable with 000 will be detectable by scan testing but
not by built-in self test.
A fault that results in the same signature as the correct circuit will be
detectable by scan testing but not by built-in self test.
P11.12.2 Are there any physical faults that are detectable by built-in self
testing but not by scan testing?
Answer:
No.
Any fault that is detectable by built-in self testing can be detected by scan
testing where the test vector that we scan in in the BIST test vector that
triggers the fault.
If scan testing is interpreted as boundary scan testing and built-in self test
is allowed inside a chip, then there are faults that are detectable by built-in
self test but not by boundary scan testing. These faults would be inside
redundant sequential circuitry. But, this scenario was not intended to be part
of this question.
P11.13. FAULT TESTING 117
P11.13 Fault Testing
In this question, you will design and analyze built-in self test circuitry for the circuit-under-test
shown below.
P11.13.1 Design test generator
Draw the schematic for a 2-bit maximal-length linear feedback shift register and demonstrate that
it is maximal length.
Answer:
S
R
S
R
d0
q0
d1
q1
clk
1
0
1
1
0
1
1
0
1
1
0
1
1
1
value 3 1 2 3
P11.13.2 Design signature analyzer
Design a signature analyzer circuit for a characteristic polynomial of x +1.
Answer:
S
R
P11.13.3 Determine if a fault is detectable
Is a stuck-at-1 fault on the output of the inverter detectable with the circuitry that youve
designed?
Answer:
1. Equation for correct circuit-under-test is ab.
a b output
1 1 0
1 0 1
0 1 1
2. Simulating correct output sequence 011 through signature analyzer:
i 0 1 1
d0 0 1 0
q0 0 0 1 0
3. Equation for faulty circuit-under-test is ab+ab.
a b output
1 1 1
1 0 0
0 1 0
4. Simulating faulty output sequence 100 through signature analyzer:
i 1 0 0
d0 1 1 1
q0 0 1 1 1
5. Output of signature analyzer is different from correct circuit, so the fault
will be detected.
P11.13.4 Testing time
How many clock cycles does your BIST circuitry require to test the circuit under test? Explain
how each clock cycle is used.
Answer:
1. reset circuit
2. run rst of three test vectors
P11.13.4 Testing time 119
3. run second of three test vectors
4. run three of three test vectors
5. op result from circuit under test into signature analyzer
5 clock cycles.

Very Good Notes-Up2

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Very Good Notes-Up2

Diunggah oleh

Hak Cipta:

Format Tersedia

E&CE 427: Digital Systems Engineering

P1.11 Waveform VHDL Behavioural Comparison 89

Figure 5.5: Setup, hold, and clock-to-Q times for a ip op

Figure 5.12: Setup with margin: goal is to store

setup with negative margin

Figure 5.13: Setup Violation

Figure 5.14: Minimum Setup Time

Figure 5.16: Hold OK: goal is to store

5.2.3 Timing Analysis of Transmission-Gate Latch

TInv delay through an inverter

Hold time for latch

Hold time for flop

Start Loading Test Vector (Load )

Continue (Un)loading Test Vector

Run Next Test Vector

Timing diagram for transition fromload

Anda mungkin juga menyukai