Anda di halaman 1dari 54

Power and Speed Trade-

offs in Data path structures


Array Subsystems
Introduction
 Data path circuits – so far discussed based on area
and speed.
 To be discussed in power (third dimension in design
exploration space)
 Digital design
 Latencyconstrained
 Throughput constrained
Design style
 Latency constrained – design has to finish
computation by a given deadline.
 Example – Interactive communication and gaming
 Throughput constrained – To maintain constant
throughput
 Pipelining and parallelism works effectively for
throughput constraints. (not good for latency
constrained design)
 With fixed architecture of data path, speed, area and
power can be traded off through the choice of the
supply voltage, transistor threshold and device sizes.
Power optimization techniques
 Implementation (enabling) time
 Design time – transistor width and lengths are fixed
 Run time –Supply voltages and transistor threshold
 Idle time – power dissipation should be minimal

 Targeted dissipation source


 Active(dynamic) power
 Leakage (static) power – lower Vdd – reduces energy
consumed per transition and leakage current
Classification of Design Techniques
Design time Sleep time Run time
Active •Reducing VDD •Clock gating •Dynamic voltage
power •Multiple Supply voltages scaling
•Transistor sizing
•Logic optimization
•Path balancing
Leakage •Multi Vth •Sleep transistors •Variable Vth
power •Active techniques •Variable Vth • Active techniques
Design time power reduction
techniques
 Reducing VDD
 Multiple Supply voltages
 Module level voltage selection – level converters
 Multiple supplies inside a block
 Multiple device threshold
 Transistor sizing
 Logic and Architecture optimization
 Path balancing
 Multi Vth
Reducing the supply voltage
 Reduce VDD – quadratic saving power
 Negative – delay of CMOS gates increases inversely
with power supply – but can be compensated by logic
optimizations
 Adder – Fast adder (Look ahead adder) – faster adder
can run at a lower supply voltage for same
performance
 Parallelism and pipelining can be used to compensate
for loss in performance
 If power dissipation is a prime concern, reduce VDD
at the cost of an increase in area.
Parallelism
 Effective capacitance
increases by 2.15
 Total power is reduced
by 0.36 due to
parallelism
Multiple supply voltages
 Reducing the supply voltages – reduces power but
evenly increasing their delay.
 Better approach – selectively decrease the supply
voltage on some of the gates
 Gates on fast paths that finish computation early
 Gates that drive large capacitance

 Requires more than one supply - common method


 One supply for I/O – transistors with thicker gate
oxides to sustain high voltages
 Every module could select the most appropriate
voltage from two or more supply options.
Module-level voltage selection
 Data path – critical path 10 ns - VDD = 2.5 V
 Control path – critical path 4 ns – VDD = 2.5 V
 Data path – fixed latency and throughput with no architectural
transformation.
 Since control path finishes first (timing slack), lower VDD for
control path – reduces power consumption
 Exploits the discrepancy in critical path length
Level converters
 Level converters are required when a module at lower supply
(VDDL) has to drive a gate at higher supply (VDDH)
 If VDDL drives a VDDH gate, PMOS is always on – static power.
 Level conversion – cross coupled PMOS using positive feedback
 NMOS (large) – operated a reduced voltage compared to PMOS
and able to overcome the power of positive feedback
 Reduced VDDL – long delay
 Sensitive to variations in supply voltage and transistor sizing
Multiple supplies inside a block
 Same approach at much smaller granularity by
individually setting the VDD for each cell inside a
block
 Shorter path – wastes energy – no reward for
finishing early – VDD can be lowered.
 Minimum energy would be achieved of all paths are
critical.
 Limit for multiple supplies is two – clustered scaling
technique
Clustered scaling technique
 Each path starts with VDDH and switches to VDDL if
a delay slack occurs.
 Level conversion at the end of paths (only if there
exists such a path)
 Large fraction of paths shorter than critical – second
VDD – delay approaches target value. Good savings
when large capacitance are concentrated toward end of
block – eg. Bus.
 Multiple VDD on a die – complicates the design
 If necessary – second VDD can be generated inside
the die – using a DC-DC converter.
Using multiple device threshold
 Muliple Vth - 0.25 μm CMOS technology –varying thresholds
by about 100 mV
 Advantage
 No need for level converters
 No need for clustering
 Careful assignment – 80 – 90 % reduce in leakage power
 Small reduction in active power (4%)
 Reduced gate channel capacitance in off state
 Small reduction in signal swing.
Higher Threshold Lower threshold
Leakage current lower than low threshold transistor Higher leakage current
30% reduction in active current
Used anywhere else Used in timing critical paths
Options to reduce switching capacitance
 Input capacitance is proportional to size and hence
decides speed too.
 Transistor sizing
 Logic and Architecture optimization
Transistor sizing
 Optimal sizing ~= 4 (inverter chain)
 Size of a cell Si is given as geometric mean of adjacent
cells.
 If delay lesser than desired delay is possible, the effort
can be designed as an optimization problem.
 Minimize switching capacitance under delay
constraints
 Solution:
 Reduce N and increase tapering factor
 Adjust tapering factor

 Downsize on the largest transistor – reconvergent


circuits – downsize may affect other paths too.
Logic and Architecture optimization
 Effective capacitance is a product of physical
capacitance and switching activity.
 Minimize both to obtain minimum capacitance
 Resource allocation
 Multiplexing – negative effect on power consumption
 Increases physical capacitance
 Increases switching activity
 Separate counters – superior
 Multiplexed – randomize signals – increases switching
activity
 Avoid excessive reuse of resources
Reduce Glitching through path balancing
 Main concern in adders, multipliers
 Wide discrepancy in lengths of signal paths can cause
spurious transitions
 All the inputs change simultaneously from 0 to 1
 Glitch – brief 1 appears at all outputs
 Glitch more pronounces when it travels down the
chain – balance paths (tree structures)
Run time power reduction techniques
 Static reduction in power supply – not good for latency
constrained designs
 Since peak performance is not continuously required, power
reduction is still possible.
 Computational functions can be classified as
 Compute intensive tasks – need full throughput – video decompression
 Low speed functions – text processing, data entry
 Idle mode operation – portable processors
 Computational throughput and latency expected from a
processor vary drastically over time.
 Dynamic Supply Voltage Scaling (DVS)
 Dynamic Threshold Scaling (DTS)
Dynamic Voltage Scaling (DVS)
 Lower clock f – reduces power, but energy is consumed.
 If VDD and f is reduced – energy can be reduced
 To maintain throughput and reduce energy consumption
– VDD and f must be dynamically varied – DVS
 Basic principle – a function should be operated at lowest
supply voltage that meets the timing constraints.
 DVS – is used by observation that delay of CMOS
circuits tracks well for a range of supply voltages
(NMOS only PTL - exception)
DVS
 DVS implementation contains the following
components
 A processor that can operate under a wide variety of
supply voltage
 A supply regulation loop that sets the minimum
voltage necessary for operation at a desired frequency
 An operating system that calculates the required
frequency to meet requested throughputs and task
completion deadlines.
DVS system
 Ring oscillator – oscillating frequency matches with critical path of
processor – used in power supply loop, it provides translation
between clock frequency and supply voltage.
 Operating system sets the frequency digitally .
 Feedback error – difference between current and desired frequency.
 Adjust supply voltage – so that error is 0.
 Scheduler – determine the frequency requirement of all
computation tasks dynamically.
 In general processor, each task should supply a completion
deadline or a desired frequency.
 Voltage scheduler calculate the time required for completing
the task and decides the optimal frequency.
 For varying performance requirements in processor a queue
can be created to determine the computational load and adjust
the voltage accordingly.
DTS – Dynamic Threshold Scaling
 Threshold varied dynamically
 Low latency computation – threshold should be minimal.
 For low speed – threshold can be high
 Standby mode – highest threshold – to minimize leakage
current.
 Substrate bias – control knob for threshold change.
 Transistors – as 4 terminal device (better with triple well
devices where body voltage can be controlled)
 Substrate biasing can be done per block, per chip, per cell, etc..
– layout cost increases as depth increases
 DTS - smaller overhead than DVS (I less)
Leakage monitor using DTS
 Digital module + leakage control
monitor + substrate bias charge
pump
 M1 and M2 – subthreshold region
 NMOS current in subthreshold
region, S is subthreshold slope
IS
ID 
W0 .W .10VGS / S
 Output of Bias generator
Vb  S. log W2 
 W1 
 Leakage monitor should be as small
as possible.
 Conventional charge pump – moves current in and out of
substrate.
 Control circuit monitors the leakage current.
 If the charge is above preset – charge pump increases back
bias and reduces the value to the target value.
 Junction leakage may increase the leakage current – again the
feedback loop runs.
 Disadvantage:
 Effectiveness of body bias reduces with technology scaling
Sleep time power reduction techniques
 Extreme case of dynamic power – no active switching
 Power dissipation is due to leakage only
 Solution:
 DTS
 Sleep transistors
 Sleep transistors – switch off power rails when the circuit is in
sleep mode. (reduce leakage but increase circuit complexity)
 Normal operation – SLEEP signal is ‘1’ and transistor must
present a small R (R causes noise in supply rails).
 To produce less noise – Ron should be less and (wide transistor)
 Wide transistor leads to layout penalty.
 High threshold transistor – better leakage reduction (but should
be large to give same Ron)
Sleep transistors on one power rail and both rails.
 Adding a sleep transistor, increases transistor stack – leakage
reduction in order of 100 (low Vth) and 1000 (high Vth).
 In general, switching power supply erases state of register –
accepted in some cases where a new task is initiated after idle
mode.
 But it is a difficulty when the state is needed to be preserved.
 This issue can be rectified by
 Tieing those sensitive registers to non-gated power supply
rails
 Using OS to save those states in nonvolatile memory
Array Subsystems

SRAM
DRAM
ROM
Array Subsystems
 Memory arrays account for the majority of transistors
in a CMOS system-on-chip.
 Random access memory is accessed with an address
and has a latency independent of the address.
 Serial access memories are accessed sequentially so
no address is necessary.
 Content addressable memories determine which
address(es) contain data that matches a specified key.
Classification
 Random access memory -> read-only memory (ROM)
or read/write memory (RAM). (volatile vs. nonvolatile )
 Volatile (RAM) memory retains its data as long as
power is applied, while nonvolatile (ROM) memory
will hold data indefinitely.
 Volatile Memory - > static structures and dynamic
structures.
Static cells Dynamic cells
feedback to maintain their state Use charge stored on a floating
capacitor through an access
transistor
faster and less troublesome dynamic cells must be periodically
More area than dynamic RAM read and rewritten to refresh their
state when transistor is off
 Nonvolatile memories -> mask ROM are hardwired during
fabrication and cannot be changed. (many memories can be changed
but slowly)
 A programmable ROM (PROM) can be programmed once after
fabrication by blowing on-chip fuses with a special high
programming voltage.
 An erasable programmable ROM (EPROM) is programmed by
storing charge on a floating gate. It can be erased by exposure to
ultraviolet (UV) light for several minutes to knock the charge off the
gate.
 Electrically erasable programmable ROMs (EEPROMs) are
similar, but can be erased in microseconds with on-chip circuitry.
 Flash memories are (variant of EEPROM) that erases entire blocks
rather than individual bits. Because of their good density (shares
erase circuitry) and easy in system reprogrammability, Flash
memories have replaced other nonvolatile (modern).
Memory array structure
 A memory array contains 2n words of 2m bits each. Each bit is
stored in a memory cell.
 One row per word and one column per
bit.
 The row decoder uses the address to
activate one of the rows by asserting the
wordline.
 During a read operation, the cells on this
wordline drive the bitlines, which may
have been conditioned to a known value
in advance of the memory access.
 The column circuitry may contain
amplifiers or buffers to sense the data.
SRAM
 Static RAMs use a memory cell with internal
feedback that retains its value as long as power is
applied. It has the following attractive properties:
 Denser than flip-flops
 Compatible with standard CMOS processes
 Faster than DRAM
 Easier to use than DRAM

 For these reasons, SRAMs are widely used in


applications from caches to register files to tables to
scratchpad buffers. The SRAM consists of an array of
memory cells along with the row and column
circuitry.
SRAM cell
 A SRAM cell needs to be able to read and write data
and to hold the data as long as the power is applied.
 Flip flop can be used – large in size
 6Transistor SRAM is preferred.
Working - SRAM
 The 6T cell achieves its compactness at the expense of
more complex peripheral circuitry for reading and
writing the cells.
 The small cell size also offers shorter wires and hence
lower dynamic power consumption.
 The 6T SRAM cell contains a pair of weak cross-
coupled inverters holding the state and a pair of access
transistors to read or write the state.
 The positive feedback corrects disturbances caused by
leakage or noise.
 The cell is written by driving the desired value and its
complement onto the bitlines, bit and bit_b, then
raising the wordline, word.
 The new data overpowers the cross-coupled inverters.
 It is read by precharging the two bitlines high, then
allowing them to float.
 When word is raised, bit or bit_b pulls down,
indicating the data value.
 The central challenges in SRAM design are
minimizing its size and ensuring that the circuitry
holding the state is weak enough to be
overpowered during a write, yet strong enough
not to be disturbed during a read.
Read operation - SRAM
 The bitlines are both initially
floating high.
 Assume Q is initially 0 & Q_b is 1.
 Q_b and bit_b both should remain
1.
 When the wordline is raised, bit
should be pulled down through Transistors must be ratioed such
driver and access transistors D1 that node Q remains below the
and A1. switching threshold of the
 At the same time bit is being pulled P2/D2 inverter. This constraint
down, node Q tends to rise. Q is is called read stability.
held low by D1, but raised by
current flowing in from A1. (D1
must be stronger than A1)
Write operation - SRAM
 Assume Q is initially 0 & Q_b is 1.
 Bit to be written is 1.
 bit is precharged high and left
floating. bit_b is pulled low by a
write driver.
 The cell must be written by forcing
Q_b low through A2. P2 must be weaker than
 P2 opposes this operation A2 so that Q_b can be
pulled low enough.
 Once Q_b falls low, D1 turns OFF
This constraint is called
and P1 turns ON, pulling Q high as
writability.
desired.
Cell stability
 To ensure both read stability and writability, the transistors must
satisfy ratio constraints.
 The nMOS pulldown transistor in the cross-coupled
inverters must be strongest.
 The Access transistors are of intermediate strength
 The pMOS pullup transistors must be weak.
 To achieve good layout density, all of the transistors must be
relatively small.
 For example, the pulldowns could be 8/2 λ, the access transistors
4/2, and the pullups 3/3.
 Cells must operate correctly at all voltages and temperatures
despite process variation.
Physical Design
 Cells share VDD and GND lines
between adjacent cells along the
cell boundary
 single diffusion contact to the bit
line is shared between a pair of
cells (halves the diffusion
capacitance and reduces the delay
discharging)
 wordline is run in both metal1 and
polysilicon

For more information, watch


https://youtu.be/k5VBJcUcaWU
DRAM
 Dynamic RAMs (DRAMs) store their contents as
charge on a capacitor rather than in a feedback
loop.
 Thus, the basic cell is substantially smaller than
SRAM, but the cell must be periodically read and
refreshed so that its contents do not leak away.
 They offer a factor of 10–20 greater density than
high-performance SRAM built in a standard logic
process
 1-transistor DRAM cell – a transistor and a capacitor.
 The cell is accessed by asserting the wordline to
connect the capacitor to the bitline.
Working
 On a read, the bitline is first precharged to VDD/2.
 When the wordline rises, the capacitor shares its charge with the
bitline, causing a voltage change ΔV that can be sensed.
 The read disturbs the cell contents at x, so the cell must be
rewritten after each read.
 On a write, the bitline is driven high or low and the voltage is
forced onto the capacitor. Some DRAMs drive the wordline to
VDDP = VDD + Vt to avoid a degraded level when writing a ‘1.’
 The DRAM capacitor Ccell must be as physically small as
possible to achieve good density.
 The bitline is contacted to many DRAM cells and has a relatively
large capacitance Cbit. Therefore, the cell capacitance is typically
much smaller than the bitline capacitance.
DRAM subarray
 The subarray size represents a trade-off between density and
performance. Larger subarrays amortize the decoders and
sense amplifiers across more cells and thus achieve better
array efficiency. But they also are slow and have small bitline
swings because of the high wordline and bitline capacitance.
 A subarray of this size has an order of magnitude higher
capacitance on the bitline than in the cell, so the bitline voltage
swing ΔV during a read is tiny.
 The array uses a sense amplifier to compare the bitline voltage
to that of an idle bitline (precharged to VDD/2). The sense
amplifier must also be compact to fit the tight pitch of the
array.
 The low-swing bitlines are sensitive to noise.
 Three bitline architectures, open, folded, and twisted, offer
different compromises between noise and area.

For more information


https://youtu.be/wNNtz_My2ps
Dynamic vs static RAM
 In practice, dynamic RAM is used for a computer’s main memory,
since it’s cheap and you can pack a lot of storage into a small
space.
 The disadvantage of dynamic RAM is its speed.
 Transfer rates are 800MHz at best, which can be much slower
than the processor itself.
 You also have to consider latency, or the time it takes data to
travel from RAM to the processor.
 Real systems augment dynamic memory with small but fast
sections of static memory called caches.
 Typical processor caches range in size from 128KB to 320KB.
 That’s small compared to a 128MB main memory, but it’s
enough to significantly increase a computer’s overall speed.
Read Only Memory (ROM)
 Read-Only Memory (ROM) cells can be built with only
one transistor per bit of storage.
 A ROM is a nonvolatile memory structure in that the state
is retained indefinitely—even without power.
 A ROM array is commonly implemented as a single-
ended NOR array.
 Commercial ROMs are normally dynamic, although
pseudo-nMOS is simple and suffices for small structures.
 As in SRAM cells and other footless dynamic gates, the
wordline input must be low during precharge on dynamic
NOR gates.
 If DC power dissipation is acceptable and the speed is
sufficient, the pseudo-nMOS ROM is the easiest to
design, requiring no timing.
 The DC power dissipation can be significantly
reduced in multiplexed ROMs by placing the pullup
transistors after the column multiplexer.
 The contents of the ROM can be symbolically
represented with a dot diagram in which dots indicate
the presence of 1s.
 The dots correspond to nMOS pulldown transistors
connected to the bit lines, but the outputs are inverted.
ROM structure
 Mask-programmed ROMs can be configured by the presence
or absence of a transistor or contact, or by a threshold implant
that turns a transistor permanently OFF where it is not needed.
 Omitting transistors has the advantage of reducing capacitance
on the wordlines and power consumption.
 Programming with metal contacts was once popular because
such ROMs could be completely manufactured except for the
metal layer, and then programmed according to customer
requirements through a metallization step.
 The advent of EEPROM and Flash memory chips has reduced
demand for such maskprogrammed ROMs.
 ROMs are actually combinational devices,
not sequential ones!
 You can store arbitrary data into a
ROM, so the same address will always Address Data
contain the same data. A2A1A0 V2V1V0
000 000
 You can think of a ROM as a 001 100
combinational circuit that takes an 010 110
011 100
address as input, and produces some
100 101
data as the output. 101 000
 A ROM table is basically just a truth table. 110 011
111 011
 The table shows what data is stored at
each ROM address.
 Data can be combinationally generated,
using the address as the input.

Anda mungkin juga menyukai