Anda di halaman 1dari 154

Efficient Methods and Hardware for

Deep Learning

Song Han
Stanford University
May 25, 2017
Intro

Song Han Bill Dally


PhD Candidate Chief Scientist
Stanford NVIDIA
Professor
Stanford

2
Deep Learning is Changing Our Lives
Self-Driving Car Machine Translation

This image is licensed under CC-BY 2.0 This image is in the public domain

This image is in the public domain This image is licensed under CC-BY 2.0

3
AlphaGo Smart Robots

3
Models are Getting Larger

IMAGE RECOGNITION SPEECH RECOGNITION

16X 10X
Model Training Ops
152 layers 465 GFLOP
22.6 GFLOP 12,000 hrs of Data
~3.5% error ~5% Error
8 layers 80 GFLOP
1.4 GFLOP 7,000 hrs of Data
~16% Error ~8% Error

2012 2015 2014 2015


AlexNet ResNet Deep Speech 1 Deep Speech 2

Microsoft Baidu

Dally, NIPS2016 workshop on Efficient Methods for Deep Neural Networks

4
The first Challenge: Model Size

Hard to distribute large models through over-the-air update

App icon is in the public domain


This image is licensed under CC-BY 2.0
Phone image is licensed under CC-BY 2.0

5
The Second Challenge: Speed

Error rate Training time


ResNet18: 10.76% 2.5 days
ResNet50: 7.02% 5 days
ResNet101: 6.21% 1 week
ResNet152: 6.16% 1.5 weeks

Such long training time limits ML researchers productivity

Training time benchmarked with fb.resnet.torch using four M40 GPUs

6
The Third Challenge: Energy Efficiency

AlphaGo: 1920 CPUs and 280 GPUs,


$3000 electric bill per game This image is in the public domain
This image is in the public domain

on mobile: drains battery


on data-center: increases TCO
Phone image is licensed under CC-BY 2.0 This image is licensed under CC-BY 2.0

7
Where is the Energy Consumed?

larger model => more memory reference => more energy

8
Where is the Energy Consumed?

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost


peration Energy [pJ]
Operation Relative Cost[pJ]
Energy Relative CostEnergy Cost
Relative
bit int ADD 0.1ADD
32 bit int 1 0.1 1
bit float ADD 0.9 ADD
32 bit float 9 0.9 9
bit Register File 32 bit Register
1 File 10 1 10
bit int MULT 3.1MULT
32 bit int 31 3.1 31
bit float MULT 32 bit float
3.7 MULT 37 3.7 37
bit SRAM Cache 32 bit SRAM
5 Cache 50 5 50
bit DRAM Memory 32 bit 640
DRAM Memory 6400 640 6400
1 10 100 11000 1010000 100 1000 100

Figure
: Energy table 1: Energy
for 45nm CMOS table for 45nm
process CMOS process
[7]. Memory access[7].
is 3Memory
orders ofaccess is 3 orders
magnitude more of magnitude mo
energy
xpensive than
1 expensive
simple than simple arithmetic.
arithmetic.
= 1000
This image is in the public domain

To achieve
ve this goal, we present athis goal, we
method present
to prune a method
network to prune network
connections connections
in a manner in a manner
that preserves the that preserves t
original
accuracy. After accuracy.
an initial After
training an initial
phase, trainingallphase,
we remove we remove
connections all connections
whose whose weight is low
weight is lower
reshold. Thisthan a threshold.
pruning convertsThis pruning
a dense, converts a dense,
fully-connected fully-connected
layer layerThis
to a sparse layer. to afirst
sparse layer. This9fi
Where is the Energy Consumed?

larger model => more memory reference => more energy

Relative Energy Cost Relative Energy Cost


peration Energy [pJ]
Operation Relative Cost[pJ]
Energy Relative CostEnergy Cost
Relative
bit int ADD 0.1ADD
32 bit int 1 0.1 1
bit float ADD 0.9 ADD
32 bit float 9 0.9 9
bit Register File 32 bit Register
1 File 10 1 10
bit int MULT 3.1MULT
32 bit int 31 3.1 31
bit float MULT 32 bit float
3.7 MULT 37 3.7 37
bit SRAM Cache 32 bit SRAM
5 Cache 50 5 50
bit DRAM Memory 32 bit 640
DRAM Memory 6400 640 6400
1 10 100 11000 1010000 100 1000 100

Figure
: Energy table for 45nm
how
1: Energy
CMOS
toprocess
make
table deep
for 45nm CMOSlearning
[7]. Memory process
access[7].
more
is 3Memory
efficient?
orders ofaccess is 3 orders
magnitude more of magnitude mo
energy
xpensive than expensive
simple than simple arithmetic.
arithmetic.

To achieve
ve this goal, we present athis goal, we
method present
to prune
Battery images are in the public domain
Image 1, image 2, image 2, image 4 a method
network to prune network
connections connections
in a manner in a manner
that preserves the that preserves t
original
accuracy. After accuracy.
an initial After
training an initial
phase, trainingallphase,
we remove we remove
connections all connections
whose whose weight is low
weight is lower
reshold. Thisthan a threshold.
pruning convertsThis pruning
a dense, converts a dense,
fully-connected fully-connected
layer layerThis
to a sparse layer. to afirst
sparse layer. This10fi
Improve the Efficiency of Deep Learning
by Algorithm-Hardware Co-Design

11
Application as a Black Box

Spec 2006
Algorithm
This image is in
the public domain

Hardware
This image is in

CPU
the public domain

12
Open the Box before Hardware Design

Algorithm
?
This image is in
the public domain

Hardware
This image is in

?PU
the public domain

Breaks the boundary between algorithm and hardware

13
Agenda

Inference Training
Agenda

Algorithm

Inference Training

Hardware
Agenda

Algorithm

Inference Training

Hardware
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Hardware 101: the Family

Hardware

General Purpose* Specialized HW

CPU GPU FPGA ASIC


latency throughput programmable fixed
oriented oriented logic logic
* including GPGPU
Hardware 101: Number Representation
s E
(-1) x (1.M) x 2
Range Accuracy
1 8 23
S E M 10-38 - 1038 .000006%
FP32
1 5 10
S E M 6x10-5 - 6x104 .05%
FP16
1 31
S M 0 2x109
Int32

1 15
0 6x104
Int16 S M

1 7
0 127
Int8 S M
- -
Fixed point S I F

radix point

Dally, High Performance Hardware for Machine Learning, NIPS2015


Hardware 101: Number Representation

Operation: Energy (pJ) Area (m2)


8b Add 0.03 36
16b Add 0.05 67
32b Add 0.1 137
16b FP Add 0.4 1360
32b FP Add 0.9 4184
8b Mult 0.2 282
32b Mult 3.1 3495
16b FP Mult 1.1 1640
32b FP Mult 3.7 7700
32b SRAM Read (8KB) 5 N/A
32b DRAM Read 640 N/A

Energy numbers are from Mark Horowitz Computings Energy Problem (and what we can do about it), ISSCC 2014
Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Pruning Neural Networks

[Lecun et al. NIPS89]


[Han et al. NIPS15]

Pruning Trained Quantization Huffman Coding 24


[Han et al. NIPS15]

Pruning Neural Networks

-0.01x2 +x+1

60 Million
6M 10x less connections

Pruning Trained Quantization Huffman Coding 25


[Han et al. NIPS15]

Pruning Neural Networks

0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 26


[Han et al. NIPS15]

Pruning Neural Networks

Pruning
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 27


[Han et al. NIPS15]

Retrain to Recover Accuracy

Pruning Pruning+Retraining
0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 28


[Han et al. NIPS15]

Iteratively Retrain to Recover Accuracy

Pruning Pruning+Retraining Iterative Pruning and Retraining


0.5%
0.0%
Accuracy Loss -0.5%
-1.0%
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parameters Pruned Away

Pruning Trained Quantization Huffman Coding 29


[Han et al. NIPS15]

Pruning RNN and LSTM

ptioning

ultimodal Recurrent Neural Networks, Mao et al.


Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
al Image Caption Generator, Vinyals et al.
onvolutional Networks for Visual Recognition and Description, Donahue et al.
*Karpathy
Visual Representation for et al, "Deep
Image Visual- Chen and Zitnick
Caption Generation,
Semantic Alignments for Generating
KarpathyImage
& Justin Johnson 2015.
Descriptions, Lecture
10 - 51 8 Feb 2016
Figure copyright IEEE, 2015; reproduced for educational purposes.

Pruning Trained Quantization Huffman Coding 30


[Han et al. NIPS15]

Pruning RNN and LSTM


Original: a basketball player in a white uniform is
playing with a ball
90% Pruned 90%: a basketball player in a white uniform is
playing with a basketball

Original : a brown dog is running through a grassy field


90% Pruned 90%: a brown dog is running through a grassy
area

Original : a man is riding a surfboard on a wave


90% Pruned 90%: a man in a wetsuit is riding a wave on a
beach

Original : a soccer player in red is running in the field


95% Pruned 95%: a man in a red shirt and black and white
black shirt is running through a field

Pruning Trained Quantization Huffman Coding 31


Pruning Happens in Human Brain

1000 Trillion
Synapses
50 Trillion 500 Trillion
Synapses Synapses

This image is in the public domain This image is in the public domain This image is in the public domain

Newborn 1 year old Adolescent


Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172172, 2013.

Pruning Trained Quantization Huffman Coding 32


[Han et al. NIPS15]

Pruning Changes Weight Distribution

Before Pruning After Pruning After Retraining

Conv5 layer of Alexnet. Representative for other network layers as well.

Pruning Trained Quantization Huffman Coding 33


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
[Han et al. ICLR16]

Trained Quantization

2.09, 2.12, 1.92, 1.87

2.0

Pruning Trained Quantization Huffman Coding 35


[Han et al. ICLR16]

Trained Quantization
Quantization: less bits per weight
g: less number of weights Huffman Encod

Cluster the Weights

ain Connectivity
Encode Weight
same same
accuracy Generate Code Book accuracy
2.09, 2.12, 1.92, 1.87
une Connections
9x-13x 27x-31x Encode Index
Quantize the Weights
reduction reduction
with Code Book
Train Weights
2.0
Retrain Code Book

e three stage compression pipeline: pruning,


32 bit quantization and Huffman cod
umber of weights by 10,4bit
while quantization further improves the comp
8x less memory footprint
and 31. Huffman coding gives more compression: between 35 an
Pruning Trained Quantization Huffman Coding 36
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 37
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 38
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 39
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 40
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 41
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 42
[Han et al. ICLR16]

Trained
e 2: Representing the matrix sparsity Quantization
with relative index. Padding filler zero to prev

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b


Pruning Trained Quantization Huffman Coding 43
[Han et al. ICLR16]

Before Trained Quantization:


Continuous Weight
Count

Weight Value
Pruning Trained Quantization Huffman Coding 44
[Han et al. ICLR16]

After Trained Quantization:


Discrete Weight
Count

Weight Value
Pruning Trained Quantization Huffman Coding 45
[Han et al. ICLR16]

After Trained Quantization:


Discrete Weight after Training
Count

Weight Value
Pruning Trained Quantization Huffman Coding 46
[Han et al. ICLR16]

How Many Bits do We Need?

Pruning Trained Quantization Huffman Coding 47


[Han et al. ICLR16]

How Many Bits do We Need?

Pruning Trained Quantization Huffman Coding 48


[Han et al. ICLR16]

Pruning + Trained Quantization Work Together

Pruning Trained Quantization Huffman Coding 49


[Han et al. ICLR16]

Pruning + Trained Quantization Work Together

AlexNet on ImageNet

Pruning Trained Quantization Huffman Coding 50


[Han et al. ICLR16]

Huffman Coding

Huffman Encoding

Encode Weights
same
y accuracy

Encode Index 35x-49x


n reduction

In-frequent weights: use more bits to represent


nd Huffman coding. Pruning
roves the compression rate:
Frequent weights: use less bits to represent
etween 35 and 49. The
n. The compression scheme
Pruning Trained Quantization Huffman Coding 51
[Han et al. ICLR16]

Summary of Deep Compression


Published as a conference paper at ICLR 2016

Quantization: less bits per weight


Pruning: less number of weights Huffman Encoding

Cluster the Weights

Train Connectivity
Encode Weights
original same same same
network accuracy Generate Code Book accuracy accuracy
Prune Connections
original 9x-13x 27x-31x Encode Index 35x-49x
Quantize the Weights
size reduction reduction reduction
with Code Book
Train Weights

Retrain Code Book

Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning
reduces the number of weights by 10, while quantization further improves the compression rate:
between 27 and 31. Huffman coding gives more compression: between 35 and 49. The
compression rate already included the meta-data for sparse representation. The compression scheme
doesnt incur any accuracy
Pruning loss.
Trained Quantization Huffman Coding 52
[Han et al. ICLR16]

Results: Compression Ratio

Original Compressed Compression Original Compressed


Network
Size Size Ratio Accuracy Accuracy

LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

ResNet-18 44.6MB 4.0MB 11x 89.24% 89.28%

Can we make compact models to begin with?

Compression Acceleration Regularization 53


SqueezeNet

Input

64
1x1 Conv
Squeeze
16
1x1 Conv 3x3 Conv
Expand Expand
64 64
Output
Concat/Eltwise
128

Vanilla Fire module


Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv 2016

Compression Acceleration Regularization 54


Compressing SqueezeNet

Top-1 Top-5
Network Approach Size Ratio
Accuracy Accuracy

AlexNet - 240MB 1x 57.2% 80.3%

AlexNet SVD 48MB 5x 56.0% 79.4%

Deep
AlexNet 6.9MB 35x 57.2% 80.3%
Compression

SqueezeNet - 4.8MB 50x 57.5% 80.3%

Deep
SqueezeNet 0.47MB 510x 57.5% 80.3%
Compression

Iandola et al, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv 2016

Compression Acceleration Regularization 55


Results: Speedup

Average

0.6x

CPU GPU mGPU

Compression Acceleration Regularization 56


Results: Energy Efficiency

Average

CPU GPU mGPU

Compression Acceleration Regularization 57


Deep Compression Applied to Industry

Deep
Compression

F8

Compression Acceleration Regularization 58


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Quantizing the Weight and Activation

Train with float


Quantizing the weight and
activation:
Gather the statistics for
weight and activation
Choose proper radix point
position
Fine-tune in float format
Convert to fixed-point format

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Quantization Result

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Low Rank Approximation for Conv

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR15


Low Rank Approximation for Conv

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR15


Low Rank Approximation for FC

Novikov et al Tensorizing Neural Networks, NIPS15


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Binary / Ternary Net: Motivation

(S) Recover Zero Weights Train on Dense (D)


6400 6400

Normalized
4800 4800 Intermediate Ternary Weight
recision Weight

Count
Count

3200 3200
=>
1600
Quantize
1600

0 0
-t
e
0.05
0 t 0.05
1 0
Weight Value
0.05 0.05
-1
0
Weight Value
0.05
0 1
(d) (e)
gradient1
Net (a), pruned GoogLeNet (b), after retraining
he sparistyInference
constraint and recovering the zero
Time
Trained Ternary Quantization

Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

gradient1 gradient2
Loss
Feed Forward Back Propagate Inference Time

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17

Pruning Trained Quantization Huffman Coding


l = t max(|w|) (9)

and 2) maintain a constant sparsity r for all layers throughout training. By adjusting the hyper-
Weight Evolution during Training
parameter r we are able to obtain ternary weight networks with various sparsities. We use the first
method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one
to explore a wider range of sparsities in section 5.1.1.

res1.0/conv1/Wn res1.0/conv1/Wp res3.2/conv2/Wn res3.2/conv2/Wp linear/Wn linear/Wp


3
Ternary Weight Value

2
1

0
-1
-2

-3
Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives
100%
Ternary Weight

75%
Percentage

50%

25%

0%
0 50 100 150 0 50 100 150 0 50 100 150
Epochs

Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers
of ResNet-20 on CIFAR-10.

4
Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17
Visualization of the TTQ Kernels

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17

Pruning Trained Quantization Huffman Coding


Error Rate on ImageNet

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17

Pruning Trained Quantization Huffman Coding


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
3x3 DIRECT Convolutions
Compute Bound

9xC FMAs/Output: Math Intensive


Image Tensor
0 8 1 7 2 6
0 1 2
3
6
4
7
5
8
3 5 4 4 5 3

Filter
6 2 7 1 8 0
8 7 6

5 4 3
2 1 0
9xK FMAs/Input: Good Data Reuse

4 0 4 1 4 2

4 3 4 4 4 5

4 6 4 7 4 8

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs

Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

73
3x3 WINOGRAD Convolutions
Transform Data to Reduce Math Intensity

4x4 Tile

Data Transform
over C

Point-wise
Output Transform
Filter multiplication

Filter Transform

Direct convolution: we need 9xCx4 = 36xC FMAs for 4 outputs


Winograd convolution: we need 16xC FMAs for 4 outputs: 2.25x fewer FMAs

See A. Lavin & S. Gray, Fast Algorithms for Convolutional Neural Networks
Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

74
Speedup of Winograd Convolution
VGG16, Batch Size 1 Relative Performance

cuDNN 3 cuDNN 5

2.00

1.50

1.00

0.50

0.00
conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0

Measured on Maxwell TITAN X

Julien Demouth, Convolution OPTIMIZATION: Winograd, NVIDIA

75
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Hardware for Efficient Inference
a common goal: minimize memory access

Sparse Matrix Read Unit. The sparse-matrix read unit


uses pointers pj and pj+1 to read the non-zero elements (if
any) of this PEs slice of column Ij from the sparse-matrix
SpMat
SRAM. Each entry in the SRAM is 8-bits in length and
Act_0 Act_1
contains one 4-bit element of v and one 4-bit element of x.
For efficiency (see Section VI) the PEs slice of encoded Ptr_Even Arithm Ptr_Odd
sparse matrix I is stored in a 64-bit-wide SRAM. Thus eight
entries are fetched on each SRAM read. The high 13 bits
of the current pointer p selects an SRAM row, and the low SpMat
3-bits select one of the eight entries in that row. A single
(v, x) entry is provided to the arithmetic unit each cycle.
Arithmetic Unit. The arithmetic unit receives a (v, x) Figure 5. Layout of one PE in EIE under TSMC 45nm process

Eyeriss DaDiannao TPU


entry from the sparse matrix read unit and performs the
multiply-accumulate operation bx = bx + v aj . Index EIE Table II
T HE IMPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
x is used to index an accumulator array (the destination BREAKDOWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LIN

MIT CAS Google Stanford


activation registers) while v is multiplied by the activation
value at the head of the activation queue. Because v is stored
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
Power Area
(%) (%)

RS Dataflow eDRAM 8-bit Integer


in 4-bit encoded form, it is first expanded to a 16-bit fixed-
Compression/
point number via a table look up. A bypass path is provided
Total
memory
(mW)
9.157
5.416 (59.15%)
(m2 )
638,024
594,786 (93.22%)
to route the output of the adder to its input if the same clock network 1.874 (20.46%) 866 (0.14%)
accumulator is selected on two adjacent cycles.
This unit is designed for dense Sparsity
Activation Read/Write. The Activation Read/Write Unit
register
combinational
filler cell
1.026
0.841
(11.20%)
(9.18%)
9,465
8,946
23,961
(1.48%)
(1.40%)
(3.76%)
contains two activation register files that accommodate the
matrices. Sparse architectural
source and destination activation values respectively during
Act queue
0.112 (1.23%)
PtrRead
1.807 (19.73%)
758
121,849
(0.12%)
(19.10%)
SpmatRead
4.955 (54.11%) 469,412 (73.57%)
support was omitted for time-to-
a single round of FC layer computation. The source and
destination register files exchange their role for next layer.
ArithmUnit
1.162 (12.68%)
ActRW
1.122 (12.25%)
3,110
18,934
(0.49%)
(2.97%)
deploy reasons. Sparsity will have
Thus no additional data transfer is needed to support multi- filler cell 23,961 (3.76%)
layer feed-forward computation.
high priority in future designs
Each activation register file holds 64 16-bit activations.
Central Unit: I/O and Computing. In the I/O mode, al
This is sufficient to accommodate 4K activation vectors the PEs are idle while the activations and weights in ev
across 64 PEs. Longer activation vectors can be accommo- PE can be accessed by a DMA connected with the Cen
dated with the 2KB activation SRAM. When the activation Unit. This is one time cost. In the Computing mode,
Compression Acceleration Regularization
vector has a length greater than 4K, the MV will be CCU repeatedly collects a non-zero value from the LN
completed in several batches, where each batch is of length quadtree and broadcasts this value to all PEs. This77
proc
Google TPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

78
Google TPU
The Matrix Unit: 65,536 (256x256)
8-bit multiply-accumulate units
700 MHz clock rate
Peak: 92T operations/second
65,536 * 2 * 700M
>25X as many MACs vs GPU
>100X as many MACs vs CPU
4 MiB of on-chip Accumulator
memory
24 MiB of on-chip Unified Buffer
(activation memory)
3.5X as much on-chip memory
vs GPU
Two 2133MHz DDR3 DRAM
channels
8 GiB of off-chip weight DRAM
memory

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

79
Google TPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

80
Inference Datacenter Workload

Layers TPU Ops / TPU


Nonlinear %
Name LOC Weights Weight Batch
function Deployed
FC Conv Vector Pool Total Byte Size
MLP0 0.1k 5 5 ReLU 20M 200 200
61%
MLP1 1k 4 4 ReLU 5M 168 168
sigmoid,
LSTM0 1k 24 34 58 52M 64 64
tanh
29%
sigmoid,
LSTM1 1.5k 37 19 56 34M 96 96
tanh
CNN0 1k 16 16 ReLU 8M 2888 8
5%
CNN1 1k 4 72 13 89 ReLU 100M 1750 32

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

81
Roofline Model: Identify Performance Bottleneck

nsightful visual
CM 52.4 (2009): 65-76.
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

82
TPU Roofline

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

83
Log Rooflines for CPU, GPU, TPU

Star = TPU
Triangle = GPU
Circle = CPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

84
Linear Rooflines for CPU, GPU, TPU

Star = TPU
Triangle = GPU
Circle = CPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit

85
Why so far below Rooflines?
Low latency requirement => Cant batch more => low ops/byte

How to Solve this?


less memory footprint => need compress the model

Challenge:
Hardware that can infer on compressed model

86
[Han et al. ISCA16]
EIE: the First DNN Accelerator for
Sparse, Compressed Model

Compression Acceleration Regularization 87


[Han et al. ISCA16]
EIE: the First DNN Accelerator for
Sparse, Compressed Model

0*A=0 W*0=0 2.09, 1.92=> 2

Sparse Weight Sparse Activation Weight Sharing


90% static sparsity 70% dynamic sparsity 4-bit weights

10x less computation 3x less computation

5x less memory footprint 8x less memory footprint

Compression Acceleration Regularization 88


EIE: Reduce Memory Access by Compression

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
logically P E3 B 0 0 0 0 C
B
B
B
C B
C=B
b3 C
C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3

physically Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3

Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016, Hotchips 2016
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 90
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 91
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 92
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 93
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 94
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 95
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 96
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 97
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 98
[Han et al. ISCA16]

Dataflow

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
P E3 B 0 0 0 0 C
B B
C B b3 C
C ReLU
B
B b3 C
C
B C=B C ) B C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

rule of thumb:
0*A=0 W*0=0
Compression Acceleration Regularization 99
z William J. Dally

[Han et al. ISCA16]

iversity, NVIDIA EIE Architecture
avan,horowitz,dally}@stanford.edu

Weight decode
Compressed 4-bit 16-bit
DNN Model Virtual weight Weight Real weight
Encoded Weight Look-up ALU Prediction
Relative Index
Sparse Format Index Result
Input 16-bit Mem
4-bit Accum
Image Relative Index Absolute Index

Figure 1. Efficient inferenceAddress


engine that Accumulate
works on the compressed deep
neural network model for machine learning applications.

word,
rule or speech0 sample.
of thumb: * A = 0 For embedded
W * 0 = 0 mobile applications,
2.09, 1.92=> 2
these resource demands become prohibitive. Table I shows
Compression Acceleration Regularization 100
[Han et al. ISCA16]

Micro Architecture for each PE

Act Value Act Value


Act Queue Act Leading
Act Index SRAM NZero
Encoded
Weight
Detect
Act Index

Col Weight
Even Ptr SRAM Bank Sparse Decoder
Start/ Bypass Dest Src
Matrix Regs
End Act Act
Addr SRAM Regs Regs
ReLU
Odd Ptr SRAM Bank Address Absolute Address
Accum
Relative Index
Pointer Read Sparse Matrix Access Arithmetic Unit Act R/W

SRAM Regs Comb

Compression Acceleration Regularization 101


[Han et al. ISCA16]

Speedup on EIE
SpMat CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
1018x
1000x 507x 618x
248x

mGPU Compressed EIE


Act_0 Act_1 210x
135x 189x
94x 115x 92x 98x
Ptr_Even Arithm Ptr_Odd 100x 56x 63x 60x
Speedup

34x 33x 48x


25x 21x 24x 22x 25x
14x 14x 16x 15x 15x
9x 8x 10x 9x 10x 9x 9x
10x 5x 5x
3x 3x 3x
2x 3x 2x 3x 2x 2x
1x 1x 1.1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x
1.0x 1.0x
SpMat 1x 0.6x 0.5x
0.3x 0.5x 0.5x 0.5x 0.6x

0.1x
5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Table II Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS 189x
98x Power
(%)
Area
(%)
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
(mW) (m2 )
60x
119,797x 76,784x
100000x 61,533x
9.157 638,024 34,522x

48x 24,207x
Energy Efficiency

14,826x 11,828x 10,904x


ry 5.416 (59.15%) 594,786 (93.22%) 9,485x 8,053x
network 1.874 (20.46%) 866 10000x
(0.14%)
er
national
1.026
0.841
(11.20%)
(9.18%)
9,465
8,946 25x
(1.48%)
1000x
(1.40%)

15x
78x 101x 102x
cell 23,961 (3.76%)
100x 37x
59x 61x 39x
ueue 0.112 (1.23%) 758 (0.12%) 26x 37x 20x 25x 25x 23x
36x
18x 17x 14x 14x 15x 20x
9x12x

9x 7x10x 10x 10x 8x


ad 1.807 (19.73%) 121,849 (19.10%)
10x 5x 7x 5x 6x 6x 6x 6x 4x 5x 6x 7x
15x 3x 13x 14x 2x
tRead 4.955 (54.11%) 469,412 (73.57%) 1x 10x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x
mUnit 1.162 (12.68%) 3,110 (0.49%)
1x
W 1.122 (12.25%) 18,934 (2.97%)
cell

x
23,961

Figure 7.
(3.76%) Alex-6 Alex-7 Alex-8 VGG-6
3x
VGG-7

Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
VGG-8
3x
NT-We
Geo Mean
NT-Wd NT-LSTM Geo Mean

2x
nit: I/O and Computing. In the I/O mode, all of
1x
1xcorner. We placed and routed
re idle while the activations and weights in every
e accessed by a DMA connected with the Central 1x
the PE using the Synopsys IC
Table III
s is one time cost. In the Computing mode, the
0.5x
compiler (ICC). We
eatedly collects a non-zero value from the LNZD used Cacti [25] to get SRAM area and
0.6x B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
and broadcasts this value to all PEs. This process Layer Size Weight% Act% FLOP% Description
energy numbers. We annotated the toggle rate from the RTL
until the input length is exceeded. By setting the 9216,
gth and starting address of pointer array, EIE is Alex-6 9% 35.1% 3%
to execute different layers.
simulation to the gate-level netlist, which was dumped to 4096 Compressed
4096, AlexNet [1] for
V. E VALUATION M ETHODOLOGY
switching activity interchange format (SAIF), and estimated Alex-7
4096
9% 35.3% 3%
large scale image
tor, RTL and Layout. We the power
implemented using Prime-Time PX.
a custom 4096, classification
urate C++ simulator for the accelerator aimed to Alex-8 25% 37.5% 10%
Comparison 1000
circuits. Each Baseline. We compare EIE with three dif-
GPU mGPU EIE
e RTL behavior of synchronous
module is abstracted as an ferent
object thatoff-the-shelf
implements
CPU
computing units: CPU, GPU and mobile VGG-6
25088,
4% 18.3% 1% Compressed
ation logic and the flip-flop GPU.
NT-LSTM
act methods: propagate and update, corresponding
in RTL. The simulator
or design space exploration. It also serves as a
or RTL verification.
1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E
VGG-7
4096
4096,
4096
4096,
Geo Mean 4% 37.5% 2%
VGG-16 [3] for
large scale image
classification and
classpath
asure the area, power and critical processor,
delay, we that has been used in NVIDIA Digits Deep VGG-8
1000
23% 41.1% 9% object detection
Compression Dev Box as a Acceleration
CPU baseline. To run the benchmark Regularization
ted the RTL of EIE in Verilog. The RTL is verified
e cycle-accurate simulator.Learning
Then we synthesized 4096, Compressed
NT-We 10% 100% 10% 102
g the Synopsys Design Compiler (DC) under the 600 NeuralTalk [7]
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE

Wd NT-LSTM Geo Mean


1018x
1000x 507x 618x

[Han et al. ISCA16]


248x 210x 189x
94x 115x 135x 92x 98x
100x 56x 63x 60x

Speedup
34x 33x 48x
25x 21x 24x 22x 25x
14x 14x 16x 15x 15x
9x 10x 10x 9x 9x

Energy Efficiency on EIE


8x 9x
10x 5x 5x
3x 3x 3x
2x 3x 2x 3x 2x 2x
1x 1x 1.1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x 1x
1.0x 1.0x
1x 0.6x 0.5x 0.5x 0.5x 0.5x 0.6x

There is no batching in all cases.


0.3x

0.1x
Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean

Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

SpMat CPU Dense (Baseline) CPU Compressed GPU Dense


119,797x
GPU Compressed mGPU Dense mGPU Compressed EIE
61,533x 76,784x
100000x 34,522x 24,207x
Energy Efficiency

14,826x 11,828x 9,485x 10,904x


Act_0 Act_1 8,053x
10000x
Ptr_Even Arithm Ptr_Odd
1000x

mGPU Compressed
SpMat
100x

10x
1x
5x 7x
26x 37x

10x 1x
37x
9x12x
15x
59x

1x
3x
7x10x
7x
18x

1x
17x
10x
78x 101x

13x
20x

1x
61x

10x
102x

14x
EIE1x
2x
5x
8x
14x

5x 1x
6x 6x
25x

8x
39x

1x
6x 6x
14x

7x
25x

1x
4x 5x
15x 20x

7x 1x
6x 7x
23x
36x

9x
1x
5. Layout of one PE in EIE under TSMC 45nm process. Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Table II Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE
OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE
8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
24,207x Table III
10,904x
corner. We placed and routed8,053x
Power
(mW)
(%) the PE using the Synopsys IC
Area
(m2 )
(%) B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
ry
9.157
5.416
compiler(93.22%)
(59.15%)
638,024
594,786
(ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description
network 1.874 energy numbers.
(20.46%) 866(0.14%) We annotated the toggle rate from the RTL 9216,
er 1.026 (11.20%) 9,465(1.48%)
Alex-6 9% 35.1% 3%
national
cell
0.841
simulation
(9.18%) 8,946
23,961
to
(1.40%)
(3.76%)
the gate-level netlist, which was dumped to 4096 Compressed
4096, AlexNet [1] for
1.807 (19.73%)switching activity interchange format (SAIF), and estimated
ueue 0.112 (1.23%) 758(0.12%)
Alex-7 9% 35.3% 3%
ad 121,849
(19.10%) 4096 large scale image
tRead
mUnit
4.955 (54.11%)
1.162 (12.68%) the power
469,412
using
(73.57%)
3,110(0.49%) Prime-Time PX. 4096, classification
W 1.122 (12.25%) 18,934(2.97%) Alex-8 25% 37.5% 10%
Comparison Baseline. We compare EIE with three dif- 1000
cell 23,961(3.76%)

ferent off-the-shelf computing units: CPU, GPU and mobile


nit: I/O and Computing. In the I/O mode, all of
VGG-6
25088,
4096
4% 18.3% 1% Compressed Geo Mean
GPU.
re idle while the activations
25x and weights in every 36x 4096,
VGG-16 [3] for
e accessed by a DMA connected with the Central
1) CPU. We use Intel
15x 20x
Core i-7 5930k CPU, a Haswell-E23x VGG-7
4096
4% 37.5% 2% large scale image
classification and
s is one time cost. In the Computing mode, the 4096,
eatedly collects a non-zero class processor, that has been used in NVIDIA Digits 7x
value from the LNZD Deep VGG-8 23% 41.1% 9% object detection
and broadcasts this value to all PEs. This process
4x 5x 6x 1000
until the input length is Learning Dev the
exceeded. By setting Box as a CPU baseline. To run the benchmark 4096, Compressed
NT-We 10% 100% 10%
gth and starting address on CPU, we used MKL CBLAS GEMV to implement the
of pointer array, EIE is 600 NeuralTalk [7]
to execute different layers.
7xV. EVALUATION METHODOLOGY original1x dense model and 7xMKL SPBLAS CSRMV 1x for the 9xNT-Wd
600,
11% 100% 11%
with RNN and
8791 LSTM for
compressed
tor, RTL and Layout. We implemented a custom sparse model. CPU socket and DRAM power 1201, automatic
NTLSTM 10% 100% 11%
urate C++ simulator for the areaccelerator
as reported aimed toby the pcm-power utility provided by Intel. 2400 image captioning
e RTL behavior of synchronous circuits. Each
2) GPU.
module is abstracted as an object We use NVIDIA GeForce GTX Titan
that implements
CPUX GPU, GPU mGPU EIE
Wd
act methods: propagate and update, corresponding
ation logic and the flip-flop in RTL. The simulator
or design space exploration. using It alsonvidia-smi
serves as a
NT-LSTM
a state-of-the-art GPU for deep learning as our baseline
utility to report the power. To run
The uncompressed DNN model is obtained from Caffe
model zoo [28] and NeuralTalk model zoo [7]; The com-
Geo Mean
or RTL verification.
asure the area, power andthe benchmark,
critical path delay, we we used cuBLAS GEMV to implement pressed DNN model is produced as described in [16], [23].
the The original dense layer.Acceleration The benchmark networks have 9 layers in total obtained
For the compressed sparse layer, Regularization
Compression
ted the RTL of EIE in Verilog. RTL is verified
e cycle-accurate simulator. Then we synthesized
we stored
g the Synopsys Design Compiler (DC) underthethesparse matrix in in CSR format, and used from AlexNet, VGGNet, and NeuralTalk. We use the Image- 103
[Han et al. ISCA16]

Comparison: Throughput
EIE

Throughput (Layers/s in log scale)


1E+06

ASIC
1E+05
ASIC
ASIC
1E+04

GPU
1E+03 ASIC
1E+02
CPU mGPU
1E+01 FPGA

1E+00
Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

Compression Acceleration Regularization 104


[Han et al. ISCA16]

Comparison: Energy Efficiency


EIE
Energy Efficiency (Layers/J in log scale)
1E+06

1E+05
ASIC ASIC
1E+04
ASIC ASIC
1E+03

1E+02

1E+01 GPU mGPU

1E+00 CPU FPGA


Core-i7 5930k TitanX Tegra K1 A-Eye DaDianNao TrueNorth EIE EIE
22nm 28nm 28nm 28nm 28nm 28nm 45nm 28nm
CPU GPU mGPU FPGA ASIC ASIC ASIC ASIC
64PEs 256PEs

Compression Acceleration Regularization 105


Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Part 3: Efficient Training Algorithms

1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Part 3: Efficient Training Algorithms

1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Moores law made CPUs 300x faster than in 1990
But its over

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011


Data Parallel Run multiple inputs in parallel

Dally, High Performance Hardware for Machine Learning, NIPS2015


Data Parallel Run multiple inputs in parallel

Doesnt affect latency for one input


Requires P-fold larger batch size
For training requires coordinated weight update
Dally, High Performance Hardware for Machine Learning, NIPS2015
Parameter Update
One method to achieve scale is parallelization

Parameter Server p = p + p

p p

Model!
Workers

Data!
Shards
Large scale distributed deep networks
J Dean et al (2012)
Large Scale Distributed Deep Networks, Jeff Dean et al., 2013
Model Parallel
Split up the Model i.e. the network

Dally, High Performance Hardware for Machine Learning, NIPS2015


Model-Parallel Convolution by output region (x,y)

Kernels
Multiple 3D
Kuvkj
x

BBxyj BBxyj
xyj xyj
6D Loop A
AAxyk
ijij BBxyj
Forall output map j xyj
For each input map k BBxyj BBxyj
xyj xyj
For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps
Axyk Bxyj

Dally, High Performance Hardware for Machine Learning, NIPS2015


Model-Parallel Convolution By output map j
(filter)

Kernels
Multiple 3D
Kuvkj
x

AA
6D Loop A Aijij
Bxyj
Forall output map j AAxyk
ijij ij

For each input map k


For each pixel x,y
For each kernel element u,v
Bxyj += A(x-u)(y-v)k x Kuvkj Input maps Output maps
Axyk Bxyj

Dally, High Performance Hardware for Machine Learning, NIPS2015


Model Parallel Fully-Connected Layer (M x V)

bi
= Wij
x aj

Output activations

Input activations
weight matrix

Dally, High Performance Hardware for Machine Learning, NIPS2015


Model Parallel Fully-Connected Layer (M x V)

bi Wij

bi
= Wij
x aj

Output activations

Input activations
weight matrix

Dally, High Performance Hardware for Machine Learning, NIPS2015


Hyper-Parameter Parallel
Try many alternative networks in parallel

Dally, High Performance Hardware for Machine Learning, NIPS2015


Summary of Parallelism
Lots of parallelism in DNNs
16M independent multiplies in one FC layer
Limited by overhead to exploit a fraction of this

Data parallel
Run multiple training examples in parallel
Limited by batch size

Model parallel
Split model over multiple processors
By layer
Conv layers by map region
Fully connected layers by output activation

Easy to get 16-64 GPUs training one model in parallel


Dally, High Performance Hardware for Machine Learning, NIPS2015
Part 3: Efficient Training Algorithms

1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Mixed Precision

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Mixed Precision Training
VOLTA
VOLTA
VOLTA
VOLTA TRAINING
TRAINING
TRAINING
TRAINING METHOD
METHOD
METHOD
METHOD
F16
W (F16)
WW(F16) WW WW F16
F16F16
(F16)
W (F16) FWD F16
F16
F16 Actv
F16
Actv F16 F16 FWD
FWD
FWD Actv
Actv
Actv
Actv
Actv
Actv F16F16

F16 F16 W
F16 F16F16
WW W
Actv Grad
F16
F16 BWD-A
Actv
Actv
Actv Grad F16
Grad
Grad
Actv Grad BWD-A
BWD-A
BWD-A F16 F16 Actv Grad
F16F16
Actv
ActvGrad
Grad
Actv Grad
Actv Grad

F16 Actv
W Grad
WWGrad F16 F16
F16F16Actv
Actv
Grad
W Grad
F16
F16 BWD-W
F16BWD-W Actv
BWD-W
BWD-W F16 F16 Actv Grad
F16F16
Actv
ActvGrad
Grad
Actv Grad

F16
F16
F16F16

F32 F32
Master-W (F32) F32
F32F32 Weight Update F32
F32F32 Updated Master-W
Master-W
Master-W(F32)
(F32)
Master-W
Master-W (F32)
(F32) Weight
WeightUpdate
Update
Weight Update Updated
UpdatedMaster-W
Master-W
Updated Master-W
5
5 5 55

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
Inception V1
INCEPTION V1

12

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
ResNet
RESNET50

13

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, Training with mixed precision, NVIDIA GTC 2017
AlexNet
Top1 Top5
Mode accuracy, % accuracy, %
Fp32 58.62 81.25
RESNET RESULTS
Mixed precision training 58.12 80.71

FP16 training Inception


No scale V3
of loss function
54.89 78.12
Top1 Top5
FP16 training,Mode
loss scale = 1000 57.76 %
accuracy, 80.76 %
accuracy,
Fp32 71.75 90.52
INCEPTION-V3
Mixed precision training
RESULTS
71.17 90.10
Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=1024, no augmentation, 1 crop, 1 model
30

FP16 training, loss scale


Scale loss=function
1
ResNet-50 by 71.17
100x 90.33
FP16 training, loss scale = 1, Top1 Top5
70.53 90.14
FP16 master weight storage
Mode accuracy, % accuracy, %
Fp32 73.85 91.44
Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=512, no augmentation, 1 crop, 1 model
41

Mixed precision training 73.6 91.11

FP16 training
Boris Ginsburg, Sergei Nikolaev, 71.36
Paulius Micikevicius, Training 90.84
with mixed precision, NVIDIA GTC 2017

FP16 training, loss scale = 100 74.13 91.51


Part 3: Efficient Training Algorithm

1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
Model Distillation

Teacher model 1 Teacher model 2 Teacher model 3


(Googlenet) (Vggnet) (Resnet)

Knowledge Knowledge
Knowledge

student
model

student model has much smaller model size


Softened outputs reveal the dark knowledge

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network


Softened outputs reveal the dark knowledge

Method: Divide score by a temperature to get a much softer


distribution

Result: Start with a trained model that classifies 58.9% of the


test frames correctly. The new model converges to 57.0%
correct even when it is only trained on 3% of the data

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network


Part 3: Efficient Training Algorithm

1. Parallelization
2. Mixed Precision with FP16 and FP32
3. Model Distillation
4. DSD: Dense-Sparse-Dense Training
DSD: Dense Sparse Dense Training
der review as a conference paper at ICLR 2017
Dense Sparse Dense

Pruning Re-Dense

Sparsity Constraint Increase Model Capacity

ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi
se training restores the pruned weights (red), increasing the model capacity without overfitti
DSD produces same model architecture but can find better optimization solution,
arrives at better local minima, and achieves higher prediction accuracy across a wide
orithm 1: Workflow of DSD training
range of deep neural networks on CNNs / RNNs / LSTMs.
tialization: W (0) with W (0) N (0, )
tput :W (t) . Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
Initial Dense Phase
DSD: Intuition

learn the trunk first then learn the leaves

Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
[Han et al. ICLR 2017]

DSD is General Purpose:


Vision, Speech, Natural Language

Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.

GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%


VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD


The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression Acceleration Regularization 133


[Han et al. ICLR 2017]

DSD is General Purpose:


Vision, Speech, Natural Language

Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.

GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%


VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.8 18.5 1.7 10.1%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD


The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression Acceleration Regularization 134


[Han et al. ICLR 2017]

DSD is General Purpose:


Vision, Speech, Natural Language

Abs. Rel.
Network Domain Dataset Type Baseline DSD Imp. Imp.

GoogleNet Vision ImageNet CNN 31.1% 30.0% 1.1% 3.6%


VGG-16 Vision ImageNet CNN 31.5% 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4% 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0% 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.8 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6% 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 13.4% 1.1% 7.4%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD


The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression Acceleration Regularization 135


https://songhan.github.io/DSD
DSD on Caption Generation

Baseline: a boy Baseline: a Baseline: two Baseline: a man and Baseline: a person in
in a red shirt is basketball player in dogs are playing a woman are sitting a red jacket is riding a
climbing a rock a red uniform is together in a field. on a bench. bike through the
wall. playing with a ball. woods.
Sparse: a young Sparse: a basketball Sparse: two dogs Sparse: a man is Sparse: a car drives
girl is jumping off player in a blue are playing in a sitting on a bench through a mud puddle.
a tree. uniform is jumping field. with his hands in the
over the goal. air. DSD: a car drives
DSD: a young girl DSD: a basketball DSD: two dogs are DSD: a man is sitting through a forest.
in a pink shirt is player in a white playing in the on a bench with his
swinging on a uniform is trying to grass. arms folded.
swing. make a shot.

Figure 3: Visualization of DSD training improves the performance of image captioning.


Baseline model: Andrej Karpathy, Neural Talk model zoo.
the
Han et forest from
al. DSD: the background. Training
Dense-Sparse-Dense The goodfor performance of DSD training
Deep Neural Networks, generalizes
ICLR 2017 beyond these
examples, more image caption results generated by DSD training is provided in the supplementary
137
material.
A. Supplementary Material: More Examples of DSD Training Improves the Performance of
Generated
NeuralTalk by NeuralTalk
Auto-Caption System (Images from Flickr-8K Test Set)
DSD on Caption Generation

Baseline: a boy is swimming in a pool. Baseline: a group of people are Baseline: two girls in bathing suits are Baseline: a man in a red shirt and
Sparse: a small black dog is jumping standing in front of a building. playing in the water. jeans is riding a bicycle down a street.
into a pool. Sparse: a group of people are standing Sparse: two children are playing in the Sparse: a man in a red shirt and a
DSD: a black and white dog is swimming in front of a building. sand. woman in a wheelchair.
in a pool. DSD: a group of people are walking in a DSD: two children are playing in the DSD: a man and a woman are riding on
park. sand. a street.

Baseline: a group of people sit on a Baseline: a man in a black jacket and a Baseline: a group of football players in Baseline: a dog runs through the grass.
bench in front of a building. black jacket is smiling. red uniforms. Sparse: a dog runs through the grass.
Sparse: a group of people are Sparse: a man and a woman are standing Sparse: a group of football players in a DSD: a white and brown dog is running
standing in front of a building. in front of a mountain. field. through the grass.
DSD: a group of people are standing DSD: a man in a black jacket is standing DSD: a group of football players in red
in a fountain. next to a man in a black shirt. and white uniforms.
Baseline model: Andrej Karpathy, Neural Talk model zoo.
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
CPUs for Training
CPUs Are Targeting Deep Learning
Intel Knights Landing (2016)

7 TFLOPS FP32

16GB MCDRAM 400 GB/s

245W TDP

29 GFLOPS/W (FP32)

14nm process

Knights Mill: next gen Xeon Phi optimized for deep learning
Intel announced the addition of new vector instructions for deep learning
(AVX512-4VNNIW and AVX512-4FMAPS), October 2016
Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.
Image
Image Source:
Source: Intel,
Intel, Data DataNext
Source: Source: Next Platform
Platform
2
GPUs for Training
GPUs Are Targeting Deep Learning
Nvidia PASCAL GP100 (2016)

10/20 TFLOPS FP32/FP16


16GB HBM 750 GB/s
300W TDP
67 GFLOPS/W (FP16)
16nm process
160GB/s NV Link

Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.


Data Source: NVIDIA
Source: Nvidia 3
GPUs for Training

Nvidia Volta GV100 (2017)

15 FP32 TFLOPS
120 Tensor TFLOPS
16GB HBM2 @ 900GB/s
300W TDP
12nm process
21B Transistors
die size: 815 mm2
300GB/s NVLink

Data Source: NVIDIA


Whats
TENSOR new
CORE in Volta:
4X4X4 Tensor CoreACC
MATRIX-MULTIPLY

a new instruction that performs 4x4x4 FMA mixed-precision operations per clock
12X increase in throughput for the Volta V100 compared to the Pascal P100
8

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Pascal v.s. Volta

Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations.

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Pascal v.s. Volta

Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100.
Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the
ResNet-50 deep neural network 3.7x faster than Tesla P100.

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
The GV100 SM is partitioned
into four processing blocks,
each with:

8 FP64 Cores
16 FP32 Cores
16 INT32 Cores
two of the new mixed-precision
Tensor Cores for deep learning
a new L0 instruction cache
one warp scheduler
one dispatch unit
a 64 KB Register File.

https://devblogs.nvidia.com/parallelforall/
cuda-9-features-revealed/
Tesla Product Tesla K40 Tesla M40 Tesla P100 Tesla V100
GPU GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta)

GPU Boost Clock 810/875 MHz 1114 MHz 1480 MHz 1455 MHz
Peak FP32 TFLOP/s* 5.04 6.8 10.6 15
Peak Tensor Core - - - 120
TFLOP/s*
Memory Interface 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2

Memory Size Up to 12 GB Up to 24 GB 16 GB 16 GB

TDP 235 Watts 250 Watts 300 Watts 300 Watts


Transistors 7.1 billion 8 billion 15.3 billion 21.1 billion
GPU Die Size 551 mm 601 mm 610 mm 815 mm
Manufacturing 28 nm 28 nm 16 nm FinFET+ 12 nm FFN
Process
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
GPU / TPU

https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
achine learning models on our new Google Cloud TPUs 5

Google Cloud TPU

Cloud TPU delivers up to 180 teraflops to train and run machine learning models.

source: Google Blog

Our new Cloud TPU delivers up to 180 teraEops to train and run machine learning 149
in an afternoon using just one eighth of a TPU pod.
Google Cloud TPU

A TPU pod built with 64 second-generation TPUs delivers up to 11.5


petaflops of machine learning acceleration.
A TPU pod built with 64 second-generation TPUs delivers up to 11.5 petaEops of
One of our new large-scale translation models used to take a full day to train
machine learning acceleration.
on 32 of the best commercially-available GPUsnow it trains to the same
accuracy in an afternoon using just one eighth of a TPU pod. Google Blog
Introducing Cloud TPUs
150
Wrap-Up

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Future

Smart Low Latency Privacy Mobility Energy-Efficient

152
Outlook: the Focus for Computation

PC Era Mobile-First Era AI-First Era

Brain-Inspired
Mobile
Computing Cognitive
Computing
Computing

Sundar Pichai, Google IO, 2016

153
Thank you!
stanford.edu/~songhan