Anda di halaman 1dari 22

Get Out of the Valley:

Power-Efficient Address Mapping for GPUs


The 45th International Symposium on Computer Architecture (ISCA)
Monday June 4th, 2018

Yuxi Liu (Ghent & Peking), Xia Zhao (Ghent), Magnus Jahre (NTNU), Zhenlin
Wang (MTU), Xiaolin Wang (Peking), Yingwei Luo (Peking),
and Lieven Eeckhout (Ghent)
GPU Memory Systems

GPUs require high bandwidth memory systems to support efficient execution of


100s to 1000s of concurrent threads
DRAM Banks
DRAM
LLC Slice
Channel 0
Network on Chip (NoC)
Multiprocessors (SMs)

DRAM
LLC Slice
Streaming

Channel 1
DRAM
LLC Slice
Channel 2
DRAM
LLC Slice
Channel 3

Achieving high bandwidth requires effectively utilizing the parallel units in the
memory system

2
Bank and channel bits must be highly variable
Entropy Valley to ensure even distribution of memory requests
across LLC slices, channels and banks

Memory Address
Most Least
Row Channel Bank Column Block
significant bit significant bit

CPUs
GPUs
Entropy is a
Entropy

measure of the
information Entropy
content of each Valley
address bit

Memory Address Bit

Entropy valleys create significant resource imbalance in GPU memory systems - leading to
poor performance and low power-efficiency

3
Why Do Entropy Valleys Exist?
Column-Major 1D Thread
Block (TB) Allocation Channel
[y,x] bits Channel 0
7
[0,0] … 0000 00 … Request [0,0]
6
[1,0] … 0010 00 … Request [1,0]
5 [2,0] … 0100 00 … Request [2,0] Channel 1
Y-dimension

4 [3,0] … 0110 00 … Request [3,0]

3 [4,0] … 1000 00 … Request [4,0]


Channel 2
2
[5,0] … 1010 00 … Request [5,0]

[6,0] … 1100 00 … Request [6,0]


1
[7,0] … 1110 00 … Request [7,0] Channel 3
0
0 1 2 3 4 5 6 7

X-dimension Memory Addresses and Requests DRAM Channels

4
Why Do Entropy Valleys Exist?
Column-Major 1D Thread
Block (TB) Allocation Channel
[y,x] bits Channel 0
Request [0,0] Request [1,0]
7
[0,0] … 0000 00 … Request [2,0] Request [3,0]
6 Request [4,0] Request [5,0]
[1,0] … 0010 00 …
Request [6,0] Request [7,0]
5 [2,0] … 0100 00 …
Y-dimension

4 [3,0] … 0110 00 … All requests end up in Channel 0

3 [4,0] … 1000 00 …
Entropy valleys are caused
Channel 2 by
[5,0] … 1010 00 …
2 dimension-related array indexing
[6,0] … 1100 00 …
1
[7,0] … 1110 00 … Our solution:
Channel 3
0 BIM-based address mapping
0 1 2 3 4 5 6 7

X-dimension Memory Addresses and Requests DRAM Channels

5
Getting Out of the Entropy Valley
Channel
BIM-based
Column-Major 1D Thread [y,x] bits Address Mapping
Block (TB) Allocation [0,0] … 0000 00 … Channel 0

Output Addr.
Binary

Input Addr.
[1,0] … 0010 00 … Invertible
7 [2,0] … 0100 00 … Matrix
x =

6
[3,0] … 0110 00 … (BIM)

[4,0] … 1000 00 … Channel 1


5 [5,0] … 1010 00 …
[6,0] … 1100 00 …
Y-dimension

4 [7,0] … 1110 00 …
Channel 2
3 [0,0] … 0000 00 … Request [0,0]
[1,0] … 0010 11 … Request [1,0]
2 [2,0] … 0100 01 … Request [2,0]
[3,0] … 0110 10 … Request [3,0]
1 Channel 3
[4,0] … 1000 11 … Request [4,0]
[5,0] … 1010 00 … Request [5,0]
0
[6,0] … 1100 10 … Request [6,0]
0 1 2 3 4 5 6 7 [7,0] … 1110 01 … Request [7,0]

X-dimension Memory Addresses and Requests DRAM Channels

6
Getting Out of the Entropy Valley
Channel
BIM-based
Column-Major 1D Thread [y,x] bits Address Mapping
Block (TB) Allocation [0,0] … 0000 00 … Channel 0

Output Addr.
Binary

Input Addr.
[1,0] … 0010 00 … Invertible Request [0,0]
7 [2,0] … 0100 00 … Matrix
x =
Request [5,0]
6
[3,0] … 0110 00 … (BIM)

[4,0] … 1000 00 … Channel 1


5 [5,0] … 1010 00 … Request [2,0]
[6,0] … 1100 00 …
Y-dimension

Request [7,0]
4 [7,0] … 1110 00 …
Channel 2
3 [0,0] … 0000 00 …
[1,0] Request [3,0]
… 0010 11 …
2 [2,0] … 0100 01 … Request [6,0]
[3,0] … 0110 10 … Perfect channel
1 Channel 3
[4,0] … 1000 11 … utilization! Request [1,0]
[5,0] … 1010 00 …
0 Request [4,0]
[6,0] … 1100 10 …
0 1 2 3 4 5 6 7 [7,0] … 1110 01 …
X-dimension Memory Addresses and Requests DRAM Channels

7
Outline

1. Introduction

2. Window-based memory address entropy

3. Binary Invertible Matrix (BIM) address mapping

4. Results

5. Conclusion

8
Window-based Entropy

We need an entropy metric without memory request ordering assumptions

Intra-TB Entropy Inter-TB Entropy


…100… TB1 TB2 TB3 TB4
…001…
Thread Block (TB) 1 … 1 0 1 … BVR 0 1 0 1
…000…
Window: The TBs that are likely to issue requests
Bit Value Ratio (BVR) 0
that coexist in the memory system
…110…
…011… Compute Shannon’s entropy function over the BVR
Thread Block (TB) 2
…111… probabilities within each window
…010…
Bit Value Ratio (BVR) 1 Overall entropy = Mean of window entropies

With Greedy-Then-Oldest (GTO) warp scheduling, we heuristically set the window size
to the number of Streaming Multiprocessors (SMs)

9
Entropy Profile Examples
Two channel bits
Three bank bits
and one bank bit
1.0 1.0 1.0

Entropy
Entropy
Entropy

0.5 0.5 0.5

0.0 0.0 0.0


29 18 6 29 18 6 29 18 6
Bit Bit Bit
MT LU GS
1.0 1.0 1.0
Entropy

Entropy
Entropy

0.5 0.5 0.5

0.0 0.0 0.0


29 18 6 29 18 6 29 18 6
Bit Bit Bit
NW LPS NN (no valley)
All workloads have low entropy bits, and their location is highly application-dependent

GPU address mapping schemes must harvest entropy across broad address bit ranges

10
Outline

1. Introduction

2. Window-based memory address entropy

3. Binary Invertible Matrix (BIM) address mapping

4. Results

5. Conclusion

11
The Binary Invertible Matrix (BIM)

Output Addr.
Binary

Input Addr.
The BIM can represent all possible Invertible
x =
address mapping schemes that consist Matrix
(BIM)
of AND and XOR operations
Example Memory Map
• Matrix covers all possible transformations
• Invertibility criterion ensures that all possible
one-to-one relations are considered Remap (RMP)

Single 1 per row

The BIM has low hardware overhead


Permutation-based mapping (PM)

• Can be implemented with a tree of XOR-gates Zhang et al.


[MICRO’00]
• Mapping can be performed in a single clock cycle
Two 1s in bank
and channel rows
12
Our Mapping Schemes

Entropy analysis shows that a GPU Broad mapping strategy


address mapping policy needs to
harvest entropy across broad Multiple 1s for each
address bit ranges bank and channel row

• We call this the broad mapping strategy


• Covers many possible mapping schemes Broad sub-strategies
Row Channel Bank Column Block
PAE FAE FAE
We define three sub-strategies that All All
differ in which memory address Binary Invertible Matrix (BIM)
fields can be used as input and All
output in the BIM Row Channel Bank Column Block
• Page Address Entropy (PAE)
• Full Address Entropy (FAE) We randomly generate BIMs that match the
• All input and output restrictions of each sub-strategy

13
Entropy Impact of Address Mapping
Schemes for the MT Benchmark
Baseline Remap PM
1.0 1.0 1.0
Entropy

Entropy

Entropy
0.5 0.5 0.5

0.0 0.0 0.0


29 18 6 29 18 6 29 18 6
Bit Bit Bit
PAE FAE All
1.0 1.0 1.0
Entropy

Entropy

Entropy
0.5 0.5 0.5

0.0 0.0 0.0


29 18 6 29 18 6 29 18 6
Bit Bit Bit

PAE, FAE, and All remove the entropy valleys – the other mapping schemes do not

14
Outline

1. Introduction

2. Window-based memory address entropy

3. Binary Invertible Matrix (BIM) address mapping

4. Results

5. Conclusion

15
Execution Time vs. DRAM Power

1,2
Average Execution Time Normalized to BASE

BASE
1
PM
RMP
0,8 - 1.51X
PAE FAE ALL
0,6
+1.30X
0,4

0,2

0
0,8 0,9 1 1,1 1,2 1,3 1,4 1,5
Average DRAM Power Consumption Normalized to BASE

16
Performance
BASE PM RMP PAE FAE ALL
8
+7.5X
7 +6.7X
PAE improves
Speed-up Relative to BASE

6
performance by
5 +1.31X on average
+4.0X compared to PM
4

3
+1.9X +2.0X
2 +1.5X
+1.4X +1.4X +1.3X
+1.1X +1.0X +1.0X
1

0
MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN

17
Performance per Watt
BASE PM RMP PAE FAE ALL
4,5

4 +3.9X PAE improves


Performance per Watt
3,5
Performance per Watt

by +1.25X on average
3 compared to PM
2,5

1,5 +1.4X
1

0,5

0
MT LU GS NW LPS SC SRAD2 DWT2D HS SP HMEAN

18
Why is PAE Most Power-Efficient?
background activate read write

60
BASE PM RMP PAE FAE ALL
DRAM Power Breakdown (W)

50

40

30

20

10

0
MT LU GS NW LPS SC SRAD2 DWT2D HS SP AVG

FAE and ALL tend to distribute requests with good DRAM page locality to different banks
which increases the number of DRAM page activations

PAE saves power by keeping these requests in the same bank

19
Outline

1. Introduction

2. Window-based memory address entropy

3. Binary Invertible Matrix (BIM) address mapping

4. Results

5. Conclusion

20
Conclusion

Window-Based Entropy
• A novel entropy metric tailored for the highly concurrent memory
behavior of GPU compute workloads
Binary Invertible Matrix (BIM) address mapping
• A unified representation of address mapping schemes that use
AND and XOR operations
Page Address Entropy (PAE) address mapping
• PAE improves performance by 1.31X and performance per Watt by
1.25X compared to the state-of-the-art permutation-based
address mapping scheme

21
Thank You!

22

Anda mungkin juga menyukai