TECHNOLOGY IMPACTS FROM THE NEW WAVE OF ARCHITECTURES FOR MEDIA-RICH WORKLOADS
Samuel Naffziger AMD Corporate Fellow
June 14th, 2011
Outline Introduction The new workloads and demands on computation Characteristics of serial and parallel computation The Accelerated Processing Unit (APU) architecture
Mid 1990s
4:3 @ 0.5 megapixel Email, film & scanners Text and low res photos CD-ROM Mouse & keyboard 1-2 Hours
Mid 2000s
4:3 @ 1.2 megapixels Digital cameras, SD webcams (1-5 MB files) WWW and streaming SD video DVDs Mouse & keyboard 3-4 Hours
Now: Parallel/Data-Dense
16:9 @ 7 megapixels HD video flipcams, phones, webcams (1GB)
Online Multimedia
Interface Battery Life*
Form Factors
Workloads
Standard-definition Internet
New Experiences
Personal finances
Taking notes Online web-based games Social networking Calendar management Locally installed games Educational apps Video editing Internet phone
0% 20% 40% 60% 80% 100%
Immersive Gaming
Visual Perception
Pictures and video are processed 400 to 2000 times faster
Communicating
IM, Email, Facebook Video Chat, NetMeeting
Using video
DVD, BLU-RAY, HD Search, Recognition, Labeling Advanced Editing & Mixing
Gaming
Mainstream Games 3D games
Music
Listening and Sharing Editing and Mixing Composing and compositing
ViVu CyberLink CyberLink ArcSoft ArcSoft Desktop Nuvixa Power Media TotalMedia Media Telepresence Be Present Director 9 Espresso 6 Theatre 5 Converter 7 6 VLSI Technology Symposium | June 2011 | Public
Codemasters F1 2010
Viewdle Uploader
24 hours
every minute
uploaded to YouTube
of video
9 billion
high-definition
video files owned are
Approximately
50 million +
added to personal content libraries
every day
1000 images
every second
Serial Code
Conditional branches
GFLOPs Trend
8000
7000
i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc
GPU CPU
Single-thread Performance (SpecINT) Frequency (MHz) 2D array representing Typical Power very large dataset (Watts)
2005 2008 2009 2010 2011 2012 2013 2014 Original data collected and 2006 by 2007 plotted M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten
Years
(thousands)
Weak performance gains through density Maximize speed with fast devices
Nearly linear performance gains with density Maximize density with cool devices
Multi-Core Era
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch
Single-thread Performance
Throughput Performance
we are here
we are here
we are here
Time (Data-parallel exploitation)
Time
Time (# of Processors)
CPU Chip
CPU Cores MC
CPU Cores
UNB / MC
UNB
APU Chip
UVD
~7 GB/sec
GPU UVD
FCH Chip
SB Functions
GPU
Optional
PCIe
GPU
Bandwidth pinch points and latency hold back the GPU capabilities
Integration Provides Improvement Eliminate power and latency of extra chip crossing 3X bandwidth between GPU and Memory! Same sized GPU is substantially more effective Power efficient, advanced technology for both CPU and GPU
Resistivity (uohm-cm)
Density
With the 20nm node, even local metal will be seeing large RC increase compromises more difficult
Add metal layers? 2 Thin, dense layers for the 1.9 1.8 GPU 1.7 1.6 Thick, low resistance 1.5 0 0.1 0.2 0.3 0.5 layers for 0.4 Width0.6 0.7 0.8 0.9 1 the CPU Line (um) Cost issues? Via resistance? Technology improvements in BEOL are required
Device Optimization
APU Vt Mix
Performance
GPU
CPU
1.2
1 0.8
0.6 0.4
LC-HVT
HVT RVT LC-RVT
Device Ioff
To achieve breakthrough APU performance, the Llano GPU has ~5X the flops and ~5X the device count of the CPUs
250 200
FPG RO speed
0.2 0
LVT
Llano CPU
50
LVT RVT LC-RVT HVT
Power Transfers
110.0
105.0
100.0
95.0
90.0
85.0
Voltage range is critical to enabling the efficient power transfers that make for compelling APU performance
2
1.5
1
0.5
0
0.7V 0.8V 0.9V 1.0V 1.1V 1.2V 1.3V
1.000V
0.950V
3 2.5 2
0.805V
1.5
0.800V
1 0.5 0
1000MHz
0.750V
40nm
28nm
20nm
14nm
2 3
4
5 6
7
8 9
10
11
12
13
14 15
16
17 18
Nominal Voltage
3.5
Many barriers to maintaining both high and low voltage as technology scales TDDB vs. SCE control ULK breakdown vs. denser pitches
Variation control
110.0
105.0
100.0
Poly
95.0
90.0
85.0
Fin
BOX
19 VLSI Technology Symposium | June 2011 | Public
Microbumps
DRAM
GPU Die Metal Layers CPU Die Metal Layers Package Substrate
South Bridge
3D Integration Challenges
Economical 3D stacking in high volume manufacturing presents many challenges
Benefits must exceed the additional costs of TSVs, and yield fallout Logistics of testing and assembling die from multiple sources can be immense Countless mechanical and thermal issues to solve in high volume mfg
DRAM
Heat Sink TIM (Thermal Interface Material) DRAM Metal Layers Analog Die (SB, Power) Metal Layers GPU Die Metal Layers
CPU Die Metal Layers
Package Substrate
Clearly 3D provides compelling solutions to many problems, but the barriers to entry mean heavy R&D $$ and partnerships required
South Bridge
Summary
Insatiable demand for high bandwidth computation
A broad device suite is necessary that operates efficiently at low voltage while enabling high speed for response time 3D integration offers a promising long term solution
22 VLSI Technology Symposium | June 2011 | Public