Anda di halaman 1dari 22

VLSI TECHNOLOGY SYMPOSIUM 2011

TECHNOLOGY IMPACTS FROM THE NEW WAVE OF ARCHITECTURES FOR MEDIA-RICH WORKLOADS
Samuel Naffziger AMD Corporate Fellow
June 14th, 2011

1 VLSI Technology Symposium | June 2011 | Public

Outline Introduction The new workloads and demands on computation Characteristics of serial and parallel computation The Accelerated Processing Unit (APU) architecture

APU architecture implications for technology


Summary

2 VLSI Technology Symposium | June 2011 | Public

The Big Experience/Small Form Factor Paradox


Technology
Display Content

Mid 1990s
4:3 @ 0.5 megapixel Email, film & scanners Text and low res photos CD-ROM Mouse & keyboard 1-2 Hours

Mid 2000s
4:3 @ 1.2 megapixels Digital cameras, SD webcams (1-5 MB files) WWW and streaming SD video DVDs Mouse & keyboard 3-4 Hours

Now: Parallel/Data-Dense
16:9 @ 7 megapixels HD video flipcams, phones, webcams (1GB)

Online Multimedia
Interface Battery Life*
Form Factors

3D Internet apps and HD video online, social networking w/HD files


3D Blu-ray HD Multi-touch, facial/gesture/voice recognition + mouse & keyboard All day computing (8+ Hours)

Early Internet and Multimedia Experiences

*Resting battery life as measured with industry standard tests.

3 VLSI Technology Symposium| June 2011 | Public

Workloads

Standard-definition Internet

Immersive and interactive performance

Focusing on the experiences that matter


Consumer PC Usage
Email Web browsing Office productivity Listen to music Online chat Watching online video Photo editing

New Experiences

Accelerated Internet and HD Video

Personal finances
Taking notes Online web-based games Social networking Calendar management Locally installed games Educational apps Video editing Internet phone
0% 20% 40% 60% 80% 100%

Simplified Content Management

Immersive Gaming

Source: IDC's 2009 Consumer PC Buyer Survey

4 VLSI Technology Symposium| June 2011 | Public

People Prefer Visual Communications Verbal Perception


Words are processed at only 150 words per minute

Visual Perception
Pictures and video are processed 400 to 2000 times faster

Augmenting Todays Content:

Rich visual experiences Multiple content sources Multi-Display Stereo 3D

5 VLSI Technology Symposium | June 2011 | Public

The Emerging World of New Data Rich Applications


The Ultimate Visual Experience
Fast Rich Web content, favorite HD Movies, games with realistic graphics
Using photos
Viewing& Sharing Search, Recognition, Labeling? Advanced Editing

Communicating
IM, Email, Facebook Video Chat, NetMeeting

Using video
DVD, BLU-RAY, HD Search, Recognition, Labeling Advanced Editing & Mixing

Gaming
Mainstream Games 3D games

Music
Listening and Sharing Editing and Mixing Composing and compositing

ViVu CyberLink CyberLink ArcSoft ArcSoft Desktop Nuvixa Power Media TotalMedia Media Telepresence Be Present Director 9 Espresso 6 Theatre 5 Converter 7 6 VLSI Technology Symposium | June 2011 | Public

Microsoft Corel Internet Digital Studio Explorer 9 PowerPoint 2010 2010

Windows Live Essentials

Codemasters F1 2010

Viewdle Uploader

Corel VideoStudio Pro

New Workload Examples: Changing Consumer Behavior

24 hours
every minute
uploaded to YouTube

of video

9 billion
high-definition
video files owned are

Approximately

50 million +
added to personal content libraries

digital media files

every day

are uploaded to Facebook

1000 images

every second

7 | 2011 VLSI Symposium| June 2011 | Public

What Are the Implications for Computation?


Insatiable demand for high bandwidth processing Visual image processing Natural user interfaces Massive data mining for associative searches, recognition Some of these compute needs can be offloaded to servers, some must be done on the mobile device Similar compute needs and massive growth in both spaces

How must CPU architecture change to deal with these trends?

8 VLSI Technology Symposium | June 2011 | Public

Serial Code
Conditional branches

35 Years of Microprocessor Trend Data

i=0 i++ load x(i) fmul store cmp i (16) bc

GFLOPs Trend
8000

Loop 1M times for 1M pieces of data

i=0 i++ load x(i) fmul store cmp i (1000000) Transistors bc

7000

Peak GFlops (SPFP)

6000 5000 4000 3000 2000 1000

Loops, branches and conditional evaluation

i,j=0 i++ j++ load x(i,j) fmul store cmp j (100000) bc cmp i (100000) bc

GPU CPU

Single-thread Performance (SpecINT) Frequency (MHz) 2D array representing Typical Power very large dataset (Watts)

AMD Number of projections Cores

2005 2008 2009 2010 2011 2012 2013 2014 Original data collected and 2006 by 2007 plotted M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten

Years

9 VLSI Technology Symposium | June 2011 | Public

(thousands)

Parallel and Serial Computation

Data Parallel Code

GPU/CPU Design Differences


CPU (Serial compute)
Lots of instructions little data
Out of order exec, Branch prediction Few hardware threads

GPU (parallel compute)


Few instructions lots of data
Single Instruction Multiple Data Extensive fine-threading capability

Weak performance gains through density Maximize speed with fast devices

Nearly linear performance gains with density Maximize density with cool devices

10 VLSI Technology Symposium | June 2011 | Public

Three Eras of Processor Performance


Single-Core Era
Enabled by: Moores Law Voltage & Process Scaling Micro Architecture
Constrained by: Power Complexity

Multi-Core Era
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch

Heterogeneous Systems Era


Enabled by: Moores Law Abundant data parallelism Power efficient GPUs Temporarily constrained by: Programming models Communication overheads Workloads

Constrained by: Power Parallel SW availability Scalability

Single-thread Performance

Throughput Performance

we are here

we are here

Targeted Application Performance

we are here
Time (Data-parallel exploitation)

Time

Time (# of Processors)

11 VLSI Technology Symposium | June 2011 | Public

Heterogeneous Computing with an APU Architecture


2010 IGP-based (Danube) Platform
~17 GB/sec ~17 GB/sec

2011 APU-based (Llano) Platform

CPU Chip

CPU Cores MC

DDR3 DIMM Memory

CPU Cores
UNB / MC

UNB

DDR3 DIMM Memory

APU Chip

UVD

~7 GB/sec
GPU UVD

FCH Chip
SB Functions

Graphics requires memory BW to bring full capabilities to life

GPU

~27 GB/sec ~27 GB/sec


PCIe

Optional
PCIe
GPU

Bandwidth pinch points and latency hold back the GPU capabilities

Integration Provides Improvement Eliminate power and latency of extra chip crossing 3X bandwidth between GPU and Memory! Same sized GPU is substantially more effective Power efficient, advanced technology for both CPU and GPU

12 VLSI Technology Symposium | June 2011 | Public

The Challenges of Integration


Performance

Density Thick, fast metal Big devices

Dense, thin metal, small devices

Flop count for 4 Llano CPU cores=0.66M

CPU flop area = 2.14

GPU flop area = 1.0

Flop count for Llano GPU =3.5M

13 VLSI Technology Symposium | June 2011 | Public

How to Balance the Metal Stack?


Performance
Cu Resistivity without barrier With barrier
2.5

Resistivity (uohm-cm)

2.4 2.3 2.2 2.1

Density

With the 20nm node, even local metal will be seeing large RC increase compromises more difficult

Add metal layers? 2 Thin, dense layers for the 1.9 1.8 GPU 1.7 1.6 Thick, low resistance 1.5 0 0.1 0.2 0.3 0.5 layers for 0.4 Width0.6 0.7 0.8 0.9 1 the CPU Line (um) Cost issues? Via resistance? Technology improvements in BEOL are required

14 VLSI Technology Symposium | June 2011 | Public

Device Optimization
APU Vt Mix

Performance

GPU

CPU

1.2

1 0.8
0.6 0.4

LC-HVT
HVT RVT LC-RVT

Device Ioff
To achieve breakthrough APU performance, the Llano GPU has ~5X the flops and ~5X the device count of the CPUs
250 200
FPG RO speed

0.2 0

LVT

Llano CPU

Speed vs. Leakage FPG vs. Ioff


Llano GPU

desired device range

Broader span of 150 devices required


100

50
LVT RVT LC-RVT HVT

A broader device suite is required LC-HVT


0.5 0.4

0 175 50 20 4.3 2.7

Ioff (nA/um room temp)

15 VLSI Technology Symposium | June 2011 | Public

Power Transfers

110.0

105.0

100.0

95.0
90.0

85.0

Voltage range is critical to enabling the efficient power transfers that make for compelling APU performance

GPU-centric data CPU-centric serial parallel workload Balanced workload workload

16 VLSI Technology Symposium | June 2011 | Public

Operating Voltage Range


Operating voltage requirements: Low voltage necessary for power efficiency High voltage necessary for a snappy user experience enabled by turbo mode
E/op vs. V
2.5

2
1.5

1
0.5

0
0.7V 0.8V 0.9V 1.0V 1.1V 1.2V 1.3V

17 VLSI Technology Symposium | June 2011 | Public

Operating Voltage Challenges


To maintain cost effective performance growth with technology node, the GPU must: Hold power density constant Exploit density gains to add compute units This necessarily drives operating voltage down This would be good for energy efficiency except Variation impacts are much greater at low voltage
5 4.5
4

Power Density Limited GPU 40nm to 14nm


0.952V 0.915V

1.000V

0.950V

3 2.5 2

Frequency Power Density Voltage

0.900V 0.886V 0.850V

0.805V
1.5

0.800V

1 0.5 0
1000MHz

0.750V

40nmJuniper Frequency DataData GPU Frequency


0.700V
1

900MHz 800MHz 700MHz 600MHz 500MHz 400MHz


300MHz 200MHz 100MHz

40nm

28nm

20nm

14nm

2 3
4

5 6
7

8 9
10

11

0MHz 0.85V 0.90V

Frequency spread increases at low voltage


0.95V 1.00V 1.10V 1.15V

12
13

14 15
16

17 18

18 VLSI Technology Symposium | June 2011 | Public

Nominal Voltage

3.5

The Operating Voltage Challenge


FD devices should enable maintaining the functional range for a generation or two Will turbo modes be too compromised? Whats next?

Many barriers to maintaining both high and low voltage as technology scales TDDB vs. SCE control ULK breakdown vs. denser pitches
Variation control

110.0

105.0

100.0

Poly

95.0
90.0

85.0

Fin

BOX
19 VLSI Technology Symposium | June 2011 | Public

3D Integration to the Rescue?


Stacking offers many attractive benefits
DRAM

Higher bandwidth to local memory


Enables parallel and serial compute die to Heat Sinktheir own be in separate optimized technology interconnect speed vs. density, device optimization etc. TIM (Thermal Interface Material)
Metal Layers Allows IO and southbridge content to remain in older, more Analog Die (SB, Power) analog-friendly technology Metal Layers

Microbumps

DRAM

Through Silicon Vias (TSVs)

GPU Die Metal Layers CPU Die Metal Layers Package Substrate

South Bridge

20 VLSI Technology Symposium | June 2011 | Public

3D Integration Challenges
Economical 3D stacking in high volume manufacturing presents many challenges
Benefits must exceed the additional costs of TSVs, and yield fallout Logistics of testing and assembling die from multiple sources can be immense Countless mechanical and thermal issues to solve in high volume mfg

DRAM

Heat Sink TIM (Thermal Interface Material) DRAM Metal Layers Analog Die (SB, Power) Metal Layers GPU Die Metal Layers
CPU Die Metal Layers

Die to Die Vias Through Silicon Vias (TSVs)

Package Substrate

Clearly 3D provides compelling solutions to many problems, but the barriers to entry mean heavy R&D $$ and partnerships required

South Bridge

21 VLSI Technology Symposium | June 2011 | Public

Summary
Insatiable demand for high bandwidth computation

Visual image processing Natural user interfaces


Massive data mining for associate searches, recognition Some of these compute needs can be offloaded to servers, some must be done on the mobile device Similar compute needs and massive growth in both spaces Combined serial and parallel computation architectures are key in both spaces Huge technology challenges to meeting this opportunity Interconnect scaling is hitting a wall that must be overcome

A broad device suite is necessary that operates efficiently at low voltage while enabling high speed for response time 3D integration offers a promising long term solution
22 VLSI Technology Symposium | June 2011 | Public

Anda mungkin juga menyukai