Anda di halaman 1dari 35

Tile Processors: Many-Core for Embedded and Cloud Computing

Richard Schooler
VP Software Engineering Tilera Corporation
rschooler@tilera.com

Exploiting Natural Parallelism

High-performance applications have lots of parallelism!


Embedded apps:
Networking: packets, flows Media: streams, images, functional & data parallelism

Cloud apps:
Many clients: network sessions Data mining: distributed data & computation

Lots of different levels:


SIMD (fine-grain data parallelism) Thread/process (medium-grain task parallelism) Distributed system (coarse-grain job parallelism)
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Every one is going Manycore, but can the architecture scale?


The computing world is ready for radical change
32 n Cores

Sun IBM Cell


#Cores

Larrabee

Performance & Performance/W Gap

Performance

Intel

2 1 2005
3

2006

2010

2014

2020

Time

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Key Many-Core Challenges: The 3 Ps

Performance challenge
How to scale from 1 to 1000 cores the number of cores is the new Megahertz

Power efficiency challenge


Performance per watt is the new metric systems are often constrained by power & cooling

Programming challenge
How to provide a converged many core solution in a standard programming environment
4
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Problems cannot be solved by the same level of thinking that created them.

Current technologies fail to deliver


Incremental performance increase High power Low level of Integration Increasingly bigger cores

We need to have a new thinking to get


10 x performance 10 x performance per watt Converged computing Standard programming models

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Stepping Back: How Did We Get Here?

Moores Conundrum: More devices =>? More performance

Old answers: More complex cores; bigger caches


But power-hungry

New answers: More cores


But do conventional approaches scale?

Diminishing returns!
6
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

The Old Challenge: CPU-on-a-chip

20 MIPS CPU in 1987

Few thousand gates


7
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

The Opportunity: Billions of Transistors

Old CPU:

What to do with all those transistors?


8
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Take Inspiration from ASICs


mem mem

mem

mem

mem

ASICs have high performance and low power Custom-routed, short wires Lots of ALUs, registers, memories huge on-chip parallelism

But how to build a programmable chip?


9
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Replace Long Wires with Routed Interconnect


Ctrl

[IEEE Computer 97]


HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

10

From Centralized Clump of CPUs

ALU

ALU ALU ALU ALU

ALU

ALU

RF
ALU

ALU

ALU

11

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

ALU

ALU

Bypass Net

ALU

ALU

ALU

ALU

To Distributed ALUs, Routed Bypass Network


ALU

ALU

Scalar Operand Network (SON) [TPDS 2005]


12
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

ALU

ALU

ALU

ALU

From a Large Centralized Cache

13

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

to a Distributed Shared Cache


$
R
ALU ALU ALU

ALU

ALU

ALU

14

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

[ISCA 1999]

Distributed Everything + Routed Interconnect Tiled Multicore


$
R
ALU
ALU

ALU

Each tile is a processor, so programmable


15
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

ALU

ALU

ALU

Tiled Multicore Captures ASIC Benefits and is Programmable


Scales to large numbers of cores Modular design and verify 1 tile Power efficient
Short wires plus locality opts

CV2f

Chandrakasan effect, more cores at 2 lower freq and voltage

CV f

Processor Core

S Core + Switch = Tile


16

Current Bus Architecture

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Tilera processor portfolio


Demonstrating the scale of many-core

Gx64 & Gx100 Up to 8x performance

TILEPro64

TILE64

Gx16 & Gx36 2x the performance

TILEPro36

2007

2008

2009

2010

...

17

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

TILE-Gx100:
Complete System-on-a-Chip with 100 64-bit cores
Memory Controller (DDR3)
SerDes
4x GbE SGMII

Memory Controller (DDR3)

MiCA
UART x2, USB x2, JTAG, I2C, SPI

10 GbE XAUI Interlaken


4x GbE SGMII

10 GbE XAUI
4x GbE SGMII

SerDes

1.2GHz 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps iMesh BW

SerDes

SerDes

SerDes

mPIPE

10 GbE XAUI
4x GbE SGMII

PCIe 2.0 8-lane

10 GbE XAUI
4x GbE SGMII

SerDes

SerDes

PCIe 2.0 8-lane

10 GbE XAUI
4x GbE SGMII

80-120 Gbps packet I/O


8 ports XAUI / 2 XAUI 2 40Gb Interlaken 32 ports 1GbE (SGMII)

80 Gbps PCIe I/O


3 StreamIO ports (20Gb)

Interlaken

PCIe 2.0 4-lane

10 GbE XAUI
4x GbE SGMII

SerDes

SerDes

Flexible I/O

SerDes

Wire-speed packet eng.


120Mpps

10 GbE XAUI

MiCA
Memory Controller (DDR3) Memory Controller (DDR3)

SerDes

4x GbE SGMII

MiCA engines:
40 Gbps crypto compress & decompress

10 GbE XAUI

18

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

TILE-Gx36 :
Scaling to a broad range of applications
MiCA
SerDes

Memory Controller (DDR3)

4x GbE SGMII 10 GbE XAUI

36 Processor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache 40 Gbps total packet I/O
4 ports 10GbE (XAUI) 16 ports 1GbE (SGMII)

UARTx2, USBx2, JTAG, I2C, SPI

SerDes

4x GbE SGMII 10 GbE XAUI

SerDes

mPIPE

PCIe 2.0 8-lane

SerDes

SerDes

4x GbE SGMII 10 GbE XAUI

PCIe 2.0 4-lane

48 Gbps PCIe I/O


2 16Gbps Stream IO ports

SerDes

PCIe 2.0 4-lane

SerDes

4x GbE SGMII 10 GbE XAUI

Flexible I/O

Wire-speed packet engine


60Mpps

Memory Controller (DDR3)

MiCA engine:
20 Gbps crypto Compress & decompress

19

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Full-Featured General Converged Cores

Processor
Each core is a complete computer 3-way VLIW CPU SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and networking Protection and interrupts

Core
Register File
Three Execution Pipelines

Memory
L1 cache and L2 Cache Virtual and physical address space Instruction and data TLBs Cache integrated 2D DMA engine

Cache
16K L1-I 8K L1-D I-TLB D-TLB 64K L2 2D DMA

Runs SMP Linux Runs off-the-shelf C/C++ programs Signal processing and general apps
20
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Terabit Switch

Software must complement the hardware


Enable re-use of existing code-bases

Standards-based development environment


e.g. gcc, C, C++, Java, Linux Comprehensive command-line & GUI-based tools

Support multiple OS models


One OS running SMP Multiple virtualized OSs with protection Bare metal or zero-overhead with background OS environment

Support range of parallel programming styles


Threaded programming (pThreads, TBB) Run-to-Completion with load-balancing Decomposition & Pipelining Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)

21

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Software Roadmap

Standards & open source integration


Compiler: gcc, g++ 4.4+ Linux:
Kernel: Tile architecture integrated to 2.6.36 User-space: glibc, broader set of standard packages

Extended programming and runtime environments


Java: porting OpenJDK Virtualization: porting KVM
22
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Tile architecture:
The future of many-core computing

Multicore is the way forward


But we need the right architecture to utilize it

The Tile architecture addresses the challenges


Scales to 100s of cores Delivers very low power Runs your existing code

Standards-based software
Familiar tools Full range of standard programming environments
23
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Thank you! Questions?

Research Vision to Commercial Product


The opportunity The future? TILE-Gx100 100 cores
MiCA
UART x2, USB x2, JTAG, I2C, SPI

4x GbE SGMII

10 GbE XAUI
4x GbE SGMII

PCIe 2.0 8-lane

10 GbE XAUI

SerDes

Interlaken

2018

mPIPE

4x GbE SGMII

SerDes

10 GbE XAUI
4x GbE SGMII

PCIe 2.0 8-lane

10 GbE XAUI
4x GbE SGMII

PCIe 2.0 4-lane

10 GbE XAUI
4x GbE SGMII

A blank slate

1996
MIT Raw 16 cores

Flexible I/O

10 GbE XAUI
4x GbE SGMII

MiCA

Memory Controller

(DDR3)

Memory Controller

(DDR3)

10 GbE XAUI

2010

CPU

Mem

Tile Processor 64 cores

1997
2002
25
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

2007

SerDes SerDes

SerDes SerDes SerDes SerDes SerDes

1B Transistors in 2007

100B transistors

Memory Controller

(DDR3)

Memory Controller

4x GbE SGMII (DDR3)

10 GbE XAUI Interlaken

SerDes

SerDes

Standard tools and programming model


Multicore Development Environment Standards-based tools Standard programming

Standard application stack Application layer


SMP Linux 2.6 ANSI C/C++ Java, PHP

Open source apps Standard C/C++ libs

Applications
Applications libraries

Operating System layer Integrated tools


Operating System
kernel drivers

GCC compiler Standard gdb gprof Eclipse IDE

64-way SMP Linux Zero Overhead Linux Bare metal environment

Hypervisor
Virtualization and high speed I/O drivers

Hypervisor layer Innovative tools


Multicore debug Multicore profile

Virtualizes hardware I/O device drivers

Tile Processor
Tile Tile Tile Tile

26

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Standard Software Stack

Management Protocols
Infrastructure Apps

IPMI 2.0
Network Monitoring

Transcoding

Language Support
Compiler, OS Hypervisor

Perl

Commercial Linux Distribution

gcc & g++

27

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

High single core performance


Comparable to Atom & ARM Cortex-A9 cores
Single-Core Single thread CoreMark Comparison

3,500

CoreMark Score

3,000 2,500 2,000 1,500 1,000 500 Tilera TILEPro64 866 MHz Tilera TILE-Gx36 1.25 GHz ARM Cortex-A9 1 GHz Intel Atom N270 1600 MHz

- Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php - TILE-Gx and single thread Atom results were measured in Tilera labs - Single core, single thread result for ARM is calculated based on chip scores

28

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Significant value across multiple markets


Networking
Classification L4-7 Services Load Balancing Monitoring/QoS Security

Multimedia
Video Conferencing Media Streaming Transcoding

Wireless
Base Station Media Gateway Service Nodes Test Equipment

Cloud
Apache Memcached Web Applications LAMP stack

High Performance Low Power Standard Programming

Over 100 customers Over 40 customers going into production Tier 1 customers in all target markets
29
29

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Targeting markets with highly parallel applications

Web

Media delivery

Government

Web Serving In Memory Cache Data Mining

Transcoding Video delivery Wireless media

Lawful interception Surveillance Other

Common Themes Hundreds and Thousands of servers running each application thousands of parallel transactions All need better performance and power efficiency
30
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

The Tile Processor Architecture


Mesh interconnect, power-optimized cores
Processor Core

2 Dimensional mesh network

S Core + Switch = Tile

Scales to large numbers of cores Modular: Design-and-verify 1 tile Power efficient:

200 Tbps on-chip bandwidth

Short wires & locality optimize CV2f Chandrakasan effect, more cores at

Traditional Bus/Ring Architecture

lower freq and voltage

CV2f

31

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Distributed everything
Cache, memory management, connectivity

Big centralized caches dont scale


Contention Long latency High power

Distributed caches

Distributed caches have numerous benefits


Lower power (less logic lit-up per access) Exploit locality (local L1 & L2) Can exploit various cache placement mechanisms to enhance performance

Completely HW Coherent Cache System

Global 64-bit address space

32

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Highest compute density

2U form factor 4 hot pluggable modules

8 Tilera TILEPro processors


512 general purpose cores 1.3 trillion operations /sec
33
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Power efficient and eco-friendly server

10,000 cores in a 8 Kilowatt rack 35-50 watts max per node Server power of 400 watts 90%+ efficient power supplies Shared fans and power supplies

34

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

Coherent distributed cache system

Globally Shared Physical Address Space


Full Hardware Cache Coherence Standard shared memory programming model
DDR2 Controller 0 DDR2 Controller 1

Distributed cache
Each tile has local L1 and L2 caches Aggregate of L2 serves as a globally shared L3 Any cache block can be replicated locally Hardware tracks sharers, invalidates stale copies
SerDes PCIe 0

XAUI 0

Dynamic Distributed Cache (DDC)


Memory pages distributed across all cores or homed by allocating core

Flexible I/O UART JTAG SPI, I2C

SerDes

Completely Coherent System


DDR2 Controller 3 DDR2 Controller 2

GbE 0 GbE 1

PCIe 1

XAUI 1

Coherent I/O
Hardware maintains coherence I/O reads/writes coherent with tile caches Reads/writes delivered by HW to home cache Header/packet delivered directly to tile caches

35

HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.

SerDes

SerDes