Anda di halaman 1dari 116

Rasterization on Larrabee:

A First Look at the Larrabee New


Instructions (LRBni) in Action *

* Warning: this is a pretty technical talk, so if you have no interest


in processor architecture, instruction sets, or programming and
were trying to decide between this talk and something about game
design or the production pipeline, you might want to reconsider!

Michael Abrash
GDC, March 2009
I never did get that lerp instruction!
Larrabee
 What is Larrabee?
 Better yet, why is Larrabee?
 For decades, processors have gotten faster
 Higher clock speeds
 Smaller (== faster and more) gates
 Bigger caches
 Hardware that extracts more work per clock
Why Larrabee?
 That process certainly hasn’t stopped
 But it’s getting harder
 Low-hanging fruit already taken

 Running up against power budgets


Why Larrabee?
 More recently, developments along another axis
 Vector processing
 And another
 Multiple hardware threads
 And yet another
 Multiple cores
 Performance can scale linearly with gates
 More work per clock by moving burden of
extracting parallelism to software
What is Larrabee?
 Larrabee is the logical conclusion of these trends
 Lots of power-efficient cores
 In-order pipeline
 Clocked at the power/performance sweet spot

 4 threads per core


 16-wide vector units
 Streaming support
What is Larrabee?
 Very powerful
 Very power-efficient
 Very highly parallel
 Very dependent on software to make use of that
parallelism
What is Larrabee?
 Gets as much work out of each watt and each
square millimeter as possible
 Scales well far into the future
 Massive potential performance; 1 teraflop & up
 My favorite part
 Excellent software control of performance
 Relies heavily on software to get it to live up to its
potential
Larrabee hardware architecture
 Obligatory vague architecture slide

Multi-
Multi-Threaded Multi-Threaded
Multi-
Threaded Threaded
Fixed Function
Wide SIMD Wide SIMD

Display Interface
Wide SIMD
I$ D$
... Wide SIMD
I$ D$
I$ D$

Controller
Memory Controller

Controller
L2 Cache

Memory
System Interface

Memory
Texture Logic

Multi-
Multi-Threaded Multi-Threaded
Multi-
Threaded
Wide SIMD Threaded
Wide SIMD
Wide SIMD
I$ D$
... Wide SIMD
I$ D$
I$ D$ I$ D$
Larrabee hardware architecture
 Lots of enhanced in-order x86 cores
 Fully capable of running an OS and apps
 Great flexibility in graphics pipeline design

 Can support a wide variety of software

 4 threads per core


 Hide pipeline latency, cache misses
 Coherent caches, connected by a fast ring
Larrabee hardware architecture
 Tried to maximize general usability of Larrabee
hw
 Fixed-function texture sampler units
 Also a per-core cache-management unit

 No other fixed-function units


Larrabee hardware architecture
 These features boost performance via thread-
level parallelism
 Key element of Larrabee performance, but it’s not
unique to Larrabee, so I’m not going to talk about it
further today
Larrabee hardware architecture
 I’m going to focus on per-thread (data-level,
SIMD) parallelism
 16-wide vector unit
 Why 16-wide?
 The wider the better – if it gets used
LRBni
 Larrabee New Instructions
 >100 new instructions
 Mostly vector instructions
 Architected in close collaboration with developers
 Design philosophy
 It’s hard to leverage data parallelism without all the right
pieces in the instruction set
 Enable generally-usable extraction of data-level parallelism
The fundamentals of LRBni
 32 vector register, each 512 bits wide
 v0-v31
 Full complement of vector instructions
 Operate on int32, float32, float64
 Mul, add, sub, adc, sbb, subr, and, or, xor, multiply-
add, multiply-sub
 Vector compares
 Aligned and unaligned store/load
 Gather/scatter
 Bit manipulation: insert field, interleave, shuffle
The fundamentals of LRBni
 Ternary (three-argument) operations
 Load-operand – can read one arg from memory
 No-cost type conversions on load/store
 All math is 32- or 64-bit wide
 Smaller data in memory to save bw and footprint
 Common upconversions on load-operand
 Upconversion and/or broadcast on memory load
 Downconversion and/or selection on memory store
 All common DX/OpenGL types including float16,
unorm8, etc.
The fundamentals of LRBni
 8 16-bit mask registers
 Every vector instruction can do no-cost predication
 Most often set by vector compares

 Can be copied from scalar registers (eax, ebx, …)

 Set of logical instructions that operate on masks

 Mask tests allow real branches and loops


The fundamentals of LRBni
 Parallel <=> serial conversion
 Pack-store; load-unpack
 Gather; scatter

 Bit scan initialized

 Streaming support
 Prefetching
 Cache control
Tim Sweeney on Larrabee
 Quotes from Tim Sweeney on Larrabee:
 Short version: Larrabee rocks! 
 Larrabee instructions are “vector complete”
 More precisely: Any loop written in a traditional
programming language can be vectorized, to execute
16 iterations of the loop in parallel on Larrabee
vector units, provided the loop body meets the
following criteria:
 Its call graph is statically known.
 There are no data dependencies between iterations.
Michael Abrash on Larrabee
 “Tim’s absolutely right, but I’ll bet there’s still a
lot of performance to be had from mucking
around under the hood ”
Sample LRBni code
kxnor k2, k2
vorpi v0, v2, v2
vorpi v1, v3, v3
vxorpi v4, v4, v4
mov ebx, 256
loop:
cmp ebx,0
jl endloop
dec ebx
vmulps v21 {k2}, v0, v1
vaddps v21 {k2}, v21, v21
vmadd213ps v0 {k2}, v0, v2
vmsub231ps v0 {k2}, v1, v1
vaddps v1 {k2}, v21, v3
vaddps v4 {k2}, v4, [ConstantFloatOne] {1to16}
vmulps v25 {k2}, v0, v0
vmadd231ps v25 {k2}, v1, v1
vcmpleps k2 {k2}, v25, [ConstantFloatOne] {1to16}
kortest k2, k2
jnz loop
endloop:

(this happens to be a Mandelbrot-set generator)


(thanks Dean for fixing this!)
LRBni examples
 Ternary: start with a simple vector multiply
 vmulps v0, v5, v6 ; v0 = v5 * v6
LRBni examples
 Multiply-add: destination is also third source
 vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
 Operand 2 times operand 3 plus operand 1
LRBni examples
 Multiply-add: destination is also third source
 vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
 Operand 2 times operand 3 plus operand 1
LRBni examples
 Multiply-add: destination is also third source
 vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
 Operand 2 times operand 3 plus operand 1
LRBni examples
 Multiply-add: destination is also third source
 vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
 Operand 2 times operand 3 plus operand 1
LRBni examples
 Predication: mask the writing of the elements
 vmadd231ps v0 {k1}, v5, v6
LRBni examples
 Load-operand: src2 is the memory location
specified by rbx+rcx*4
 vmadd231ps v0 {k1}, v5, [rbx+rcx*4]
LRBni examples
 The operands can be plugged in differently
 vmadd213ps v0 {k1}, v5, [rbx+rcx*4]
 Operand 2 times operand 1 plus operand 3
LRBni examples
 Broadcast: expand 4 (or 1) elements in memory
 vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}
LRBni examples
 Conversion: upconvert from float16 format
 vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {float16}
LRBni examples
 Pack-store
 vcompressd [rbx] {k1}, v0
LRBni examples
 Gather
 vgatherd v1 {k1}, [rbx + v2*4]
The fundamentals of LRBni
 All these instructions run at full speed!
 I know this has been way too fast
 There just isn’t time for an in-depth look
 LRBni in more detail:
 Tom Forsyth’s talk (10:30: Room 3002, West Hall)
 Dr. Dobb’s Journal article in April

 These slides (complete with notes)


LRBni in Action
 Everything you need to run fully-parallel code well
 General code running on a CPU
 Can run anything
 How well can it run less-than-perfectly-parallelizable code?
 RAD has been working on the Larrabee
D3D/OpenGL graphics pipeline
 Pipeline is not an ideal candidate for parallelization
 Retirement must be in order
 Rasterization is not easy to vectorize efficiently
 I’ll look at rasterization today
 The process of determining which pixels are inside a triangle
Applying LRBni to the graphics
pipeline
 For the most part, this is easy
 Z/stencil buffering, pixel shading, blending all
fall out of processing 4x4 blocks
 16 vertices can be shaded and cached in parallel
 Vertex usage tends to be localized
 Triangle set-up lends itself pretty well to
vectorization
 Some efficiency cost from culling
Rasterization is not easy to vectorize
 At least not with usable performance
 In fact, we were sure it couldn’t be done!
 Forced to reexamine assumptions
 So irritated at being asked for the hundredth time if
it was possible that we sat down to prove it wasn’t
Rasterization is not easy to vectorize
 At least not with usable performance
 In fact, we were sure it couldn’t be done!
 Forced to reexamine assumptions
 So irritated at being asked for the hundredth time if
it was possible that we sat down to prove it wasn’t
 We failed 
Dedicated hardware can do any
given task more efficiently
 In general, dedicated hardware will be able to do any
given graphics task more efficiently per square
millimeter than software
 However, CPU flexibility can gain some or all of that
back by applying square millimeters as needed
 Hardware needs worst-case capacity for each component
 Often partly or entirely idle
 When little rasterization is needed (long shaders, say), CPUs
can just use cycles for other purposes; ALUs are never idle
 Can even have higher peak rates in many cases, because the
whole chip can work on a single task if necessary
A quick refresher
 Three edges per triangle, each defined by an
equation Bx+Cy relative to any point on edge
 Sign indicates in/out
 X and y snapped to 15.8
 Range [-16K,+16K]
 Tested at pixel/sample centers
 Fill rules must be observed
 Must be exact (discrete math)
 Must support multisampled antialiasing (MSAA)
Pixel rasterization: example

+
- - +
-
+
Pixomatic 1 rasterization
 Pixo 1 decomposed triangles into 1 or 2
trapezoids
 Stepped down 2 edges at a time on pixel centers
 Could do with only 1 branch, for loop
 Branch misprediction is very expensive
 Rasterization code tends to predict poorly
Pixomatic 1 rasterization
Problems with Pixo 1 rasterization
 Required expensive IMUL per edge
 Poorly suited to small triangles
 Not well suited to MSAA sample jittering
 Never could figure out how to vectorize it
efficiently
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
 Generally well suited to vectorization
 Problems for CPUs
 Lots of badly-predicted branching
 Significant work to figure out where to descend

 Example of an approach that dedicated


hardware can perform much more efficiently
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
CPU smarts are useful for
rasterization
 Not-rasterizing
CPU smarts are useful for
rasterization
 The Larrabee renderer uses chunking (binning, tiling)
 For chunking, separate per-tile and intra-tile rasterization
 Per-tile rasterization (triangle->tile assignment)
 Bounding box tests for triangles up to 1 tile in size
 Sweep or just walk bounding box for larger triangles
 Tile-assignment time is insignificant for larger triangles
 CPUs make it easy to do the 90% case well, difficult 10% case
adequately
 If large-triangle tile assignment important, could use a form
of the approach used for intra-tile (discussed next)
A tiled render target
256x256 render target
(0,0)

Tile 0 Tile 1

Tile 2 Tile 3

(256,256)
Tile assignment test – trivial reject
 Calculate value of each edge equation at its
trivial reject tile corner
 If any edge is non-negative, triangle does not
intersect tile (<0 == inside so we can just test sign
bit)
Trivial reject: example
256x256 render target
(0,0)

Tile 0 Tile 1
Trivial reject corner of
tile 0 for black edge; if Trivial reject
this point isn’t inside corner of tile
black edge, no point in More positive +
the tile can be inside - 1 for black
edge
black edge
Tile 2 Tile 3

More negative Trivial reject


corner of tile
(256,256) 3 for black
Trivial reject corner of edge
tile 2 for black edge
Trivial reject: example
256x256 render target
(0,0)

Tile 0 Tile 1

+
-
Tile 2 Tile 3

(256,256)
Trivial reject: example
256x256 render target
(0,0)

Tile 0 Tile 1
Trivial reject corner of
tile 0 for black edge; if Trivial reject
this point isn’t inside corner of tile
black edge, no point in +
the tile can be inside - 1 for black
edge
black edge
Tile 2 Tile 3

Trivial reject
corner of tile
(256,256) 3 for black
Trivial reject corner of edge
tile 2 for black edge
Tile assignment test – trivial accept
 For each edge, sum trivial reject corner value
with the equation step to opposite corner
 If any edge is negative, the whole tile is trivially
accepted for that edge
 No need to consider it when rasterizing within tile
 In general, only relevant edges need to be considered
 Scissors, user clip
Trivial accept: example
Trivial accept
Trivial accept corner 256x256 render target corner of tile 1
of tile 0 for black
(0,0) for black edge
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge +
Tile 0 Tile 1
- Trivial accept
corner of tile
3 for black
Trivial accept corner edge
of tile 2 for black edge
Tile 2 Tile 3

(256,256)
Trivial accept: example
Trivial accept
Trivial accept corner 256x256 render target corner of tile 1
of tile 0 for black
(0,0) for black edge
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge +
Tile 0
- - Tile 1 Trivial accept
+ corner of tile
3 for black
Trivial accept corner
of tile 2 for black edge +
- edge

Tile 2 Tile 3

(256,256)
Trivial accept: example
Trivial accept
Trivial accept corner 256x256 render target corner of tile 1
of tile 0 for black
(0,0) for black edge
edge; if this point is
inside black edge, all
points in the tile must
+
be inside black edge + -
Tile 0
- Tile 1 Trivial accept
corner of tile
3 for black
edge
Trivial accept corner
of tile 2 for black edge
-
+
Tile 2 Tile 3

(256,256)
Not-rasterizing
 If all edges are negative at their trivial accept
corners, the whole tile is inside the triangle
 No further rasterization is needed
 Can just store a “draw-whole-tile” command in bin

 Back end can then effectively do two nested loops


around shaders
 Full-screen triangle rasterization speed ~= infinity 
Intra-tile rasterization
 Same principle, but vectorized
 Assume tile is 64x64 pixels and vector width is 16
 Vector-calculate the 16 trivial-reject and trivial-
accept corners of the 16x16 blocks as a delta
from the tile trivial-reject corner
 Just two adds per edge, using tables generated at
triangle set-up time
 AND edge results together
Vector rasterization to 16x16s:
trivial reject example
64x64 tile

Gray 16x16s
are trivially White dots
rejected by are trivial
+
black edge
- reject
corners of
16x16s for
black edge

Tile trivial
reject corner
for black edge
Vector rasterization to 16x16s:
trivial reject example
64x64 tile

Gray 16x16s
are trivially White dots
rejected by are trivial
+
black edge
- reject
corners of
16x16s for
black edge

Tile trivial
reject corner
for black edge
Vector rasterization to 16x16s:
trivial reject example
64x64 tile

(-48, -48) [Step by


B(-48) + C(-48)]
Gray 16x16s
are trivially White dots
rejected by are trivial
+
black edge
- reject
corners of
16x16s for
black edge

Tile trivial reject


corner for black edge
(-48, 0) [Step by B(-48) + C(0)] [Step from value here]
Vector rasterization to 16x16s:
trivial accept example
64x64 tile

White dots are


trivial accept
corners of
16x16s for Pink 16x16s
black edge are trivially
+
- accepted by
black edge
Gray 16x16s
are trivially
rejected by
black edge
Intra-tile rasterization
 Vector-test sign of edge equations at trivial
accept and trivial reject corners and AND
together
 Bit-scan through resulting masks to find trivially
and partially accepted 16x16 blocks
 Each trivial accept becomes a draw-block command
 Again, no further rasterization needed for those pixels
Intra-tile rasterization
Edge #1 equation value at tile 1
trivial accept corner

Edge #1 equation step values + + + + + + + + + + + + + + + +


from tile trivial accept corner to
trivial accept corners of 16x16s 3 2 1 0 2 1 0 -1 1 0 -1 -2 0 -1 -2 -3
= = = = = = = = = = = = = = = =
Edge #1 equation values at 4 3 2 1 3 2 1 0 2 1 0 -1 1 0 -1 -2
trivial accept corners of 16x16s
Intra-tile rasterization
Edge #1 equation value at tile 1
trivial accept corner

Edge #1 equation step values + + + + + + + + + + + + + + + +


from tile trivial accept corner to
trivial accept corners of 16x16s 3 2 1 0 2 1 0 -1 1 0 -1 -2 0 -1 -2 -3
= = = = = = = = = = = = = = = =
Edge #1 equation values at 4 3 2 1 3 2 1 0 2 1 0 -1 1 0 -1 -2
trivial accept corners of 16x16s
< < < < < < < < < < < < < < < <
Preset zeroes, in a register or
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
broadcast from memory
= = = = = = = = = = = = = = = =
Bit mask, in mask register, for 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
edge #1 trivial accept 16x16s
Intra-tile rasterization
Bit mask, in mask register, for
edge #1 trivial accept 16x16s 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
edge #2 trivial accept 16x16s
= = = = = = = = = = = = = = = =
Intermediate result 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #3 trivial accept 16x16s 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1
= = = = = = = = = = = = = = = =
Composite trivial accept mask
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
for 16x16 blocks
Intra-tile rasterization
Bit mask, in mask register, for
edge #1 trivial accept 16x16s 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
edge #2 trivial accept 16x16s
= = = = = = = = = = = = = = = =
Intermediate result 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #3 trivial accept 16x16s 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1
= = = = = = = = = = = = = = = =
Composite trivial accept mask
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
for 16x16 blocks

Block #0 found
by first bit-scan
Intra-tile rasterization
Bit mask, in mask register, for
edge #1 trivial accept 16x16s 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
edge #2 trivial accept 16x16s
= = = = = = = = = = = = = = = =
Intermediate result 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #3 trivial accept 16x16s 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1
= = = = = = = = = = = = = = = =
Composite trivial accept mask
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
for 16x16 blocks

Block #4 found by
second bit-scan
Intra-tile rasterization
Bit mask, in mask register, for
edge #1 trivial accept 16x16s 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
edge #2 trivial accept 16x16s
= = = = = = = = = = = = = = = =
Intermediate result 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #3 trivial accept 16x16s 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1
= = = = = = = = = = = = = = = =
Composite trivial accept mask
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
for 16x16 blocks

Bit-scan
completed
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners &
; see if each trivial accept corner is inside all three edges
vaddsetspi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddsetspi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddsetspi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Partially accepted 16x16 blocks
 For each partial 16x16, descend again to 4x4s
 Put trivially accepted 4x4s in bins
 Partially accepted 4x4s need to be processed
into pixel masks
 Vector add of equation step for each edge
 AND signs together to form pixel mask
Pixel mask for partial 4x4: example
4x4 pixels
Blue pixels
White dots are inside
are pixel black edge
centers

+
Grey pixels
- Resulting mask register
1110111011001100
are outside
black edge

Trivial reject
corner for black
edge for 4x4
Pixel mask for partial 4x4: example
4x4 pixels

White dots
are pixel
centers

Blue pixels
+
Grey pixels
- are inside
black edge
are outside
black edge

Trivial reject
corner for black
edge for 4x4
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks - MSAA
4x4 pixels

Yellow dots
are sample 2
centers

Blue samples
+ - are inside
Grey black edge
samples are
outside black
edge

Trivial reject
corner for black
edge for 4x4
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Adaptive rasterization
 Don’t have the luxury of custom data and ALU
sizes
 Do have the luxury of adapting to input data
 Edge equations have to be evaluated with 48 bits
in the worst case
 We have to use 64 bits
 Can use 32 bits for triangles that fit in 128x128
bounding boxes
 90+% of triangles
Adaptive rasterization (cont.)
 When we do have to do 64-bit edge evaluation
 64-bit only required for tile assignment
 Any tile up to 128x128 that’s not trivially accepted or
rejected can be rasterized using 32 bits
 Adaptive intra-tile rasterization
 Triangles that fit in a 16x16 bounding box
 One less level of descent, less set-up, no trivial accept test
 Direct mask stamping for 4x4, 4x8, 8x4 bounding
boxes
 Non-rasterization-based z for small triangles

 Easy to try things out


Implementation
 Will still not match dedicated hardware peak
rates per square millimeter on average
 Efficient enough, and avoids dedicating area and
design effort for a narrow purpose
 Generality improves overall perf for a wide range of
tasks
 For example, can bring more mm^2 to bear – the
whole chip!
Implementation
 Texture sampling and filtering remain as
significant challenges for software
 Apart from that, everything can be implemented
in software
 Not always obvious how at first, but surprisingly
often doable
 Still evolving

 A whole new way to think about optimization 


Further Larrabee Information
 Tom Forsyth’s talk
 10:30: Room 3002, West Hall
 SIGGRAPH paper
 “Larrabee: A Many-Core x86 Architecture for Visual
Computing,” Seiler et al
 Just search on “Larrabee SIGGRAPH paper”

 Dr. Dobb’s Journal article in April


 www.intel.com/software/graphics
 GDC Larrabee talks
 C++ Larrabee Prototype Library

Anda mungkin juga menyukai