Michael Abrash
GDC, March 2009
I never did get that lerp instruction!
Larrabee
What is Larrabee?
Better yet, why is Larrabee?
For decades, processors have gotten faster
Higher clock speeds
Smaller (== faster and more) gates
Bigger caches
Hardware that extracts more work per clock
Why Larrabee?
That process certainly hasn’t stopped
But it’s getting harder
Low-hanging fruit already taken
Multi-
Multi-Threaded Multi-Threaded
Multi-
Threaded Threaded
Fixed Function
Wide SIMD Wide SIMD
Display Interface
Wide SIMD
I$ D$
... Wide SIMD
I$ D$
I$ D$
Controller
Memory Controller
Controller
L2 Cache
Memory
System Interface
Memory
Texture Logic
Multi-
Multi-Threaded Multi-Threaded
Multi-
Threaded
Wide SIMD Threaded
Wide SIMD
Wide SIMD
I$ D$
... Wide SIMD
I$ D$
I$ D$ I$ D$
Larrabee hardware architecture
Lots of enhanced in-order x86 cores
Fully capable of running an OS and apps
Great flexibility in graphics pipeline design
Streaming support
Prefetching
Cache control
Tim Sweeney on Larrabee
Quotes from Tim Sweeney on Larrabee:
Short version: Larrabee rocks!
Larrabee instructions are “vector complete”
More precisely: Any loop written in a traditional
programming language can be vectorized, to execute
16 iterations of the loop in parallel on Larrabee
vector units, provided the loop body meets the
following criteria:
Its call graph is statically known.
There are no data dependencies between iterations.
Michael Abrash on Larrabee
“Tim’s absolutely right, but I’ll bet there’s still a
lot of performance to be had from mucking
around under the hood ”
Sample LRBni code
kxnor k2, k2
vorpi v0, v2, v2
vorpi v1, v3, v3
vxorpi v4, v4, v4
mov ebx, 256
loop:
cmp ebx,0
jl endloop
dec ebx
vmulps v21 {k2}, v0, v1
vaddps v21 {k2}, v21, v21
vmadd213ps v0 {k2}, v0, v2
vmsub231ps v0 {k2}, v1, v1
vaddps v1 {k2}, v21, v3
vaddps v4 {k2}, v4, [ConstantFloatOne] {1to16}
vmulps v25 {k2}, v0, v0
vmadd231ps v25 {k2}, v1, v1
vcmpleps k2 {k2}, v25, [ConstantFloatOne] {1to16}
kortest k2, k2
jnz loop
endloop:
+
- - +
-
+
Pixomatic 1 rasterization
Pixo 1 decomposed triangles into 1 or 2
trapezoids
Stepped down 2 edges at a time on pixel centers
Could do with only 1 branch, for loop
Branch misprediction is very expensive
Rasterization code tends to predict poorly
Pixomatic 1 rasterization
Problems with Pixo 1 rasterization
Required expensive IMUL per edge
Poorly suited to small triangles
Not well suited to MSAA sample jittering
Never could figure out how to vectorize it
efficiently
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Generally well suited to vectorization
Problems for CPUs
Lots of badly-predicted branching
Significant work to figure out where to descend
Tile 0 Tile 1
Tile 2 Tile 3
(256,256)
Tile assignment test – trivial reject
Calculate value of each edge equation at its
trivial reject tile corner
If any edge is non-negative, triangle does not
intersect tile (<0 == inside so we can just test sign
bit)
Trivial reject: example
256x256 render target
(0,0)
Tile 0 Tile 1
Trivial reject corner of
tile 0 for black edge; if Trivial reject
this point isn’t inside corner of tile
black edge, no point in More positive +
the tile can be inside - 1 for black
edge
black edge
Tile 2 Tile 3
Tile 0 Tile 1
+
-
Tile 2 Tile 3
(256,256)
Trivial reject: example
256x256 render target
(0,0)
Tile 0 Tile 1
Trivial reject corner of
tile 0 for black edge; if Trivial reject
this point isn’t inside corner of tile
black edge, no point in +
the tile can be inside - 1 for black
edge
black edge
Tile 2 Tile 3
Trivial reject
corner of tile
(256,256) 3 for black
Trivial reject corner of edge
tile 2 for black edge
Tile assignment test – trivial accept
For each edge, sum trivial reject corner value
with the equation step to opposite corner
If any edge is negative, the whole tile is trivially
accepted for that edge
No need to consider it when rasterizing within tile
In general, only relevant edges need to be considered
Scissors, user clip
Trivial accept: example
Trivial accept
Trivial accept corner 256x256 render target corner of tile 1
of tile 0 for black
(0,0) for black edge
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge +
Tile 0 Tile 1
- Trivial accept
corner of tile
3 for black
Trivial accept corner edge
of tile 2 for black edge
Tile 2 Tile 3
(256,256)
Trivial accept: example
Trivial accept
Trivial accept corner 256x256 render target corner of tile 1
of tile 0 for black
(0,0) for black edge
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge +
Tile 0
- - Tile 1 Trivial accept
+ corner of tile
3 for black
Trivial accept corner
of tile 2 for black edge +
- edge
Tile 2 Tile 3
(256,256)
Trivial accept: example
Trivial accept
Trivial accept corner 256x256 render target corner of tile 1
of tile 0 for black
(0,0) for black edge
edge; if this point is
inside black edge, all
points in the tile must
+
be inside black edge + -
Tile 0
- Tile 1 Trivial accept
corner of tile
3 for black
edge
Trivial accept corner
of tile 2 for black edge
-
+
Tile 2 Tile 3
(256,256)
Not-rasterizing
If all edges are negative at their trivial accept
corners, the whole tile is inside the triangle
No further rasterization is needed
Can just store a “draw-whole-tile” command in bin
Gray 16x16s
are trivially White dots
rejected by are trivial
+
black edge
- reject
corners of
16x16s for
black edge
Tile trivial
reject corner
for black edge
Vector rasterization to 16x16s:
trivial reject example
64x64 tile
Gray 16x16s
are trivially White dots
rejected by are trivial
+
black edge
- reject
corners of
16x16s for
black edge
Tile trivial
reject corner
for black edge
Vector rasterization to 16x16s:
trivial reject example
64x64 tile
Block #0 found
by first bit-scan
Intra-tile rasterization
Bit mask, in mask register, for
edge #1 trivial accept 16x16s 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
edge #2 trivial accept 16x16s
= = = = = = = = = = = = = = = =
Intermediate result 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #3 trivial accept 16x16s 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1
= = = = = = = = = = = = = = = =
Composite trivial accept mask
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
for 16x16 blocks
Block #4 found by
second bit-scan
Intra-tile rasterization
Bit mask, in mask register, for
edge #1 trivial accept 16x16s 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
edge #2 trivial accept 16x16s
= = = = = = = = = = = = = = = =
Intermediate result 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
& & & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #3 trivial accept 16x16s 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 1
= = = = = = = = = = = = = = = =
Composite trivial accept mask
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
for 16x16 blocks
Bit-scan
completed
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners &
; see if each trivial accept corner is inside all three edges
vaddsetspi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddsetspi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddsetspi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Partially accepted 16x16 blocks
For each partial 16x16, descend again to 4x4s
Put trivially accepted 4x4s in bins
Partially accepted 4x4s need to be processed
into pixel masks
Vector add of equation step for each edge
AND signs together to form pixel mask
Pixel mask for partial 4x4: example
4x4 pixels
Blue pixels
White dots are inside
are pixel black edge
centers
+
Grey pixels
- Resulting mask register
1110111011001100
are outside
black edge
Trivial reject
corner for black
edge for 4x4
Pixel mask for partial 4x4: example
4x4 pixels
White dots
are pixel
centers
Blue pixels
+
Grey pixels
- are inside
black edge
are outside
black edge
Trivial reject
corner for black
edge for 4x4
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks
; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks - MSAA
4x4 pixels
Yellow dots
are sample 2
centers
Blue samples
+ - are inside
Grey black edge
samples are
outside black
edge
Trivial reject
corner for black
edge for 4x4
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Adaptive rasterization
Don’t have the luxury of custom data and ALU
sizes
Do have the luxury of adapting to input data
Edge equations have to be evaluated with 48 bits
in the worst case
We have to use 64 bits
Can use 32 bits for triangles that fit in 128x128
bounding boxes
90+% of triangles
Adaptive rasterization (cont.)
When we do have to do 64-bit edge evaluation
64-bit only required for tile assignment
Any tile up to 128x128 that’s not trivially accepted or
rejected can be rasterized using 32 bits
Adaptive intra-tile rasterization
Triangles that fit in a 16x16 bounding box
One less level of descent, less set-up, no trivial accept test
Direct mask stamping for 4x4, 4x8, 8x4 bounding
boxes
Non-rasterization-based z for small triangles