Anatomy of Inmemory Processing in Spark 160215173515

Anatomy of in-memory
processing in Spark
A deep-dive into custom memory management in Spark
https://github.com/shashankgowdal/introduction_to_dataset
Shashank L
Big data consultant and trainer at

datamantra.io
www.shashankgowda.com
Agenda
Era of in-memory processing
Big data frameworks on JVM
JVM memory model
Custom memory management
Allocation
Serialization
Processing
Benefits of these on Spark
Era of in-memory processing
After Spark, in memory has become a defacto
standard for big data workloads
Advancement in hardware is pushing more
frameworks in that direction
Memory management is coupled with runtime of the
framework
Using memory efficiently in a big data workload is a
challenging task
Memory management depends upon runtime of the
framework
Why JVM is a prominent runtime for
big data workloads
Managed runtime
Portable
Hadoop was on JVM
Rich eco system

Big data frameworks on JVM
Many frameworks today runs on JVM today
Spark
Flink
Hadoop
etc
Organising data in memory
In-memory processing
In-memory caching of intermediate results
Memory management influences
Resource efficiency
Performance
Straight-forward approach
JVM memory model approach
Store collection of objects and perform any processing on
the collection.
Advantages
Eases development cycle
Built in safety checks before modifying any of the memory
Reduces complexity
JVM built-in GC
JVM memory model - Disadvantages
Predicting memory consumption is hard

If it fails, OutOfMemory error kills the JVM
High garbage collection overhead
Easily 50% of the time spent in GC
Objects have space overhead
JVM objects doesnt take the same amount of memory as
we think.
Java object overhead
Consider a string abcd as a JVM object. By looking at it, it
should take up 4 bytes (one per character) of memory.
Garbage collection challenges
Many big data workloads create objects in a way that are
unfriendly to regular Java GC.
Young generation garbage collection is frequent.

Objects created in big data workloads tend to live in Young
generation itself because they are used few times.
Generality has a cost, so
semantics and schema should
be used to exploit specificity
instead
Custom memory management
Allocation - Allocate fixed number of segments upfront.
Serialization - Data objects are serialized into memory

segments
Processing - Implement algorithms on binary

representation.
Allocation
Managing memory on our own
sun.misc.Unsafe
Directly manipulating memory without safety checks.
(hence, its unsafe)
This API is used to build data structures off heap in
Spark.
sun.misc.Unsafe
Unsafe is one of the gateway to low level

programming in Java.
Exposes C-style memory access
explicit allocation, deallocation, pointer arithmetics
Unsafe methods are intrinsic
Hands-on
DirectIntArray
MemoryCorruption
Custom memory management in Spark
On heap
Stores data inside an array of type Long
Capable of storing 16GB at once
Bytes are encoded in long and stored here
Off heap
Allocates memory in the memory assigned to JVM other than
heap
Uses Unsafe API
Stores bytes directly
Encoding memory addresses
Off heap: Addresses are raw memory pointers.
On heap: Addresses are base object + offset pairs
Spark uses its own page table abstraction to enable more
compact encoding of on-heap addresses.
Serialization
Data Structures prominently used in Big
data
Sequence
Key-Value pair
Java object-based row notation
3 fields of type (int, string, string)
with value (123, data,mantra)
5+ objects
high space overhead
expensive hashCode()
Tungstens unsafe row format
Bitset for tracking null values

Every column appears in the fixed-length value region
Fixed length variables are inclined
For variable length values, we store a relative offset
into the variable length data section
Rows are always 8 byte aligned
Equality comparison can be done on raw bytes.
Example of unsafe row
(123, data,mantra)
null tracking bitmap

Hands-on
UnsafeRowCreator
java.util.HashMap
... Huge object overheads

Poor memory locality
Size estimation is hard
array
Tungsten BytesToBytesMap
...
Low overheads
Good memory locality, especially for scans
Processing
Many big data workloads are now
compute bound
Network optimizations can only reduce job completion time by a median of

at most 2%.
Optimizing or eliminating disk accesses can only reduce job completion
time by a median of at most 19%
[1]
Hardware trends
Why is CPU the new bottleneck
Hardware has improved
1Gbps to 10Gbps link in networks
High B/W SSDs or Stripped HDD arrays
Spark IO has been optimized
Many workloads now avoid significant disk IO by pruning
data that is not needed in a given job
New shuffle and network layer implementations
Data formats have improved
Parquet, binary data formats
Serialization and Hashing are CPU-bound bottlenecks
Code generation
Generic evaluation of expression logic is very expensive
on the JVM
Virtual function calls
Branches based on expression type
Object creation due to primitive boxing
Memory consumption by boxed primitive objects
Generating the code which directly applies the expression
logic on serialized data
Which Spark API can be benefited
Spark dataframes
SparkSQL
RDD
Why only Dataframes are benefited?
Python Java/Scala R Spark RDD

DF DF DF SQL API
Logical
Plan
Catalyst
optimizer
Physical Physical
execution execution
Runtime bytecode generation
Dataframe code
Catalyst expressions
Low level bytecode

Code generation
Aggregation optimization in DataFrame
Aggregation optimization in DataFrame
Convert Project
Input row UnsafeRow Grouping key
Probe
Scan
Iterate BytesToBytesMap
Update in place
Performance results with optimizations
(Run time)
Performance results with optimizations
(GC Time)
References
https://www.eecs.berkeley.edu/~keo/publications/nsdi15-
final147.pdf
https://databricks.com/blog/2015/04/28/project-tungsten-
bringing-spark-closer-to-bare-metal.html
https://spark-summit.org/2015/events/deep-dive-into-
project-tungsten-bringing-spark-closer-to-bare-metal/
https://gist.github.com/raphw/7935844
http://www.bigsynapse.com/addressing-big-data-
performance

Anatomy of Inmemory Processing in Spark 160215173515

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Anatomy of Inmemory Processing in Spark 160215173515

Diunggah oleh

Hak Cipta:

Format Tersedia

Anatomy of in-memory

Big data consultant and trainer at

Hadoop was on JVM

Rich eco system

Predicting memory consumption is hard

Young generation garbage collection is frequent.

Allocation - Allocate fixed number of segments upfront.

Serialization - Data objects are serialized into memory

Processing - Implement algorithms on binary

Unsafe is one of the gateway to low level

Bitset for tracking null values

null tracking bitmap

... Huge object overheads

Network optimizations can only reduce job completion time by a median of

Python Java/Scala R Spark RDD

Low level bytecode

Anda mungkin juga menyukai