processing in Spark
A deep-dive into custom memory management in Spark
https://github.com/shashankgowdal/introduction_to_dataset
Shashank L
www.shashankgowda.com
Agenda
Era of in-memory processing
Big data frameworks on JVM
JVM memory model
Custom memory management
Allocation
Serialization
Processing
Benefits of these on Spark
Era of in-memory processing
After Spark, in memory has become a defacto
standard for big data workloads
Advancement in hardware is pushing more
frameworks in that direction
Memory management is coupled with runtime of the
framework
Using memory efficiently in a big data workload is a
challenging task
Memory management depends upon runtime of the
framework
Why JVM is a prominent runtime for
big data workloads
Managed runtime
Portable
Off heap
Allocates memory in the memory assigned to JVM other than
heap
Uses Unsafe API
Stores bytes directly
Encoding memory addresses
Off heap: Addresses are raw memory pointers.
On heap: Addresses are base object + offset pairs
Spark uses its own page table abstraction to enable more
compact encoding of on-heap addresses.
Serialization
Data Structures prominently used in Big
data
Sequence
Key-Value pair
Java object-based row notation
3 fields of type (int, string, string)
with value (123, data,mantra)
5+ objects
high space overhead
expensive hashCode()
Tungstens unsafe row format
array
Tungsten BytesToBytesMap
...
Low overheads
Good memory locality, especially for scans
Processing
Many big data workloads are now
compute bound
Spark dataframes
SparkSQL
RDD
Why only Dataframes are benefited?
Logical
Plan
Catalyst
optimizer
Physical Physical
execution execution
Runtime bytecode generation
Dataframe code
Catalyst expressions
Probe
Scan
Iterate BytesToBytesMap
Update in place
Performance results with optimizations
(Run time)
Performance results with optimizations
(GC Time)
References
https://www.eecs.berkeley.edu/~keo/publications/nsdi15-
final147.pdf
https://databricks.com/blog/2015/04/28/project-tungsten-
bringing-spark-closer-to-bare-metal.html
https://spark-summit.org/2015/events/deep-dive-into-
project-tungsten-bringing-spark-closer-to-bare-metal/
https://gist.github.com/raphw/7935844
http://www.bigsynapse.com/addressing-big-data-
performance