15 SparkRDDPersistence

Spark
RDD Persistence
Chapter 15
201509
Course Chapters
1 IntroducIon Course IntroducIon

2 IntroducIon to Hadoop and the Hadoop Ecosystem
IntroducIon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporIng RelaIonal Data with Apache Sqoop
5 IntroducIon to Impala and Hive
ImporIng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParIIoning
9 Capturing Data with Apache Flume IngesIng Streaming Data
10 Spark Basics
11 Working with RDDs in Spark
12 AggregaIng Data with Pair RDDs
13 WriIng and Deploying Spark ApplicaIons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaEerns in Spark Data Processing
17 Spark SQL and DataFrames
18 Conclusion Course Conclusion
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-2
Spark RDD Persistence
In this chapter you will learn

How Spark uses an RDDs lineage in operaAons
How to persist RDDs to improve performance
Chapter Topics
Distributed Data Processing

with Spark
RDD Lineage
RDD Persistence Overview
Distributed Persistence
Conclusion
Homework: Persist an RDD
Lineage Example (1)
File: purplecow.txt
Each transforma)on operaAon I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Lineage Example (2)
File: purplecow.txt
MappedRDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt")
Lineage Example (3)
File: purplecow.txt

> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
MappedRDD[2]
FilteredRDD[3]: (myrdd)
Lineage Example (4)
File: purplecow.txt
Spark keeps track of the parent RDD I've never seen a purple cow.
for each new RDD I never hope to see one;
Child RDDs depend on their parents

MappedRDD[2]
Lineage Example (5)
File: purplecow.txt
Ac)on operaAons execute the I've never seen a purple cow.
parent transformaAons I never hope to see one;
I've never seen a purple cow.
I never hope to see one;
> myrdd.count() I'd rather see than be one.
3 MappedRDD[2]
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
Lineage Example (6)
File: purplecow.txt
Each acAon re-executes the lineage I've never seen a purple cow.
transformaAons starAng with the I never hope to see one;
base I'd rather see than be one.
By default MappedRDD[1] (mydata)

> myrdd.count()
3 MappedRDD[2]
> myrdd.count()
Lineage Example (7)
File: purplecow.txt
Each acAon re-executes the lineage I've never seen a purple cow.
transformaAons starAng with the I never hope to see one;
base I'd rather see than be one.
By default MappedRDD[1] (mydata)

I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 MappedRDD[2]
> myrdd.count()
3 I NEVER HOPE TO SEE ONE;
Chapter Topics

with Spark
RDD Lineage
Conclusion
RDD Persistence
File: purplecow.txt
PersisAng an RDD saves the data (by I've never seen a purple cow.
default in memory) I never hope to see one;
RDD Persistence
File: purplecow.txt
RDD[1] (mydata)
> myrdd1 = mydata.map(lambda s:
s.upper())
RDD[2] (myrdd1)
RDD Persistence
File: purplecow.txt
RDD[1] (mydata)
s.upper())
> myrdd1.persist()
RDD[2] (myrdd1)
RDD Persistence
File: purplecow.txt
RDD[1] (mydata)
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
s:s.startswith('I'))
RDD[3] (myrdd2)
RDD Persistence
File: purplecow.txt
RDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
> myrdd1 = mydata.map(lambda s: I never hope to see one;

s.upper())
> myrdd1.persist()
> myrdd2.count()
3
RDD[3] (myrdd2)
RDD Persistence
File: purplecow.txt
Subsequent operaAons use saved I've never seen a purple cow.
data I never hope to see one;
RDD[1] (mydata)
s.upper())
> myrdd1.persist()
> myrdd2.count()
3
> myrdd2.count()
RDD[3] (myrdd2)
RDD Persistence
File: purplecow.txt
Subsequent operaAons use saved I've never seen a purple cow.
data I never hope to see one;
RDD[1] (mydata)
> my data =
sc.textFile("purplecow.txt")
s.upper())
> myrdd1.persist() RDD[2] (myrdd1)
> myrdd2 = myrdd1.filter(lambda \
> myrdd2.count()
3
> myrdd2.count() RDD[3] (myrdd2)
3 I'VE NEVER SEEN A PURPLE COW.
Memory Persistence
In-memory persistence is a sugges)on to Spark

If not enough memory is available, persisted parIIons will be cleared
from memory
Least recently used parIIons cleared rst
TransformaIons will be re-executed using the lineage when needed
Chapter Topics

with Spark
RDD Lineage
Conclusion
Persistence and Fault-Tolerance
RDD = Resilient Distributed Dataset

Resiliency is a product of tracking lineage
RDDs can always be recomputed from their base if needed
RDD parAAons are distributed across a cluster

By default, parAAons are persisted in memory in Executor JVMs
RDD
Driver Executor
task rdd_1_0
Executor
task rdd_1_1
Executor
RDD Fault-Tolerance (1)
What happens if a parAAon persisted in memory becomes unavailable?
RDD
Driver Executor
task rdd_1_0
Executor
RDD Fault-Tolerance (2)
The driver starts a new task to recompute the parAAon on a dierent node
Lineage is preserved, data is never lost
RDD
Driver Executor
task rdd_1_0
Executor
task rdd_1_1
Persistence Levels
By default, the persist method stores data in memory only

The cache method is a synonym for default (memory) persist
The persist method oers other opAons called Storage Levels
Storage Levels let you control
Storage locaIon
Format in memory
ParIIon replicaIon
Persistence Levels: Storage LocaIon
Storage locaAon where is the data stored?

MEMORY_ONLY (default) same as cache
MEMORY_AND_DISK Store parIIons on disk if they do not t in
memory
Called spilling
DISK_ONLY Store all parIIons on disk
> from pyspark import StorageLevel

Python
> myrdd.persist(StorageLevel.DISK_ONLY)
> import org.apache.spark.storage.StorageLevel

Scala
> myrdd.persist(StorageLevel.DISK_ONLY)
Persistence Levels: Memory Format
SerializaAon you can choose to serialize the data in memory

MEMORY_ONLY_SER and MEMORY_AND_DISK_SER
Much more space ecient
Less Ime ecient
If using Java or Scala, choose a fast serializaIon library (e.g. Kryo)
Persistence Levels: ParIIon ReplicaIon
ReplicaAon store parAAons on two nodes

MEMORY_ONLY_2
MEMORY_AND_DISK_2
DISK_ONLY_2
MEMORY_AND_DISK_SER_2
DISK_ONLY_2
You can also dene custom storage levels
Changing Persistence OpIons
To stop persisAng and remove from memory and disk

rdd.unpersist()
To change an RDD to a dierent persistence level
Unpersist rst
Disk Persistence
Disk-persisted parAAons are stored in local les
RDD
Client
Driver Executor
task rdd_0
Executor
rdd_1
task rdd_1
Executor
Disk Persistence with ReplicaIon (1)
Persistence replicaAon makes recomputaAon less likely to be necessary
RDD
Client
Driver Executor
task rdd_0
Executor
rdd_1
task rdd_1
Executor
rdd_1
Disk Persistence with ReplicaIon (2)
Replicated data on disk will be used to recreate the parAAon if possible

Will be recomputed if the data is unavailable
e.g., the node is down
RDD
Client
Driver Executor
task rdd_0
Executor
task rdd_1 rdd_1
When and Where to Persist
When should you persist a dataset?

When a dataset is likely to be re-used
e.g., iteraIve algorithms, machine learning
How to choose a persistence level
Memory only when possible, best performance
Save space by saving as serialized objects in memory if necessary
Disk choose when recomputaIon is more expensive than disk read
e.g., expensive funcIons or ltering large datasets
ReplicaIon choose when recomputaIon is more expensive than
memory
Chapter Topics

with Spark
RDD Lineage
Conclusion
EssenIal Points
Spark keeps track of each RDDs lineage

Provides fault tolerance
By default, every RDD operaAon executes the enAre lineage
If an RDD will be used mulAple Ames, persist it to avoid re-computaAon
Persistence opAons
LocaIon memory only, memory and disk , disk only
Format in-memory data can be serialized to save memory (but at the
cost of performance)
ReplicaIon saves data on mulIple nodes in case a node goes down,
for job recovery without recomputaIon
Chapter Topics

with Spark
RDD Lineage
Conclusion
In this homework assignment you will

Persist an RDD before reusing it
Use the Spark ApplicaIon UI to see how an RDD is persisted
Please refer to the Homework descripAon

15 SparkRDDPersistence

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

15 SparkRDDPersistence

Diunggah oleh

Hak Cipta:

Format Tersedia

Spark

1 IntroducIon Course IntroducIon

18 Conclusion Course Conclusion

In this chapter you will learn

Distributed Data Processing

> mydata = sc.textFile("purplecow.txt")

> mydata = sc.textFile("purplecow.txt")

> mydata = sc.textFile("purplecow.txt")

By default MappedRDD[1] (mydata)

> mydata = sc.textFile("purplecow.txt")

By default MappedRDD[1] (mydata)

Distributed Data Processing

> myrdd1 = mydata.map(lambda s: I never hope to see one;

In-memory persistence is a sugges)on to Spark

Distributed Data Processing

RDD = Resilient Distributed Dataset

RDD parAAons are distributed across a cluster

What happens if a parAAon persisted in memory becomes unavailable?

By default, the persist method stores data in memory only

Storage locaAon where is the data stored?

> from pyspark import StorageLevel

> import org.apache.spark.storage.StorageLevel

SerializaAon you can choose to serialize the data in memory

ReplicaAon store parAAons on two nodes

To stop persisAng and remove from memory and disk

Disk-persisted parAAons are stored in local les

Persistence replicaAon makes recomputaAon less likely to be necessary

Replicated data on disk will be used to recreate the parAAon if possible

When should you persist a dataset?

Distributed Data Processing

Spark keeps track of each RDDs lineage

Distributed Data Processing

In this homework assignment you will

Anda mungkin juga menyukai