Anda di halaman 1dari 38

Spark

RDD Persistence
Chapter 15

201509
Course Chapters

1 IntroducIon Course IntroducIon


2 IntroducIon to Hadoop and the Hadoop Ecosystem
IntroducIon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporIng RelaIonal Data with Apache Sqoop
5 IntroducIon to Impala and Hive
ImporIng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParIIoning
9 Capturing Data with Apache Flume IngesIng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaIng Data with Pair RDDs
13 WriIng and Deploying Spark ApplicaIons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaEerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-2
Spark RDD Persistence

In this chapter you will learn


How Spark uses an RDDs lineage in operaAons
How to persist RDDs to improve performance

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-3
Chapter Topics

Distributed Data Processing


Spark RDD Persistence
with Spark

RDD Lineage
RDD Persistence Overview
Distributed Persistence
Conclusion
Homework: Persist an RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-4
Lineage Example (1)
File: purplecow.txt
Each transforma)on operaAon I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-5
Lineage Example (2)
File: purplecow.txt
Each transforma)on operaAon I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

MappedRDD[1] (mydata)

> mydata = sc.textFile("purplecow.txt")

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-6
Lineage Example (3)
File: purplecow.txt
Each transforma)on operaAon I've never seen a purple cow.
creates a new child RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

MappedRDD[1] (mydata)

> mydata = sc.textFile("purplecow.txt")


> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))

MappedRDD[2]

FilteredRDD[3]: (myrdd)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-7
Lineage Example (4)
File: purplecow.txt
Spark keeps track of the parent RDD I've never seen a purple cow.
for each new RDD I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Child RDDs depend on their parents
MappedRDD[1] (mydata)

> mydata = sc.textFile("purplecow.txt")


> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))

MappedRDD[2]

FilteredRDD[3]: (myrdd)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-8
Lineage Example (5)
File: purplecow.txt
Ac)on operaAons execute the I've never seen a purple cow.
parent transformaAons I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

MappedRDD[1] (mydata)
I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt")
I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\
But I can tell you, anyhow,
.filter(lambda s:s.startswith('I'))
> myrdd.count() I'd rather see than be one.

3 MappedRDD[2]
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

FilteredRDD[3]: (myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-9
Lineage Example (6)
File: purplecow.txt
Each acAon re-executes the lineage I've never seen a purple cow.
transformaAons starAng with the I never hope to see one;
But I can tell you, anyhow,
base I'd rather see than be one.

By default MappedRDD[1] (mydata)

> mydata = sc.textFile("purplecow.txt")


> myrdd = mydata.map(lambda s: s.upper())\
.filter(lambda s:s.startswith('I'))
> myrdd.count()
3 MappedRDD[2]
> myrdd.count()

FilteredRDD[3]: (myrdd)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-10
Lineage Example (7)
File: purplecow.txt
Each acAon re-executes the lineage I've never seen a purple cow.
transformaAons starAng with the I never hope to see one;
But I can tell you, anyhow,
base I'd rather see than be one.

By default MappedRDD[1] (mydata)


I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt") I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\ But I can tell you, anyhow,
.filter(lambda s:s.startswith('I')) I'd rather see than be one.
> myrdd.count()
3 MappedRDD[2]
I'VE NEVER SEEN A PURPLE COW.
> myrdd.count()
3 I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

FilteredRDD[3]: (myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-11
Chapter Topics

Distributed Data Processing


Spark RDD Persistence
with Spark

RDD Lineage
RDD Persistence Overview
Distributed Persistence
Conclusion
Homework: Persist an RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-12
RDD Persistence
File: purplecow.txt
PersisAng an RDD saves the data (by I've never seen a purple cow.
default in memory) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-13
RDD Persistence
File: purplecow.txt
PersisAng an RDD saves the data (by I've never seen a purple cow.
default in memory) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())

RDD[2] (myrdd1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-14
RDD Persistence
File: purplecow.txt
PersisAng an RDD saves the data (by I've never seen a purple cow.
default in memory) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
RDD[2] (myrdd1)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-15
RDD Persistence
File: purplecow.txt
PersisAng an RDD saves the data (by I've never seen a purple cow.
default in memory) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
s:s.startswith('I'))

RDD[3] (myrdd2)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-16
RDD Persistence
File: purplecow.txt
PersisAng an RDD saves the data (by I've never seen a purple cow.
default in memory) I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.

> myrdd1 = mydata.map(lambda s: I never hope to see one;


But I can tell you, anyhow,
s.upper())
I'd rather see than be one.
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.

RDD[3] (myrdd2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-17
RDD Persistence
File: purplecow.txt
Subsequent operaAons use saved I've never seen a purple cow.
data I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2] (myrdd1)
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
> myrdd2.count()
RDD[3] (myrdd2)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-18
RDD Persistence
File: purplecow.txt
Subsequent operaAons use saved I've never seen a purple cow.
data I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD[1] (mydata)
> my data =
sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist() RDD[2] (myrdd1)
I'VE NEVER SEEN A PURPLE COW.
> myrdd2 = myrdd1.filter(lambda \
I NEVER HOPE TO SEE ONE;
s:s.startswith('I'))
BUT I CAN TELL YOU, ANYHOW,
> myrdd2.count()
I'D RATHER SEE THAN BE ONE.
3
> myrdd2.count() RDD[3] (myrdd2)
3 I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-19
Memory Persistence

In-memory persistence is a sugges)on to Spark


If not enough memory is available, persisted parIIons will be cleared
from memory
Least recently used parIIons cleared rst
TransformaIons will be re-executed using the lineage when needed

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-20
Chapter Topics

Distributed Data Processing


Spark RDD Persistence
with Spark

RDD Lineage
RDD Persistence Overview
Distributed Persistence
Conclusion
Homework: Persist an RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-21
Persistence and Fault-Tolerance

RDD = Resilient Distributed Dataset


Resiliency is a product of tracking lineage
RDDs can always be recomputed from their base if needed

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-22
Distributed Persistence

RDD parAAons are distributed across a cluster


By default, parAAons are persisted in memory in Executor JVMs
RDD

Driver Executor
task rdd_1_0

Executor
task rdd_1_1

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-23
RDD Fault-Tolerance (1)

What happens if a parAAon persisted in memory becomes unavailable?

RDD

Driver Executor
task rdd_1_0

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-24
RDD Fault-Tolerance (2)

The driver starts a new task to recompute the parAAon on a dierent node
Lineage is preserved, data is never lost
RDD

Driver Executor
task rdd_1_0

Executor
task rdd_1_1

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-25
Persistence Levels

By default, the persist method stores data in memory only


The cache method is a synonym for default (memory) persist
The persist method oers other opAons called Storage Levels
Storage Levels let you control
Storage locaIon
Format in memory
ParIIon replicaIon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-26
Persistence Levels: Storage LocaIon

Storage locaAon where is the data stored?


MEMORY_ONLY (default) same as cache
MEMORY_AND_DISK Store parIIons on disk if they do not t in
memory
Called spilling
DISK_ONLY Store all parIIons on disk

> from pyspark import StorageLevel


Python
> myrdd.persist(StorageLevel.DISK_ONLY)

> import org.apache.spark.storage.StorageLevel


Scala
> myrdd.persist(StorageLevel.DISK_ONLY)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-27
Persistence Levels: Memory Format

SerializaAon you can choose to serialize the data in memory


MEMORY_ONLY_SER and MEMORY_AND_DISK_SER
Much more space ecient
Less Ime ecient
If using Java or Scala, choose a fast serializaIon library (e.g. Kryo)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-28
Persistence Levels: ParIIon ReplicaIon

ReplicaAon store parAAons on two nodes


MEMORY_ONLY_2
MEMORY_AND_DISK_2
DISK_ONLY_2
MEMORY_AND_DISK_SER_2
DISK_ONLY_2
You can also dene custom storage levels

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-29
Changing Persistence OpIons

To stop persisAng and remove from memory and disk


rdd.unpersist()
To change an RDD to a dierent persistence level
Unpersist rst

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-30
Disk Persistence

Disk-persisted parAAons are stored in local les

RDD
Client
Driver Executor
task rdd_0

Executor
rdd_1
task rdd_1

Executor

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-31
Disk Persistence with ReplicaIon (1)

Persistence replicaAon makes recomputaAon less likely to be necessary

RDD
Client
Driver Executor
task rdd_0

Executor
rdd_1
task rdd_1

Executor
rdd_1

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-32
Disk Persistence with ReplicaIon (2)

Replicated data on disk will be used to recreate the parAAon if possible


Will be recomputed if the data is unavailable
e.g., the node is down
RDD
Client
Driver Executor
task rdd_0

Executor
task rdd_1 rdd_1

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-33
When and Where to Persist

When should you persist a dataset?


When a dataset is likely to be re-used
e.g., iteraIve algorithms, machine learning
How to choose a persistence level
Memory only when possible, best performance
Save space by saving as serialized objects in memory if necessary
Disk choose when recomputaIon is more expensive than disk read
e.g., expensive funcIons or ltering large datasets
ReplicaIon choose when recomputaIon is more expensive than
memory

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-34
Chapter Topics

Distributed Data Processing


Spark RDD Persistence
with Spark

RDD Lineage
RDD Persistence Overview
Distributed Persistence
Conclusion
Homework: Persist an RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-35
EssenIal Points

Spark keeps track of each RDDs lineage


Provides fault tolerance
By default, every RDD operaAon executes the enAre lineage
If an RDD will be used mulAple Ames, persist it to avoid re-computaAon
Persistence opAons
LocaIon memory only, memory and disk , disk only
Format in-memory data can be serialized to save memory (but at the
cost of performance)
ReplicaIon saves data on mulIple nodes in case a node goes down,
for job recovery without recomputaIon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-36
Chapter Topics

Distributed Data Processing


Spark RDD Persistence
with Spark

RDD Lineage
RDD Persistence Overview
Distributed Persistence
Conclusion
Homework: Persist an RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-37
Homework: Persist an RDD

In this homework assignment you will


Persist an RDD before reusing it
Use the Spark ApplicaIon UI to see how an RDD is persisted
Please refer to the Homework descripAon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-38

Anda mungkin juga menyukai