RDD
Persistence
Chapter
15
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
AggregaIng
Data
with
Pair
RDDs
13
WriIng
and
Deploying
Spark
ApplicaIons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaEerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-2
Spark
RDD
Persistence
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-3
Chapter
Topics
RDD
Lineage
RDD
Persistence
Overview
Distributed
Persistence
Conclusion
Homework:
Persist
an
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-4
Lineage
Example
(1)
File:
purplecow.txt
Each
transforma)on
operaAon
I've never seen a purple cow.
creates
a
new
child
RDD
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-5
Lineage
Example
(2)
File:
purplecow.txt
Each
transforma)on
operaAon
I've never seen a purple cow.
creates
a
new
child
RDD
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
MappedRDD[1] (mydata)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-6
Lineage
Example
(3)
File:
purplecow.txt
Each
transforma)on
operaAon
I've never seen a purple cow.
creates
a
new
child
RDD
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
MappedRDD[1] (mydata)
MappedRDD[2]
FilteredRDD[3]: (myrdd)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-7
Lineage
Example
(4)
File:
purplecow.txt
Spark
keeps
track
of
the
parent
RDD
I've never seen a purple cow.
for
each
new
RDD
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Child
RDDs
depend
on
their
parents
MappedRDD[1]
(mydata)
MappedRDD[2]
FilteredRDD[3]: (myrdd)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-8
Lineage
Example
(5)
File:
purplecow.txt
Ac)on
operaAons
execute
the
I've never seen a purple cow.
parent
transformaAons
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
MappedRDD[1]
(mydata)
I've never seen a purple cow.
> mydata = sc.textFile("purplecow.txt")
I never hope to see one;
> myrdd = mydata.map(lambda s: s.upper())\
But I can tell you, anyhow,
.filter(lambda s:s.startswith('I'))
> myrdd.count() I'd rather see than be one.
3 MappedRDD[2]
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
FilteredRDD[3]:
(myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-9
Lineage
Example
(6)
File:
purplecow.txt
Each
acAon
re-executes
the
lineage
I've never seen a purple cow.
transformaAons
starAng
with
the
I never hope to see one;
But I can tell you, anyhow,
base
I'd rather see than be one.
FilteredRDD[3]: (myrdd)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-10
Lineage
Example
(7)
File:
purplecow.txt
Each
acAon
re-executes
the
lineage
I've never seen a purple cow.
transformaAons
starAng
with
the
I never hope to see one;
But I can tell you, anyhow,
base
I'd rather see than be one.
FilteredRDD[3]:
(myrdd)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-11
Chapter
Topics
RDD
Lineage
RDD
Persistence
Overview
Distributed
Persistence
Conclusion
Homework:
Persist
an
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-12
RDD
Persistence
File:
purplecow.txt
PersisAng
an
RDD
saves
the
data
(by
I've never seen a purple cow.
default
in
memory)
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-13
RDD
Persistence
File:
purplecow.txt
PersisAng
an
RDD
saves
the
data
(by
I've never seen a purple cow.
default
in
memory)
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]
(mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
RDD[2] (myrdd1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-14
RDD
Persistence
File:
purplecow.txt
PersisAng
an
RDD
saves
the
data
(by
I've never seen a purple cow.
default
in
memory)
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]
(mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
RDD[2]
(myrdd1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-15
RDD
Persistence
File:
purplecow.txt
PersisAng
an
RDD
saves
the
data
(by
I've never seen a purple cow.
default
in
memory)
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]
(mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2]
(myrdd1)
s:s.startswith('I'))
RDD[3] (myrdd2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-16
RDD
Persistence
File:
purplecow.txt
PersisAng
an
RDD
saves
the
data
(by
I've never seen a purple cow.
default
in
memory)
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]
(mydata)
> mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.
RDD[3]
(myrdd2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-17
RDD
Persistence
File:
purplecow.txt
Subsequent
operaAons
use
saved
I've never seen a purple cow.
data
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]
(mydata)
> mydata = sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist()
> myrdd2 = myrdd1.filter(lambda \ RDD[2]
(myrdd1)
I'VE NEVER SEEN A PURPLE COW.
s:s.startswith('I'))
I NEVER HOPE TO SEE ONE;
> myrdd2.count()
BUT I CAN TELL YOU, ANYHOW,
3
I'D RATHER SEE THAN BE ONE.
> myrdd2.count()
RDD[3]
(myrdd2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-18
RDD
Persistence
File:
purplecow.txt
Subsequent
operaAons
use
saved
I've never seen a purple cow.
data
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD[1]
(mydata)
> my data =
sc.textFile("purplecow.txt")
> myrdd1 = mydata.map(lambda s:
s.upper())
> myrdd1.persist() RDD[2]
(myrdd1)
I'VE NEVER SEEN A PURPLE COW.
> myrdd2 = myrdd1.filter(lambda \
I NEVER HOPE TO SEE ONE;
s:s.startswith('I'))
BUT I CAN TELL YOU, ANYHOW,
> myrdd2.count()
I'D RATHER SEE THAN BE ONE.
3
> myrdd2.count() RDD[3]
(myrdd2)
3 I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-19
Memory
Persistence
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-20
Chapter
Topics
RDD
Lineage
RDD
Persistence
Overview
Distributed
Persistence
Conclusion
Homework:
Persist
an
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-21
Persistence
and
Fault-Tolerance
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-22
Distributed
Persistence
Driver
Executor
task
rdd_1_0
Executor
task
rdd_1_1
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-23
RDD
Fault-Tolerance
(1)
RDD
Driver
Executor
task
rdd_1_0
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-24
RDD
Fault-Tolerance
(2)
The
driver
starts
a
new
task
to
recompute
the
parAAon
on
a
dierent
node
Lineage
is
preserved,
data
is
never
lost
RDD
Driver
Executor
task
rdd_1_0
Executor
task
rdd_1_1
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-25
Persistence
Levels
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-26
Persistence
Levels:
Storage
LocaIon
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-27
Persistence
Levels:
Memory
Format
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-28
Persistence
Levels:
ParIIon
ReplicaIon
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-29
Changing
Persistence
OpIons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-30
Disk
Persistence
RDD
Client
Driver
Executor
task
rdd_0
Executor
rdd_1
task
rdd_1
Executor
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-31
Disk
Persistence
with
ReplicaIon
(1)
RDD
Client
Driver
Executor
task
rdd_0
Executor
rdd_1
task
rdd_1
Executor
rdd_1
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-32
Disk
Persistence
with
ReplicaIon
(2)
Executor
task
rdd_1
rdd_1
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-33
When
and
Where
to
Persist
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-34
Chapter
Topics
RDD
Lineage
RDD
Persistence
Overview
Distributed
Persistence
Conclusion
Homework:
Persist
an
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-35
EssenIal
Points
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-36
Chapter
Topics
RDD
Lineage
RDD
Persistence
Overview
Distributed
Persistence
Conclusion
Homework:
Persist
an
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
15-37
Homework:
Persist
an
RDD
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 15-38