MapReduce
Heinz Analytics Club Seminar
Abhinav Maurya
Carnegie Mellon University
www.andrew.cmu.edu/user/amaurya/docs/spark_talk
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
Spark Installation
https://spark.apache.org/downloads.html
Spark Installation
Extract compressed folder spark-2.0.1-binhadoop2.7
From terminal, go to spark-2.0.1-binhadoop2.7/bin
Run pyspark
Run rdd = sc.parallelize([1, 2, 3]);
rdd.map(lambda x: x*x).collect()
Get result [1,4,9]
Its that easy!
Spark Installation
Steps may not work for Windows users
Two choices
Build from source
Download a bigger zip (1.01GB)
from http://training.databricks.com/
workshop/usb.zip
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
MR
MR
HDFS
...
HDFS
HDFS
T
HDFS/
TXT
RDD
...
RDD
In Memory!
12#
30#
41#
58#
100#
80#
60#
40#
20#
0#
69#
Execution*time*(s)*
Cache#
disabled#
25%#
50%#
75%#
Fully#
cached#
%*of*working*set*in*cache*
Spark SQL
MLlib
GraphX
SparkR
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
Action
Transformations
Value
2
3
Partition into 2
R3
R4
R2
Operators on RDDs form a directed acyclic graph
If any partition on dead workers is lost, it can be
recomputed by retracing the operator DAG.
Spark Context
Your link to the YARN/Mesos cluster
Helps you bring RDDs to life
>>>rdd = sc.textFile(filepath)
>>>rdd = sc.parallelize(my_list)
28
Pair
(Alice, 23)
(Alice, 27)
(Bob, 17)
(Alice, 50)
(Bob, 17)
RDD Actions
rdd.glom().collect(): Return RDD
partition contents as a list
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.collect(): [[0], [1], [4]]
rdd.collect(): Return RDD content as a list
rdd = sc.parallelize([0,1,2], 3)
rdd2 = rdd.map(lambda x: x*x)
rdd2.glom().collect(): [0, 1, 4]
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
Hello WordCount!
Hello WordCount!
>>>import fileinput
>>>lines = []
>>>filepath = 'pnas_titles.txt'
>>>for line in fileinput.input(filepath):
...
lines.append(line)
<WordCount Code in Spark>
Hello WordCount!
>>>rdd_lines = sc.parallelize(lines, num_workers)
>>>rdd_result = rdd_lines.flatMap(lambda x:
x.split()).map(lambda x: (x,
1)).reduceByKey(lambda x,y: x+y)
>>>rdd_result = rdd_result.sortBy(lambda x: x[1],
False).keys()
>>>rdd_result.collect()[0:5]
???['of', 'the', 'in', 'and', 'a']
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
Movie Recommendation
Download movielens_ratings.txt from
http://www.andrew.cmu.edu/user/
amaurya/docs/spark_talk/
Data format
user::movie::rating::time
We need only the first three fields
Around 6000 users, 6000 movies, and
a million ratings
users x
factors
factors x movies
=
users x movies
Movie Recommendation
### Import the required packages
>>>from pyspark.mllib.recommendation import
ALS, MatrixFactorizationModel, Rating
>>>from pyspark import SparkContext, SparkConf
### Load and parse the data
>>>conf = SparkConf().setAppName('movielensrecommendation').setMaster('local')
>>>sc = SparkContext(conf=conf)
data = sc.textFile(movielens_ratings.txt)
Movie Recommendation
>>>ratings = data.map(lambda l:
l.split('::')).map(lambda l:
Rating(int(l[0]), int(l[1]), float(l[2])))
### Build the recommendation model using
Alternating Least Squares
>>>rank = 10
>>>numIterations = 20
>>>model = ALS.train(ratings, rank,
numIterations)
Movie Recommendation
### Evaluate the model on training data
>>>testdata = ratings.map(lambda p: (p[0], p[1]))
>>>predictions =
model.predictAll(testdata).map(lambda r: ((r[0],
r[1]), r[2]))
>>>ratesAndPreds = ratings.map(lambda r: ((r[0],
r[1]), r[2])).join(predictions)
>>>MSE = ratesAndPreds.map(lambda r: (r[1][0] r[1][1])**2).reduce(lambda x, y: x + y) /
ratesAndPreds.count()
>>>print("Mean Squared Error = " + str(MSE))
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
Outline
Spark Installation
Introduction to Spark
Spark Basics
Hello WordCount!
Recommending Movies
Performing SQL Queries
Advanced Tips
Advanced Tips
Run program in Spark local mode and
debug
Spark UI is at http://localhost:4040
Identify slow running tasks
See RDD sizes
Be robust to stragglers
Set spark.speculation = true in Spark
configuration file
References
Thanks!
Questions?