Anda di halaman 1dari 128

Real-time Analytics with

Cassandra, Spark and Shark

Tuesday, June 18, 13

Who is this guy

Staff Engineer, Compute and Data Services, Ooyala

Building multiple web-scale real-time systems on top of C*, Kafka,


Storm, etc.

Scala/Akka guy

Very excited by open source, big data projects - share some today

@evanfchan

Tuesday, June 18, 13

Agenda

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture

Tuesday, June 18, 13

Agenda
Ooyala and Cassandra
What problem are we trying to solve?
Spark and Shark
Our Spark/Cassandra Architecture
Demo

Tuesday, June 18, 13

Cassandra at Ooyala
Who is Ooyala, and how we use Cassandra

Tuesday, June 18, 13

OOYALA
Powering personalized video
experiences across all screens.

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

COMPANY OVERVIEW
Founded in 2007
Commercially launch in 2009
230+ employees in Silicon Valley, LA, NYC,
London, Paris, Tokyo, Sydney & Guadalajara
Global footprint, 200M unique users,
110+ countries, and more than 6,000 websites
Over 1 billion videos played per month
and 2 billion analytic events per day
25% of U.S. online viewers watch video
powered by Ooyala

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

TRUSTED VIDEO PARTNER


CUSTOMERS

STRATEGIC PARTNERS

CONFIDENTIALDO NOT DISTRIBUTE

Tuesday, June 18, 13

We are a large Cassandra user

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Powers all of our analytics infrastructure

Tuesday, June 18, 13

We are a large Cassandra user

11 clusters ranging in size from 3 to 36 nodes

Total of 28TB of data managed over ~85 nodes

Over 2 billion C* column writes per day

Powers all of our analytics infrastructure

Much much bigger cluster coming soon

Tuesday, June 18, 13

What problem are we trying to


solve?
Lots of data, complex queries, answered really quickly... but how??

Tuesday, June 18, 13

From mountains of useless data...

Tuesday, June 18, 13

To nuggets of truth...

Tuesday, June 18, 13

To nuggets of truth...
Quickly

Tuesday, June 18, 13

To nuggets of truth...
Quickly
Painlessly

Tuesday, June 18, 13

To nuggets of truth...
Quickly
Painlessly
At

Tuesday, June 18, 13

scale?

Today: Precomputed aggregates

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Most computed aggregates are never read

Tuesday, June 18, 13

Today: Precomputed aggregates

Video metrics computed along several high cardinality dimensions

Very fast lookups, but inflexible, and hard to change

Most computed aggregates are never read

What if we need more dynamic queries?

Top content for mobile users in France

Engagement curves for users who watched recommendations

Data mining, trends, machine learning

Tuesday, June 18, 13

The static - dynamic continuum


100% Precomputation

Super fast lookups

Inflexible, wasteful

Best for 80% most


common queries

Tuesday, June 18, 13

100% Dynamic

Always compute results


from raw data

Flexible but slow

The static - dynamic continuum


100%
100%
Precomputation
Dynamic

Always
Super
fast lookups
compute results
from raw data
Inflexible, wasteful

slow
BestFlexible
for 80% but
most
common queries

Tuesday, June 18, 13

Where we want to be
Partly dynamic

Tuesday, June 18, 13

Pre-aggregate most
common queries

Flexible, fast dynamic


queries

Easily generate many


materialized views

Industry Trends

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

Impala

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

Impala

In-memory databases

VoltDB, Druid

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

In-memory databases

Impala
VoltDB, Druid

Streaming and real-time

Tuesday, June 18, 13

Industry Trends

Fast execution frameworks

Impala

In-memory databases

VoltDB, Druid

Streaming and real-time

Higher-level, productive data frameworks

Cascading, Hive, Pig

Tuesday, June 18, 13

Why Spark and Shark?


Lightning-fast in-memory cluster computing

Tuesday, June 18, 13

Introduction to Spark

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
More general purpose than Hadoop MR

Tuesday, June 18, 13

Introduction to Spark
In-memory distributed computing framework
Created by UC Berkeley AMP Lab in 2010
Targeted problems that MR is bad at:
Iterative algorithms (machine learning)
Interactive data mining
More general purpose than Hadoop MR
Active contributions from ~ 15 companies

Tuesday, June 18, 13

Map
Reduce

HDFS
Map
Reduce

Tuesday, June 18, 13

Map
Reduce

HDFS
Map
Reduce

Tuesday, June 18, 13

Data Source
Source 2

Map
map()
Reduce

HDFS
Map
Reduce

Tuesday, June 18, 13

join()

Data Source
Source 2

Map
map()
Reduce
join()

HDFS
Map
Reduce

Tuesday, June 18, 13

cache()

Data Source
Source 2

Map
map()
Reduce
join()

HDFS
Map
Reduce

Tuesday, June 18, 13

cache()
transform

Throughput: Memory is king

6-node C*/DSE 1.1.9 cluster,


Spark 0.7.0

Tuesday, June 18, 13

Throughput: Memory is king


0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Throughput: Memory is king


0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Throughput: Memory is king


0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Throughput: Memory is king


0
C*, cold cache
C*, warm cache
Spark RDD
6-node C*/DSE 1.1.9 cluster,
Spark 0.7.0

Tuesday, June 18, 13

37500

75000

112500

150000

Developers love it

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!

Tuesday, June 18, 13

Developers love it
I wrote my first aggregation job in 30 minutes
High level distributed collections API
No Hadoop cruft
Full power of Scala, Java, Python
Interactive REPL shell
EASY testing!!
Low latency - quick development cycles

Tuesday, June 18, 13

Spark word count example

Tuesday, June 18, 13

1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }

Spark word count example

file = spark.textFile("hdfs://...")

file.flatMap(line => line.split(" "))


.map(word => (word, 1))
.reduceByKey(_ + _)

Tuesday, June 18, 13

1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18
private final static IntWritable one = new IntWritable(1);
19
private Text word = new Text();
20
21
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22
String line = value.toString();
23
StringTokenizer tokenizer = new StringTokenizer(line);
24
while (tokenizer.hasMoreTokens()) {
25
word.set(tokenizer.nextToken());
26
context.write(word, one);
27
}
28
}
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33
public void reduce(Text key, Iterable<IntWritable> values, Context context)
34
throws IOException, InterruptedException {
35
int sum = 0;
36
for (IntWritable val : values) {
37
sum += val.get();
38
}
39
context.write(key, new IntWritable(sum));
40
}
41 }
42
43 public static void main(String[] args) throws Exception {
44
Configuration conf = new Configuration();
45
46
Job job = new Job(conf, "wordcount");
47
48
job.setOutputKeyClass(Text.class);
49
job.setOutputValueClass(IntWritable.class);
50
51
job.setMapperClass(Map.class);
52
job.setReducerClass(Reduce.class);
53
54
job.setInputFormatClass(TextInputFormat.class);
55
job.setOutputFormatClass(TextOutputFormat.class);
56
57
FileInputFormat.addInputPath(job, new Path(args[0]));
58
FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60
job.waitForCompletion(true);
61 }
62
63 }

The Spark Ecosystem

Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

The Spark Ecosystem


Bagel Pregel on
Spark
Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

The Spark Ecosystem


Bagel Pregel on
Spark

HIVE on Spark

Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

The Spark Ecosystem


Bagel Pregel on
Spark

HIVE on Spark

Spark Streaming discretized stream


processing

Spark
Tachyon - in-memory caching DFS

Tuesday, June 18, 13

Shark - HIVE on Spark

Tuesday, June 18, 13

Shark - HIVE on Spark


100% HiveQL compatible

Tuesday, June 18, 13

Shark - HIVE on Spark


100% HiveQL compatible
10-100x faster than HIVE, answers in seconds

Tuesday, June 18, 13

Shark - HIVE on Spark


100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers

Tuesday, June 18, 13

Shark - HIVE on Spark


100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Can use DSE / CassandraFS for Metastore

Tuesday, June 18, 13

Shark - HIVE on Spark


100% HiveQL compatible
10-100x faster than HIVE, answers in seconds
Reuse UDFs, SerDes, StorageHandlers
Can use DSE / CassandraFS for Metastore
Easy Scala/Java integration via Spark - easier than
writing UDFs

Tuesday, June 18, 13

Our new analytics architecture


How we integrate Cassandra and Spark/Shark

Tuesday, June 18, 13

From raw events to fast queries


Raw
Events
Raw
Events
Raw
Events

Tuesday, June 18, 13

From raw events to fast queries


Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

From raw events to fast queries


Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

Spark

View 1

Spark

View 2

Spark

View 3

From raw events to fast queries


Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

Spark

View 1

Spark

View 2

Spark

View 3

Spark

Predefined
queries

From raw events to fast queries


Raw
Events
Raw
Events

Ingestion

Raw
Events

Tuesday, June 18, 13

C*
event
store

Spark

View 1

Spark

Predefined
queries

Spark

View 2

Shark

Ad-hoc
HiveQL

Spark

View 3

Our Spark/Shark/Cassandra Stack

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack

SerDe

SerDe

SerDe

InputFormat

InputFormat

InputFormat

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Shark

Spark
Worker

SerDe

SerDe

InputFormat

InputFormat

InputFormat

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack


Spark Master

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Shark

Spark
Worker

SerDe

SerDe

InputFormat

InputFormat

InputFormat

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Our Spark/Shark/Cassandra Stack


Spark Master

Shark

Spark
Worker

Shark

Spark
Worker

SerDe

Job Server

Shark

Spark
Worker

SerDe

SerDe

InputFormat

InputFormat

InputFormat

Cassandra

Cassandra

Cassandra

Node1

Node2

Node3

Tuesday, June 18, 13

Event Store Cassandra schema


Event CF
2013-04-05
T00:00Z#id1

Tuesday, June 18, 13

t0

t1

t2

t3

t4

{event0:
a0}

{event1:
a1}

{event2:
a2}

{event3:
a3}

{event4:
a4}

Event Store Cassandra schema


Event CF
2013-04-05
T00:00Z#id1

t0

t1

t2

t3

t4

{event0:
a0}

{event1:
a1}

{event2:
a2}

{event3:
a3}

{event4:
a4}

EventAttr CF
ipaddr:10.20.30.40:t1
2013-04-05
T00:00Z#id1

Tuesday, June 18, 13

videoId:45678:t1

providerId:500:t0

Unpacking raw events


t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,


type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,


type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

10

Unpacking raw events


t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,


type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,


type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

10

id1

11

Unpacking raw events


t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,


type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,


type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

10

id1

11

id2

20

Unpacking raw events


t0

t1

2013-04-05
T00:00Z#id1

{video: 10, {video: 11,


type:5}
type:1}

2013-04-05
T00:00Z#id2

{video: 20, {video: 25,


type:5}
type:9}

Tuesday, June 18, 13

UserID

Video

Type

id1

10

id1

11

id2

20

id2

25

Tips for InputFormat Development

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Which API to write against? New? Old? Both?

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Which API to write against? New? Old? Both?

Be prepared to spend time tuning your split computation

Low latency jobs require fast splits

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Be prepared to spend time tuning your split computation

Which API to write against? New? Old? Both?


Low latency jobs require fast splits

Consider sorting row keys by token for data locality

Tuesday, June 18, 13

Tips for InputFormat Development

Know which target platforms you are developing for

Which API to write against? New? Old? Both?

Be prepared to spend time tuning your split computation

Low latency jobs require fast splits

Consider sorting row keys by token for data locality

Implement predicate pushdown for HIVE SerDes

Use your indexes to reduce size of dataset

Tuesday, June 18, 13

Example: OLAP processing


t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Example: OLAP processing


Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Cached Materialized Views

Example: OLAP processing


Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Union

Cached Materialized Views

Example: OLAP processing


Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Query 1: Plays
by Provider
Union

Cached Materialized Views

Example: OLAP processing


Spark

OLAP
Aggregates

Spark

OLAP
Aggregates

t0
2013-04
-05T00:
00Z#id1

{video:
10,
type:5}

2013-04
-05T00:
00Z#id2

{video:
20,
type:5}

C* events

Tuesday, June 18, 13

Spark

Query 1: Plays
by Provider
Union

OLAP
Aggregates
Cached Materialized Views

Query 2: Top
content for
mobile

Performance numbers

6-node C*/DSE 1.1.9 cluster,


Spark 0.7.0

Tuesday, June 18, 13

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,


Spark 0.7.0

Tuesday, June 18, 13

130 seconds

20-30 seconds

60 ms

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,


Spark 0.7.0

Tuesday, June 18, 13

130 seconds

20-30 seconds

60 ms

Performance numbers
Spark: C* -> OLAP aggregates
cold cache, 1.4 million events
C* -> OLAP aggregates
warmed cache
OLAP aggregate query via Spark
(56k records)

6-node C*/DSE 1.1.9 cluster,


Spark 0.7.0

Tuesday, June 18, 13

130 seconds

20-30 seconds

60 ms

OLAP WorkFlow
Aggregate

REST Job Server

Spark
Executors

Tuesday, June 18, 13

Aggregation Job

OLAP WorkFlow
Aggregate

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

OLAP WorkFlow
Aggregate

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

OLAP WorkFlow
Aggregate

Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

Query Job

OLAP WorkFlow

Result

Aggregate

Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

Query Job

OLAP WorkFlow

Result

Aggregate

Query

Result
Query

REST Job Server

Spark
Executors

Aggregation Job

Cassandra

Tuesday, June 18, 13

Dataset

Query Job

Query Job

Fault Tolerance

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is


expensive!

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is


expensive!

Can also replicate cached dataset to survive single node failures

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is


expensive!

Can also replicate cached dataset to survive single node failures

Persist materialized views back to C*, then load into cache -- now
recovery path is much faster

Tuesday, June 18, 13

Fault Tolerance

Cached dataset lives in Java Heap only - what if process dies?

Spark lineage - automatic recomputation from source, but this is


expensive!

Can also replicate cached dataset to survive single node failures

Persist materialized views back to C*, then load into cache -- now
recovery path is much faster

Persistence also enables multiple processes to hold cached dataset

Tuesday, June 18, 13

Demo time

Tuesday, June 18, 13

Shark Demo

Local shark node, 1 core, MBP

How to create a table from C* using our inputformat

Creating a cached Shark table

Running fast queries

Tuesday, June 18, 13

Creating a Shark Table from InputFormat

Tuesday, June 18, 13

Creating a cached table

Tuesday, June 18, 13

Querying cached table

Tuesday, June 18, 13

THANK YOU

Tuesday, June 18, 13

THANK YOU

Tuesday, June 18, 13

@evanfchan

THANK YOU

@evanfchan

ev@ooyala.com

Tuesday, June 18, 13

THANK YOU

@evanfchan

ev@ooyala.com

Tuesday, June 18, 13

THANK YOU

@evanfchan

ev@ooyala.com

WE ARE HIRING!!

Tuesday, June 18, 13

Spark: Under the hood

Driver

Map

Reduce

Dataset

Map

Map

Reduce

Dataset

Map

Map

Reduce

Dataset

Map

One executor process per node

Tuesday, June 18, 13

Driver

Anda mungkin juga menyukai