Anda di halaman 1dari 76

Apache Spark

Crash Course - DataWorks Summit – Berlin 2018

Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Data

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data Sources
à Internet of Things (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars

à User Generated Content (Social, Web & Mobile)


– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data Growth in Zeta Bytes (ZB)
70.00

60.00

50.00

40.00 50+ ZB in 2021


30.00

20.00

10.00

0.00
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


The “Big Data” Problem
Problem
à A single machine cannot process or even store all the data!

Solution
à Distribute data over large clusters

Difficulty
à How to split work across machines?

à Moving data over network is expensive


à Must consider data & network locality
à How to deal with failures?
à How to deal with slow nodes?

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Spark

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What Is Apache Spark?

à Apache open source project


originally developed at AMPLab
(University of California Berkeley)
à Unified, general data processing
engine that operates across varied
data workloads and platforms

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Why Apache Spark?

à Elegant Developer APIs


– Single environment for data munging, data wrangling, and Machine Learning (ML)
à In-memory computation model – Fast!
– Effective for iterative computations and ML
à Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark MLlib)
– External libraries via open & commercial projects (H2Os Sparkling Water)

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


More Flexible /// Better Storage and Performance

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Overview

à Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
à Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


SparkSession

What is it?
à Main entry point for Spark functionality

à Allows programming with DataFrame and Dataset APIs


à Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter)

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


DataFrames
Column
à Distributed collection of data organized into named
columns
Col1 Col2 … … ColN
à Conceptually equivalent to a table in relational DB or
Row
a data frame in R/Python
à API available in Scala, Java, Python, and R

DataFrame

Data is described as a DataFrame


with rows, columns, and a schema

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sources

Avro Column
CSV
JSON
Col1 Col2 … … ColN

Row

Spark SQL

DataFrame
HIVE

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Create a DataFrame

Example
val path = "examples/flights.json"
val flights = spark.read.json(path)

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Register a Temporary View (SQL API)

Example
flights.createOrReplaceTempView("flightsView")

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Two API Examples: DataFrame and SQL APIs

DataFrame API
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)

Results
SQL API +------+----+--------+
SELECT Origin, Dest, DepDelay |Origin|Dest|DepDelay|
FROM flightsView +------+----+--------+
WHERE DepDelay > 15 LIMIT 5 | IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What is Stream Processing?

Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics

Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action

Stream Processing + Batch Processing = All Data Analytics


real-time (now) historical (past)

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Modern Data Applications approach to Insights
Traditional Analytics Next Generation Analytics
Structured & Repeatable Iterative & Exploratory
Structure built to store data Data is the structure

Start with hypothesis Data leads the way


Test against selected data Explore all data, identify correlations

23 Analyze
© Hortonworks Inc. 2011 – 2016. after
All Rights landing…
Reserved Analyze in motion…
23
Spark Streaming

Overview
à Extension of Spark Core API

à Stream processing of live data streams


– Scalable
– High-throughput
– Fault-tolerant

No longer
supported
ZeroMQ
in
Spark
MQTT 2.x

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

Discretized Streams (DStreams)


à High-level abstraction representing continuous stream of data

à Internally represented as a sequence of RDDs


à Operation applied on a DStream translates to operations on the underlying RDDs

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

Example: flatMap operation

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

Window Operations
à Apply transformations over a sliding window of data, e.g. rolling average

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Challenges in Streaming Data

à Consistency
à Fault tolerance
à Out-of-order data

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Structured Streaming

à High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
à Event-time Processing - Native support for working w/ out -of-order and late data
à End-to-end Exactly Once - Transactional both in processing and output

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Structured Streaming: Basics

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Structured Streaming: Model

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Handling late arriving data

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark MLlib

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark ML Pipeline

à fit() is for training


à transform() is for prediction

Input
Train DataFrame Pipeline
(TRAIN)

fit()
Input transform() Output
Predict DataFrame Pipeline Model Dataframe
(TEST) (PREDICTIONS)

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark ML Pipeline

Feature Feature
Combine Linear
transform transform
features Regression
1 2

Input
Train DataFrame Pipeline

Export Model
Input Output
Predict DataFrame Pipeline Model DataFrame

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sample Spark ML Pipeline

indexer = …
parser = …
hashingTF = …
vecAssembler = …

rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])

model = pipe.fit(trainData) # Train model


results = model.transform(testData) # Test model

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Exporting ML Models - PMML
à Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
à Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark GraphX

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


à Page Rank
à Topic Modeling (LDA)
à Community Detection

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: ampcamp.berkeley.edu


43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GraphX Algorithms

à PageRank
à Connected components
à Label propagation
à SVD++
à Strongly connected components
à Triangle count

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sample GraphX Code in Scala

graph = Graph(vertices, edges)


messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Zeppelin

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What’s Apache Zeppelin?

Web-based notebook
that enables interactive
data analytics.

You can make beautiful


data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Zeppelin with HDP 2.6+
Web-based Notebook for interactive analytics

Features Use Case


• Ad-hoc experimentation • Data exploration and discovery

• Deeply integrated with • Visualization


Spark + Hadoop • Interactive snippet-at-a-time
• Supports multiple experience
language backends • “Modern Data Science Studio”
• Incubating at Apache

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Zeppelin work?

Notebook
Author
Zeppelin

Cluster
Spark | Hive | HBase
Collaborators/ Any of 30+ back ends
Report viewers

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Big Data Lifecycle
Business user
Customer

Data Engineer Data Scientist


Report

ETL /
Collect Analysis
Process

Data
Product

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Zeppelin Multitenancy

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Livy

à Livy is the open source REST interface for interacting with Apache Spark from anywhere
à Installed as Spark Ambari Service

Spark Interactive Session


SparkContext

HTTP HTTP (RPC)

Livy Client Livy Server


Spark Batch Session
SparkContext

53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Security Across Zeppelin-Livy-Spark

Shiro LDAP
Zeppelin
Driver Ispark Group Interpreter Livy APIs
Spark on YARN
SPNego: Kerberos Kerberos

Livy Server

54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Reasons to Integrate with Livy

à Bring Sessions to Apache Zeppelin


– Isolation
– Session sharing

à Enable efficient cluster resource utilization


– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout)

à To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


SparkSession Sharing

Session-1

Client 1
SparkSession-1
SparkContext
Session-1
Session-1

Session-2
Client 2
SparkSession-2
Livy Server SparkContext
Session-2

Client 3

56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Zeppelin + Livy End-to-End Security

Tommy Callahan

Ispark Group Interpreter Livy APIs


Spark on YARN
SPNego: Kerberos Kerberos/RPC
Zeppelin Job runs as
Livy Server Tommy Callahan

LDAP

57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


HDP Basics

58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


à Zeppelin è Interactive notebook
Scala
Java
Python
R
MLlib
Spark
SQL
Spark
Streaming
GraphX
à Spark
APIs

à YARN è Resource Management


Spark Core Engine

à HDFS è Distributed Storage Layer (4M files)


YARN – Future: Ozone object store
1 ° ° ° ° ° ° ° ° ° °

° ° ° ° °
HDFS
° ° ° ° ° N

59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Hortonworks Data Platform

63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sample Architecture

64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Managed Dataflow
REGIONAL CORE
SOURCES
INFRASTRUCTURE INFRASTRUCTURE

69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


High-Level Overview
Live Dashboard

IoT Devices IoT Edge


(single node)
NiFi Hub Data Broker

Data Column
Store DB
HDFS/Ozone HBase/Cassandra
IoT Devices IoT Edge
72
(single node) Data Center
(on prem/cloud)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.x & HDP 2.x
What’s New?

73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What’s New

à Future HDP / Spark 2.3


– Spark Structured Streaming latency in single-digit milliseconds in continuous mode in stream
processing (instead of 100ms we’d normally see with micro batching)
– stream-to-stream joins
– PySpark boost by improving performance with pandas UDFs
– runs on Kubernetes clusters by providing native support for Apache Spark applications
à HDP 2.6.4 / Spark 2.2
– Structured Streaming GA
– Yahoo! Benchmark: 65M rec/s
– ORC feature & performance improvements à Parquet Parity
à HDP 2.6.3 / Spark 2.1
– Spark SQL Ranger integration for row and column security
– DataSet API GA
– GraphX GA
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.3

76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


DSX + HDP

78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data Science Experience (DSX) Local
Enterprise Data Science platform for teams

DSX

Livy REST interface

Hortonworks Data Platform (HDP)


HDP Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)

79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Lab

80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Hortonworks Community Connection
community.hortonworks.com

• Full Q&A Platform (like StackOverflow)

• Knowledge Base Articles

• Code Samples and Repositories

81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Community Engagement
community.hortonworks.com

20k+
Registered Users

45k+
Answers

100k+
Technical Assets
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2015. All Rights Reserved


Future of Data Meetups

83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Thanks!
Robert Hryniewicz
@RobHryniewicz

Anda mungkin juga menyukai