Selamat datang di Scribd!

Lewati carousel

Lab Chapter 10 Use RDDs

Diunggah oleh

Nicolas Valbuena

0% menganggap dokumen ini bermanfaat (1 suara)

371 tayangan4 halaman

Hak Cipta

Format Tersedia

PDF, TXT atau baca online dari Scribd

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Laporkan Dokumen Ini

Lab Chapter 10 Use RDDs

Hak Cipta:

Format Tersedia

Unduh sebagai PDF, TXT atau baca online dari Scribd

Tandai sebagai konten tidak pantas

0% menganggap dokumen ini bermanfaat (1 suara)

371 tayangan4 halaman

Lab Chapter 10 Use RDDs

Diunggah oleh

Nicolas Valbuena

Lab Chapter 10 Use RDDs

Hak Cipta:

Format Tersedia

Unduh sebagai PDF, TXT atau baca online dari Scribd

Tandai sebagai konten tidak pantas

Lompat ke Halaman

Anda di halaman 1dari 4

Cari di dalam dokumen

201509

CAP - Developing with Spark

and Hadoop:

Homework Assignment Guide

for Students

Homework: Use RDDs to Transform a Dataset .............................................................. 2

© Copyright 2010-2015 Cloudera. All rights reserved. Not to be 1

reproduced or shared without prior written consent from Cloudera.
Homework: Use RDDs to Transform a
Dataset
Files and Data Used in This Homework:

Exercise Directory: $DEV1/exercises/spark-transform

Data files: /loudacre/weblogs/* (HDFS)

In this Exercise you will practice using RDDs in the Spark Shell.

Use Spark to explore the web server logs you captured in Flume earlier.

Explore the Loudacre web log files

In this section you will be using weblogs you imported into HDFS in the Flume
exercise.

1. Using the HDFS command line or Hue File Browser, review one of the files in the
HDFS /loudacre/weblogs directory, e.g. FlumeData.1423586038966.
Note the format of the lines, e.g.

IP address User ID
116.180.70.237 - 128 [15/Sep/2013:23:59:53 +0100]
"GET /KBDOC-00031.html HTTP/1.0" 200 1388
Request
"http://www.loudacre.com" "Loudacre CSR Browser"

2. Set a variable for the data file so you do not have to retype it each time.

pyspark> logfile="/loudacre/weblogs/FlumeData.*"

scala> var logfile="/loudacre/weblogs/FlumeData.*"

© Copyright 2010-2015 Cloudera. All rights reserved. Not to be 2

reproduced or shared without prior written consent from Cloudera.
3. Create an RDD from the data file.

4. Create an RDD containing only those lines that are requests for JPG files.

5. View the first 10 lines of the data using take.

6. Sometimes you do not need to store intermediate objects in a variable, in which
case you can combine the steps into a single line of code. Combine the previous
commands into a single command to count the number of JPG requests.

7. Now try using the map function to define a new RDD. Start with a simple map
that returns the length of each line in the log file.

This will produce an array of five integers corresponding to the first five lines in
the file.

8. That is not very useful. Instead, try mapping to an array of words for each line.

This time, you will print out five arrays, each containing the words in the
corresponding log file line.

9. Now that you know how map works, define a new RDD containing just the IP
addresses from each line in the log file. (The IP address is the first “word” in
each line).

10. Although take and collect are useful ways to look at data in an RDD, their
output is not very readable. Fortunately, they return arrays which you can
iterate through:

pyspark> for ip in ips.take(10): print ip

scala> ips.take(10).foreach(println)

11. Finally, save the list of IP addresses as a text file to: /loudacre/iplist

12. In a terminal window or the Hue file browser, list the contents of the
/loudacre/iplist folder. You should see multiple files, including several
part-xxxxx files, which are the files containing the output data. (“Part”

reproduced or shared without prior written consent from Cloudera.
(partition) files are numbered because there may be results from multiple tasks
running on the cluster; you will learn more about this later.) Review the
contents of one of the files to confirm that they were created correctly.

Optional Homework
Use RDD transformations to create a dataset consisting of the IP address and
corresponding user ID for each request for an HTML file. (Disregard requests for
other file types). The user ID is the third field in each log file line.

Display the data in the form ipaddress/userid, e.g.:

165.32.101.206/8
100.219.90.44/102
182.4.148.56/173
246.241.6.175/45395
175.223.172.207/4115
…

This is the end of the Homework

reproduced or shared without prior written consent from Cloudera.

Anda mungkin juga menyukai

Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dari Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
Belum ada peringkat
Living with Linux in the Industrial World
Dari Everand
Living with Linux in the Industrial World
Elaiya Iswera Lallan
Belum ada peringkat
Hands - On Exercise: Using The Spark Shell..................................
Dokumen13 halaman
Hands - On Exercise: Using The Spark Shell..................................
miyumi
100% (2)
Big Data Computing Spark Basics and RDD: Ke Yi
Dokumen43 halaman
Big Data Computing Spark Basics and RDD: Ke Yi
Patrick Li
Belum ada peringkat
BDALab Assn5
Dokumen16 halaman
BDALab Assn5
Deepti Agrawal
Belum ada peringkat
Hadoop & Big Data
Dokumen36 halaman
Hadoop & Big Data
Paresh Bhatia
Belum ada peringkat
Spark in Production
Dokumen34 halaman
Spark in Production
Sridhar Plv
Belum ada peringkat
BDA LabCompendium
Dokumen6 halaman
BDA LabCompendium
KiS Mint
Belum ada peringkat
11 SparkWorkingWithRDDs
Dokumen21 halaman
11 SparkWorkingWithRDDs
mym786
Belum ada peringkat
Questions
Dokumen3 halaman
Questions
pavan kumar
Belum ada peringkat
Elastix 2.3 HA Cluster
Dokumen12 halaman
Elastix 2.3 HA Cluster
leossj2
Belum ada peringkat
Big Data Analytics With Hadoop and Apache Spark
Dokumen17 halaman
Big Data Analytics With Hadoop and Apache Spark
Fernando Andrés Hinojosa Villarreal
Belum ada peringkat
Informatica Admin
Dokumen13 halaman
Informatica Admin
Vikas Sinha
Belum ada peringkat
(Linux) DRBD - Heartbeat (ActivePassive High Availability Cluster)
Dokumen3 halaman
(Linux) DRBD - Heartbeat (ActivePassive High Availability Cluster)
Sonia Andrea Bautista Amaya
Belum ada peringkat
Deploying Oracle RAC 11g R2 Database On Red Hat Enterprise Linux 6.x
Dokumen25 halaman
Deploying Oracle RAC 11g R2 Database On Red Hat Enterprise Linux 6.x
sarfrazasif2003
Belum ada peringkat
Bda Lab Manual
Dokumen20 halaman
Bda Lab Manual
RAKSHIT AYACHIT
Belum ada peringkat
A Brief Introduction To PySpark. PySpark Is A Great Language For - by Ben Weber - Towards Data Science
Dokumen16 halaman
A Brief Introduction To PySpark. PySpark Is A Great Language For - by Ben Weber - Towards Data Science
Joshua White
Belum ada peringkat
UD2 Anexo III Instalación Odoo16 Docker
Dokumen7 halaman
UD2 Anexo III Instalación Odoo16 Docker
jose.feria8004
Belum ada peringkat
Learning Spark Programming Basics: Introduction To Rdds
Dokumen70 halaman
Learning Spark Programming Basics: Introduction To Rdds
Huy Nguyễn
Belum ada peringkat
Hadoop File System: B. Ramamurthy
Dokumen36 halaman
Hadoop File System: B. Ramamurthy
mihirhota
Belum ada peringkat
Big Data Lab Manual and Syllabus
Dokumen71 halaman
Big Data Lab Manual and Syllabus
startechbyjus123
Belum ada peringkat
Board800 Installation v1.4
Dokumen5 halaman
Board800 Installation v1.4
Hugo Servellon
Belum ada peringkat
Data Analytics IT 404 - Mod 6: Ojus Thomas Lee CE Kidangoor
Dokumen53 halaman
Data Analytics IT 404 - Mod 6: Ojus Thomas Lee CE Kidangoor
sreelaya
Belum ada peringkat
Module - 4-2
Dokumen32 halaman
Module - 4-2
Raj Kumar
Belum ada peringkat
HDP Developer Apache Pig and Hive
Dokumen42 halaman
HDP Developer Apache Pig and Hive
RICARDO FERNANDES
Belum ada peringkat
Chotu Meat
Dokumen11 halaman
Chotu Meat
emraan anwar
Belum ada peringkat
Getting Started With Spark Redis PDF
Dokumen9 halaman
Getting Started With Spark Redis PDF
Adam Sánchez Ayte
0% (1)
RedHat Linux 9
Dokumen22 halaman
RedHat Linux 9
umankumar845
Belum ada peringkat
Word Count Exercise Instructions
Dokumen3 halaman
Word Count Exercise Instructions
Ashwin Ajmera
Belum ada peringkat
05 Monitoring The Cluster
Dokumen12 halaman
05 Monitoring The Cluster
ud
Belum ada peringkat
Lab 1
Dokumen12 halaman
Lab 1
quyên nguyễn
Belum ada peringkat
Gold Video Task Complted
Dokumen31 halaman
Gold Video Task Complted
srinivas75k
Belum ada peringkat
WarehouseDB On VM Think - 2019 - Speaker - 08102018-V3
Dokumen20 halaman
WarehouseDB On VM Think - 2019 - Speaker - 08102018-V3
Vũ Hà Nam
Belum ada peringkat
Bda Practical
Dokumen62 halaman
Bda Practical
vijay kholia
Belum ada peringkat
AICTE SPONSORED Faculty Development Programme (FDP) On "DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS"
Dokumen28 halaman
AICTE SPONSORED Faculty Development Programme (FDP) On "DATA SCIENCE RESEARCH AND BIG DATA ANALYTICS"
D.KESAVARAJA
Belum ada peringkat
Hadoop Presentaton
Dokumen47 halaman
Hadoop Presentaton
Jhumri Talaiya
Belum ada peringkat
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
Dokumen35 halaman
BIG DATA WITH HADOOP, HDFS & MAPREDUCE (Hands On Training)
D.KESAVARAJA
Belum ada peringkat
4 UNIT-4 Introduction To Hadoop
Dokumen154 halaman
4 UNIT-4 Introduction To Hadoop
PrakashRameshGadekar
Belum ada peringkat
Multi Sites
Dokumen28 halaman
Multi Sites
Hari Kishan
Belum ada peringkat
Semweb & Dotnetrdf: Mazilu Liviu-Andrei Pintilie Radu-Stefan
Dokumen6 halaman
Semweb & Dotnetrdf: Mazilu Liviu-Andrei Pintilie Radu-Stefan
Andrei Mazilu
Belum ada peringkat
Apache Hadoop Filesystem and Its Usage in Facebook
Dokumen33 halaman
Apache Hadoop Filesystem and Its Usage in Facebook
harisgx
Belum ada peringkat
Big Data
Dokumen17 halaman
Big Data
gtfhbmnvh
Belum ada peringkat
Spart Part 2
Dokumen44 halaman
Spart Part 2
Aleena Nasir
100% (1)
L Hadoop 1 PDF
Dokumen12 halaman
L Hadoop 1 PDF
Dao Van Hang
Belum ada peringkat
Experiment No 5 CCLlabmanual
Dokumen9 halaman
Experiment No 5 CCLlabmanual
SHIVAM UPADHYAY
Belum ada peringkat
Elastix HA Cluster
Dokumen11 halaman
Elastix HA Cluster
Ing Diego Eduardo Morales Lopez
Belum ada peringkat
Hadoop File System: B. Ramamurthy
Dokumen36 halaman
Hadoop File System: B. Ramamurthy
forjunklikescribd
Belum ada peringkat
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
Dokumen47 halaman
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
kavitha
Belum ada peringkat
NYOUG Hadoop Presentaton
Dokumen47 halaman
NYOUG Hadoop Presentaton
V Kalyan
Belum ada peringkat
20dce017 Bda Pracfil
Dokumen41 halaman
20dce017 Bda Pracfil
Raj Chauhan
Belum ada peringkat
Implementing A Reverse Proxy Alone in A DMZ Configuration - R12 (Doc ID 726953.1)
Dokumen9 halaman
Implementing A Reverse Proxy Alone in A DMZ Configuration - R12 (Doc ID 726953.1)
Rathinavel Subramaniam
Belum ada peringkat
Unit 4
Dokumen104 halaman
Unit 4
nosopa5904
Belum ada peringkat
Issues Datastage
Dokumen4 halaman
Issues Datastage
rkpolu
Belum ada peringkat
ApacheSpark MyNotes
Dokumen6 halaman
ApacheSpark MyNotes
seenu0104
Belum ada peringkat
Cloudera CCD-333
Dokumen42 halaman
Cloudera CCD-333
MbaStudent56
Belum ada peringkat
Apex Institute of Technology: Big Data Security
Dokumen30 halaman
Apex Institute of Technology: Big Data Security
So do so
Belum ada peringkat
Cloud Computing Lab Setup Using Hadoop & Open Nebula
Dokumen46 halaman
Cloud Computing Lab Setup Using Hadoop & Open Nebula
D.KESAVARAJA
100% (4)
RDDs Vs DataFrames and Datasets
Dokumen7 halaman
RDDs Vs DataFrames and Datasets
parthasarathy
Belum ada peringkat
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
Dari Everand
Introducing .NET for Apache Spark: Distributed Processing for Massive Datasets
Ed Elliott
Belum ada peringkat
Professional Hadoop Solutions
Dari Everand
Professional Hadoop Solutions
Boris Lublinsky
Penilaian: 4 dari 5 bintang
4/5 (2)
Bomba Watson Marlow PDF
Dokumen13 halaman
Bomba Watson Marlow PDF
Ronald Salas
Belum ada peringkat
U.S. Vs Hart, Et Al., 26 Phil 149 CASE DIGEST
Dokumen2 halaman
U.S. Vs Hart, Et Al., 26 Phil 149 CASE DIGEST
Aaron Ariston
100% (2)
Sample CV
Dokumen3 halaman
Sample CV
sam_mad00
Belum ada peringkat
Dirt Bikes Financial and Sales Data
Dokumen7 halaman
Dirt Bikes Financial and Sales Data
khang nguyen
Belum ada peringkat
Social Psychology
Dokumen6 halaman
Social Psychology
shakti1432ss
100% (3)
Transform Your Appraisal Management Process: Time-Saving. User-Friendly. Process Improvement
Dokumen10 halaman
Transform Your Appraisal Management Process: Time-Saving. User-Friendly. Process Improvement
Colby Evans
Belum ada peringkat
Redox Chemistry
Dokumen25 halaman
Redox Chemistry
Santosh G Pattanad
Belum ada peringkat
Snowman Card Game PDF
Dokumen2 halaman
Snowman Card Game PDF
FOA
Belum ada peringkat
Warcraft III Manual
Dokumen47 halaman
Warcraft III Manual
trevorbourget78486
100% (6)
Parts of The Guitar Quiz
Dokumen2 halaman
Parts of The Guitar Quiz
api-293063423
100% (1)
1ST Periodical Test in English 10
Dokumen3 halaman
1ST Periodical Test in English 10
kira bueno
Belum ada peringkat
Disruptive Strategy Final Paper Company Profiles
Dokumen2 halaman
Disruptive Strategy Final Paper Company Profiles
Humberto Jose Arias Barros
Belum ada peringkat
Nava V Artuz AC No. 7253
Dokumen7 halaman
Nava V Artuz AC No. 7253
MACASERO JACQUILOU
Belum ada peringkat
Kodaly
Dokumen11 halaman
Kodaly
Sally Di Martino
100% (3)
Term Coalition Has Been Derived From The Latin Word 'Coalitio' Which Means To Grow Up Together
Dokumen2 halaman
Term Coalition Has Been Derived From The Latin Word 'Coalitio' Which Means To Grow Up Together
Shree Mishra
Belum ada peringkat
Stop Motion Rubric
Dokumen3 halaman
Stop Motion Rubric
api-506782994
Belum ada peringkat
Banana Oil Lab
Dokumen5 halaman
Banana Oil Lab
jbernay
Belum ada peringkat
Lampiran 18-Lesson Plan 02
Dokumen5 halaman
Lampiran 18-Lesson Plan 02
San Louphlii Tha
Belum ada peringkat
KGTE February 2011 Result
Dokumen60 halaman
KGTE February 2011 Result
Supriya Nair
Belum ada peringkat
The Basics of Effective Interpersonal Communication: by Sushila Bahl
Dokumen48 halaman
The Basics of Effective Interpersonal Communication: by Sushila Bahl
Devesh Khanna
Belum ada peringkat
LESSON PLAN IN wRITING A REPORT
Dokumen2 halaman
LESSON PLAN IN wRITING A REPORT
Markaton Dihagnos
100% (4)
CWW
Dokumen2 halaman
CWW
mary joy martin
Belum ada peringkat
The Impact of Eschatology and Protology
Dokumen48 halaman
The Impact of Eschatology and Protology
Juan Stein
Belum ada peringkat
PMMSI Vs CA
Dokumen1 halaman
PMMSI Vs CA
Fermari John Manalang
Belum ada peringkat
Practical Research 2 - Chapter 1
Dokumen30 halaman
Practical Research 2 - Chapter 1
Luis Concepcion
Belum ada peringkat
Lord of The Flies Analysis
Dokumen10 halaman
Lord of The Flies Analysis
Muhammad Asif
Belum ada peringkat
Latihan Soal Causative
Dokumen1 halaman
Latihan Soal Causative
anakosagista anakosagista
Belum ada peringkat
Indus River Valley Civilization Objectives
Dokumen4 halaman
Indus River Valley Civilization Objectives
Arslan Asif
Belum ada peringkat
CE GATE'2017 Paper 02 Key Solution PDF
Dokumen29 halaman
CE GATE'2017 Paper 02 Key Solution PDF
gola
Belum ada peringkat
Death and Dying: Presented by Dr. Judith Sugay
Dokumen21 halaman
Death and Dying: Presented by Dr. Judith Sugay
Michelle Hutamares
Belum ada peringkat