Anda di halaman 1dari 7

The Shared location is: \\192.168.1.

17\Hadoop\BasicStuff
UserNama: Rivatech
Password: rivatech@123
===================================>
Training Modules
1) Big Data Developer --> Level 0
2) Hadoop Admin
3) Hadoop Analyst -->
New Course --> Hadoop Spark & Storm.
====================================>
House Keeping
==> start by 10
Tea Break: 11.30
Lunch Break: 1.30
Tea Break: 3.30
Wrap --> 5.30
Post that: 45 min ~ 1 hour reading / watching
=====================================>
--> Agile & Dev Ops
Pain Points
1) Takes time for DB Link
2) Refreshing takes time.
--> Variety of Data --> Different Types of Data
--> Velocity of Data --> Rate at which the data comes in
--> Real time requirement.
Distributions of Hadoop
-->
-->
-->
-->

Cloudera
Hortonworks
Pivotal
IBM BigInsights

============================> Working with Vanilla Hadoop --> ASF - Apache Softw


are Foundation
Bucky Java--> https://www.youtube.com/results?search_query=bucky+java
Visualization tools
1) Tableu
2) QlikView
Reporting Tools
1)
2)
3)
4)

Cognos
MicroStragegy
Sap BO
Pentaho

5) Talend
https://marketplace.informatica.com/community/collections/big_data_&_analytics
===================================> Requirements
1) 4 GB, Preferred 8GB. For high end systems, we need 16 GB
2) Windows 7,8,10
3) Virtualization software
--> Ability to run another OS within a OS
a) VMWare --> VMWare workstation
b) Virtual Box
c) Hyper V
d) KVM
=============================>
What is Big Data --> 3 v's of Big Data
1) Sheer Volume -->
2) Velocity of data --> rate at which the data arrival is happening.
3) Variety
2015
250%
2017
a) Structured
85%
30%
b) Semi Structured
10%
35%
c) UnStructured
5%
35%
Total Cost of Owner --> MPP / DWH
a) License Cost
b) Appliance / Hardware Cost
c) Support Cost

Hadoop/ Open Source


No
No
Yes, only if needed.

OLTP

OLAP

Transactional Processing
Real Time / NRT
Databases
Transactions
NoSQL

Analytical Processing
Batch Processing
DataWarehouses
Visualization / Reporting
Hadoop

NoSQL -->
Facebook messages --> HBase
FlipKart --> HBase
SnapDeal - MongoDB
Netflix - Cassandra
Why Spark?
Processing Engine --> Data Storage is in HDFS
--> InMemory -- [Although every thing is not really in memory ]
==================================>
After the copying, ensure that you extract the Cloudera and Ubuntu Images. --> R
ight Click - Say Extract To ...
3 v's of Big Data -->
We now want to work with the complete size of the polulation and not to work wit

h Samples. --> Analytics


Big data is a broad term for data sets so large or complex that traditional data
processing applications [ RDBMS & DWH ] are inadequate.
Hadoop: Framework for Big Data Storage and Analysis. Apache Hadoop is an open-so
urce software framework written in Java for distributed storage and distributed
processing of very large data sets on computer clusters built from commodity har
dware.
The password [ if required for the PDF files ] --< hdp
HDFS --> Storage
Map Reduce --> Processing
In Hadoop --> Code goes to the data and the processing happens locally on the sy
stems which has the data.
Commodity Hardware --> Systems which are not High End systems [ Dual Power Suppl
y, RAID and less memory [4 - 8 GB ]
http://business.time.com/2013/08/06/big-data-is-my-copilot-auto-insurers-push-de
vices-that-track-driving-habits/
Structured Data: Fixed Rows and Cols
Semi-Structured Data: Textual Data --> xml, json, email, blogs, comments, feedba
cks, logs
UnStructured: Audio, Video,Images
Structured:
Hive [ SQL ] , Pig [ Scripting ] , MR [ Programming ]
Semi-Structured: Pig, MR
UnStructured : MR
2 things to note before powering on a Virtual Machine
1) memory --< always give it half of your system memory
2) network adapter --< should be NAT [ if we need a individual IP for the image
], Bridged when we are doing multiple systems working together.
The username and password in Cloudera --< cloudera / cloudera
In all the Vendor VMs the hadoop deamons [ processes ] gets automatically stated
when the systems boots up. So practically hadoop is all ready when we log in.
hadoop dfs and hadoop fs are both synonymous.
To get a listing of all linux commands possible in HDFS --> hadoop fs
Note: The local filesystem and the HDFS are completely different file systems.
========================> HDFS Architecture
1) Master - Slave [ Master - Server Grade ] , Slaves - Commodity Grade
2) Block --> block size in linux - 4K, but in hadoop it is 64MB in Gen1 and 128
MB in Gen2.
**http://stackoverflow.com/questions/19473772/data-block-size-in-hdfs-why-64mb
This block size is configuration at the file level granurality.
3) Replication: By default there will be always at a minimum 3 copies of the dat
a and that too in different systems.
Cluster --> System of machines acting as single logical unit. --> 9 node cluster

Client A: Wants to write a file called sample - 100 MB into the cluster.
Master
1
64

2
36

5
64

7
36

8
64

9
36

1)
ve
2)
3)
4)

Client will communicate with the master? Is it necessary for the client to ha
hadoop on its side? YES [ Only the client component ]
What will be the response of the master: Nodes where the data will be stored.
How many ips will be given by the master as a part of the response. [6 ]
How are the 6 ips chosen
64 [1,5,8]
Write Pipeline - Lease
36 [2,7,9]
The first node for every block is decided based on the proximity between
client and slaves. The data will be distributed amoung the nodes in a horizonta
l fashion.
The other ips will be decided based on Availability and not on the basis
of proximity. If multiple nodes are up as available, then the master will choos
e it randomly.
5) Is the client going to write to all the nodes in the pipeline. No. It will on
ly write to the first node for every block
6) Is the write from the client going to be parallel: YES
7) Who will break the file into blocks? Hadoop on the client side.
8) When is the replication going to start? After a block is written or is it byt
e - by byte. It is Byte by Byte. So when the client write to 1, it will byte by
byte write to 5 and 5 will write to 8.
9) When is the whole write finished. Answer: When all the replications are done.
==================================>
Master
1
64

2
36

5
64

7
36

8
64

9
36

Client B.
read sample.
[8, 9]
After write is finished, all of them are originals and there is no concept of re
plicated blocks.
HDFS is a virtual file system.
Failures:
Master
1
64

2
36

5
64

7
36

8
64

9
36

1] While Writing
a) On Primary Nodes [ 1 & 2 ]. Node 1 goes down after 60 MB of data is writte
n.
1) Will the client be aware of the failure? Yes.
2) What will happen to the data already written [ 60MB on 3 machines ]
and the 36 MB written on other 3 machines? All the data is zombied data and the
Admin will have to manually remove the same.
3) Should the client write the whole 100 MB again? YES
b) On Replicated node: Node 5 goes down after 60 MB of data is written.

1) Will the client be aware of the failure? No. The new node info need n
ot be given back to the client.
2) What will happen to the data already written [ 60MB on 3 machines ]
and the 36 MB written on other 3 machines? Only the 64MB should be replicated by
the framework, the client is unware of any failure, the 36 MB is valid.
3) Should the client write the whole 100 MB again? No. Only the block wh
ich was being written will be copied again by the framework and client is not in
volved in this process.
2] After Writing.
Any node fails: Should the Admin get involved? No. The framework will en
sure that it replicates all the blocks that was present on the failed node onto
other nodes in teh cluster, by picking it from the other copies.
==========================================> deamons
Master
Slave

HDFS
MR
Namenode
JobTracker
Datanode
TaskTracker
SecondaryNameNode
[Checkpoint / backup node for the NameNode]

Embedded Web Server in Hadoop: Apache Jetty


Embedded database in Hadoop: Derby
By Default, there will be 1 block when the HDFS starts up. The size of this bloc
k will be 4k
The meta data will be stored in 2 files [ fsimage and edits ] and the location f
or the same would be
/var/lib/hadoop-0.20/cache/hadoop/dfs/name/current --> ls -l
fsimage -->
edits --> 4 k [ empty file ]
fsimage is the snapshot of the file system at a point of time.
edits are like redo logs in oracle. Which contain info abt the changes in the fi
le system.
When any modification [ file insert ] happens in the file system, the metadata i
s stored in edits.
Then periodically the edits will be applied to the fsimage and the size of edits
will go back to 4k
/var/lib/hadoop-0.20/cache/hdfs/dfs/data/current -->Note we should log in as the
hdfs user to view the blocks. --> sudo su hdfs
1 blk file
1 meta file [for every block ] --> Checksum file
Exercise on file injestion -->
1) create a new file called sample in /home/cloudera --> gedit sample
2) put some dummy content into that file.
3) client will put the file in to HDFS --> hadoop fs -copyFromLocal sample /
4) Check if the file exists --> hadoop fs -ls / or browser the file system.
5) What would be the changes in the metadata location? --> edits size would get
increased to 1.1MB and fsimage size would be the same. Even if we add another fi
le, the size of edits would be the same at 1.1 MB, till the time the whole 1.1 M
B is filled with pointers.
6) What would be the changes in the data location? --> a new blk file will be cr
eated and a ,meta file for the new blk file.

7) cat blk file [ on the node which has the data ]


--------------------> Metadata
1) The metadata is present in the main memory of the NN plus
2) A persisted copy is present in fsimage and edits files. Periodically the edit
s gets applied to the fsimage and the fsimage size will increase.
VM - Power - Shutdown guest in the Cloudera VM
Then we will load the Ubuntu VM --> File open - navigate to the place where ubun
tu is extracted and you should be a single vmx file and open the same.
============================================>
We will Resume the ubuntu image as the image is in a suspended state. Then we wi
ll power off the image [ so that we can change the network adapter and the memor
y ].
Then again power on the system.
Open hadoop_setup.pdf from the Setup Documents folder.
Install WinSCP so that we can edit the files over there instead of doing vi.
============================================>
1) extracted hadoop in lab/software
2) made changes in the .bash_profile and then executed it
3) confirmed that java and hadoop command are running
------------> How to remove HADOOP_HOME is deprecated error
1) comment the HADOOP_HOME environment variable #
2) open a duplicated terminal windows to ensure that .bash_profile gets
called as if we comment a variable and even after sourcing it, it will still be
in the cache and hence a duplicated window
hadoop version [ you will see that the HADOOP_HOME is deprecated will no
t come.
=============================================>
Masters {
4 nodes
Slaves {

NN

SNN

JT

DN+TT

DN+TT

DN+TT

DN+TT

3 components for Configuration in Hadoop


conf directory
1) core

src/relevant directory/XXX-default.xml

--> core-site.xml
IP ADDRESS of the NN System.
Directory of Checkpoint
2) hdfs --> hdfs-site.xml
Directory of the metadata location
Directory of the Data location
replication
3) mapred--> mapred-site.xml
IP ADDRESS of the JT system
Location of the local.dir
Location of the system.dir
max.map.tasks

max.reduce.tasks
child.java.opts
<final> true means that the value cannot be changed after the server is started.
masters --> IP ADDRESS of the SNN
slaves --> IP ADDRESS of the slaves seperated by new lines.
How many JVMs will be in a default hadoop system - single node
NN
DN
SNN
JT
TT
3 map --> 400M
3 reduce
--> 400M
The size of each of them is 1000MB . 1GB [ hadoop-env.sh ]
===================================> Once all the 5 deamons are ready
stop-all.sh --> confirm that with a jps which should show no deamons
VM - Power Shutdown the Ubuntu Image.
===================================>
2 things for tomorrow
1) Watch the 10 minutes video on Map Reduce with Playing Cards
2) Watch the 25 minutes interview with Sanjay Ghemawat --> GFS white paper.

Anda mungkin juga menyukai