17\Hadoop\BasicStuff
UserNama: Rivatech
Password: rivatech@123
===================================>
Training Modules
1) Big Data Developer --> Level 0
2) Hadoop Admin
3) Hadoop Analyst -->
New Course --> Hadoop Spark & Storm.
====================================>
House Keeping
==> start by 10
Tea Break: 11.30
Lunch Break: 1.30
Tea Break: 3.30
Wrap --> 5.30
Post that: 45 min ~ 1 hour reading / watching
=====================================>
--> Agile & Dev Ops
Pain Points
1) Takes time for DB Link
2) Refreshing takes time.
--> Variety of Data --> Different Types of Data
--> Velocity of Data --> Rate at which the data comes in
--> Real time requirement.
Distributions of Hadoop
-->
-->
-->
-->
Cloudera
Hortonworks
Pivotal
IBM BigInsights
Cognos
MicroStragegy
Sap BO
Pentaho
5) Talend
https://marketplace.informatica.com/community/collections/big_data_&_analytics
===================================> Requirements
1) 4 GB, Preferred 8GB. For high end systems, we need 16 GB
2) Windows 7,8,10
3) Virtualization software
--> Ability to run another OS within a OS
a) VMWare --> VMWare workstation
b) Virtual Box
c) Hyper V
d) KVM
=============================>
What is Big Data --> 3 v's of Big Data
1) Sheer Volume -->
2) Velocity of data --> rate at which the data arrival is happening.
3) Variety
2015
250%
2017
a) Structured
85%
30%
b) Semi Structured
10%
35%
c) UnStructured
5%
35%
Total Cost of Owner --> MPP / DWH
a) License Cost
b) Appliance / Hardware Cost
c) Support Cost
OLTP
OLAP
Transactional Processing
Real Time / NRT
Databases
Transactions
NoSQL
Analytical Processing
Batch Processing
DataWarehouses
Visualization / Reporting
Hadoop
NoSQL -->
Facebook messages --> HBase
FlipKart --> HBase
SnapDeal - MongoDB
Netflix - Cassandra
Why Spark?
Processing Engine --> Data Storage is in HDFS
--> InMemory -- [Although every thing is not really in memory ]
==================================>
After the copying, ensure that you extract the Cloudera and Ubuntu Images. --> R
ight Click - Say Extract To ...
3 v's of Big Data -->
We now want to work with the complete size of the polulation and not to work wit
Client A: Wants to write a file called sample - 100 MB into the cluster.
Master
1
64
2
36
5
64
7
36
8
64
9
36
1)
ve
2)
3)
4)
Client will communicate with the master? Is it necessary for the client to ha
hadoop on its side? YES [ Only the client component ]
What will be the response of the master: Nodes where the data will be stored.
How many ips will be given by the master as a part of the response. [6 ]
How are the 6 ips chosen
64 [1,5,8]
Write Pipeline - Lease
36 [2,7,9]
The first node for every block is decided based on the proximity between
client and slaves. The data will be distributed amoung the nodes in a horizonta
l fashion.
The other ips will be decided based on Availability and not on the basis
of proximity. If multiple nodes are up as available, then the master will choos
e it randomly.
5) Is the client going to write to all the nodes in the pipeline. No. It will on
ly write to the first node for every block
6) Is the write from the client going to be parallel: YES
7) Who will break the file into blocks? Hadoop on the client side.
8) When is the replication going to start? After a block is written or is it byt
e - by byte. It is Byte by Byte. So when the client write to 1, it will byte by
byte write to 5 and 5 will write to 8.
9) When is the whole write finished. Answer: When all the replications are done.
==================================>
Master
1
64
2
36
5
64
7
36
8
64
9
36
Client B.
read sample.
[8, 9]
After write is finished, all of them are originals and there is no concept of re
plicated blocks.
HDFS is a virtual file system.
Failures:
Master
1
64
2
36
5
64
7
36
8
64
9
36
1] While Writing
a) On Primary Nodes [ 1 & 2 ]. Node 1 goes down after 60 MB of data is writte
n.
1) Will the client be aware of the failure? Yes.
2) What will happen to the data already written [ 60MB on 3 machines ]
and the 36 MB written on other 3 machines? All the data is zombied data and the
Admin will have to manually remove the same.
3) Should the client write the whole 100 MB again? YES
b) On Replicated node: Node 5 goes down after 60 MB of data is written.
1) Will the client be aware of the failure? No. The new node info need n
ot be given back to the client.
2) What will happen to the data already written [ 60MB on 3 machines ]
and the 36 MB written on other 3 machines? Only the 64MB should be replicated by
the framework, the client is unware of any failure, the 36 MB is valid.
3) Should the client write the whole 100 MB again? No. Only the block wh
ich was being written will be copied again by the framework and client is not in
volved in this process.
2] After Writing.
Any node fails: Should the Admin get involved? No. The framework will en
sure that it replicates all the blocks that was present on the failed node onto
other nodes in teh cluster, by picking it from the other copies.
==========================================> deamons
Master
Slave
HDFS
MR
Namenode
JobTracker
Datanode
TaskTracker
SecondaryNameNode
[Checkpoint / backup node for the NameNode]
NN
SNN
JT
DN+TT
DN+TT
DN+TT
DN+TT
src/relevant directory/XXX-default.xml
--> core-site.xml
IP ADDRESS of the NN System.
Directory of Checkpoint
2) hdfs --> hdfs-site.xml
Directory of the metadata location
Directory of the Data location
replication
3) mapred--> mapred-site.xml
IP ADDRESS of the JT system
Location of the local.dir
Location of the system.dir
max.map.tasks
max.reduce.tasks
child.java.opts
<final> true means that the value cannot be changed after the server is started.
masters --> IP ADDRESS of the SNN
slaves --> IP ADDRESS of the slaves seperated by new lines.
How many JVMs will be in a default hadoop system - single node
NN
DN
SNN
JT
TT
3 map --> 400M
3 reduce
--> 400M
The size of each of them is 1000MB . 1GB [ hadoop-env.sh ]
===================================> Once all the 5 deamons are ready
stop-all.sh --> confirm that with a jps which should show no deamons
VM - Power Shutdown the Ubuntu Image.
===================================>
2 things for tomorrow
1) Watch the 10 minutes video on Map Reduce with Playing Cards
2) Watch the 25 minutes interview with Sanjay Ghemawat --> GFS white paper.