Hadoop Training - Part1

Hadoop Training ---- By Sasidhar M Sasis937@gmail.
com
Module 1 Introduction and Basics
Why Hadoop?
We can processing and querying vast amount of data.
Efficient.
Security.
Automatic distribution of data and work across machines.
Open source
Parallel Processing
History of Hadoop
Stand alone Applications
Network
Application Server Connection Pulling Data Manipulating Updating Synchronization
Data Base
Challenges in Standalone Application

Network Latency
GB,TB,PBs of Data is moved
What is the size of application? What if data size is huge?
Changing Mindset
Should the application be moved or Data?
Parallel Processing
Apart from data storage performance becomes major concern Parallel Processing
Multithreading OpenMP MPI (Message passing Interface)
Building Scalable Systems

What is the need of scaling?
Storage Processing
Vertical Scaling
Adding extra hardware, RAM etc
Horizontal Scaling
Adding more nodes
Can you scale your existing system? Elastic scalability
Distributed Framework
Data Localization
Moving application to where data resides
Data Availability
When storing data across nodes, it should be available to all other nodes and should be accessible Even if the nodes are failing, data should not be lost
Data Consistency
Data at all times should be consistent
Data Reliablity
Challenges in Distributed Framework

How to reduce network latency? How to make sure that data is not lost? How to design a programming model to access the data?
Chunking an input file
Assignments- Prerequisities
Linux OS Sun JDK 6 (>=1.6) Hadoop 1.0.3 Eclipse Apache Maven 3.0.4
Hadoop Training Module-2

Hadoop Installation
Next Module OverView

Installing Hadoop Running Hadoop on Pseudo Mode Understanding configuration Understanding Hadoop Processes like NN,DN,SNN,JT,TT Running sample mapreduce programs Overview of basic commands and their usage
Agenda
Hadoop Installation Running Sample MapReduce Program HDFS Commands
Step1: Installing Java on RHEL or CentOS or Fedora OS

Download Sun jdk from the oracle web site (rpm.bin file) Execute chmod +x give_rpm.bin_filename java will be installed under /usr/java folder Set JAVA_HOME environment variable in .bashrc file or .bash_profile
export JAVA_HOME=/usr/java/jdk1.6.0_31 export PATH = $PATH:$JAVA_HOME/bin source .bashrc Run the command java -version, it should show the version of jdk you installed
Installing java on Ubuntu Contd..

tar -xvzf jdk-7u9-linux-x64.tar.gz sudo mv jdk1.7.0_04 /usr/lib/jvm/ sudo update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.7.0_04/bin/javac 1 sudo update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.7.0_04/bin/java 1 sudo update-alternatives --install /usr/bin/javaws javaws /usr/lib/jvm/jdk1.7.0_04/bin/javaws
Step 2: Disabling ipv6

cat /proc/sys/net/ipv6/conf/all/disable_ipv6
The value of 0 indicates that ipv6 is disabled
If ipv6 is not disabled then

Open /etc/sysctl.conf and add the following lines:
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
Step 3 Creating a new user for hadoop

Not a mandatory step, but make sure in cluster mode, hadoop should be run from same user
useradd hadoop passwd hadoop
Step 4- Configuring SSH

Nodes in the cluster communicate with each other via ssh Name Node should be able to communicate to the Data nodes in password less manner Run the following command to generate public and private key without password ssh-keygen -t rsa -P cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys Set the permission of authorized_keys to 755 chmod 755 authorized_keys Now when u do ssh localhost, it shold not ask you password
Possible Error While configuring ssh

ssh: connect to host localhost port 22: Connection refused Check if sshd is running or not ps ef | grep sshd Check if ssh server and client are installed or not. If not install it sudo apt-get install openssh-client openssh-server (On Ubuntu) yum -y install openssh-server openssh-client (On Centos) Now start the service by running the following command chkconfig sshd on service sshd start
Step 5-Installing Hadoop

Untar the file by running the following command tar zxf hadoop-1.0.3.tar.gz Create following environment variable in .bashrc file export HADOOP_HOME=/home/hadoop/hadoop-1.0.3 export HADOOP_LIB=$HADOOP_HOME/lib
Step 6: Configuring Hadoop
mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> </configuration>
hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration>
Masters and Slaves file

In masters file specify the IP or hostname of the NameNode In slaves file, specify the IP or hostname of the slaves file
hadoop-env.sh
Specify the JAVA_HOME path
Step 7: Formatting NameNode and starting Hadoop cluster

Require to build file system Where it is created? Run the following command to format the namenode.
cd $HADOOP_HOME bin/hadoop namenode -format
The output should show that NameNode is successfully formatted. Run bin/start-all.sh to start hadoop cluster Execute ps ef | grep hadoop. It should show you all the below 5 processes running
NameNode DataNode Secondary NameNode Task Tracker Job Tracker
Job Tracker UI
http://localhost:50030
NameNode UI
http://locahost:50070

Hadoop Training - Part1

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hadoop Training - Part1

Diunggah oleh

Hak Cipta:

Format Tersedia

Hadoop Training ---- By Sasidhar M Sasis937@gmail.

Stand alone Applications

Challenges in Standalone Application

What is the size of application? What if data size is huge?

Building Scalable Systems

Can you scale your existing system? Elastic scalability

Challenges in Distributed Framework

Chunking an input file

Hadoop Training Module-2

Next Module OverView

Step1: Installing Java on RHEL or CentOS or Fedora OS

Installing java on Ubuntu Contd..

Step 2: Disabling ipv6

If ipv6 is not disabled then

net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

Step 3 Creating a new user for hadoop

Step 4- Configuring SSH

Possible Error While configuring ssh

Step 5-Installing Hadoop

Step 6: Configuring Hadoop

Masters and Slaves file

Step 7: Formatting NameNode and starting Hadoop cluster

Anda mungkin juga menyukai