com
Module 1 Introduction and Basics
Why Hadoop?
We can processing and querying vast amount of data.
Efficient.
Security.
Automatic distribution of data and work across machines.
Open source
Parallel Processing
History of Hadoop
Network
Application Server Connection Pulling Data Manipulating Updating Synchronization
Data Base
Changing Mindset
Should the application be moved or Data?
Parallel Processing
Apart from data storage performance becomes major concern Parallel Processing
Multithreading OpenMP MPI (Message passing Interface)
Vertical Scaling
Adding extra hardware, RAM etc
Horizontal Scaling
Adding more nodes
Distributed Framework
Data Localization
Moving application to where data resides
Data Availability
When storing data across nodes, it should be available to all other nodes and should be accessible Even if the nodes are failing, data should not be lost
Data Consistency
Data at all times should be consistent
Data Reliablity
Assignments- Prerequisities
Linux OS Sun JDK 6 (>=1.6) Hadoop 1.0.3 Eclipse Apache Maven 3.0.4
Agenda
Hadoop Installation Running Sample MapReduce Program HDFS Commands
mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> </configuration>
hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration>
hadoop-env.sh
Specify the JAVA_HOME path
The output should show that NameNode is successfully formatted. Run bin/start-all.sh to start hadoop cluster Execute ps ef | grep hadoop. It should show you all the below 5 processes running
NameNode DataNode Secondary NameNode Task Tracker Job Tracker
Job Tracker UI
http://localhost:50030
NameNode UI
http://locahost:50070