50 MustRead Hadoop Interview Questions
& Answers
Whizlabs D
ec 2
9th, 2
017 B
ig D
ata
Are you planning to land a job with big data and data analytics? Are you worried about cracking the
Hadoop job interview? We have put together a list of Hadoop Interview Questions that will come in
handy. You might have sound knowledge regarding the software framework, but all of it can’t be
tested in a short 15 minutes interview session. So the interviewer will ask you some specific
questions t hey t hink a
re a
pt t o j udge y our k nowledge i n t he s ubject m
atter.
Basic Hadoop Interview Questions
To start off the list, we will be focusing on the common and Basic Hadoop Interview Questions that
people c ome a cross w hen a pplying f or a
H
adoop r elated j ob, i rrespective o
f p
osition.
1. What are the concepts used in the Hadoop Framework?
Answer: T
he H
adoop F
ramework f unctions o
n t wo c ore c oncepts:
● HDFS : A
bbreviation f or H
adoop D
istributed F
ile S
ystem, i t i s a
J avabased f ile s ystem
for s calable a
nd r eliable s torage o
f l arge d
atasets. H
DFS i tself w
orks o
n t he
MasterSlave A
rchitecture a
nd s tores a
ll i ts d
ata i n t he f orm o
f b
locks.
● MapReduce : T
his i s t he p
rogramming m
odel a
nd t he a
ssociated i mplementation f or
processing a
nd g
enerating l arge d
ata s ets. T
he H
adoop j obs a
re b
asically d
ivided i nto
two d
ifferent t asks j ob. T
he m
ap j ob b
reaks d
own t he d
ata s et i nto k eyvalue p
airs o
r
tuples. A
nd t hen t he r educe j ob t akes t he o
utput o
f t he m
ap j ob a
nd c ombines t he d
ata
tuples i nto a
s maller s et o
f t uples.
Preparing for MapReduce Interview? Here’ 1
0 Most Popular MapReduce
Interview Questions
2. What is Hadoop? Name the Main Components of a Hadoop Application.
Answer: Hadoop is what evolved as the solution to the “Big Data” problem. Hadoop is described as
the framework that offers a number of tools and services in order to store and process Big Data. It
also plays an important role in the analysis of big data and to make efficient business decisions
when i t i s d
ifficult t o m
ake t he d
ecision u
sing t he t raditional m
ethod.
Hadoop offers a vast toolset that makes it possible to store and process data very easily. Here are all
the m
ain c omponents o
f t he H
adoop:
● Hadoop C
ommon
● HDFS
● Hadoop M
apReduce
● YARN
● PIG a
nd H
IVE –
T
he D
ata A
ccess C
omponents.
● HBase –
F
or D
ata S
torage
● Apache F
lume, S
qoop, C
hukwa –
T
he D
ata I ntegration C
omponents
● Ambari, O
ozie a
nd Z
ooKeeper –
D
ata M
anagement a
nd M
onitoring C
omponent
● Thrift a
nd A
vro –
D
ata S
erialization c omponents
● Apache M
ahout a
nd D
rill –
D
ata I ntelligence C
omponents
3. How many Input Formats are there in Hadoop? Explain.
Answer: T
here a
re f ollowing t hree i nput f ormats i n H
adoop –
1. Text I nput F
ormat: T
he t ext i nput i s t he d
efault i nput f ormat i n H
adoop.
2. Sequence F
ile I nput F
ormat: T
his i nput f ormat i s u
sed t o r ead f iles i n s equence.
3. Key V
alue I nput F
ormat: T
his i nput f ormat i s u
sed f or p
lain t ext f iles.
4. What do you know about YARN?
5. Why do the nodes are removed and added frequently in a Hadoop
cluster?
Answer: The following features of Hadoop framework makes a Hadoop administrator to add
(commission) a
nd r emove ( decommission) D
ata N
odes i n a
H
adoop c lusters –
1. The Hadoop framework utilizes commodity hardware, and it is one of the important
features of Hadoop framework. It results in a frequent DataNode crash in a Hadoop
cluster.
2. The ease of scale is yet another important feature of the Hadoop framework that is
performed a
ccording t o t he r apid g
rowth o
f d
ata v olume.
6. What do you understand by “Rack Awareness”?
Answer: In Hadoop, Rack Awareness is defined as the algorithm through which NameNode
determines how the blocks and their replicas are stored in the Hadoop cluster. This is done via rack
definitions that minimize the traffic between DataNodes within the same rack. Let’s take an example
– we know that the default value of replication factor is 3. According to the “Replica Placement
Policy” two copies of replicas for every block of data will be stored in a single rack whereas the third
copy i s s tored i n t he d
ifferent r ack.
7. What do you know about the Speculative Execution?
8. State some of the important features of Hadoop.
Answer: T
he i mportant f eatures o
f H
adoop a
re –
● Hadoop framework is designed on Google MapReduce that is based on Google’s Big
Data F
ile S
ystems.
● Hadoop f ramework c an s olve m
any q
uestions e
fficiently f or B
ig D
ata a
nalysis.
9. Do you know some companies that are using Hadoop?
Answer: Y
es, I k now s ome p
opular n
ames t hat a
re u
sing H
adoop.
Yahoo –
u
sing H
adoop
Facebook –
d
eveloped H
ive f or a
nalysis
Amazon, Adobe, Spotify, Netflix, eBay, and Twitter are some other wellknown and established
companies t hat a
re u
sing H
adoop.
10. How can you differentiate RDBMS and Hadoop?
Answer: T
he k ey p
oints t hat d
ifferentiate R
DBMS a
nd H
adoop a
re –
1. RDBMS is made to store structured data, whereas Hadoop can store any kind of data
i.e. u
nstructured, s tructured, o
r s emistructured.
2. RDBMS follows “Schema on write” policy while Hadoop is based on “Schema on read”
policy.
3. The schema of data is already known in RDBMS that makes Reads fast, whereas in
HDFS, w
rites n
o s chema v alidation h
appens d
uring H
DFS w
rite, s o t he W
rites a
re f ast.
4. RDBMS is licensed software, so one needs to pay for it, whereas Hadoop is open source
software, s o i t i s f ree o
f c ost.
5. RDBMS is used for Online Transactional Processing (OLTP) system whereas Hadoop is
used f or d
ata a
nalytics, d
ata d
iscovery, a
nd O
LAP s ystem a
s w
ell.
Hadoop Architecture Interview Questions
Up next we have some Hadoop interview questions based on Hadoop architecture. Knowing and
understanding the Hadoop architecture helps a Hadoop professional to answer all the Hadoop
Interview Q
uestions c orrectly.
11. What are the differences between Hadoop 1 and Hadoop 2?
Answer: T
he f ollowing t wo p
oints e
xplain t he d
ifference b
etween H
adoop 1
a
nd H
adoop 2
:
In Hadoop 1.X, there is a single NameNode which is thus the single point of failure whereas, in
Hadoop 2.x, there are Active and Passive NameNodes. In case, the active NameNode fails, the
passive NameNode replaces the active NameNode and takes the charge. As a result, high
availability i s t here i n H
adoop 2
.x.
12. What do you know about active and passive NameNodes?
Answer: I n h
ighavailability H
adoop a
rchitecture, t wo N
ameNodes a
re p
resent.
Active N
ameNode –
T
he N
ameNode t hat r uns i n H
adoop c luster, i s t he A
ctive N
ameNode.
Passive NameNode – The standby NameNode that stores the same data as that of the Active
NameNode i s t he P
assive N
ameNode.
13. What are the Components of Apache HBase?
Answer: A
pache H
Base C
onsists o
f t he f ollowing m
ain c omponents:
● Region Server : A Table can be divided into several regions. A group of these regions
gets s erved t o t he c lients b
y a
R
egion S
erver.
● HMaster : T
his c oordinates a
nd m
anages t he R
egion s erver.
● ZooKeeper : This acts as a coordinator inside HBase distributed environment. It
functions b
y m
aintaining s erver s tate i nside o
f t he c luster b
y c ommunication i n s essions.
14. How is the DataNode failure handled by NameNode?
Answer: NameNode continuously receives a signal from all the DataNodes present in Hadoop
cluster that specifies the proper function of the DataNode. The list of all the blocks present on a
DataNode is stored in a block report. If a DataNode is failed in sending the signal to the NameNode,
it is marked dead after a specific time period. Then the NameNode replicates/copies the blocks of
the d ead n
ode t o a
nother D
ataNode w
ith t he e
arlier c reated r eplicas.
15. Explain the NameNode recovery process.
Step 1
: T
o s tart a
n
ew N
ameNode, u
tilize t he f ile s ystem m
etadata r eplica ( FsImage).
Step 2
: C
onfigure t he c lients a
nd D
ataNodes t o a
cknowledge t he n
ew N
ameNode.
Step 3: Once the new Name completes the loading of last checkpoint FsImage and receives block
reports f rom t he D
ataNodes, t he n
ew N
ameNode s tart s erving t he c lient.
16. What are the different schedulers available in Hadoop?
Answer: T
he d
ifferent a
vailable s chedulers i n H
adoop a
re –
COSHH –
I t s chedules d
ecisions b
y c onsidering c luster, w
orkload, a
nd u
sing h
eterogeneity.
FIFO Scheduler – It orders the jobs on the basis of their arrival time in a queue without using
heterogeneity.
17. Can DataNode and NameNode be commodity hardware?
18. What are the Hadoop daemons? Explain their roles.
Answer: The Hadoop daemons are NameNode, Secondary NameNode, DataNode, NodeManager,
ResourceManager, J obHistoryServer. T
he r ole o
f d
ifferent H
adoop d
aemons i s –
NameNode – The master node, responsible for metadata storage for all directories and files is
known as the NameNode. It also contains metadata information about each block of the file and their
allocation i n H
adoop c luster.
Secondary NameNode – This daemon is responsible to merge and store the modified Filesystem
Image i nto p
ermanent s torage. I t i s u
sed i n c ase t he N
ameNode f ails.
DataNode –
T
he s lave n
ode c ontaining a
ctual d
ata i s t he D
ataNode.
NodeManager – Running on the slave machines, the NodeManager handles the launch of
application c ontainer, m
onitoring r esource u
sage a
nd r eporting s ame t o t he R
esourceManager.
ResourceManager – It is the main authority responsible to manage resources and to schedule
applications r unning o
n t he t op o
f Y
ARN.
JobHistoryServer – It is responsible to maintain every information about the MapReduce jobs when
the A
pplication M
aster s tops t o w
ork ( terminates).
19. Define “Checkpointing”. What is its benefit?
Benefit o
f C
heckpointing
Checkpointing i s a
h
ighly e
fficient p
rocess a
nd d
ecreases t he s tartup t ime o
f t he N
ameNode.
Hadoop Administrator Interview Questions
The Hadoop Administrator is responsible to handle that Hadoop cluster is running smoothly. To crack
the Hadoop Administrator job interview, you need to go through Hadoop Interview Questions related
to Hadoop environment, cluster etc. The common Hadoop Administrator interview questions are as
follows:
20. When deploying Hadoop in a production environment, what are the
Important Hardware considerations?
Answer: Memory System’s memory requirements : This will vary between the worker services and
management s ervices b
ased o
n t he a
pplication.
Operating System: A 64bit OS is preferred as it avoids any such restrictions on the amount of
memory t hat c an b
e u
sed o
n t he w
orker n
odes.
Storage : A Hadoop Platform should be designed by moving the computing activities to data and
thus a
chieving s calability a
nd h
igh p
erformance.
Capacity : L
arge F
orm F
actor d
isks w
ill c ost l ess a
nd a
llow f or m
ore s torage.
Network: T
wo T
OR s witches p
er r ack i s i deal t o a
void a
ny c hances f or r edundancy.
21. What should you consider while deploying a secondary NameNode?
Answer: A secondary NameNode should always be deployed on a separate Standalone system.
This p
revents i t f rom i nterfering w
ith t he o
perations o
f t he p
rimary n
ode.
22. Name the modes in which Hadoop code can be run.
Answer: T
here a
re d
ifferent m
odes t o r un H
adoop c ode –
1. Fullydistributed m
ode
2. Pseudodistributed m
ode
3. Standalone m
ode
23. Name the operating systems supported by Hadoop deployment.
Answer: Linux is the main operating system that is used for Hadoop. However, it can also e deployed
on W
indows o
perating s ystem w
ith t he h
elp o
f s ome a
dditional s oftware.
24. Why is HDFS used for the applications with large data sets, and not for
the multiple small files?
Answer: HDFS is more efficient for a large number of data sets, maintained in a single file as
compared to the small chunks of data stored in multiple files. As the NameNode performs storage of
metadata for the file system in RAM, the amount of memory limits the number of files in HDFS file
system. In simple words, more files will generate more metadata, that will, in turn, require more
memory ( RAM). I t i s r ecommended t hat m
etadata o
f a
b
lock, f ile, o
r d
irectory s hould t ake 1
50 b
ytes.
25. What are the important properties of hdfssite.xml?
Answer: T
here a
re t hree i mportant p
roperties o
f h
dfssite.xml:
● data.dr –
i dentify t he l ocation o
f t he s torage o
f d
ata.
● name.dr – identifies the location of metadata storage and specify whether DFS is located
on d
isk o
r t he o
n t he r emote l ocation.
● checkpoint.dir –
f or S
econdary N
ameNode.
26. What are the essential Hadoop tools that enhance the performance of
Big Data?
Answer: S
ome o
f t he e
ssential H
adoop t ools t hat e
nhance t he p
erformance o
f B
ig D
ata a
re –
Hive, H
DFS, H
Base, A
vro, S
QL, N
oSQL, O
ozie, C
louds, F
lume, S
olrSee/Lucene, a
nd Z
ooKeeper
27. What do you know about SequenceFile?
Answer: SequenceFile is defined as the flat file that contains binary key or value pairs. It is mainly
used in Input/Output format of the MapReduce. The map outputs are stored internally as
SequenceFile.
Different f ormats o
f S
equenceFile a
re –
Record c
ompressed k
ey/value r ecords –
I n t his f ormat, v alues a
re c ompressed.
Block compressed key/value records – In this format, both the values and keys are separately
stored i n b
locks a
nd t hen c ompressed.
Uncompressed k
ey/value r ecords –
I n t his f ormat, n
either v alues n
or k eys a
re c ompressed.
28. Explain the functions of Job Tracker.
Answer: I n H
adoop, t he J ob T
racker p
erforms v arious f unctions, t hat a
re f ollowings –
1. It manages resources, tracks availability of resources, and manages the life cycle of
tasks.
2. It i s r esponsible t o i dentify t he l ocation o
f d
ata b
y c ommunicating w
ith N
ameNode.
3. It e
xecutes t he t asks o
n g
iven n
odes b
y f inding t he b
est t ask t racker n
ode.
4. Job Tracker manages to monitor the all task trackers individually and then submit the
overall j ob t o t he c lient.
5. It is responsible to track the MapReduce workloads execution from local to the slave
node.
Hadoop HDFS Interview Questions
29. H
ow i s H
DFS d
ifferent f rom N
AS?
Answer: T
he f ollowing p
oints d
ifferentiates H
DFS f rom N
AS –
● Hadoop Distributed File System (HDFS) is a distributed file system that stores data using
commodity hardware whereas Network Attached Storage (NAS) is just a file level server
for d
ata s torage, c onnected t o a
c omputer n
etwork.
● HDFS stores data blocks in the distributed manner on all the machines present in a
cluster w
hereas N
AS s tores d
ata o
n a
d
edicated h
ardware.
● HDFS stores data using commodity hardware that makes it costeffective while NAS
stores d
ata o
n h
ighend d
evices t hat i ncludes h
igh e
xpenses.
● HDFS work with MapReduce paradigm while NAS does not work with MapReduce as
data a
nd c omputation a
re s tored s eparately.
30. Is HDFS faulttolerant? If yes, how?
31. Differentiate HDFS Block and Input Split.
Answer: The main difference between HDFS Block and the Input Split is that the HDFS Block is
known to be the physical division of data whereas the Input Split is considered as the logical division
of the data. For processing, HDFS first divides data into blocks and then stores all the blocks
together, while the MapReduce first divides the data into input split and then assign this input split to
the m
apper f unction.
32. What happens when two clients try to access the same file in HDFS?
Wh the n first client contacts the NameNode to open the file to write, the NameNode provides a
lease to the client to create this file. When the second client sends a request to open that same file
to write, the NameNode find that the lease for that file has already been given to another client, and
thus r eject t he s econd c lient’s r equest.
33. What is the block in HDFS?
Answer: The smallest site or say, location on the hard drive that is available to store data, is known
as the block. The data in HDFS is stored as blocks and then it is distributed over the Hadoop cluster.
The w hole f ile i s f irst d
ivided i nto s mall b
locks a
nd t hen s tored a
s s eparate u
nits.
Hadoop Developer Interview Questions
34. What is Apache Yarn?
Answer: YARN stands for Yet Another Resource Negotiator. It is a Hadoop Cluster resource
management system. It was introduced in Hadoop 2 to help MapReduce and is the next generation
computation and resource management framework in Hadoop. It allows Hadoop to support more
varied p
rocessing a
pproaches a
nd a
b
roader a
rray o
f a
pplications.
35. What is Node Manager?
Answer: Node Manager is the YARN equivalent of the Tasktracker. It takes in instructions from the
ResourceManager and manages resources available on a single node. It is responsible for
containers and also monitors and reports their resource usage to the ResourceManager. Every
single container processes that run on a slave node gets initially provisioned, monitored and tracked
by t he N
ode M
anager d
aemon c orresponding t o t hat s lave n
ode.
36. What is the RecordReader in Hadoop used for?
In Hadoop, RecordReader is used to read the split data into a single record. It is important to
combine data as Hadoop splits the data into various blocks. For example, if the input data is split like
–
Row1: W
elcome t o
Row 2
: t he H
adoop w
orld
Using R
ecordReader, i t w
ill b
e r ead a
s “ Welcome t o t he H
adoop w
orld”.
37. What is the procedure to compress mapper output without affecting
reducer output?
In o
rder t o c ompress t he m
apper o
utput w
ithout a
ffecting r educer o
utput, s et t he f ollowing:
Conf.set(“mapreduce.map.output.compress” , t rue)
Conf.set(“mapreduce.output.fileoutputformat.compress” , f alse)
38. Explain different methods of a Reducer.
The d
ifferent m
ethods o
f a
R
educer a
re a
s f ollows:
1. Setup() –
I t i s u
sed t o c onfigure d
ifferent p
arameters s uch a
s i nput d
ata s ize.
S
yntax: p
ublic v oid s etup ( context)
1. Cleanup() –
I t i s u
sed f or c leaning a
ll t he t emporary f iles a
t t he e
nd o
f t he t ask.
S
yntax: p
ublic v oid c leanup ( context)
1. Reduce() – This method is known as the heart of the reducer. It is regularly used once
per k ey w
ith t he a
ssociated r educe t ask.
S
yntax: p
ublic v oid r educe ( Key, V
alue, c ontext)
39. How can you configure the replication factor in HDFS?
For the configuration of HDFS, hdfssite.xml file is used. In order to change the default value of
replication f actor f or a
ll t he f iles s tored i n H
DFS, f ollowing p
roperty i s c hanged i n h
dfssite.xml
d
fs.replication
40. What is the use of “jps” command?
The “jps” command is used to check whether the Hadoop daemons are in running state. This
command will list all the Hadoop daemons running on the machine i.e. namenode, nodemanager,
resourcemanager, d atanode e tc.
41. What is the procedure to restart “NameNode” or all other daemons in
Hadoop?
There a
re d
ifferent m
ethods t o r estart N
ameNode a
nd a
ll o
ther d
aemons i n H
adoop –
Method to restart NameNode: First, stop the NameNode using the command
/sbin/hadoopdaemon.sh stop namenode and then start the NameNode again using the command
/sbin/hadoopdaemon.sh s tart n
amenode
Method to restart all the daemons: Use the command /sbin/stopall.sh to stop all the daemons at a
time a
nd t hen u
se t he c ommand / sbin/startall.sh t o s tart a
ll t he s topped d
aemons a
t t he s ame t ime.
42. What is the query to transfer data from Hive to HDFS?
The q
uery t o t ransfer d
ata f rom H
ive t o H
DFS i s –
hive> i nsert o
verwrite d
irectory ‘ / ‘ s elect * f rom e
mp;
The o
utput o
f t his q
uery w
ill b
e s tored i n t he p
art f iles a
t t he s pecified H
DFS p
ath.
43. What are the common Hadoop shell commands used for Copy
operation?
The c ommon H
adoop s hell c ommands f or C
opy o
peration a
re –
fs –
copyToLocal
fs –
put
fs –
copyFromLocal
ScenarioBased Hadoop Interview Questions
Often you will be asked some tricky questions regarding particular scenarios and how you will handle
them. The reason for asking such Hadoop Interview Questions is to check your Hadoop skills. These
Hadoop interview questions specify how you implement your Hadoop knowledge and approach to
solve given big data problem. These Scenariobased Hadoop interview questions will give you an
idea.
44. You have a directory XYZ that has the following files –
Hadoop123Training.txt,_Spark123Training.txt,#DataScience123Training.txt,
.Salesforce123Training.txt. If you pass the XYZ directory to the Hadoop
MapReduce jobs, how many files are likely to be processed?
Answer: Hadoop123Training.txt and #DataScience123Training.txt are the only files that will be
processed by MapReduce jobs. This happens because we need to confirm that none of the files has
a hidden file prefix such as “_” or .“” while processing a file in Hadoop using a FileInputFormat.
MapReduce FileInputFormat will use HiddenFileFilter class by default to ignore all such files.
However, w e c an c reate o
ur c ustom f ilter t o e
liminate s uch c riteria.
45. We have a Hive partitioned table where the country is the partition
column. We have 10 partitions and data is available for just one country. If
we want to copy the data for other 9 partitions, will it be reflected with a
command or manually?
Answer: In the above case, the data will only be available for all the other partitions when the data
will b
e p
ut t hrough c ommand, i nstead o
f c opying i t m
anually.
46. What is the difference between put, copyToLocal, and and
–copyFromLocal commands?
put: T
his c ommand i s u
sed t o c opy t he f ile f rom a
s ource t o t he d
estination
copyToLocal: T
his c ommand i s u
sed t o c opy t he f ile f rom H
adoop s ystem t o t he l ocal f ile s ystem.
copyFromLocal: This command is used to copy the file from the local file system to the Hadoop
System.
47. What is the difference between Left Semi Join and Inner Join
The Left Semi Join will return the tuples only from the lefthand table while the Inner Join will return
the common tuples from both the tables (i.e. lefthand and righthand tables) depending on the given
condition.
Answer: T
he d
efault v alue o
f b
lock s ize i n H
adoop 1
i s 6
4 M
B.
The d
efault v alue o
f b
lock s ize i n H
adoop 2
i s 1
28 M
B.
Yes, it is possible to change the block size from the default value. The following parameter is used
hdfssite.xml f ile t o c hange a
nd s et t he b
lock s ize i n H
adoop –
dfs.block.size
Answer: The following status can be used to check it NameNode is working with the use of jps
command
/etc/init.d/hadoop0.20namenode
50. MapReduce jobs are getting failed on a recently restarted cluster while
these jobs were working well before the restart. What can be the reason for
this failure?
Conclusion
If y ou w
ant a ny o
ther i nformation a bout H adoop, j ust l eave a c omment b elow a nd o
ur H
adoop e
xpert
will g et i n t ouch w ith y ou. I n c ase, y ou a
re l ooking f or B
ig D
ata c ertification ( HDPCA/HDPCD) o nline
training, c lick h ere .
© C
opyright 2
017. W
hizlabs S
oftware P
vt. L
td. A
ll R
ights R
eserved