Anda di halaman 1dari 10

Cloudera CCD-470 Exam

QUESTION NO: 1
When is the earliest point at which the reduce method of a given Reducer can be called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C
Explanation: In a MapReduce job reducers do not start executing the reduce method until the all
Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers
as soon as they are available. The programmer defined reduce method is called only after all the
mappers have finished.
Note: The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by
the reducer from each mapper. This can happen while mappers are generating data since it is only
a data transfer. On the other hand, sort and reduce can only start once all the mappers are done.
Why is starting the reducers early a good thing? Because it spreads out the data transfer from the
mappers to the reducers over time, which is a good thing if your network is the bottleneck.
Why is starting the reducers early a bad thing? Because they "hog up" reduce slots while only
copying data. Another job that starts later that will actually use the reduce slots now can't use
them.
You can customize when the reducers startup by changing the default value of
mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the
mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A
value of 0.5 will start the reducers when half of the mappers are complete. You can also change
mapred.reduce.slowstart.completed.maps on a job-by-job basis.
Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has
multiple jobs running at once. This way the job doesn't hog up reducers when they aren't doing
anything but copying data. If you only ever have one job running at a time, doing 0.1 would
probably be appropriate.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, When is the
reducers are started in a MapReduce job?
QUESTION NO: 2
Which describes how a client reads a file from HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the block
location(s) to the client. The client reads the data directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the requested data
responds directly to the client. The client reads the data directly off the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode
redirects the client to the DataNode that holds the requested data block(s). The client then reads
the data directly off the DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts the
DataNode that holds the requested data block. Data is transferred from the DataNode to the
NameNode, and then from the NameNode to the client.
Answer: C
Explanation: The Client communication to HDFS happens using Hadoop HDFS API. Client
applications talk to the NameNode whenever they wish to locate a file, or when they want to
add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by
returning a list of relevant DataNode servers where the data lives. Client applications can talk
directly to a DataNode, once the NameNode has provided the location of the data.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How the Client
communicates with HDFS?
QUESTION NO: 3
You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text
keys, IntWritable values. Which interface should your class implement?
A. Combiner <Text, IntWritable, Text, IntWritable>
B. Mapper <Text, IntWritable, Text, IntWritable>
C. Reducer <Text, Text, IntWritable, IntWritable>
D. Reducer <Text, IntWritable, Text, IntWritable>

E. Combiner <Text, Text, IntWritable, IntWritable>


Answer: D
Explanation:
QUESTION NO: 4
Indentify the utility that allows you to create and run MapReduce jobs with any executable or script
as the mapper and/or the reducer?
A. Oozie
B. Sqoop
C. Flume
D. Hadoop Streaming
E. mapred
Answer: D
Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility
allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or
the reducer.
Reference:http://hadoop.apache.org/common/docs/r0.20.1/streaming.html(Hadoop Streaming,
second sentence)
QUESTION NO: 5
How are keys and values presented and passed to the reducers during a standard sort and shuffle
phase of MapReduce?
A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending
order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in
ascending order.
Answer: A
Explanation: Reducer has 3 primary phases:
1. Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the
same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should
extend the key with the secondary key and define a grouping comparator. The keys will be sorted
using the entire key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
QUESTION NO: 6
Assuming default settings, which best describes the order of data provided to a reducers reduce
method:
A. The keys given to a reducer arent in a predictable order, but the values associated with those
keys always are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C. Neither keys nor values are in any predictable order.
D. The keys given to a reducer are in sorted order but the values associated with each key are in
no predictable order
Answer: D
Explanation: Reducer has 3 primary phases:
1. Shuffle

The Reducer copies the sorted output from each Mapper using HTTP across the network.
2. Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have output the
same key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the application should
extend the key with the secondary key and define a grouping comparator. The keys will be sorted
using the entire key, but will be grouped using the grouping comparator to decide which keys and
values are sent in the same call to reduce.
3. Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each <key, (collection of
values)> in the sorted inputs.
The output of the reduce task is typically written to a RecordWriter via
TaskInputOutputContext.write(Object, Object).
The output of the Reducer is not re-sorted.
Reference: org.apache.hadoop.mapreduce, Class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
QUESTION NO: 7
You wrote a map function that throws a runtime exception when it encounters a control character
in input data. The input supplied to your mapper contains twelve such characters totals, spread
across five file splits. The first four file splits each have two control characters and the last split has
four control characters.
Indentify the number of failed task attempts you can expect when you run the job with
mapred.max.map.attempts set to 4:
A. You will have forty-eight failed task attempts
B. You will have seventeen failed task attempts
C. You will have five failed task attempts
D. You will have twelve failed task attempts
E. You will have twenty failed task attempts
Answer: E
Explanation: There will be four failed task attempts for each of the five file splits.
Note:
Whenthejobtracker is notified of a task attempt that has failed (by the tasktracker's heartbeat tall),
it will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a
tasktracker where ithas previously tailed. Furthermore, if a task fails four times (or more), itwillnot
be retried further. This value is configurable:themaximum number of attempts to run a task is
controlled by the mapred.map.max.attempts property for map tasks and
mapred.reduce.max.attempts for reduce tasks. By default, ifany task fails four times (or whatever
the maximum number of attempts is configured to), the whole job fails.
QUESTION NO: 8
You want to populate an associative array in order to perform a map-side join. Youve decided to
put this information in a text file, place that file into the DistributedCache and read it in your
Mapper before any records are processed.
Indentify which method in the Mapper you should use to implement code for reading the file and
populating the associative array?
A. combine
B. map
C. init
D. configure
Answer: D
Explanation: See 3) below.
Here is an illustrative example on how to use the DistributedCache:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar

$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz


$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in theMapper
orReducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
localArchives = DistributedCache.getLocalCacheArchives(job);
localFiles = DistributedCache.getLocalCacheFiles(job);
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
Reference: org.apache.hadoop.filecache , Class DistributedCache
QUESTION NO: 9
Youve written a MapReduce job that will process 500 million input records and generated 500
million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a
significant amount of intermediate data that it needs to transfer between mappers and reduces
which is a potential bottleneck. A custom implementation of which interface is most likely to reduce
the amount of intermediate data transferred across the network?
A. Partitioner
B. OutputFormat
C. WritableComparable
D. Writable
E. InputFormat
F. Combiner
Answer: F
Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are
used to aggregate intermediate map output locally on individual mapper outputs. Combiners can
help you reduce the amount of data that needs to be transferred across to the reducers. You can
use your reducer code as a combiner if the operation performed is commutative and associative.
QUESTION NO: 10
Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume
that the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.
Answer: A
Explanation: Note:
* Join Algorithms in MapReduce

A) Reduce-side join
B) Map-side join
C) In-memory join
/ Striped Striped variant variant
/ Memcached variant
* Which join to use?
/ In-memory join > map-side join > reduce-side join
/ Limitations of each?
In-memory join: memory
Map-side join: sort order and partitioning
Reduce-side join: general purpose
QUESTION NO: 11
You have just executed a MapReduce job. Where is intermediate data written to after being
emitted from the Mappers map method?
A. Intermediate data in streamed across the network from Mapper to the Reduce and is never
written to disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are
written into HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running
the Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker
node running the Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are
written into HDFS.
Answer: D
Explanation: The mapper output (intermediate data) is stored on the Local file system (NOT
HDFS) of each individual mapper nodes. This is typically a temporary directory location which can
be setup inconfig by the hadoop administrator. The intermediate data is cleaned up after the
Hadoop Job completes.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the
Mapper Output (intermediate kay-value data) stored ?
QUESTION NO: 12
You want to understand more about how users browse your public website, such as which pages
they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How
will you gather this data for your analysis?
A. Ingest the server web logs into HDFS using Flume.
B. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for
reduces.
C. Import all users clicks from your OLTP databases into Hadoop, using Sqoop.
D. Channel these clickstreams inot Hadoop using Hadoop Streaming.
E. Sample the weblogs from the web servers, copying them into Hadoop using curl.
Answer: B
Explanation: Hadoop MapReduce for Parsing Weblogs
Here are the steps for parsing a log file using Hadoop MapReduce:
Load log files into the HDFS location using this Hadoop command:
hadoop fs -put <local file path of weblogs> <hadoop HDFS location>
The Opencsv2.3.jar framework is used for parsing log records.
Below is the Mapper program for parsing the log file from the HDFS location.
public static class ParseMapper
extends Mapper<Object, Text, NullWritable,Text >{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
CSVParser parse = new CSVParser(' ','\"');
String sp[]=parse.parseLine(value.toString());
int spSize=sp.length;
StringBuffer rec= new StringBuffer();
for(int i=0;i<spSize;i++){
rec.append(sp[i]);
if(i!=(spSize-1))

rec.append(",");
}
word.set(rec.toString());
context.write(NullWritable.get(), word);
}
}
The command below is the Hadoop-based log parse execution. TheMapReduce program is
attached in this article. You can add extra parsing methods in the class. Be sure to create a new
JAR with any change and move it to the Hadoop distributed job tracker system.
hadoop jar <path of logparse jar> <hadoop HDFS logfile path> <output path of parsed log file>
The output file is stored in the HDFS location, and the output file name starts with "part-".
QUESTION NO: 13
MapReduce v2 (MRv2/YARN) is designed to address which two issues?
A. Single point of failure in the NameNode.
B. Resource pressure on the JobTracker.
C. HDFS latency.
D. Ability to run frameworks other than MapReduce, such as MPI.
E. Reduce complexity of the MapReduce APIs.
F. Standardize on a single MapReduce API.
Answer: B,D
Explanation: YARN (Yet Another Resource Negotiator), as an aspect of Hadoop, has two major
kinds of benefits:
* (D) The ability to use programming frameworks other than MapReduce.
/ MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce
alternative
* Scalability, no matter what programming framework you use.
Note:
* The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker,
resource management and job scheduling/monitoring, into separate daemons. The idea is to have
a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is
either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
* (B) The central goal of YARN is to clearly separate two things that are unfortunately smushed
together in current Hadoop, specifically in (mainly) JobTracker:
/ Monitoring the status of the cluster with respect to which nodes have which resources available.
Under YARN, this will be global.
/ Managing the parallelization execution of any specific job. Under YARN, this will be done
separately for each job.
The current Hadoop MapReduce system is fairly scalable Yahoo runs 5000 Hadoop jobs, truly
concurrently, on a single cluster, for a total 1.5 2 millions jobs/cluster/month. Still, YARN will
remove scalability bottlenecks
Reference: Apache Hadoop YARN Concepts & Applications
QUESTION NO: 14
You need to run the same job many times with minor variations. Rather than hardcoding all job
configuration options in your drive code, youve decided to have your Driver subclass
org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface.
Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?
A. hadoop mapred.job.name=Example MyDriver input output
B. hadoop MyDriver mapred.job.name=Example input output
C. hadoop MyDrive D mapred.job.name=Example input output
D. hadoop setproperty mapred.job.name=Example MyDriver input output
E. hadoop setproperty (mapred.job.name=Example) MyDriver input output
Answer: C
Explanation: Configure the property using the -D key=value notation:
-D mapred.job.name='My Job'
You can list a whole bunch of options by calling the streaming jar with just the -info argument
Reference: Python hadoop streaming : Setting a job name
QUESTION NO: 15
You are developing a MapReduce job for sales reporting. The mapper will process input keys
representing the year (IntWritable) and input values representing product indentifies (Text).

Indentify what determines the data types used by the Mapper for a given job.
A. The key and value types specified in the JobConf.setMapInputKeyClass and
JobConf.setMapInputValuesClass methods
B. The data types specified in HADOOP_MAP_DATATYPES environment variable
C. The mapper-specification.xml file submitted with the job determine the mappers input key and
value types.
D. The InputFormat used by the job determines the mappers input key and value types.
Answer: D
Explanation: The input types fed to the mapper are controlled by the InputFormat used. The
default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs. The long
value is the byte offset of the line in the file. The Text object holds the string contents of the line of
the file.
Note: The data types emitted by the reducer are identified by setOutputKeyClass()
andsetOutputValueClass(). The data types emitted by the reducer are identified by
setOutputKeyClass() and setOutputValueClass().
By default, it is assumed that these are the output types of the mapper as well. If this is not the
case, the methods setMapOutputKeyClass() and setMapOutputValueClass() methods of the
JobConf class will override these.
Reference: Yahoo! Hadoop Tutorial, THE DRIVER METHOD
QUESTION NO: 16
Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application
containers and monitoring application resource usage?
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. ApplicationMasterService
E. TaskTracker
F. JobTracker
Answer: C
Explanation: The fundamental idea of MRv2 (YARN) is to split up the two major functionalities of
the JobTracker, resource management and job scheduling/monitoring, into separate daemons.
Theidea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM).
An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.
Note:Lets walk through an application execution sequence :
Reference: Apache Hadoop YARN Concepts & Applications
QUESTION NO: 17
Which best describes how TextInputFormat processes input files and line breaks?
A. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the beginning of the broken line.
B. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.
C. The input file is split exactly at the line breaks, so each RecordReader will read a series of
complete lines.
D. Input file splits may cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the end of the broken line.
Answer: E
Explanation: As the Map operation is parallelized the input file set is first split to several pieces
called FileSplits. If an individual file is so large that it will affect seek time it will be split to several
Splits. The splitting does not know anything about the input file's internal logical structure, for
example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is
created per FileSplit.
When an individual map task starts it will open a new output writer per configured reduce task. It
will then proceed to read its FileSplit using the RecordReader it gets from the specified
InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also
handle records that may be split on the FileSplit boundary. For example TextInputFormat will read
the last line of the FileSplit past the split boundary and, when reading other than the first FileSplit,
TextInputFormat ignores the content up to the first newline.
Reference: How Map and Reduce operations are actually carried out
QUESTION NO: 18

For each input key-value pair, mappers can emit:


A. As many intermediate key-value pairs as designed. There are no restrictions on the types of
those key-value pairs (i.e., they can be heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type as the
input key-value pair.
C. One intermediate key-value pair, of a different type.
D. One intermediate key-value pair, but of the same type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the same types
and all the values have the same type.
Answer: E
Explanation: Mapper maps input key/value pairs to a set of intermediate key/value pairs.
Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input records. A given
input pair may map to zero or many output pairs.
Reference: Hadoop Map-Reduce Tutorial
QUESTION NO: 19
You have the following key-value pairs as output from your Map task:
(the, 1)
(fox, 1)
(faster, 1)
(than, 1)
(the, 1)
(dog, 1)
How many keys will be passed to the Reducers reduce method?
A. Six
B. Five
C. Four
D. Two
E. One
F. Three
Answer: A
Explanation: Only one key value pair will be passed from the two (The, 1) key value pairs.
QUESTION NO: 20
You have user profile records in your OLPT database, that you want to join with web logs you
have already ingested into the Hadoop file system. How will you obtain these user records?
A. HDFS command
B. Pig LOAD command
C. Sqoop import
D. Hive LOAD DATA command
E. Ingest with Flume agents
F. Ingest with Hadoop Streaming
Answer: B
Explanation: Apache Hadoop and Pig provide excellent tools for extracting and analyzing data
from very large Web logs.
We use Pig scripts for sifting through the data and to extract useful information from the Web logs.
We load the log file into Pig using the LOAD command.
raw_logs = LOAD 'apacheLog.log' USING TextLoader AS (line:chararray);
Note 1:
Data Flow and Components
* Content will be created by multiple Web servers and logged in local hard discs. This content will
then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers;
these are machines that collect data intermediately using collectors and finally push that data to
HDFS.
* Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch
job solution). These scripts actually analyze the logs on various dimensions and extract the
results. Results from Pig are by default inserted into HDFS, but we can use storage
implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the
solution with HBase (please see the implementation section). Pig Scripts can either push this data
to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts
can push this data into HBase directly. In this article, we use scripts to push data onto HDFS, as

we are showcasing the Pig framework applicability for log analysis at large scale.
* The database HBase will have the data processed by Pig scripts ready for reporting and further
slicing and dicing.
* The data-access Web service is a REST-based service that eases the access and integrations
with data clients. The client can be in any language to access REST-based API. These clients
could be BI- or UI-based clients.
Note 2:
The Log Analysis Software Stack
* Hadoop is an open source framework that allows users to process very large data in parallel. It's
based on the framework that supports Google search engine. The Hadoop core is mainly divided
into two modules:
1. HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using
multiple commodity servers connected in a cluster.
2. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default
implementation is bonded with HDFS.
* The database can be a NoSQL database such as HBase. The advantage of a NoSQL database
is that it provides scalability for the reporting module as well, as we can keep historical processed
data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which
usesHDFS. It can also use MR jobs to process data. It gives real-time, random read/write access
to very large data sets -- HBase can save very large tables having million of rows. It's a distributed
database and can also keep multiple versions of a single row.
* The Pig framework is an open source platform for analyzing large data sets and is implemented
as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of
developers who write code in the Map-Reduce format, since code in Map-Reduce format needs to
be written in Java. In contrast, Pig enables users to write code in a scripting language.
* Flume is a distributed, reliable and available service for collecting, aggregating and moving a
large amount of log data (src flume-wiki). It was built to push large logs into Hadoop-HDFS for
further processing. It's a data flow solution, where there is an originator and destination for each
node and is divided into Agent and Collector tiers for collecting logs and pushing them to
destination storage.
Reference: Hadoop and Pig for Large-Scale Web Log Analysis
QUESTION NO: 21
What is the disadvantage of using multiple reducers with the default HashPartitioner and
distributing your workload across you cluster?
A. You will not be able to compress the intermediate data.
B. You will longer be able to take advantage of a Combiner.
C. By using multiple reducers with the default HashPartitioner, output files may not be in globally
sorted order.
D. There are no concerns with this approach. It is always advisable to use multiple reduces.
Answer: C
Explanation: Multiple reducers and total ordering
If your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapredsite.
xml has been set to a number larger than 1, or because youve used the -r option to specify
the number of reducers on the command-line), then by default Hadoop will use the HashPartitioner
to distribute records across the reducers. Use of the HashPartitioner means that you cant
concatenate your output files to create a single sorted output file. To do this youll need total
ordering,
Reference: Sorting text files with MapReduce
QUESTION NO: 22
Given a directory of files with the following structure: line number, tab character, string:
Example:
1abialkjfjkaoasdfjksdlkjhqweroij
2kadfjhuwqounahagtnbvaswslmnbfgy
3kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper. Which InputFormat should you use to
complete the line: conf.setInputFormat (____.class) ; ?
A. SequenceFileAsTextInputFormat
B. SequenceFileInputFormat
C. KeyValueFileInputFormat
D. BDBInputFormat

Answer: B
Explanation: Note:
The output format for your first MR job should be SequenceFileOutputFormat - this will store the
Key/Values output from the reducer in a binary format, that can then be read back in, in your
second MR job using SequenceFileInputFormat.
Reference: How to parse CustomWritable from text in Hadoop
http://stackoverflow.com/questions/9721754/how-to-parse-customwritable-from-text-in-hadoop(see
answer 1 and then see the comment #1 for it)
QUESTION NO: 23
You need to perform statistical analysis in your MapReduce job and would like to call methods in
the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file.
Which is the best way to make this library available to your MapReducer job at runtime?
A. Have your system administrator copy the JAR to all nodes in the cluster and set its location in
the HADOOP_CLASSPATH environment variable before you submit your job.
B. Have your system administrator place the JAR file on a Web server accessible to all cluster
nodes and then set the HTTP_JAR_URL environment variable to its location.
C. When submitting the job on the command line, specify the libjars option followed by the JAR
file path.
D. Package your code and the Apache Commands Math library into a zip file named JobJar.zip
Answer: C
Explanation: The usage of the jar command is like this,
Usage: hadoop jar <jar> [mainClass] args...
If you want the commons-math3.jar to be available for all the tasks you can do any one ofthese
1. Copy the jar file in $HADOOP_HOME/lib dir or
2. Use the generic option -libjars.
QUESTION NO: 24
The Hadoop framework provides a mechanism for coping with machine issues such as faulty
configuration or impending hardware failure. MapReduce detects that one or a number of
machines are performing poorly and starts more copies of a map or reduce task. All the tasks run
simultaneously and the task finish first are used. This is called:
A. Combine
B. IdentityMapper

Anda mungkin juga menyukai