Pipelined-MapReduce An Improved MapReduce

A Seminar-II Report On
Pipelined-MapReduce: An improved MapReduce Parallel programing model

submitted in partial fulfilment of the requirements for the award of the degree of First Year Master of Engineering in Computer Engineering By Prof.Deshmukh Sachin B. Under the guidance of
Prof. B.L.Gunjal
DEPARTMENT OF COMPUTER ENGINEERING
AMRUTVAHINI COLLEGE OF ENGINEERING. SANGAMNER

A/P-GHULEWADI-422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA 2013-2014
DEPARTMENT OF COMPUTER ENGINEERING AMRUTVAHINI
COLLEGE OF ENGINEERING. SANGAMNER A/P-GHULEWADI422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA 2013-14
CERTIFICATE
This is to certify that Seminar-II report entitled

is submitted as partial fulfilment of curriculum of First Year M.E. of Computer Engineering By
Prof.Deshmukh Sachin B.
Prof. B.L.Gunjal (Seminar Guide)
Prof. B.L.Gunjal (Seminar Coordinator)
Prof. R.L.Paikrao (H.O.D.)
Prof. Dr.G.J. Vikhe Patil (Principal)
University of Pune
CERTIFICATE
This is to Certify that
Prof.Deshmukh Sachin B.
Student of First Year M.E.(Computer Engineering) was examined in Dissertation Report entitled

on / /2013 at
Department of Computer Engineering,
Amrutvahini College of Engineering,

Sangamner - 422608
(. . . . . . . . . . . . . . . . . . . . .) Prof. B.L.Gunjal Internal Examiner
(. . . . . . . . . . . . . . . . . . . . .) Prof. A.N.Nawathe External Examiner
ACKNOWLEDGEMENT
As wise person said about the nobility of the teaching profession, Catch a fish and you feed a man his dinner, but teach a man how to catch a fish, and you feed him for life. In this context, I would like to thank my helpful seminar guide Prof. B.L. Gunjal who had been an incessant source of inspiration and help. Not only did she inspire me to undertake this assignments, she also advise me throughout its course and help me during my times of trouble. I would like to thank H.O.D. of Department of Computer Engineering Prof. Paikrao R. L. and M. E. co-ordinator Prof. Gunjal B. L. for motivating me. Also, I would like to thank other member of Computer Engineering department who helped me to handle this assignment efficiently, and who were always ready to go out of their were to help us during our times of need.
Prof.Deshmukh Sachin B. M.E.Computer Engineering
ABSTRACT
Cloud MapReduce (CMR) is a framework for processing large data sets of batch data in cloud. The Map and Reduce phases run sequentially, one after another. This leads to: 1.Compulsory batch processing 2.No parallelization of the map and reduce phases 3.Increased delays. The current implementation is not suited for processing streaming data. We propose a novel architecture to support streaming data as input using pipelining between the Map and Reduce phases in CMR, ensuring that the output of the Map phase is made available to the Reduce phase as soon as it is produced. This Pipelined MapReduce approach leads to increased parallelism between the Map and Reduce phases; thereby 1.Supporting streaming data as input. 2.Reducing delays 3. Enabling the user to take snapshots of the approximate output generated in a stipulated time frame. 4. Supporting cascaded MapReduce jobs. This cloud implementation is light-weight and inherently scalable.
ii
Contents
Acknowledgement i ii List of Abstract Figures 1 1 1
v INTRODUCTION 1.1 1 1.2 Need of system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 LITERATURE SURVEY 2.1 2.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.2.1 2.2.2 2.2.3 2.3 2.4 Mappers and Reducers: . . . . . . . . . . . . . . . . . . . . . . . . Introduction to Hadoop: . . . . . . . . . . . . . . . . . . . . . . . design Of HDFS: . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic concepts of HDFS: . . . . . . . . . . . . . . . . . . . . . . . Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 4 5 5 5 6 7 8 9 10 10 10 10 11 11
Cloud Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online Map Reduce(Hadoop Online Prototype) . . . . . . . . . .
3 3. PROPOSED ARCHITECTURE 3.1 Architecture of Pipelined CMR . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 3.1.3 3.2 3.2.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mapper Operation . . . . . . . . . . . . . . . . . . . . . . . . . . Reduce Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handling Reducer Failures . . . . . . . . . . . . . . . . . . . . . . iii
The first design option: . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
The second design option: . . . . . . . . . . . . . . . . . . . . . . . . . .
12
iv
3.4 3.5
Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 13 15 15 15 16 17 17 19 19 20 21 21 23 23 23 23 24 24 24 25
4 Hadoop and HDFS 4.1 Hadoop System Architecture: . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 4.2 4.3 4.4 Operating a Server Cluster: . . . . . . . . . . . . . . . . . . . . . Data Storage and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison with Other Systems: . . . . . . . . . . . . . . . . . . . . . . 4.3.1 4.4.1 4.4.2 4.5 4.5.1 RDBMS: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HDFS Federation: . . . . . . . . . . . . . . . . . . . . . . . . . . HDFS High-Availability . . . . . . . . . . . . . . . . . . . . . . . Algorithm For Word Count: . . . . . . . . . . . . . . . . . . . . . Hadoop Distributed File System (HDFS): . . . . . . . . . . . . . . . . .
Algorithm: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 ADVANTAGES, FEATURES AND APPLICATIONS 5.1 5.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 5.2.2 5.2.3 Conclusion References Time windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cascaded MapReduceJobs . . . . . . . . . . . . . . . . . . . . . .
iv
List of Figures
2.1 2.2 2.3 3.1 3.2 4.1 4.2
Map Reduce Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture Of Complete Hadoop Structure . . . . . . . . . . . . . . . . Architecture Of Hadoop Distributed File System . . . . . . . . . . . . . . Architecture of Pipelined Map Reduce . . . . . . . . . . . . . . . . . . . Hadoop Data flow for Batch And Pipelined Map Reduce Data Flow . . . Hadoop System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . Operation Of Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . .
4 5 7 11 13 15 16
Chapter 1
INTRODUCTION
1.1
Need of system
Cloud Map Reduce (CMR) is gaining popularity among small companies for processing large data sets in cloud environments. The current implementation of CMR is designed for batch processing of data. Today, streaming data constitutes an important portion of web data in the form of continuous click-streams, feeds, micro-blogging, stock-quotes, news and other such data sources. For processing streaming data in Cloud MapReduce (CMR), significant changes are required to be made to the existing CMR architecture. We use pipelining between Map and Reduec phases as an approach to support stream data processing, in contrast to the current implementation where the Reducers do not start working unless all Mappers have finished. Thus, in our architecture, the Reduce phase too gets a continuous stream of data and can produce continuous output. This new architecture is explained in subsequent section. Computing and data intensive data processing are increasingly prevalent . In the near future , it is expected that the data volumes processed by applications will cross the peta-scale threshold , and increase the computional requirements .
1.2
Map Reduce
Googles MapReduce is a programming model and process, resulting in large data sets related to implementation. MapReduce is a programming model, it is with the processing / production related to implementation of large data sets. Users specify a
Chapter1
INTRODUCTION
map function, through the map function handles key/value pairs, and produce a series of intermediate key/value pairs and use the reduce function to combine all of the key values have the same middle button pair The value part. MapReduce allows program developers to data-centric way of thinking: a focus on the application data record set conversion, and allows the implementation of the MapReduce framework for distributed processing, network communication, coordination and fault-tolerance and other details. MapReduce model is usually applied to the completion of the major concerns of large quantities of computing time. Google MapReduce framework and the open source Hadoop framework through the implementation of strategies to strengthen the batch using the model: each map and reduce all of the output stage of the next stage is to be materialized before the consumer to stable storage, or produce output. Batch entity allows a simple and practical of the checkpoint,restart fault tolerance, which is critical for large deployments. Piped-MapReduce is an improved implementation of the intermediate data transfer in the operating room with the pipe, while retaining the structure before the MapReduce programming interface and fault tolerance mode. Piped- MapReduce has many advantages, downstream elements of the data elements can be completed in the producer began before the implementation of consumption data, which can increase the opportunities for parallelism and improve efficiency and reduce response time. Because of a production data reducers mappers start treatment, they can be generated in the implementation of projects and improve the final result of the approximation. Piped-MapReduce MapReduce broadened the field can be applied.
Pipelined-MapReduce: An improved MapReduce Parallel programing model P2
Chapter 2
LITERATURE SURVEY
2.1
Map Reduce
MapReduce is a programming model developed by Google for processing large data of each other and can hence be processed in parallel fashion. Each split consists of a set of (key, value) pairs that form records. The splits are divided among sets in a distributed fashion. The model consists of two phases: a Map phase and a reduce phase. Initially, the data is divided into smaller splits. These splits of data are independent computing nodes called Mappers. Mappers map the input (key, value) pairs into intermediate (keyi, valuei) pairs. The following stage consists of Reducers which are computing nodes that take the intermediate (keyi, valuei) pairs generated by the Mappers. The Reducers combine the set of values associated with a particular keyi obtained from all the Mappers to produce a final output in the form of (keyi, value). For e.g. the often cited word count example elegantly illustrates the computing phases of MapReduce. A large document is to be scanned and the numbers of occurrences of each word are to be determined. The solution using MapReduce proceeds as follows: 1. The document is divided into several splits (ensuring that no split is such that it results into splitting of a whole word). The ID of a split is the input key, and the actual split (or a pointer to it) is the value corresponding to that key. Thus, the document is divided into a number of splits, each containing a set of (key, value) pairs. 2. There is a set of m Mappers. Each mapper is assigned an input split based on some scheduling policy (FIFO, Fair scheduling etc.) (If the mapper consumes its split, it asks for the next split). Each Mapper processes each input (key, value) pair in the split assigned to it according to some user defined Map function to produce intermediate key value pairs. In this case, a Mapper, for 2 occurrences of the word cake and 3 occurrences 3
Chapter2
LITERATURE SURVEY
of the word pastry outputs the following: (Cake, 1) (Cake, 1) (Pastry, 1) (Pastry, 1) (Pastry, 1) Here, each intermediate (key, value) pair indicates one occurrence of the key. 3. The set of intermediate (key, value) pairs is thenpulled by the Reducer workers from the Mappers. 4. The Reducers then work to combine all the values associated with a particular intermediate key (keyi) according to a user defined Reduce function. Thus, in the above example, a Reducer worker, on processing the set of intermediate (key, value) pairs of word count, would output: (Cake, 2) (Pastry, 3) 5. This output is then written to disk as the final output of the MapReduce job, which gives the count of all words in the document in this example. An optional Combiner phase is usually included at the Mapper worker to combine the local result of the Mapper before sending it over to the Reducer. This works to reduce network traffic between the Mapper and the Reducer.
2.1.1
Mappers and Reducers:
The Map Reduce framework operates exclusively on key, value pairs, that is, the framework views the input to the job as a set of key, value pairs and produces a set of Key, value pairs as the output of the job, conceivably of different types. Key-value pairs form the basic data structure in Map Reduce. Keys and values may be primitives such as integers, oating point values, strings, and raw bytes, or they may be arbitrarily complex structures (lists, tuples, associative arrays, etc.). In Map Reduce, the programmer de nes a mapper and a reducer with the following signatures: Map: (k1; v1) ? [(K2; v2)] Reduce: (K2; [v2]) ? [(k3; v3)] Hence, the complete Hadoop Cluster i.e. HDFS and Map Reduce is look like as shown in following figure. Figure suggests, the programmer can specify that the system sort the bins contents according to any provided criterion. This customization occasionally can be useful. The programmer can also specify partitioning methods to more coarsely or more finely bin the map results. Although every job must specify both a mapper and a reducer, either Pipelined-MapReduce: An improved MapReduce Parallel programing model P4
Chapter2
LITERATURE SURVEY
Figure 2.1: Map Reduce Process
Figure 2.2: Architecture Of Complete Hadoop Structure of them can be the identity, or an empty stub that creates output records equal to the input records. In particular, an algorithm frequently calls for input categorization or ordering for many successive stages, suggesting a reducer sequence without the need for intermediate mapping; identity mappers fill in to satisfy architectural needs.
2.2
Hadoop
Hadoop is an implementation of the MapReduce programming model developed by Apache. The Hadoop framework is used for batch processing of large data sets on a physical cluster of machines. It incorporates a distributed file system called Hadoop Distributed File System (HDFS), a Common set of commands, scheduler, and the MapRe-
Chapter2
LITERATURE SURVEY
duce evaluation framework. As the entire complexity of cluster management, redundancy in HDFS, consistency and reliability in case of node failure is included in the framework itself, the code base is huge: around 3,00,000 lines of code (LOC). Hadoop is popular for processing huge data sets, especially in social networking, targeted advertisements, internet log processing etc.
2.2.1
Introduction to Hadoop:
Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called Map Reduce. Hadoop runs on a collection of commodity, sharednothing servers.
2.2.2
design Of HDFS:
Very large in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. Very large files: Very large in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. Streaming data access: HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. Commodity hardware: Hadoop doesnt require expensive, highly reliable hardware to run on. Its designed to run on clusters of commodity hardware.
2.2.3
Blocks:
Basic concepts of HDFS:
HDFS has the concept of a block, it is a much larger unit64 MB by default. Like in a file system for a single disk, files in HDFS are broken into block-sized chunks, which are Pipelined-MapReduce: An improved MapReduce Parallel programing model P6
Chapter2
LITERATURE SURVEY
stored as independent units. Unlike a file system for a single disk, a file in HDFS that is smaller than a single block does not occupy a full blocks worth of underlying storage. Name Node: The name node manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The name node also knows the data nodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from data nodes when the system starts. Without the name node, the file system cannot be used. Data Node: Data nodes are the work horses of the file system. They store and retrieve blocks when they are told to and they report back to the name node periodically with lists of blocks that they are storing. The following diagram shows the basic HDFS architecture.
Figure 2.3: Architecture Of Hadoop Distributed File System
2.3
Cloud Map Reduce
Cloud MapReduce is a light-weight implementation of MapReduce programming model on top of the Amazon cloud OS, using Amazon EC2 instances. It is very compact Pipelined-MapReduce: An improved MapReduce Parallel programing model P7
Chapter2
LITERATURE SURVEY
(around 3000 LOC) as compared to Hadoop. It also gives significant performance improvements (in one case, over 60x) over traditional Hadoop. The architecture of CMR, as described in consists of one input queue, multiple reduce queues which act as staging areas for holding the intermediate (key, value) pairs produced by the Mappers, a master reduce queue that holds the pointers to the reduce queues, and an output queue that holds the final results. In addition, S3 file system is used to store the data to be processed, and SimpleDB is used to communicate the status of the worker nodes which is then used to ensure consistency and reliability. Initially all the queues are set up. The user then puts the data (key, value) pairs to be processed in the input SQS queue. Typically, the key is a unique ID and the value is a pointer into S3. The Mappers poll the queue for work whenever they are free. Then, they dequeue one message from the queue and process it according to the user defined Map function. The intermediate (key, value) pairs are pushed by the Mappers to the intermediate Reduce queues as soon as they are produced. The choice of a reduce queue is made by hashing on the intermediate key generated, to ensure even load balancing, and also to ensure that the records with the same key land up in the same reduce queue. After the Map phase is complete, each Reducer dequeues a pointer to one of the reduce queues from the master reduce queue and process the reduce queue associated with that pointer. They apply the user defined Reduce function on the reduce queue by sorting the queue by intermediate key, merging the values associated with a key, and then writing the final output to the Output queue. This is again batch processing, as the user submits a batch of data to be processed. Also, the Reducer workers don?t start working unless all the Mappers have finished all their Map tasks. Thus, it is not suited for processing streaming data.
2.4
Online Map Reduce(Hadoop Online Prototype)
It is a modification to traditional Hadoop framework that incorporates pipelining between the Map and Reduce phases, thereby supporting parallelism between these phases, and providing support for processing streaming data. The output of the Mappers is made available to the Reducers as soon as it is produced. A downstream dataflow element can begin processing before an upstream producer finishes. It carries out online aggregation of data to produce incrementally correct output. This also supports continuous queries.
Chapter2
LITERATURE SURVEY
This can also be used with batch data, and gives approximately 30 better throughput and response time because of parallelism between the Map and Reduce phases. It also supports snapshots of output data where the user can get a view of the output produced till some time instant instead of waiting for the entire batch of data to finish processing. Also, Cascaded MapReduce jobs with pipelining are supported in this implementation, whereby the output of one job is fed to the input of another job as soon as it is produced. This leads to parallelism within a job, as well as between jobs. The architecture is particularly suitable for processing streaming data, as it is modelled as dataflow architecture with windowing and online aggregation.
Chapter 3
3. PROPOSED ARCHITECTURE
The above implementations suffer from the following drawbacks: 1. HOP is unsuitable for cloud, as Hadoop is a framework for distributed computing. Hence, it lacks the inherent scalability and flexibility of cloud. 2. In HOP, code for handling of HDFS, reliability, scheduling etc. is a part of the Hadoop framework itself, and hence makes it large and heavy-weight (around 3lac LOC). 3. Cloud MapReduce does not support stream data processing, which is an important use case. Our proposal aims at bridging this gap between heavyweight HOP and the light-weight, scalable Cloud MapReduce implementation, by providing support for processing stream data in Cloud MapReduce. This architecture is easier to build, maintain and run as the underlying complexity of managing resources in cloud is handled by the Amazon cloud OS, in contrast to a Hadoop-like framework where the issues of handling filesystem, scheduling, availability, reliability are done in the framework itself. The challenges involved in the implementation include: 1. Providing support for streaming data at input 2. A novel design for output aggregation 3. Handling Reducer failures 4. Handling windows based on timestamps. Currently, no open-source implementation exists for processing streaming data using MapReduce on top of Cloud. To the best of our knowledge, this is the first such attempt to integrate stream data processing capability with MapReduce on Amazon Web Services using EC2 instances. Such an implementation will provide significant value to stream processing applications such as the ones outlined in section IV. We now describe the architecture of the Pipelined CMR approach.
10
Chapter3
3.1
3.1.1
Architecture of Pipelined CMR

Input
A drop-box concept can be used, where a folder on S3 (for example) is used to hold the data that is to be processed by Cloud MapReduce. The user is responsible for providing data in the drop-box from which it will be sent to the input SQS queue. That way, the user can control the data that is sent to the drop-box and can use the implementation in either a batch or a stream processing manner. The data provided by the user must be in (key, value) format. An Input Handler function running on one EC2 instance polls the drop-box for new data. When a new record of (key, value) format is available in the drop-box, the Input Handler appends it with a unique MapID, Timestamp, and Unique Tag to the record and generates a record of a new form: (Metadata, Key, Value). The addition of Timestamp is important for taking snapshots of the output (described later) and for preserving the order of data processing in case of stragglers. The Input Handler then pushes this generated record into SQS for processing by the Mapper.
3.1.2
Mapper Operation
Input Queue-Mapper interface is unchanged from the CMR implementation [2]. The Mapper, whenever it is free, pops one message from the input SQS queue thereby removing the message from the queue for a visibility timeout and processes it according to the user-defined Map function. If the Mapper fails, the message re-appears on the Input queue after the visibility timeout. If the Mapper is successful, it deletes the message from the queue and writes its status to SimpleDB. Mapper generates a set of intermediate (key, value) pairs for each input split processed.
3.1.3
Reduce Phase
The Mapper writes the intermediate records produced to ReduceQueues. ReduceQueues are intermediate staging queues implemented using SQS for holding the mapper output. Reducers take their input from these queues. The Reduce phase as implemented in CMR will have to be modified to process Stream Data. For this, we considered two design options:
Chapter3
Figure 3.1: Architecture of Pipelined Map Reduce
3.2
The first design option:
As in CMR, the Mapper pushes each intermediate (key, value) pair to one of the reduce queues based on hash value of the intermediate key. The hash function can be user- defined or a default function provided, as in CMR. The number of reduce queues must be at least equal to the number of reducers and preferably much larger to balance load better. Unlike the CMR implementation, where a reducer can process any of the reduce queues by dequeueing on message from the master reduce queue (which holds pointers to all the ReduceQueues), in this case, the reducer will have to be statically bound to one or more reduce queues, so that records with a particular intermediate key are always processed by the same reducer. This is essential to aggregate values associated with the same key in the Reduce phase. Also, each reducer will be statically bound to an output split associated with that reducer, so that during online aggregation of values associated with a particular reduce key, a particular key does not land up in two different output splits, giving incorrect aggregation. Thus, each reducer will be bound statically to: one or more reduce queues and one output split. This can be achieved by maintaining a database Containing (ReducerID, ReduceQueuePointers, OutputQueuePointers, Status, timeLastUpdated) Or an equivalent data-structure in nonrelational database, like Simple DB. In the static implementation, ReduceQueuePointers hold the pointers to Reduce queues associated with a particular Reducer identified by the ReducerID. OutputQueuePointers will hold Pointers to OutputQueues (or rather OutputFilePointers) associated with a particular ReducerID.
Chapter3
3.2.1
Handling Reducer Failures
The Status field is used for handling Reducer failures. Status can be one of Live, Dead, Idle. Each Reducer periodically updates the isAlive bit in its Status field to indicate that it has not failed (similar to the heartbeat signal in traditional Hadoop). A thread running in an EC2 instance monitors to see if any reducer has not updated its status in the past TIMEOUT seconds. If it finds a Reducer which has not updated its status in such time, its sets its Status to Dead, searches for a Reducer with Idle Status and assigns the new Reducer the ReduceQueuePointers and OutputPointers previously held by the old Reducer. When the visibility timeout expires, the messages processed by the failed Reducer again become visible on the ReduceQueues. As the Reducer does not update its Status to Done in the SimpleDB unless it has finished processing a given window of the input and removed the corresponding messages from the ReduceQueues, it is guaranteed that if the Reducer fails, the messages will reappear on the ReduceQueues, as SQS requires that messages be specifically deleted, or else they re-appear after the visibility timeout. Thus, Idle Reducers are used to handle other Reducer failures. Also, if on getting the job of some other failed reducer, the original queues of the Idle reducer start filling up, it will also process those queues, as the queues associated with the failed reducer are added to the original list of queues associated with that reducer, without replacement. The issue of which output split to write to, when a record is read from a reduce queue (because it may be read from the original queue associated with a reducer, or a new queue assigned because of some other Reducers failure) can be determined by holding another database table that associates a particular reduce queue to a particular output split. This is essential for correct aggregation. The Reducer then writes the output generated to the correct output split, checking for duplicates and if the key already exists, aggregating the new output with the output already available. This is called online aggregation. Thus, the following sequence of steps occurs for each intermediate (key, value) pair generated by the mapper (ignoring here the issues involved in handling reducer failures that have been discussed above): 1. Mapper pushes intermediate (key, value) pairs to one of the reduce queues based on hash value of the key. 2. Each reducer is statically bound to one or more ReduceQueues, and polls those queues only for work. 3. It then writes the output to the output split assigned to it.
Chapter3
3.3
The second design option:
Alternatively, we can have a single queue between the Mappers and the Reducers, with all the intermediate (key, value) pairs generated by all the Mappers pushed to this intermediate queue. Reducers poll this IntermediateQueue for records. Aggregation is carried out as follows: There are a fixed number of Output splits. Whenever a Reducer reads a (key, value) record from the IntermediateQueue, it applies a user-defined Reduce function to the record, to produce an output (key, value) pair. It then selects an output split by hashing on the output key produced. If there already exists a record for that key in the output split, the new value is merged with the existing value. Otherwise a new record is created with (key, value) as the output record. An issue with this approach is that, for aggregating the records directly in S3 the issue of latency of S3 must be taken into account. If the delays are too long, the first design option would be better.
3.4
Hybrid Approach
Both the above approaches could be combined as follows: Have multiple ReduceQueues, each linked statically to a particular Reducer, but instead of linking the output splits to the Reducer statically, use hashing on the output key of the (key, value) pair generated by the Reducer to select an output split. This will require fewer changes to the existing CMR architecture, but will involve static linking of Reducers to ReduceQueues. This is the approach that we prefer.
3.5
Pipelined Architecture
Figure depicts the dataflow of two MapReduce implementations. The dataflow on the left corresponds to the output materialization approach used by Hadoop; the dataflow on the right allows pipelining and We called it Pipelined-MapReduce. In the remainder of this section, we present our design and implementation for the Pipelined-MapReduce dataflow. In general , reduce tasks traditionally issue HTTP requests to pull their output from each Task-Tracker. This means that map task execution is completely decoupled from reduce task execution. To support pipelining, we modified the map task to instead push data to reducers as it is produced. To give an intuition for how this works, we begin by
Chapter3
Figure 3.2: Hadoop Data flow for Batch And Pipelined Map Reduce Data Flow describing a straightforward pipelined design, and then discuss the changes we had to make to achieve good performance. In our Pipelined-MapReduce implementation , we modified Hadoop to send data directly from map to reduce tasks. When a client submits a new job to Hadoop, the Job- Tracker assigns the map and reduce tasks associated with the job to the available Task-Tracker slots. For purposes of discussion, we assume that there are enough free slots to assign all the tasks for each job. We modified Hadoop so that each reduce task contacts every map task upon initiation of the job, and opens a TCP socket which will be used to pipeline the output of the map function. As each map output record is produced, the mapper determines which partition (reduce task) the record should be sent to, and immediately sends it via the appropriate socket. A reduce task accepts the pipelined data it receives from each map task and stores it in an in-memory buffer, spilling sorted runs of the buffer to disk as needed. Once the reduce task learns that every map task has completed, it performs a final merge of all the sorted runs and applies the userdefined reduce function as normal, wirte the ouput to the HDFS. process was occurred rapidly. So make the siphonic pressure increase consumedly, and enhance the siphonic power, attained the purpose of saving water. The data we have for the experiment is enwiki- 20100904-pages-articles.xml[6]. enwiki20100904-pagesarticles. xml is contains the full content of the Wikipedia,which is a around 6GB XML datasets. We implement the Naive Bayes algorithm[7] in the Hadoop environment[8] for classify the XML datasets. First, the datasets divided into some chunks, then chunks of datasets write into HDFS. Naive Bayes classifier consists of two Pipelined-MapReduce: An improved MapReduce Parallel programing model P1515
Chapter3
processes: tracking specific documents related to the characteristics and types, and then use this model to predict the new documents, not containd the contents of the category. The first step is called training, it has been classified by looking at the contents of the sample to create a model, and then follow with specific content related to the probability of each word. The second step is called classification, it will use the training model and the content of the new documents, combined with the Bayes theorem to predict the of new documents. We setting the test datasets , and use the the training model trained it , then use the model to classify the new documents. After setting the the training and test datasets, classified the test datasets. Therefore, to apply the Naive Bayes classifier to the XML datasets, we need to train the training model , and then use the model to classified the new documents. In order to evaluate the performance of MapReduce technologies, we first tested different implementations[9] of the environment increases with the amount of data the task execution time. Fig.2 depicts our results .Hadoop and Pipelined-MapReduce have almost similar performance. May be subject to I/O bandwidth impact, performance, or less. The consumption generated by the MapReduce implementation completed time for the whole is small.
Chapter 4
Hadoop and HDFS
4.1
Hadoop System Architecture:
Each cluster has one master node with multiple slave nodes. The master node runs Name Node and Job Tracker functions and coordinates with the slave nodes to get the job done. The slaves run Task Tracker, HDFS to store data, and map and reduce functions for data computation. The basic stack includes Hive and Pig for language and compilers, HBase for NoSQL database management, and Scribe and Flume for log collection. ZooKeeper provides centralized coordination for the stack.
Figure 4.1: Hadoop System Architecture
17
Chapter4
Hadoop and HDFS
4.1.1
Operating a Server Cluster:
A client submits a job to the master node, which orchestrates with the slaves in the cluster. Job Tracker controls the Map Reduce job, reporting to Task Tracker. In the event of a failure, Job Tracker reschedules the task on the same or a different slave node, whichever is most efficient. HDFS is location-aware or rack-aware and manages data within the cluster, replicating the data on various nodes for data reliability. If one of the data replicas on HDFS is corrupted, Job Tracker, aware of where other replicas are located, can reschedule the task right where it resides, decreasing the need to move data back from one node to another. This saves network bandwidth and keeps performance and availability high. Once the job is mapped, the output is sorted and divided into several groups, which are distributed to reducers. Reducers may be located on the same node as the mappers or on another node.
Figure 4.2: Operation Of Hadoop Cluster
4.2
Data Storage and Analysis
The problem is simple: while the storage capacities of hard drives have increased massively over the years, access speeds the rate at which data can be read from drives have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed Pipelined-MapReduce: An improved MapReduce Parallel programing model P1818
Chapter4
Hadoop and HDFS
is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. This is a long time to read all data on a single drive and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in less than two minutes. Only using one hundredth of a disk may seem wasteful. But we can store one hundred datasets, each of which is one terabyte, and provide shared access to them. We can imagine that the users of such a system would be happy to share access in return for shorter analysis times, and, statistically, that their analysis jobs would be likely to be spread over time, so they wouldnt interfere with each other too much. Theres more to being able to read and write data in parallel to or from multiple disks, though. The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoops file system, the Hadoop Distributed File system (HDFS), takes a slightly different approach, as you shall see later. The second problem is that most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. Map Reduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into a computation over sets of keys and values. This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by Map Reduce. There are other parts to Hadoop, but these capabilities are its kernel.
4.3
Comparison with Other Systems:

The
The approach taken by Map Reduce may seem like a brute-force approach.
premise is that the entire dataset or at least a good portion of it is processed for each query. But this is its power. Map Reduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data, and unlocks data that was
Chapter4
Hadoop and HDFS
previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights. For example, Mail trust, Rack spaces mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In their words: This data was so useful that weve scheduled the Map Reduce job to run monthly and we will be using this data to help us decide which Rack space data centers to place new mail servers in as we grow. By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rack space engineers were able to gain an understanding of the data that they otherwise would never have had, and, furthermore, they were able to use what they had learned to improve the service for their customers.
4.3.1
RDBMS:
Why cant we use databases with lots of disks to do large-scale batch analysis? Why is map Reduce needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disks head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disks bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform, seeks) works well. For updating the majority of a database, a B-Tree is less efficient than Map Reduce, which uses Sort/Merge to rebuild the database. In many ways, Map Reduce can be seen as a complement to an RDBMS Map Reduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. Map Reduce suits applications where the data is written once, and read many times, whereas a relational database is good for datasets that are continually updated. Another difference between Map Reduce and an RDBMS is the amount of structure in the datasets that they operate on. Structured data is data that is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular Pipelined-MapReduce: An improved MapReduce Parallel programing model P2020
Chapter4
Hadoop and HDFS
predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. Map Reduce works well on unstructured or semi structured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for Map Reduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data. Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for Map Reduce, since it makes reading a record a nonlocal operation, and one of the central assumptions that Map Reduce makes is that it is possible to perform (high-speed) streaming reads and writes. A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that log files of all kinds are particularly well-suited to analysis with Map Reduce. Map Reduce is a linearly scalable programming model. The programmer writes two functions a map function and a reduce function each of which defines a mapping from one set of key-value pairs to another. These functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More important, if you double the size of the input data, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. Over time, however, the differences between relational databases and Map Reduce systems are likely to blur both as relational databases start incorporating some of the ideas from Map Reduce (such as Aster Datas and Green plums databases) and, from the other direction, as higher-level query languages built on Map Reduce (such as Pig and Hive) make Map Reduce systems more approachable to traditional database programmers.
4.4
Hadoop Distributed File System (HDFS):
When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. File systems that manage the storage across a network of machines are called distributed file systems. Since they
Chapter4
Hadoop and HDFS
are network-based, all the complications of network programming kick in, thus making distributed file systems more complex than regular disk file systems. For example, one of the biggest challenges is making the file system tolerate node failure without suffering data loss.Hadoop comes with a distributed file system called HDFS, which stands for Hadoop Distributed File system. HDFS is Hadoops flagship file system and is the focus of this chapter, but Hadoop actually has a general purpose file system abstraction, so well see along the way how Hadoop integrates with other storage systems (such as the local file system)
4.4.1
HDFS Federation:
The name node keeps a reference to every file and block in the file system in memory, which means that on very large clusters with many files, memory becomes the limiting factor for scaling adding name nodes, each of which manages a portion of the file system namespace. For example, one name node might manage all the files rooted under /user, say, and a second name node might handle files under /share. Under federation, each name node manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace. Namespace volumes are independent of each other, which mean name nodes do not communicate with one another, and furthermore the failure of one name node does not affect the availability of the namespaces managed by other name nodes. Block pool storage is not partitioned, however, so data nodes register with each name node in the cluster and store blocks from multiple block pools. To access a federated HDFS cluster, clients use client-side mount tables to map file paths to name nodes. This is managed in configuration using the view File System, and viewfs: // URIs.:
4.4.2
HDFS High-Availability
The combination of replicating name node metadata on multiple file systems, and using the secondary name node to create checkpoints protects against data loss, but does not provide high-availability of the file system. The name node is still a single point of failure (SPOF), since if it did fail, all clients including Map Reduce jobs would be unable to the file-to-block mapping. In such an event the whole Hadoop system would effectively be out of service until a new name node could be brought online. To recover from a failed name node in this situation, an administrator starts a new primary name
Chapter4
Hadoop and HDFS
node with one of the file system metadata replicas, and configures ata nodes and clients to use this new name node. The new name node is not able to serve requests until it has i) loaded its namespace image into memory, ii) replayed its edit log, and iii) received enough block reports from the data nodes to leave safe mode. On large clusters with many files and blocks, the time it takes for a name node to start from cold can be 30 minutes or more. The long recovery time is a problem for routine maintenance too. In fact, since unexpected failure of the name node is so rare, the case for planned downtime is actually more important in practice. The 0.23 release series of Hadoop remedies this situation by adding support for HDFS high-availability (HA). In this implementation there is a pair of name nodes in an active standby configuration. In the event of the failure of the active name node, the standby takes over its duties to continue servicing client requests without a significant interruption. A few architectural changes are needed to allow this to happen: The name nodes must use highly-available shared storage to share the edit log. (In the initial implementation of HA this will require an NFS filer, but in future releases more options will be provided, such as a Bookkeeper-based system built on Zoo- Keeper.) When a standby name node comes up it reads up to the end of the shared edit log to synchronize its state with the active name node, and then continues to read new entries as they are written by the active name node. Data nodes must send block reports to both name nodes since the block mappings are stored in a name nodes memory, and not on disk. Clients must be configured to handle name node failover, which uses a mechanism that is transparent to users. If the active name node fails, then the standby can take over very quickly (in a few tens of seconds) since it has the latest state available in memory: both the latest edit log entries, and an up-to-date block mapping. The actual observed failover time will be longer in practice (around a minute or so), since the system needs to be conservative in deciding that the active name node has failed. In the unlikely event of the standby being down when the active fails, the administrator can still start the standby from cold. This is no worse than the non-HA case, and from an operational point of view its an improvement, since the process is a standard operational procedure built into Hadoop.
Chapter4
Hadoop and HDFS
4.5
4.5.1
Algorithm:
Algorithm For Word Count:
Example: Wordcount Mapper public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) word.set(itr.nextToken()); output.collect(word, one);
Example: Wordcount Reducer public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException int sum = 0; while (values.hasNext()) sum += values.next().get(); output.collect(key, new IntWritable(sum));
Chapter 5
ADVANTAGES, FEATURES AND APPLICATIONS
5.1
Advantages
The design has the following advantages as compared to the existing implementations surveyed above: 1. Either design allows Reducers to start processing as soon as data is made available by the Mappers. This allows parallelism between the Map and Reduce phases. 2. A downstream processing element can start processing as soon as some data is available from an upstream element. 3. The network is better utilized as data is continuously pushed from one phase to the next. 4. The final output is computed incrementally. 5. Introduction of a pipeline between the Reduce phase of one job and the Map phase of the next job will support Cascaded MapReduce jobs. the delays experienced between the jobs. A module that pushes data produced by one job, into the InputQueue of the next job will pipeline and parallelise the operation of the two jobs. A typical application is database queries where for example, the output of a join operation may be required as the input to a grouping operation.
5.2
5.2.1
Features
Time windows
The user can define a time-window for processing where the user can specify the range of time values over which he wishes to calculate the output for the job. For e.g. the user could specify that he wishes to run some query X on the click-stream data from 24 hrs 25
Chapter5
ADVANTAGES, FEATURES AND APPLICATIONS
ago, till present time. This will use the timestamp metadata added to the input record and process only those records that lie within the timestamp range.
5.2.2
Snapshots
Incremental processing of data can allow users to take snapshots of partial output generated, based on the job running time and/or the percentage of job completed. For eg. In the word count example, the user could request a snapshot of the data processed till time t=15minutes from start of job. A snapshot will essentially be a snapshot of the output folder of S3 where the output is being collected. Snapshots allow users to get an approximate idea of the output without having to wait for all processing to be done till the final output is available.
5.2.3
Cascaded MapReduceJobs
Cascaded MapReduce jobs are those where the output of one MapReduce job is to be used by the next MapReduce job. Pipelining between jobs further increases the throughput while decreasing popular and typical stream processing applications. With accurate delay guarantees, this design can also be used to process real-time data.
Conclusion
As described above, the design fulfills a real need of processing streaming data using MapReduce. It is also inherently scalable as it is cloud-based. This also gives it a lightweight? nature, as the handling of distributed resources is done by the Cloud OS. The gains due to parallelism are expected to be similar to, or better than those obtained in [3]. Currently, the authors are working on implementing the said design. Future work will include supporting rollingwindows for obtaining outputs of arbitrary time-intervals of the input stream. Further work can be done in maintaining intermediate output information for supporting rolling windows. Also, reducing the delays to levels acceptable for realtime data is an important usecase that needs to be supported. Future scope also includes designing a generic system that is portable across several cloud operating systems.
27 27
Bibliography
[1] Jeffrey Dean and Sanjay Ghemawat: MapReduce Simplified Data Processing on Large Clusters in OSDI, 2004. [2] Huan Liu, Dan Orban: Cloud MapReduce: a MapReduce Implementation on top of a Cloud Operating System in , Accenture Technology Labs, Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium. [3] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, Russell Sears: MapReduce Online in 7th USENIX conference on Networked systems design and implementation 2010 [4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph,R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson,A. Rabkin, I. Stoica, and M. Zaharia. Above theclouds: A berkeley view of cloud computing in TechnicalReport UCB/EECS-2009-28, EECS Department,University of California, Berkeley, Feb 2009. [5] Daniel Warneke and Odej Kao: Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud in IEEE Transactions on parallel and distributed systems, 2011 [6] Shrideep Pallickara, Jaliya Ekanayake and Geoffrey Fox: Granules- A Lightweight, Streaming Runtime for Cloud Computing With Support for MapReduce in Cluster Computing and Workshops, CLUSTER09, IEEE International Conference. [7] Floreen P, Przybilski M, Nurmi P, Koolwaaij J, Tarlano A, Wagner M, Luther M, Bataille F, Boussard M, Mrohs B, et al 2005). Towards a context management framework for mobiLife. 14th IST Mobile Wireless Summit 7. [8] Nathan Backman, Karthik Pattabiraman, Ugur Cetintemel: C-MR: A Continuous MapReduce Processing Model for Low-Latency Stream Processing on Multi-Core Architectures in Department of Computer Science, Brown University. 28 28
Chapter5
BIBLIOGRAPHY
[9] Zikai Wang: A Distributed Implementation of Continuous- MapReduce Stream Processing Framework in Department of Computer Science, Brown University. [10] Huan Liu: Cutting MapReduce Cost with Spot Market In Proc. Of Usenix HotCloud (2011). [11] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica: Improving MapReduce Performance in Heterogeneous Environments In proceedings of 8th Usenix Symposium on Operating Systems Design and Implementation. [12] Michael Stonebraker, Ugur etintemel, Stan Zdonik: The 8 Requirements of RealTime Stream Processing in ACM SIGMOD Record Volume 34 Issue 4, 2005 [13] Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G. Twister: A runtime for iterative MapReduce In The First International Workshop on MapReduce and its Applications (2010). [14] Hadoop. http://hadoop.apache.org/ [15] Amazon Elastic MapReduce aws.amazon.com/elasticmapreduce/ [16] Gunho Leey, Byung-Gon Chunz, Randy H. Katzy: Heterogeneity- Aware Resource Allocation and Scheduling in the Cloud in ,Usenix HotCloud 2011 [17] Amazon ec2. http://aws.amazon.com/ec2 [18] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for finegrained resource sharing in the data center In USENIX symposium on Networked Systems Design and Implementation, 2011. [19] J. Polo, D. Carrera, Y. Becerra, V. Beltran, and J. T. andEduard Ayguad. Performance management of accelerated mapreduce workloads in heterogeneous clusters In 39th International Conference on Parallel Processing (ICPP2010), 2010. [20] Pramod Bhatotia Alexander Wieder ?Istemi Ekin Akkus Rodrigo Rodrigues Umut A. Acar Large-scale Incremental Data Processing with Change Propagation Max Planck Institute for Software Systems (MPI-SWS)
Chapter5
BIBLIOGRAPHY
[21] D., Olston, C., Reed, B., Webb, K. C., and Yocum, K. LOGOTHETIS Stateful bulk processing for incremental analytics In Proc. 1st Symp. on Cloud computing (SoCC10). [22] Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom Models and Issues in Data Stream Systems Department of Computer Science Stanford University

Pipelined-MapReduce An Improved MapReduce

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Pipelined-MapReduce An Improved MapReduce

Diunggah oleh

Hak Cipta:

Format Tersedia

A Seminar-II Report On

Pipelined-MapReduce: An improved MapReduce Parallel programing model

DEPARTMENT OF COMPUTER ENGINEERING

AMRUTVAHINI COLLEGE OF ENGINEERING. SANGAMNER

DEPARTMENT OF COMPUTER ENGINEERING AMRUTVAHINI

Pipelined-MapReduce: An improved MapReduce Parallel programing model

Prof. B.L.Gunjal (Seminar Guide)

Prof. B.L.Gunjal (Seminar Coordinator)

Prof. R.L.Paikrao (H.O.D.)

Prof. Dr.G.J. Vikhe Patil (Principal)

Pipelined-MapReduce: An improved MapReduce Parallel programing model

Amrutvahini College of Engineering,

(. . . . . . . . . . . . . . . . . . . . .) Prof. B.L.Gunjal Internal Examiner

(. . . . . . . . . . . . . . . . . . . . .) Prof. A.N.Nawathe External Examiner

Prof.Deshmukh Sachin B. M.E.Computer Engineering

Acknowledgement i ii List of Abstract Figures 1 1 1

v INTRODUCTION 1.1 1 1.2 Need of system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Cloud Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online Map Reduce(Hadoop Online Prototype) . . . . . . . . . .

The first design option: . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The second design option: . . . . . . . . . . . . . . . . . . . . . . . . . .

Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 3.1 3.2 4.1 4.2

Pipelined-MapReduce: An improved MapReduce Parallel programing model P2

Mappers and Reducers:

Figure 2.1: Map Reduce Process

Pipelined-MapReduce: An improved MapReduce Parallel programing model P5

Basic concepts of HDFS:

Figure 2.3: Architecture Of Hadoop Distributed File System

Cloud Map Reduce

Online Map Reduce(Hadoop Online Prototype)

Pipelined-MapReduce: An improved MapReduce Parallel programing model P8

Pipelined-MapReduce: An improved MapReduce Parallel programing model P9

Architecture of Pipelined CMR

Pipelined-MapReduce: An improved MapReduce Parallel programing model P1111

Figure 3.1: Architecture of Pipelined Map Reduce

The first design option:

Pipelined-MapReduce: An improved MapReduce Parallel programing model P1212

Handling Reducer Failures

Pipelined-MapReduce: An improved MapReduce Parallel programing model P1313

The second design option:

Pipelined-MapReduce: An improved MapReduce Parallel programing model P1414

Pipelined-MapReduce: An improved MapReduce Parallel programing model P1616

Hadoop System Architecture:

Figure 4.1: Hadoop System Architecture

Hadoop and HDFS

Operating a Server Cluster:

Figure 4.2: Operation Of Hadoop Cluster

Data Storage and Analysis

Hadoop and HDFS

Comparison with Other Systems:

Pipelined-MapReduce: An improved MapReduce Parallel programing model P1919

Hadoop and HDFS

Hadoop and HDFS

Hadoop Distributed File System (HDFS):

Pipelined-MapReduce: An improved MapReduce Parallel programing model P2121

Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing model P2222

Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing model P2323

Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing model P2424

ADVANTAGES, FEATURES AND APPLICATIONS

Pipelined-MapReduce: An improved MapReduce Parallel programing model P26