The best place for students to learn Applied Engineering http://www.insofe.edu.in
Dr. Sreerama KV Murthy September 25, 2013 Engineering Big Data: Online Batch Session 9: Map Reduce 2 CEO, Teqnium Consultancy Services The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in Refresher: What is MapReduce? MapReduce is a programming model Google has used successfully is processing its big-data sets (~ 20000 peta bytes per day) Users specify the computation in terms of a map and a reduce function Underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, and Underlying system also handles machine failures, efficient communications, and performance issues. Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. CCSCNE 2009 Palttsburg, April 24 2009. B.Ramamurthy & K.Madurai The best place for students to learn Applied Engineering 3 http://www.insofe.edu.in The Five MapReduce Daemons 1. NameNode Holds the metadata for HDFS 2. Secondary NameNode Performs housekeeping functions for the NameNode. It is not a backup or hot standby for the NameNode. 3. DataNode Stores actual HDFS data blocks 4. JobTracker Manages MapReduce jobs, distributes individual tasks to machines, etc 5. TaskTracker Instantiates and monitors individual Map and Reduce tasks Master Nodes in the cluster run one of the blue daemons above. Slave Nodes run both of the non-blue daemons. Each daemon runs in its own Java virtual machine. The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in Five Daemons of MapReduce.. contd. The best place for students to learn Applied Engineering 5 http://www.insofe.edu.in YARN (MR2) The best place for students to learn Applied Engineering 6 http://www.insofe.edu.in CDH-3 Map Reduce Daemons The best place for students to learn Applied Engineering 7 http://www.insofe.edu.in MapReduce NextGen aka YARN aka MRv2 Divides the two major functions of the JobTracker - resource management and job life-cycle management - into separate components Released in Hadoop-0.23 The new ResourceManager manages the global assignment of compute resources to applications. The ResourceManager has two main components: Scheduler and ApplicationsManager. The Scheduler is responsible for allocating resources to various running applications subject to constraints of capacities, queues etc. The best place for students to learn Applied Engineering 8 http://www.insofe.edu.in MRv2 aka YARN: JobTracker Redefined The best place for students to learn Applied Engineering 9 http://www.insofe.edu.in Scheduler performs no tracking of status for the application, and offers no guarantees about restarting failed tasks. The per-application ApplicationMaster manages the applications scheduling and coordination The per-machine NodeManager daemon manages the user processes on that machine. An application is either a single MR job or a DAG of such jobs. The ApplicationMaster negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor tasks. CDH4 continues to support the original MapReduce framework (i.e. the JobTracker and TaskTrackers). The old framework is referred to as MRv1. YARN aka MRv2 (contd.) The best place for students to learn Applied Engineering 10 http://www.insofe.edu.in The best place for students to learn Applied Engineering 11 http://www.insofe.edu.in A COUPLE OF USE CASES Map-Reduce The best place for students to learn Applied Engineering 12 http://www.insofe.edu.in Yahoo: Running Production WebMap Search needs a graph of the known web Invert edges, compute link text, whole graph heuristics Periodic batch job using Map/Reduce Uses a chain of ~100 map/reduce jobs Scale 1 trillion edges in graph Largest shuffle is 450 TB Final output is 300 TB compressed Runs on 10,000 cores Raw disk used 5 PB Written mostly using Hadoops C++ interface The best place for students to learn Applied Engineering 13 http://www.insofe.edu.in Yahoo Research Clusters Mostly data mining/machine learning jobs Most research jobs are not Java: 42% Streaming Uses Unix text processing to define map and reduce 28% Pig Higher level dataflow scripting language 28% Java 2% C++ The best place for students to learn Applied Engineering 14 http://www.insofe.edu.in NY Times Needed offline conversion of public domain articles from 1851-1922. Used Hadoop to convert scanned images to PDF Ran 100 Amazon EC2 instances for around 24 hours 4 TB of input 1.5 TB of output Published 1892, copyright New York Times The best place for students to learn Applied Engineering 15 http://www.insofe.edu.in Terabyte Sort Benchmark Started by Jim Gray at Microsoft in 1998 Sorting 10 billion 100 byte records Hadoop won the general category in 209 seconds 910 nodes 2 quad-core Xeons @ 2.0Ghz / node 4 SATA disks / node 8 GB ram / node 1 gb ethernet / node 40 nodes / rack 8 gb ethernet uplink / rack Previous records was 297 seconds Only hard parts were: Getting a total order Converting the data generator to map/reduce The best place for students to learn Applied Engineering 16 http://www.insofe.edu.in NOW.THE SHAKE-UP QUIZ !! The best place for students to learn Applied Engineering 17 http://www.insofe.edu.in KEY-VALUE PAIRS The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in MapReduce Programmers specify two functions: map (k, v) <k, v>* reduce (k, v) <k, v>* All values with the same key are reduced together The best place for students to learn Applied Engineering 19 http://www.insofe.edu.in Keys and Values The best place for students to learn Applied Engineering 20 http://www.insofe.edu.in MapReduce In more detail The best place for students to learn Applied Engineering 21 http://www.insofe.edu.in MR: Logical Execution The best place for students to learn Applied Engineering 22 http://www.insofe.edu.in map map map map Shuffle and Sort: aggregate values by keys reduce reduce reduce k 1 k 2 k 3 k 4 k 5 k 6 v 1 v 2 v 3 v 4 v 5 v 6 b a 1 2 c c 3 6 a c 5 2 b c 7 8 a 1 5 b 2 7 c 2 3 6 8 r 1 s 1 r 2 s 2 r 3 s 3 The best place for students to learn Applied Engineering 23 http://www.insofe.edu.in SIMPLE MAPPERS & REDUCERS The best place for students to learn Applied Engineering 24 http://www.insofe.edu.in Mappers Mappers run on nodes which hold their portion of the data locally, to avoid network traffic Multiple Mappers run in parallel, each processing a portion of the input data Mapper reads data in the form of key/value pairs Mapper may use, or completely ignore, the input key. E.g., a standard pattern is to read a line of a file at a time. Key then is the byte offset into the file at which the line starts. Value is the contents of the line itself. Typically the key is considered irrelevant . It outputs zero or more key/value pairs let map(k, v) = emit(k.toUpper(), v.toUpper()) ('foo', 'bar') -> ('FOO', 'BAR') The best place for students to learn Applied Engineering 25 http://www.insofe.edu.in Others Examples: Explode mapper The best place for students to learn Applied Engineering 26 http://www.insofe.edu.in Example: Filter mapper The best place for students to learn Applied Engineering 27 http://www.insofe.edu.in Example: Changing Keyspaces The best place for students to learn Applied Engineering 28 http://www.insofe.edu.in International School of Engineering Plot No 63/A, 1st Floor, Road No 13, Film Nagar, Jubilee Hills, Hyderabad - 500033 For Individuals: +91-9502334561/63 For Corporates: +91-9618483483 Web: http://www.insofe.edu.in Facebook: https://www.facebook.com/Insofe Twitter: https://twitter.com/INSOFEedu YouTube: http://www.youtube.com/InsofeVideos SlideShare: http://www.slideshare.net/INSOFE This presentation may contain references to findings of various reports available in the public domain. INSOFE makes no representation as to their accuracy or that the organization subscribes to those findings.