100%(2)100% menganggap dokumen ini bermanfaat (2 suara)
7K tayangan31 halaman
After we've introduced you to the tools, it's time to learn how to use them efficiently. Algorithms designed for running on MapReduce look a little different than those you've written before. We'll introduce you to some widely-used algorithms, common idioms to use when designing your own, and techniques for implementing these in Java MapReduce and scripting languages via HadoopStreaming.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
After we've introduced you to the tools, it's time to learn how to use them efficiently. Algorithms designed for running on MapReduce look a little different than those you've written before. We'll introduce you to some widely-used algorithms, common idioms to use when designing your own, and techniques for implementing these in Java MapReduce and scripting languages via HadoopStreaming.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
Hak Cipta:
Attribution Non-Commercial (BY-NC)
Format Tersedia
Unduh sebagai PDF, TXT atau baca online dari Scribd
After we've introduced you to the tools, it's time to learn how to use them efficiently. Algorithms designed for running on MapReduce look a little different than those you've written before. We'll introduce you to some widely-used algorithms, common idioms to use when designing your own, and techniques for implementing these in Java MapReduce and scripting languages via HadoopStreaming.
Check http://www.cloudera.com/hadoop-training-basic for training videos.
Hak Cipta:
Attribution Non-Commercial (BY-NC)
Format Tersedia
Unduh sebagai PDF, TXT atau baca online dari Scribd
MapReduce Jobs • Tend to be very short, code-wise – IdentityReducer is very common • “Utility” jobs can be composed • Represent a data flow, more so than a procedure
Sort: The Trick • (key, value) pairs from mappers are sent to a particular reducer based on hash(key) • Must pick the hash function for your data such that k1 < k2 => hash(k1) < hash(k2)
Search Algorithm • Mapper: – Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _) • Reducer: – Identity function
Search: An Optimization • Once a file is found to be interesting, we only need to mark it that way once • Use Combiner function to fold redundant (filename, _) pairs into a single one – Reduces network I/O
An Aside: Word Count • Word count was described in module I • Mapper for Word Count is (word, 1) for each word in input line – Strikingly similar to inverted index – Common theme: reuse/modify existing mappers
Bayesian Classification • Existing toolsets exist to perform Bayes classification on instance – E.g., WEKA, already in Java! • Another example of discarding input key
Joining • Common problem: Have two data types, one includes references to elements of the other; would like to incorporate data by value, not by reference
Join Mapper • Read in all values of joiner, joinee classes • Emit to reducer based on primary key of joinee (i.e., the reference in the joiner, or the joinee’s identity)
Join Reducer • Joinee objects are emitted as-is • Joiner objects have additional fields populated by Joinee which comes to the same reducer as them. – Must do a secondary sort in the reducer to read the joinee before emitting any objects which join on to it
Information We Need • Number of times term X appears in a given document • Number of terms in each document • Number of documents X appears in • Total number of documents
Job 1: Word Frequency in Doc • Mapper – Input: (docname, contents) – Output: ((word, docname), 1) • Reducer – Sums counts for word in document – Outputs ((word, docname), n) • Combiner is same as Reducer
Job 2: Word Counts For Docs • Mapper – Input: ((word, docname), n) – Output: (docname, (word, n)) • Reducer – Sums frequency of individual n’s in same doc – Feeds original data through – Outputs ((word, docname), (n, N))
Job 3: Word Frequency In Corpus • Mapper – Input: ((word, docname), (n, N)) – Output: (word, (docname, n, N, 1)) • Reducer – Sums counts for word in corpus – Outputs ((word, docname), (n, N, m))
Job 4: Calculate TF-IDF • Mapper – Input: ((word, docname), (n, N, m)) – Assume D is known (or, easy MR to find it) – Output ((word, docname), TF*IDF) • Reducer – Just the identity function
Working At Scale • Buffering (doc, n, N) counts while summing 1’s into m may not fit in memory – How many documents does the word “the” occur in? • Possible solutions – Ignore very-high-frequency words – Write out intermediate data to a file – Use another MR pass
Final Thoughts on TF-IDF • Several small jobs add up to full algorithm • Lots of code reuse possible – Stock classes exist for aggregation, identity • Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle • Very easy to handle medium-large scale, but must take care to ensure flat memory usage for largest scale