Anda di halaman 1dari 20

Nutch in a Nutshell

Presented by
Liew Guo Min
Zhao Jin
 Recap
 Special features
 Running Nutch in a distributed environment
(with demo)
 Q&A
 Discussion
 Complete web search engine
 Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)

 Java based, open source

 Features:
 Customizable
 Extensible
 Distributed
Nutch as a crawler
Initial URLs

Injector Web

CrawlDB Webpages/files
update get

Generator CrawlDBTool Fetcher


generate read/write
Segment Parser
Special Features
 Extensible (Plugin system)
 Most of the essential functionalities of Nutch
are implemented as plugins
 Three layers
 Extension points
 What can be extended: Protocol, Parser, ScoringFilter, etc.
 Extensions
 The interfaces to be implemented for the extension points
 Plugins
 The actual implementation
Special Features
 Extensible (Plugin system)
 Anyone can write a plugin
 Write the code
 Prepare metadata files
 Plugin.xml: what has been extended by what
 Build.xml: how ant can build your source code

 Ask nutch to include your plugin in conf/nutch-

 Tell ant to build your in src/plugin/build.xml

 More details @ http://
Special Features
 Extensible (Plugin system)
 To use a plugin
 Make sure you have modified Nutch-site.xml to
include the plugin
 Then, either
 Nutch would automatically call it when needed, or
 You can write something to call it with its classname and
then use it
Special Features
 Distributed (Hadoop)
 Map-Reduce (Diagram)
 A framework for distributed programming
 Map -- Process the splits of data to get

intermediate results and the keys to indicate what

should be put together later
 Reduce -- Process the intermediate results with

the same key and output final result

Special Features
 Distributed (Hadoop)
 MapReduce in Nutch
 Example1: Parsing
 Input: <url, content> files from fetch
 Map(url,content)  <url, parse> by calling parser plugins
 Reduce is identity

 Example2: Dumping a segment

 Input: <url, CrawlDatum>, <url, ParseText> etc. files from
 Map is identity
 Reduce(url, value*)  <url, ConcatenatedValue> by simply
concatenating the text representation of values
Special Features
 Distributed (Hadoop)
 Distributed File system
 Write-once-read-many coherence model
 High throughput
 Master/slave
 Simple architecture
 Single point of failure
 Transparent
 Access via Java API
 More info @
Running Nutch in a distributed
 MapReduce
 In hadoop-site.xml
 Specify job tracker host & port
 mapred.job.tracker
 Specify task numbers
 mapred.reduce.tasks

 Specify location for temporary files

 Mapred.local.dir
Running Nutch in a distributed
 In hadoop-site.xml
 Specify namenode host, port & directory

 Specify location for files on each datanode

Demo time!
 Hands-on exercises
 Install Nutch, crawl a few webpages using the crawl command and
perform a search on it using the GUI

 Repeat the crawling process without using the crawl command

 Modify your configuration to perform each of the following crawl jobs

and think when they would be useful.
 To crawl only webpages and pdfs but not anything else
 To crawl the files on your harddisk
 To crawl but not to parse

 (Challenging) Modify Nutch such that you can unpack the crawled
files in the segments back into their original state
 -- Information on Nutch
 -- Hadoop homepage
 -- Hadoop Wiki
"MapReduce in Nutch"
data/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
 Updated tutorial on setting
up Nutch, Hadoop and Lucene together
Excursion: MapReduce
 Problem
 Find the number of occurrences of “cat” in a
 What if the file is 20GB large?

Why not do it with more computers?

 Solution
Split 1 PC1 200 PC1 500
Split 2
PC2 300
Excursion: MapReduce
 Problem
 Find the number of occurrences of both “cat”
and “dog” in a very large file
 Solution
PC1 cat:
200,200, cat: 200,
Split 1 PC1 cat:500
250 250 300
Split 2
PC2 cat:
300,300, dog: 250, PC2 dog:500
250 250 250

Map Sort/Group Reduce

Input Files Intermediate files Output files

Excursion: MapReduce
 Generalized Framework

Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
Split 4 k3:v2
Worker Worker Output 3
k4:v6 k4:v6

Map Sort/Group Reduce

Input Files Intermediate files Output files