Anda di halaman 1dari 20

Nutch in a Nutshell

Presented by
Liew Guo Min
Zhao Jin
Outline
 Recap
 Special features
 Running Nutch in a distributed environment
(with demo)
 Q&A
 Discussion
Recap
 Complete web search engine
 Nutch = Crawler + Indexer/Searcher (Lucene) + GUI
+ Plugins
+ MapReduce & Distributed FS (Hadoop)

 Java based, open source

 Features:
 Customizable
 Extensible
 Distributed
Nutch as a crawler
Initial URLs

Injector Web

CrawlDB Webpages/files
update get

Generator CrawlDBTool Fetcher

read/write

generate read/write
Segment Parser
Special Features
 Extensible (Plugin system)
 Most of the essential functionalities of Nutch
are implemented as plugins
 Three layers
 Extension points
 What can be extended: Protocol, Parser, ScoringFilter, etc.
 Extensions
 The interfaces to be implemented for the extension points
 Plugins
 The actual implementation
Special Features
 Extensible (Plugin system)
 Anyone can write a plugin
 Write the code
 Prepare metadata files
 Plugin.xml: what has been extended by what
 Build.xml: how ant can build your source code

 Ask nutch to include your plugin in conf/nutch-


site.xml
 Tell ant to build your in src/plugin/build.xml

 More details @ http://

wiki.apache.org/nutch/PluginCentral
Special Features
 Extensible (Plugin system)
 To use a plugin
 Make sure you have modified Nutch-site.xml to
include the plugin
 Then, either
 Nutch would automatically call it when needed, or
 You can write something to call it with its classname and
then use it
Special Features
 Distributed (Hadoop)
 Map-Reduce (Diagram)
 A framework for distributed programming
 Map -- Process the splits of data to get

intermediate results and the keys to indicate what


should be put together later
 Reduce -- Process the intermediate results with

the same key and output final result


Special Features
 Distributed (Hadoop)
 MapReduce in Nutch
 Example1: Parsing
 Input: <url, content> files from fetch
 Map(url,content)  <url, parse> by calling parser plugins
 Reduce is identity

 Example2: Dumping a segment


 Input: <url, CrawlDatum>, <url, ParseText> etc. files from
segment
 Map is identity
 Reduce(url, value*)  <url, ConcatenatedValue> by simply
concatenating the text representation of values
Special Features
 Distributed (Hadoop)
 Distributed File system
 Write-once-read-many coherence model
 High throughput
 Master/slave
 Simple architecture
 Single point of failure
 Transparent
 Access via Java API
 More info @ http://lucene.apache.org/hadoop/hdfs_design.html
Running Nutch in a distributed
environment
 MapReduce
 In hadoop-site.xml
 Specify job tracker host & port
 mapred.job.tracker
 Specify task numbers
 mapred.map.tasks
 mapred.reduce.tasks

 Specify location for temporary files


 Mapred.local.dir
Running Nutch in a distributed
environment
 DFS
 In hadoop-site.xml
 Specify namenode host, port & directory
 fs.default.name
 dfs.name.dir

 Specify location for files on each datanode


 dfs.data.dir
Demo time!
Q&A
Discussion
Exercises
 Hands-on exercises
 Install Nutch, crawl a few webpages using the crawl command and
perform a search on it using the GUI

 Repeat the crawling process without using the crawl command

 Modify your configuration to perform each of the following crawl jobs


and think when they would be useful.
 To crawl only webpages and pdfs but not anything else
 To crawl the files on your harddisk
 To crawl but not to parse

 (Challenging) Modify Nutch such that you can unpack the crawled
files in the segments back into their original state
Reference
 http://wiki.apache.org/nutch/PluginCentral -- Information on Nutch
plugins
 http://lucene.apache.org/hadoop/ -- Hadoop homepage
 http://wiki.apache.org/lucene-hadoop/ -- Hadoop Wiki
 http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/map
"MapReduce in Nutch"
 http://wiki.apache.org/nutch-
data/attachments/Presentations/attachments/oscon05.pdf "Scalable
Computing with MapReduce“
 http://www.mail-archive.com/nutch-
commits@lucene.apache.org/msg01951.html Updated tutorial on setting
up Nutch, Hadoop and Lucene together
Excursion: MapReduce
 Problem
 Find the number of occurrences of “cat” in a
file
 What if the file is 20GB large?

Why not do it with more computers?


 Solution
Split 1 PC1 200 PC1 500
File
Split 2
PC2 300
Excursion: MapReduce
 Problem
 Find the number of occurrences of both “cat”
and “dog” in a very large file
 Solution
PC1 cat:
200,200, cat: 200,
Split 1 PC1 cat:500
dog:
250 250 300
File
Split 2
PC2 cat:
300,300, dog: 250, PC2 dog:500
dog:
250 250 250

Map Sort/Group Reduce

Input Files Intermediate files Output files


Excursion: MapReduce
 Generalized Framework
Master

k1:v1
k1:v1,v2
Split 1 Worker k3:v2
Worker Output 1
Split 2 k2:v4,v5
Worker k1:v3
Split 3 Worker Output 2
k2:v4
Split 4 k3:v2
Worker Worker Output 3
k2:v5
k4:v6 k4:v6

Map Sort/Group Reduce

Input Files Intermediate files Output files


back