Introduction
Types of Search Engines
Components of a Search Engine
Semantics and Relevancy
Search Engine Optimization
Engine
something that supplies the driving force or energy to a
movement, system, or trend
Search Engine
a computer program that searches for particular keywords
and returns a list of documents in which they were found,
especially a commercial service that scans documents on
the Internet
www.dmoz.org
Website classified into a Taxonomy
Website are categorically arranged
Searching vs Navigation
Instead of Query, you Click and navigate
Accurate search always! (if data is
available)
Problem: Mostly
CopyleftManually created
(ɔ) 2009 Sudarsun Santhiappan 13
Copyleft (ɔ) 2009 Sudarsun Santhiappan 14
Copyleft (ɔ) 2009 Sudarsun Santhiappan 15
How does a Search Engine work ?
Crawler
U
RL1
U
RL2
U U
RL3 RL4
Search Al l Abou
Eggs - 90% t
Engine You
EggoEggs
- r81%
Eggs? Brows
Ego-by40% er
Database Huh?
S. I. - Am
10%
Eggs.
in addition:
many national search engines
own coverage, orientation, governance
many specialized or domain search engines
own coverage geared to subject of interest
many comprehensive sources independent of search engines
some have compilations of evaluated web sources
Copyleft (ɔ) 2009 Sudarsun Santhiappan 23
searching differences
substantial differences among search
engines on searching, retrieval display
need to know how they work & differ in respect to
defaults in searching a query
searching of phrases, case sensitivity, categories
searching of different fields, formats, types of resources
advance search capabilities and features
possibilities for refinement, using relevance feedback
display options
personalization options
Crawlers
Indexers
Searching
Semantics
Ranking
create an
user inverted
query index
Search
Inverte
Show results engine
To user d
server
index
s Santhiappan
Copyleft (ɔ) 2009 Sudarsun 43
Typical Search Engine
What is Crawling ?
How does Crawling happen ?
Have you tried “wget -r <url>” in Linux ?
Have you tried “DAP” to download entire
site?
Page Walk
Spidering & Crawlbots
Copyleft (ɔ) 2009 Sudarsun Santhiappan 47
Copyleft (ɔ) 2009 Sudarsun Santhiappan 48
Copyleft (ɔ) 2009 Sudarsun Santhiappan 49
Spidering the Web
Index: This tell the spider/bot that it’s OK to index this page
Noindex: Spider/bot see this and don’t index any of the content on this page.
Follow: This let the spider/bot know that it’s OK to travel down links found on this
page.
Nofollow: It tells the spider/bot not to follow any of the links on this page.
Are Created
time 1
for 1
all 1
good 1
men 1
Periodically rebuilt, static to
come
1
1
otherwise. to
the
1
1
Document
Doc 1 ID. Doc 2
it
was
2
2
How Inverted is
the
time
1
1
1
aid
all
and
1
1
2
Created men
to
come
1
1
1
dark
for
good
2
1
1
to 1 in 2
After all
the 1 is 1
aid 1 it 2
of 1 manor 2
documents have their
country
1
1
men
midnight
1
2
been parsed the it
was
2
2
night
now
2
1
inverted file is a
dark
2
2
of
past
1
2
sorted and
stormy
2
2
stormy
the
2
1
alphabetically.
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
Copyleft (ɔ) 2009past
Sudarsun Santhiappan2 was 58 2
midnight 2 was 2
Term Doc # Term Doc # Freq
a 2 a 2 1
aid 1 aid 1 1
Within-
stormy 2
of 1 1
the 1
past 2 1
the 1
document term the
the
2
2
stormy
the
2
1
1
2
frequency their
time
1
1
the
their
2
1
2
1
information is time
to
2
1
time
time
1
2
1
1
compiled.
to 1
to 1 2
Copyleft (ɔ) 2009
wasSudarsun Santhiappan
2 59
was 2
was 2 2
How Inverted Files are
Created
Finally, the file can be split into
A Dictionary or Lexicon file
and
A Postings file
URL
Indexer
You may wish to have web pages that are not indexed
(for example, test pages).
It is also possible to hide web content from robots,
using the Robots.txt file and the robots meta tag.
Not all crawlers will obey this, so this is not foolproof.
I ns tead of Do thi s
gold.ac.uk/science/ science.gold.ac.uk
gold.ac.uk/english/ english.gold.ac.uk
gold.ac.uk/admin/ admin.gold.ac.uk
Yahoo
The Open Directory
(Netscape, Lycos, AOL Search, others)
LookSmart
UK Plus
Snap
Indexing features
Search features
Results display
Costs, licensing and registration requirements
Unique features (if any)