Search

Web search engines
Rooted in Information Retrieval (IR) systems

Prepare a keyword index for corpus
Respond to keyword queries with a ranked list of
documents.
ARCHIE
Earliest application of rudimentary IR systems to
the Internet
Title search across sites serving files over FTP
MiningtheWeb ChakrabartiandRamakrishnan 2
Boolean queries: Examples
Simple queries involving relationships
between terms and documents
Documents containing the word Java
Documents containing the word Java but not
the word coffee
Proximity queries
Documents containing the phrase Java beans
or the term API
Documents where Java and island occur in
the same sentence
Document preprocessing
Tokenization
Filtering away tags
Tokens regarded as nonempty sequence of
characters excluding spaces and
punctuations.
Token represented by a suitable integer, tid,
typically 32 bits
Optional: stemming/conflation of words
Result: document (did) transformed into a
sequence of integers (tid, pos)
Storing tokens
Straight-forward implementation using a
relational database
Example figure
Space scales to almost 10 times
Accesses to table show common pattern
reduce the storage by mapping tids to a
lexicographically sorted buffer of (did, pos)
tuples.
Indexing = transposing document-term matrix

Two variants of the inverted index data structure, usually stored on disk. The simpler
version in the middle does not store term offset information; the version to the right stores
term
offsets. The mapping from terms to documents and positions (written as
document/position) may
be implemented using a B-tree or a hash-table.
Storage
For dynamic corpora
Berkeley DB2 storage manager
Can frequently add, modify and delete
documents
For static collections
Index compression techniques (to be
discussed)

Stopwords
Function words and connectives
Appear in large number of documents and little
use in pinpointing documents
Indexing stopwords
Stopwords not indexed
For reducing index space and improving performance
Replace stopwords with a placeholder (to remember
the offset)
Issues
Queries containing only stopwords ruled out
Polysemous words that are stopwords in one sense
but not in others
E.g.; can as a verb vs. can as a noun
Stemming
Conflating words to help match a query term with a
morphological variant in the corpus.
Remove inflections that convey parts of speech, tense
and number
E.g.: university and universal both stem to universe.
Techniques
morphological analysis (e.g., Porter's algorithm)
dictionary lookup (e.g., WordNet).
Stemming may increase recall but at the price of
precision
Abbreviations, polysemy and names coined in the technical and
commercial sectors
E.g.: Stemming ides to IDE, SOCKS to sock, gated to
gate, may be bad !
Batch indexing and updates
Incremental indexing
Time-consuming due to random disk IO
High level of disk block fragmentation
Simple sort-merges.
To replace the indexed update of variable-
length postings
For a dynamic collection
single document-level change may need to
update hundreds to thousands of records.
Solution : create an additional stop-press
index.
Maintaining indices over dynamic collections.
Stop-press index
Collection of document in flux
Model document modification as deletion followed by insertion
Documents in flux represented by a signed record (d,t,s)
s specifies if d has been deleted or inserted.
Getting the final answer to a query
Main index returns a document set D0.
Stop-press index returns two document sets
D+ : documents not yet indexed in D0 matching the query
D- : documents matching the query removed from the collection
since D0 was constructed.
Stop-press index getting too large
Rebuild the main index
signed (d, t, s) records are sorted in (t, d, s) order and merge-
purged into the master (t, d) records
Stop-press index can be emptied out.
Index compression techniques
Compressing the index so that much of it
can be held in memory
Required for high-performance IR installations
(as with Web search engines),
Redundancy in index storage
Storage of document IDs.
Delta encoding
Sort Doc IDs in increasing order
Store the first ID in full
Subsequently store only difference (gap) from
previous ID
Encoding gaps
Small gap must cost far fewer bits than a
document ID.
Binary encoding
Optimal when all symbols are equally likely
Unary code
optimal if probability of large gaps decays
exponentially
Encoding gaps
Gamma code
Represent gap x as
Unary code for followed by
represented in binary ( bits)
Golomb codes
Further enhancement

logx 1+

logx
2 - x
logx
Lossy compression mechanisms
Trading off space for time
collect documents into buckets
Construct inverted index from terms to bucket
IDs
Document' IDs shrink to half their size.
Cost: time overheads
For each query, all documents in that bucket
need to be scanned
Solution: index documents in each bucket
separately
E.g.: Glimpse (http://webglimpse.org/)
General dilemmas
Messy updates vs. High compression rate
Storage allocation vs. Random I/Os
Random I/O vs. large scale
implementation
Relevance ranking
Keyword queries
In natural language
Not precise, unlike SQL
Boolean decision for response unacceptable
Solution
Rate each document for how likely it is to satisfy the user's
information need
Sort in decreasing order of the score
Present results in a ranked list.
No algorithmic way of ensuring that the ranking
strategy always favors the information need
Query: only a part of the user's information need
Responding to queries
Set-valued response
Response set may be very large
(E.g., by recent estimates, over 12 million Web
pages contain the word java.)
Demanding selective query from user
Guessing user's information need and
ranking responses
Evaluating rankings
Evaluating procedure
Given benchmark
Corpus of n documents D
A set of queries Q
For each query, an exhaustive set of
relevant documents identified
manually
Query submitted system
Ranked list of documents
retrieved
compute a 0/1 relevance list
iff
otherwise.
Q qe
D D
q
_
) d , , d , (d
n 2 1
.
) r .., , r , (r
n 2 1
D d
q i
e 1 r
i
=
0 r
i
=
Recall and precision
Recall at rank
Fraction of all relevant documents included in
.
.
Precision at rank
Fraction of the top k responses that are
actually relevant.
.

1 k >
) d , , d , (d
n 2 1
.
s s
=
k i 1
i
q
r
| D |
1
recall(k)
s s
=
k i 1
i
r
k
1
k) precision(
Other measures
Average precision
Sum of precision at each relevant hit position in the
response list, divided by the total number of relevant
documents
.
.
avg.precision =1 iff engine retrieves all relevant
documents and ranks them ahead of any irrelevant
document
Interpolated precision
To combine precision values from multiple queries
Gives precision-vs.-recall curve for the benchmark.
For each query, take the maximum precision obtained for the
query for any recall greater than or equal to
average them together for all queries
Others like measures of authority, prestige etc
s s
=
| | k 1
k
q
) ( * r
| D |
1
ion avg.precis
D
k precision
Precision-Recall tradeoff
Interpolated precision cannot increase with
recall
Interpolated precision at recall level 0 may be less
than 1
At level k = 0
Precision (by convention) = 1, Recall = 0
Inspecting more documents
Can increase recall
Precision may decrease
we will start encountering more and more irrelevant
documents
Search engine with a good ranking function will
generally show a negative relation between
recall and precision.
Higher the curve, better the engine
Precision and interpolated precision plotted against recall for the given relevance vector.
Missing are zeroes.
k
r
The vector space model
Documents represented as vectors in a
multi-dimensional Euclidean space
Each axis = a term (token)
Coordinate of document d in direction of
term t determined by:
Term frequency TF(d,t)
number of times term t occurs in document d,
scaled in a variety of ways to normalize document
length
Inverse document frequency IDF(t)
to scale down the coordinates of terms that occur
in many documents
Term frequency
.
.
Cornell SMART system uses a smoothed
version

=
t
t ) n(d,
t) n(d,
t) TF(d,
)) (n(d, max
t) n(d,
t) TF(d,
t
t
=
)) , ( 1 log( 1 ) , (
0 ) , (
t d n t d TF
t d TF
+ + =
=
otherwise
t d n 0 ) , ( =
Inverse document frequency
Given
D is the document collection and is the set
of documents containing t
Formulae
mostly dampened functions of
SMART
.

| |
t
D
D
)
| |
| | 1
log( ) (
t
D
D
t IDF
+
=
t
D
Vector space model
Coordinate of document d in axis t
.
Transformed to in the TFIDF-space
Query q
Interpreted as a document
Transformed to in the same TFIDF-space
as d
) ( ) , ( t IDF t d TF d
t
=
d
Measures of proximity
Distance measure
Magnitude of the vector difference
.
Document vectors must be normalized to unit
( or ) length
Else shorter documents dominate (since queries
are short)
Cosine similarity
cosine of the angle between and
Shorter documents are penalized
| | q d

1
L
2
L
d
Relevance feedback
Users learning how to modify queries
Response list must have least some relevant
documents
Relevance feedback
`correcting' the ranks to the user's taste
automates the query refinement process
Rocchio's method
Folding-in user feedback
To query vector
Add a weighted sum of vectors for relevant documents D+
Subtract a weighted sum of the irrelevant documents D-
.
q

+
+ =
D - D
d - d q ' q

| o
Relevance feedback (contd.)
Pseudo-relevance feedback
D+ and D- generated automatically
E.g.: Cornell SMART system
top 10 documents reported by the first round of
query execution are included in D+
typically set to 0; D- not used
Not a commonly available feature
Web users want instant gratification
System complexity
Executing the second round query slower and
expensive for major search engines
Ranking by odds ratio
R : Boolean random variable which
represents the relevance of document d
w.r.t. query q.
Ranking documents by their odds ratio for
relevance
.
Approximating probability of d by product
of the probabilities of individual terms in d
.
Approximately

) , | Pr( / ) | Pr(
) , | Pr( / ) | Pr(
) , Pr( / ) , , Pr(
) , Pr( / ) , , Pr(
) , | Pr(
) , | Pr(
q R d q R
q R d q R
d q d q R
d q d q R
d q R
d q R
= =
[
~
t t
t
q R x
q R x
q R d
q R d
) , | Pr(
) , | Pr(
) , | Pr(
) , | Pr(
[
e

d q t q t q t
q t q t
a b
b a
d q R
d q R
) 1 (
) 1 (
) , | Pr(
) , | Pr(
, ,
, ,
Bayesian Inferencing
Bayesian inference network for relevance ranking. A
document is relevant to the extent that setting its
corresponding belief node to true lets us assign a high
degree of belief in the node corresponding to the query.
Manual specification of
mappings between terms
to approximate concepts.
Bayesian Inferencing (contd.)
Four layers
1.Document layer
2.Representation layer
3.Query concept layer
4.Query
Each node is associated with a random
Boolean variable, reflecting belief
Directed arcs signify that the belief of a
node is a function of the belief of its
immediate parents (and so on..)
Bayesian Inferencing systems
2 & 3 same for basic vector-space IR
systems
Verity's Search97
Allows administrators and users to define
hierarchies of concepts in files
Estimation of relevance of a document d
w.r.t. the query q
Set the belief of the corresponding node to 1
Set all other document beliefs to 0
Compute the belief of the query
Rank documents in decreasing order of belief
that they induce in the query
Other issues
Spamming
Adding popular query terms to a page unrelated to
those terms
E.g.: Adding Hawaii vacation rental to a page about
Internet gambling
Little setback due to hyperlink-based ranking
Titles, headings, meta tags and anchor-text
TFIDF framework treats all terms the same
Meta search engines:
Assign weight age to text occurring in tags, meta-tags
Using anchor-text on pages u which link to v
Anchor-text on u offers valuable editorial judgment about v as
well.
Other issues (contd..)
Including phrases to rank complex queries
Operators to specify word inclusions and
exclusions
With operators and phrases
queries/documents can no longer be treated
as ordinary points in vector space
Dictionary of phrases
Could be cataloged manually
Could be derived from the corpus itself using
statistical techniques
Two separate indices:
one for single terms and another for phrases
Corpus derived phrase dictionary
Two terms and
Null hypothesis = occurrences of and are
independent
To the extent the pair violates the null hypothesis, it is
likely to be a phrase
Measuring violation with likelihood ratio of the
hypothesis
Pick phrases that violate the null hypothesis
with large confidence
Contingency table built from statistics

1
t
2
t
1
t
2
t
) , ( ) , (
) , ( ) , (
2 1 11 2 1 10
2 1 01 2 1 00
t t k k t t k k
t t k k t t k k
= =
= =
Corpus derived phrase dictionary
Hypotheses
Null hypothesis

Alternative hypothesis

Likelihood ratio
) ; ( max
) ; ( max
0
k p H
k p H
p
p
[ e
[ e
=
11 10 01 00
) ( )) 1 ( ( ) ) 1 (( )) 1 )( 1 (( ) , , , ; , (
2 1 2 1 2 1 2 1 11 10 01 00 2 1
k k k k
p p p p p p p p k k k k p p H
11 10 01 00
11 10 01 00 11 10 01 00 11 10 01 00
) , , , ; , , , (
k k k k
p p p p k k k k p p p p H
Approximate string matching
Non-uniformity of word spellings
dialects of English
transliteration from other languages
Two ways to reduce this problem.
1. Aggressive conflation mechanism to
collapse variant spellings into the same
token
2. Decompose terms into a sequence of q-
grams or sequences of q characters
Approximate string matching
1. Aggressive conflation mechanism to collapse
variant spellings into the same token
E.g.: Soundex : takes phonetics and pronunciation details
into account
used with great success in indexing and searching last
names in census and telephone directory data.
2. Decompose terms into a sequence of q-grams
or sequences of q characters
Check for similarity in the grams
Looking up the inverted index : a two-stage affair:
Smaller index of q-grams consulted to expand each query
term into a set of slightly distorted query terms
These terms are submitted to the regular index
Used by Google for spelling correction
Idea also adopted for eliminating near-duplicate pages
) 4 2 ( s s q q
Meta-search systems
Take the search engine to the document
Forward queries to many geographically distributed
repositories
Each has its own search service
Consolidate their responses.
Advantages
Perform non-trivial query rewriting
Suit a single user query to many search engines with
different query syntax
Surprisingly small overlap between crawls
Consolidating responses
Function goes beyond just eliminating duplicates
Search services do not provide standard ranks which
can be combined meaningfully
Similarity search
Cluster hypothesis
Documents similar to relevant documents are
also likely to be relevant
Handling find similar queries
Replication or duplication of pages
Mirroring of sites
Document similarity
Jaccard coefficient of similarity between
document and
T(d) = set of tokens in document d
.
Symmetric, reflexive, not a metric
Forgives any number of occurrences and any
permutations of the terms.
is a metric
1
d
2
d
| ) ( ) ( |
| ) ( ) ( |
) , ( '
2 1
2 1
2 1
d T d T
d T d T
d d r
=
) , ( ' 1
2 1
d d r
Estimating Jaccard coefficient with
random permutations
1. Generate a set of m random
permutations
2. for each do
3. compute and
4. check if
5. end for
6. if equality was observed in k cases,
estimate.
[
[
m
k
d d r = ) , ( '
2 1
) ( min ) ( min
2 1
d T d T =
) (
2
d [ ) (
1
d [
Fast similarity search with random
permutations
1. for each random permutation do
2. create a file
3. for each document d do
4. write out to
5. end for
6. sort using key s--this results in contiguous blocks with fixed
s containing all associated
7. create a file
8. for each pair within a run of having a given s do
9. write out a document-pair record to g
10. end for
11. sort on key
12. end for
13. merge for all in order, counting the number of
entries
[
) , (
2 1
d d
s
d
> [ = < d d T s )), ( ( min
[
f
[
f
[
f
[
g
[
f
) , (
2 1
d d
[
g ) , (
2 1
d d
[
g
[
) , (
2 1
d d ) , (
2 1
d d
Eliminating near-duplicates via shingling
Find-similar algorithm reports all duplicate/near-
duplicate pages
Eliminating duplicates
Maintain a checksum with every page in the corpus
Eliminating near-duplicates
Represent each document as a set T(d) of q-grams (shingles)
Find Jaccard similarity between and
Eliminate the pair from step 9 if it has similarity above a
threshold
1
d ) , (
2 1
d d r
2
d
Detecting locally similar sub-graphs of the
Web
Similarity search and duplicate elimination on the
graph structure of the web
To improve quality of hyperlink-assisted ranking
Detecting mirrored sites
Approach 1 [Bottom-up Approach]
1. Start process with textual duplicate detection
cleaned URLs are listed and sorted to find duplicates/near-
duplicates
each set of equivalent URLs is assigned a unique token ID
each page is stripped of all text, and represented as a sequence
of outlink IDs
2. Continue using link sequence representation
3. Until no further collapse of multiple URLs are possible
Approach 2 [Bottom-up Approach]
1. identify single nodes which are near duplicates (using text-
shingling)
2. extend single-node mirrors to two-node mirrors
3. continue on to larger and larger graphs which are likely mirrors of
one another
Detecting mirrored sites (contd.)
Approach 3 [Step before fetching all pages]
Uses regularity in URL strings to identify host-pairs which are
mirrors
Preprocessing
Host are represented as sets of positional bigrams
Convert host and path to all lowercase characters
Let any punctuation or digit sequence be a token separator
Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)
Eliminate stop terms such as htm, html, txt, main, index, home,
bin, cgi
Form positional bigrams from the token sequence
Two hosts are said to be mirrors if
A large fraction of paths are valid on both web sites
These common paths link to pages that are near-duplicates.

Search

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Search

Diunggah oleh

Hak Cipta:

Format Tersedia

Web search engines

Rooted in Information Retrieval (IR) systems

Anda mungkin juga menyukai