Anda di halaman 1dari 16

Retrieval Models

and
Rank Retrieval
Recap and Preview
• What we have done so far
• Acquire Data
• Preprocessed Text
• Created index
• Compression techniques
• What now?
• Retrieve documents on query
• Rank retrieval on bases of relevance to query
• Different retrieval models to do this.
Ranked Retrieval
• Ranking algorithms based on good retrieval models will retrieve relevant
documents near the top of the ranking.
• One factor of ranking is on the base of relevance to the query.
• Relevance if from user’s perspective.
• Good models should produce outputs that correlate well with human decisions on relevance.
• Relevance is a complex concept
• Need understanding of how language is represented and processed in the human brain
• Topical and user relevance
• Binary or Multivalued
• Solution:
• Work with theories about relevance in the form of mathematical retrieval models and test
those theories by comparing them to human actions
Retrieval Models
• Boolean retrieval model
• Vector space models
• Probabilistic approaches
Boolean retrieval model
• Used by the earliest search engines and is still in use today.
• AKA exact-match retrieval.
• Documents are retrieved if they exactly match the query specification, and otherwise
are not retrieved.
• Not exactly a ranking algorithm
• Because: for a given query, each document gets score of either 0 or 1 (hence the
name Boolean), if its non relevant of relevant resp.
• Pros:
• Results are predictable, and exact
• Cons:
• Difficult to user, feast or famine situation.
Example of Boolean retrieval on a Query
• Consider the query

• president AND Lincoln

• This query will retrieve a set of documents that contain both words,
occurring anywhere in the document. For example

• Ford Motor Company today announced that Darryl Hazel will succeed Brian
Kelley as president of Lincoln Mercury.
Vector Space Model
• Query and documents are considered as t dimensional vectors
• Where t is the number or terms in index
• Each feature of vector is given a weight
• Binary occurrence
• Term Count
• Term Frequency (TF)
• Term frequency-Inverse document frequency (Tf-idf )
• A similarity measure such as cosine similarity between query vector
and document vector is used as measure of relevance.
Vector Space Model
• For example if Di is a document, its feature vector will be

• Where dij represents the weight of the jth term


• Queries are represented the same way as documents. For example if
Q is a query

• where qj is the weight of the jth term in the query


Vector Space model
• Figure shows the vectors of each
document. The weight of each feature
is Term count.
• If the query is “Tropical Fish” its
vectors will be
• (0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1)

• Cosine (D1, q)= V(D).V(q)/ |V(D1)|.|V(q)|


= 2/ sqrt(2) *sqrt(4)= 0.7
• The documents will be then ranked base on
the similarity score.
Vector Space Model
• The vector space model was the basis for most of the research in information
retrieval in the 1960s and 1970s
• Clearly based on assumption that relevance is related to the similarity of query
and document vectors.
• Pros:
• One of the appealing aspects of the vector space model
is the use of simple diagrams to visualize the
documents and queries.
• Helpful in teaching, but not in reality.
• Cons:
• The above assumption
Computing Vector Scores
• How will you compute the score of document and query using (only )
index?
• You cannot store document vectors a shown in figure on slide 9
• You can store some additional information in postings.
• What additional information you will need to store in posting.
Weighting techniques
• Different term weighting can be used in the vector space model.
• Binary Occurrence
• 1 if term is present 0 if not
• Term Count
• Number of times a term is present in document
• Term Frequency (tf t,d)
• Term Count of t in d/ Σ over all Term Count in d
• Term Frequency Inverse Document frequency (tf-idft,d)
• tf-idft,d = tf t,d x idft
• idft, = log N/dft
• N is total number of document and dft is document frequency, defined to be the
number of documents in the collection that contain a term t.
Probabilistic Models
• Estimate is the probability of the given document is relevant or not.
• P(R = 1|d, q)
• Results are ranked on this probability.
• This is the basis of the probability ranking principle (PRP)
• If a reference retrieval system’s response to each request is a ranking of the
documents in the collection in order of decreasing probability of relevance to
the user who submitted the request, where the probabilities are estimated as
accurately as possible on the basis of whatever data have been made
available to the system for this purpose, the overall effectiveness of the
system to its user will be the best that is obtainable on the basis of those
data.
(van Rijsbergen 1979, 113–114):
Before next Class
• Revise Probability concepts, Manning 11.1
References:
• Bruce Croft Chapter 7
• Manning 6.2 and 6.3
• Optional Reading
• Manning 6.1

Anda mungkin juga menyukai