and
Rank Retrieval
Recap and Preview
• What we have done so far
• Acquire Data
• Preprocessed Text
• Created index
• Compression techniques
• What now?
• Retrieve documents on query
• Rank retrieval on bases of relevance to query
• Different retrieval models to do this.
Ranked Retrieval
• Ranking algorithms based on good retrieval models will retrieve relevant
documents near the top of the ranking.
• One factor of ranking is on the base of relevance to the query.
• Relevance if from user’s perspective.
• Good models should produce outputs that correlate well with human decisions on relevance.
• Relevance is a complex concept
• Need understanding of how language is represented and processed in the human brain
• Topical and user relevance
• Binary or Multivalued
• Solution:
• Work with theories about relevance in the form of mathematical retrieval models and test
those theories by comparing them to human actions
Retrieval Models
• Boolean retrieval model
• Vector space models
• Probabilistic approaches
Boolean retrieval model
• Used by the earliest search engines and is still in use today.
• AKA exact-match retrieval.
• Documents are retrieved if they exactly match the query specification, and otherwise
are not retrieved.
• Not exactly a ranking algorithm
• Because: for a given query, each document gets score of either 0 or 1 (hence the
name Boolean), if its non relevant of relevant resp.
• Pros:
• Results are predictable, and exact
• Cons:
• Difficult to user, feast or famine situation.
Example of Boolean retrieval on a Query
• Consider the query
• This query will retrieve a set of documents that contain both words,
occurring anywhere in the document. For example
• Ford Motor Company today announced that Darryl Hazel will succeed Brian
Kelley as president of Lincoln Mercury.
Vector Space Model
• Query and documents are considered as t dimensional vectors
• Where t is the number or terms in index
• Each feature of vector is given a weight
• Binary occurrence
• Term Count
• Term Frequency (TF)
• Term frequency-Inverse document frequency (Tf-idf )
• A similarity measure such as cosine similarity between query vector
and document vector is used as measure of relevance.
Vector Space Model
• For example if Di is a document, its feature vector will be