Anda di halaman 1dari 35

Deep Web

Under the guidance of


Prof. Pushpak Bhattacharyya
Presented by Jayanta Das
(11305R012)
Souvik Pal
(113059003)

Introduction
What is Deep Web

Introduction: What is Deep


Web
Modern Internet: Most effective
source of information.
Most popular search engine: Google
In 2008, Google added Trillionth (1012)
web link to their index database!
Stores several billion documents!
Despite many a times we are not
satisfied with the search results.
43 % users reports dissatisfaction about the
results

Real Life Example

Motivation: Why Deep Web


Then why Google fails?
Most of the Web's information is buried far
down on dynamically generated sites.
Traditional web crawler cannot reach there.
Large portion of data are literally un-explored
Quest for exploration of unknown a human instinct

Need for more specific information stored in


databases
Can only be obtained if we have access to the
database containing the information.

Evolution of Deep Web


Early Days: static html pages, crawlers
can easily reach
In mid-90s: Introduction of dynamic
pages, that are generated as a result of a
query.
In 1994: Jill Ellsworth used the term
Invisible Web to refer to these websites.
In 2001, Bergman coined it as Deep
Web

Measuring the Deep Web (1)


when you can measure what you
are speaking about, and express it in
numbers, you know something about
it Lord Kelvin
First Attempt: Bergman (2000 )
Size of surface web is around 19 TB
Size of Deep Web is around 7500 TB
Deep Web is nearly 400 times larger than
the Surface Web

Measuring the Deep Web (2)


In 2004 Mitesh classified the deep
web more acurately
Most of the html
forms are found
either on the fist
hop or 2nd hop from
the home page

Measuring the Deep Web (3)


Unstructured: Data objects as
unstructured media (text, images,
audio, video)
e.g www.cnn.com

Structured: data objects


as structured relational
records with
attribute-value pairs.

Deep Resources
Dynamic Web Pages
returned in response to a submitted query or accessed only
through a form

Unlinked Contents
Pages without any backlinks

Private Web
sites requiring registration and login (password-protected
resources)

Limited Access web


Sites with captchas, no-cache pragma http headers

Scripted Pages
Page produced by javascrips, Flash, AJAX etc

Non HTML contents


Multimedia files e.g. images o videos

Approach towards
crawling
Deep Web

Timeline: How it all started!


2001: Raghavan et al -> Hidden Web
Exposer
domain specific human assisted crawler

2002: Stumbleupon used Human Crawler


human crawlers can find relevant links that
algorithmic crawlers miss.

2003: Bergman introduced LexiBot


used for quantifying the deep web

2004: Yahoo! Content Acquisition


Program
paid inclusion for webmasters

Time line contd


2005: Yahoo! Subscriptions
Yahoo started searching subcription only sites
eg WSJ

2005: Notulas et. al. -> Hidden Web


Crawler
automatically generated meaningful queries
to issue against search form

2005: Google site map


Allows webmasters to inform search engines
about urls on their websites that are
available for crawling.

Present Deep Web Search


Scenario
Federated Search
Googles surfacing

Federated Search
Federated search is the process of
performing a real-time search of multiple
diverse and distributed sources from a
single search page, with the federated
search engine acting as intermediary.
Why federated?
Content from different sources are combined
instead of searching the sources one at a
time.

Federated Search:
Properties (1)
Real Time
Fed search occurs live and results are
current.

Diverse and Distributed Sources


Multiple sources present in different
locations in the web are serached.
Sources are diverse in nature containing
text, documents, pdfs, ppts etc.

Federated Search:
Properties (2)
Single Search page
Fed search engines provide a single point of
searching.

Fed Search engine acts as intermediary


User does not communicate directly with the
content sources when performing searches.
The search engine does it on the users
behalf.

Federated Search Method


Works by filling out forms on web pages.
The search engine is programmed with
the knowledge of each form that it has
to search.
It knows how to fill out the form, press
the submit button and retrieve the
results.

Web Form example

A web form that a normal search engine cannot crawl . This involv
in the textbox, clicking search and retreiving the results.

Federated search example

WorldWideScience.org : Searches science content from all over the w


government agencies, research and academic organizations.

Fed Search In Ac

Incremental search : Federated search engines do not wait for results fr


To improve response time results are displayed in chunks while the sea
in the background. When a new result set is available the user is promp

Metasearch vs Fed Search


Metasearch is similar to federated search.
Here the search engine searches other
search engines in real time.
Even though they search the underlying
search engine in real time, the
underlying search engines may not have
the most current information as they
themselves are crawlers.
It is NOT a Deep Web Seach!
People often confuse between Meta Search
and Fed Search

Metasearch example

Federated Search
(Advantages)
Efficiency, Time Savings
Instead of querying many search engines
one at a time , the federated search
engine does it on the users behalf
Quality of results
searches only authoritative sources since
it has been programmed to do so.
Most Current content
Searches in real time.

Federated Search
(Challenges)
Aggregation
The process of combining search results
from different sources in some helpful
way
eg: sorting by date,title,author

Ranking
Displaying results relevant to search

De-duplication
A federated search engine may retreive
the same result from multiple resources

Googles reasons to move


away from Fed Search
Federated search works quite well
when it is restricted to one domain.
In case of general search involving
multiple domains it is not as effective.
Number of domains is extremely large
Defining boundary of domain difficult.
Mapping a query to a domain difficult
Dependent on latency of deep web
sources.

Case Study:
Googles Crawling

Case Study: Googles


crawling (1)
Two approaches for Deep Web
Crawling:
Virtual Integration
Surfacing

Case Study: Googles


crawling (2)
Virtual Integration (Domain
Specific)
A mediator form is created for
each domain
semantic mapping between
individual data sources and
mediator form.
Performed in real time.
Drawback:
Cost of building mediator form
and mapping.
Identifying relevant queries for a
particular domain.

mediated form
semantic mappings

deep-web sources

Case Study: Googles


crawling (3)
Surfacing:
Precomputes most relevant form values for
interesting html forms
Resulting urls are generated offline and
indexed
Helps in retaining exsiting infrustructure
while inclusion of Deep Web
Covers maximum web pages while bounding
the total number of web form submissions
GET vs POST method

Case Study: Googles


crawling (4)
Challenges:
Which form inputs to fill
Appropiate values to those inputs

Googles approach:
Selecting wild card for form submission
Some fields are mandetory

Query template
Testing with all possible values in select
menu
Predicting form values from datatypes

Subconcious Mind and Deep


Web
Inspiration behind exploration of
deep web
Analogy
Iceberg example
Real life example

References(1)
1. Wikipedia,
http://en.wikipedia.org/wiki/Deep_web
2. Bergman, Michael K , "The Deep Web: Surfacing Hidden Value". The
Journal of Electronic Publishing , August 2001
3. Alex Wright, "Exploring a 'Deep Web' That Google Cant Grasp". The
New York Times. Sept 23, 2009.
http://www.nytimes.com/2009/02/23/technology/internet/23search.ht
ml?th&emc=th
4. Jesse Alpert & Nissan Hajaj, We knew the web was big, 2008
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
5. He, Bin; Patel, Mitesh; Zhang, Zhen; Chang, Kevin ChenChuan ,"Accessing the Deep Web: A Survey". Communications of the
ACM (CACM), May 2007

References(2)
6. Madhavan, Jayant; David Ko, ucja Kot, Vignesh Ganapathy, Alex
Rasmussen, Alon Halevy, Googles Deep-Web Crawl, 2008
7. Maureen Flynn-Burhoe, "Timeline of events related to the Deep Web"
,2008, http://papergirls.wordpress.com/2008/10/07/timeline-deepweb/
8. Darcy Pedersen, "Federated Search Finds Content that Google Cant
Reach Part I of III" , 2009,
http://deepwebtechblog.com/federated-search-finds-content-thatgoogle-cant-reach-part-i-of-iii/
9. Darcy Pedersen, "A Federated Search Primer Part II of III" , 2009,
http://deepwebtechblog.com/a-federated-search-primer-part-ii-of-iii/
10. Darcy Pedersen, "A Federated Search Primer Part IIIof III" , 2009,
http://deepwebtechblog.com/a-federated-search-primer-part-iii-of-iii/

THANK YOU