Anda di halaman 1dari 22

WEB STRUCTURE

MINING

SUBMITTED BY:
BLESSY JOHN
R7A
ROLL NO:18
INTRODUCTION
 Web mining is the application of data
mining techniques in search engines.
 Data mining - process of discovering
useful knowledge from data sources
 Web mining automatically discover and
extract information from Web documents.
 Web structure mining discovers useful
data from hyperlinks.
WEB MINI NG
 Useful patterns extraction from WWW
resources

 WWW is widely distributed, global


information service centre that
constitutes a rich source for data
mining

 Employing techniques from Data


Mining, information retrieval,etc.
NEED FOR WEB MINING
 Aims at finding and extracting relevant
information that is hidden in web-
related data.

 The challenge is to bring back the


semantics of hyper text document

 To turn web data into web knowledge


CLASSIFICATION

WEB MINING

WEB CONTENT
WEB STRUCTURE
MINING WEB USAGE
MINING
MINING
WEB STRUCTURE
MINING
 Generate structural summary about the
Web site and Web page

 Use graph theory to analyse node and


connection structure of a web site
 Analysis of the link structure of the
web, and its purposes is to identify
more preferable documents
WEB STRUCTURE
MINING cont…..
 Discovering the nature of the hierarchy
of hyperlinks in the website and its
structure

 Hyperlink identifies author’s


endorsement of the other web page

 Retrieving information about the


relevance and the quality of the web
page.
Page Layout and Li nk
Analy sis for Web
Images
WEB BASICS
 A web is a huge collection of documents
linked together by references.
 To refer from one document to another
is based on hyper text and embedded in
HTML
 HTML describes how the document
should display on browser window
 Web document has a web address
called URL that identifies it uniquely.
WEB CRAWLERS
 Collects “all” web documents by
browsing the Web systematically and
exhaustively

 Region of the web to be crawled can be


specified by using the URL structure.

 Used by a search engine to provide


local access to the most recent versions
of possibly all web pages
INDEXING AND
KEYWORD SEARCH
 There are two types of data:
structured and unstructured
 Structured data have keys associated
with each data item that reflect its
content
 Content-based access to unstructured
data without considering the meaning is
the keyword search approach
DOCUMENT
REPRESENTATION
 To facilitate the process of matching
keywords and documents, some
preprocessing steps are taken first:

 Documents are tokenized


 Characters are converted to upper or
lower case
 Words reduced to canonical form
 Stopwords are usually removed
ALGORITHMS
 There are two main algorithms used in
web structure mining

1. HITS (Hypertext-Induced Topic


Search)
2. Page rank algorithm
HI TS (H yper tex t-In duced Top ic
Searc h)

 Link analysis algorithm


 Rates web pages
 Developed by Jon Kleinberg
 Determines two values for a page
 Authority-estimates the value of the
content of the page
 Hub-estimates the value of its links to
other pages
Hubs a nd Au th or it ies

 Hu b pages point to interesting links to authorities = relevant


pages
 Au thorit ies are targets of hub pages
Continue……
 Authority and hub values are defined in
terms of one another in a mutual
recursion

 It is executed at querry time with the


associated HIT on performance
Page R ank
 Link analysis algorithm
 Assigns a numerical weightage to each
element of a hyperlinked set of
documents
 Denoted by PR(E)
 Relies on uniquely democratic nature
 Link from page A to page B is a vote,
by page A, for page B
Continue…..
 Here, A considers itself important and
help to make B important

 Also a probability distribution –


represents the probability that a click on
a link arrives at any particular page

 Page rank of 0.5 -> 50% chance that a


person clicking on a link will be directed
to the document with the 0.5 page rank
APPLICATIONS
 Information retrieval in social networks.
 To find out the relevancy of each Web
page
 Measuring completeness of the Web
sites
 Used in search engines to find out
relevant information
CONCLUSION
 Search engines uses web structure
mining to find the information.

 We can create new knowledge out of


the available information

 Web Content mining can be added to it


to enhance the performance of search
engines.
Thank Yo u !
Questions ?