Web Crawler Assisted Web Page Cleaning For Web Data Mining

CHAPTER -1 INTRODUCTION
1.1 OVERVIEW:
1.1.1 THE WORLD WIDE WEB:
The World Wide Web (WWW) has impacted on almost every aspect of our lives. It is the biggest and most widely known information source that is easily accessible and searchable. It consists of billions of interconnected documents(called Web pages) which are authored by millions of people. Before the Web, information seeking was by means of asking a friend or an expert, or by reading a book. However, with the Web, everything is only a few clicks away from the comfort of our homes or offices. The web also provides a convenient means to communicate with each other and discuss with people from anywhere in the world.
The operation of the Web relies on the structure of its hypertext documents. Hypertext allows Web page authors to link their documents to other related documents residing on computers anywhere in the world. To view these documents, one simple follows the links(called hyperlinks). The Web being the largest publicly accessible data source in the world, has many unique characteristics, which make mining useful information and knowledge a fascination and challenging task. Some of the characteristics are: The amount of data/information on the Web is huge and still growing. Coverage of the information is also wide and diverse. Data of all types exist on the Web. Information on the Web is heterogeneous ie. Web pages have diverse authorship hence they may present information in their own individual way.
Pages in the Web are linked using hyperlinks. Links help in navigation within a website or to move from one site to another. The information on the Web is noisy. The noise comes from two main sources. First, a typical Web page contains many pieces of information e.g. the main content of the page, navigation links, advertisements, copyright notices, privacy policies,etc. For a particular application, only part of the information is useful. The rest is considered as noise. To perform data mining and Web content analysis, the noise has to be removed. Second, due to the fact that the Web does not have quality control of information, a large amount of information on the Web is of low quality, erroneous or misleading. The Web is dynamic, information on the Web changes constantly.
These characteritisics of the Web present both challenges and
opportunities for mining and discovery of information and knowledge from the Web. 1.1.2 WEB DATA MINING
Web mining aims to discover useful information or Web mining tasks can be categorized into three types: Web structure mining: Web structure mining discovers useful knowledge from hyperlinks, which represent the structure of the Web. Web content mining:
knowledge from the Web hyperlink structure, page content, and usage data.
Web content mining extracts or mines useful information or knowledge from Web page content. Web usage mining: Web usage mining refers to the discovery of user access patterns from Web usage logs, which record every click made by each user. 1.2 OBJECTIVE: One of the problems faced with web data mining is data collection. For Web mining, data collection can be a substantial task, especially for Web structure and content mining, which involves crawling a large number of target Web pages. Another problem with Web mining is content on the Web pages are noisy. Web pages typically contain the main content, advertisements, navigation links, privacy policy,etc. Parts of the Web page that are not useful to the user are labeled as noise. Such noise affect the quality of Web mining, hence have to eliminated. Out project aims make use of the power of a Web crawler to extract the main content from the web sites eliminating the noisy data. Here we concentrate on review sites and blogs which provide information regarding products. As a sample case we have taken review sites of various cars for our project.
CHAPTER 2 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM: Most of the data cleaning research is done relating cleaning databases and data warehouses. Web page cleaning is different, it deals with semi structured data. Different methods are needed for context of web. Some existing methods for cleaning web pages include Use of raw regular expressions. Regular expressions allow for quick retrieval of content but they are complex and confusing to analyze. Other approaches include the use of Machine learning. The machine is taught the general template of the page, and based on this template extraction occurs. This requires a high level of expertise and understanding of AI concepts Still other methods include the development of wrappers. Wrappers make use of a set of highly accurate rules that extract a Wrappers handle highly structured collections of webpages but fail Wrappers exploit shallow natural language knowledge and can be Existing content extraction systems do not integrate crawling with content extraction. To extract relevant pages from a website we can make use of a web crawler.
6
particular page's content. when the pages are less structured. applied to less structured text but at a lower efficiency.
2.2 PROPOSED SYSTEM: We propose a technique using the web crawler to crawl through the Seed URL to get all the relevant links to the Seed page. From these relevant links only we extract the content using a combination of regular expressions and the DOM(Document Object Model) tree structure. This technique is based on the analysis of both the layouts and the actual content of the Web pages. The Web crawler is custom designed to extract only those links that are relevant to a particular website. Once the relevant links are extracted we obtain the proper structured HTML page from the Web and cache it on the system. The cached page is made free of HTML errors and discrepancies while caching process. The cached pages are represented as DOM trees. Analysis of the DOM tree and the HTML pages are done in order to determine the location of the main content. Based on the analysis the content is extracted. In order to make the content suitable for data analysis and mining we make use of text mining cleaning methodologies like Stemming and Stop words removal. Stop words removal:
Stop words are common words that carry less important meaning than the actual keywords. Usually search engines remove stop words from a keyword phrase to return the most relevant result. The high occurrence of stop words make content look less important for search engines.
Stemming:
In linguistic morphology, stemming is the process for reducing inflected words to their stem, base or root form. This process is used in search engines and other natural language processing problems. 2.3 ADVANTAGES OF PROPOSED SYSTEM: Our web crawler extracts only relevant links of a particular website. Since almost all pages within a web site follow the same structure(looking at review sites), content extraction can be generalized to each web site. The current system does not extract the unwanted links which are classified as noises. All links that do not follow the domain of the Seed URL are classified as unwanted. Content extraction provides only the main content and hence improves the efficiency of web data mining process. 2.4 DISADVANTAGES: The web crawler does not accept a list of Seed URLs. The efficiency of the Web crawler can be improved by making it a Analyzing the DOM tree requires some manual work.
8
multithreaded web crawler.
Other structuring techniques, in the form of XML tags, can be
included in the extracted content to assist in Web data mining.
CHAPTER 3 LITERATURE REVIEW
3.1SYSTEM STUDY: 3.1.1 WEB CRAWLER:
Web crawlers, also known as spider or robots, are
programs that automatically download Web pages. Since information on the web is scattered among billions of pages served by millions of servers around the globe, users who browse the Web can follow hyperlinks to access information, virtually moving from one page to the next. A crawler can visit many sites to collect information that can be analyzed and mined in a central location, either online(as it is downloaded) or off-line(after it is stored).
Were the Web a static collection of pages, we would have little
long term use for crawling. Once all the pages are fetched and saved in a repository, we are done. However, the Web is a dynamic entity evolving at rapid rates. Hence there is a continuous need for crawlers to help applications stay current as pages and links are added, deleted, moved or modified.
Some of the applications of Web crawlers are: potential collaborators. They are used in business intelligence, whereby organizations collect information about their competitors and
10
pages of interest.
They can be used to monitor Web sites and The most widespread use of crawlers is in There are two types of Crawlers:
support of search engines for collecting pages to build their indexes. Universal Crawlers: These crawlers gather crawl all pages irrespective of their content Preferential (or) Focussed Crawlers: These crawlers are more targeted in the web sites that they crawl. 3.1.2 A BASIC CRAWLER ALGORITHM:
A crawler starts from a set of seed pages(input URLs) and then
uses the links within them to fetch other pages. The links in these pages are, in turn, extracted and the corresponding pages are visited. This process repeats until a sufficient number of pages are visited or some objective is achieved.
The following flowchart depicts the working of a sequential Such a crawler fetches one page at a time, making inefficient use The crawler maintains a list of unvisited URLs called the frontier.
crawler
of its resources
The list is initialized with seed URLs which may be provided by the user or another program.
11
In each iteration of its main loop, the crawler picks the next URL
from the frontier, fetches the page corresponding to the URL through HTTP, parses the retrieved page to extract its URLs, adds newly discovered URLs to the frontier, and stores the page (or other extracted information, possibly index terms) in a local disk repository.
The crawling process may be terminated when a certain number of
pages have been crawled. The crawler may also be forced to stop if the frontier becomes empty, although this rarely happens in practice due to the high average number of links
12
Fig 3.1 Science of Crawling 3.1.3 PAGE CACHING:
Web page caching is used to store pages onto the system in order
to reduce the bandwidth traffic and the server load.
There are different types of caches: A forward cache sits close to users and accelerates their requests to Internet. A reverse cache sits in front of one or more Web servers and web applications, accelerating requests from the Internet.
Caching is also used to maintain Web archives. A Web archive is
a collection of portions of the WWW for the preservation purposes and for future use of scientists and analysts. 3.1.4 DOM TREE: DOM stands for Document Object Model. The DOM is a crossplatform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents.
The Document Object Model
(DOM) is an application
programming interface (API) for valid HTML and well-formed XML documents. It defines the logical structure of documents and the way a document is accessed and manipulated. With the Document Object Model, programmers can build documents, navigate their structure, and add, modify, or delete elements and content. Anything found in an HTML or XML document can be accessed, changed, deleted, or added using the Document Object Model
13
<TABLE> <TBODY> <TR> <TD>Shady Grove</TD> <TD>Aeolian</TD> </TR> <TR> <TD>Over the River, Charlie</TD> <TD>Dorian</TD> </TR> </TBODY> </TABLE>
The graphical representation of the DOM of the example table is:
Fig 3.2 DOM Tree
In the DOM, documents have a logical structure which is very
much like a tree; to be more precise, which is like a "forest" or "grove", which can contain more than one tree. Each document contains zero or one doctype nodes, one root element node, and zero or more comments or processing instructions; the root element serves as the root of the element tree for the document.
14
3.1.5 STOP WORDS:
Some extremely common words which would appear to be of little
value in helping select documents matching a users need are excluded from the vocabulary entirely. These words are called stop words.
The general strategy for determining a stop list is to sort the terms
by collection frequency (the total number of times each term appears in the document collection), and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, as a stop list , the members of which are then discarded during indexing.
An example of a stop list is shown below:
Fig 3.3 Stop list
The removal of these stop words improves the quality of the
results produced. 3.1.6 STEMMING AND LEMMATIZATION:
For grammatical reasons, documents are going to use different
forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.
15
In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
The goal of both stemming and lemmatization is to reduce
inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance: am, are, is be car, cars, car's, cars' car
The result of this mapping of text will be something like: the boy's cars are different colors the boy car be differ color
Stemming usually refers to a crude heuristic process that chops off
the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use
of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
If confronted with the token saw, stemming might return just s,
whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun.
The most common algorithm for stemming English, and one that
has repeatedly been shown to be empirically very effective, is Porter's algorithm
Porter's algorithm consists of 5 phases of word reductions, applied
sequentially. Within each phase there are various conventions to select
16
rules, such as selecting the rule from each rule group that applies to the longest suffix.
In the first phase, this convention is used with the following rule
group:
Fig 3.4 Stemming Rules
Many of the later rules use a concept of the measure of a word,
which loosely checks the number of syllables to see whether a word is long enough that it is reasonable to regard the matching portion of a rule as a suffix rather than as part of the stem of a word. For example, the rule:
(m>1) EMENT
would map replacement to replac, but not cement to c.
Here are a few types of stemming techniques:
17
Fig 3.5 Stemming Types 3.2 SYSTEM ARCHITECTURE:

SEED URL LIST OF RELEVANT EXTRACTED LINKS PAGE CACHER
FRONTIER
WEBCRAWLER
WWW
CACHED PAGES
DOM BASED CONTENT EXTRACTOR
EXTRACTED CONTENT
REMOVING STOP WORDS
STEMMING
CLEANED CONTENT
18
Fig 3.6 System Architecture
CHAPTER 4 SYSTEM REQUIREMENTS
19
4.1 HARDWARE: Processor Speed RAM Hard disk 4.2 SOFTWARE: OS Platform : Languages : Tool kit IDE API 4.3.1 Eclipse
: : : :
Intel Pentium IV or later 1.6 GHz or above 512 MB or above 40Gb or above Windows XP or later, Linux Java JDK 1.5 or above Eclipse HTMLCleaner, MorphAdorner
: : :
4.3 SOFTWARE DESCRIPTION: Eclipse is a multi-language software development environment comprising an integrated development environment (IDE) and an extensible plug-in system. It is written primarily in Java and can be used to develop applications in Java and, by means of the various plug-ins, in other languages as well, including C, C++, COBOL, Python, Perl, PHP, and others.
An integrated development environment (IDE) also known as integrated design environment or integrated debugging environment is a software application that provides comprehensive facilities to computer programmers for software development.
An IDE normally consists of:

20
o a source code editor o a compiler and/or an interpreter o build automation tools o a debugger 4.3.2 HTMLCleaner API: HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. Here is a typical example - improperly structured HTML containing unclosed tags and missing quotes: <table id=table1 cellspacing=2px <h1>CONTENT</h1> <td><a href=index.html>1 -> Home Page</a> <td><a href=intro.html>2 -> Introduction</a> After putting it through HtmlCleaner, XML similar to the following is coming out: <?xml version="1.0" encoding="UTF-8"?> <html> <head /> <body> <h1>CONTENT</h1>
21
<table id="table1" cellspacing="2px"> <tbody> <tr> <td> <a href="index.html">1 -> Home Page</a> </td> <td> <a href="intro.html">2 -> Introduction</a> </td> </tr> </tbody> </table> </body> </html> HtmlCleaner also support methods for DOM manipulating. 4.3.3 MORPHADORNER MorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. Currently MorphAdorner provides methods for adorning text with standard spellings, parts of speech and lemmata. MorphAdorner also provides facilities for tokenizing text, recognizing sentence boundaries, and extracting names and places. MorphAdorner can also be used for Stemming purposes. Two widely used stemmers are included in MorphAdorner. o The Porter stemmer, created by Martin Porter
22
o The Lancaster stemmer, created by Chris Paice and Gareth Husk
CHAPTER 5 FEATURE REQUIREMENTS
23
5.1MODULE 1: WEB CRAWLING 5.1.1. 5.1.2. 5.1.3. Input: Seed URL Output: List of Links relevant to the Seed URL Description: Web Crawler is a program that browses WWW in a methodological and systematic manner. This custom built Web crawler extracts links that are relevant to the Seed URL. The extracted links are added to the frontier and the visited links are stored in a file. The crawler makes use of a focused breadth first search crawler. 5.1.4. Data Flow Diagram
SEED URL FRONTIER
FETCH PAGE
EXTRACT RELEVANT URLs
STORE VISITED LINKS
WWW
OUTPUT FILE WITH EXTRACTED LINKS
Fig 5.1 DFD for Module 1: web crawling
24
5.2. MODULE 2: CONTENT EXTRACTION (Consists of 2 Parts) 5.2.1. WEB PAGE CACHING: 5.2.1.1 5.2.1.2 5.2.1.3 Input : List of relevant Links Output: Web pages HTML source Description: Each link is opened using
httpconnection and its HTML source is cached in a file for future use. While caching, the HTML source is checked for errors and corrected using the HTMLCleaner API. 5.2.1.4 Data Flow Diagram
LIST OF RELEVANT LINKS
FETCH HTML SOURCE
CLEAN HTML SOURCE
STORE PAGE
WWW
CACHED PAGE
Fig 5.2 DFD for Module 2: Web page Caching
25
5.2.2. CONTENT EXTRACTION: 5.2.2.1. 5.2.2.2. 5.2.2.3. Input: HTML source for each web page Output: Main Content from each page Description: Content from each HTML page is extracted by representing each HTML page as a DOM tree. By the use of the DOM tree extraction becomes an easy task. 5.2.2.4. Data Flow Diagram:
DOM RULES
CACHED PAGES
DOM BASED CONTENT EXTRACTION
EXTRACTED CONTENT
Fig 5.3 DFD for Module 2: Content Extraction
26
5.3.
MODULE 3: CONTENT CLEANING 5.3.1 5.3.2 5.3.3 using stemming 5.3.4 Input: Extracted content in the form of files Output: Cleaned content Description: The extracted content is cleaned information retrieval cleaning techniques like and stop words removal. Data Flow Diagram:
LIST OF STOP WORDS
STEMMING RULES
EXTRACTE D CONTENT
CLEANING STEMMING AND REMOVING STOP WORDS
CLEANED CONTENT
Fig 5.4 DFD for Module 3: Content Cleaning
27
CHAPTER - 6 IMPLEMENTATION
28
6.1 MODULES: WEB CRAWLER CONTENT EXTRACTION CONTENT CLEANING 6.2 MODULE 1: WEB CRAWLER 6.2.1 CRAWLER:
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. This process is called Web crawling or spidering.
Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.
Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.
29
A Web crawler is one type of bot, or software agent.
In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier.
URLs from the frontier are recursively visited according to a set of policies.
6.2.2 TYPE OF CRAWLER USED: The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers. The current trend of crawlers involve, mostly Universal Crawlers which extracts all the links irrespective of its relevancy to the site or the specified domain name. The other type of crawler is the focused or topical crawler. These crawlers do not crawl all the sites but only sites that are relevant to a particular topic or based on a preference. In our project we make use of a focused crawler to extract links that are relevant. Relevancy of links is determined by comparing the host names of the
30
extracted link and the Seed URL. We extract only those links that belong to the Web site of the Seed URL. All other links are discarded. e.g. We extract links from review sites and are specific about them adhering to the domain name. 6.2.3 INPUT SITES (SEED URL): As we are concentrating on review sites we have chosen these as our sample input sites or seed URLs

http://www.caranddriver.com/ http://www.edmunds.com/ http://www.motortrend.com/index.html/ http://www.roadandtrack.com/ http://www.automobilemag.com/index.html/ http://www.autoblog.com/ http://www.cnet.com/ http://www.epinions.com/ http://usnews.rankingsandreviews.com/cars-trucks/ http://www.thecarconnection.com/ http://autos.msn.com/ http://www.jdpower.com/autos/powersteering http://www.consumerreports.org/cro/cars/new-cars/index.htm
31
6.2.4 CHARACTERISTICS OF THE WEB CRAWLER: Our Web crawler is a focused sequential crawler. It extracts only links that are relevant to the Seed URL. The relevancy of the links are determined comparing the host names of the Seed URL and the extracted link. We make use of the Breadth First Crawling technique. In this, the crawler first starts extracting links from the Seed URL. As and when links are encountered in the page they are added to the end of the list of not visited links called the frontier. After parsing one URL and next URL is taken from front of the frontier. This process continues in a iterative manner, with each iteration being a new web page. 6.2.5 ISSUES IN CRAWLER: 6.2.5.1 ABSOLUTE LINKS:
A link is said to be absolute if the URL or file name can be found from anywhere on the Web, not just from a single Web site.
An absolute link specifies a fully-qualified URL; the protocol must be present in addition to a domain name, and often a file name must be included as well. For instance:
32
<a href="http://www.autoblog.com/">
Absolute links are extracted by checking if the extracted link follows the http protocol.
6.2.5.2 RELATIVE LINKS:
A relative link specifies the name of the file to be linked to only as it is related to the current document. For example, if all the files in your Web site are contained within the same
directory (or folder), and you want to establish a link from page1.html to page2.html, the code on page1.html will be: <a href="page2.html">Go to page 2</a> This link will be valid only from within the same directory that page2.html is saved in. The Relative links are converted into absolute links before adding them into the frontier. If an extracted link does not follow the http protocol the domain name of the seed URL, including the protocol, is appended to the beginning of the relative link. 6.2.5.3 HIDDEN LINKS:
Hidden web or Deep web refers to World Wide Web content that is not part of the surface Web, which is indexed by standard search engines. Deep web resources may be classified into one or more of the following categories
33
Dynamic content: dynamic pages which are returned in response to a submitted query or accessed only through a form.
Unlinked content: pages which are not linked to by other pages Private Web: sites that require registration and login Scripted content: pages that are only accessible through links produced by Javascript as well as content dynamically downloaded from web servers via Flash or AJAX.
The links that are part of the Hidden web can be extracted only by selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index. Hidden links are not extracted by out crawler.
6.2.5.4 REDUNDANT LINKS:
Redundant links in your network provide you with more ways to get your data to its destination, should something fail. Although these provide useful information and act as a recovery it serves little purpose in the extraction process.
So we have to make sure the redundant links are not extracted during the crawling process because this simply adds up the no. of links in the frontier without any extra useful information.
34
6.2.5.5 DEAD LINKS:
A dead link (also called a broken link or dangling link) is a link in the World Wide Web that points to a web page or server that is permanently unavailable.
The most common result of a dead link is a 404 error, which indicates that the web server responded, but the specific page could not be found. The browser may also return a DNS error indicating that a web server could not be found at that domain name A link might also be dead because of some form of blocking such as content filters or firewalls. Another type of dead link is a URL that points to a site unrelated to the content sought. This can sometimes occur when a domain name is allowed to lapse, and is subsequently reregistered by another party. The dead links should not be extracted during the crawling process. 6.2.6 ALGORITHM: Get base URL Maintain two lists visited links list and not visited links list(frontier) Add base URL to not visited list(frontier) while(not visited list not empty)
35
Get URL from front of list (URL i) Check if protocol is http Open URL connection Extract relevant links from web page If (link not in not visited list(frontier)) Add to not visited list Else Discard Add (URL i) to list of visited links 6.3 MODULE 2: CONTENT EXTRACTION 6.3.1: Part 1 Web Page Cache: Definition: Web Caching is the caching of web documents like HTML pages and images etc. on to the system. While caching the web page we make sure that the HTML source does not contain any errors. This is checked by making use of the rules in the HtmlCleaner API. Caching of Web Page: Get each link from list of extracted links while(input list file not empty) Open URL Connection for each Link L(i) Open new File F(i) Read HTML source from URL connection Parse the HTML page using HtmlCleaner Write the cleaned HTML source File F(i)
36
Close File F(i) Repeat for each input Use of Web Page Caching: We can get a cleaned html page, making it easy for content extraction. Caching the web page can help in future content extraction It reduces bandwidth. 6.3.2: Part 2-Content Extraction Document Object Model: Content Extraction o Get each file containing the cleaned HTML source o Represent the HTML as a DOM tree o Make use of DOM rules to extract the content o Repeat for each cleaned HTML source file.
6.4 MODULE 3: CONTENT CLEANING: 6.4.1 Preprocessing Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user
37
6.4.2 Removing Stop Words:
Stop words are common words that carry less important meaning than keywords. Usually search engines remove stop words from a keyword phrase to return the most relevant result. i.e. stop words drive much less traffic than keywords.
Stop words is a part of human language and theres nothing you can do about it. Sure, but high stop word density can make your content look less important for search engines.
Look at the picture below. There are two paragraphs from above without stopwords.
38
START
GET DOCUMENTS
CONVERSION TO TEXT
COMPARE WITH PRE DEFINED STOP WORDS
REMOVE STOP WORDS
PRINT LIST OF WORDS FREE FROM STOP WORDS
STOP
FIGURE 6.1: SYSTEM FLOW DIAGRAM STOP WORD REMOVAL
39
6.4.4 Stemming: In linguistic morphology, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The algorithm has been a long-standing problem in computer science; the first paper on the subject was published in 1968. The process of stemming, often called conflation, is useful in search engines for query expansion or indexing and other natural language processing problems. A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish". 6.4.4.5 Different Stemming Algorithms: Lancaster Algorithm(by Paice and Husk)
Porter Stemming Algorithm
Lovins Stemming Algorithm Dawson Stemming Algorithm Krovetz Stemming Algorithm
40
6.4.4.5.1 Lancaster stemming: Introduction: The Paice/Husk Stemmer was developed in the Computing Department of Lancaster University in the late 1980s. It was designed by Chris Paice with the assistance of Gareth Husk, and was first implemented in the Pascal programming language. Due to the reducing popularity of Pascal, further implementations have been made using ANSI C and Java. A Perl version has also been implemented by Mary Taffet at the Center for Natural Language Processing at Syracuse University. The Paice/Husk Stemmer consists of a stemming algorithm and a separate set of stemming rules. The standard set of rules provides a rather 'strong' or 'heavy' Stemmer which is quite aggressive in conflation of words. Stemmer strength is a quality that can be extremely advantageous for index compression. A heavy stemmer, however, tends to produce a rather large number of overstemming errors relative to the number of understemming errors . Users who would prefer a lighter stemmer can develop their own rule sets. The Paice/Husk Stemmer a simple iterative Stemmer, in that, the endings are removed piecemeal in an indefinite number of stages. Lancaster stemming algorithm: The stemmer is a conflation based iterative stemmer. The stemmer, although remaining efficient and easily implemented, is known to be very strong and aggressive. It utilizes a single table of rules, each of which may specify the removal or replacement of an ending. This technique of replacement is used to avoid the problem of spelling exceptions as described earlier, by replacing endings rather than simply removing them the stemmer manages to do without a separate
41
stage in the stemming process, i.e. no recoding or partial matching is required. This helps to maintain the efficiency of the algorithm, whilst still being effective. The Rule Format: In the following, the term form refers to any word or part-word which is being considered for stemming. The original word before any changes have been made is said to be intact. Each line in the rule-table holds a separate stemming rule. Braces {...} enclose comments describing the action of each rule. The rules in the table are grouped into sections, each containing all those rules relating to a particular final letter (known as the section letter). In the current Pascal implementation the rules are held in an array. Each rule is represented as a string of characters (nine characters suffice for all rules at the present time). The procedure which reads in the rules constructs an index array allowing fast access to the relevant section. Each rule has five components, two of which are optional: a) b) c) d) e) an ending of one or more characters, held in reverse order; an optional intact flag "*"; a digit specifying the remove total (may be zero); an optional append string of one or more characters; a continuation symbol, ">" or "." Within each section, the order of the rules is significant Example 1: the rule "sei3y>" means: if the word end in "-ies" then replace the last three letters by. "-y" and then apply the stemmer again to the truncated form. Example 2: the rule "mu*2." means: if the word ends in "-um" and if the word is intact, then remove the last two letters and terminate. This converts "maximum" to "maxim" but leaves "presum" (from "presumably" etc.) unchanged.
42
Example 3: the rule "ylp0." means: if the ward ends in "-ply" then leave it unchanged and terminate. This ensures that the subsequent rule "yl2>" does not remove the "-ly" from "multiply". Example 4: the rule "nois4j>' causes "-sion" endings to be replaced by "-j". This acts as a dummy, causing activation of the "j" section of the rules (q.v.). Hence "provision" is converted first to "provij" and then to "provid" The Algorithm : 1. Select relevant section:

Inspect the final letter of the form; if there is no section corresponding to that letter, then terminate; otherwise, consider the first rule in the relevant section.
2. Check applicability of rule: If the final letters of the form do not match the reversed ending in the rule, then goto 4; if the ending matches, and the intact flag is set, and the form is not intact, then goto 4; if the acceptability conditions (see below) are not satisfied, then goto 4. 3. Apply rule:
Delete from the right end of the form the number of characters specified by the remove total; if there is an append string, then append it to the form; if the continuation symbol is "." then terminate; otherwise (if the continuation symbol is ">") then goto 1
43
4. Look for another rule: Move to the next rule in the table; if the section letter has changed then terminate; otherwise goto 2. Acceptability Conditions If these conditions were not present, the words "rent", "rant", "rice", "rage", "rise", "rate", "ration", and "river" would all be reduced to "r" by the rules shown. The conditions used are: a) if the form starts with a vowel then at least two letters must remain after stemming (e.g., "owed"/"owing" -> "ow", but not "ear" -> "e"). b) if the form starts with a consonant men at least three letters must remain after stemming and at least one of these must be a vowel or "y" (e.g., "saying" -> "say" and "crying" -> "cry", but not "string" -> "str", "meant" -> "me" or "cement" -> "ce"). These conditions wrongly prevent the stemming of various short-rooted words (e.g., "doing", "dying", "being"); it is probably best to deal with these separately by lexical lookup.
44
Flow chart:
Fig 6.2 Flow Chart Stemming

45
CHAPTER-7 CONCLUSION
46
The Web being the largest publicly accessible data source in the world, is a very valuable resource for knowledge discovery. Due to the unstructured and heterogeneous characteristic of the Web, it makes it quite difficult to use basic data mining techniques. Web data mining though derived from data mining approaches knowledge discovery in a different manner. Web data warehouses have not come into play because the Web is dynamic, ever growing and tough to maintain. Our project takes an initiative in the direction of cleaning web content and maintaining web knowledge bases. Since the Web is unstructured extraction of content is a difficult problem. The content must be eliminated from noisy data(navigation links, advertisements, privacy policy,etc.) The noisy free content is stored in knowledge bases for information retrieval. The web crawler assisted content extraction can be used for data mining and other applications like getting content for PDA's and other devices.
47
APPENDIX - A SAMPLE CODING
48
CODING 8.1 MODULE 1: WEB CRAWLING linkextract.java package crawler; import java.io.*; import java.net.*; import java.util.*; import java.io.InputStreamReader; import java.net.URLConnection; import javax.swing.text.html.parser.ParserDelegator; import javax.swing.text.html.HTMLEditorKit.ParserCallback; import javax.swing.text.html.HTML.Tag; import javax.swing.text.html.HTML.Attribute; import javax.swing.text.MutableAttributeSet;
public class linkextract { public static String newline=System.getProperty("line.separator"); public static String address1=""; public static int count=0,count1=0; public static boolean flag=false; public static String str1=""; public static String contype=""; public static String abs="";
49
public static URL add; public int i=0; public static URL base,hostcheck;
static ArrayList<String> notvisited=new ArrayList<String>(); static ArrayList<String> visited=new ArrayList<String>(); static ArrayList<String>list=new ArrayList<String>(); public static void main(String args[])throws IOException, MalformedURLException { String file=args[1]; try{ address1=args[0]; base=new URL(address1); notvisited.add(address1); for(int i=0;i<notvisited.size();i++) { String str=notvisited.get(i); flag=false; System.out.println("Address obtained from list"); count1=0; System.out.println(str); if(str.matches("http.*"))//&&str.matches(address+".*"))
50
{ System.out.println("Checked absolute"); URL myurl=new URL(str); abs=myurl.getHost(); if(!(str.matches(".*#comments"))) { boolean motortrend=((str.matches(".*roadtests/[az].*"))&&((!str.matches(".*/[0-9][0-9]/.*"))|| ((str.matches(".*sedans/[0-9]*.*"))|| (str.matches(".*luxury/[0-9]*.*"))|| (str.matches(".*coupes/[0-9]*.*"))|| (str.matches(".*convertibles/[0-9]*.*"))|| (str.matches(".*alternatives/[0-9]*.*"))|| (str.matches(".*hatchbacks/[0-9]*.*"))|| (str.matches(".*wagons/[0-9]*.*"))|| (str.matches(".*minivans_vans/[0-9]*.*"))|| (str.matches(".*trucks/[0-9]*.*"))|| (str.matches(".*suvs/[0-9]*.*"))|| (str.matches(".*oneyear/[0-9]*.*"))|| (str.matches(".*virtual/[0-9]*.*"))))); boolean edmunds=(str.matches(".*review.html")); boolean caranddriver= (str.matches(".*reviews"))&&(str.matches(".*car.*"))&& (!str.matches(".*gallery.*")); boolean roadandtrack=(str.matches(".*blog.*")); boolean automobilemag=(str.matches(".*rumors.*"));
51
boolean autoblog=(str.matches(".*make.*"))|| (str.matches(".*category.*")); boolean rankings=(str.matches(".*cars-trucks.*")); boolean thecarconn=(str.matches(".*review/[0-9]*.*")); boolean autobytel=(str.matches(".*articles/templates.*")); boolean jdpower=(str.matches(".*overview")); if(base.getHost().equals(abs)&&((!str.matches(".*gallery.*"))&&(! str.matches(".*photo.*")))&&((str.matches(".*review.*"))||motortrend|| edmunds||caranddriver||roadandtrack||automobilemag||autoblog||rankings|| thecarconn||autobytel||jdpower)) { System.out.println("Checked relevance "); String s=myurl.getProtocol(); if(s.equals("http")) { System.out.println("Checked protocol"); try{ myurl.openStream(); } catch(IOException e){ e.printStackTrace(); continue;} BufferedReader br=new BufferedReader(new InputStreamReader(myurl.openStream()));
52
System.out.println("scanning source html"); ParserDelegator parserDelegator = new ParserDelegator(); ParserCallback parserCallback = new ParserCallback() { public void handleText(final char[] data, final int pos) { } public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { if (tag == Tag.A) { String address = (String) attribute.getAttribute(Attribute.HREF); if(address!=null) { if(address.matches("/.*")) address="http://"+abs+address; String[] splt=address.split("/"); if(address.matches("http.*")) { if(list.contains(address)); else if(base.getHost().equals(splt[2])) { notvisited.add(address); list.add(address); }
53
} } } } public void handleEndTag(Tag t, final int pos) { } public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { } public void handleComment(final char[] data, final int pos) { } public void handleError(final java.lang.String errMsg, final int pos) {} }; parserDelegator.parse(br, parserCallback, true); File fl=new File(file); BufferedWriter bw=new BufferedWriter(new FileWriter(fl,true)); bw.write(newline); bw.write(str); bw.close(); visited.add(str); } } } } notvisited.remove(i); System.out.println(notvisited.size()+" "+ i);
54
} } catch(Exception e){ } } }
8.2 MODULE 2: PART 1- WEB PAGE CACHE Cache.java package crawler; import org.htmlcleaner.*; import org.jdom.Document; import org.jdom.Element; import org.jdom.JDOMException; import org.jdom.output.XMLOutputter; import org.jdom.xpath.XPath; import java.io.*; import javax.xml.transform.stream.StreamResult; import java.util.List; import java.util.Map;
55
import java.util.Iterator; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.File; import java.net.URL; import java.net.*; public class Cache { public static void main(String[] args) throws IOException, JDOMException, XPatherException, ConnectException { long start = System.currentTimeMillis(); /*Create instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setUseCdataForScriptAndStyle(true); props.setRecognizeUnicodeChars(true); props.setUseEmptyElementTags(true); props.setAdvancedXmlEscape(true); props.setTranslateSpecialEntities(true); props.isRecognizeUnicodeChars(); props.setRecognizeUnicodeChars(true); props.setBooleanAttributeValues("empty"); props.setPruneTags("script"); props.setPruneTags("img"); props.setPruneTags("style");
56
*/
props.setPruneTags("form");
props.setOmitComments(true); /* Transformations */ CleanerTransformations transformations = new CleanerTransformations();
/* input HTML file */ String src=args[0]; String folder=args[1]; File fl=new File(src); String path=""; int uniq=0; BufferedReader br=new BufferedReader(new FileReader(fl)); while(br.ready()){ path=br.readLine(); if(path.matches("http.*")){
57
uniq++; File file=new File(folder+"file"+uniq+".xml"); URL myurl=new URL(path); TagNode node; try{ node=cleaner.clean(myurl); } catch(Exception e) { System.out.println(e.getMessage()); continue; } System.out.println("vreme: " + (System.currentTimeMillis() - start)); /* output XML file */ new PrettyXmlSerializer(props).writeXmlToFile(node, folder+"file"+uniq+".xml"); System.out.println("vreme: " + (System.currentTimeMillis() - start)); } } System.out.println(br.readLine());
} }
58
8.3 MODULE 2: PART 2 CONTENT EXTRACTION ContentExtract.java package crawler; import java.io.*; import java.net.ConnectException; import java.util.Vector; import org.htmlcleaner.*; import org.jdom.JDOMException; import org.w3c.tidy.Tidy; public class ContentExtract { public static void main(String args[])throws IOException,JDOMException,XPatherException, ConnectException { String EMPTY=""; HtmlCleaner cleaner=new HtmlCleaner(); CleanerProperties props = cleaner.getProperties(); props.setUseCdataForScriptAndStyle(true); props.setRecognizeUnicodeChars(true); props.setUseEmptyElementTags(true); props.setAdvancedXmlEscape(true); props.setTranslateSpecialEntities(true); props.isRecognizeUnicodeChars();
59
props.setRecognizeUnicodeChars(true); props.setBooleanAttributeValues("empty"); props.setPruneTags("script"); props.setPruneTags("img"); props.setPruneTags("style"); props.setPruneTags("form");
props.setOmitComments(true); String folderpath; folderpath=args[0]; Vector htmlpages=ls(folderpath); String outputfolder="C:\\eclipse projects\\webcrawler\\"; outputfolder.concat(args[1]); int uniq1=0; for(int i=0;i<htmlpages.size();i++) { uniq1++; String page=(String)htmlpages.get(i); File fl=new File(page); System.out.println(page); TagNode tg=cleaner.clean(fl); TagNode findtg=tg.findElementByName("body", true);
60
FileWriter fw=new FileWriter(new File(outputfolder+"content"+uniq1+".txt")); TagNode[] listclass=findtg.getElementsHavingAttribute("class", true); TagNode[] listrevclass=findtg.getElementsHavingAttribute("class",true); for(int k=0;k<listrevclass.length;k++) { String str; TagNode str1=listrevclass[k]; boolean automobilemag=str1.getAttributeByName("class").matches(".*body.*"); boolean autoblog=str1.getAttributeByName("class").matches(".*postbody.*"); boolean automobilemag=str1.getAttributeByName("class").matches(".*body.*");
if(listrevclass[k].getAttributeByName("class").matches(".*byline.*")|| listrevclass[k].getAttributeByName("class").matches(".*content.*")|| listrevclass[k].getAttributeByName("class").matches(".*item.*")|| listrevclass[k].getAttributeByName("class").equals(".*entry.*"))//|| listrevclass[k].getAttributeByName("class").matches(".*title.*")) { System.out.println(listrevclass[k].getText());
61
fw.write(listrevclass[k].getText().toString()); System.out.println(listrevclass[k].getText().toString().length()); }
} fw.close(); }
public static Vector ls(String folderpath) { Vector v=new Vector(); ls(new File(folderpath),v); return v; } public static void ls(File file,Vector v) { File list[]=file.listFiles();
62
for(int i=0;i<list.length;i++) { if(list[i].isDirectory()) ls(list[i],v); else v.add(list[i].getAbsolutePath()); } } } 8.4 MODULE 3: CONTENT CLEANING: Cleaner.java package stopwordsstemmerproj; import java.io.*; import java.util.*; public class Cleaner { public static void main(String args[])throws IOException { String output=args[1]; String folderpath=args[0]; Vector ve=ls(folderpath); Token_Lancaster tl= new Token_Lancaster(); for (int i=0;i<ve.size();i++) {
63
String path=(String)ve.get(i);
try { tl.tokenize(path,output); } catch (Exception e) { e.printStackTrace(); } } } public static Vector ls(String folderPath) { Vector v = new Vector(); ls(new File(folderPath),v); return v; } public static void ls(File file,Vector v) { File[] list = file.listFiles(); for(int i=0;i<list.length;i++) { File li = list[i]; if(li.isDirectory()) { ls(li,v); } else
64
v.add(li.getAbsolutePath()); } } } Token_Lancaster.java package stopwordsstemmerproj; import java.io.*; import java.util.*; import edu.northwestern.at.utils.corpuslinguistics.stemmer.LancasterStemmer; import edu.northwestern.at.utils.corpuslinguistics.stopwords.*; import edu.northwestern.at.utils.corpuslinguistics.tokenizer.*; public class Token_Lancaster { public void tokenize(String ifname,String opname) throws Exception { String outputfolder=opname; File file=new File(ifname); String inputfilename=file.getName(); Scanner s=null; try
65
{ s=new Scanner(file); } catch(Exception e) { System.out.println(e.getMessage()); } LancasterStemmer stem_word= new LancasterStemmer(); PorterStopWords stop_word = new PorterStopWords(); String output_fname=outputfolder.concat(inputfilename); while(s.hasNextLine()){ String text = s.nextLine();
String comma = text.toString().replaceAll("\\,",""); String exclamation = comma.toString().replaceAll("\\!",""); String hyphen = exclamation.toString().replaceAll("\\-"," "); String quotes = hyphen.toString().replaceAll("\\'",""); String colon = quotes.toString().replaceAll("\\:",""); String semicolon = colon.toString().replaceAll("\\;",""); String braces = semicolon.toString().replaceAll("\\(",""); String braces1 = braces.toString().replaceAll("\\)",""); String stop = braces1.toString().replaceAll("\\.","");
66
System.out.println(stop); List<String> input_string = new LinkedList<String>(); StringTokenizer st = new StringTokenizer(stop); while (st.hasMoreTokens()) { input_string.add(st.nextToken()); } LinkedList<String> output_string = new LinkedList<String>(); for(int i=0;i<input_string.size() ;i++) { if(!(stop_word.isStopWord(input_string.get(i))) && input_string.get(i) != null && !(input_string.get(i).indexOf("_")>0)) { try{
System.out.println(stem_word.stem(input_string.get(i))); output_string.add(stem_word.stem(input_string.get(i))); } catch(Exception e){ output_string.add(stem_word.stem(input_string.get(i))); }

67
} else{ output_string.add(input_string.get(i)); } }
try{ boolean append = true; FileOutputStream out = new FileOutputStream(output_fname,append); PrintStream p = new PrintStream(out); for(int i=0;i < output_string.size();i++){ p.print ((String)output_string.get(i)); p.print (" "); } p.println(); p.close(); } catch (IOException ioe) { System.err.println ("Error writing to file "+ioe); } } }}
68
8.5 SCREENSHOTS: 8.5.1 MODULE 1: WEB CRAWLER:
Fig 8.1 Input Seed URL and Output file name is given to the system
69
Figure 8.2 Output file containing the list of extracted links 8.5.2 MODULE 2: PART 1 PAGE CACHING
Figure 8.3 Input file and output folder is given to the Page cacher
70
Figure 8.4 Cached files 8.5.3 MODULE 2: PART 2 CONTENT EXTRACTION
Figure 8.5 Input folder and Output folder is given to the content extractor
71
Figure 8.6 Output of Content Extractor 8.5.4 MODULE 3 - CONTENT CLEANING:
Figure 8.7 Extracted content folder is given as input and a location to store cleaned files is also given
72
Figure 8.8 Cleaned content
73
APPENDIX - B REFERENCES
74
REFERENCES [1] L. Yi, B. Liu, and X. Li, Eliminating Noisy Information in Web Pages for Data Mining, ACM, 2003 [2] L. Yi, and B. Liu, Web Page Cleaning for Web Mining through Feature Weighting, WWW, 2003 [3] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, DOM-based Content Extraction of HTML Documents, WWW, 2004 [4] Y. Weissig, and T. Gottron, Combinations of Content Extraction Algorithms, WWW, 2009 [5] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, Springer Press [6]Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze, Introduction to Information Retrieval, Cambridge University Press, 2008
75

Web Crawler Assisted Web Page Cleaning For Web Data Mining

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Web Crawler Assisted Web Page Cleaning For Web Data Mining

Diunggah oleh

Hak Cipta:

Format Tersedia

CHAPTER -1 INTRODUCTION

These characteritisics of the Web present both challenges and

CHAPTER 2 SYSTEM ANALYSIS

multithreaded web crawler.

Other structuring techniques, in the form of XML tags, can be

included in the extracted content to assist in Web data mining.

CHAPTER 3 LITERATURE REVIEW

3.1SYSTEM STUDY: 3.1.1 WEB CRAWLER:

Web crawlers, also known as spider or robots, are

Were the Web a static collection of pages, we would have little

A crawler starts from a set of seed pages(input URLs) and then

The crawling process may be terminated when a certain number of

Fig 3.1 Science of Crawling 3.1.3 PAGE CACHING:

to reduce the bandwidth traffic and the server load.

Caching is also used to maintain Web archives. A Web archive is

The Document Object Model

The graphical representation of the DOM of the example table is:

Fig 3.2 DOM Tree

In the DOM, documents have a logical structure which is very

3.1.5 STOP WORDS:

Some extremely common words which would appear to be of little

An example of a stop list is shown below:

Fig 3.3 Stop list

The removal of these stop words improves the quality of the

results produced. 3.1.6 STEMMING AND LEMMATIZATION:

For grammatical reasons, documents are going to use different

The goal of both stemming and lemmatization is to reduce

Stemming usually refers to a crude heuristic process that chops off

Lemmatization usually refers to doing things properly with the use

If confronted with the token saw, stemming might return just s,

has repeatedly been shown to be empirically very effective, is Porter's algorithm

Porter's algorithm consists of 5 phases of word reductions, applied

sequentially. Within each phase there are various conventions to select

Fig 3.4 Stemming Rules

Many of the later rules use a concept of the measure of a word,

would map replacement to replac, but not cement to c.

Here are a few types of stemming techniques:

Fig 3.5 Stemming Types 3.2 SYSTEM ARCHITECTURE:

DOM BASED CONTENT EXTRACTOR

Fig 3.6 System Architecture

CHAPTER 4 SYSTEM REQUIREMENTS

An IDE normally consists of:

o The Lancaster stemmer, created by Chris Paice and Gareth Husk

CHAPTER 5 FEATURE REQUIREMENTS

EXTRACT RELEVANT URLs

STORE VISITED LINKS

OUTPUT FILE WITH EXTRACTED LINKS

Fig 5.1 DFD for Module 1: web crawling

LIST OF RELEVANT LINKS

FETCH HTML SOURCE

CLEAN HTML SOURCE

Fig 5.2 DFD for Module 2: Web page Caching

DOM BASED CONTENT EXTRACTION

Fig 5.3 DFD for Module 2: Content Extraction

LIST OF STOP WORDS

CLEANING STEMMING AND REMOVING STOP WORDS

Fig 5.4 DFD for Module 3: Content Cleaning

A Web crawler is one type of bot, or software agent.

6.2.5.2 RELATIVE LINKS:

6.2.5.4 REDUNDANT LINKS:

6.2.5.5 DEAD LINKS:

6.4.2 Removing Stop Words: