S.Rajesh#2
M.Tech (CSE),V.R Siddhartha Engineering College, Vijayawada Associate Professor, V.R Siddhartha Engineering College, vijayawada.
ABSTRACT:
A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. It then extracts each record from the data region and identifies it whether it is a flat or nested records based on visual information the area covered and the number of data items present in each record. The next step is data items extraction from these records and transferring them into the database.This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure.
I INTRODUCTION
The World Wide Web has become one of the most important information sources today. Most of data on web are available as pages encoded in markup languages like HTML intending for visual browsers. As the amount of data on web grows, locating desired contents accurately and accessing them conveniently become pressing requirements. Technologies like web search engine and adaptive content delivery [1] are being developed to meet such requirements. However web pages are normally composed for viewing in visual web browsers and lack information on semantic structures. These days most of the companies manage their business through web sites and use these web sites for advertising their products and services. These data which are dynamic need to be collected and organized such that after extracting information from these data one can produce many value-added applications. For example, in order to collate and compare the prices and features of products available from the various Web sites, we need tools to extract attribute descriptions of each product (called data object) within a specific region (called data region) in a web page. If one examines a web page there are many irrelevant components intertwined with the In many web pages, there are normally more than one data object intertwined together in a data region, which makes it difficult to discover the attributes for each page. Furthermore, since
the raw source of the web page for depicting the objects is non-contiguous one, the problem becomes more difficult. In real applications, the users require the description of individual data object from complex web pages derived from the partitioning of data region. There are different approaches in practice due to Hammer, Garcia Molina, Cho, and Crespo [1], Kushmerick [2], Chang and Lui [3], Crescenzi, Mecca, and Merialdo [4], Zhao, Meng, Wu and Raghavan [5] which address the problems of web data extraction through wrapper generation techniques. To extract these structures, documents wrappers are commonly used. Building wrappers, however, is not a trivial task. Normally, wrappers are built for specific web pages by having people examine these pages and then figure out some rules that can separate the chunks of interests on these web pages. Based on these special rules,we canwrite the wrapper to extract information from pages that belong to exactly the same class. Many wrappers are just lexical analyzers as that discussed in [8]. Methods like [9] make some improvements by using heuristics in addition to lexical analyzers. There are also approaches trying to derive some semantic structures directly. Approach presented in [1] discusses a concept discovery and confirmation method based on heuristics. Another one [11] introduces a method to find the relationships between labeled semi-structured data. As we can see that methods listed above are some limited because detection of content chunks is actually done by human.
II BACKGROUND AND RELATED WORK Existing methods are not feasible if a large amount and variations of web pages are to be processed. Automatic methods or semiautomatic methods aremuchmore effective in this situation. Only recently, several proposals discuss ways of automatic analysis. In [4], a method to parse HTML data tables and generate a hierarchical representation is discussed. The approach assumed that authors of tables have provided enough information to interpret tables. The authors of [3] introduce a method that detects chunk boundary by combining multiple independent heuristics. With specific field of interests, wrappers can also be implemented based on semantic rules. Approach discussed in [2] is such an idea. HTML, as it was introduced with web technology, is the most commonly used standard of current web pages. However it lacks the ability of representing semantic related contents. For some reasons, it was designed to take both
Page 719
Existing approach: The VER algorithm for the proposed technique is as Follows: Algorithm VER (HTML document) Begin 1. Accept input web page. 2. Preprocessing the web page to filter the useless node. 3. Segment the web page into semantic blocks by using VIPS algorithm. 4. Cluster the block by using visual appearance similarity of Web page 5. Identified the data region and align the data record. 6. Show extracted data record. End. CE [11] considers all detailed pages of a website as pages with the same class. It runs a learning phase with two or more pages as its input and finds the blocks that their pattern repeats between input pages and marks them as noninformative blocks then stores them in storage. These noninformative blocks are mostly copyright information, header, footer, sidebars and navigation links. Then when we use CE algorithm in actual world it first eliminate noninformative patterns from the structure of its input pages based on the stored patterns in its storage for specific class of input pages. Finally from the remaining blocks in the page it will return the text of block containing the most text length. CE needs a learning phase so it couldnt extract the main content from random one input web page. FE [11] extracts the text content of a block that has the most probability of having text so it will work fine in web pages that text content of main content dominates other types of content. In addition FE could return just one block of the main content, so [11] proposed K-FE that returns k blocks with high probability of having the main content. Algorithm steps of K-FE and FE are the same except the last part. In KFE the algorithm final section, sorts the blocks depends on their probability then it uses k-means clustering and takes high probability clusters. Procedure ExtractDataRecord(dataRegion) { THeight=0 For each child of dataRegion BEGIN THeight += height of the bounding rectangle of child END AHeight = THeight/no of children of
Page 720
IV. Conclusion
In this paper we studied a more effective and better technique to perform the automatic data extraction from the flat nested data records from the web pages. Given a web page our method first identifies and extracts the data records based on the visual clue information. It than counts the number of the data items in the each records and then identifies it as either flat or nested. The extracted data fields are then stored in file. Although the problem has been studied by several researchers, existing techniques either inaccurate or make many strong assumptions. The VCED is a pure visual clue based extraction of flat and nested data records having some limitations as follows: Framework Limitations: 1) Above algorithms fails to identify records in huge webpages. 2) It will work only on offline analysis. 3) It retrieves duplicate records structure. 4) It doesnt retrieve records with the given pattern. 5) It doesnt support subtree page structure generation.
REFERENCES:
[1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):3458, 1989. [2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo . Extracting semi-structured information from the web.In Proc.of the Workshop on the Management of Semistructured Data, 1997. [3] Kushmerick, N. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118:15- 68, 2000. Clustering-based Approach to Integrating Source Query ] [4] Chang, C-H., Lui, S-L. IEPAD: Information Extraction Based on Pattern Discovery. WWW-01, 2001. ] [5] Freitag, D., Information extraction from HTML: Application of ageneral learning approach. Proceedings of the Fifteenth Conferenceon Artificial Intelligence (AAAI98). [6] Adelberg, B., NoDoSE: A tool for semi-automatically extracting structured and semi-structured data from text documents. SIGMOD Record 27(2): 283-294, 1998. [7] Arocena, G. O. and Mendelzon, A. O., WebOQL: Restructuringdocuments, databases, and Webs. Proceedings of the 14th IEEE International Conference on Data Engineering (ICDE), Orlando, Florida, pp. 24-33, 1998.
Page 721
Page 722