Ijctt V3i5p110

International Journal of Computer Trends and Technology- volume3Issue5- 2012
EXTRACTING SEMI-STRUCTURED INFORMATION BASED ON SUBTREES

B. Swapna kumari#1,
1
S.Rajesh#2
M.Tech (CSE),V.R Siddhartha Engineering College, Vijayawada Associate Professor, V.R Siddhartha Engineering College, vijayawada.
ABSTRACT:
A new web content structure based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. It then extracts each record from the data region and identifies it whether it is a flat or nested records based on visual information the area covered and the number of data items present in each record. The next step is data items extraction from these records and transferring them into the database.This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure.
I INTRODUCTION
The World Wide Web has become one of the most important information sources today. Most of data on web are available as pages encoded in markup languages like HTML intending for visual browsers. As the amount of data on web grows, locating desired contents accurately and accessing them conveniently become pressing requirements. Technologies like web search engine and adaptive content delivery [1] are being developed to meet such requirements. However web pages are normally composed for viewing in visual web browsers and lack information on semantic structures. These days most of the companies manage their business through web sites and use these web sites for advertising their products and services. These data which are dynamic need to be collected and organized such that after extracting information from these data one can produce many value-added applications. For example, in order to collate and compare the prices and features of products available from the various Web sites, we need tools to extract attribute descriptions of each product (called data object) within a specific region (called data region) in a web page. If one examines a web page there are many irrelevant components intertwined with the In many web pages, there are normally more than one data object intertwined together in a data region, which makes it difficult to discover the attributes for each page. Furthermore, since
the raw source of the web page for depicting the objects is non-contiguous one, the problem becomes more difficult. In real applications, the users require the description of individual data object from complex web pages derived from the partitioning of data region. There are different approaches in practice due to Hammer, Garcia Molina, Cho, and Crespo [1], Kushmerick [2], Chang and Lui [3], Crescenzi, Mecca, and Merialdo [4], Zhao, Meng, Wu and Raghavan [5] which address the problems of web data extraction through wrapper generation techniques. To extract these structures, documents wrappers are commonly used. Building wrappers, however, is not a trivial task. Normally, wrappers are built for specific web pages by having people examine these pages and then figure out some rules that can separate the chunks of interests on these web pages. Based on these special rules,we canwrite the wrapper to extract information from pages that belong to exactly the same class. Many wrappers are just lexical analyzers as that discussed in [8]. Methods like [9] make some improvements by using heuristics in addition to lexical analyzers. There are also approaches trying to derive some semantic structures directly. Approach presented in [1] discusses a concept discovery and confirmation method based on heuristics. Another one [11] introduces a method to find the relationships between labeled semi-structured data. As we can see that methods listed above are some limited because detection of content chunks is actually done by human.
II BACKGROUND AND RELATED WORK Existing methods are not feasible if a large amount and variations of web pages are to be processed. Automatic methods or semiautomatic methods aremuchmore effective in this situation. Only recently, several proposals discuss ways of automatic analysis. In [4], a method to parse HTML data tables and generate a hierarchical representation is discussed. The approach assumed that authors of tables have provided enough information to interpret tables. The authors of [3] introduce a method that detects chunk boundary by combining multiple independent heuristics. With specific field of interests, wrappers can also be implemented based on semantic rules. Approach discussed in [2] is such an idea. HTML, as it was introduced with web technology, is the most commonly used standard of current web pages. However it lacks the ability of representing semantic related contents. For some reasons, it was designed to take both
ISSN: 2231-2803 http://www.internationaljournalssrg.org
Page 719

structural and presentational capability in mind. And these two were not clearly separated (In the first version of HTML most of the tags were for structures. But many layout and presentation tags were stuffed into following versions and are widely used today. Some of the histories can be found in [5]). Further widely misuses of structural HTML tags for layout purpose make the situation even worse. Cascade Style Sheet (CSS) [2] was later developed as a remedy to this, but only recently several popular browsers begin to have better CSS support [1]. The recentW3C recommendation ofXMLprovides a better way to organize data and represent semantic structures of data. However,most ofweb contents are still authored inHTML. Because of the common misusages, we consider that HTML tags are not stable features for analyzing structures of HTML documents. For semantic rules based approaches, limited fields of interests and difficulties to learn new rules automatically restrict their feasibilities with general web pages. The amount of Web information has been increasing rapidly, especially with the emergence of Web 2.0 environments, where users are encouraged to contribute rich content. Much Web information is presented in the form of a Web record which exists in both detail and list pages The task of web information extraction (WIE) or information retrieval of records from web pages is usually implemented by programs called wrappers. Automatic methods aim to find patterns/grammars from the web pages and then use them to extract data. Examples of automatic systems are IEPAD [3], ROADRUNNER [5], MDR [1], DEPTA [10] and VIPS [2].Some of these systems make use of the Patricia (PAT) tree for discovering the record boundaries automatically and a pattern based extraction rule to extract the web data. This method has a poor performance due to the various limitations of the PAT Tree. ROADRUNNER [5] extracts a template by analyzing a pair of web pages of the same class at a time. It uses one page to derive an initial template and then tries to match the second page with the template. The major limitation of this approach is basically deriving the initial template manually. III. EXISTING FRAMEWORK In some of these algorithms proposed uses the method of identification of data records, but these cannot extract data items from the data records and do not handle nested data records. DEPTA by Zhai and Liu [8] is able to align and extract data items from the data records but does not handle nested data records. The NET by Bing Liu and Y. Zai [9] which is the latest and widely used at present (Nested data Extraction using Tree matching) works in two main steps: (i) Building a tag tree of the page: Due to numerous tags and unbalanced tags in the HTML code of the page, building a correct tag tree is a complex task. A Visual based method is used to deal with this problem. (ii) Identifying data records and extracting data from them. The algorithm performs a post order traversal of tag tree to identify data records at different levels. This ensures that nested data records are found. The tree edit distance algorithm and visual clues are used to perform these tasks. Though this technique is able to extract the flat or nested data records, construction of tag tree and its post order traversal is consider to be an overhead. concept of visual features and semantic information in web page. The system relies on an existing algorithm for page segmentation (VIPS) to analyze and partition a web page into a set of visual blocks, and then group related blocks with by appearance similarity of data record. The VIPS algorithm cannot determine the data regions or data record boundaries, but the VIPS block tree provides the important semantic partition information of a web page. Finally extract data record from identified data region
Existing approach: The VER algorithm for the proposed technique is as Follows: Algorithm VER (HTML document) Begin 1. Accept input web page. 2. Preprocessing the web page to filter the useless node. 3. Segment the web page into semantic blocks by using VIPS algorithm. 4. Cluster the block by using visual appearance similarity of Web page 5. Identified the data region and align the data record. 6. Show extracted data record. End. CE [11] considers all detailed pages of a website as pages with the same class. It runs a learning phase with two or more pages as its input and finds the blocks that their pattern repeats between input pages and marks them as noninformative blocks then stores them in storage. These noninformative blocks are mostly copyright information, header, footer, sidebars and navigation links. Then when we use CE algorithm in actual world it first eliminate noninformative patterns from the structure of its input pages based on the stored patterns in its storage for specific class of input pages. Finally from the remaining blocks in the page it will return the text of block containing the most text length. CE needs a learning phase so it couldnt extract the main content from random one input web page. FE [11] extracts the text content of a block that has the most probability of having text so it will work fine in web pages that text content of main content dominates other types of content. In addition FE could return just one block of the main content, so [11] proposed K-FE that returns k blocks with high probability of having the main content. Algorithm steps of K-FE and FE are the same except the last part. In KFE the algorithm final section, sorts the blocks depends on their probability then it uses k-means clustering and takes high probability clusters. Procedure ExtractDataRecord(dataRegion) { THeight=0 For each child of dataRegion BEGIN THeight += height of the bounding rectangle of child END AHeight = THeight/no of children of
Page 720

dataRegion For each child of dataRegion BEGIN If height of childs bounding rectangle > AHeight BEGIN dataRecord=child END END } Procedure IdentifyNestedData(dataRecord[I], dataRecord[I+1]) { noofField[I]=0 For I 1 to no of records BEGIN noofFields [I]=noofFields[I]+noofFields in the record[I] END DO For I 1 to no of records BEGIN For dataRecord [I], dataRecord[I+1] IF the no of fields in the [I+1] th record>=40% of the no of fields in the [I] th record The [I+1]th record is a nested data record ELSE The [I] th record is a nested data record END WHILE (EOF) } Extraction of data fields from the extracted records. Once the record is being extracted and identified the next step is to extract the data fields from the data records. The data fields are extracted based on the following algorithms. Procedure ExtractNesteddatafields() { extract nested records from Flatdata file. For I From the start of the file to the END of file BEGIN Extract the data fields row by row END Store the data fields in the file. } The above algorithm explains how data fields are extracted from nested records. First the file in which the nested data records are stored is navigated. The file is navigated using the absolute path of the file. Then the file is read line by line till the end of file. The data fields are extracted row by row. Each data field has a bounding rectangle associated with it. The data fields are extracted using these bounding rectangles. When a bounding rectangle is recognized the respective data field is extracted and stored in a file. ProcedureExtractFlatdatafields() { extract nested records from Flatdata file. For I From the start of the file to the end BEGIN Extract the data fields row by row END Store the data fields in the file. } The above algorithm explains the extraction of data fields from the extracted and identified flat records. The procedure for extracting the data fields from flat records is same as mentioned above for the nested records.
IV. Conclusion
In this paper we studied a more effective and better technique to perform the automatic data extraction from the flat nested data records from the web pages. Given a web page our method first identifies and extracts the data records based on the visual clue information. It than counts the number of the data items in the each records and then identifies it as either flat or nested. The extracted data fields are then stored in file. Although the problem has been studied by several researchers, existing techniques either inaccurate or make many strong assumptions. The VCED is a pure visual clue based extraction of flat and nested data records having some limitations as follows: Framework Limitations: 1) Above algorithms fails to identify records in huge webpages. 2) It will work only on offline analysis. 3) It retrieves duplicate records structure. 4) It doesnt retrieve records with the given pattern. 5) It doesnt support subtree page structure generation.
REFERENCES:
[1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):3458, 1989. [2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo . Extracting semi-structured information from the web.In Proc.of the Workshop on the Management of Semistructured Data, 1997. [3] Kushmerick, N. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118:15- 68, 2000. Clustering-based Approach to Integrating Source Query ] [4] Chang, C-H., Lui, S-L. IEPAD: Information Extraction Based on Pattern Discovery. WWW-01, 2001. ] [5] Freitag, D., Information extraction from HTML: Application of ageneral learning approach. Proceedings of the Fifteenth Conferenceon Artificial Intelligence (AAAI98). [6] Adelberg, B., NoDoSE: A tool for semi-automatically extracting structured and semi-structured data from text documents. SIGMOD Record 27(2): 283-294, 1998. [7] Arocena, G. O. and Mendelzon, A. O., WebOQL: Restructuringdocuments, databases, and Webs. Proceedings of the 14th IEEE International Conference on Data Engineering (ICDE), Orlando, Florida, pp. 24-33, 1998.
Page 721

[8] Muslea, I., Minton, S., and Knoblock, C., A hierarchical approach to wrapper induction. Proceedings of the Third International Conference on Autonomous Agents (AA-99), 1999. [9] Liu, L., Pu, C., and Han, W. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources, Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE), San Diego, California, pp. 611621, 2000. [10] M. lvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda, Finding and extracting data records from web pages, Journal of Signal Processing Systems, 2008. [11] Y. Zhai and B. Liu, Web data extraction based on partial tree alignment, in WWW 05: Proceedings of the 14th international conference on World Wide Web. New York, NY, USA: ACM, 2005, pp. 7685. [12] Y. Kim, J. Park, T. Kim, and J. Choi, Web information extraction by html tree edit distance matching, in ICCIT 07: Proceedings of the 2007 International Conference on Convergence Information Technology. Washington, DC, USA: IEEE Computer Society, 2007, pp. 24552460. [13] B. Liu and Y. Zhai, Net - a system for extracting web data from flat and nested data records, in Proceedings of 6th International Conference on Web Information Systems Engineering (WISE-05, 2005. [14] S. P. Algur and P. S. Hiremath, Extraction of flat and nested data records from web pages, in AusDM 06: Proceedings of the fifth Australasian conference on Data mining and analystics. Darlinghurst, Australia, Australia: Australian Computer Society, Inc., 2006, pp. 163168.
Page 722

Ijctt V3i5p110

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ijctt V3i5p110

Diunggah oleh

Hak Cipta:

Format Tersedia

International Journal of Computer Trends and Technology- volume3Issue5- 2012

EXTRACTING SEMI-STRUCTURED INFORMATION BASED ON SUBTREES

ISSN: 2231-2803 http://www.internationaljournalssrg.org

International Journal of Computer Trends and Technology- volume3Issue5- 2012

ISSN: 2231-2803 http://www.internationaljournalssrg.org

International Journal of Computer Trends and Technology- volume3Issue5- 2012

ISSN: 2231-2803 http://www.internationaljournalssrg.org

International Journal of Computer Trends and Technology- volume3Issue5- 2012

ISSN: 2231-2803 http://www.internationaljournalssrg.org

Anda mungkin juga menyukai