04906994

Learning to Adapt Web Information Extraction
Knowledge and Discovering New Attributes

via a Bayesian Approach
Tak-Lam Wong and Wai Lam, Senior Member, IEEE
AbstractThis paper presents a Bayesian learning framework for adapting information extraction wrappers with new attribute
discovery, reducing human effort in extracting precise information from unseen Web sites. Our approach aims at automatically
adapting the information extraction knowledge previously learned from a source Web site to a new unseen site, at the same time,
discovering previously unseen attributes. Two kinds of text-related clues from the source Web site are considered. The first kind of clue
is obtained from the extraction pattern contained in the previously learned wrapper. The second kind of clue is derived from the
previously extracted or collected items. A generative model for the generation of the site-independent content information and the site-
dependent layout format of the text fragments related to attribute values contained in a Web page is designed to harness the
uncertainty involved. Bayesian learning and expectation-maximization (EM) techniques are developed under the proposed generative
model for identifying new training data for learning the new wrapper for new unseen sites. Previously unseen attributes together with
their semantic labels can also be discovered via another EM-based Bayesian learning based on the generative model. We have
conducted extensive experiments from more than 30 real-world Web sites in three different domains to demonstrate the effectiveness
of our framework.
Index TermsWrapper adaptation, Web mining, text mining, machine learning.
1 INTRODUCTION
I
NFORMATION extraction systems aim at automatically
extracting precise and exact text fragments from docu-
ments. They can also transform largely unstructured in-
formationtostructureddata for further intelligent processing
[11], [33]. A common information extraction technique for
semistructured documents such as Web pages is known as
wrappers. A wrapper normally consists of a set of extraction
rules which were typically manually constructed by human
experts in the past. Recently, several wrapper learning
approaches have been proposed for automatically learning
wrappers from training examples [22], [26], [28], [34]. For
instance, consider a Web page shown in Fig. 1 collected from
a Web site
1
in the book catalog domain. To learn the wrapper
for automatically extracting information from this Web site,
one can manually provide some training examples. For
example, a user may label the text fragment Java How to
Program as the book title, and the fragments Harvey M.
Deitel and Paul J. Deitel as the corresponding authors. A
wrapper learning method can then automatically learn the
wrapper based on the text patterns embedded in training
examples, as well as the text patterns related to the layout
format embodied in the HTML document. The learned
wrapper can be applied to other Web pages of the same Web
site to extract information.
Wrapper learning systems can significantly reduce the
amount of human effort in constructing wrappers. Though
many existing wrapper learning methods can effectively
extract information from the same Web site and achieve very
good performance, one restriction of a learned wrapper is
that it cannot be applied to previously unseen Web sites, even
in the same domain. To construct a wrapper for an unseen
site, a separate human effort for the preparation of training
examples is required. To cope with this problem, this paper
investigates wrapper adaptation which aims at automati-
cally adapting a previously learned wrapper from one Web
site, known as a source Web site, to new unseen sites in the
same domain. For example, the wrapper previously learned
from the source Web site shown in Fig. 1 can be adapted to
the new unseen site shown in Fig. 2. The adapted wrapper
can then be applied to Web pages of this new site for
extracting data records. Consequently, it can significantly
reduce the human effort in preparing training examples for
learning wrappers for different sites.
Another shortcoming of existing wrapper learning tech-
niques is that attributes extracted by the learned wrapper are
limited to those defined in the training process. As a result,
they can only handle prespecified attributes. For example, if
the previously learned wrapper only contains extraction
patterns for the attributes title, author, and price from the
source Web site shown in Fig. 1, the adapted wrapper can at
best extract these attributes fromnewunseen sites. However,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 4, APRIL 2010 523
. T.-L. Wong is with the Department of Computer Science and Engineering,
The Chinese University of Hong Kong, Shatin, Hong Kong.
E-mail: wongtl@cse.cuhk.edu.hk.
. W. Lam is with the Department of Systems Engineering and Engineering
Management, The Chinese University of Hong Kong, Shatin, Hong Kong.
E-mail: wlam@se.cuhk.edu.hk.
Manuscript received 27 July 2008; revised 21 Feb. 2009; accepted 11 Apr.
2009; published online 21 Apr. 2009.
Recommended for acceptance by Z.-H. Zhou.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2008-07-0388.
Digital Object Identifier no. 10.1109/TKDE.2009.11.
1. The URL of the Web site is http://www.barnesandnobel.com.
1041-4347/10/$26.00 2010 IEEE Published by the IEEE Computer Society
a new unseen site may contain some new additional
attributes that do not appear in the source Web site. For
instance, book records inFig. 2 containthe attribute ISBNthat
does not exist in Fig. 1. The ISBN of the book records cannot
be extracted. This observation leads to another objective of
this paper. We investigate the problem of new attribute
discovery which aims at extracting the unspecified attributes
from new unseen sites. New attribute discovery can
effectively deliver more useful information to users.
1.1 Our Contributions
This paper presents a novel framework for solving the
wrapper adaptation with new attribute discovery via
Bayesian learning. One objective of our approach is to
automatically adapt a previously learned wrapper from a
source Web site to a new unseen site. The second objective
is to tackle the problem of new attribute discovery which is
to find new attributes that are not specified in the learned or
adapted wrapper. It is also able to discover semantic labels
for the new attributes discovered. A semantic label refers to
the text fragment on the Web page indicating the name of
the attribute. For example, B&N Price and Our Price
are examples of semantic labels for the attribute price in the
Web sites as shown in Figs. 1 and 2, respectively. By
discovering semantic labels in the new site, users are able to
understand the meaning of newly discovered attributes.
This problem is challenging because it needs to consider
the interaction among the text fragments corresponding to
the attribute value and the layout format of the record
attributes collected from different Web sites. Some informa-
tion is generally different across different sites. For instance,
the layout formats of the source site and the target site are
typically different. Therefore, the wrapper learned and the
items collected or extracted from the source Web site are not
directly applicable in the new site. Only partial information
about the content of attributes canbe usedto infer items inthe
new site. Moreover, although the Web sites are in the same
domain, theymaynot have anycommonrecord. Eventhough
they have common records, attributes may be displayed in
different formats. For example, the attribute values of the
attribute author may be shown as Kruglinski, David J. and
David J. Kruglinski in two different Web sites.
Two kinds of text-related clues from the source Web site
are considered in our framework. The first kind of clue is
derived from the extraction pattern contained in the
previously learned wrapper. Another clue is embodied in
the items previously collected or extracted from the source
Web site. A generative model for the generation of text
fragments about attributes and layout formats in a Web
page is designed to harness the uncertainty involved in the
learning process of wrapper adaptation and new attribute
discovery. In the new unseen site, the DOM structure
representation of the Web page is analyzed to identify a set
of useful text fragments. The proposed generative model is
utilized to conduct inference over the set of useful text
fragments to infer training examples in the new site.
Bayesian learning and expectation-maximization (EM)
techniques are employed for the proposed generative
model in the inference process. The set of inferred training
examples is then utilized to learn the wrapper, known as
the adapted wrapper, tailored to the new unseen site. This
adapted wrapper can be applied to Web pages of the new
site to extract items. Previously unseen attributes together
with their semantic labels in the new site can also be
discovered via another EM-based Bayesian inference pro-
cess based on the generative model.
Domain adaptation, which is a subproblem of transfer
learning [1], [8], shares certain resemblance with our
wrapper adaptation and has been applied to text-related
problems. Referring to our problem, the source Web site
and the target Web site can be treated as the source domain
and the target domain, respectively. Most existing works
assume that the data distribution of the target domain is
different from, but is highly related to that of the source
domain [9]. However, these methods are not applicable to
wrapper adaptation since the layout features are site-
dependent and completely different across different sites,
making the data distribution of the target Web site largely
different from the source Web site.
Graphical models have been applied in different research
areas related to text. For example, Ray and Craven have
applied Hidden Markov Models (HMM) to discovery the
relation in grammatical free-text sentences [24]. Latent
Dirichlet Allocation (LDA) is a language model for modeling
the topics of words in a probabilistic manner [4]. One major
characteristic of our model is that we consider site-
dependent format features and site-independent content
features during the learning process, leveraging the in-
formation to increase the degree of confidence in both
wrapper adaptation and new attribute discovery tasks.
We have conducted extensive experiments on more
than 30 real-world Web sites in three different domains
524 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 4, APRIL 2010
Fig. 2. An example of a Web page about book catalog collected from a
different Web site shown in Fig. 1.
Fig. 1. An example of a Web page about book catalog.
demonstrating very promising results. For example, in the
book domain, our approach can automatically adapt the
wrapper previously learned from a source Web site to
nine new unseen sites achieving an average precision of
91.3 percent and an average recall of 81.8 percent without
any human intervention. It can also automatically dis-
cover new attributes that are not specified in the wrapper,
with an average precision and recall of 89.0 and
77.5 percent, respectively. Moreover, we demonstrate that
the performance of our framework is superior than
several existing approaches.
We have previously developed a wrapper adaptation
method called WrapMA [29]. WrapMA requires human
effort to scrutinize the intermediate discovered data in the
adaptation process. We have also previously developed
another method which attempts to tackle the wrapper
adaptation problem in a fully automatic manner [31], [32],
[30]. In contrast, the wrapper adaptation approach pro-
posed in this paper is designed based on a more principled
Bayesian learning framework.
2 PROBLEM DEFINITION
A summary of the notation used in this paper is shown in
Table 1. Consider a domain 11. For example, the book
domain may contain a number of books, while the job
domain may contain a number of job advertisements. There
exists a set of records 11 fi
1
. i
2
. . . .g. Each record is
characterized by a set of attributes f" og [ fo
1
. o
2
. . . .g,
where f" og is a special element called not-an-attribute and
others are regular attributes. For a particular record i
i
, the
value of the attribute o
,
is represented by .
,
i
i
. For
example, in the book domain, each book record contains the
attributes, title, author, ISBN, price, etc. The values of the
attributes title and author of the first book record shown in
Fig. 1 are Shrink Yourself: The Ultimate Program to End
Emotional Eating and Roger Gould, respectively. In the
domain 11, suppose there exists a Web site : containing a
set of pages denoted by 11
:
fj
:
1
. j
:
2
. . . .g, each of which
consists of one or more records. The HTML document
corresponding to the Web page j can be treated as a
sequence of tokens t
j
1
. . . . . t
j
1j
, where 1j refers to the
number of tokens in the page j after preprocessing. Each
token t
j
|
is associated with a label y
j
|
representing the
attribute to which the token is related. Referring to Fig. 1,
the tokens contained in the text fragment Shrink Yourself:
The Ultimate Program to End Emotional Eating are
labeled as the attribute title, while the tokens in the text
fragment SEARCH RESULTS are labeled as the attribute
not-an-attribute. We define a text fragment r to be a
continuous token subsequence, where the contained tokens
are labeled as the same attribute. Each text fragment can be
represented by a three-field tuple Y . 1. C, where Y is the
label of the text fragment referring to the attribute to which
the contained tokens belong, and 1 and C refer to the
formatting feature and the content feature of the text
fragment. The content feature represents the characteristics
of an attribute such as the orthographic information and the
number of words starting with a capital letter of the text
fragment. The formatting feature 1 of attributes represents
the layout format information. For example, the text
fragment Shrink Yourself: The Ultimate Program to End
Emotional Eating is in underline, while the text fragment
SEARCH RESULTS is in capital in Fig. 1. Note that C and
1 are observable on Web pages, whereas Y is unobservable.
Information extraction wrapper. Let
/ioni /ioni
[
ni/ioni ni/ioni
[ f" og, where
/ioni /ioni
,
ni/ioni ni/ioni
, and f" og are three
disjoint nonempty sets as described in Table 1. Given a set
of Web pages 11
:
collected from the Web site :, the goal of
an information extraction wrapper, denoted by \ioj:, is
WONG AND LAM: LEARNING TO ADAPT WEB INFORMATION EXTRACTION KNOWLEDGE AND DISCOVERING NEW ATTRIBUTES VIA A... 525
TABLE 1
A Summary on the Notation Used in the Paper
to identify the text fragments from the pages in 11
:
such that
each text fragment refers to the value of one of the attributes
in
/ioni /ioni
for a record with error rate less than c. For
example, an information extraction wrapper for the Web
site shown in Fig. 1 can extract the attributes title, author,
and price of the book records in Web pages. A wrapper
normally consists of a set of site-dependent rules, each of
which corresponds to an attribute in
/ioni /ioni
, to identify the
text fragments based on the layout format and the content
of the tokens.
Wrapper learning. Consider a set of Web pages 11
:
collected from the Web site : and
/ioni /ioni
. Given a set of
training examples consisting of text fragments extracted
from the pages in 11
:
and correctly labeled with the
attributes in
/ioni /ioni
. Wrapper learning is an algorithm
which can automatically learn an information extraction
wrapper \ioj: from the training examples such that the
learned wrapper \ioj: is able to identify the text
fragments belonging the attributes in
/ioni /ioni
from the
remaining pages in site :. Note that the learned wrapper
\ioj: cannot be applied to site :
0
to extract information
where : 6 :
0
since wrappers are site-dependent. Moreover,
\ioj: can only extract text fragments belonging to those
attributes which are specified in the training examples. In
contrast, this paper investigates the wrapper adaptation
problem and the new attribute discovery problem.
Wrapper adaptation. Given a wrapper \ioj:, which
can extract text fragments belonging to the attributes in
/ioni /ioni
from the Web site :. Wrapper adaptation aims at
automatically learning a wrapper \ioj:
0
for the Web site
:
0
where : 6 :
0
without any training example from :
0
, such
that the adapted wrapper \ioj:
0
can extract text
fragments belonging the attributes in
/ioni /ioni
. For example,
suppose we have a wrapper which can extract the attributes
title, author, and price of the book records in the Web site
shown in Fig. 1. Wrapper adaptation can automatically
learn a new wrapper which can extract the same set of
attributes of the book records from the Web site, as shown
in Fig. 2. Note that wrapper adaptation is independent of
any wrapper learning method.
New attribute discovery. Given a wrapper \ioj:,
which can extract text fragments belonging to the attributes
in
/ioni /ioni
for the Web site :. New attribute discovery aims at
automatically identifying attributes contained in
ni/ioni ni/ioni
,
but not contained in
/ioni /ioni
. For instance, suppose we have
a wrapper which can extract the attributes title, author, and
price of the book records in the Web site, as shown in Fig. 2.
New attribute discovery can identify the text fragments
referring to the previously unseen attributes such as ISBN in
the same Web site. Notice that
ni/ioni ni/ioni
is not predefined in
advance since the new attributes in the new unseen site are
unknown. The new attribute discovery is independent of
any wrapper learning method.
3 OVERVIEW OF OUR FRAMEWORK
3.1 A Generative Model
We develop a Bayesian learning framework to tackle the
wrapper adaptation and new attribute discovery problem
based on a generative model for the generation of the text
fragments related to attributes. Fig. 3 depicts the graphical
representation of our model. Shaded and unshaded nodes
represent observable and unobservable variables, respec-
tively. We describe our proposed framework using a
running example. In each domain, there is an attribute
generation variable denoted by c which controls the label Y
of each text fragment. Consider the book domain and the
Web site, as shown in Fig. 1. This site consists of several
Web pages and each of them contains a number of records.
The attributes contained in each record are title, author, price,
publishing date, etc. For the Web sites, such as the ones
shown in Figs. 1 and 2, collected from the book catalog
domain, records are normally composed of a similar set of
book-related attributes. As a result, c basically does not
change drastically for different Web sites given a domain.
Suppose a Web site contains ` pages and the ith page
contains `
i
text fragments. The label of each text fragment
is generated according to 1Y jc. Based on the label
generated, the content feature C is then generated from
the distribution 1CjY . In essence, C is a feature vector
which characterizes the content of the text fragment. For
example, an element of C refers to the number of
occurrence of a particular token in the text fragment. C is
dependent on Y , as a consequence, they are, in turn,
dependent on c. Therefore, C remains largely unchanged
for different Web sites. For instance, in the book catalog
domain, the book titles collected from different Web sites
show similar characteristics and orthographic information.
Within a particular Web site, there is a formatting feature
generation variable denoted by u. The formatting feature 1
of attributes represents the formatting information such as
the font color or location of a text fragment. Similar to C, 1
is a feature vector characterizing the layout format of the
text fragment. Attributes of records from different Web sites
are normally presented in different formats or style. For
example, the title of each book record in Fig. 1 is in boldface
and underlined, whereas that in Fig. 2 is only underlined.
Within a Web site, an attribute of a record can be
associated with a semantic label. A semantic label basically
is a text fragment showing the semantic meaning of a text
fragment. For example, in the Web page shown in Fig. 1, the
semantic label for the attribute publication date is Pub.
Date:. The semantic labels commonly show certain reg-
ularity. For example, the semantic labels for the attributes
are located on the left-hand side and ended with a colon.
Such semantic label characteristics are normally different
across different Web sites. To model this observation, the
semantic label feature variable, denoted by o, is considered
in our generative model representing the characteristics of
Fig. 3. The generative model for text fragment generation. Shaded
nodes represent observable variables and unshaded nodes represent
unobservable variables.
the surrounding text fragments of a text fragment related to
an attribute. Essentially, o is a feature vector depending on
H and generated according to 1ojH. H, which is a binary
variable, indicates whether the feature o is related to
semantic labels of attributes. H is, in turn, controlled by
the semantic label generation variable denoted by . Since
semantic label features are generally different across
different sites, each Web site has its specific . As a result,
o is also generally different across different sites. Conse-
quently, the joint probability over the variables in Fig. 3 can
be expressed as
1CC. 11. YY . HH. oo. cc. uu.
1c
Y
`
i1
(
1u
i
1
i
Y
`i
i1
f1C
i.i
jY
i.i
11
i.i
jY
i.i
; u
i
1o
i.i
jH
i.i
1H
i.i
jY
i.i
;
i
1Y
i.i
jcg
)
.
1
where C
i.i
, 1
i.i
, Y
i.i
, H
i.i
, and o
i.i
denote the content
feature, formatting feature, label, semantic label indicator,
and semantic label of the ith text fragment in the ith page,
respectively; u
i
and
i
are the formatting feature genera-
tion variable and semantic label generation variable of the
ith page, respectively.
Our generative model can support two goals. The first
goal is to tackle the wrapper adaptation problemwhich aims
at automatically adapting a previously learned wrapper
fromthe source Web site to a newunseen site. To achieve this
goal, it first makes use of the site-independent information
contained in the source Web site to infer a set of training
examples in the new unseen site. Next, the inferred training
examples are utilized to train a wrapper tailored to the new
site for information extraction. The second goal supported by
our generative model is to discover new or previously
unseen attributes. The inferred training examples from the
wrapper adaptation process contain the formatting feature
of attributes in the unseen site. The formatting feature can be
exploited to discover previously unseen attributes and their
associated semantic labels.
3.2 Overview of Wrapper Adaptation
As described in Section 2, the objective of wrapper
adaptation is to learn a wrapper \ioj:
0
for the unseen
site :
0
given the learned wrapper \ioj: of the source site
:. Recall that if \ioj: can only extract attributes in
/ioni /ioni
,
\ioj:
0
can at best extract the same set of attributes. To
tackle the wrapper adaptation problem, our approach first
identifies a set of useful text fragments from the new unseen
site :
0
by considering the Document Object Model
2
(DOM)
representation of Web pages using an information-theoretic
method developed in our previous work [32].
The idea of our useful text fragment identification
approach is that each Web page can be considered as a
DOM structure, which is a tree-like structure. The internal
nodes are labeled as the HTML tags, while the leaf nodes
refer to the text fragments displayed on the browser. Hence,
each text fragment is associated with a root-to-leaf path,
which is the concatenation of the HTML tags. Suppose we
have two Web pages of the same site containing different
records. The text fragments related to the attributes of a
record are likely to be different, while text fragments related
to the irrelevant information such as advertisements or
copyright statements are likely to be similar. We make use of
entropy, which can measure the uniformity of the distribu-
tion of tokens contained in the text fragments identified by a
root-to-leaf path. If a root-to-leaf path can locate text
fragments which vary largely in different Web pages, these
text fragments are likely to be related to attributes of a record
and considered to be the useful text fragments.
Each useful text fragment is represented by C. 1. Y ,
where C and 1 are observable, whereas Y is unobservable.
Next, our generative model is applied to conduct inference
over this set of useful text fragments. The objective of the
inference is to identify certain text fragments to be included
in a set of training examples for learning the wrapper
\ioj:
0
for the new site :
0
. In essence, we consider the
probability 1Y ojC. 1. 8o 2
/ioni /ioni
. It represents the
probability that the useful text fragment observed as
C. 1 belongs to the attribute o. This inference problem
shares some resemblance with an ordinary classification
problem. However, typical classification methods cannot
solve our problem because we do not have complete
training examples for the unseen Web site as explained in
the next paragraph. Consequently, we cannot learn a
classifier in advance, and hence, common classifier learning
methods are not applicable.
Two kinds of text-related clues from the source Web site
are considered when inferring training examples in the
target Web site. The first kind of clue is the extraction pattern
contained in the previously learned wrapper, which contains
the pattern about the semantic content of the attributes.
Many wrapper systems such as [19] will capture the pattern
of the data to be extracted. For example, one pattern may
capture the observation that text fragments related to the
attribute title likely contain alphabetical tokens and start
with a capital letter. Such extraction pattern can be useful to
identify the corresponding text fragments in new unseen
sites. The second kind of clue comes from the items
previously extracted or collected from the source Web site
since these items embody the characteristics of the content of
an attribute. Shrink Yourself: The Ultimate Program to End
Emotional Eating and Java How to Program are samples
of extracted or collected items for the attribute title in Fig. 1.
These items can be used for inferring training examples from
the unseen site because they contain information about C,
which is likely to remain similar in different sites. However,
they are different from ordinary training examples because
they do not contain any information about 1 of the unseen
site and the formatting features of the attributes cannot be
used in the new unseen sites for extraction. Using the
formatting features in the source site may even lead to wrong
extraction. For example, the author of a book may be in italic
font in the source site, while the font of the author is not in
italic font but the attribute publisher is. Using the formatting
features of the source site may incorrectly predict text
fragments actually belonging to publisher as author.
From an unseen site, some useful text fragments are first
identified. As the content feature C of the attributes
2. The details of the Document Object Model can be found in http://
www.w3.org/DOM/.
contained in the source site and the new unseen site are
similar, our approach exploits the items previously ex-
tracted or collected from the source site for conducting
inference over the set of useful text fragments. Bayesian
learning and EM techniques are employed in the inference
process. For instance, in Fig. 2, our approach can discover
that text fragments Cocoa: Programming for Mac OS X
Third Edition and Aaron Hillegass likely belong to the
attributes title and author, respectively. This inference
corresponds to the generative model depicted in the upper
part of Fig. 3. The identified training examples can then be
used to train a wrapper for the new site. In our framework,
a wrapper learning module is needed. In our experiments,
we employ the wrapper learning module previously
developed and presented in [19]. In principle, any wrapper
learning modules or other nonwrapper-based information
extraction approaches can also be employed.
3.3 Overview of New Attribute Discovery
The adapted wrapper \ioj:
0
can effectively extract
attributes in
/ioni /ioni
from the new unseen Web site :
0
.
However, since the previously unseen attributes in
ni/ioni ni/ioni
are not specified in the adapted wrapper, text fragments
related to these attributes cannot be extracted from the new
unseen site. For example, text fragments related to the
attribute ISBN, such as 0321503619, as shown in Fig. 2,
cannot be extracted. As mentioned in Section 3.1, the second
goal of our framework is to discover new or previously
unseen attributes, and at the same time, identify the
corresponding semantic labels. The idea of our approach
is to analyze those useful text fragments which are not
associated with any known attributes. Recall that a set of
useful text fragments is collected during the wrapper
adaptation process in our framework. Some of the collected
useful text fragments belong to known attributes, and are to
be identified in wrapper adaptation. Among the remaining
useful text fragments, some are likely to be associated with
previously unseen attributes. We develop another inference
method based on Bayesian learning to infer such text
fragments and their semantic labels. Our method first
identifies a set of semantic label candidates for the useful
text fragments. It then exploits the relationship between the
useful text fragments and the semantic label candidates to
discover new attributes and their semantic labels. It
corresponds to the generative model depicted in the lower
part of Fig. 3.
We provide an example to further illustrate the idea.
After the wrapper adaptation process, 0321503619 and
$31.50 are two sample-useful text fragments collected in
the Web page as shown in Fig. 2. Some useful text
fragments like $31.50 are surrounded by Our Price
and You Save in the corresponding HTML document. On
the other hand, some text fragments like 0321503619 are
surrounded by ISBN and Our Price. Therefore, Our
Price, Your Save, and ISBN are considered as semantic
label candidates. From the results of the wrapper adapta-
tion process, we know that text fragments including
$31.50 in the first record and $21.95 in the second
record of Fig. 2 belong to the attribute price. Since Our
Price is likely to be the semantic label for the attribute price,
we can then discover the relationship between attribute
values and semantic labels in the Web page from their
features including their relative position. Specifically, the
semantic label is likely to be in the same row and on the
left-hand side of an attribute value. As a result, it can be
inferred that it is highly likely that ISBN is the semantic
label for the text fragment 0321503619.
To formalize this idea, we consider the conditional
probability as 1HjC. 1. o. Essentially, it specifies how
likely that the semantic label candidate denoted by o is the
semantic label of the useful text fragment observed as C. 1.
4 DESCRIPTION FOR WRAPPER ADAPTATION
MODEL
Recall that a useful text fragment can be represented by the
three-field tuple C. 1. Y in which C and 1 are observable
and Y is unobservable. Given a text fragment observed as
C. 1 and collected from the page j, the goal of wrapper
adaptation is to predict the label Y . In essence, we consider
the probability 1Y ojC. 1, where o 2
/ioni
. 1Y
ojC. 1 can be expressed as follows using Bayes Theorem:
1Y ojC. 1
/ 1C. 1jY o1Y o
1CjY o11jY o1Y o
Z Z
1CjY o11jY o. u
j
1Y ojc1c1u
j
du
j
dc.
2
where u
j
denotes the formatting feature generation variable
of the page j. However, computing the above formula is
intractable. One may use the conjugate priors of 1Y jc and
11jY . u
j
for 1c and 1u
j
, respectively, and apply
sampling methods such as Gibbs sampling to estimate the
solution. However, such sampling methods normally
require a long time to converge for the inference process.
Instead, we develop a point estimation method for
computing c and u
j
based on the EM technique.
Given a particular set of c and u
j
, the probability for
generating a particular text fragment C. 1. Y in j can be
expressed as follows:
1C. 1. Y jc. u
j
1CjY 11jY ; u
j
1Y jc. 3
Let be the set of all text fragments in Web pages and ` be
the total number of text fragments. The probability for
generating can be expressed as follows:
1
jc. u
ji

Y
`
i1
1
C
i
. 1
i
. Y
i
jc. u
ji
. 4
where C
i
, 1
i
, and Y
i
refer to the content feature, the
formatting feature, and the label of the ith text fragment in
, respectively; ji refers to the page where the ith text
fragment comes from. Combining (3) and (4), we can obtain
the following log-likelihood function 1
c. u:
1
c. u
X
`
i1
log
X
o2
1C
i
jY
i
o1
1
i
jY
i
o; u
ji
1Y
i
ojc.
5
When Y
i
i s known, 1Y
i
ojc 1 i f Y
i
o and
1Y
i
ojc 0, otherwise. Then, it becomes the ordinary
log-likelihood function in typical supervised learning
problems. However, since Y
i
is unknown in the useful text
fragments collected from the unseen site, 1Y
i
ojc is in
the range of 0-1, and hence, direct computation of 1
c. u
is intractable. We derive the following expected log-
likelihood function 1
0
c. u:
1
0
c. u
X
`
i1
X
o2
1Y
i
ojc log 1C
i
jY
i
o1
1
i
jY
i
o; u
ji
.
6
By Jensens inequality and the concavity of the logarithmic
function, it can be proved that 1
c. u is bounded below
by 1
0
c. u [21]. We design an EM algorithm that

approximates the parameters in (6) without knowing the
actual value of the attributes of the useful text fragments in
the new unseen site. Next, the learned parameters can be
used to conduct inference. For each useful text fragment in
the new unseen site, it is classified into an appropriate
attribute by estimating 1Y ojC. 1; c. u. 8o 2 .
One characteristic of the EM algorithm is that it increases
1
0
c. u iteratively until convergence so that an approx-

imation of 1
c. u can be obtained. The E-Step and M-Step

in the tth iteration are described as follows:
E-Step.
1
Y
i
ojC
i
. 1
i
; c
t
. u
t
ji
C
i
. 1
i
. Y
i
ojc
t
. u
t
ji
P
o
0
2
1
C
i
. 1
i
. Y
i
o
0
jc
t
. u
t
ji
/ 1C
i
jY
i
o1
1
i
jY
i
o; u
t
ji
1Y
i
ojc
t
.
M-Step.
c
t1
. u
t1
arg max
c.u
1
0
c. uj
c
t
.u
t
.
Let 1Y jc be a multinomial distribution where each
mixture component refers to an attribute o 2 . Then, c
refers to the parameter, which denotes the proportion of
each component in the mixture model, of the distribution.
Recall that in wrapper adaptation, we aim at extracting text
fragments belonging to the attributes in
/ioni
, and treat
the attributes in
ni/ioni
and f" og as a single component
denoted as ~ o. Precisely, we estimate c
t1
as follows:
c
t1
o

P
`
i1
1
Y
i
ojC
i
. 1
i
; c
t
. u
t
ji
`
.
1Y
i
ojc
t1
c
t1
o
.
7
for all o 2
/ioni
[ f~ og, where c
t
o
represents the proportion
of the component representing the attribute o in the mixture
at the tth iteration. On the other hand, u
ji
, together with Y
i
,
controls the formatting feature using 11
i
jY
i
. u
ji
, which is
an attribute-dependent and page-dependent distribution.
Hence, u
ji
refers to the parameter, which is the proportion
of the useful text fragments that belong to o 2
/ioni
[ f~ og
in page ji and are characterized with the formatting
feature 1
i
of the distribution. Precisely, we decompose the
formatting features into a set of binary formatting feature
functions and assume that they are independent. Hence,
11
i
jY
i
o can be expanded as follows:
11
i
jY
i
o
Y
7
)
/1
1
)
)
/
1
i
jY
i
o
. 8
where 7
)
is the number of formatting features and )
)
/
1
i
is
the /th binary feature function of the layout format 1
i
. For a
particular formatting feature )
)
/
, we can estimate u
t1
j
for
the Web page j as follows:
u
t1
j.o./

1
P
`
i1
j. i
(
X
`
i1
1
)
)
/
1
i
1jY
i
o; u
t
Y
i
ojC
i
. )
)
/
1
i
1; c
t1
. u
t
j. i
)
.
1
)
)
/
1
i
1jY
i
o; u
t1
u
t1
ji.o./
.
9
for all o 2
/ioni /ioni
[ f~ og, where u
t1
j.o./
refers to the proportion
of the useful text fragments which belong to the attribute o
in page j and have the formatting feature )
)
/
at the
tth iteration; j. i 1 if ji j and 0, otherwise.
Consequently, the EM algorithm can be revised as follows:
E-Step.
1Y
i
oj\
i
. C
i
. 1
i
; c
t
. u
t
C
i
. 1
i
. Y
i
ojc
t
. u
t
ji
P
o
0
2[f~ og
1
\
i
. C
i
. 1
i
. Y
i
o
0
jc
t
. u
t
ji
/ 1\
i
jY
i
o1C
i
jY
i
o1
1
i
jY
i
o; u
t
ji
1Y
i
ojc
t
.
M-Step.
c
t1
o

P
`
i1
1
Y
i
ojC
i
. 1
i
; c
t
. u
t
ji
`
.
u
t1
j.o./

1
P
`
i1
j. i
(
X
`
i1
1
)
)
/
1
i
1jY
i
o; u
t
ji
Y
i
ojC
i
. )
)
/
1
i
1; c
t1
. u
t
ji
j. i
)
.
To initialize the EM algorithm, we have to first estimate
1Y ojC. 1; c
0
. u
0
. 8o 2
/ioni /ioni
[ f~ og for each useful text
fragment. Recall that C is largely unchanged for the
attributes in the Web pages collected from the same
domain. Therefore, we utilize the items previously
extracted or collected from the source Web site. A two-
level edit distance algorithm is invoked to compute an
initial approximation of this probability [31]. We define |
/
o
and 1\
i
. |
/
o
as the /th item belonging to the attribute o
collected or extracted from the source site and the
distance between the ith useful text fragment with
character sequence \
i
and |
/
o
, respectively. Then, we
approximate 1Y
i
ojC
i
. 1
i
; c
0
. u
0
for the ith useful text
fragment by
1Y
i
ojC
i
. 1
i
; c
0
. u
0
$
1
7
max
/
1
0
\
i
. |
/
o
. 10
where 1
0
\
i
. |
/
o
1 1\
i
. |
/
o
and
7
X
o
0
2/ioni /ioni[f~ og
n
max
/
1
0
\
i
. |
/
o
0
o
is a normalization factor.
For 1C
i
jY
i
o, we can express with probabilities
related to the set of features concerned with the content
feature and the formatting feature, respectively. We assume
that these features are independent to each other. Hence,
1C
i
jY
i
o can be expanded as follows:
1C
i
jY
i
o
Y
7
c
/1
1
)
c
/
C
i
jY
i
o
. 11
where 7
c
is the number of content features and )
c
/
C
i
is the
/th feature function of the content C
i
. These feature functions
map the content C
i
to a real value. For example, one of the
feature functions may be the number of digits contained.
After obtaining the parameters, we can calculate the
probability 1Y jC. 1; c. u for each useful text fragment
and determine the attribute by the following formulas:
1Y ojC. 1; c. u
1C. 1jY o; u1Y ojc
P
o
0
2
1C. 1jY o
0
; u1Y o
0
jc
.
12
^
Y arg max
o
0
2
1\. C. 1jY o
0
; u1Y o
0
jc. 13
For each attribute, the top ` useful text fragments with
their probabilities belonging to this attribute higher than a
certain threshold 0 will be selected as the training examples
for learning the new wrapper for the new unseen Web site.
Users could optionally scrutinize the discovered training
examples to improve the quality. However, in our experi-
ments, we did not conduct any manual intervention and
the adaptation was conducted in a fully automatic manner.
To learn the new wrapper for the new unseen site, we
employ the same wrapper learning method [19] used in the
source site.
5 DESCRIPTION FOR NEW ATTRIBUTE DISCOVERY
MODEL
The overview of our new attribute discovery model is
described in Section 3.3. The goal is to discover previously
unseen attributes together with their semantic labels in the
unseen sites. This section presents the details of the model.
Our model considers 1HjC. 1. o, which states how likely
that the semantic label candidate denoted by o is the
semantic label of a useful text fragment. We develop a
Bayesian learning model based on our generative model to
tackle this task.
According to the graphical model depicted in Fig. 3, we
can obtain the following conditional probability if we are
given a set of c. u. :
1C. 1. o. H. Y jc. u.
1C. 1. jY ; u1ojH1Y jc1HjY ;
1CjY 11jY ; u1ojH1HjY ; 1Y jc.
14
By Bayes theorem,
1C. 1jY ; u1Y jc 1Y jC. 1; c. u1C. 1.
Hence, we can get the following expression:
1HjC. 1. o; c. u.
(
X
o2
1Y ojC. 1; c. u1ojH1HjY o; 1C. 1
)
,(
X
/2f0.1g
X
o2
1Y ojC. 1; c. u1ojH /
1H /jY ; 1C. 1
)
/
X
o2
1Y ojC. 1; c. u1ojH1HjY o; 1C. 1.
15
Since H and Y are both unobservable, we derive the
following expected log-likelihood function:
1
00
c. u.
X
`
i1
X
/f0.1g
X
o2
f1Y
i
ojc1
H
i
/jY
i
o;
ji
log 1C
i
jY
i
o1
1
i
jY
i
o; u
ji
1o
i
jH
i
/g.
16
In the wrapper adaptation process described in Section 4,
we can determine the value of 1Y oj\. C. 1; c. u in (16)
for the existing attributes o 2
/ioni /ioni
. An EM algorithm is
designed to estimate the parameters in (16) without
knowing the actual value of H. Next, the new attributes
and the associated semantic labels can be discovered based
on the estimated parameters. The E-Step and M-Step at the
tth iteration are described as follows:
E-Step.
1H
i
jC
i
. 1
i
. o
i
; c. u.
t
/
X
o2
1Y
i
ojC
i
. 1
i
; c. u1o
i
jH
i
1
H
i
jY
i
o;
t
ji
.
M-Step.
t1
arg max
1
00
c. u. j
t .
where c and u are fixed in the wrapper adaptation process
as described in Section 4. Let H be represented as a
binomial distribution. then refers to the proportion of the
true pairs of useful text fragments observed as C. 1 and
the semantic label denoted by o among all pairs of useful
text fragments and semantic label candidates. As a result,
t1
j
for page j can be computed as follows:
t1
j

(
X
`
i1
X
joij
/1
X
o2
1Y
i
ojC
i
. 1
i
; c. u1o
i./
jH
i
1H
i
jY
i
o;
t
ji
j. i
'
,
1.
17
where o
i./
refers to the /th semantic label candidate for the
ith useful text fragment; jo
i
j refers to the number of
semantic label candidates for the ith useful text fragments;
and 1 refers to the total number of pairs of useful text
fragments and semantic label candidates. As a consequence,
we have the following revised EM algorithm:
E-Step.
1
H
i
jC
i
. 1
i
. o; c. u.
t
ji
/
X
o2
1Y
i
ojC
i
. 1
i
; c. u1o
i
jH
i
1
H
i
jY
i
o;
t
ji
.
M-Step.
t1
j

(
X
`
i1
X
jo
i
j
/1
X
o2
1Y
i
ojC
i
. 1
i
; c. u1o
i./
jH
i
H
i
jY
i
o;
t
ji
j. i
),
1.
After the parameters are estimated, the semantic label
can be predicted by the following reasoning. Recall that
some useful text fragments are classified as the existing
attributes as described in Section 4. The remaining un-
classified useful text fragments may belong to some
previously unseen attributes. We assume that the terms
1Y o
0
jC. 1; c. u are equal for all o
0
2
ni/ioni
for the
unclassified useful text fragments. Then, we can derive the
following expression regarding those unclassified useful
text fragments in page j:
1HjC. 1. o; c. u.
j
/ 1HjY o
0
;
j
1ojH. 18
Next, we make use of the classified text fragments in the
wrapper adaptation process to determine 1HjY o
0
;
j
in
(18). Since H is binomially distributed being either 0 or 1, let
1H 1jY o
0
;
j
be the proportion of classified useful
text fragments which associate with a semantic label in page
j. Essentially, as we do not have information about the
unseen attributes o
0
2
ni/ioni
, we determine the value of
1H 1jY o
0
;
j
based on the information obtained from
the known attributes. By representing o with a set of features
related to the semantic labels )
:
/
o, such as the relative
position of o to the useful text fragments, the number of
characters of o, etc., and assuming that these features are
independent, we can derive the following expression:
1HjC. 1. o; c. u.
j
/ 1HjY o;
j
Y
7:
/1
1
)
:
/
ojH
.
19
where 7
:
is the number of semantic label features.
Equation (19) can be used for estimating the probability
that the semantic label candidate o is the semantic label
of the useful text fragment observed as C. 1.
We define two types of semantic label candidates in our
method. The first type of semantic label candidates refers to
those text fragments located just before, after, above, or
below the useful text fragments. For example, Pub. Date
is an example of a semantic label candidate for the useful
text fragment April 2007 discovered in Fig. 1. This type of
semantic label candidates can handle the layout format that
the semantic labels are located around the attributes. The
second type of semantic label candidates refers to those text
fragments located in the first row of a table in Web pages.
This type of semantic label candidates can handle the layout
format that the records are organized in a table format and
the first row of the table contains the column heading.
Usually, the semantic labels in Web pages originated from
the same Web site appear repeatedly. Therefore, only those
semantic label candidates which appear more than once in
Web pages are considered. The semantic label candidates
(denoted by o in (19)) and the unclassified useful text
fragments (denoted by C. 1 in (19)) are enumerated to
form pairs. The conditional probability 1HjC. 1. o; c. u.
for each pair is computed to discover the new attributes as
well as the semantic labels.
6 EXPERIMENTAL RESULTS
We conducted extensive experiments on more than 30 real-
world Web sites collected from three domains, namely, the
book domain, the electronics appliance domain, and the job
advertisement domain, to evaluate the performance of our
framework. Table 2 depicts the Web sites used in our
experiment. The Web sites labeled with T1, T2, and S1-S10
were collected from the book domain. T1 and T2 were used
for parameter tuning, while S1-S10 were used for testing.
Similarly, T3, T4, and S11-S20 were collected from the
electronics appliance domain; T5, T6, and S21-S30 were
collected from the job advertisement domain. In each site,
the Web pages collected were divided into training set and
testing set. The training set contains Web pages from where
training examples for training a wrapper come. The testing
set contains Web pages that the trained wrapper or the
adapted wrapper is applied on for evaluation. For example,
when adapting the wrapper from the source Web site S1 to
the target Web site S2, the source wrapper is trained from
the training examples manually annotated on the training
set of S1. Wrapper adaptation is conducted to automatically
identify training examples in the training set of S2. Next, the
identified training examples are used to train the wrapper
TABLE 2
Web Sites Collected for Experiments
rec. # and pp. # refer to the number of records and the number of pages
collected.
for S2. The performance of the adapted wrapper is
evaluated by applying the adapted wrapper to the testing
set of the target site. In our experiments, one page from each
Web site is randomly selected and used to form the training
set, while the testing set contains the remaining Web pages.
In our framework, parameters include ` and 0 in
selecting training examples for learning a new wrapper
for the new unseen site in wrapper adaptation as described
in Section 4. The parameter tuning process was conducted
as follows: The Web sites T1, T2, T3, T4, T5, and T6 were
randomly chosen for parameter tuning. In each domain, we
used different sets of parameters for adapting the wrapper
from one of Web sites to another and vice versa. For
example, in the book domain, we first learned a wrapper by
making use of the training set of T1 and this learned
wrapper was adapted to the testing set of T2 using our
framework. The extraction performance of the adapted
wrapper in T2 was recorded. Next, we learned the wrapper
from the training set of T2 and adapted the learned wrapper
to the testing set of T1. The extraction performance of the
adapted wrapper in T1 was also recorded. Different sets of
parameters were used and the parameter set producing the
best average extraction performance was chosen for con-
ducting experiments on the testing data. Specifically, ` is
set to 5 and 10 for book title and author, respectively, in the
book domain; 5 and 10 for model number and description,
respectively, in the electronics appliance domain; and 5 for
both job title and company in the job advertisement domain.
Since the wrapper learning method we employ requires
only a few training examples for learning a wrapper, 0 is set
to a very small value and the number of training examples
selected is mainly controlled by `.
6.1 Evaluation on Wrapper Adaptation
We conducted five sets of experiments in each domain to
evaluate our framework for wrapper adaptation. In the first
set of experiment, we provided training examples in each
Web site for learning a wrapper using our wrapper learning
approach [19]. After obtaining the learned wrapper for each
Web site, we simply applied the learned wrapper from one
particular Web site, known as the source site, without
wrapper adaptation to all remaining sites that are treated as
newunseensites, for informationextraction. For example, the
wrapper learned from S1 is directly applied to S2-S10 to
extract items. The second set of experiments is to make use of
an existing wrapper induction system called WIEN
3
[16] to
perform the same extraction task. These two sets of
experiments can be treated as baselines for our adaptation
approach. The third set of experiments called as ROAD-
RUNNER++ is a variant of ROADRUNNER.
4
The reason for
comparing our wrapper adaptation approach with ROAD-
RUNNER++ is that ROADRUNNER achieves a good data
extraction performance in Web pages [7]. In this set of
experiments, we first applied ROADRUNNER to new
unseen sites to extract items. ROADRUNNER is an unsu-
pervised data extraction approach and cannot label the items
extracted. ROADRUNNER++ conducts a postprocessing
phase that can label the items extracted by ROADRUNNER
as follows: For each item extracted by ROADRUNNER, the
edit distance between each of the previously extracted or
collected items from the source Web site is first calculated.
Each extracted item is then labeled with the label of the item
in the source site if the edit distance between them is the
smallest and below a certain threshold. In our experiments,
this threshold is set to 0.5 which leads to the best extraction
performance for ROADRUNNER++. The fourth set of
experiments is to adapt the learned wrapper from the source
Web site with our wrapper adaptation approach, but without
the EM procedure, to all new unseen sites. This approach
considers both content features and layout format features
and is initialized as described in Section 4. However, it does
not have the EMprocedure to update the parameters. This set
of experiments can show the effect of our EM procedure. We
call this method our approach without EM. The fifth set of
experiments is to adapt the learned wrapper from the source
Web site with our full wrapper adaptation approach to all
new unseen sites. It takes around 15 minutes, on average, to
obtain the adapted wrapper for the target Web site using Sun
Microsystems Sun Blade 2000/1000 with 1200 MHz CPUand
2 GB Memory. In all the five sets of experiments, the applied
methods terminate until no more data can be extracted.
The extraction performance is evaluated by two com-
monly used metrics, namely, precision and recall. Precision
is defined as the number of items for which the system
correctly identified divided by the total number of items it
extracts. Recall is defined as the number of items for which
the system correctly identified divided by the total number
of actual items. F1-measure, which is defined as the
harmonic mean of recall and precision, is also used in
the evaluation.
Attributes of interest in the book domain are title and
author. In the first set of experiments, wrappers learned
from source Web sites cannot extract any items in new
unseen sites without adaptation in all cases because the
formats of the Web pages from different sites are different.
Extraction patterns learned from a particular Web site
cannot be applied to other sites to extract information.
Similar results are obtained in the second set of experiments
in which the wrapper learned by WIEN for a particular Web
site also cannot extract items in other Web sites. Table 3
shows the average extraction performance for the experi-
ments using ROADRUNNER++, our approach without EM,
and our wrapper adaptation approach. The first column
shows the Web sites treated as the source sites. Each row
summarizes results obtained by applying the wrapper
learned from the source Web site all other sites for
extracting items.
The results indicate that after applying our full wrapper
adaptation approach, the wrapper learned from a particular
Web site can be adapted to other sites. Our wrapper
adaptation approach achieves better performance compared
with ROADRUNNER++. We have also conducted the
significance test (t-test) on the F1-measure to compare our
approach with ROADRUNNER++ as well as our approach
without EM, showing that the performance of our approach
is significantly better than both approaches. In particular, it
can be observed that ROADRUNNER++ obtains similar
3. WIEN is available in the Web site: http://www.kushmerick.org/nick/
research/wrappers/wien/.
4. ROADRUNNER is available in the Web site: http://www.dia.
uniroma3.it/db/roadRunner/software.html.
level of recall, but substantially less accurate extraction
performance. The reason is that ROADRUNNER extracts a
large amount of data from unseen sites and the extracted
data are labeled by only considering the edit distance. It does
not consider other useful information such as the layout
format of the new site resulting in poor precision.
In addition, we compare the extraction performance of
our wrapper adaptation approach with the approach
relying on manual training data collection typically used
in existing wrapper learning methods. Specifically, training
examples were obtained manually from the training set for
learning the wrapper in each Web site. The learned
wrapper was then applied to the same site to extract items.
The average precision, recall, and F1-measure for such
wrapper learning with manual training data are 97.6, 98.0,
and 97.8 percent, respectively, for title, and 94.2, 88.8, and
91.4 percent, respectively, for author. Therefore, our
wrapper adaptation can achieve an extraction performance
comparable to manual training data collection. Favorably,
our approach greatly reduces the human work in preparing
training examples.
We also conducted experiments in the electronics
appliance and job advertisement domains similar to the
ones carried out in the book domain. Attributes of interest
are model number and description in the electronics appliance
domain; title and company in the job advertisement domain.
Tables 4a and 4b show the extraction performance of our
wrapper adaptation approach in the electronics appliance
and job advertisement domains, respectively. Compared
with ROADRUNNER++, the extraction performance of our
approach is very promising. The attribute description from
different Web sites varies substantially. For example, the
description collected from S11 consists of an average of
90 tokens, while the average number of tokens for the
description collected from S17 is around 12. This imposes
additional difficulty in the adaptation task. Such difference
also diminishes the usefulness of edit distance and leads to
poor performance in ROADRUNNER++. However, our
approach can make use of additional information such as
the layout format of Web pages to improve the quality of
the extracted data. The performance of our approach
without EM is nearly the same as that of our approach in
extracting the attribute company in the job domain. The
major reason is that the names of the companies vary
greatly and it is seldom that a company has job advertise-
ments in different Web sites collected. As a result, the text
fragments related to this attribute vary greatly in different
sites. This poses difficulties in adapting the wrapper
reducing the effectiveness of the EM procedure. To
conclude, our wrapper adaptation approach can signifi-
cantly reduce the human effort in constructing wrappers for
new unseen sites, at the same time, achieve a very
satisfactory performance.
TABLE 3
Average Extraction Performance on Title and Author for the Book Domain in
Experiments When Training Examples of One Particular Web Site are Provided
P, R, and F refer to precision, recall, and F1-measure (in percentages), respectively. Ave. and t-st. refer to the average performance and the
t-statistic value in significance test, respectively.
TABLE 4
Average Extraction Performance (F1-Measure) in Experiments
of Wrapper Adaptation When Training Examples of
One Particular Web Site are Provided: (a) Electronics
Appliance Domain and (b) Job Advertisement Domain
ROAD. No EM, and Ours refer to ROADRUNNER++, Our Approach
without EM, and Our Approach, respectively.
6.2 Evaluation on New Attribute Discovery
Each Web site shown in Table 2 contains some other
attributes apart from those specified in the wrapper learned
from the source Web site. For example, S1 contains
attributes such as publisher, publishing date, etc. We
conducted experiments in all three domains to evaluate
our new attribute discovery approach for discovering those
previously unseen attributes. In each domain, we first adapt
the wrapper learned from a Web site to remaining sites
using our wrapper adaptation approach. Next, our new
attribute discovery approach is applied to each new unseen
site to discover new attributes. For example, the wrapper
learned from S1 is first adapted to S2-S10. Next, our new
attribute discovery approach is applied to discover new
attributes contained in S2-S10. F1-measure (NewF) is used
for evaluating the performance with the ground truth
consisting of all new attributes in Web sites. Table 5 shows
the results of our experiments. Each row depicts the average
performance in discovering new attributes from the Web
site shown in the first subcolumn when different source
Web sites are provided. For example, the first row of the
book domain shows the average performance for discover-
ing new attributes in the nine runs of experiments in which
wrappers learned from other Web sites are adapted to S1.
The results show that our approach can effectively identify
new attributes in the unseen site. The results of the job
advertisement domain are not as satisfactory as the results
of other domains. The major reason is that some text
fragments that do not belong to any attributes are very
similar to the text fragments related to the attributes of
records. For example, it is common that job advertisement
Web sites have a menu containing hyperlinks of different
locations and post date of the jobs. These text fragments of
hyperlinks are very similar to the text fragments related to
the new attributes of the job records. As a result, these text
fragments are also extracted degrading the precision.
Table 6 depicts the new or previously unseen attributes
that are discovered in the Web sites in the book domain. For
example, the extracted data from S1 contain publishing date,
reading level, online price, and you save, which are regarded as
newattributes, withtheir semantic labels correctly identified.
In most cases, our approach can correctly identify semantic
labels for newattributes fromWeb pages. In some situations,
attributes, such as publisher in S1, are not associated with any
semantic label in Web sites. However, our approach can still
discover such attributes from Web pages.
7 RELATED WORK
A lot of research aiming at extracting information from
various types of documents such as free texts and
semistructured documents have been conducted [3], [12],
[17], [27]. One major limitation of most existing Web
information extraction methods is that they employ
supervised learning approaches requiring manually pre-
pared training examples for each Web site. Moreover, the
wrapper learned from a particular Web site cannot be
applied to other sites for extracting information.
Some methods attempt to alleviate this limitation by
investigating wrapper adaptation. Cohen and Fan designed
a method for learning page-independent heuristics for
extracting items from Web pages [5]. However, one major
disadvantage of this method is that training examples from
several Web sites must be collected to learn such heuristic
rules. Golgher and da Silva [13] proposed to solve the
wrapper adaptation problem by applying a bootstrapping
technique and a query-like approach. This approach
assumes that the seed words, which refer to the elements
in the source repository in their framework, must appear in
the unseen Web page. However, exact matching of items
from different Web sites is generally not feasible. ADEL is
able to extract records from Web sites and semantically
label the attributes in new unseen sites [18]. It applies a
heuristic method to induce a grammar for the DOM trees of
the pages to accomplish the extraction task. The extracted
data are organized in a table format. Each column of the
table is labeled by matching with the entries in the column
and the patterns learned in the source site. One limitation of
ADEL is that only a single attribute is determined for the
entire column, which may consists of inconsistent or
incorrectly extracted data. Consequently, these incorrectly
extracted entries will be assigned a wrong attribute label.
Several techniques such as bootstrapping [25] and active
learning [15], [14] have been developed for reducing the
human effort in preparing training examples. However, a
TABLE 5
Experimental Results for New Attribute Discovery in Book,
Electronics Appliance, and Job Advertisement Domains
NewF refers to the F1-measure for discovery all new attributes in Web
sites (in percentages). Ave. denotes the average performance.
TABLE 6
New Attributes Discovered and Not Discovered Using Our New
Attribute Discovery Approach in the Book Domain
Attributes with

denote that these attributes are not associated with any
semantic label in the Web site.
substantial amount of human work is still required when
learning the wrappers for multiple Web sites.
Another common shortcoming of all of the above
existing methods is that the attributes to be extracted by
the adapted wrapper are prespecified. They cannot discover
previously unseen attributes in new unseen sites. Recently,
unsupervised information extraction methods such as
ROADRUNNER [6] and MDR [20] have been developed.
However, they suffer from a major drawback that they
cannot differentiate the type and the meaning of the
information extracted. Hence, the items extracted require
human effort to interpret the meaning. KNOWITALL [10]
and TEXTRUNNER [2] are two systems aiming at extract-
ing relations between entities from documents in an
unsupervised manner. One common limitation of them is
that they cannot be applied to extract attributes and
attribute values associated with a particular record from a
Web page. For instance, it cannot be applied to extract the
selling prices of the books in a Web page since the same
book may be sold at different prices in different Web sites.
Our new attribute discovery can also identify the
semantic labels of attributes. Probst et al. developed a
system aiming at extracting attribute-value pairs from
product descriptions contained in Web documents [23].
One limitation of this approach is that the input of the
system is mainly short phrases or sentences. As a result, a
preprocessing step is needed to extract the description
phrases from Web documents.
8 CONCLUSIONS
We have developed a framework for adapting information
extraction wrappers with new attribute discovery via
Bayesian learning. Our approach can automatically adapt
the information extraction patterns previously learned in a
source Web site to new unseen sites, at the same time,
discover new attributes together with semantic labels. A
generative model for the generation of text fragments
related to the attributes and layout format in Web pages
is designed to harness the uncertainty. Bayesian learning
and EM techniques are employed in our framework for
tackling the wrapper adaptation and new attribute dis-
covery tasks. Extensive experiments from more than 30 real-
world Web sites in three different domains were conducted
and the results demonstrate that our framework achieves a
very promising performance.
ACKNOWLEDGMENTS
The work described in this paper is substantially supported
by grants from the Research Grant Council of the Hong
Kong Special Administrative Region, China (Project No:
CUHK4128/07) and the Direct Grant of the Faculty of
Engineering, CUHK (Project Codes: 2050391 and 2050442).
This work is also affiliated with the Microsoft-CUHK Joint
Laboratory for Human-Centric Computing and Interface
Technologies.
REFERENCES
[1] A. Arnold, R. Nallapati, and W. Cohen, Exploiting Feature
Hierarchy for Transfer Learning in Named Entity Recognition,
Proc. 46th Ann. Meeting of the Assoc. for Computational Linguistics:
Human Language Technologies (ACL-HLT), pp. 245-253, 2008.
[2] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and O.
Etzioni, Open Information Extraction from the Web, Proc. 20th
Intl Joint Conf. Artificial Intelligence (IJCAI), pp. 2670-2676, 2007.
[3] D. Blei, J. Bagnell, and A. McCallum, Learning with Scope, with
Application to Information Extraction and Classification, Proc.
18th Conf. Uncertainty in Artificial Intelligence (UAI), pp. 53-60,
2002.
[4] D. Blei, A. Ng, and M. Jordan, Latent Dirichlet Allocation,
J. Machine Learning Research, vol. 2, pp. 993-1022, 2003.
[5] W. Cohen and W. Fan, Learning Page-Independent Heuristics for
Extracting Data from Web Pages, Computer Networks, vol. 31,
nos. 11-16, pp. 1641-1652, 1999.
[6] V. Crescenzi and G. Mecca, Automatic Information Extraction
from Large Websites, J. ACM, vol. 51, no. 5, pp. 731-779, 2004.
[7] V. Crescenzi, G. Mecca, and P. Merialdo, ROADRUNNER:
Towards Automatic Data Extraction from Large Web Sites, Proc.
27th Very Large Databases Conf. (VLDB), pp. 109-118, 2001.
[8] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu, Boosting for Transfer
Learning, Proc. 24th Intl Conf. Machine Learning (ICML), pp. 193-
200, 2007.
[9] H. Daume, III, and D. Marcu, Domain Adaptation for Statistical
Classifiers, J. Artificial Intelligence Research, vol. 26, pp. 101-126,
2006.
[10] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S.
Soderland, D. Weld, and A. Yates, Unsupervised Named-Entity
Extraction from the Web: An Experimental Study, Artificial
Intelligence, vol. 165, pp. 91-134, 2005.
[11] O. Etzioni, C. Knoblock, R. Tuchinda, and A. Yates, To Buy or
Not to Buy: Mining Airfare Data to Minimize Ticket Purchase
Price, Proc. 11th ACM SIGKDD, pp. 119-128, 2000.
[12] D. Freitag and A. McCallum, Information Extraction with HMMs
and Shrinkage, Proc. AAAI-99 Workshop Machine Learning for
Information Extraction, pp. 31-36, 1999.
[13] P. Golgher and A. da Silva, Bootstrapping for Example-Based
Data Extraction, Proc. 10th ACM Intl Conf. Information and
Knowledge Management (CIKM), pp. 371-378, 2001.
[14] U. Irmak and T. Suel, Interactive Wrapper Generation with
Minimal User Effort, Proc. 15th Intl World Wide Web Conf.
(WWW), pp. 553-563, 2006.
[15] T. Kristjansson, A. Culotta, P. Viola, and A. McCallum, Inter-
active Information Extraction with Constrained Conditional
Random Fields, Proc. 19th Natl Conf. Artificial Intelligence (AAAI),
pp. 412-418, 2004.
[16] N. Kushmerick, Wrapper Induction: Efficiency and Expressive-
ness, Artificial Intelligence, vol. 118, nos. 1/2, pp. 15-68, 2000.
[17] J. Lafferty, A. McCallum, and F. Pereira, Conditional Random
Fields: Probabilistic Models for Segmenting and Labeling Se-
quence Data, Proc. 18th Intl Conf. Machine Learning (ICML),
pp. 282-289, 2001.
[18] K. Lerman, C. Gazen, S. Minton, and C. Knoblock, Populating the
Semantic Web, Proc. AAAI Workshop Advances in Text Extraction
and Mining, 2004.
[19] W.Y. Lin and W. Lam, Learning to Extract Hierarchical
Information from Semi-Structured Documents, Proc. Ninth Intl
Conf. Information and Knowledge Management (CIKM), pp. 250-257,
2000.
[20] B. Liu, R. Grossman, and Y. Zhai, Mining Data Records in Web
Pages, Proc. Ninth ACM SIGKDD, pp. 601-606, 2003.
[21] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions.
John Wiley & Sons, Inc., 1997.
[22] M. Michelson and C. Knoblock, Semantic Annotation of
Unstructured and Ungrammatical Text, Proc. 19th Intl Joint Conf.
Artificial Intelligence (IJCAI), pp. 1092-1098, 2005.
[23] K. Probst, R. Ghani, M. Krema, and A. Fano, Semi-Supervised
Learning of Attribute-Value Pairs from Product Descriptions,
Proc. 20th Intl Joint Conf. Artificial Intelligence (IJCAI), pp. 2838-
2843, 2007.
[24] S. Ray and M. Craven, Representing Sentence Structure in
Hidden Markov Models for Information Extraction, Proc. 17th
Intl Joint Conf. Artificial Intelligence (IJCAI), pp. 1273-1279, 2001.
[25] E. Riloff and R. Jones, Learning Dictionaries for Information
Extraction by Multi-Level Bootstrapping, Proc. 16th Natl Conf.
Artificial Intelligence (AAAI), pp. 1044-1049, 1999.
[26] G. Sigletos, G. Paliouras, C. Sypropoulos, and M. Hatzopoulos,
Combining Information Extraction Systems Using Voting and
Stacked Generalization, J. Machine Learning Research, vol. 6,
pp. 1751-1782, 2005.
[27] J. Turmo, A. Ageno, and N. Catala, Adaptive Information
Extraction, ACM Computing Surveys, vol. 38, no. 2, article no. 4,
2006.
[28] P. Viola and M. Narasimhan, Learning to Extract Information
from Semi-Structured Text Using a Discriminative Context Free
Grammar, Proc. 43rd Ann. Meeting of the Assoc. for Computational
Linguistics, pp. 371-378, 2005.
[29] T.L. Wong and W. Lam, Adapting Information Extraction
Knowledge for Unseen Web Sites, Proc. IEEE Intl Conf. Data
Mining (ICDM), pp. 506-513, 2002.
[30] T.L. Wong and W. Lam, A Probabilistic Approach for Adapting
Information Extraction Wrappers and Discovering New Attri-
butes, Proc. IEEE Intl Conf. Data Mining (ICDM), pp. 257-264,
2004.
[31] T.L. Wong and W. Lam, Text Mining from Site Invariant and
Dependent Features for Information Extraction Knowledge
Adaptation, Proc. SIAM Intl Conf. Data Mining (SDM), pp. 45-
56, 2004.
[32] T.L. Wong and W. Lam, Adapting Web Information
Extraction Knowledge via Mining Site Invariant and Site
Dependent Features, ACM Trans. Internet Technology, vol. 7,
no. 1, article no. 6, 2007.
[33] J. Yang, H. Seo, and J. Choi, MORPHEUS: A More Scalable
Comparison-Shopping Agent, Proc. Fifth Intl Conf. Autonomous
Agents (AGENTS), pp. 63-64, 2001.
[34] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and H.-W. Hon, Webpage
Understanding: An Integrated Approach, Proc. 13th ACM
SIGKDD, pp. 903-912, 2007.
Tak-Lam Wong received the BEng, MPhil, and
PhD degrees from The Chinese University of
Hong Kong in 2001, 2003, and 2006, respec-
tively. He is currently with The Chinese Uni-
versity of Hong Kong. His research interests lie
in the areas of Web mining, data mining,
information extraction, machine learning, and
knowledge management.
Wai Lam received the BSc and MPhil degrees
from The Chinese University of Hong Kong, and
the PhD degree in computer science from the
University of Waterloo. After completing his PhD
degree, he conducted research at Indiana Uni-
versity, Purdue University Indianapolis (IUPUI),
and the University of Iowa. He joined The
Chinese University of Hong Kong, where he is
currently a professor. His research interests
include text mining, intelligent information retrie-
val, digital library, machine learning, and knowledge-based systems. He
has published articles in the IEEE Transactions on Pattern Analysis and
Machine Intelligence, the IEEE Transactions on Knowledge and Data
Engineering, the ACM Transactions on Information Systems, etc. His
research projects have been funded by the Hong Kong Government
RGC Earmarked Grant and US Defense Advanced Research Projects
Agency (DARPA). He also managed industrial projects funded by
Innovation and Technology Fund (industrial grant) and IT companies. He
is a senior member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

04906994

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

04906994

Diunggah oleh

Hak Cipta:

Format Tersedia

Learning to Adapt Web Information Extraction

Knowledge and Discovering New Attributes

c. u [21]. We design an EM algorithm that

c. u iteratively until convergence so that an approx-

c. u can be obtained. The E-Step and M-Step

Anda mungkin juga menyukai