Anda di halaman 1dari 8

Electronic Commerce Research and Applications 12 (2013) 289–296

Contents lists available at SciVerse ScienceDirect

Electronic Commerce Research and Applications


journal homepage: www.elsevier.com/locate/ecra

Concept extraction and e-commerce applications


Yongzheng Zhang ⇑, Rajyashree Mukherjee, Benny Soetarman
eBay Inc., 2065 Hamilton Ave., San Jose, CA 95125, USA

a r t i c l e i n f o a b s t r a c t

Article history: Concept extraction is the technique of mining the most important topic of a document. In the e-com-
Received 11 November 2012 merce context, concept extraction can be used to identify what a shopping related Web page is talking
Received in revised form 3 March 2013 about. This is practically useful in applications like search relevance and product matching. In this paper,
Accepted 18 March 2013
we investigate two concept extraction methods: Automatic Concept Extractor (ACE) and Automatic Key-
Available online 11 April 2013
phrase Extraction (KEA). ACE is an unsupervised method that looks at both text and HTML tags. We
upgrade ACE into Improved Concept Extractor (ICE) with significant improvements. KEA is a supervised
Keywords:
learning system. We evaluate the methods by comparing automatically generated concepts to a gold
Concept extraction
Automatic keyphrase extraction
standard. The experimental results demonstrate that ICE significantly outperforms ACE and also outper-
e-Commerce forms KEA in concept extraction. To demonstrate the practical use of concept extraction in the e-com-
Product matching merce context, we use ICE and KEA to showcase two e-commerce applications, i.e. product matching
Topic-based opinion mining and topic-based opinion mining.
Ó 2013 Elsevier B.V. All rights reserved.

1. Introduction is to identify what a shopping-related page is talking about. This is


the motivation behind the concept extraction work in the e-com-
Online shopping has never been more popular than nowadays. merce context.
Many factors contribute to the prominence of online shopping. Note that there is a subtle distinction between ‘‘talking about’’
Customers can do online shopping in minutes, which is much more and ‘‘mentioning’’. We are more interested in finding out what a
efficient than conventional shopping in local stores. There are often web page is ‘‘mainly about’’ instead of ‘‘also mentioning’’. By
good deals online with additional benefits such as free shipping ‘‘mainly about’’ we mean the most essential topic of a Web page.
and no tax. Moreover, online shopping is becoming more secure For example, there is a review page about Dell Latitude D4204.
as more security policies are being enforced by many online The review also compares this model against other laptops such as
marketplaces. IBM ThinkPad X41. However, the main topic of this page is ‘‘Dell Lat-
Generally speaking, online shopping starts with research and itude D420 review’’. Other laptops are ‘‘mentioned’’ but they are not
shopping research starts with search. Online shoppers often use the main focus. This is how we define the concept of a shopping-re-
main stream search engines such as Google to find products in lated Web page. We aim to identify such most essential concepts
the search results and sponsored advertisements. More than often, using concept extraction techniques.
returning buyers bookmark their favorite shopping sites such as The technique of extracting concepts from Web pages is derived
eBay1 and BestBuy2 and shop there directly. from key phrase extraction from traditional text documents in
In order to provide a more streamlined user experience in shop- information retrieval and text mining fields. A phrase can be a sin-
ping related research, it is critical for e-commerce sites to design a gle word or a multi-word term, which is also known as n-gram.
platform of technologies that delivers structured content and Concept extraction is very useful in many text-related applications,
highly relevant results to the user. The ideal platform includes such as search, classification, clustering, and visualization.
search, organization, and summarization of any ‘‘shopping- In this paper, we investigate two existing key phrase extraction
related’’3 pages on the World Wide Web. One essential component tools: Automatic Concept Extractor (ACE) (Ramirez and Mattmann
2004), and Automatic Keyphrase Extraction (KEA) (Witten et al.
1999), in the concept extraction task.
⇑ Corresponding author. ACE is an unsupervised method that looks at both text
E-mail addresses: ytzhang@ebay.com (Y. Zhang), rmukherjee@ebay.com (R. and HTML tags. The Term Frequency (TF) scorer calculates the
Mukherjee), bsoetarman@ebay.com (B. Soetarman).
1
http://www.ebay.com.
2
http://www.bestbuy.com.
3 4
A shopping-related Web page is one that contains information about deals, http://www.notebookreview.com/default.asp?newsID=3101, last viewed on
product specifications, professional and consumer reviews, FAQs, etc. March 9, 2012.

1567-4223/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.elerap.2013.03.008
290 Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296

popularity of a term in the text body, and the HTML scorer assigns collection of documents to compute the significance score. Thus,
significance weights to a pre-defined set of HTML tags so terms in a it has been widely used as the bag-of-words representation in
pair of tags will be given a significance weight. Finally, the Concept text-based applications, such as document categorization and doc-
Miner combines the above two parts using given weights to rank ument clustering.
candidate concepts. Krulwich and Burkey (1996) use a pre-defined set of heuristic
KEA is a supervised learning system. It first builds a Naive Bayes rules to extract key phrases from a document. The rules are based
model from training documents where concepts are manually as- on lexical features such as the use of acronyms and visual clues
signed. Two features are used in training the model: Term Fre- such as the use of italics. The extracted key phrases are used as fea-
quency – Inverse Document Frequency (TFIDF) and first appearance, tures in an automatic document classification task. Turney (2000)
which is the normalized distance in terms of number of words to proposes GenEx, a key phrase extraction system that is based on
the beginning of the document. The trained model is then used rule learning using a genetic program. More specifically, the sys-
to find concepts in new documents. tem consists of a pre-defined set of parameterized heuristic rules
Our focus in this paper is to perform various improvements to that are tuned to the training documents by the genetic program.
the basic ACE to obtain Improved Concept Extractor (ICE). We The learned optimized rules are then applied on new documents
use ACE and KEA as benchmarking in experiments. In order to eval- to extract key phrases. However, the above two methods heavily
uate the three systems, we create a collection of 100 Web pages depend on heuristic rule pre-defining and tuning, which is very
from leading brand sites such as Dell, HP, and Canon. We create expensive to apply on new applications.
a gold standard by manually assigning concepts to each page in Ramirez and Mattmann (2004) propose a system, ACE, which is
the collection. We tune different parameters of ACE, ICE and KEA specifically designed for concept extraction from Web pages. ACE
to generate concepts. And we use precision, recall and F1 to evaluate analyzes both the text body of a page and visual clues in various
the concepts. The experimental results indicate that ICE performs HTML tags to extract concepts from a single Web page. In Section 3,
significantly better than ACE and better than KEA in concept we will discuss why we choose ACE as a benchmarking method
extraction. and perform key improvements for the concept extraction task.
To further demonstrate the practical use of concept extraction Witten et al. (1999) develop a key phrase extraction system
in the e-commerce context, we use ICE and KEA as examples to dis- called Automatic Keyphrase Extraction (KEA). KEA builds a Naïve
cuss two applications: product matching and topic-based opinion Bayes learning model using training documents with known key
mining. In product matching, we use concept extraction methods phrases, and then uses the model to find key phrases in new doc-
to identify the most important topic of a product review Web page uments. Song et al. (2003) introduce a method which uses the
and decide if the top concept is matching the manually verified information gain measure to rank the candidate key phrases based
product. This can be very useful for search indexing, relevance on the TF-IDF and distance features, which were first proposed in
ranking, advertising, and many others. In topic-based opinion min- KEA (Witten et al. 1999).
ing, we find the topics of a collection of documents where opinions The above methods are designed for key phrase extraction from
of each topic can be mined. individual documents. Frantzi et al. (2000) propose a method
The main contributions of our paper are two fold: (1) We pro- named C-value/NC-value (CNC), which consists of both linguistic
vide an extensive analysis of concept extraction tools. We specifi- and statistical analysis to extract key phrases automatically. It is
cally upgrade an existing tool with key improvements and designed for key phrase extraction from an entire document
demonstrate its performance by running experiments. (2) We collection.
showcase two key applications of concept extraction in the e-com- More recently, Parameswaran et al. (2010) propose a system
merce field. To our knowledge this is the first study that analyzes that extracts concepts from a large dataset such as user tags (e.g.
how concept extraction methods can be used in e-commerce del.icio.us) or query logs of search engines (e.g. AOL). The system
applications. uses techniques similar to association rule mining in the market
The rest of the paper is organized as follows. Section 2 reviews basket analysis and aims towards building a web of concepts. Fea-
concept extraction tools in the literature and Section 3 explains in tures such as frequency of occurrences and popularity among users
detail how different concept extraction systems work. In Section 4, are used to extract core concepts while sub-concepts and super-
we describe the experiments and present the evaluation results. concepts of the core concepts are pruned. The authors claim that
Then we discuss two practical e-commerce applications of concept the system can be applied to any large data set. However, if Web
extraction in Section 5. Finally, Section 6 concludes our work and pages are used, a lot of additional processing is needed to identify
describes future research directions. the popular concepts.
Table 1 summarizes the concept extraction methods by the fea-
2. Related work tures they have.
The first feature, supervised learning, is the process of training a
Concept extraction is derived from key phrase extraction in model with examples and then applying the learnt model on new
information retrieval and text mining fields. Key phrase extraction documents. It means that there is a human effort to provide train-
methods often analyze a document to determine the significance of ing examples. In the concept extraction context, a supervised
a phrase, which can be a single word or a multi-word term. The sig- method analyzes training documents with manual concepts to find
nificance of a phrase is measured by modelling statistical features out rules and patterns, which can be used to find concepts in new
such as frequency of occurrence and linguistic features such as documents. For example, do the manual concepts have something
part-of-speech. The phrases above a certain threshold are often re- in common, such as do they often appear in the title or headings,
ferred to key phrases or concepts. do they appear more frequently than phrases that are not con-
Many key phrase extraction methods have been proposed in the cepts? The second feature is whether the concept extraction tool
literature. TFIDF is a popular method that is widely used in infor- requires a document collection. Document collection means that
mation retrieval and machine learning areas. The intuition behind there is a crawling effort to collect training documents, which is of-
this method is that phrases, which appear frequently in one docu- ten an expensive effort. The third feature is whether the tool works
ment but rarely in the whole document collection, often have on raw text only, or does it work better on visual clues found in
high discrimination power between documents. TFIDF requires a HTML tags, or both?
Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296 291

Table 1 Table 2
Summary of concept extraction methods by various features. A list of example HTML tags and corresponding significance weights.

Method Supervised? Collection? Text/HTML Tag Meaning Weight


TFIDF No Yes Text B Bold 0.8
GenEx Yes Yes Text BIG Big text 0.8
ACE No No Both DT Definition term 1.0
KEA Yes Yes Text EM Emphasis 0.3
CNC No Yes Text H1 Heading 1 1.0
H2 Heading 2 0.85
H3 Heading 3 0.7
H4 Heading 4 0.5
H5 Heading 5 0.3
Concept extraction has many practical applications, including H6 Heading 6 0.1
Web search (e.g. (Parameswaran et al. 2010)), education (e.g. (Vill- I Italic 0.3
alon and Calvo 2009)), and digital economy (e.g. (Piao et al. 2010)), Li List item 0.2
to name a few. In this work, we focus on e-commerce applications. STRONG Strong emphasis 0.8
TH Table header 0.5
We will explain why we choose ACE and KEA and describe how TITLE Document title 1.0
they work in the next section. U Underline 0.7

3. Concept extraction

In this section, we discuss how ACE and KEA extract concepts


with a score above the pre-defined threshold, say 0.5, are consid-
from Web pages. These two methods have been extensively stud-
ered qualifying tokens.
ied in the related work and we use them for benchmarking in
our experiments. We will focus on how to improve ACE to obtain
3.1.5. Concept derivation
ICE.
ACE also derives n-grams on both sides of each qualifying token.
For example, suppose there is a qualifying token called ‘‘laptop’’
3.1. ACE
appearing in a piece of text ‘‘dell laptop battery’’. ACE derives n-
grams on both sides of the token ‘‘laptop’’. When n = 2 is used,
Ramirez and Mattmann (2004) propose Automatic Concept
two derived n-grams ‘‘dell laptop’’ and ‘‘laptop battery’’ are found.
Extraction. We choose ACE because: (1) it is specifically designed
In ACE, all derived concepts will be scored using only the TF scorer.
for extracting concepts from HTML pages; (2) it is an unsupervised
Thus, derived concepts that are not frequent enough will be
method, which means no training is needed; (3) it does not require
dropped. In our improved ICE, we use the combined scorer instead.
a document corpus and it is open source.
Hence, derived concepts that are not frequent enough but have
For a given Web page, ACE finds the concepts in the following
high HTML score may still surface as a key concept.
steps:
3.2. Improved ACE (ICE)
3.1.1. Tokenization
For a given web page, ACE first tokenizes the text body of the
The basic ACE is a simple and straightforward framework that
page using blank spaces. This is the simplest tokenization policy
analyses both text and HTML visual clues, but does not utilize lin-
and we will discuss more complex tokenization in ICE.
guistic and lexical features embedded in the text. We aim to up-
grade ACE by a few key improvements, which take advantage of
3.1.2. TF scorer
rich linguistic and lexical information. The improvements are sum-
The TF scorer measures the popularity of a token in a given Web
marized in Table 3.
page, which is normalized with the maximum term frequency in
the document, as shown in Eq. (1).
1. The first improvement is about the tokenization policy. In
f ðti Þ the basic ACE, text is tokenized by blank spaces only. In
STF ðti Þ ¼ ð1Þ the improved ACE, we tokenize text using all non-alphanu-
maxni¼1 f ðt i Þ
merical characters.
where f(ti) is the frequency of occurrence of token i in the target 2. The second improvement is about detecting the HTML tag
Web page. boundary. In the basic implementation, ACE disregards HTML
tag boundaries. For example, given the following text: <a href
3.1.3. HTML scorer = ‘‘http://buy.ebay.com/cell-phone’’ > cell phone < /a> <a href
The HTML scorer measures the importance of a token within a = ‘‘http://www.ebay.com/’’ >Home < /a > , ACE will extract
pair of HTML tags. All the HTML tags are assigned a pre-defined ‘‘cell phone home’’ as a candidate concept, which is not true.
weight. For example, text in the title and headings has more signif- In ICE, we respect tag boundaries, i.e. we do not combine tokens
icance than text in a regular paragraph. Examples of HTML tags and from different HTML tags, so the candidate concepts become
corresponding significance weights are summarized in Table 2. ‘‘cell phone’’, and ‘‘home’’. Similarly, we detect sentence
boundaries so tokens across two neighboring sentences are
3.1.4. Concept miner not taken as candidates.
The concept miner is a combined scorer to estimate the concept 3. The third improvement is about how to score derived con-
likelihood of each token. It assigns the weighted sum of the TF cepts. The basic ACE uses only the TF scorer to score derived
score and the HTML score to the tokens. Let STF and SHTML be the concepts and normalize to the highest TF score of a derived
TF score and the HTML score, respectively. Also let S be the com- concept. For instance, if the parent concept is A with a score
bined score, which is calculated by S = k  STF + (1  k)  SHTML. Note of 0.5 and it has three derived concepts: AB, DA, and ABC (A,
that k is a number between 0 and 1 inclusive, representing the rel- B, C, D are different single words), where DA happens to have
ative importance of the TF score versus the HTML score. Tokens the highest TF out of the three, then DA would get a score of
292 Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296

Table 3 KEA is a domain-independent method as shown in Witten et al.


Summary of improvements to the basic ACE. (1999), which means a KEA model trained on one domain (e.g.
Improvements ACE ICE computer science) performs well on another domain (e.g. biology).
Tokenization Blank spaces Non-alphanumeric characters The training set bundled with the Java-based KEA package (Version
Tag and sentence boundary No Yes 2.0)6 is used to train a CSTR KEA learning model. This data set con-
Scoring derived concepts TF scorer Combined scorer tains 80 abstracts of Computer Science Technical Reports (CSTR)
Emphasis scorer No Yes from the New Zealand Digital Library project.7 Each abstract has five
Porter stemming No Yes
human-authored key phrases. The input to the Java program consists
of text files with the corresponding key phrases. Research in Witten
1.0. This is not reasonable in many times. In ICE, we use the et al. (1999) shows that a training set of 25 or more documents can
combined scorer instead and normalize the derived concept achieve good performance.
with regard to the parent’s score. Thus in the above example,
DA has a score of 1.0  0.5 = 0.5. 3.3.2. KEA extraction
4. The fourth improvement is that we add a new scorer, In the extraction stage, KEA uses the model to find the best set
Emphasis scorer, to the basic ACE. We observe that there of (by default 5) key phrases in new documents. More explicitly,
is often some overlapping between the title and the KEA chooses a set of candidate key phrases from new documents
emphasized text and such overlapping text often repre- and calculates their two feature values as above. Then each candi-
sents the concept of a page very well. The emphasis scorer date is assigned a score, which is the overall probability that this
detects whether there is such an overlap. Each sequence of candidate is a key phrase. Candidates with the score above a cer-
overlapping words are considered as a candidate concept, tain threshold are taken as the concepts.
whose score will be the weight of the HTML tags within
which it is. Emphasis scorer extracts concepts irrespective
4. Experiments and evaluation
of TF and HTML weights. Only when emphasis scorer fails
to extract any concept then the Concept Miner comes into
In this section, we discuss how to construct a data corpus to
picture.
experiment with ACE, ICE and KEA. We evaluate these algorithms
5. The fifth improvement is to apply Porter stemming on
using standard metrics: precision and recall.
extracted concepts so concepts share the same stem will
be considered as the same. Stemming has been a typical
4.1. Data corpus
experience in text mining applications.
Collectively with all these five key improvements, we
Evaluation of automatically generated concepts often proceeds
achieve to improve ACE by 50% in terms of accuracy. For-
either in intrinsic mode, where concepts are compared against a
mal evaluation results are presented in Section 4.
gold standard created by human subjects, or in extrinsic mode,
which measures the utility of concepts in performing a particular
3.3. KEA
task (e.g. site indexing) Zhang et al. (2007); McKeown et al.
(2005). In this task, we take the intrinsic approach. We aim to
KEA stands for Automatic Keyphrase Extraction. It is an efficient
investigate how well these concepts reveal the main topic of a gi-
and practical algorithm for extracting key phrases in information
ven Web page. In other words, we are interested in the correctness
retrieval and text mining fields. KEA is now part of the machine
and completeness of the automatically generated concepts.
learning toolkit WEKA,5 and has been proven to be successful in
We ask human subjects to create a gold standard. More specif-
many text-based applications, e.g. Frank et al. (1999); Gutwin
ically, we first construct a data corpus with 100 shopping related
et al. (1999); Turney (2002, 2003); Zhang et al. (2004, 2007, 2005).
Web pages that are randomly selected from our database of lead-
KEA is a supervised method using the Naive Bayes approach,
ing brand sites (ranked by the number of incoming links) such as
which is a probabilistic model that applies Bayes’ theorem with
Dell, HP, and Canon. Second, we ask two experts in the e-com-
feature independence assumption. In other words, it estimates
merce field to manually review each page and pick up the best
the probability of a pre-defined class given a feature set based on
up to 10 concepts as golden standard in a double blinded fashion,
the prior probability of each class and the conditional probability
i.e. they work separately and do not interfere with each other’s
of the feature set given the target class. Then all classes are ranked
decision. We observe that most pages have 3–5 concepts. The
using the probability value.
agreement on extracted concepts between the two reviewers is
KEA consists of two stages: training and extraction. In both
83%. For concepts that are picked by only one reviewer, a second
stages, n-grams (n P 1) are extracted as candidate concepts. For
round review is conducted so both reviewers agree on concepts
each candidate, two feature values, TFIDF and first occurrence, are
for all the 100 pages.
calculated. TFIDF is a standard feature often used in text mining
Once the gold standard is created, we then run ACE, ICE, and
tasks. It favors terms that appear frequently in one document but
KEA to generate concepts with different configurations. Evaluation
rarely in a document collection. First occurrence is calculated as
proceeds with comparing automatically generated concepts with
the number of words that precede the candidate’s first appearance,
human authored concepts.
divided by the number of words in the document.
4.2. Evaluation measures
3.3.1. KEA training
In the training stage, each document is manually assigned a few
We use precision, recall and F1 to evaluate the quality of concept
concepts. Then a Naı̈ve Bayes model is built based on TFIDF and
extraction. In the context of concept extraction, precision measures
first appearance. More explicitly, KEA chooses a set of candidate
the fraction of extracted concepts that are also human authored
key phrases from input documents. Those candidates that happen
concepts; and recall measures the fraction of human authored
to be human-authored key phrases are positive examples in the
KEA model construction.
6
http://www.nzdl.org/Kea.
5 7
http://www.cs.waikato.ac.nz/ml/weka. http://www.nzdl.org.
Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296 293

concepts that are captured by concept extraction methods. Usually Table 4


there is a tradeoff between the two, i.e. it is hard to achieve both Confusion matrix of manual concepts and extracted concepts.

high precision and high recall at the same time. F1 is such a mea- Manual Non-manual
sure that balances the two. Extracted A: true positive B: false positive
Given the confusion matrix shown in Table 4, precision P, recall Non-extracted C: false negative D: true negative
R, and F1 can be formally defined as follows:
A

AþB ACE. We also report the results of ICE using the best configuration.
A The results are summarized in Table 6.
R¼ ð2Þ
AþC As shown in Table 6, ICE significantly outperforms ACE in preci-
2PR sion, recall, and F1. In most cases, more than 90% of true concepts
F1 ¼ are captured by ICE and more than 70% of extracted concepts are
PþR
truly desired. This indicates that the improvements applied in Ta-
where A is the number of overlapped concepts between human- ble 3 are greatly beneficial. They capture the essential signals, i.e.
authored concepts (golden standard) and program-generated con- the linguistic and lexical patterns, that can identify the most essen-
cepts, B is the number of extracted concepts that are not truly hu- tial topic in a shopping related Web page. This is very encouraging
man authored concepts, and C is the human authored concepts and promising as we look into much larger data sets in the future.
that are missed by the concept extraction methods. We also observe that ICE tends to favor shorter concepts than ACE.
Alternatively, we can use the acceptable percentage measure The best quality with F1 = 0.8325 is achieved when a threshold of
proposed by Turney (2000) to evaluate automatically extracted 0.5 is used on concepts with up to 3 words and a 30% significance
concepts. This usually proceeds with a formal user study where weight on the TF score.
people assign 1-to-5 scores to concepts and often a gold standard In addition to the improvements summarized in Table 3, we
is not required (if the method does not require any) for evaluation also investigate whether applying CSS (Cascading Style Sheets8)
purpose. to the Web pages will impact concept extraction. We repeat the
same experiments with an addition of applying CSS. The results
4.3. Evaluation results are summarized in Table 7, which indicates that the quality of ex-
tracted concepts is actually worse than without CSS. This implies
4.3.1. Evaluation results of ACE that style sheets tend to redefine features such as color and font,
First of all, we present the evaluation results of basic ACE. We which consequently dilute the signals given by HTML markups that
have 100 pages with manually tagged concepts. All the 100 pages are critical to the ICE algorithm.
are used to test ACE. We tune the following three parameters in or-
der to obtain concepts with the best quality. 4.3.3. Evaluation results of KEA
In KEA, we do differently in the sense that we use the first 50
 T: threshold on final concept score, which is a weighted combi- Web pages for training and the rest 50 for testing.9 More specifi-
nation of the TF score and the HTML score and a threshold of 0.5 cally for training, we use two sets of concepts: (1) the ICE concepts
is used by default. We experiment with five different threshold extracted using the best configuration (0.5 for threshold of concept
values from 0.5 to 0.9 in increasing step of 0.1. score, up to three words allowed in a concept, 30% weight for TF
 B: maximum breadth of concept, i.e. maximum number of sin- score and 70% for HTML score); (2) the manual concepts. Our interest
gle words allowed in a concept. A default value of 5 is used in resides on how good ICE concepts are compared to manual concepts
ACE. We experiment with 3 to 6. in the task of training a KEA model.
 k: significance weight of the TF scorer (0.5 by default). We We tune two parameters in KEA, one is the breadth, i.e. the
experiment with five different weights from 0.3 to 0.7 in maximum number of words allowed in a KEA concept, the other
increasing step of 0.1. Note that the weight of the HTML scorer is the threshold on the concept score. We observe that most KEA
is 1  k. concepts carry a score less than 0.5 so we use a range of threshold
on concept score from 0.2 to 0.5. The evaluation results are sum-
When a threshold on final concept score is chosen (e.g. T = 0.6), marized in Table 8.
we experiment with each configuration of B and k (denoted as When using ICE concepts of the first 50 Web pages for training a
[B, k]). This leads to a total of 20 configurations. KEA model, the evaluation results on the remaining 50 Web pages
For each Web page, we calculate the precision, recall and F1 are the same (P = 0.6583, R = 0.7600, F1 = 0.7056, and B = 3) when a
values. The quality of ACE is measured by the average precision, threshold of 0.2, 0.3, 0.4, and 0.5 is separately used. This can be ex-
recall and F1 over 100 Web pages. We aim to find out for each plained by the fact that most often there is a huge gap between the
threshold T which of the 20 configurations C yields the highest top KEA concepts (usually good ones) and the bottom ones (usually
average F1. Due to space limit, we report only the evaluation re- not good). For example, Cisco, 0.5504; Cisco AS5350XM, 0.5504; Uni-
sults of ACE using the best configuration. The results are summa- versal Gateway, 0.5504; DSP, 0.0399; voice, 0.0399; IP, 0.0399; net-
rized in Table 5. work, 0.0399. Consequently, applying a lower threshold has
As we can see in Table 5, ACE tends to favor long concepts but minor impact on the quality.
achieves low precision and recall in all cases. Also as the threshold
goes up, the overall quality is decreased. As an example in the sub-
optimal case for T = 0.5, when the configuration of [4, 0.5] is used, 8
http://en.wikipedia.org/wiki/Cascading_Style_Sheets, last viewed on March 9,
the precision, recall, and F1 are 0.2245, 0.2583, and 0.2402, 2012.
respectively. 9
In order to evaluate KEA’s sensibility and stability across different training
examples, we conduct a fivefold cross validation on the 50 training examples, i.e. the
4.3.2. Evaluation results of ICE data set is evenly split into 5 segments and each time a different segment is left aside
for testing using the model trained on the rest four segments. Our fivefold cross
The main focus of our paper is to investigate how to improve
validation using specific configurations (e.g. T = 0.3,B = 4) shows that KEA is not
ACE and demonstrate its performance compared with ACE and sensible to different training examples, i.e. the average F1 values across different folds
KEA. We experiment with ICE using same configurations as for using the same configuration are very close to each other.
294 Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296

Table 5 Table 6
Evaluation results of basic ACE with the best configuration. Evaluation results of ICE with the best configuration.

T 0.5 0.6 0.7 0.8 0.9 T 0.5 0.6 0.7 0.8 0.9
C [6, 0.4] [6, 0.4] [6, 0.3] [6, 0.6] [4, 0.3] C [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3]
P 0.2496 0.2429 0.2258 0.2046 0.1987 P 0.7227 0.7286 0.7396 0.7323 0.7391
R 0.2700 0.2450 0.2350 0.2017 0.1750 R 0.9815 0.9613 0.9360 0.8889 0.8822
F1 0.2594 0.2440 0.2303 0.2031 0.1861 F1 0.8325 0.8289 0.8263 0.8030 0.8043

Next, we use the manual concepts of the top 50 Web pages for
Table 7
training a KEA model. Again, the testing results are the same
Evaluation results of ICE and applying CSS.
(P = 0.7383, R = 0.8300, F1 = 0.7815, and B = 6) when a threshold
of 0.2, 0.3, 0.4, and 0.5 is separately used. We observe that using T 0.5 0.6 0.7 0.8 0.9
C [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3]
manual concepts for training KEA greatly improves the quality of
P 0.6880 0.6930 0.6897 0.6925 0.6992
concept extraction by 8%. However, this model prefers longer con- R 0.8767 0.8667 0.8517 0.8350 0.8283
cepts. It is observed that ICE outperforms KEA by 5% (see Table 6). F1 0.7710 0.7702 0.7622 0.7571 0.7583

5. E-commerce applications

We have discussed the concept extraction algorithms in Sec- We calculate accuracy for brand and model, respectively. For
tion 3 and evaluation results in Section 4, respectively. To show- example, if the top one ICE concept contains the manually given
case the practical use of concept extraction in e-commerce brand for 70 times out of 100 consumer reviews, then the accuracy
context, we present two applications in this section, i.e. product of ICE in brand matching on consumer reviews will be 70%. The re-
matching, and topic-based opinion mining. Since ICE proves to be sults are summarized in Table 9. We believe the accuracy achieved
significantly better than ACE (see Section 4), we use only ICE and by both methods is reasonable considering only the top one con-
KEA in these two applications. cept is used for matching brand and model.
As we see in Table 9, ICE does better than KEA in extracting
5.1. Product matching brand and model from both professional and consumer review
pages. This is consistent with the experiment results presented in
Product matching is the process of identifying if a shopping re- Section 4. We also observe that both ICE and KEA perform much
lated Web page is about a certain product (or some products), e.g, better on consumer reviews than on professional reviews. This
iPhone 4s. This can be useful for search indexing and relevance may be explained by the fact that consumer reviewers mostly fo-
ranking, as well as advertising and merchandizing. Due to the cus on the target product and hence it is easier for concept extrac-
scope of this paper, we focus on measuring how well ICE and tion methods to find the correct product, whereas professional
KEA concepts match to products for product review pages. reviewers often compare the target product to other brands and
In our work, we use consumer reviews and professional reviews models in addition to reviewing the target product. The other
of digital cameras as examples to investigate the product matching interesting finding is that both ICE and KEA perform much better
problem. Consumer reviews and professional reviews are mostly on model matching than on brand matching in both professional
about a particular product, although they might mention other reviews and consumer reviews. This indicates that the reviews in
similar products too. For example, a Canon digital camera review our test mention similar brands more often than similar models
page might mention similar products by Sony and Samsung. We with the same brand and hence it is more difficult to correctly
need the capability of detecting what this page is about instead match brands than models. However, we believe it is more impor-
of mentioning using the power of concept extraction. On the other tant to correctly find the product model than the brand. That is be-
hand, in professional review pages, very often multiple products cause the model name (or number) is pretty unique in most cases
with the same brand are reviewed for comparison purpose. Other and it can identify the product accurately (even without knowing
page types such as Frequently Asked Questions (FAQs), product the brand).
specs, and forums are good candidates of learning the product
matching problem too. Different from the above five types is the 5.2. Topic-based opinion mining
buying guides, which often focus on suggestions of a category
rather than particular products, such as electronics, sports gear, We have shown how ICE and KEA can be useful in an important
and media. Thus, buying guides have less impact on the product e-commerce application, i.e. product matching. In this subsection,
matching problem. we present how concept extraction can be helpful in a text mining
To this end, we manually label 100 consumer review pages from problem, i.e. topic-based opinion mining. Opinion mining, also
epinions.com and 100 professional review pages from cnet.com, known as sentiment analysis, is a research area which aims to
respectively. They are all about digital cameras with various identify and extract sentiments and emotions from text documents
brands and models. For each page, an exact product name, i.e. (Liu 2012). Opinion mining has been very popular in the recent
brand and model, is given manually. We observe that many prod- decade and it has been used in many domains such as marketing,
uct names have as many as four words, so we allow ICE to extract customer analysis and political campaign. For example, business
concepts with up to four words, with a threshold of 0.5 on concept and organizations want to understand customers’ voice and im-
score, and 50% significance weight on TF score and 50% on HTML prove their products and consumers want to compare products
score. For KEA, we use the manual concepts in Section 4 to train from multiple vendors based on product reviews before they make
a model and also allow it to produce concepts with up to four a purchase decision. One of the main sub-areas in opinion mining
words. Both ICE and KEA produce a few concepts for each page is topic-based opinion mining, where the document collection is
but we use only the top one as extracted product name. large or has multiple topics. In such a case, one may wish to first
We use accuracy to evaluate ICE and KEA with respect to its identify the essential topics within the corpus and then apply opin-
ability to extract brand and model from product review pages. ion mining separately on each essential topic. This technique has
Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296 295

Table 8 In terms of evaluation, it is difficult to define the concepts for a


Evaluation results of KEA using ICE concepts and manual concepts for training. collection of posts even for humans so popular measures like pre-
Training ICE Concepts Manual Concpets cision, recall, and accuracy do not apply. However, we can ask hu-
T 0.2, 0.3, 0.4, 0.5 0.2, 0.3, 0.4, 0.5 man subjects to rate the concepts in extrinsic mode. We show
B 3 6 concepts of discussion boards to our product managers, who are
P 0.6583 0.7383 experts in given domains and are continuously monitoring the for-
R 0.7600 0.8300 ums, and ask them to rate the concepts with a 1-to-5 score with 1
F1 0.7055 0.7815
being poor and 5 being excellent. Related research in Turney
(2003); Zhang et al. (2007) defines acceptable concepts as those
that are rated excellent, good, or fair by human subjects. In our
Table 9 work, acceptable concepts are those that receive a score of 3, 4,
Accuracy of ICE and KEA in product matching. or 5. These concepts are reasonably related to the target discussion
Accuracy BrandICE BrandKEA ModelICE ModelKEA
board. In other words, they correctly and completely reveal the
main contents of the discussion board. The percentage, p, is then
Professional reviews 0.65 0.60 0.71 0.67
Consumer reviews 0.73 0.69 0.80 0.77
formally defined as:
n3 þ n4 þ n5
p¼ P5 : ð3Þ
i¼1 ni
been well studied and applied to attribute-based opinion mining
The average acceptable percentage of aggregated ICE concepts
on product reviews and others (Liu 2012, Moghaddam and Ester
for each discussion board is 81.3% and 82.0% for aggregated KEA
2012).
concepts. It is expected that ICE is only on par with or even worse
In the e-commerce context, a topic-based opinion mining sys-
than KEA in this task (instead of outperforming KEA) because we
tem can be very useful. We describe such a system in Zhang
notice many forum posts have only a few HTML tags which may
et al. (2011), which first mines essential topics from discussion
limit ICE’s performance on the topic extraction task. In other
boards on eBay forums10 and then mines opinions for each topic.
words, if there are more HTML tags in the forum posts, ICE will
Such a system can quickly judge the community’s feedback and sen-
do even better.
timent on our inventory, products and services, and many other
interesting things. It is also capable of detecting the trends of various
6. Conclusion and future work
topics. Thus in this work, we focus on the first stage of the topic-
based opinion mining task, i.e. topic extraction. For each discussion
We investigate two concept extraction methods ACE and KEA in
board on eBay forums, we gather threads that have ten or more posts
the e-commerce context. We discuss how to upgrade ACE with ma-
and index all the posts. We mine concepts from a set of posts at a
jor improvements into ICE. We construct a gold standard to evaluate
certain level, which can be the entire board, a certain thread in the
the methods using precision and recall. The evaluation results dem-
board, or a set of posts related to a certain topic, e.g. fish oil.
onstrate that ICE significantly outperforms ACE and that it also
In this task, we use both ICE and KEA for topic extraction. We
outperforms the state-of-the-art method KEA. We also showcase
notice that many posts from our users have few HTML tags so
two important applications of concept extraction in e-commerce:
we favor the TF score over the HTML score (k = 0.7). We train a
product matching and topic-based opinion mining. The main contri-
KEA model with manual concepts in Section 4 and allow it to pro-
bution of our work is the application of concept extraction in the
duce concepts with up to four words for a post.
e-commerce context. The major improvements we propose are
Both ICE and KEA are aimed towards extracting concepts from
relevant to the shopping related Web pages and hence make ICE
individual documents in a document collection rather than from
an innovative approach to concept extraction in the e-commerce
the whole collection. Hence in order to generate a concept list for
applications.
an entire document corpus (in this case multiple posts), the output
Directions of future research include but are not limited to the
concepts from all posts should be combined properly. We aim to do
following: (1) perform experiments on much larger data sets; (2)
the following:
perform part-of-speech tagging and study linguistic filters (part-
of-speech patterns) in concept extraction; (3) investigate anchor
1. Unite the top 5 concepts from each post j to obtain a single
text in the concept extraction task.
list. Each concept i has a weight wi,j.
2. Record fi, which is the number of posts in which concept i
appears. Let Wi be the overall weight of concept i in the Acknowledgments
P
corpus and Ai be its average weight. So W i ¼ j W i;j , and
Ai = Wi/fi. We are thankful to the authors of the ACE and KEA methods for
3. Now three features, i.e. Wi, Ai, and fi can be used to re-rank sharing with us the source code.
the list in order to select the top concepts for the entire cor-
pus. Preliminary tests show that in terms of acceptable per-
References
centage (fraction of concepts with a score of 3 or above
where a 1-to-5 score is assigned to each concept by human Frank, E., Paynter, G., Witten, I., Gutwin, C., and Nevill-Manning, C. Domain-specific
subjects), fi is the best feature. keyphrase extraction. In: Proceedings of the Sixteenth International Joint
Conference on Artificial Intelligence, Stockholm, Sweden, 1999. 668–673.
Frantzi, K., Ananiadou, S., and Mima, H. Automatic recognition of multi-word terms:
For example, the top 5 ICE concepts for the Cook’s Nook board the C-value/NC-value method. International Journal on Digital Libraries, 3, 2000,
are: lemon juice, sour cream, black pepper, chicken wings, and green 115–130.
beans. As another example, the top 5 KEA concepts for search re- Gutwin, C., Paynter, G., Witten, I., Nevill-Manning, C., and Frank, E. Improving
browsing in digital libraries with keyphrase indexes. Journal of Decision Support
sults of ‘‘fish oil’’ in the Health & Beauty board are: multi vitamin, Systems, 27, 1999, 81–104.
coconut oil, fish oil supplements, lower cholesterol, and omega fish oil. Krulwich, B., and Burkey, C. Learning user information interests through the
extraction of semantically significant phrases. In: AAAI Spring Symposium
Technical Report SS-96-05: Machine Learning in Information Access, 1996. 110–
10
http://forums.ebay.com. 112.
296 Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296

Liu, B. Sentiment Analysis and Opinion Mining (Introduction and Survey). Morgan & 44947). Institute for Information Technology, National Research Council of
Claypool Publishers, Cambridge, MA, USA, 2012. Canada. Ottawa, ON, Canada, 2002.
McKeown, K., Passonneau, R., Elson, D., Nenkova, A., and Hirschberg, J. Do Turney, P. Coherent keyphrase extraction via web mining. In: Proceedings of the
summaries help? A task-based evaluation of multi-document summarization. Eighteenth International Joint Conference on Artificial Intelligence, Acapulco,
In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research Mexico, 2003. 434–439.
and Development in Information Retrieval, Salvador, Brazil, 2005. 210–217. Villalon, J., and Calvo, R. A. Concept extraction from student essays, towards concept
Moghaddam, S., and Ester, M. Tutorial: aspect-based opinion mining from product map mining. In: Proceedings of the Ninth IEEE International Conference on
reviews. In: 2012 Annual International ACM SIGIR Conference on Research and Advanced Learning Technologies, Washington, DC, USA, 2009. 221–225.
Development in Information Retrieval, Portland, Oregan, 2012. Witten, I., Paynter, G., Frank, E., Gutwin, C., and Nevill-Manning, C. KEA: practical
Parameswaran, A., Rajaraman, A., and Garcia-Molina, H. Towards the Web of automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on
concepts: extracting concepts from large datasets. In: Proceedings of the 2010 Digital Libraries, Berkeley, CA, USA, 1999. 254–255.
VLDB Endowment, 2010. 566–577. Zhang, Y., Milios, E., and Zincir-Heywood, N. A Comparison of Keyword- and Keyterm-
Piao, S., Forth, J., Gacitua, R., Whittle, J., and Wiggins, G. Evaluating tools for Based Methods for Automatic Web Site Summarization. Technical Report CS-2004-
automatic concept extraction: a case study from the musicology domain. In: 11. Faculty of Computer Science, Dalhousie University. Halifax, NS, Canada,
Proceedings of the Digital Economy All Hands Meeting - Digital Futures 2010 2004.
Conference, Nottingham, UK, 2010. Zhang, Y., Zincir-Heywood, N., and Milios, E. Narrative text classification for
Ramirez, P., and Mattmann, C. ACE: improving search engines via automatic concept automatic key phrase extraction in web document corpora. In: Proceedings of
extraction. In: Proceedings of the 2004 IEEE International Conference on the Seventh ACM International Workshop on Web Information and Data
Information Reuse and Integration, Las Vegas, NV, USA, 2004. 229–234. Management, Bremen, Germany, 2005. 51–58.
Song, M., Song, I., and Hu, X. KPSpotter: a flexible information gain-based keyphrase Zhang, Y., Milios, E., and Zincir-Heywood, N. A comparative study on key phrase
extraction system. In: Proceedings of the Fifth ACM International Workshop on extraction methods in automatic web site summarization. Journal of Digital
Web Information and Data Management, 2003. 50–53. Information Management, Special Issue on Web Information Retrieval, 5, 2007,
Turney, P. Learning algorithms for keyphrase extraction. Information Retrieval, 2, 323–332.
2000, 303–336. Zhang, Y., Shen, D., and Baudin, C. Tutorial: sentiment analysis in practice. In: 2011
Turney, P. Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: IEEE International Conference in Data Mining (ICDM’11), Vancouver, BC, Canada,
Learning from Labeled and Unlabeled Data. Technical Report ERB-1096 (NRC- 2011.

Anda mungkin juga menyukai