a r t i c l e i n f o a b s t r a c t
Article history: Concept extraction is the technique of mining the most important topic of a document. In the e-com-
Received 11 November 2012 merce context, concept extraction can be used to identify what a shopping related Web page is talking
Received in revised form 3 March 2013 about. This is practically useful in applications like search relevance and product matching. In this paper,
Accepted 18 March 2013
we investigate two concept extraction methods: Automatic Concept Extractor (ACE) and Automatic Key-
Available online 11 April 2013
phrase Extraction (KEA). ACE is an unsupervised method that looks at both text and HTML tags. We
upgrade ACE into Improved Concept Extractor (ICE) with significant improvements. KEA is a supervised
Keywords:
learning system. We evaluate the methods by comparing automatically generated concepts to a gold
Concept extraction
Automatic keyphrase extraction
standard. The experimental results demonstrate that ICE significantly outperforms ACE and also outper-
e-Commerce forms KEA in concept extraction. To demonstrate the practical use of concept extraction in the e-com-
Product matching merce context, we use ICE and KEA to showcase two e-commerce applications, i.e. product matching
Topic-based opinion mining and topic-based opinion mining.
Ó 2013 Elsevier B.V. All rights reserved.
1567-4223/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.elerap.2013.03.008
290 Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296
popularity of a term in the text body, and the HTML scorer assigns collection of documents to compute the significance score. Thus,
significance weights to a pre-defined set of HTML tags so terms in a it has been widely used as the bag-of-words representation in
pair of tags will be given a significance weight. Finally, the Concept text-based applications, such as document categorization and doc-
Miner combines the above two parts using given weights to rank ument clustering.
candidate concepts. Krulwich and Burkey (1996) use a pre-defined set of heuristic
KEA is a supervised learning system. It first builds a Naive Bayes rules to extract key phrases from a document. The rules are based
model from training documents where concepts are manually as- on lexical features such as the use of acronyms and visual clues
signed. Two features are used in training the model: Term Fre- such as the use of italics. The extracted key phrases are used as fea-
quency – Inverse Document Frequency (TFIDF) and first appearance, tures in an automatic document classification task. Turney (2000)
which is the normalized distance in terms of number of words to proposes GenEx, a key phrase extraction system that is based on
the beginning of the document. The trained model is then used rule learning using a genetic program. More specifically, the sys-
to find concepts in new documents. tem consists of a pre-defined set of parameterized heuristic rules
Our focus in this paper is to perform various improvements to that are tuned to the training documents by the genetic program.
the basic ACE to obtain Improved Concept Extractor (ICE). We The learned optimized rules are then applied on new documents
use ACE and KEA as benchmarking in experiments. In order to eval- to extract key phrases. However, the above two methods heavily
uate the three systems, we create a collection of 100 Web pages depend on heuristic rule pre-defining and tuning, which is very
from leading brand sites such as Dell, HP, and Canon. We create expensive to apply on new applications.
a gold standard by manually assigning concepts to each page in Ramirez and Mattmann (2004) propose a system, ACE, which is
the collection. We tune different parameters of ACE, ICE and KEA specifically designed for concept extraction from Web pages. ACE
to generate concepts. And we use precision, recall and F1 to evaluate analyzes both the text body of a page and visual clues in various
the concepts. The experimental results indicate that ICE performs HTML tags to extract concepts from a single Web page. In Section 3,
significantly better than ACE and better than KEA in concept we will discuss why we choose ACE as a benchmarking method
extraction. and perform key improvements for the concept extraction task.
To further demonstrate the practical use of concept extraction Witten et al. (1999) develop a key phrase extraction system
in the e-commerce context, we use ICE and KEA as examples to dis- called Automatic Keyphrase Extraction (KEA). KEA builds a Naïve
cuss two applications: product matching and topic-based opinion Bayes learning model using training documents with known key
mining. In product matching, we use concept extraction methods phrases, and then uses the model to find key phrases in new doc-
to identify the most important topic of a product review Web page uments. Song et al. (2003) introduce a method which uses the
and decide if the top concept is matching the manually verified information gain measure to rank the candidate key phrases based
product. This can be very useful for search indexing, relevance on the TF-IDF and distance features, which were first proposed in
ranking, advertising, and many others. In topic-based opinion min- KEA (Witten et al. 1999).
ing, we find the topics of a collection of documents where opinions The above methods are designed for key phrase extraction from
of each topic can be mined. individual documents. Frantzi et al. (2000) propose a method
The main contributions of our paper are two fold: (1) We pro- named C-value/NC-value (CNC), which consists of both linguistic
vide an extensive analysis of concept extraction tools. We specifi- and statistical analysis to extract key phrases automatically. It is
cally upgrade an existing tool with key improvements and designed for key phrase extraction from an entire document
demonstrate its performance by running experiments. (2) We collection.
showcase two key applications of concept extraction in the e-com- More recently, Parameswaran et al. (2010) propose a system
merce field. To our knowledge this is the first study that analyzes that extracts concepts from a large dataset such as user tags (e.g.
how concept extraction methods can be used in e-commerce del.icio.us) or query logs of search engines (e.g. AOL). The system
applications. uses techniques similar to association rule mining in the market
The rest of the paper is organized as follows. Section 2 reviews basket analysis and aims towards building a web of concepts. Fea-
concept extraction tools in the literature and Section 3 explains in tures such as frequency of occurrences and popularity among users
detail how different concept extraction systems work. In Section 4, are used to extract core concepts while sub-concepts and super-
we describe the experiments and present the evaluation results. concepts of the core concepts are pruned. The authors claim that
Then we discuss two practical e-commerce applications of concept the system can be applied to any large data set. However, if Web
extraction in Section 5. Finally, Section 6 concludes our work and pages are used, a lot of additional processing is needed to identify
describes future research directions. the popular concepts.
Table 1 summarizes the concept extraction methods by the fea-
2. Related work tures they have.
The first feature, supervised learning, is the process of training a
Concept extraction is derived from key phrase extraction in model with examples and then applying the learnt model on new
information retrieval and text mining fields. Key phrase extraction documents. It means that there is a human effort to provide train-
methods often analyze a document to determine the significance of ing examples. In the concept extraction context, a supervised
a phrase, which can be a single word or a multi-word term. The sig- method analyzes training documents with manual concepts to find
nificance of a phrase is measured by modelling statistical features out rules and patterns, which can be used to find concepts in new
such as frequency of occurrence and linguistic features such as documents. For example, do the manual concepts have something
part-of-speech. The phrases above a certain threshold are often re- in common, such as do they often appear in the title or headings,
ferred to key phrases or concepts. do they appear more frequently than phrases that are not con-
Many key phrase extraction methods have been proposed in the cepts? The second feature is whether the concept extraction tool
literature. TFIDF is a popular method that is widely used in infor- requires a document collection. Document collection means that
mation retrieval and machine learning areas. The intuition behind there is a crawling effort to collect training documents, which is of-
this method is that phrases, which appear frequently in one docu- ten an expensive effort. The third feature is whether the tool works
ment but rarely in the whole document collection, often have on raw text only, or does it work better on visual clues found in
high discrimination power between documents. TFIDF requires a HTML tags, or both?
Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296 291
Table 1 Table 2
Summary of concept extraction methods by various features. A list of example HTML tags and corresponding significance weights.
3. Concept extraction
high precision and high recall at the same time. F1 is such a mea- Manual Non-manual
sure that balances the two. Extracted A: true positive B: false positive
Given the confusion matrix shown in Table 4, precision P, recall Non-extracted C: false negative D: true negative
R, and F1 can be formally defined as follows:
A
P¼
AþB ACE. We also report the results of ICE using the best configuration.
A The results are summarized in Table 6.
R¼ ð2Þ
AþC As shown in Table 6, ICE significantly outperforms ACE in preci-
2PR sion, recall, and F1. In most cases, more than 90% of true concepts
F1 ¼ are captured by ICE and more than 70% of extracted concepts are
PþR
truly desired. This indicates that the improvements applied in Ta-
where A is the number of overlapped concepts between human- ble 3 are greatly beneficial. They capture the essential signals, i.e.
authored concepts (golden standard) and program-generated con- the linguistic and lexical patterns, that can identify the most essen-
cepts, B is the number of extracted concepts that are not truly hu- tial topic in a shopping related Web page. This is very encouraging
man authored concepts, and C is the human authored concepts and promising as we look into much larger data sets in the future.
that are missed by the concept extraction methods. We also observe that ICE tends to favor shorter concepts than ACE.
Alternatively, we can use the acceptable percentage measure The best quality with F1 = 0.8325 is achieved when a threshold of
proposed by Turney (2000) to evaluate automatically extracted 0.5 is used on concepts with up to 3 words and a 30% significance
concepts. This usually proceeds with a formal user study where weight on the TF score.
people assign 1-to-5 scores to concepts and often a gold standard In addition to the improvements summarized in Table 3, we
is not required (if the method does not require any) for evaluation also investigate whether applying CSS (Cascading Style Sheets8)
purpose. to the Web pages will impact concept extraction. We repeat the
same experiments with an addition of applying CSS. The results
4.3. Evaluation results are summarized in Table 7, which indicates that the quality of ex-
tracted concepts is actually worse than without CSS. This implies
4.3.1. Evaluation results of ACE that style sheets tend to redefine features such as color and font,
First of all, we present the evaluation results of basic ACE. We which consequently dilute the signals given by HTML markups that
have 100 pages with manually tagged concepts. All the 100 pages are critical to the ICE algorithm.
are used to test ACE. We tune the following three parameters in or-
der to obtain concepts with the best quality. 4.3.3. Evaluation results of KEA
In KEA, we do differently in the sense that we use the first 50
T: threshold on final concept score, which is a weighted combi- Web pages for training and the rest 50 for testing.9 More specifi-
nation of the TF score and the HTML score and a threshold of 0.5 cally for training, we use two sets of concepts: (1) the ICE concepts
is used by default. We experiment with five different threshold extracted using the best configuration (0.5 for threshold of concept
values from 0.5 to 0.9 in increasing step of 0.1. score, up to three words allowed in a concept, 30% weight for TF
B: maximum breadth of concept, i.e. maximum number of sin- score and 70% for HTML score); (2) the manual concepts. Our interest
gle words allowed in a concept. A default value of 5 is used in resides on how good ICE concepts are compared to manual concepts
ACE. We experiment with 3 to 6. in the task of training a KEA model.
k: significance weight of the TF scorer (0.5 by default). We We tune two parameters in KEA, one is the breadth, i.e. the
experiment with five different weights from 0.3 to 0.7 in maximum number of words allowed in a KEA concept, the other
increasing step of 0.1. Note that the weight of the HTML scorer is the threshold on the concept score. We observe that most KEA
is 1 k. concepts carry a score less than 0.5 so we use a range of threshold
on concept score from 0.2 to 0.5. The evaluation results are sum-
When a threshold on final concept score is chosen (e.g. T = 0.6), marized in Table 8.
we experiment with each configuration of B and k (denoted as When using ICE concepts of the first 50 Web pages for training a
[B, k]). This leads to a total of 20 configurations. KEA model, the evaluation results on the remaining 50 Web pages
For each Web page, we calculate the precision, recall and F1 are the same (P = 0.6583, R = 0.7600, F1 = 0.7056, and B = 3) when a
values. The quality of ACE is measured by the average precision, threshold of 0.2, 0.3, 0.4, and 0.5 is separately used. This can be ex-
recall and F1 over 100 Web pages. We aim to find out for each plained by the fact that most often there is a huge gap between the
threshold T which of the 20 configurations C yields the highest top KEA concepts (usually good ones) and the bottom ones (usually
average F1. Due to space limit, we report only the evaluation re- not good). For example, Cisco, 0.5504; Cisco AS5350XM, 0.5504; Uni-
sults of ACE using the best configuration. The results are summa- versal Gateway, 0.5504; DSP, 0.0399; voice, 0.0399; IP, 0.0399; net-
rized in Table 5. work, 0.0399. Consequently, applying a lower threshold has
As we can see in Table 5, ACE tends to favor long concepts but minor impact on the quality.
achieves low precision and recall in all cases. Also as the threshold
goes up, the overall quality is decreased. As an example in the sub-
optimal case for T = 0.5, when the configuration of [4, 0.5] is used, 8
http://en.wikipedia.org/wiki/Cascading_Style_Sheets, last viewed on March 9,
the precision, recall, and F1 are 0.2245, 0.2583, and 0.2402, 2012.
respectively. 9
In order to evaluate KEA’s sensibility and stability across different training
examples, we conduct a fivefold cross validation on the 50 training examples, i.e. the
4.3.2. Evaluation results of ICE data set is evenly split into 5 segments and each time a different segment is left aside
for testing using the model trained on the rest four segments. Our fivefold cross
The main focus of our paper is to investigate how to improve
validation using specific configurations (e.g. T = 0.3,B = 4) shows that KEA is not
ACE and demonstrate its performance compared with ACE and sensible to different training examples, i.e. the average F1 values across different folds
KEA. We experiment with ICE using same configurations as for using the same configuration are very close to each other.
294 Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296
Table 5 Table 6
Evaluation results of basic ACE with the best configuration. Evaluation results of ICE with the best configuration.
T 0.5 0.6 0.7 0.8 0.9 T 0.5 0.6 0.7 0.8 0.9
C [6, 0.4] [6, 0.4] [6, 0.3] [6, 0.6] [4, 0.3] C [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3]
P 0.2496 0.2429 0.2258 0.2046 0.1987 P 0.7227 0.7286 0.7396 0.7323 0.7391
R 0.2700 0.2450 0.2350 0.2017 0.1750 R 0.9815 0.9613 0.9360 0.8889 0.8822
F1 0.2594 0.2440 0.2303 0.2031 0.1861 F1 0.8325 0.8289 0.8263 0.8030 0.8043
Next, we use the manual concepts of the top 50 Web pages for
Table 7
training a KEA model. Again, the testing results are the same
Evaluation results of ICE and applying CSS.
(P = 0.7383, R = 0.8300, F1 = 0.7815, and B = 6) when a threshold
of 0.2, 0.3, 0.4, and 0.5 is separately used. We observe that using T 0.5 0.6 0.7 0.8 0.9
C [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3] [3, 0.3]
manual concepts for training KEA greatly improves the quality of
P 0.6880 0.6930 0.6897 0.6925 0.6992
concept extraction by 8%. However, this model prefers longer con- R 0.8767 0.8667 0.8517 0.8350 0.8283
cepts. It is observed that ICE outperforms KEA by 5% (see Table 6). F1 0.7710 0.7702 0.7622 0.7571 0.7583
5. E-commerce applications
We have discussed the concept extraction algorithms in Sec- We calculate accuracy for brand and model, respectively. For
tion 3 and evaluation results in Section 4, respectively. To show- example, if the top one ICE concept contains the manually given
case the practical use of concept extraction in e-commerce brand for 70 times out of 100 consumer reviews, then the accuracy
context, we present two applications in this section, i.e. product of ICE in brand matching on consumer reviews will be 70%. The re-
matching, and topic-based opinion mining. Since ICE proves to be sults are summarized in Table 9. We believe the accuracy achieved
significantly better than ACE (see Section 4), we use only ICE and by both methods is reasonable considering only the top one con-
KEA in these two applications. cept is used for matching brand and model.
As we see in Table 9, ICE does better than KEA in extracting
5.1. Product matching brand and model from both professional and consumer review
pages. This is consistent with the experiment results presented in
Product matching is the process of identifying if a shopping re- Section 4. We also observe that both ICE and KEA perform much
lated Web page is about a certain product (or some products), e.g, better on consumer reviews than on professional reviews. This
iPhone 4s. This can be useful for search indexing and relevance may be explained by the fact that consumer reviewers mostly fo-
ranking, as well as advertising and merchandizing. Due to the cus on the target product and hence it is easier for concept extrac-
scope of this paper, we focus on measuring how well ICE and tion methods to find the correct product, whereas professional
KEA concepts match to products for product review pages. reviewers often compare the target product to other brands and
In our work, we use consumer reviews and professional reviews models in addition to reviewing the target product. The other
of digital cameras as examples to investigate the product matching interesting finding is that both ICE and KEA perform much better
problem. Consumer reviews and professional reviews are mostly on model matching than on brand matching in both professional
about a particular product, although they might mention other reviews and consumer reviews. This indicates that the reviews in
similar products too. For example, a Canon digital camera review our test mention similar brands more often than similar models
page might mention similar products by Sony and Samsung. We with the same brand and hence it is more difficult to correctly
need the capability of detecting what this page is about instead match brands than models. However, we believe it is more impor-
of mentioning using the power of concept extraction. On the other tant to correctly find the product model than the brand. That is be-
hand, in professional review pages, very often multiple products cause the model name (or number) is pretty unique in most cases
with the same brand are reviewed for comparison purpose. Other and it can identify the product accurately (even without knowing
page types such as Frequently Asked Questions (FAQs), product the brand).
specs, and forums are good candidates of learning the product
matching problem too. Different from the above five types is the 5.2. Topic-based opinion mining
buying guides, which often focus on suggestions of a category
rather than particular products, such as electronics, sports gear, We have shown how ICE and KEA can be useful in an important
and media. Thus, buying guides have less impact on the product e-commerce application, i.e. product matching. In this subsection,
matching problem. we present how concept extraction can be helpful in a text mining
To this end, we manually label 100 consumer review pages from problem, i.e. topic-based opinion mining. Opinion mining, also
epinions.com and 100 professional review pages from cnet.com, known as sentiment analysis, is a research area which aims to
respectively. They are all about digital cameras with various identify and extract sentiments and emotions from text documents
brands and models. For each page, an exact product name, i.e. (Liu 2012). Opinion mining has been very popular in the recent
brand and model, is given manually. We observe that many prod- decade and it has been used in many domains such as marketing,
uct names have as many as four words, so we allow ICE to extract customer analysis and political campaign. For example, business
concepts with up to four words, with a threshold of 0.5 on concept and organizations want to understand customers’ voice and im-
score, and 50% significance weight on TF score and 50% on HTML prove their products and consumers want to compare products
score. For KEA, we use the manual concepts in Section 4 to train from multiple vendors based on product reviews before they make
a model and also allow it to produce concepts with up to four a purchase decision. One of the main sub-areas in opinion mining
words. Both ICE and KEA produce a few concepts for each page is topic-based opinion mining, where the document collection is
but we use only the top one as extracted product name. large or has multiple topics. In such a case, one may wish to first
We use accuracy to evaluate ICE and KEA with respect to its identify the essential topics within the corpus and then apply opin-
ability to extract brand and model from product review pages. ion mining separately on each essential topic. This technique has
Y. Zhang et al. / Electronic Commerce Research and Applications 12 (2013) 289–296 295
Liu, B. Sentiment Analysis and Opinion Mining (Introduction and Survey). Morgan & 44947). Institute for Information Technology, National Research Council of
Claypool Publishers, Cambridge, MA, USA, 2012. Canada. Ottawa, ON, Canada, 2002.
McKeown, K., Passonneau, R., Elson, D., Nenkova, A., and Hirschberg, J. Do Turney, P. Coherent keyphrase extraction via web mining. In: Proceedings of the
summaries help? A task-based evaluation of multi-document summarization. Eighteenth International Joint Conference on Artificial Intelligence, Acapulco,
In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research Mexico, 2003. 434–439.
and Development in Information Retrieval, Salvador, Brazil, 2005. 210–217. Villalon, J., and Calvo, R. A. Concept extraction from student essays, towards concept
Moghaddam, S., and Ester, M. Tutorial: aspect-based opinion mining from product map mining. In: Proceedings of the Ninth IEEE International Conference on
reviews. In: 2012 Annual International ACM SIGIR Conference on Research and Advanced Learning Technologies, Washington, DC, USA, 2009. 221–225.
Development in Information Retrieval, Portland, Oregan, 2012. Witten, I., Paynter, G., Frank, E., Gutwin, C., and Nevill-Manning, C. KEA: practical
Parameswaran, A., Rajaraman, A., and Garcia-Molina, H. Towards the Web of automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on
concepts: extracting concepts from large datasets. In: Proceedings of the 2010 Digital Libraries, Berkeley, CA, USA, 1999. 254–255.
VLDB Endowment, 2010. 566–577. Zhang, Y., Milios, E., and Zincir-Heywood, N. A Comparison of Keyword- and Keyterm-
Piao, S., Forth, J., Gacitua, R., Whittle, J., and Wiggins, G. Evaluating tools for Based Methods for Automatic Web Site Summarization. Technical Report CS-2004-
automatic concept extraction: a case study from the musicology domain. In: 11. Faculty of Computer Science, Dalhousie University. Halifax, NS, Canada,
Proceedings of the Digital Economy All Hands Meeting - Digital Futures 2010 2004.
Conference, Nottingham, UK, 2010. Zhang, Y., Zincir-Heywood, N., and Milios, E. Narrative text classification for
Ramirez, P., and Mattmann, C. ACE: improving search engines via automatic concept automatic key phrase extraction in web document corpora. In: Proceedings of
extraction. In: Proceedings of the 2004 IEEE International Conference on the Seventh ACM International Workshop on Web Information and Data
Information Reuse and Integration, Las Vegas, NV, USA, 2004. 229–234. Management, Bremen, Germany, 2005. 51–58.
Song, M., Song, I., and Hu, X. KPSpotter: a flexible information gain-based keyphrase Zhang, Y., Milios, E., and Zincir-Heywood, N. A comparative study on key phrase
extraction system. In: Proceedings of the Fifth ACM International Workshop on extraction methods in automatic web site summarization. Journal of Digital
Web Information and Data Management, 2003. 50–53. Information Management, Special Issue on Web Information Retrieval, 5, 2007,
Turney, P. Learning algorithms for keyphrase extraction. Information Retrieval, 2, 323–332.
2000, 303–336. Zhang, Y., Shen, D., and Baudin, C. Tutorial: sentiment analysis in practice. In: 2011
Turney, P. Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: IEEE International Conference in Data Mining (ICDM’11), Vancouver, BC, Canada,
Learning from Labeled and Unlabeled Data. Technical Report ERB-1096 (NRC- 2011.