735
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
names. Such segmentation and indexing allow end-users to • We introduce a new concept, independent frequent sub-
perform fuzzy searches for chemical names, including sub- sequences, and propose an algorithm to find those sub-
string search and similarity search. Because the space of sequences, with the goal to discover sub-terms with
possible substrings of all chemical names appearing in text is corresponding probabilities in chemical names.
extremely large, an intelligent selection needs to be made as • We propose an unsupervised method of hierarchical
to which substrings should be indexed. We propose to mine text segmentation and use it for chemical name index
independent frequent patterns, use information about those pruning. This approach can be extended to index any
patterns for chemical name segmentation, and then index kinds of sequences, such as DNA.
sub-terms. Our empirical evaluation shows that this index-
ing scheme results in substantial memory savings while pro- • We present various query models for chemical name
ducing comparable query results in reasonable time. Sim- searches with corresponding ranking functions.
ilarly, index pruning is also applied to chemical formulae. The rest of this paper is organized as follows: Section 2 re-
Finally, user interaction is provided so that scientists can views related works. Section 3 presents approaches of mining
choose the relevant query expansions. Our experiments also chemical names and formulas based on CRFs and HCRFs.
show that the chemical entity search engine outperforms We also propose algorithms for independent frequent subse-
general purpose search engines (as expected). quence mining and hierarchical text segmentation. Section
Constructing an accurate domain-specific search engine 4 describes the index schemes, query models, and ranking
is a hard problem. For example, tagging chemical enti- functions for name and formula searches. Section 5 presents
ties is difficult due to noise from text recognition errors by experiments and results. Conclusions and future directions
OCR, PDF transformation, etc. Chemical lexicons are of- are discussed in Section 6. Web service of our system is
ten incomplete especially with respect to newly discovered provided online 1 .
molecules. Often they do not contain all synonyms or do
not contain all the variations of a chemical formula. Un-
derstanding the natural language context of the text to de- 2. RELATED WORK
termine that an ambiguous string is chemical entity or not Banville has provided a high-level overview on mining
(e.g., “He” may actually refer to ‘HeliumŠ as opposed to chemical structure information from the literature [12]. Hid-
the pronoun) is also a hard problem to solve with very high den Markov Models (HMMs) [9] are one of the common
precision even if dictionaries of chemical elements and for- methods for entity tagging. HMMs have strong indepen-
mulae are used. Moreover, many chemical names are long dence assumptions and suffer from the label-bias problem.
phrases composed of several terms so that general tokeniz- Other methods such as Maximum Entropy Markov Models
ers for natural languages may tokenize a chemical entity into (MEMMs) [16] and Conditional Random Fields (CRFs) [11]
several tokens. Finally, different representations of a chem- have been proposed. CRFs are undirected graph models
ical molecule (e.g., “acetic acid” is written as “CH3COOH” that can avoid the label-bias problem and relax the inde-
or “C2H4O2”) have to be taken into account while index- pendence assumption. CRFs have been used in many ap-
ing (or use other strategies like query expansion) to provide plications, such as detecting names [17], proteins [22], genes
accurate and relevant results with a good coverage. [18], and chemical formulae [25]. Other methods based on
Our search engine has two stages: 1) mining textual chem- lexicons and Bayesian classification have been used to recog-
ical molecule information, and 2) indexing and searching for nize chemical nomenclature [26]. The most common method
textual chemical molecule information. In the first stage, used to search for a chemical molecule is substructure search
chemical entities are tagged from text and analyzed for in- [27], which retrieves all molecules with the query substruc-
dexing. Machine-learning-based approaches utilizing domain ture. However, users require sufficient knowledge to select
knowledge perform well, because they can mine implicit substructures to characterize the desired molecules for sub-
rules as well as utilize prior knowledge. In the second stage, string search, so similarity search[27, 29, 23, 21] is desired
each chemical entity is indexed. In order to answer user by users to bypass the substructure selection. Other re-
queries with different search semantics, we have to design lated work involves mining for frequent substructures [8, 10]
ranking functions that enable these searches. Chemical en- or for frequent sub-patterns [28, 30]. Some previous work
tities and documents are ranked using our ranking schemes has addressed how to handle entity-oriented search [6, 15].
to enable fast searches in response to queries from users. Chemical markup languages have been proposed [20] but are
Synonyms or similar chemical entities are suggested to the still not in wide use. None of the previous work focuses on
user, and if the user clicks on any of these alternatives, the the same problem as ours. They are mainly about biological
search engine returns the corresponding documents. entity tagging or chemical structure search. We have shown
Our prior work on chemical formulae in the text [25], while how to extract and search for chemical formulae in text [25],
this paper focuses approaches to handle chemical names. We but we propose several new methods to extract and search
show how a domain-specific chemical entity search engine for chemical names in text documents.
can be built that includes tasks of entity tagging, mining fre-
quent subsequences, text segmentation, index pruning, and
supporting different query models. Some previous work is in
3. MOLECULE INFORMATION MINING
[25]. We propose several novel methods for chemical name In this section, we discuss three mining issues before index
tagging and indexing in this paper. The major contributions construction: how to tag chemical entities from text, how to
of this paper are as follows: mine frequent sub-terms from chemical names, and how to
hierarchically segment chemical names into sub-terms that
• We propose a model of hierarchical conditional random can be used for indexing.
fields for entity tagging that can consider the long-term
dependencies at different levels of documents. 1
http://chemxseer.ist.psu.edu/
736
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
737
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
Feature Set system can index the sub-terms, it must segment chemical
For entity tagging, our algorithm extracts two categories names like “methylethyl” automatically into “methyl” and
of state features from sequences of terms: single-term fea- “ethyl”. The independent frequent sub-terms and their fre-
tures, and overlapping features from adjacent terms. Single- quencies can be used in hierarchical segmentation of chemi-
term features are of two types: surficial features and ad- cal names (explained in details in the next subsection).
vanced features. Surficial features can be observed directly First, we introduce some notations as follows:
from the term, e.g. InitialCapital, AllCapitals, HasDigit, Definition 1. Sequence, Subsequence, Occurrence:
HasDash, HasDot, HasBrackets, HasSuperscripts, IsTerm- A sequence s =< t1 , t2 , ..., tm >, is an ordered list, where
Long. Advanced features are generated by complex domain each token ti is an item, a pair, or another sequence. Ls is
knowledge or other data mining methods, e.g. POS tags the length of s. A subsequence s0 ¹ s is an adjacent part
learned by natural language parsers, if a term follows the of s, < ti , ti+1 , ..., tj >, 1 ≤ i ≤ j ≤ m. An occurrence of
syntax of chemical formulae, etc. For chemical name tag- s0 in s, i.e., Occurs0 ¹s,i,j =< ti , ..., tj >, is an instance of
ging, we use a lexicon of chemical names collected online s0 in s. We say that in s, Occurs0 ¹s,i,j and Occurs00 ¹s,i0 ,j 0
by ourselves and WordNet [4]. We support fuzzy matching overlap, i.e. Occurs0 ¹s,i,j ∩ Occurs00 ¹s,i0 ,j 0 6= ∅, iff ∃n, i ≤
on the Levenshtein Distance [13]. Furthermore, we check n ≤ j ∧ i0 ≤ n ≤ j 0 . Unique occurrences are those without
if a term has sub-terms learned from the chemical lexicon overlapping.
based the method in Section 3.5. For sentence tagging, fea-
ture examples are: ContainTermInList, ContainTermPat- Definition 2. Support: Given a data set D of sequences
tern, ContainSequentialTermPattern. For example, a list of s, Ds0 is the support of subsequence s0 , which is the set of
journals, a list of names, and string patterns in reference, all sequences s containing s0 , i.e. s0 ¹ s. |Ds | is the number
are used for sentence tagging. Overlapping features of ad- of sequences in Ds .
jacent terms are extracted. We used -1, 0, 1 as the window Definition 3. Frequent Subsequence: F reqs0 ¹s is the
of tokens, so that for each token, all overlapping features of frequency of s0 in s, i.e. the count of all unique Occurs0 ¹s . A
last token and next token are included in the feature set. subsequenceP s0 is in the set of frequent subsequences F S, i.e.
For example, for He in “... . He is ...”, feature(termn−1 = s0 ∈ F S, if s∈D 0 F reqs0 ¹s ≥ F reqmin , where F reqmin is
s
“.00 ∧ termn = initialCapital)=true, and feature(termn = a threshold of the minimal frequency.
initialCapital ∧ termn+1 = isP OST agV BZ)=true, where However, usually a sub-term is a frequent subsequence,
isP OST agV BZ means that “is” is a “present tense verb, 3rd but not for inverse, since all subsequences of a frequent sub-
person singular”. The “He” is likely to be an English word sequence are frequent, e.g. “methyl” (-CH3) is a meaningful
instead of Helium. sub-term, but “methy” is not, although it is frequent too.
Thus, mining frequent subsequences results in much redun-
3.2 Independent Frequent Subsequence dant information. We extend two concepts from previous
Mining work [28, 30] to remove redundant information:
In the last subsection, we use sub-terms as features for Definition 4. Closed Frequent Subsequence: A fre-
chemical name tagging. Moreover, our system supports quent subsequence s is in the set of closed frequent sub-
searches for keywords that occur as parts of terms in docu- sequences CS, iff there does not exist a frequent super-
ments, e.g., documents with the term “methylethyl” should sequence s0 of s such that all the frequencies of s in s00 ∈ D
be returned in response to an user search for “methyl”. In- is the same as that of s0 in s00 for all s00 ∈ D. That is,
stead of manually discovering those chemical sub-terms by s ∈ CS = {s|s ∈ F S and @s0 ∈ F S such that s ≺ s0 ∧ ∀s00 ∈
domain experts, we propose unsupervised algorithms to find D, F reqs0 ¹s00 = F reqs¹s00 }.
them automatically from a chemical lexicon and documents.
To support such searches, our system maintains an index Definition 5. Maximal Frequent Subsequence: A fre-
of sub-terms. Before it can index sub-terms, the system quent subsequence s is in the set of maximal frequent sub-
must segment terms into meaningful sub-terms. Indexing sequences MS, iff it has no frequent super-sequences, i.e.
all possible subsequences of all chemical names would make s ∈ M S = {s|s ∈ F S and @s0 ∈ F S such that s ≺ s0 }.
the index too large resulting in slower query processing. Fur- The example in Figure 3 demonstrates the set of Closed
thermore, users will not query using all possible sub-terms of Frequent Subsequences, CS, and Maximal Frequent Subse-
queries, e.g., a query using keyword “hyl” is highly unlikely quences, M S. Given the collection D={abcde, abcdf, aba,
if not completely improbable, while searches for “methyl” abd, bca} and F reqmin =2, support Dabcd ={abcde, abcdf},
can be expected. So, indexing “hyl” is a waste. There- Dab ={abcde, abcdf, aba, abd}, etc. The set of frequent sub-
fore, “methyl” should be in the index, while “hyl” should sequence FS={abcd, abc, bcd, ab, bc, cd} has much redun-
not. However, our problem has additional subtlety. While dant information. CS={abcd, ab, bc} removes some redun-
“hyl” should not be indexed, another sub-term “ethyl” that dant information, e.g. abc/bcd/cd only appears in abcd.
occurs independently of “methyl” should be indexed. Also, MS={abcd} removes all redundant information as well as
when the user searches for “ethyl”, we should not return the useful information, e.g., ab and bc have occurrences exclud-
documents that contain “methyl”. Due to this subtlety, sim- ing those in abcd. Thus, we need to determine if ab and bc
ply using the concepts of closed frequent subsequences and are still frequent excluding occurrences in abcd. For a fre-
maximal frequent subsequences as defined in [28, 30] is not quent sequence s0 , its subsequence s ≺ s0 is also frequent
enough. We introduce the important concept of indepen- independently, only if the number of all the occurrences of
dent frequent subsequence. Our system attempts to identify s not in any occurrences of s0 is larger than F reqmin . If a
independent frequent subsequences and index them. subsequence s has more than one frequent super-sequences,
In the rest of this paper, we use“sub-term” to refer to then all the occurrences of those super-sequences are ex-
a substring in a term that appears frequently. Before our cluded to count the independent frequency of s, IF reqs .
738
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
Algorithm: IFSM(D,D∀s ,O∀s ,F reqmin ,Lmin ): super-sequence “methyl”. In CS, “ethyl” occurs 180 times,
Input: since for each occurrence of “methyl”, a “ethyl” occurs. Thus,
Candidate set of sequences D, CS over-estimates the probability of “ethyl”. If a bias exists
set D∀s including the support Ds for each subsequence s,
set O∀s including all Occurs∈D , while estimating the probability of sub-terms, we cannot ex-
minimal threshold value of frequency F reqmin , and pect a good text segmentation result using the probability
minimal length of subsequence Lmin . (next section).
Output: Based on these observations, we propose an algorithm for
Set of Independent Frequent Subsequences IS, and Independent Frequent Subsequence Mining (IFSM) from a
Independent Frequency IF reqs of each s ∈ IS. collection of sequences in Figure 2 with an example in Fig-
1. Initialization: IS = {∅}, length l = maxs (Ls ).
2. while l ≥ Lmin , do ure 3. This algorithm considers sequences from the longest
sequence s to the shortest sequence, checking if s is frequent.
3. P all s ∈ D, Ls = l,
put
s0 ∈Ds F reqs¹s 0 ≥ F reqmin into Set S; If F reqs ≥ F reqmin , put s in IS, remove all occurrences of
its subsequences that are in any occurrences of s, and re-
P
4. while ∃s ∈ S, s0 ∈Ds F reqs¹s0 ≥ F reqmin , do
5. move s with the largest F reqs¹s0 from S to IS; move all occurrences overlapping with any occurrences of
6. for each s0 ∈ Ds s. If the left occurrences of a subsequence s0 still make s0
7. for each Occurs00 ¹s0 ∩ (∪s¹s0 Occurs¹s0 )*6= ∅, frequent, then put s0 into IS. After mining independent
8. F reqs00 ¹s0 − −; frequent sub-terms, the discovered sub-terms can used as
9. l − −;
10.return IS and IF reqs∈IS = F reqs ;
features in CRFs, and the independent frequencies IF req
*∪s¹s0 Occurs¹s0 is the range of all Occurs¹s0 in s0 , except can be used to estimate their probabilities for hierarchical
Occurs¹s0 ∩ Occurt¹s0 6= ∅ ∧ t ∈ IS. text segmentation in next section.
739
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
740
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
900 80
Freqmin=10 Freq =10
800 70 min
Freqmin=20 Freqmin=20
700 Freq =40
Number of substrings
60 Freq =40
20
200
100 10
0 0
0 5 10 15 20 25 30 0 1 2 3 4 5 6 7
Length of substring Sample size of terms 4
x 10
741
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
1 1 1 1
0.95 0.95
0.95 0.95
0.9 0.9
F−measure
0.8 0.8
Precision
Precision
0.85 0.85
Recall
0.75 0.75
0.8 0.8
0.7 0.7
(a) Avg precision v.s. recall (b) Average precision (c) Average recall (d) Average F-measure
Figure 9: Chemical formula tagging using different values of feature boosting parameter θ
1 1 0.9 1
0.85
0.9 0.9 0.9
0.8
0.8 0.8
0.75 0.8
0.7 0.7
F−measure
0.7
Precision
Precision
0.7
Recall
0.6 0.6 0.65
0.6
0.6
0.5 0.5
0.55 0.5
0.4 all features 0.4 all features all features all features
0.5
no subterm no subterm no subterm no subterm
0.4
0.3 no lexicon 0.3 no lexicon 0.45 no lexicon no lexicon
no subterm+lexicon no subterm+lexicon no subterm+lexicon no subterm+lexicon
0.2 0.2 0.4 0.3
0.4 0.5 0.6 0.7 0.8 0.9 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3
Recall Feature boosting parameter θ Feature boosting parameter θ Feature boosting parameter θ
(a) Avg precision v.s. recall (b) Average precision (c) Average recall (d) Average F-measure
Figure 10: Chemical name tagging using different values of feature boosting parameter θ
742
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
0.66
0.4
Freq =10
0.64 min
Freq =20 0.5
0.35 min
0.54
0.35
Freqmin=10
0.15
0.52
Freqmin=20
Freq =40
0.1 0.5 0.3 min
Freq =80
min
0.05
0.48
Freqmin=160
0.25
0.46
0 5 10 15 20 25 30 0 5 10 15 20 25 30
0
0 20 40 60 80 100 120 140 160 Top n retrieved chemical names Top n retrieved chemical names
Values of Freqmin
(a) Similarity name search (b) Substring name search
Figure 11: Ratio of after v.s. Figure 12: Correlation of name search results before and after
before index pruning index pruning
For example,“C” is usually a name abbreviation in the part index pruning, respectively. Results are presented in Figure
of authors and the references, but usually refers to “Car- 12. We can observe that for similarity search, when more
bon” in the body of document. Thus, HCRFs is evaluated results are retrieved, the correlation curves decrease, while
for chemical formula tagging to check whether the long de- for substring search, the correlation curves increase. When
pendence information at the sentence level is useful or not. F reqmin is larger, the correlation curves decrease especially
HCRFs are not applied for chemical name tagging, because for substring search.
we found that the major source of errors in chemical name
tagging is not due to term ambiguity, but due to limitations 5.5 Term Disambiguation
of the chemical dictionary and incorrect text tokenizing. To test the ability of our approach for term disambigua-
tion in documents, we index 5325 PDF documents crawled
5.3 Chemical Name Indexing online. We then design 15 queries of chemical formulae.
For chemical name indexing and search, we use the seg- We categorize them into three levels based on their ambi-
mentation based index construction and pruning. We com- guity, 1) hard (He, As, I, Fe, Cu), 2) medium (CH4, H2O,
pare our approach with the method using all possible sub- O2, OH, NH4 ), and 3) easy (Fe2O3, CH3COOH, NaOH,
strings for indexing. We use the same collection of chemi- CO2, SO2 ). We compare our approach with the traditional
cal names in Section 5.1. We split the collection into two approach by analyzing the precision of the returned top-
subsets. One is for index construction (37,656 chemical 20 documents. The precision is defined as the percentage
names), while the query names are randomly selected from of returned documents really containing the query formula.
743
WWW 2008 / Refereed Track: Web Engineering - Applications April 21-25, 2008 · Beijing, China
1 1 1
Formula
0.9
Formula 0.9 0.98
Lucene
0.8 Lucene 0.8 0.96 Google
Google Yahoo
0.7 0.7 0.94
Yahoo Formula MSN
Precision MSN
Precision
Precision
0.6 0.6 0.92
Lucene
0.5 0.5 Google 0.9
0.4 0.4
Yahoo 0.88
MSN
0.3 0.3 0.86
0 0 0.8
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
Lucene [1] is used for the traditional approach. We also eval- [9] D. Freitag and A. McCallum. Information extraction using
uate generic search engines, Google, Yahoo, and MSN using hmms and shrinkage. In AAAI Workshop on Machine
Learning for Information Extraction, 1999.
those queries to show the ambiguity of terms. Since the in- [10] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In
dexed data are different, they are not comparable with the Proc. ICDM, 2001.
results of our approach and Lucene. Their results illustrate [11] J. Lafferty, A. McCallum, and F. Pereira. Conditional random
that ambiguity exists and the domain-specific search engines fields: Probabilistic models for segmenting and labeling
sequence data. In Proc. ICML, 2001.
are desired. From Figure 13, we can observe 1) the ambigu- [12] D. L.Banville. Mining chemical structural information from the
ity of terms is very serious for short chemical formulae, 2) drug literature. Drug Discovery Today, 11(1-2):35–42, 2006.
results of Google and Yahoo are more diversified than that [13] V. I. Levenshtein. Binary codes capable of correcting deletions,
of MSN, so that chemical web pages are included, and 3) insertions, and reversals. Soviet Physics Doklady, 1966.
[14] W. Li and A. McCallum. Semi-supervised sequence modeling
our approach out-performs the traditional approach based with syntactic topic models. In Proc. AAAI, 2005.
on Lucene, especially for short formulae. [15] G. Luo, C. Tang, and Y. li Tian. Answering relationship queries
on the web. In Proc. WWW, 2007.
6. CONCLUSIONS AND FUTURE WORK [16] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy
markov models for information extraction and segmentation. In
We evaluated CRFs for chemical entity extraction and Proc. ICML, 2000.
proposed an extended model HCRFs. Experiments show [17] A. McCallum and W. Li. Early results for named entity
recognition with conditional random fields, feature induction
that CRFs perform well and HCRFs perform better. Exper- and web-enhanced lexicons. In Proc. CoNLL, 2003.
iments illustrate that most of the discovered sub-terms in [18] R. McDonald and F. Pereira. Identifying gene and protein
chemical names using our algorithm of IFSM have semantic mentions in text using conditional random fields.
Bioinformatics, 6(Suppl 1):S6, 2005.
meanings. Examples show our HTS method works well. Ex-
[19] P. Murray-Rust, H. S. Rzepa, S. Tyrrell, and Y. Zhang.
periments also show that our schemes of index construction Representation and use of chemistry in the global electronic
and pruning can reduce indexed tokens as well as the index age. Org. Biomol. Chem., 2004.
size significantly. Moreover, the response time of similar- [20] P. Murray-Rust, H. S. Rzepa, and M. Wright. Development of
chemical markup language (cml) as a system for handling
ity name search is considerably reduced. Retrieved ranked complex chemical content. New J. Chem., 2001.
results of similarity and substring name search before and af- [21] J. W. Raymond, E. J. Gardiner, and P. Willet. Rascal:
ter segmentation-based index pruning are highly correlated. Calculation of graph similarity using maximum common edge
We also introduced several query models for name searches subgraphs. The Computer Journal, 45(6):631–644, 2002.
[22] B. Settles. Abner: an open source tool for automatically
with corresponding ranking functions. Experiments show tagging genes, proteins, and other entity names in text.
that the heuristics of the new ranking functions work well. Bioinformatics, 21(14):3191–3192, 2005.
In the future, a detailed user study is required to evalu- [23] D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and
ate the results. Entity fuzzy matching and query expansion applications of tree and graph searching. In Proc. PODS, 2002.
[24] B. Sun, P. Mitra, H. Zha, C. L. Giles, and J. Yen. Topic
among synonyms will also be considered. segmentation with shared topic detection and alignment of
multiple documents. In Proc. SIGIR, 2007.
7. ACKNOWLEDGMENTS [25] B. Sun, Q. Tan, P. Mitra, and C. L. Giles. Extraction and
search of chemical formulae in text documents on the web. In
We acknowledge partial support from NSF grant 0535656. Proc. WWW, 2007.
[26] W. J. Wilbur, G. F. Hazard, G. Divita, J. G. Mork, A. R.
8. REFERENCES Aronson, and A. C. Browne. Analysis of biomedical text for
[1] Apache lucene. http://lucene.apache.org/. chemical names: A comparison of three methods. In Proc.
[2] Nist chemistry webbook. http://webbook.nist.gov/chemistry. AMIA Symp., 1999.
[3] Royal society of chemistry. http://www.rsc.org. [27] P. Willet, J. M. Barnard, and G. M. Downs. Chemical similarity
[4] Wordnet. http://wordnet.princeton.edu/. searching. J. Chem. Inf. Comput. Sci., 38(6):983–996, 1998.
[5] World wide molecular matrix. [28] X. Yan and J. Han. Closegraph: Mining closed frequent graph
http://wwmm.ch.cam.ac.uk/wikis/wwmm. patterns. In Proc. SIGKDD, 2003.
[6] S. Chakrabarti. Dynamic personalized pagerank in [29] X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based
entity-relation graphs. In Proc. WWW, 2007. substructure similarity search. ACM Transactions on Database
Systems, 2006.
[7] E. S. de Moura, C. F. dos Santos, D. R. Fernandes, A. S. Silva,
P. Calado, and M. A. Nascimento. Improving web search [30] G. Yang. The complexity of mining maximal frequent itemsets
efficiency via a locality based static pruning method. In Proc. and maximal frequent patterns. In Proc. SIGKDD, 2004.
WWW, 2005. [31] Y. Zhao and G. Karypis. Hierarchical clustering algorithms for
[8] L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent document datasets. Data Mining and Knowledge Discovery,
substructures in chemical compounds. In Proc. SIGKDD, 1998. 2005.
744