1051
WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China
T HEOREM 3. A sequence S=e1 e2 . . . en is a generator if and 0.018 0.02 0.022 0.024 0.026 0.028 0.03 80 85 90 95 100
only if ∃i that 1≤i≤ n and sup(S)=sup(S (i) ). Minimum Support Threshold (in %) Minimum Support Threshold (in %)
P ROOF. Easily derived from the definition of generator and the a) Gazelle b) P rogramT race
well-known downward closure property of a sequence.
Figure 1: Runtime Efficiency Comparison
2.4 Algorithm
By integrating the preceding pruning methods and generator check- This also validates that our pruning techniques are very effective,
ing scheme with a traditional pattern growth framework [5], we can since without pruning FEAT needs to generate the same set of se-
easily derive the FEAT algorithm as shown in Algorithm 1. Given a quential patterns as PrefixSpan and perform generator checking to
prefix sequence SP , FEAT first finds all its locally frequent items, remove those non-generators, thus it should be no faster than Pre-
uses each locally frequent item to grow SP , and builds the pro- fixSpan if the pruning techniques are not applied. Figure 1 b) shows
jected database for the new prefix (lines 2,3,4). It adopts both the that for dense dataset ProgramTrace, FEAT is significantly faster
forward and backward pruning techniques to prune the unpromis- than PrefixSpan with any minimum support. For example, Pre-
ing parts of search space (lines 8,11), and uses the generator check- fixSpan used nearly 200,000 seconds to finish even at a minimum
ing scheme to judge whether the new prefix is a generator (lines support of 100% , while FEAT costs less then 0.02 seconds.
7,9,11,12). Finally, if the new prefix cannot be pruned , FEAT re- We used generators and sequential patterns as features to build
cursively calls itself with the new prefix as its input (lines 14,15). SVM and Naïve Bayesian classifiers respectively. The results for
Algorithm 1: F EAT (Sp, SDBSp , min_sup, F GS) Office07Review dataset show that both generator-based and sequen-
Input : Prefix sequence SP , SP ’s projected database SDBSp , minmum tial pattern-based models achieve almost the same accuracy. For
support min_sup, and result set F GS example, with a minimum support of 2% and a minimum confi-
1 begin dence of 75%, both generator-based and sequential pattern-based
2 foreach i in localF requentItems(SDBSp , min_sup) do
Naïve Bayesian classifiers can achieve the same best accuracy of
3 Spi ←< Sp , i >;
80.6%. As generator-based approach is more efficient, it has an
4 SDBS i ← projectedDatabase(SDBSp , Spi );
p edge over sequential pattern-based approach in terms of efficiency.
5 canP rune ← f alse;
6 isGenerator ← true;
7 if sup(SDBSp ) = sup(SDBS i ) then
p
4. CONCLUSIONS
8 canP rune ← In this paper we study the problem of mining sequence gener-
F orwardP rune(Sp , SDBSp , Spi , SDBS i );
p ators, which has not been explored previously to our best knowl-
9 isGenerator ← f alse; edge. We proposed two novel pruning methods and an efficient
10 if not canP rune then
generator checking scheme, and devised a frequent generator min-
11 BackwardP rune(Spi , SDBS i , canP rune, isGenerator);
p ing algorithm, FEAT. An extensive performance study shows that
12 if isGenerator then
i FEAT is more efficient than the state-of-the-art sequential pattern
13 F GS ← F GS ∪ {SP };
14 if not canP rune then mining algorithm, PrefixSpan, and is very effective for classifying
i Web product reviews. In future we will further explore its applica-
15 F EAT (SP , SDBS i , min_sup, F GS);
p
16 end tions in Web page classification and click stream data analysis.
5. ACKNOWLEDGEMENTS
3. PERFORMANCE EVALUATION This work was partly supported by 973 Program under Grant No.
We conducted extensive performance study to evaluate FEAT al- 2006CB303103, and Program for New Century Excellent Talents
gorithm on a computer with Intel Core Duo 2 E6550 CPU and 2GB in University under Grant No. NCET-07-0491, State Education
memory installed. Due to limited space, we only report the results Ministry of China.
for some real datasets. The first dataset, Gazelle, is a Web click-
stream data containing 29,369 sequences of Web page views. The 6. REFERENCES
second dataset, ProgramTrace, is a program trace dataset. The third [1] R. Agrawal, R. Srikant. Mining Sequential Patterns. ICDE’95.
dataset, Office07Review, contains 320 consumer reviews for Office [2] J. Chen, T. Cook. Mining Contiguous Sequential Patterns from Web Logs.
2007 collected from Amazon.com, in which 240 and 80 reviews WWW’07 (Posters track).
[3] N. Jindal, B. Liu. Identifying Comparative Sentences in Text Documents.
are labeled as positive and negative, respectively. SIGIR’06.
Figure 1 shows the runtime efficiency comparison between FEAT [4] J. Li, et al. Minimum description length principle: Generators are preferable to
and PrefixSpan, a state-of-the-art algorithm for mining all sequen- closed patterns. AAAI’06.
tial patterns. Figure 1a) demonstrates that FEAT is slightly slower [5] J. Pei, J. Han, et al. Prefixspan: mining sequential patterns efficciently by
prefix-projected pattern growth. ICDE’01.
than P ref ixSpan when the minimum support threshold is high [6] J. Wang,J. Han. BIDE:efficient mining of frequent closed sequences. ICDE’04.
for sparse dataset Gazelle, however, with a minimum support thresh- [7] X. Yan, J. Han, R. Afshar. CloSpan: Mining closed sequential patterns in large
old less than 0.026%, FEAT is significantly faster than P ref ixSpan. datasets. SDM’03.
1052