I Forest 2003

Motivation
Research activity in association rule mining has been initially focused on defining efficient
algorithms to perform the computationally intensive knowledge extraction task (i.e.,
frequent itemset mining). The data to be analyzed was (possibly) extracted from the
DBMS and stored into binary files (i.e., flat files). Many algorithms, both memorybased and disk-based, are focused on specialized data structures and buffer
management strategies to efficiently extract the desired type of knowledge from a flat
dataset. However, no attempt to achieve complete integration into a true relational
DBMS of these external structures has been proposed so far. DBMSs exploit indices to
improve the performance on complex queries. The intuition that the same approach
could be ``exported'' to the data mining domain is the driving force of this project. A
true integration of novel data mining indices into the PostgreSQL? open source
DBMS is proposed and advantages and disadvantages of the proposed disk-based
data structures are highlighted. For each index, DBMS kernel primitives have been
designed and implemented to materialize the index structure on disk and to perform
many frequent itemset extraction sessions.
I-Mine index
The I-Mine index [1,4,7] is a general and compact structure which provides tight
integration of itemset extraction in a relational DBMS. Since no constraint is enforced
during the index creation phase, I-Mine provides a complete representation of the
original database. To reduce the I/O cost, data accessed together during the same
extraction phase are clustered on the same disk block. The I-Mine index structure can
be efficiently exploited by different itemset extraction algorithms. In particular, IMine data access methods currently support the FP-growth and LCM v.2 algorithms,
but they can straightforwardly support the enforcement of various constraint
categories. The I-Mine index has been integrated into the PostgreSQL? DBMS and
exploits its physical level access methods.
Experiments, run for both sparse and dense data distributions, show the efficiency of the
proposed index and its linear scalability also for large datasets. Itemset mining
supported by the I-Mine index shows performance always comparable with, and
often (especially for low supports) better than, state of the art algorithms accessing
data on flat file.
I-Forest index
A technique for the incremental update of the index, suitable for association rule extraction
on evolving databases, has been proposed and implemented. The content of evolving
databases (e.g., transactional data from large retail chains, web server logs, call-detail
records) is periodically updated through insertion (or deletion) of data blocks. Data
can be described as a sequence of incoming data blocks, where new blocks arrive
periodically. Different kinds of analysis could be performed over such data like (i)
mining all available data, (ii) mining only the most recent data (e.g., last month data),
(iii) mining periodical data (e.g., quarterly data) and (iv) mining selected data blocks
(e.g., data related to the first month of last year and the first month of this year).
Itemset mining on sequences of incoming data blocks may require a significant
amount of main memory during the extraction process. By means of the proposed
index [2, 3, 5, 6], called I-Forest, incoming data blocks are stored on disk in
appropriate (compact) structures. During the mining phase, only the data required by
the current mining process is actually loaded in main memory. To allow different
kinds of analysis and easy incremental insertion of new blocks (or deletion of obsolete
ones), each data block is represented separately and independently of all others. The
proposed structure is a covering index that represents transactional blocks in a
succinct form. During the creation phase no support constraint is enforced (i.e., the
index provides a complete representation of the evolving data). Hence, item, support
and time constraints may be enforced during the extraction phase. Experiments have
been run for both sparse and dense data distributions. Experimental results show
that frequent itemset mining based on the proposed index is efficient both in terms of
extraction time and memory usage. The performance of the proposed algorithm
exploiting the index structure is always comparable with and for low support
threshold faster than the best implementation algorithm accessing static data on flat
file.
Publications
[1] Baralis E., Cerquitelli T., Chiusano S., IMine: Index Support for Itemset Mining. IEEE
Transactions on Knowledge and Data Engineering. Accepted on August 2008. In
press.
[2] E. Baralis, T. Cerquitelli, S. Chiusano. Constrained itemset mining on a sequence of
incoming data blocks. Special Issue on International Journal of Intelligent Systems,
John Wiley & Sons Inc. Accepted on November 2007. In press.
[3] E. Baralis, T. Cerquitelli, S. Chiusano. Indexing Evolving Databases for Itemset Mining.
In: ''Intelligent Techniques and Tools for Novel System Architectures'' in the series of
''Studies in Computational Intelligence'', Chountas Panagiotis, Petrounias Ilias,
Kacprzyk Janusz, Springer Verlag. ISBN: 978-3-540-77621-5
[4] Baralis E., Cerquitelli T., Chiusano S., Index Support for Frequent Itemset Mining in a
Relational DBMS., In: ICDE 2005: The 21th IEEE International Conference on Data
Engineering, Tokyo 05-08 April 2005, pp. 754-765, 2005, ISBN: 0-7695-2285-8
[5] Baralis E., Cerquitelli T., Chiusano S., Itemset Mining on Indexed Data Blocks, In: The
3rd International IEEE Conference Intelligent Systems, 4-6 September 2006, pp. 820825, 2006, ISBN: 1-4244-0196-8
[6] Baralis E., Cerquitelli T., Chiusano S., Mostile D., Index Support for Mining Data
Streams in a Relational DBMS, In: SEBD 2005: The Thirteenth Italian Symposium
on Advanced Database Systems, SEBD 2005 19-22 June 2005, pp. 240-247, 2005,
ISBN: 88-548-0122-4
[7] Baralis E., Cerquitelli T., Chiusano S., Lottini A., Pansa L., Frequent Itemset Mining in
a Relational DBMS, In: SEBD 2004: The Twelfth Italian Symposium on Advanced
Database Systems, SEBD 21-23 June 2004, pp. 158-165, 2004, ISBN: 88-901409-1-7

I Forest 2003

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

I Forest 2003

Diunggah oleh

Hak Cipta:

Format Tersedia

Motivation

Anda mungkin juga menyukai