Tapes Tapes
Scientific
Simulations Terabytes Terabytes
& experiments
Disks Disks
• Climate Modeling
• Astrophysics SDM requirements
• Genomics and Proteomics
• High Energy Physics •Optimizing shared access Data
Data from storage systems ~20% Manipulation:
Manipulation: •Metadata time
•High-dimensional
• Getting files from
indexing
storage •Adaptive file
~80%
• Extracting subset
time caching
of data from files •Parallel File Systems
• Reformatting data
•Runtime libraries
• Getting data from
•Datamining
heterogeneous,
distributed systems
• moving data over
the network Scientific
Analysis
~80% & Discovery
Goals time
Optimize and simplify:
Scientific • access to very large datasets
~20% Analysis • access to distributed data
time & Discovery • access of heterogeneous data
• data mining of very large datasets
Current Goal
@ANC NGDM
Oct 10, 2007
Challenges in Scientific
Knowledge Discovery
Scientific Data Management
•Data management
•Query of Scientific DB
•Performance optimizations
Knowledge Discovery
•High-level interface •In-place and on-line
•proactive analytics
•What not How? •Customized acceleration
•Scalable Mining
High-Performance
I/O Analytics and
Mining
@ANC NGDM
Oct 10, 2007
Extinction and reignition in a CO/H2
jet flame
Understanding extinction/reignition in Burning
non-premixed combustion is key to Extinguished
flame stability and emission control
in aircraft and power producing
gas-turbines
Discovered dominant reignition mode is due to
engulfment of product gases, not flame
propagation
Scalar dissipation rate
@ANC NGDM
Oct 10, 2007
Example - Mining-based Data
Reduction for Multigrid Simulation
Based on PCA of contiguous
field blocks
Astrophysics supernova Ack: Nagiza Samatova
simulation: ORNL
16 to 200 times reduction
per time step
Timestep 390
@ANC NGDM
Oct 10, 2007
Fusion: Using image processing/mining to
analyze blob formation
@ANC NGDM
Oct 10, 2007
Simulation Data Sets
Dynamically Change
Stream of climate simulation data
High-dimensional
Existing methods do not
scale up with the number of
dimensions
Supernova Explosion:
Dynamic
1-D simulation: 2GB
Existing methods work w/
2-D simulation:
static data. Changes lead to
1TB 3-D
complete re-computation
simulation: 50TB
@ANC NGDM
Oct 10, 2007
Scientific Work-Flow
@ANC NGDM
Oct 10, 2007
In-Place On-Line Scalable
Mining
Application
MPI/MPI-IO
Library MPI-based analytics
functions
@ANC NGDM
Oct 10, 2007
Accelerating and Computing in
the Storage
Application execution Application execution
Simulation Simulation
Active Storage System
@ANC NGDM
Oct 10, 2007
Distance kernel in Clustering data mining: Speedup over a 2.4GHz AMD Opteron
PCA
@ANC NGDM
Oct 10, 2007
from other application
domains?
25 dimensional performance and characterization
data. Mining used to cluster
NU MINEBENCH
http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html
8
7
6
5
4
3
2
1
0
gcc
Q17
bzip2
rawcaudio
apriori
scalparc
Q3
Q4
Q6
gzip
birch
semphy
bayesian
hop
mcf
rsearch
wupwise
snp
swim
mesa
mpeg2
mgrid
encode
pegwit
fuzzy
eclat
svm-rfe
kMeans
vortex
parser
gs
apsi
epic
genenet
lucas
equake
cjpeg
vpr
twolf
art
toast
@ANC NGDM
Oct 10, 2007
Community Resource:
MineBench Project Homepage
http://cucis.ece.northwestern.edu/projects/DMS
@ANC NGDM
Oct 10, 2007