Anda di halaman 1dari 16

Large-Scale Scientific Knowledge

Discovery: Problems and Potential


Approach

Alok Choudhary, Professor


Director: Center for Ultra-Scale Computing and Security
Dept. of Electrical Engineering and Computer Science
And Kellogg School of Management
Northwestern University
choudhar@ece.northwestern.edu
Acknowledgements:
DOE (SCIDAC)
NSF: (HECURA, CRI, Fellowships)
Students: Kenin Coloma, Avery Ching, Ramanathan, Berkin, Jianwei Li (Now at
Wallstreet), Ying Liu (now faculty at Chinese Academy of Sciences), Joe
Zambreno (now faculty at Iowa State), Wei-Keng Liao (Research prof at
NWU), G. Memik (Asst prof at NWU)
Scientific Data Management and Analysis:
Productivity and Performance
Petabytes Petabytes

Tapes Tapes
Scientific
Simulations Terabytes Terabytes
& experiments
Disks Disks
• Climate Modeling
• Astrophysics SDM requirements
• Genomics and Proteomics
• High Energy Physics •Optimizing shared access Data
Data from storage systems ~20% Manipulation:
Manipulation: •Metadata time
•High-dimensional
• Getting files from
indexing
storage •Adaptive file
~80%
• Extracting subset
time caching
of data from files •Parallel File Systems
• Reformatting data
•Runtime libraries
• Getting data from
•Datamining
heterogeneous,
distributed systems
• moving data over
the network Scientific
Analysis
~80% & Discovery
Goals time
Optimize and simplify:
Scientific • access to very large datasets
~20% Analysis • access to distributed data
time & Discovery • access of heterogeneous data
• data mining of very large datasets
Current Goal
@ANC NGDM
Oct 10, 2007
Challenges in Scientific
Knowledge Discovery
Scientific Data Management
•Data management
•Query of Scientific DB
•Performance optimizations

Knowledge Discovery
•High-level interface •In-place and on-line
•proactive analytics
•What not How? •Customized acceleration
•Scalable Mining
High-Performance
I/O Analytics and
Mining

@ANC NGDM
Oct 10, 2007
Extinction and reignition in a CO/H2
jet flame
Understanding extinction/reignition in Burning
non-premixed combustion is key to Extinguished
flame stability and emission control
in aircraft and power producing
gas-turbines
Discovered dominant reignition mode is due to
engulfment of product gases, not flame
propagation

Scalar dissipation rate 

The largest ever simulations of combustion


have been performed to advance this goal:
− 500 million grid points
− 11 species and 21 reactions
− 16 DOF per grid point
− 512 Cray X1E processors
− 30 TB raw data
− 2.5M hours on IBM SP NERSC (INCITE);
400K hours on Cray X1E (ORNL)

Hawkes, Sankaran, Sutherland, Chen – @ANC


Oct 10, 2007 2006, DOE INCITE 2005, early user LCFNGDM
/ORNL 20
Combustion understanding and modeling:
Detection and tracking of autoignition features on­line
Direct simulation of a 3D turbulent flame with detailed chemistry
(200 million grids, 12 species, 5 TB raw data, 5 TB derived data

ACK: Jackeline Chen, SNL

@ANC NGDM
Oct 10, 2007
Example - Mining-based Data
Reduction for Multigrid Simulation
 Based on PCA of contiguous
field blocks
 Astrophysics supernova Ack: Nagiza Samatova
simulation: ORNL
 16 to 200 times reduction
per time step

Timestep 390
@ANC NGDM
Oct 10, 2007
Fusion: Using image processing/mining to
analyze blob formation

First, identify well-


Second, track blobs defined blobs using
back to their source image analysis.
in the “sea of
turbulence”
Fundamental question: Why does
turbulence produce coherent structures
such as blobs?
Ack: Scott Klasky
@ANC NGDM
Oct 10, 2007
Cosmology
ENZO: simulates the formation of galaxies from
the beginning of the universe to the present day

Data set 1 Data set 2

Each data set contains 491520 particles

@ANC NGDM
Oct 10, 2007
Simulation Data Sets
Dynamically Change
Stream of climate simulation data

Incremental update via fusion


 Scientific simulations (e.g., t t=t1 new t new
climate modeling and supernova =t0 =t2
explosion) typically run for days
to month and produce data sets
in the order of one to ten
terabytes per simulation.
 Effectively and efficiently
analyzing these streams of data
is a challenge:
 Static analysis techniques are
not sufficient. Any changes
require complete re-
computation.

Computations MUST be able to efficiently analyze


streams of data while they are being produced,
rather than wait until they are produced
@ANC NGDM
Oct 10, 2007
Simulation Produced Data Sets
requires mining
Challenge: Develop
Tera&Petabytes effective & efficient
Existing methods do not
scale in terms of time methods for mining
and storage
Distributed
scientific data sets
Existing methods work on
single centralized dataset.
Data transfer is prohibitive

High-dimensional
Existing methods do not
scale up with the number of
dimensions
Supernova Explosion:
Dynamic
1-D simulation: 2GB
Existing methods work w/
2-D simulation:
static data. Changes lead to
1TB 3-D
complete re-computation
simulation: 50TB
@ANC NGDM
Oct 10, 2007
Scientific Work-Flow

@ANC NGDM
Oct 10, 2007
In-Place On-Line Scalable
Mining
Application

MPI/MPI-IO
Library MPI-based analytics
functions

Parallel File system/ Active storage functions/


Storage Functions Mining & Analytics library

Traditional Active Storage &


Storage & I/O Analytics Nodes
nodes

@ANC NGDM
Oct 10, 2007
Accelerating and Computing in
the Storage
Application execution Application execution
Simulation Simulation

Problem setup Analyze


I/O, Storage decomposition (on-line)
Problem setup access
decomposition

Measure I/O, Storage


Manage access
Measure Analyze/
Archive
Archive Manage

Active Storage System
@ANC NGDM
Oct 10, 2007
Distance kernel in Clustering data mining: Speedup over a 2.4GHz AMD Opteron

PCA
@ANC NGDM
Oct 10, 2007
from other application
domains?
 25 dimensional performance and characterization
data. Mining used to cluster
 NU MINEBENCH
 http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html

SPEC INT SPEC FP MediaBench TPC-H MineBench


11
10
9
Cluster Number

8
7
6
5
4
3
2
1
0
gcc

Q17
bzip2

rawcaudio

apriori

scalparc
Q3
Q4
Q6
gzip

birch

semphy
bayesian

hop
mcf

rsearch
wupwise

snp
swim
mesa

mpeg2
mgrid

encode

pegwit

fuzzy
eclat

svm-rfe
kMeans
vortex

parser

gs
apsi

epic

genenet
lucas
equake

cjpeg
vpr
twolf

art

toast

@ANC NGDM
Oct 10, 2007
Community Resource:
MineBench Project Homepage
http://cucis.ece.northwestern.edu/projects/DMS

@ANC NGDM
Oct 10, 2007

Anda mungkin juga menyukai