Anda di halaman 1dari 43

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Capture, Organize, and Analyze Big Data for Research


Agenda:

Discovering the undiscovered. Applying analytical methods to Big Data in Space


Domains, Salim Ansari, Head of Technical Services, European Space Agency
Big Data & Security in Healthcare, Ron Hutchins, CTO & Associate Vice Provost for
Research & Technology, Georgia Tech
Exadata for the exome. Big Data for medics & Life scientists, Jerven Bolleman, Senior
Scientist, Swiss BioInformatics Institute
Supporting Researchers and the Large Hadron Collider with Oracle,Tony Cass,
Database Services Head, CERN
Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Discovering the undiscovered. Applying analytical


methods to Big Data in Space Domains

Salim Ansari
Oracle OpenWorld 2013

We have enough data in


the archives to keep us
busy for the next 20 years

However, we will need 20


years to analyze all the
data if we continue to use
conventional analytical

If we had one drop of


inspiration in the scientific
domains of what pours
into the business world

The case of the Astronomical Archives


Based on outdated technologies
Have very little intuitive intelligence
Do not have basic analytical capabilities
Based on basic paradigms (search by
object, search by attribute)

What if...
We created a powerful analytical engine
to apply to the data?
We revolutionized the way in which we
searched for data in the archives?
Applied marketing methods to scientific
data?
Created the domain of Science
Intelligence...!??

The Amazon.com example

Select an item

Be offered with several other


interesting options

Multiwavelength Astronomy

The Oracle Challenge

Bring inspiration back by using methods


to stimulate the mind
Regular updates from the archives to
remind you of what you are working on.
Push intelligence induced messages
about interesting topics to help focus the
attention on the problem at hand.

Big Data & Security in Healthcare


Ron Hutchins, CTO
Georgia Institute of
Technology

Quick and dirty outline:


Motivation why big data in
healthcare?
Personalized medicine each of us is
different.

Issues around big data


Policy and security (and morality?)

Possible aids to solving issues.

Summary of Health Care in


Despite its high cost,
the
the USUS
healthcare system

produces relatively short life spans, and is


wasteful, inefficient and has serious safety and
quality issues. .... our approaches to care delivery and
financial incentives were designed for a bygone era.
Beyond that the technology offered to practitioners
has often been overly expensive, poorly designed,
overly proprietary, hard to implement and difficult
to use. Spurred by a unique, one-time Federal
stimulus and the new mobile, wireless and cloud
technologies now available, this landscape is rapidly
changing. need to understand the new driving forces
and have a basic understanding of contemporary clinical

Watson is Big Data for diagnosing


depends on good data!
Diagnostic capabilities based on previous diagnosis and outcomes
clinical knowledge, existing molecular and genomic data and vast
repository of cancer case histories, in order to create an outcome and
evidence-based decision support system.
*http://www-03.ibm.com/press/us/en/pressrelease/37235.wss
WellPoint trained Watson with 18,000 historical cases. Now Watson uses
hypothesis generation and evidence-based learning to generate
confidence-scored recommendations that help nurses make decisions
about utilization management.
*http://www03.ibm.com/innovation/us/watson/pdf/WellPoint_Case_Study_IMC14792.pdf

term viewcollecting your personal data in


Jawbone UP
the cloud?
Fitbit
Nike+ Fuelband
Basis
Lark
RunKeeper
MapMyFitness

Chronic Disease
Management

Changes in healthcare laws (meaningful use) change how


providers are paid and ultimately impact our ability to better
manage overall healthcare costs
Application of analytics, mobility, process improvement, etc. to
healthcare is a hot area with respect to research,
commercialization, and education -- it is also inspiring stuff...
But:
Health record sells on the black market for $50, SSN for $.50
raising the stakes for fraud.
Why have we not been hacked? Were just lucky

The Problems with Data


Much of the data today is falling on the floor after being sampled
we dont know how to store it or use it cost effectively
Data is messy many current data standards are proprietary at
best in practice and those on the horizon are cumbersome
Anonymization/deidentification helps some for research - all of this
is disruptive to industry incumbents
Pipelines for data analysis/research vary greatly, and most of it
starts without knowing what specific data is needed
We live in a mobile/remote access world, and space is at a
premium.

What are research issues


today?

how can we architect systems and networks to


efficiently scale given the constraints?

Networks private networks


Security privacy and protection
Identity management access control, auth-z
Computation - how do we analyze/mine
Storage and databases scaling and securing
Tools and services monitoring/management/alerting
auditing access

So your personal data in the cloud?


According to an NIH assistant director the greatest risk of data loss in
healthcare is unencrypted data on an unlocked laptop that gets lost
Multi-tenancy hosting multiple datasets on the same hardware is
complicated
Multi-use of a dataset sharing the same dataset across multiple research
projects is against policy at times
Mash-ups combining protected data with other datasets (e.g. weather) is
against some policy
Policy is different based on who creates the policy are there risks of
confusion and human error in data hosting?
Cloud based apps data loss and exposure?
e.g. Dropbox breech

How are we securing the data: worst case?


HIPAA is all about protecting the identity of the patient - but this is also
about protecting against fraud and reputation
2 factor physical access to a controlled environment (dedicated)
2 factor data access to a controlled and dedicated set of resources - no
network access beyond resources dedicated to the project
Data (including derivative data) must be encrypted at rest and in flight,
read-only to researchers, compliant destruction, oversight of removal of
the results, and not stored or mixed with data sets which are not
explicitly allowed
Training, IRB and data source specific Data Management Plans and
procedures, logging, and auditing of access and activity - this includes
researcher and system/network administration.

And then there is


Moral issues around big data?
NSA has caused a stir
So, what about meta-data? Is it
important?
Do we protect meta-data the same
was as personal data?

Exadata for the exome


Big Data for medics & life scientists

Jerven Bolleman
Developer
SIB Swiss Institute of Bioinformatics

46 groups
650 collaborators

2013 SIB Swiss Instute of Bioinformatics


Oracle Open World 2013

Strategic Goal
To provide key competencies
& research support to the medical &
life science community

2013 SIB Swiss Instute of Bioinformatics


Oracle Open World 2013

volume

velocity

3PB storage

24 TB per week

Value

variety

100s of sources

2013 SIB Swiss Instute of Bioinformatics


Oracle Open World 2013

veracity
experimental
uncertainty

Exadata meets Exome


DNA =~ genes
Menu in the restaurant
RNA
What the waiter writes down
Proteins
The meal delivered
Exome =~
List of all ordered meals
2013 SIB Swiss Instute of Bioinformatics
Oracle Open World 2013

M Murtaza et al. Nature 000, 1-5 (2013) doi:10.1038/nature12065

2013 SIB Swiss Instute of Bioinformatics


Oracle Open World 2013

2013 SIB Swiss Instute of Bioinformatics


Oracle Open World 2013

150GB of information
per person
CycliX data

Ensembl data
2013 SIB Swiss Instute of Bioinformatics
Oracle Open World 2013

Exadata compared to current database hardware


1/4 Exadata

current

CPU bandwidth
limiting factor

no IO wait

8 core/24gb ram SAN FC storage

IO wait
2013 SIB Swiss Instute of Bioinformatics
Oracle Open World 2013

Compared to best result of


our current hardware.

VALUE
Secure
access
backups

Upgradable
Maintainable
Fast data ingestion
Works within the week

2013 SIB Swiss Instute of Bioinformatics


Oracle Open World 2013

SQL/SPARQL
low developer costs
set based analytics
Query language
good for questions
Adaptable
New question
New query

jervenbolleman
jerven.bolleman@isb-sib.ch
dmitry.kuznetsov@isb-sib.ch

This is Dmitry, who


did the hard work
3

Supporting Researchers
and the Large Hadron Collider
with Oracle
Tony Cass
Head of Database Services

4/12/16

Document reference

Oracle Parallel Query vs Hadoop

Graphics courtesy of Zbigniew Baranowski

38

Exadata test results


With HCC I could run the benchmark analysis on 2 TB of data in
210 seconds, with I/O reads up to ~10 GB/s
27% faster using Hybrid Columnar Compression (HCC)
Query activity without HCC:

scan electron-table

scan jet-table

Hybrid Columnar Compression (HCC) reduced


my data volume by a factor of 3.2
HCC made electron-selection faster
HCC made jet-selection slower (no offloading

Query activity with HCC:

due to external function used for b-tagging)

scan electron-table
Slide courtesy of Maaike Limper

scan jet-table
39

Maybe
Maybe another
another Exadata
Exadata test
test in
in future:
future:
With
With more
more complex
complex analysis
analysis
In-memory
In-memory columnar
columnar beta
beta ??

Hadoop vs Oracle
Hadoop
Hadoop version
version of
of Z+H
Z+H benchmark
benchmark analysis:
analysis:
Physics-data
stored
as
comma-delimited
Physics-data stored as comma-delimited text-files
text-files in
in hadoop
hadoop filesystem
filesystem (hdfs)
(hdfs)
Reproduce
Z+H
benchmark
analysis
with
MapReduce-code
(java!)
Reproduce Z+H benchmark analysis with MapReduce-code (java!)
Mappers:
Mappers: one
one mappers
mappers per
per object
object to
to select
select muon,
muon, electron
electron etc.
etc.
Reduce:
Reduce: select
select events
events with
with 22 good
good leptons
leptons and
and 22 b-jets,
b-jets, calculate
calculate invariant
invariant mass
mass

Hadoop:
Hadoop:
179
179 seconds
seconds
(limited
(limited by
by CPU)
CPU)

Slide courtesy of Maaike Limper

CERN openlab major review September

Oracle
Oracle (parallel
(parallel 40):
40):
150
150 seconds
seconds
(limited
(limited by
by iowait)
iowait)

40

Hadoop vs Oracle
I/O reads speed comparison for the Z+H benchmark
Hadoop:
Hadoop:
up
up to
to 1600
1600 MB/s
MB/s

Slide courtesy of Maaike Limper

CERN openlab major review September

Oracle
Oracle DB:
DB:
up
up to
to 2100
2100 MB/s
MB/s

41

Comparison with root


Run root file-based analysis in parallel on all 5 nodes
127 ntuples copied and divided to the 5 x 12 disks in our cluster
Use a set of scripts to run the ntuple-analysis on all nodes:
Copy root-macro to each node
12 root-jobs per node simultaneously, each job reads from single disk
Merge histogram produced by sub-jobs to obtain final result
like a miniature-version of running jobs on the grid
root-ntuple
root-ntuple analysis:
analysis:
86
86 seconds
seconds

Slide courtesy of Maaike Limper

root-ntuple
root-ntuple IO:
IO:
up
to
500
MB/s
up to 500 MB/s

Note
Note that
that root
root uses
uses column-based
column-based storage!
storage!Amount
Amount of
of data
data read
read is
is less
less
(benchmark
uses
45
out
of
4000
stored
variables)
(benchmark
uses
45
out
of
4000
stored
variables)
42

Anda mungkin juga menyukai