CON9611 Capture, Organize, and Analyze Big Data For Research

Copyright 2011, Oracle and/or its affiliates. All rights reserved.
Capture, Organize, and Analyze Big Data for Research

Agenda:
Discovering the undiscovered. Applying analytical methods to Big Data in Space

Domains, Salim Ansari, Head of Technical Services, European Space Agency
Big Data & Security in Healthcare, Ron Hutchins, CTO & Associate Vice Provost for
Research & Technology, Georgia Tech
Exadata for the exome. Big Data for medics & Life scientists, Jerven Bolleman, Senior
Scientist, Swiss BioInformatics Institute
Supporting Researchers and the Large Hadron Collider with Oracle,Tony Cass,
Database Services Head, CERN
Copyright 2011, Oracle and/or its affiliates. All rights reserved.
Discovering the undiscovered. Applying analytical

methods to Big Data in Space Domains
Salim Ansari
Oracle OpenWorld 2013
We have enough data in

the archives to keep us
busy for the next 20 years
However, we will need 20

years to analyze all the
data if we continue to use
conventional analytical
If we had one drop of

inspiration in the scientific
domains of what pours
into the business world
The case of the Astronomical Archives

Based on outdated technologies
Have very little intuitive intelligence
Do not have basic analytical capabilities
Based on basic paradigms (search by
object, search by attribute)
What if...
We created a powerful analytical engine
to apply to the data?
We revolutionized the way in which we
searched for data in the archives?
Applied marketing methods to scientific
data?
Created the domain of Science
Intelligence...!??
The Amazon.com example
Select an item
Be offered with several other

interesting options
Multiwavelength Astronomy
The Oracle Challenge
Bring inspiration back by using methods

to stimulate the mind
Regular updates from the archives to
remind you of what you are working on.
Push intelligence induced messages
about interesting topics to help focus the
attention on the problem at hand.
Big Data & Security in Healthcare

Ron Hutchins, CTO
Georgia Institute of
Technology
Quick and dirty outline:

Motivation why big data in
healthcare?
Personalized medicine each of us is
different.
Issues around big data

Policy and security (and morality?)
Possible aids to solving issues.
Summary of Health Care in

Despite its high cost,
the
the USUS
healthcare system
produces relatively short life spans, and is

wasteful, inefficient and has serious safety and
quality issues. .... our approaches to care delivery and
financial incentives were designed for a bygone era.
Beyond that the technology offered to practitioners
has often been overly expensive, poorly designed,
overly proprietary, hard to implement and difficult
to use. Spurred by a unique, one-time Federal
stimulus and the new mobile, wireless and cloud
technologies now available, this landscape is rapidly
changing. need to understand the new driving forces
and have a basic understanding of contemporary clinical
Watson is Big Data for diagnosing

depends on good data!
Diagnostic capabilities based on previous diagnosis and outcomes
clinical knowledge, existing molecular and genomic data and vast
repository of cancer case histories, in order to create an outcome and
evidence-based decision support system.
*http://www-03.ibm.com/press/us/en/pressrelease/37235.wss
WellPoint trained Watson with 18,000 historical cases. Now Watson uses
hypothesis generation and evidence-based learning to generate
confidence-scored recommendations that help nurses make decisions
about utilization management.
*http://www03.ibm.com/innovation/us/watson/pdf/WellPoint_Case_Study_IMC14792.pdf
term viewcollecting your personal data in

Jawbone UP
the cloud?
Fitbit
Nike+ Fuelband
Basis
Lark
RunKeeper
MapMyFitness
Chronic Disease
Management
Changes in healthcare laws (meaningful use) change how

providers are paid and ultimately impact our ability to better
manage overall healthcare costs
Application of analytics, mobility, process improvement, etc. to
healthcare is a hot area with respect to research,
commercialization, and education -- it is also inspiring stuff...
But:
Health record sells on the black market for $50, SSN for $.50
raising the stakes for fraud.
Why have we not been hacked? Were just lucky
The Problems with Data

Much of the data today is falling on the floor after being sampled
we dont know how to store it or use it cost effectively
Data is messy many current data standards are proprietary at
best in practice and those on the horizon are cumbersome
Anonymization/deidentification helps some for research - all of this
is disruptive to industry incumbents
Pipelines for data analysis/research vary greatly, and most of it
starts without knowing what specific data is needed
We live in a mobile/remote access world, and space is at a
premium.
What are research issues

today?
how can we architect systems and networks to

efficiently scale given the constraints?
Networks private networks

Security privacy and protection
Identity management access control, auth-z
Computation - how do we analyze/mine
Storage and databases scaling and securing
Tools and services monitoring/management/alerting
auditing access
So your personal data in the cloud?

According to an NIH assistant director the greatest risk of data loss in
healthcare is unencrypted data on an unlocked laptop that gets lost
Multi-tenancy hosting multiple datasets on the same hardware is
complicated
Multi-use of a dataset sharing the same dataset across multiple research
projects is against policy at times
Mash-ups combining protected data with other datasets (e.g. weather) is
against some policy
Policy is different based on who creates the policy are there risks of
confusion and human error in data hosting?
Cloud based apps data loss and exposure?
e.g. Dropbox breech
How are we securing the data: worst case?

HIPAA is all about protecting the identity of the patient - but this is also
about protecting against fraud and reputation
2 factor physical access to a controlled environment (dedicated)
2 factor data access to a controlled and dedicated set of resources - no
network access beyond resources dedicated to the project
Data (including derivative data) must be encrypted at rest and in flight,
read-only to researchers, compliant destruction, oversight of removal of
the results, and not stored or mixed with data sets which are not
explicitly allowed
Training, IRB and data source specific Data Management Plans and
procedures, logging, and auditing of access and activity - this includes
researcher and system/network administration.
And then there is

Moral issues around big data?
NSA has caused a stir
So, what about meta-data? Is it
important?
Do we protect meta-data the same
was as personal data?
Exadata for the exome

Big Data for medics & life scientists
Jerven Bolleman
Developer
SIB Swiss Institute of Bioinformatics
46 groups
650 collaborators
2013 SIB Swiss Instute of Bioinformatics

Oracle Open World 2013
Strategic Goal
To provide key competencies
& research support to the medical &
life science community

volume
velocity
3PB storage
24 TB per week
Value
variety
100s of sources

veracity
experimental
uncertainty
Exadata meets Exome

DNA =~ genes
Menu in the restaurant
RNA
What the waiter writes down
Proteins
The meal delivered
Exome =~
List of all ordered meals
M Murtaza et al. Nature 000, 1-5 (2013) doi:10.1038/nature12065


150GB of information
per person
CycliX data
Ensembl data
Exadata compared to current database hardware

1/4 Exadata
current
CPU bandwidth
limiting factor
no IO wait
8 core/24gb ram SAN FC storage
IO wait
Compared to best result of

our current hardware.
VALUE
Secure
access
backups
Upgradable
Maintainable
Fast data ingestion
Works within the week

SQL/SPARQL
low developer costs
set based analytics
Query language
good for questions
Adaptable
New question
New query
jervenbolleman
jerven.bolleman@isb-sib.ch
dmitry.kuznetsov@isb-sib.ch
This is Dmitry, who

did the hard work
3
Supporting Researchers
and the Large Hadron Collider
with Oracle
Tony Cass
Head of Database Services
4/12/16
Document reference
Oracle Parallel Query vs Hadoop
Graphics courtesy of Zbigniew Baranowski
38
Exadata test results

With HCC I could run the benchmark analysis on 2 TB of data in
210 seconds, with I/O reads up to ~10 GB/s
27% faster using Hybrid Columnar Compression (HCC)
Query activity without HCC:
scan electron-table
scan jet-table
Hybrid Columnar Compression (HCC) reduced

my data volume by a factor of 3.2
HCC made electron-selection faster
HCC made jet-selection slower (no offloading
Query activity with HCC:
due to external function used for b-tagging)
scan electron-table
Slide courtesy of Maaike Limper
scan jet-table
39
Maybe
Maybe another
another Exadata
Exadata test
test in
in future:
future:
With
With more
more complex
complex analysis
analysis
In-memory
In-memory columnar
columnar beta
beta ??
Hadoop vs Oracle
Hadoop
Hadoop version
version of
of Z+H
Z+H benchmark
benchmark analysis:
analysis:
Physics-data
stored
as
comma-delimited
Physics-data stored as comma-delimited text-files
text-files in
in hadoop
hadoop filesystem
filesystem (hdfs)
(hdfs)
Reproduce
Z+H
benchmark
analysis
with
MapReduce-code
(java!)
Reproduce Z+H benchmark analysis with MapReduce-code (java!)
Mappers:
Mappers: one
one mappers
mappers per
per object
object to
to select
select muon,
muon, electron
electron etc.
etc.
Reduce:
Reduce: select
select events
events with
with 22 good
good leptons
leptons and
and 22 b-jets,
b-jets, calculate
calculate invariant
invariant mass
mass
Hadoop:
Hadoop:
179
179 seconds
seconds
(limited
(limited by
by CPU)
CPU)
CERN openlab major review September
Oracle
Oracle (parallel
(parallel 40):
40):
150
150 seconds
seconds
(limited
(limited by
by iowait)
iowait)
40
Hadoop vs Oracle
I/O reads speed comparison for the Z+H benchmark
Hadoop:
Hadoop:
up
up to
to 1600
1600 MB/s
MB/s
CERN openlab major review September
Oracle
Oracle DB:
DB:
up
up to
to 2100
2100 MB/s
MB/s
41
Comparison with root

Run root file-based analysis in parallel on all 5 nodes
127 ntuples copied and divided to the 5 x 12 disks in our cluster
Use a set of scripts to run the ntuple-analysis on all nodes:
Copy root-macro to each node
12 root-jobs per node simultaneously, each job reads from single disk
Merge histogram produced by sub-jobs to obtain final result
like a miniature-version of running jobs on the grid
root-ntuple
root-ntuple analysis:
analysis:
86
86 seconds
seconds
root-ntuple
root-ntuple IO:
IO:
up
to
500
MB/s
up to 500 MB/s
Note
Note that
that root
root uses
uses column-based
column-based storage!
storage!Amount
Amount of
of data
data read
read is
is less
less
(benchmark
uses
45
out
of
4000
stored
variables)
(benchmark
uses
45
out
of
4000
stored
variables)
42

CON9611 Capture, Organize, and Analyze Big Data For Research

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

CON9611 Capture, Organize, and Analyze Big Data For Research

Diunggah oleh

Hak Cipta:

Format Tersedia

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Capture, Organize, and Analyze Big Data for Research

Discovering the undiscovered. Applying analytical methods to Big Data in Space

Discovering the undiscovered. Applying analytical

We have enough data in

However, we will need 20

If we had one drop of

The case of the Astronomical Archives

The Amazon.com example

Be offered with several other

The Oracle Challenge

Bring inspiration back by using methods

Big Data & Security in Healthcare

Quick and dirty outline:

Issues around big data

Possible aids to solving issues.

Summary of Health Care in

produces relatively short life spans, and is

Watson is Big Data for diagnosing

term viewcollecting your personal data in

Changes in healthcare laws (meaningful use) change how

The Problems with Data

What are research issues

how can we architect systems and networks to

Networks private networks

So your personal data in the cloud?

How are we securing the data: worst case?

And then there is

Exadata for the exome

2013 SIB Swiss Instute of Bioinformatics

2013 SIB Swiss Instute of Bioinformatics

2013 SIB Swiss Instute of Bioinformatics

Exadata meets Exome

M Murtaza et al. Nature 000, 1-5 (2013) doi:10.1038/nature12065

2013 SIB Swiss Instute of Bioinformatics

2013 SIB Swiss Instute of Bioinformatics

Exadata compared to current database hardware

8 core/24gb ram SAN FC storage

Compared to best result of

2013 SIB Swiss Instute of Bioinformatics

This is Dmitry, who

Oracle Parallel Query vs Hadoop

Graphics courtesy of Zbigniew Baranowski

Exadata test results

Hybrid Columnar Compression (HCC) reduced

Query activity with HCC:

due to external function used for b-tagging)

Slide courtesy of Maaike Limper

CERN openlab major review September

Slide courtesy of Maaike Limper

CERN openlab major review September

Comparison with root

Slide courtesy of Maaike Limper

Anda mungkin juga menyukai