Ia Pe 2013-10-Building The BigData EDW

Building ABig Data
Data Warehouse
Integrating Structured and Unstructured Data

DAMA IOWA October 2013

Krish Krishnan
Founder Sixth Sense Advisors Inc
Discussion Focus
! Big data and the data warehousethe new landscape
! Technology overview: Hadoop, NoSQL, Cassandra,
BigQuery, Drill, Redshift, AWS (S3, EC2); programming
with MapReduce; understanding analytical requirements,
self-service discovery platforms
! The challenges of data processing: Workloads; data
management; infrastructure limitations
! Next-generation data warehouse: Solution architectures; the
three Ss: scalability, sustainability, and stability
2 @2013 Copyright Sixth Sense Advisors
A New Landscape
A Growing Trend
@2013 Copyright Sixth Sense Advisors 4
Requirement Expectations Reality
Speed Speed of the Internet Speed = Infra + Arch +
Design
Accessibility Accessibility of a
Smartphone
BI Tool licenses &
security
Usability IPAD - Mobility Web Enabled BI Tool
Availability Google Search Data & Report Metadata
Delivery Speed of questions Methodology & Signoff
Data Access to everything Structured Data
Scalability Cloud (Amazon) Existing Infrastructure
Cost Cell phone or Free WIFI Millions
Expectations for BI are changing w/o anyone telling us
State of Data Today
Data Growth Trends
Facebook has an average of 30 billion pieces of content added every
month
YouTube receives 24hours of video, every minute
15 Billion mobile phones predicted to be in use in 2015
A leading retailer in the UK collects 1.5 billion pieces of information to
adjust prices and promotions
Amazon.com: 30% of sales is out of its recommendation engine
A Boeing Jet Engine produces 20TB/Hour for engineers to examine in
real time to make improvement
CERN Haldron Collider produces 15PB of data for each cycle of
execution.
Decision Support = #Fail?
! Decision support platforms of today are not satisfying the
needs of the business user
! Decisions being driven in the organization are not based on
360 degree views of the organization and its performance
! Business transformations are not completely successful due
to the lack of information presented in the Business
Intelligence Architecture
! Analytics and Key Performance Indicators are not available
in a timely manner and the data that is presented is not
sufficient to complete any business decisions with utmost
confidence
State of the Data
Warehouse
!"#$" &'()*+,-. /+0.- /1231 456+3'*3
What We Have Built
Business Thinking
@2013 Copyright Sixth Sense Advisors
10
New Data Increasing
Complexity
Increase
Quality of
Service
Increase
Agility
Digital
Intelligence
Customer Centric
Cost driven
TCO
Opportunity Cost
Competitive Cost
Digital
Connected
Mobile
Metrics Driven
Big Data
Social Media
Corporate Data
New Data
Smarter
Consumer
Global
Competition
Cost
CIO Thinking
Flexibility
Reliability
Simplicity
/7898:+9+.)
;'5<98*+.)
Architects Thinking
Users Needs
Every Data, All Shapes, Sizes and Formats Are Needed By The Users
Why The Database
Alone Cannot Be The
Platform
The Limitations of Databases
The Disappointment
! Distributed
! Transactional Databases
! Data Warehouses
! Datamarts
! Analytical Databases
! CRM Databases
! SCM Databases
! ERP Databases
! Redundant
! Weak Metadata
! Weak Integration
Base Graph Courtesy Dr. Richard Hackathorn
Why The Data Warehouse Fails
Action time or Action distance
Time
Business Value
Data Latency
Analysis Latency
Decision Latency
Business Situation
Data is ready
Information is available
Decision is made
L
o
s
t

V
a
l
u
e

Lost value = Sum
(Latencies)+
Opportunity Cost
Data Warehouse Computing
Today
17
Transactional
Systems
ODS
Enterprise
Datawarehouse
Datamarts &
Analytical
Databases
Datamarts &
Analytical
Databases
Datamarts &
Analytical
Databases
Transactional
Systems
ODS
Transactional
Systems
ODS
Reports
Dashboards
Analytic
Models
Other
Applications
Data Transformation
The Bottom Line
! We have designed, architected, deployed systems that have
been built on architectures that were not intended to be used
for complex processing and compute requirements
! The real issue lies in the fact that the architectures that were
designed for the RDBMS platform differ widely in their
abilities to handle diverse types of workloads
! In order to design and manage complex workloads,
architects need to understand the underlying platforms
capabilities with relation to the type of workload being
designed
Shared Everything Architecture
! Resources are distributed
and shared
! CPUs are shared across the
databases
! Memory is shared across
CPUs and databases
! Disk architecture is shared
across CPUs
! Big disadvantage is the
sharing of resources limits
the scalability
! Addition of the resources
will not increase linear
scalability and performance
but only cost
Issues
! Shared Everything architecture cannot scale and handle
workloads effectively
! You cannot achieve 100% linear scalability in a shared
architecture environment
! Compute and store happen in disparate environments
! Infrastructure limitations create more latencies in the overall
system
! Data Governance is complex subject area that adds to the
weakness of the architecture
8lC uaLa Lxample
To: Bob.Collins@bankwithus.com

Dear Mr. Collins,

This email is in reference to my bank account which has been efficiently
handled by your bank for more than five years. There has been no problem
till date until last week the situation went out of the hand.

I have deposited one of my high amount cheque to my bank account no:
65656512 which was to be credited same day but due to your staff
carelessness it wasnt done and because of this negligence my reputation
in the market has been tarnished. Furthermore I had issued one payment
cheque to the party which was showing bounced due to Insufficient
balance just because my cheque didnt make on time.

My relationship with your bank has matured with the time and its a
shame to tell you about this kind of services are not acceptable when it is
question of somebodys reputation. I hope you got my point and I am
attaching a copy of the same for further rapid procedures and remit into
my account in a day.

Yours sincerely

Daniel Carter

Ph: 564-009-2311
Big Data Example
! We wlll oen lmply addluonal lnformauon ln spoken language by Lhe way we
place sLress on words.
! 1he senLence "l never sald she sLole my money" demonsLraLes Lhe lmporLance
sLress can play ln a senLence, and Lhus Lhe lnherenL dlmculLy a naLural language
processor can have ln parslng lL.
! "= never sald she sLole my money" - Someone else sald lL, buL ! dldn'L.
! "l 2161* sald she sLole my money" - l slmply dldn'L ever say lL.
! "l never 38+5 she sLole my money" - l mlghL have lmplled lL ln some way, buL l
never expllclLly sald lL.
! "l never sald 3-1 sLole my money" - l sald someone Look lL, l dldn'L say lL was
she.
! "l never sald she 3.'91 my money" - l [usL sald she probably borrowed lL.
! "l never sald she sLole >) money" - l sald she sLole someone else's money.
! "l never sald she sLole my >'21)" - l sald she sLole someLhlng, buL noL my
money
! uependlng on whlch word Lhe speaker places Lhe sLress, Lhls senLence could have
several dlsuncL meanlngs.
Example Source: Wikepedia
1he normal Way 8esulLs ln
lmpacL on uaLa Warehouse
New Data Types
New volume
New analytics
New workload
New metadata
POOR
Performance
Failed
Programs
Scalability; Sharding; ACID;
Why 8lg uaLa can lall?
ACID is Not Good All The Time
! Atomic All of the work in a transaction completes
(commit) or none of it completes
! Consistent A transaction transforms the database
from one consistent state to another consistent state.
Consistency is defined in terms of constraints.
! Isolated The results of any changes made during a
transaction are not visible until the transaction has
committed.
! Durable The results of a committed transaction
survive failures
Where Do we Go?
Tools
instructions
Data
&
Next Generation
Technologies
Integrating Big Data
Innovations
Category New Frontiers
Infrastructure Big Data and Data Warehouse Appliances
In-Memory Technologies
SSD Storage
Fast Networks
Cloud
Mobile Technologies
Software In-memory Databases
Hadoop, Cassandra & NoSQL Ecosystems
Columnar DBMS
Improved ETL-Hadoop integration Informatica, Talend
Algorithms Mahout
Pre-Configured
Architectures
IBM, Teradata, Kognitio, EMC, CloudEra, HortonWorks,
Cirro, Intel, Cicso UCS, Pivotal, Oracle, MapR
BIG Data - Infrastructure
Requirements
! Scalable platform
! Database independent
! Fault tolerant
! Low cost of acquisition
! Scalable and Reliable Storage
! Supported by standard toolsets
! Datacenter Ready
Big Data Workload Demands
! Process dynamic data content
! Process unstructured data
! Systems that can scale up with high volume data
! Systems that can scale out with high volume of users
! Perform complex operations within reasonable response
time
Parallel databases
! Shared-nothing MPP architecture (a collection of
independent machines, each with local hard disk and main
memory, connected together on high-speed network)
! Machines are cheaper, lower-end, commodity hardware
! Scales well up to a point, tens of nodes
! Good performance
! Poor fault tolerance
! Problems with heterogeneous environment (machines must
be equal in performance)
! Good support for flexible query interface
Data Warehouse Appliance
High Availability
Standard SQL Interface
Advanced Compression
MPP
Leverages existing BI, ETL and OLTP investments
Hadoop & MapReduce Interface / Embedded
Minimal disk I/O bottleneck; simultaneously load & query
Auto Database Management
A Data Warehouse (DW)
Appliance is an integrated set of
servers, storage, OS, database
and interconnect specifically
preconfigured and tuned for the
rigors of data warehousing.
DW appliances offer an
attractive price / performance
value proposition and are
frequently a fraction of the cost
of traditional data warehouse
solutions.
Hadoop Evolution
Hadoop
Why Hadoop
! Commodity HW
! Built on inexpensive servers
! Storage servers and their disks are not assumed to be highly reliable and available
! Modular expansion
! Metadata-data oriented design
! Namenode maintains metadata
! Datanodes manage data placement and store
! Computation happens close to data
! Servers have dual goals: data storage and computation
! Single store and computevs. Separate clusters
! File-System Architecture
! Focus is mostly sequential access
! Single writers
! No file locking features
Hadoop Architecture
HDFS
! Hadoop Distributed File System
! A scalable, Fault tolerant, High
performance distributed file
system
! Asynchronous replication
! Write-once and read-many
(WORM)
! No RAID required
! Access from C, Java,Thrift
! NameNode holds filesystem
metadata
! Files are broken up and spread
over the DataNodes
HDFS Splits & Replication
! Data is organized into files and
directories
! Files are divided into uniform sized
blocks and distributed across cluster
nodes
! Blocks are replicated to handle
hardware failure
! Filesystem keeps checksums of data
for corruption detection and
recovery
! HDFS exposes block placement so
that computation can be migrated to
data
HDFS
! Data Node
! Stores data in HDFS
! Can be found in multiples
! Data is replicated across data
nodes
! File size
! A typical block size is 64MB (or
even 128 MB).
! A file is chopped into 64MB
chunks and stored.
! Name Node
! The Name Node is the heartbeat of an HDFS file system.
! It keeps the directory of all files in the file system, and tracks data
distribution across the cluster the file.
! It does not store the data of these files itself.
! Cluster configuration management
! Transaction Log management
! Features
! HDFS provides Java API for
application to use.
! Python access is also used in
many applications.
! A C language wrapper for Java
API is also available.
! A HTTP browser can be used
to browse the files of a HDFS
instance.
14

Data Correctness
- File creation : Client computes checksum per
512 bytes DataNode stores the checksum
- File Access : Client retrieves the data and
checksum from DataNode
If Validation fails, Client tries other replicas

Data Pipeline
- Client retrieves a list of DataNodes on which
to place replicas of a block
- Client writes block to the first DataNode
- The first DataNode forwards the data to the
next DataNode in the Pipeline
- When all replicas are written, the client moves
on to write the next block in file
Rebalancer
- Usually run when new DataNodes are added
- Cluster is online when Rebalancer is active
- Rebalancer is throttled to avoid network
congestion
- Command line tool

Blocks Placement
- First replica on a node in a local rack
- Second replica on different rack
- 3rd replica on the same rack of the
second replica
- Clients read from nearest replica

Heartbeats
- DataNodes send heartbeat to the
NameNode
(once every 3 seconds)
- NameNode used heartbeats to detect
DataNode failure

Replication Engine
-

-
-
Chooses new DataNodes for new
replicas
Balances disk usage
Balances communication traffic to
DataNodes
HDFS Features
HBASE
! Clone of Big Table (Google)
! Implemented in Java (Clients : Java, C
++,Ruby...)
! Columnoriented data store
! Distributed over many servers
! Tolerant of machine failure
! Layered over HDFS
! Strong consistency
! It's not a relational database (No joins)
! Sparse data nulls are stored for free
! Supports Semi-structured and
unstructured data
! Versioned data storage capability
! Extremely Scalable Goal of billions
of rows x millions of columns
! Hbase provides storage for the Hadoop Distributed Computing Environment.
! Data is logically organized into tables, rows and columns.
Hive
! Data summarization and ad-hoc query
interface on top of Hadoop
! MapReduce for Execution & HDFS for
storage
! Hive Query Language
! Basic SQL : Select, From, Join, Group By
! Equi-Join, Multi-Table Insert, Multi-Group-By
! Batch query
! MetaStore
! Table/Partitions properties
! Thrift API : Current clients in Php (Web
! Interface), Python interface to Hive, Java
(Query
! Engine and CLI)
! Metadata stored in any SQL backend
Image Cloudera Hive Tutorial
Hbase Hive Integration
HBase
Hive table denitions
Points to an existing table
Points to some column
Points to other
columns, different
names
Pig
! Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data
analysis programs
! Pig generates and compiles a Map/Reduce program(s)
on the fly.
! Abstracts you from specific detail
! Focus on data processing
! Data flow
! Built For data manipulation
! Pig is workflow driven and is easy to maintain

Sqoop is a tool designed to
help users of large data import
existing relational databases
into their Hadoop clusters
Automatic data import
SQL to Hadoop
Easy import data from many
databases to Hadoop
Generates code for use in
MapReduce applications
Integrates with Hive
Sqoop

All servers store a copy of the data
A leader is elected at startup
Followers service clients, all updates go through leader
Update responses are sent when a majority of servers have persisted the
Change
Zookeeper
24
AVRO
! A data serialization system that provides dynamic integration
with scripting languages
! Avro Data
! Expressive
! Smaller and Faster
! Dyamic
! Schema store with data
! APIs permit reading and creating
! Include a file format and a textual encoding
! Generates JSON Metadata Automatically
24
AVRO
! Avro RPC
! Leverage versioning support
! For Hadoop service provide cross-language access
25
A data collection system for managing large distributed
systems
Build on HDFS and MapReduce
Tools kit for displaying, monitoring and analyzing the log
files
Chukwa
Flume
! Flume is:
! A scalable, configurable, extensible and manageable distributed
data collection service
! Developed on Open source
! One-stop solution for data collection of all formats
! Flexible reliability guarantees allow careful performance tuning
! Enables quick iteration on new collection strategies
Oozie
! Workflow Engine in Hadoop
! HTTP and command line interface + Web console
! Used to
! Execute and monitor workflows in Hadoop
! Periodic scheduling of workflows
! Trigger execution by data availability
Hadoop Differentiator
Schema-on-Write: RDBMS

Schema-on-Read: Hadoop

Schema must be created before
data is loaded.
An explicit load operation has
to take place which transforms
the data to the internal structure
of the database.
New columns must be added
explicitly before data for such
columns can be loaded into the
database.
Read is Fast.
Standards/Governance.

Data is simply copied to the file
store, no special transformation is
needed.
A SerDe (Serializer/Deserlizer)
is applied during read time to
extract the required columns.
New data can start flowing
anytime and will appear
retroactively once the SerDe is
updated to parse them.
Load is Fast
Evolving Schemas/Agility

HadoopDB
! Recent study at Yale University, Database Research Dep.
! Hybrid architecture of parallel databases and MapReduce
system
! The idea is to combine the best qualities of both
technologies
! Multiple single-node databases are connected using Hadoop
as the task coordinator and network communication layer
! Queries are distributed across the nodes by MapReduce
framework, but as much work as possible is done in the
database node
Slide Courtsey: Dr.Daniel Abadi
HadoopDB architecture
Slide Courtsey: Dr.Daniel Abadi
Hadoop Limitations
! Write-once model
! A namespace with an extremely large number of files exceeds
Namenodes capacity to maintain
! Cannot be mounted by existing OS
! Getting data in and out is tedious
! Virtual File System can solve problem
! HDFS does not implement / support
! User quotas
! Access permissions
! Hard or soft links
! Data balancing schemes
! No periodic checkpoints
Hadoop Tips
! Hadoop is useful
! When you must process lots of unstructured
data
! When running batch jobs is acceptable
! When you have access to lots of cheap
hardware
! Hadoop is not useful
! For intense calculations with little or
no data
! When your data is not self-contained
! When you need interactive results
Implementation
Think big, start small
Build on agile cycles
Focus on the data, as you will always
develop schema on write.
Available Optimizations
Input to Maps
Map only jobs
Combiner
Compression
Speculation
Fault Tolerance
Buffer Size
Parallelism (threads)
Partitioner
Reporter
DistributedCache
Task child environment
settings
Hadoop Tips
! Performance Tuning
! Increase the memory/buffer
allocated to the tasks
! Increase the number of tasks that
can be run in parallel
! Increase the number of threads
that serve the map outputs
! Disable unnecessary logging
! Turn on speculation
! Run reducers in one wave as they
tend to get expensive
! Tune the usage of
DistributedCache, it can increase
efficiency
! Troubleshooting
! Are your partitions uniform?
! Can you combine records at the
map side?
! Are maps reading off a DFS block
worth of data?
! Are you running a single reduce
wave (unless the data size per
reducers is too big) ?
! Have you tried compressing
intermediate data & final data?
! Are there buffer size issues
! Do you see unexplained long
tails
! Are your CPU cores busy?
! Is at least one system resource
being loaded?
MapReduce
! Developed for processing large data sets.
! Contains Map and Reduce functions.
! Runs on a large cluster of machines.
! Goals
! Use machines across the data center
! Elastic scaling
! Finite programming model
Input | Map() | Copy/Sort | Reduce() | Output

Map Phase
Raw data analyzed and
converted to name/value
pair
Shuffle Phase
All name/value pairs are
sorted and grouped by their
keys
Reduce Phase
All values associated with a
key are processed for results
MapReduce
Programming model
! Input & Output: each a set of key/value pairs
! Programmer specifies two functions:
! map (in_key, in_value) -> list(out_key, intermediate_value)
! Processes input key/value pair
! Produces set of intermediate pairs
! reduce (out_key, list(intermediate_value)) -> list(out_value)
! Combines all intermediate values for a particular key
! Produces a set of merged output values (usually just
one)
Example
! Page 1: DAMA Conference is good
! Page 2: There are good ideas presented at DAMA
! Page 3: I like DAMA because of its variety of topics.
Map output
! Worker 1:
! (DAMA1), (Conference 1), (is 1), (good 1).
! Worker 2:
! (There 1), (are 1), (good 1), (ideas 1), (presented 1), (at 1), (DAMA
1).
! Worker 3:
! (I 1), (Like 1), (DAMA 1), (Because 1), (of 1), (its 1), (variety 1), (of
1), (topics 1).
Reduce Input
! Worker 1:
! (DAMA 1), (DAMA 1), (DAMA
1)
! Worker 2:
! (is 1)
! Worker 3:
! (good 1), (good 1)
! Worker 4:
! (There 1)
! Worker 5:
! (ideas 1)
! Worker 6:
! (presented 1)
! Worker 7:
! (I 1)
! Worker 8:
! (like 1)
! Worker 9:
! (its 1)
! Worker 10:
! (variety 1)
! Worker 11:
! (Topics 1)
Reduce Output
! Worker 1:
! (DAMA 3)
! Worker 2:
! (is 1)
! Worker 3:
! (good 2)
! Worker 4:
! (There 1)
! Worker 5:
! (ideas 1)
! Worker 6:
! (presented 1)
! Worker 7:
! (I 1)
! Worker 8:
! (like 1)
! Worker 9:
! (its 1)
! Worker 10:
! (variety 1)
! Worker 11:
! (Topics 1)
MapReduce Strengths
! Tunable
! Fine grained Map and Reduce tasks
! Improved load balancing
! Faster recovery from failed tasks
! Good fault tolerance
! Can scale to thousands of nodes
! Supports heterogeneous environments
! Automatic re-execution on failure
! Localized execution
! With large data, eliminates bandwidth problem by scheduling execution close to
location of data when possible
! Map-Reduce + HDFS is a very effective solution for scaling in a distributed
geographical environment
NoSQL
! Stands for Not Only SQL
! Based on CAP Theorem
! Usually do not require a fixed table schema nor do they use the
concept of joins
! All NoSQL offerings relax one or more of the ACID properties
! NoSQL databases come in a variety of flavors
! XML (myXMLDB, Tamino, Sedna)
! Wide Column (Cassandra, Hbase, Big Table)
! Key/Value (Redis, Memcached with BerkleyDB)
! Graph (neo4j, InfoGrid)
! Document store (CouchDB, MongoDB)
NoSQL
Size
Complexity
Amazon Dynamo
Google Big Table
Cassandra
Lotus Notes
HBase
Voldermort
Graph
Theory
Approaches to CAP
68
! Eric Brewer stated in 2000 at
PODC that
! You have to give up one
of the following in a
distributed system :
! Consistency of data
! Availability
! Partition tolerance
! BASE
! No ACID, use a single version of DB,
reconcile later
! Defer transaction commit
! Until partitions fixed and replicate can run
! Eventual consistency (e.g., Amazon Dynamo)
! Eventually, all copies of an object converge
! Restrict transactions (e.g., Sharded MySQL)
! 1-M/c Xacts: Objects in xact are on the same
machine
! 1-Object Xacts: Xact can only read/write 1
object
! Object timelines (PNUTS)
Consistency Model
! If copies are asynchronously updated, what can we say
about stale copies?
! ACID guarantees require synchronous updts
! Eventual consistency: Copies can drift apart, but will
eventually converge if the system is allowed to quiesce
! To what value will copies converge?
! Do systems ever quiesce?
! Is there any middle ground?
Consistency Techniques
! Per-record mastering
! Each record is assigned a master region
! May differ between records
! Updates to the record forwarded to the master region
! Ensures consistent ordering of updates
! Tablet-level mastering
! Each tablet is assigned a master region
! Inserts and deletes of records forwarded to the master region
! Master region decides tablet splits
! These details are hidden from the application
! Except for the latency impact!
HBASE
71
Architecture
Disk
HRegionServer
Client Client Client Client Client
HBaseMaster
REST API
Disk
HRegionServer
Disk
HRegionServer
Disk
HRegionServer
Java Client
HRegion Server
! Records partitioned by column family into HStores
! Each HStore contains many MapFiles
! All writes to HStore applied to single memcache
! Reads consult MapFiles and memcache
! Memcaches flushed as MapFiles (HDFS files) when full
! Compactions limit number of MapFiles
HRegionServer
HStore
MapFiles
Memcache
writes
Flush to disk
reads
Pros and Cons
! Pros
! Log-based storage for high write throughput
! Elastic scaling
! Easy load balancing
! Column storage for OLAP workloads
! Cons
! Writes not immediately persisted to disk
! Reads cross multiple disk, memory locations
! No geo-replication
! Latency/bottleneck of HBaseMaster when using
REST
CASSANDRA
@2013 Copyright
Sixth Sense
Advisors
75
75
Architecture
! Facebooks storage system
! BigTable data model
! Dynamo partitioning and consistency model
! Peer-to-peer architecture
Cassandra node
Disk
Cassandra node
Disk
Cassandra node
Disk
Cassandra node
Disk
Client Client Client Client Client
Routing
! Consistent hashing, like Dynamo or Chord
! Server position = hash(serverid)
! Content position = hash(contentid)
! Server responsible for all content in a hash interval
Server
Responsible hash interval
Cassandra Server
! Writes go to log and memory table
! Periodically memory table merged with disk table
Cassandra node
Disk
RAM
Log
SSTable file
Memtable
Update
(later)
Pros and Cons
! Pros
! Elastic scalability
! Easy management
! Peer-to-peer configuration
! BigTable model is nice
! Flexible schema, column groups for partitioning, versioning, etc.
! Eventual consistency is scalable
! Cons
! Eventual consistency is hard to program against
! No built-in support for geo-replication
! Load balancing?
! System complexity
! P2P systems are complex; have complex corner cases
Cassandra Tips
Tunable memtable size
Can have large memtable flushed less frequently, or small memtable
flushed frequently
Tradeoff is throughput versus recovery time
Larger memtable will require fewer flushes, but will take a long time to
recover after a failure
With 1GB memtable: 45 mins to 1 hour to restart
Can turn off log flushing
Risk loss of durability
Replication is still synchronous with the write
Durable if updates propagated to other servers that dont fail
NoSQL
Best Practices
Design for data collection
Plan the data store
Organize by type and semantics
Partition for performance
Access and Query is run time
dependent
Horizontal scaling
Memory Cachin
Access and Query
RESTful interfaces (HTTP as an
accessAPI)
Query languages other than SQL
SPARQL - Query language for
the SemanticWeb
Gremlin - the graph traversal
language
Sones Graph Query Language
Data Manipulation / Query API
The Google BigTable
DataStoreAPI
The Neo4jTraversalAPI
Serialization Formats
JSON
Thrift
ProtoBuffers
RDF
Forest Rim Technology Textual ETL Engine (TETLE) is an integration tool for turning text into a structure of
data that can be analyzed by standard analytical tools
Textual ETL Engine
Textual ETL Engine provides a robust user
interface to define rules (or patterns / keywords) to
process unstructured or semi-structured data.
The rules engine encapsulates all the complexity
and lets the user define simple phrases and
keywords
Easy to implement and easy to realize ROI
Advantages
Simple to use
No MR or Coding required for text analysis
and mining
Extensible by Taxonomy integration
Works on standard and new databases
Produces a highly columnar key-value store,
ready for metadata integration
Disadvantages
Not integrated with Hadoop as a rules
interface
Currently uses Sqoop for metadata
interchange with Hadoop or NoSQL
interfaces
Current GA does not handle distributed
processing outside Windows platform
Amazon RedShift
! Goal 1 - Reduce I/O
! Direct-attached storage
! Large data block sizes
! Columnar storage
83
! The industrys first large scale Data
Warehouse As A Service.
! Designed and Architected For
Petabyte Scale Deployment
! Goal 2 Optimize Hardware
! Optimized for I/O intensive
workloads
! High disk density
! Runs in fast network - HPC
! Goal 3 Extreme Parallelism
Increased speed and efficiency
! Loading
! Querying
! Backup
! Restore
SQL Clients / BI Tools

Leader
Node
RedShift Architecture
Picture Amazon Presentation on RedShift - Internet
Deployment Options
! Can be hosted with RDBMS on-site and RedShift on the Cloud
Deployment Options
! Can be used as Live Archive on the Cloud
Deployment Options
! Can be used as ETL for Big Data on the Cloud
Big Data Technologies
! Apache Software Foundation
! Hadoop
! HBASE
! Zookeeper
! Oozie
! Avro
! Pig
! Sqoop
! Flume
! Cassandra
! CloudEra
! HortonWorks
! MongoDB
! IBM BigInsights
! EMC Pivotal
! Teradata Aster Big Data Appliance
! Oracle Big Data Appliance
! Intel Hadoop Distribution
! MapR
! Datastax
! Rainstor
! QueryIO

Workloads,
Architectures,
Computing
Workload
! Defined as the usage of
resources including CPU,
Disk and Memory by every
query ETL, ELT, BI and
Analytics
! Often misunderstood as a
Database capability
! Mostly touted by vendors as
a differentiator for their
platform
Workload
! Loading
! Continuous (near real-time)
! Batch
! Micro Batch
! Queries
! Tactical
! AdHoc
! Analytical
! Dashboard
MIXED Workload
What Are You Trying to Do?
Data Workloads
OLTP
(Random access to
a few records)
OLAP
(Scan access to a large
number of records)
Read-heavy Write-heavy By rows By columns Unstructured
Combined
(Some OLTP and
OLAP tasks)
Data Engineering vs.
Analysis/Warehousing
! Very different workloads, requirements
! Warehoused data for analysis includes
! Data from serving system
! Click log streams
! Syndicated feeds
! Trend towards scalable stores with
! Semi-structured data
! Map-reduce
! The result of analysis is stored in the Data
Warehouse
Workload Isolation
! Assigning the appropriate systems and processes to manage
workloads
! Creates an interchangeable infrastructure
! Provides for better scalability
! Will create a heterogeneous configuration, can be deployed
on a homogenized platform if desired
Workload Isolation
Semi-
Structured
Data
Workload Isolation
Semi-
Structured
Data
Workload Isolation
Semi-
Structured
Data
Metadata
! The key to the castle in integrating Big Data is metadata
! Whatever the tool, technology and technique, if you do not
know your metadata, your integration will fail
! Semantic technologies and architectures will be the way to
process and integrate the Big Data.
! Business domain experts can identify large data patterns by
association relationships with small metadata.
The Big Data - Data
Warehouse
Multi-Tiered Workload
Application Unstructured Data ( File
Based)
Semi-Sturctured Data
(File / Digital)
Structured Data
(Digital)
Social Analytics, Behavior
Analytics, Recommendation
Engines, Sentiment Analytics,
Fraud Detection
Hadoop / NoSQL Hadoop / NoSQL RDBMS
CRM, SalesForce, Marketing RDBMS
Data Mining Hadoop / NoSQL Hadoop / NoSQL RDBMS
System Characteristics Volume: Large
Concurrency: Low
Consolidation: App
Specific
Availability: High
Updated: Near Real Time
to Monthly
Volume: Large
Concurrency: Medium
Consolidation/
Integration: Variable
Availability:Medium
Updated: Near Real
Time
Volume: Large
Concurrency:
High
Consolidation/
Integration: High
Availability: High
Updated: Intra-
Day & Daily
8eference ArchlLecLure
Which Tool
Application Hadoop NoSQL Textual ETL
Machine
Learning
x x
Sentiments x x x
Text Processing x x x
Image
Processing
x x
Video Analytics x x
Log Parsing x x x
Collaborative
Filtering
x x x
Context Search x
Email &
Content
x
Challenges
! Resources Availability
! MR is hard to implement
! Speech to text
! Conversation context is often missing
! Quality of recording
! Accent issues
! Visual data tagging
! Images
! Text embedded within images
! Metadata is not available
! Data is not trusted
! Content management platform capabilities
! Ontologies Ambiguity
! Taxonomy Integration
Thank You
Krish Krishnan
rkrish1124@yahoo.com
Twitter Handle: @datagenius

Ia Pe 2013-10-Building The BigData EDW

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ia Pe 2013-10-Building The BigData EDW

Diunggah oleh

Hak Cipta:

Format Tersedia

Building ABig Data

Anda mungkin juga menyukai