Anda di halaman 1dari 61

The Past, Present and Future of NoSQL

Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight alvinonmongodb.com

Your data needs started here...

http://bit.ly/OT71M4

...but soon you had to be here

http://bit.ly/Oxcsis

...probably using one of these

http://bit.ly/QDUIUF

Before NoSQL there was...

Since the dawn of the RDBMS


1970 Main memory Intel 1103, 1k bits 2012 4GB of RAM costs $25.99 3TB Superspeed USB for $129

Mass storage Microprocessor

IBM 3330 Model 1, 100 MB

Nearly 4004 being Westmere EX has 10 developed; 4 bits and cores, 30MB L3 cache, 92,000 instructions per runs at 2.4GHz second

More recent changes


A decade ago Faster Faster storage More reliable storage Deployed in Large data set Development Buy a bigger server A SAN with more spindles More expensive SAN Your data center Millions of rows Waterfall Now Buy more servers SSD More copies of local storage The cloud private or public Billions to trillions of rows Iterative

http://bit.ly/UmUnsU

http://bit.ly/cnP77L

http://bit.ly/ODoMhh

http://bit.ly/uW2nk

http://bit.ly/Qmg8YD

Challenges for Databases


Build a database for scaleout

Run on clusters of 100s of commodity

machines
that enables agile development and is usable for a broad variety of applications

Is Scaleout Mission Impossible?


What about the CAP Theorem?

Brewer's theorem Consistency, Availability, Partition Tolerance


It says if a distributed system is partitioned, you cant

be able to update everywhere and have consistency

So, either allow inconsistency or limit where updates

can be applied

Two choices for consistency


Eventual consistency
Allow updates when a system has been partitioned Resolve conicts later Example: CouchDB, Cassandra

Immediate consistency
Limit the application of updates to a single master

node for a given slice of data

Another node can take over after a failure is detected

Avoids the possibility of conicts Example: MongoDB

Scaleout architecture
How do you distribute data among many servers Choices
Hashes (Dynamo style) vs ranges (BigTable style) Tradeoff: set-and-forget vs optimizability Physical vs logical segments Very important with secondary indexes Tradeoff: cluster rebalancing ease vs performance optimization

Why mess with the data model?


Relational minus joins and multi-statement
transactions is much less useful What about partial solutions to joins and multistatement transactions?

Hard to implement Complex for developers to understand


Therefore alternatives are worth considering for
distributed systems

performance implications

NoSQL Alternatives
Key-Value Column Document Graph

Redis Voldemort DynamoDB

Cassandra

MongoDB CouchDB

Neo4j InniteGraph

NoSQL and MongoDB

Tradeoff: Scale vs Functionality


scalability & performance

memcached key/value

RDBMS

depth of functionality

What MongoDB solves


Agility

Applications store complex data that is easier to model as documents Schemaless DB enables faster development cycles

Flexibility

Relaxed transactional semantics enable easy scale out Auto Sharding for scale down and scale up

Cost

Cost effective operationalize abundant data (clickstreams, logs, tweets, ...)

Sharding Data Distribution across nodes


Data location transparent to your code Data distribution is automatic Data re-distribution is automatic Aggregate system resources horizontally No code changes

Sharding - Range distribution


sh.shardCollection("test.tweets", {_id: 1} , false)

shard01

shard02

shard03

Sharding - Range distribution

shard01

shard02

shard03

a-i

j-r

s-z

Sharding - Splits

shard01

shard02

shard03

a-i

ja-jz k-r

s-z

Sharding - Splits

shard01

shard02

shard03

a-i

ja-ji ji-js js-jw jz-r

s-z

Sharding - Auto Balancing

shard01

shard02

shard03

a-i js-jw

ja-ji ji-js js-jw jz-r

s-z

jz-r

Sharding - Auto Balancing

shard01

shard02

shard03

a-i js-jw

ja-ji ji-js

s-z

jz-r

Sharding - Routed Query


find({_id: "alvin"})

shard01

shard02

shard03

a-i js-jw

ja-ji ji-js

s-z

jz-r

Sharding - Routed Query


find({_id: "alvin"})

shard01

shard02

shard03

a-i js-jw

ja-ji ji-js

s-z

jz-r

Sharding - Scatter Gather


find({email: "alvin@10gen.com"})

shard01

shard02

shard03

a-i js-jw

ja-ji ji-js

s-z

jz-r

Sharding - Scatter Gather


find({email: "alvin@10gen.com"})

shard01

shard02

shard03

a-i js-jw

ja-ji ji-js

s-z

jz-r

Sharding - Caching
96 GB Mem 3:1 Data/Mem

shard01

a-i
300 GB Data

j-r s-z
300 GB

Aggregate Horizontal Resources


96 GB Mem 1:1 Data/Mem 96 GB Mem 1:1 Data/Mem 96 GB Mem 1:1 Data/Mem

shard01

shard02

shard03

a-i
300 GB Data

j-r

s-z

100 GB

100 GB

100 GB

Replica Sets Data Availability across nodes


Data Protection Multiple copies of the data Spread across Data Centers, AZs High Availability Automated Failover Automated Recovery

Replica Sets

App

Write Read Read

Primary

Asynchronous Replication

Secondary

Read

Secondary

Replica Sets

App

Write Read Read

Primary

Secondary

Read

Secondary

Replica Sets

App

Primary
Write Read Read

Primary

Automatic Election of new Primary

Secondary

Replica Sets

App

Recovering
Write Read Read New primary serves data

Primary

Secondary

Replica Sets

App
Read Write Read Read

Secondary

Primary

Secondary

Challenges for Databases


Build a database for scaleout

Run on clusters of 100s of commodity

machines
that enables agile development and is usable for a broad variety of applications

Data Model
Why JSON?

Provides a simple, well understood

encapsulation of data Maps simply to the object in your OO language Linking & Embedding to describe relationships

Schema Design Relational Database

Schema Design MongoDB

embedding

linking

Schemas in MongoDB
Design documents that simply map to your application
post = {author: "Herg", date: new Date(), text: "Destination Moon", tags: ["comic", "adventure"]} > db.posts.save(post)

Embedding
> db.blogs.find( { author: "Herg"} ) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Herg", date : ISODate("2011-09-18T09:56:06.298Z"), text : "Destination Moon", tags : [ "comic", "adventure" ], comments : [ { ! author : "Kyle", ! date : ISODate("2011-09-19T09:56:06.298Z"), ! text : "great book" } ] }

! ! ! ! !

JSON & Scaleout


Embedding removes need for Enables data to be distributed across many nodes
without penalty

Distributed Joins Two Phase commit

Challenges for Databases


Build a database for scaleout

Run on clusters of 100s of commodity

machines
that enables agile development and is usable for a broad variety of applications

Big Data = MongoDB = Solved


Content Management Opera9onal Intelligence E-Commerce

User Data Management

High Volume Data Feeds

Mobile

Location Based Service


Problem: Solution:
Location based social networking service needs to scale to
high number of users and check-ins

Used MongoDB deployed on EC2 8 clusters, 40 machines, 15k QPS, 2.3 billion records Auto-sharding and geo-spatial indexing are key To date have scaled to 9m users, 3m check-ins per day,
750m total check-ins, 20m places, 400k merchants

Results:

How Telefnica uses MongoDB


London:

O2 UK: Priority Moments location based offers


service O2 UK: eCommerce Product Catalog

Madrid:

M2M (machine to machine) event acquisition


platform Personalization Server (Oracle migration)

How Telefnica uses MongoDB


M2M Event Acquisition
Apps

Event notication
Event Notier
Portal API

Core

Event Storage

Mng Storage

Mng Platform

Mng

Event Gateway

Event acquisition

BOSS

Operator Network

MNO1 MNO2 MNOn

How Telefnica uses MongoDB


Product Catalog

Challenges for Databases


Build a database for scaleout

Run on clusters of 100s of commodity

machines
that enables agile development and is usable for a broad variety of applications

10gen is the company behind MongoDB


Founded in 2007
$73M+ in funding
NEA
Dwight Merriman, Eliot Horowitz Flybridge, Sequoia, Union Square,

Set the direc*on & contribute code to MongoDB

Foster community & ecosystem

Worldwide Expanding Team


170+ employees NY, CA, UK and Australia

Provide MongoDB cloud services

Provide MongoDB support services

MongoDB is the leading NoSQL solution.


#2 on Indeeds Fastest Growing Jobs Jasperso] BigData Index
Demand for MongoDB, the document-oriented NoSQL database, saw the biggest spike with over 200% growth in 2011.

Google Searches

451 Group MongoDB increasing its dominance

56

The Evolution of MongoDB


1.8 March 11
Journaling Sharding and Replica set enhancements Spherical geo search

2.0 Sept 11
Index enhancements to improve size and performance Authentication with sharded clusters Replica Set Enhancements Concurrency improvements

2.2 Aug 12
Aggregation Framework

2.4 winter 12

Multi-Data Center Deployments Improved Performance and Concurrency

Future of NoSQL?

Future of the Data Center


Hardware

"Auto Pilot"

More Cores More Memory More IOPs (SSD) More Capacity More bandwidth (100GbE) Zero human intervention

Future of NoSQL?
Real Time Analytics Ad-Hoc / Analytics Greater Scale

Can't wait for a batch process / ETL / DW Map/Reduce = Hammer 100s -> 1,000s of nodes Petabytes -> Exabytes

Deeper history

Heterogeneous deployment

Seamless integration with what you have

Anda mungkin juga menyukai