Anda di halaman 1dari 15

MONGODB

CouchDB

MongoDB

MySQL

Data Model

Document-Oriented (JSON)

Document-Oriented (BSON)

Relational

Data Types

string,number,boolean,array,object
string, int, double, boolean, date,
bytearray, object, array, others

link
Large Objects (Files)

Yes (attachments)

Yes (GridFS)

Blobs
Horizontal
partitioning scheme

CouchDB Lounge

Auto-sharding
Partitioning

Replication

Master-master (with developer supplied conflict
resolution)

Master-slave and replica sets

Master-slave, multi-master, and circular
replication
Object(row) Storage

One large repository

Collection-based

Table-based

Query Method

Map/reduce of javascript functions to lazily build
an index per query

Dynamic; object-based query language

Dynamic; SQL

Secondary Indexes

Yes Yes Yes
Atomicity
Single document

Single document

Yes - advanced

Interface

REST

Native drivers ; REST add-on

Native drivers

Server-side batch
data manipulation

?
Map/Reduce, server-side javascript

Yes (SQL)

Written in

Erlang

C++
C++

Concurrency Control MVCC Update in Place
Geospatial Indexes

GeoCouch

Yes
Spatial extensions

Distributed
Consistency Model
Eventually consistent (master-master replication
with versioning and version reconciliation)
Strong consistency. Eventually consistent
reads from secondaries are available.
Strong consistency. Eventually consistent
reads from secondaries are available.
MongoDB, CouchDB, MySQL Compare Grid

Comparing Mongo DB and Couch DB
We are getting a lot of questions "how are mongo db and couch different?" It's a good question:
both are document-oriented databases with schemaless JSON-style object data storage. Both
products have their place -- we are big believers that databases are specializing and "one size fits
all" no longer applies.
We are not CouchDB gurus so please let us know in the forums if we have something wrong.
MVCC
One big difference is that CouchDB is MVCC based, and MongoDB is more of a traditional update-
in-place store. MVCC is very good for certain classes of problems: problems which need intense
versioning; problems with offline databases that resync later; problems where you want a large
amount of master-master replication happening. Along with MVCC comes some work too: first,
the database must be compacted periodically, if there are many updates. Second, when conflicts
occur on transactions, they must be handled by the programmer manually (unless the db also
does conventional locking -- although then master-master replication is likely lost).
MongoDB updates an object in-place when possible. Problems requiring high update rates of
objects are a great fit; compaction is not necessary. Mongo's replication works great but, without
the MVCC model, it is more oriented towards master/slave and auto failover configurations than
to complex master-master setups. With MongoDB you should see high write performance,
especially for updates.
Horizontal Scalability
One fundamental difference is that a number of Couch users use replication as a way to
scale. With Mongo, we tend to think of replication as a way to gain reliability/failover rather than
scalability. Mongo uses (auto) sharding as our path to scalabity (sharding is GA as of 1.6). In this
sense MongoDB is more like Google BigTable. (We hear that Couch might one day add
partitioning too.)
Query Expression
Couch uses a clever index building scheme to generate indexes which support particular
queries. There is an elegance to the approach, although one must predeclare these structures for
each query one wants to execute. One can think of them as materialized views.

Mongo uses traditional dynamic queries. As with, say, MySQL, we can do queries where an index
does not exist, or where an index is helpful but only partially so. Mongo includes a query optimizer
which makes these determinations. We find this is very nice for inspecting the data administratively,
and this method is also good when we don't want an index: such as insert-intensive
collections. When an index corresponds perfectly to the query, the Couch and Mongo approaches
are then conceptually similar. We find expressing queries as JSON-style objects in MongoDB to be
quick and painless though. Update Aug2011: Couch is adding a new query language "UNQL".
Atomicity
Both MongoDB and CouchDB support concurrent modifications of single documents. Both forego
complex transactions involving large numbers of objects.
Durability
CouchDB is a "crash-only" design where the db can terminate at any time and remain consistent.
Previous versions of MongoDB used a storage engine that would require a repairDatabase()
operation when starting up after a hard crash (similar to MySQL's MyISAM). Version 1.7.5 and higher
offer durability via journaling; specify the --journal command line option
Map Reduce
Both CouchDB and MongoDB support map/reduce operations. For CouchDB map/reduce is inherent
to the building of all views. With MongoDB, map/reduce is only for data processing jobs but not for
traditional queries.
Javascript
Both CouchDB and MongoDB make use of Javascript. CouchDB uses Javascript extensively including
in the building of views .
MongoDB supports the use of Javascript but more as an adjunct. In MongoDB, query expressions
are typically expressed as JSON-style query objects; however one may also specify a javascript
expression as part of the query. MongoDB also supports running arbitrary javascript functions
server-side and uses javascript for map/reduce operations.
REST
Couch uses REST as its interface to the database. With its focus on performance, MongoDB relies on
language-specific database drivers for access to the database over a custom binary protocol. Of
course, one could add a REST interface atop an existing MongoDB driver at any time -- that would be
a very nice community project. Some early stage REST implementations exist for MongoDB.


Performance
Philosophically, Mongo is very oriented toward performance, at the expense of features that would
impede performance. We see Mongo DB being useful for many problems where databases have not
been used in the past because databases are too "heavy". Features that give MongoDB good
performance are:
client driver per language: native socket protocol for client/server interface (not REST)
use of memory mapped files for data storage
collection-oriented storage (objects from the same collection are stored contiguously)
update-in-place (not MVCC)
written in C++
Use Cases
It may be helpful to look at some particular problems and consider how we could solve them.
if we were building Lotus Notes, we would use Couch as its programmer versioning
reconciliation/MVCC model fits perfectly. Any problem where data is offline for hours then back
online would fit this. In general, if we need several eventually consistent master-master replica
databases, geographically distributed, often offline, we would use Couch.
mobile
Couch is better as a mobile embedded database on phones, primarily because of its online/offine
replication/sync capabilities.
we like Mongo server-side; one reason is its geospatial indexes.
if we had very high performance requirements we would use Mongo. For example, web site user
profile object storage and caching of data from other sources.
for a problem with very high update rates, we would use Mongo as it is good at that because of its
"update-in-place" design. For example see updating real time analytics counters
in contrast to the above, couch is better when lots of snapshotting is a requirement because of its
MVCC design.
Generally, we find MongoDB to be a very good fit for building web infrastructure.
In his what Ive learned while using MongoDB for an
year post, Simon Maynard recommends 5 metrics to
always monitor:
index sizes
current ops
index misses
replication lag
I/O performance
The 1st and the 3rd are about making sure all your
MongoDB working set (including indexes) fits in RAM.

Ive used MongoDB for over a year at scale at both Heyzap and Bugsnag and Ive found it to be a very capable database. As with all
databases, there are some gotchas, and here is a summary of the things I wish someone had told me earlier.
Selective counts are slow even if indexed
For example, when paginating a users feed of activity, you might see something like,
db.collection.count({username: "my_username"});
In MongoDB this count can take orders of magnitude longer than you would expect. There is an open ticket and is currently slated for
2.4, so heres hoping theyll get it out. Until then you are left aggregating the data yourself. You could store the aggregated count in
mongo itself using the $inc command when inserting a new document.
Inconsistent reads in replica sets
When you start using replica sets to distribute your reads across a cluster, you can get yourself in a whole world of trouble. For
example, if you write data to the primary a subsequent read may be routed to a secondary that has yet to have the data replicated to
it. This can be demonstrated by doing something like,
// Writes the object to the primary db.collection.insert({_id: ObjectId("505bd76785ebb509fc183733"), key: "value"}); // This find is
routed to a read-only secondary, and finds no results db.collection.find({_id: ObjectId("505bd76785ebb509fc183733")});
This is compounded if you have performance issues that cause the replication lag between a primary and its secondaries to increase
to minutes or even hours in some cases.
You can control whether a query is run on secondaries and also how many secondaries are replicated to during the insert, but this
will affect performance and could block forever in some cases!
Range queries are indexed differently
I have found that range queries use indexes slightly differently to other queries. Ordinarily you would have the key used for sorting as
the last element in a compound index. However, when using a range query like $in for example, Mongo applies the sort before it
applies the range. This can cause the sort to be done on the documents in memory, which is pretty slow!
// This doesn't use the last element in a compound index to sort db.collection.find({_id: {$in : [
ObjectId("505bd76785ebb509fc183733"), ObjectId("505bd76785ebb509fc183734"), ObjectId("505bd76785ebb509fc183735"),
ObjectId("505bd76785ebb509fc183736") ]}}).sort({last_name: 1});
At Heyzap we worked around the problem by building a caching layer for the query in Redis, but you can also run the same query
twice if you only have two values in your $in statement or adjust your index if you have the RAM available.
You can read more about the issue or view a ticket.
Mongos BSON ID is awesome
Mongos BSON ID provides you with a load of useful functionality, but when I first started using Mongo, I didnt realize half the things
you can do with them. For example, the creation time of a BSON ID is stored in the ID. You can extract that time and you have a
created_at field for free! // Will return the time the ObjectId was created ObjectId("505bd76785ebb509fc183733").getTimestamp();
The BSON ID will also increment over time, so sorting by id will sort by creation date as well. The column is also indexed automatically, so
these queries are super fast. You can read more about it on the 10gen site.
Index all the queries
When I first started using Mongo, I would sometimes run queries on an ad-hoc basis or from a cron job. I initially left those queries
unindexed, as they werent user facing and werent run often. However this caused performance problems for other indexed queries, as the
unindexed queries do a lot of disk reads, which impacted the retrieval of any documents that werent cached. I decided to make sure the
queries are at least partially indexed to prevent things like this happening.
Always run explain on new queries
This may seem obvious, and will certainly be familiar if youve come from a relational background, but it is equally important with Mongo.
When adding a new query to an app, you should run the query on production data to check its speed. You can also ask Mongo to explain
what its doing when running the query, so you can check things like which index its using etc.
db.collection.find(query).explain() { // BasicCursor means no index used, BtreeCursor would mean this is an indexed query "cursor" :
"BasicCursor", // The bounds of the index that were used, see how much of the index is being scanned "indexBounds" : [ ], // Number of
documents or indexes scanned "nscanned" : 57594, // Number of documents scanned "nscannedObjects" : 57594, // The number of times
the read/write lock was yielded "nYields" : 2 , // Number of documents matched "n" : 3 , // Duration in milliseconds "millis" : 108, // True if
the results can be returned using only the index "indexOnly" : false, // If true, a multikey index was used "isMultiKey" : false }
Ive seen code deployed with new queries that would take the site down because of a slow query that hadnt been checked on production
data before deploy. Its relatively quick and easy to do so, so there is no real excuse not to!
Profiler
MongoDB comes with a very useful profiler. You can tune the profiler to only profile queries that take at least a certain amount of time
depending on your needs. I like to have it recording all queries that take over 100ms.
// Will profile all queries that take 100 ms db.setProfilingLevel(1, 100); // Will profile all queries db.setProfilingLevel(2); // Will disable the
profiler db.setProfilingLevel(0);
The profiler saves all the profile data into the capped collection system.profile. This is just like any other collection so you can run some
queries on it, for example
// Find the most recent profile entries db.system.profile.find().sort({$natural:-1}); // Find all queries that took more than 5ms
db.system.profile.find( { millis : { $gt : 5 } } ); // Find only the slowest queries db.system.profile.find().sort({millis:-1});
You can also run the show profile helper to show some of the recent profiler output.
The profiler itself does add some overhead to each query, but in my opinion it is essential. Without it you are blind. Id much rather add small
overhead to the overall speed of the database to give me visibility of which queries are causing problems. Without it you may just be
blissfully unaware of how slow your queries actually are for a set of your users.
Useful Mongo commands
Heres a summary of useful commands you can run inside the mongo shell to get an idea of how your server is acting. These can be scripted
so you can pull out some values and chart or monitor them if you want.
db.currentOp() - shows you all currently running operations, db.killOp(opid) - lets you kill long running queries
db.serverStatus() - shows you stats for the entire server, very useful for monitoring
db.stats() - shows you stats for the selected db, db.collection.stats() - stats for the specified collection


Monitoring
While monitoring production instances of Mongo over the last year or so, Ive built up a list of key metrics that should be monitored.
Index sizes
Seeing as how in MongoDB you really need your working set to fit in RAM, this is essential. At Heyzap for example, we would need our
entire indexes to sit in memory, as we would quite often query our entire dataset when viewing older games or user profiles.
Charting the index size allowed Heyzap to accurately predict when we would need to scale the machine, drop an index or deal with
growing index size in some other way. We would be able to predict to within a day or so when we would start to have problems with
the current growth of index.
Current ops
Charting your current number of operations on your mongo database will show you when things start to back up. If you notice a spike in
currentOps, you can go and look at your other metrics to see what caused the problem. Was there a slow query at that time? An
increase in traffic? How can we mitigate this issue? When current ops spike, it quite often leads to replication lag if you are using a
replica set, so getting on top of this is essential to preventing inconsistent reads across the replica set.
Index misses
Index misses are when MongoDB has to hit the disk to load an index, which generally means your working set is starting to no longer fit
in memory. Ideally, this value is 0. Depending on your usage it may not be. Loading an index from disk occasionally may not adversely
affect performance too much. You should be keeping this number as low as you can however.
Replication lag
If you use replication as a means of backup, or if you read from those secondaries you should monitor your replication lag. Having a
backup that is hours behind the primary node could be very damaging. Also reading from a secondary that is many hours behind the
primary will likely cause your users confusion.
I/O performance
When you run your MongoDB instance in the cloud, using Amazons EBS volumes for example, its pretty useful to be able to see how
the drives are doing. You can get random drops of I/O performance and you will need to correlate those with performance indicators
such as the number of current ops to explain the spike in current ops. Monitoring something like iostat will give you all the information
you need to see whats going on with your disks.
Monitoring commands
There are some pretty cool utilities that come with Mongo for monitoring your instances.
mongotop - shows how much time was spend reading or writing each collection over the last second
mongostat - brilliant live debug tool, gives a view on all your connected MongoDB instances
Monitoring frontends
MMS - 10gens hosted mongo monitoring service. Good starting point.
Kibana - Logstash frontend. Trend analysis for Mongo logs. Pretty useful for good visibility.

Why Schemaless?

MongoDB is a JSON-style data store. The documents stored in the database can have varying sets of
fields, with different types for each field. One could have the following objects in a single collection:
{ name : Joe, x : 3.3, y : [1,2,3] }
{ name : Kate, x : abc }
{ q : 456 }
Of course, when using the database for real problems, the data does have a fairly consistent
structure. Something like the following would be more common:
{ name : Joe, age : 30, interests : football }
{ name : Kate, age : 25 }
Generally, there is a direct analogy between this schemaless style and dynamically typed
languages. Constructs such as those above are easy to represent in PHP, Python and Ruby. What we are
trying to do here is make this mapping to the database natural.
Note the database does have some structure. The system namespace contains explicit lists of our
collections and indexes. Collections may be implicitly or explicitly created, while indexes are explicitly
declared (except for predefined _id index).
One of the great benefits of these dynamic objects is that schema migrations become very easy. With a
traditional RDBMS, releases of code might contain data migration scripts. Further, each release should
have a reverse migration script in case a rollback is necessary. ALTER TABLE operations can be very slow
and result in scheduled downtime.
With a schemaless database, 90% of the time adjustments to the database become transparent and
automatic. For example, if we wish to add GPA to the student objects, we add the attribute, resave, and
all is well if we look up an existing student and reference GPA, we just get back null. Further, if we roll
back our code, the new GPA fields in the existing objects are unlikely to cause problems if our code was
well written.

Databases and Predictability of Performance

A subject which perhaps doesnt get enough attention is whether the performance of a
database is predictable. What we are asking is: are there ever any surprises or gotchas in the
time it takes for a db operation to execute? For traditional database management systems,
the answer is yes.
For example, statistical query optimizers can be unpredictable: if the statistics for a table
change in production, the query plan may change. This could result in a big change in
performance perhaps better, perhaps worse but it certainly wasnt an expected
change. Query plans and performance profiles that were never tested in QA may go into
effect.
Another potential issue is locking. A lock from one transaction may cause another operation
that is normally very fast to be slow.
If a system is simple enough, it is predictable. memcached is very predictable in
performance: perhaps that is one reason it is so widely used. Yet we also need more
sophisticated tools, and as they become more advanced, predictability is hard. A goal of the
MongoDB project is to be reasonably predictable in performance. Note this is a goal: the
database is far from perfect in this regard today, but we think it certainly moves things in the
right direction.
For example, the MongoDB query optimizer utilitizes concurrent query plan evaluation to
assure good worst-case performance on queries, at a slight expense to average query
time. Further, the lockless design eliminates unpredictability from locking. Other areas of
the system could still use improvement: particularly concurrent query execution. That said,
this is certainly considered an important area for the project and will only get better over
time.

Capped Collections

With MongoDB one may create collections of a predefined size, where old data automatically
ages out on a least recently inserted basis. This can be quite handy. In the mongo JavaScript
shell, it is as simple as this:
db.createCollection("mycoll", {capped: true, size:100000})
When capped, a MongoDB collection has a couple of interesting properties. First, the data
automatically ages out when the collection is full on a least recently inserted basis.
Second, for capped collections, MongoDB automatically keeps the objects in the collection in
their insertion order. This is great for logging-types of problems where order should be
preserved. To retrieve items in their insertion order:
db.mycoll.find().sort( {$natural:1} ); // oldest to newest
db.mycoll.find().sort( {$natural:-1} ); // newest to oldest
The implementation of the above two properties in the database is done at a low level and is
very fast and efficient. We could simulate this behavior by using a timestamp column and
index, but with a significant speed penalty.
In fact, the capped collection performance is so good that MongoDB uses capped collections
as the storage mechanism for its own replication logs. One can inspect these logs with
standard MongoDB commands. For example, if you have a master MongoDB database
running, try this from the mongo shell:
use local
db.db.oplog.$main.find(); // show some replication log data
db.getReplicationInfo(); // distill from it some summary statistics
db.getReplicationInfo; // shows the js code for getReplicationinfo function
Using MongoDB for Real-time Analytics

Some MongoDB developers use the database as a way to track real-time performance metrics for their websites (page views, uniques, etc.) Tools
like Google Analytics are great but not real-time sometimes it is useful to build a secondary system that provides basic realtime stats.

Using the Mongo upsert and $inc features, we can efficiently solve the problem. When an app server renders a page, the app server can send one or
more updates to the database to update statistics.

We can be do this efficiently for a few reasons. First, we send a single message to the server for the update. The message is an upsert if the
object exists, we increment the counters, if it does not, the object is created. Second, we do not wait for a response we simply send the
operation, and immediately return to other work at hand. As the data is simply page counters, we do not need to wait and see if the operation
completes (we wouldnt report such an error to our web site user anyway). Third, the special $inc operator lets us efficiently update an existing
object without requiring a much more expensive query/modify/update sequence.

The example below demonstrates this using the mongo shell syntax (analogous steps can be done in any programming language for which one has a
Mongo driver).
$ ./mongo
> c = db.uniques_by_hour;
> c.find();
> cur_hour = new Date("Mar 05 2009 10:00:00")
> c.ensureIndex( { hour : 1, site : 1 } );
> c.update( { hour : cur_hour, site : "abc" },
{ $inc : { uniques:1, pageviews: 1} },
{ upsert : true } )
> c.find();
{"_id" : "49aff5c62f47a38ee77aa5cf" ,
"hour" : "Thu Mar 05 2009 10:00:00 GMT-0500 (EST)" ,
"site" : "abc" , "uniques" : 1 ,
"pageviews" : 1}
> c.update( { hour : cur_hour, site : "abc" },
{ $inc : { uniques:1, pageviews: 1} },
{ upsert : true } )
> c.find();
{"_id" : "49aff5c62f47a38ee77aa5cf" ,
"hour" : "Thu Mar 05 2009 10:00:00 GMT-0500 (EST)" ,
"site" : "abc" , "uniques" : 2 , "pageviews" : 2}
> c.update( { hour : cur_hour, site : "abc" },
{ $inc : { uniques:0, pageviews: 1} },
{ upsert : true } )
> c.find();
{"_id" : "49aff5c62f47a38ee77aa5cf" ,
"hour" : "Thu Mar 05 2009 10:00:00 GMT-0500 (EST)" ,
"site" : "abc" , "uniques" : 2 , "pageviews" : 3}


http://square.github.com/cubism/

Anda mungkin juga menyukai