Anda di halaman 1dari 57

Introducing:

MongoDB
David J. C. Beach
Sunday, August 1, 2010
David Beach
Software Consultant (past 6 years)
Python since v1.4 (late 90s)
Design, Algorithms, Data Structures
Sometimes Database stuff
not a frameworks guy
Organizer: Front Range Pythoneers
Sunday, August 1, 2010
Outline
Part I: Trends in Databases
Part II: Mongo Basic Usage
Part III: Advanced Features
Sunday, August 1, 2010
Part I:
Trends in Databases
Sunday, August 1, 2010
Database Trends
Past: Relational (RDBMS)
Data stored in Tables, Rows, Columns
Relationships designated by Primary, Foreign
keys
Data is controlled & queried via SQL
WARNING: extreme
oversimplification
Sunday, August 1, 2010
Trends:
Criticisms of RDBMS
Rigid data model
Hard to scale / distribute
Slow (transactions, disk seeks)
SQL not well standardized
Awkward for modern/dynamic languages
Lots of disagreement over this
There are points & counterpoints from
both sides
The debate is not over
Not here to deli ver a verdict
POINT: This is why we see an explosion of
new databases.
Sunday, August 1, 2010
Trends:
Fragmentation
Relational with ORM (Hibernate, SQLAlchemy)
ODBMS / ORDBMS (push OO-concepts into database)
Key-Value Stores (MemcacheDB, Redis, Cassandra)
Graph (neo4j)
Document Oriented (Mongo, Couch, etc...)
categories are
incomplete
some dont fi t neatly into
categories
As wi th so many things in technology,
were seeing... FRAGMENTATION!
some examples of DB categories
Sunday, August 1, 2010
Where Mongo Fits
The Best Features of
Document Databases,
Key-Value Stores,
and RDBMSes.
Mongos Tagline (taken from websi te)
Sunday, August 1, 2010
What is Mongo
Document-Oriented Database
Produced by 10gen / Implemented in C++
Source Code Available
Runs on Linux, Mac, Windows, Solaris
Database: GNU AGPL v3.0 License
Drivers: Apache License v2.0
Sunday, August 1, 2010
Mongo
Advantages
json-style documents
(dynamic schemas)
exible indexing (B-Tree)
replication and high-
availability (HA)
automatic sharding
support (v1.6)*
easy-to-use API
fast queries (auto-tuning
planner)
fast insert & deletes
(sometimes trade-offs)
sharding support available as of
v1.6 (late July 2010)
many of these taken
straight from home page
Sunday, August 1, 2010
Mongo
Language Bindings
C, C++, Java
Python, Ruby, Perl
PHP, JavaScript
(many more community supported ones)
Sunday, August 1, 2010
Mongo
Disadvantages
No Relational Model / SQL
No Explicit Transactions / ACID
Limited Query API
You can do a lot more wi th MapReduce
and JavaScript!
Operations can only be atomic wi thin single
collection. (Generally)
Can mimic wi th foreign IDs, but referential
integri ty not enforced.
Sunday, August 1, 2010
When to use Mongo
Rich semistructured records (Documents)
Transaction isolation not essential
Humongous amounts of data
Need for extreme speed
You hate schema migrations
My personal take on this...
Caveat: Ive never used Mongo in Production!
Sunday, August 1, 2010
Part II:
Mongo Basic Usage
BRIEFLY cover:
- Download, Install, Configure
- connection, creating DB, creating Collection
- CRUD operations (Insert, Query, Update, Delete)
Sunday, August 1, 2010
Installing Mongo
Use a 64-bit OS (Linux, Mac, Windows)
Get Binaries: www.mongodb.org
Run mongod process
32-bi t available; not for production
PyMongo uses memory-mapped files.
32-bi ts limi ts database to 2 GB!
Sunday, August 1, 2010
Installing PyMongo
Download: http://pypi.python.org/pypi/pymongo/1.7
Build with setuptools
(includes C extension for speed)
# python setup.py install
# python setup.py --no-ext install
(to compile wi thout extension)
Sunday, August 1, 2010
Mongo Anatomy
Database
Collection
Document
Mongo Server
Sunday, August 1, 2010
>>> import pymongo
>>> connection = pymongo.Connection(localhost)
Getting a Connection
Connection required for using Mongo
Sunday, August 1, 2010
>>> db = connection.mydatabase
Finding a Database
Databases = logically separate stores
Navigation using properties
Will create DB if not found
Sunday, August 1, 2010
>>> blog = db.blog
Using a Collection
Collection is analogous to Table
Contains documents
Will create collection if not found
Sunday, August 1, 2010
>>> entry1 = {title: Mongo Tutorial,
body: Heres a document to insert. }
>>> blog.insert(entry1)
ObjectId('4c3a12eb1d41c82762000001')
Inserting
collection.insert(document) => document_id
document
Sunday, August 1, 2010
>>> entry1
{'_id': ObjectId('4c3a12eb1d41c82762000001'),
'body': "Here's a document to insert.",
'title': 'Mongo Tutorial'}
Inserting (contd.)
Documents must have _id eld
Automatically generated unless assigned
12-byte unique binary value
You can also assign your own _id, can be
any unique value.
Mongos IDs are designed to be unique...
...even if hundreds of thousands of
documents are generated per second, on
numerous clustered machines.
ID generated by dri ver. No wai ting on DB.
Sunday, August 1, 2010
>>> entry2 = {"title": "Another Post",
"body": "Mongo is powerful",
"author": "David",
"tags": ["Mongo", "Power"]}
>>> blog.insert(entry2)
ObjectId('4c3a1a501d41c82762000002')
Inserting (contd.)
Documents may have different properties
Properties may be atomic, lists, dictionaries
another document
Sunday, August 1, 2010
>>> blog.ensure_index(author)
>>> blog.ensure_index(tags)
Indexing
May create index on any eld
If eld is list => index associates all values
index by single value
by multiple values
Sunday, August 1, 2010
bulk_entries = [ ]
for i in range(100000):
entry = { "title": "Bulk Entry #%i" % (i+1),
"body": "What Content!",
"author": random.choice(["David", "Robot"]),
"tags": ["bulk",
random.choice(["Red", "Blue", "Green"])]
}
bulk_entries.append(entry)
Bulk Insert
Lets produce 100,000 fake posts
Sunday, August 1, 2010
>>> blog.insert(bulk_entries)
[ObjectId(...), ObjectId(...), ...]
Bulk Insert (contd.)
collection.insert(list_of_documents)
Inserts 100,000 entries into blog
Returns in 2.11 seconds
Sunday, August 1, 2010
>>> blog.remove() # clear everything
>>> blog.insert(bulk_entries, safe=True)
Bulk Insert (contd.)
returns in 7.90 seconds (vs. 2.11 seconds)
driver returns early; DB is still working
...unless you specify safe=True
Sunday, August 1, 2010
>>> blog.find_one({title: Bulk Entry #12253})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}
Querying
collection.nd_one(spec) => document
spec = document of query parameters
presumably, need more entries to effecti vely test index performance...
returned in 0.04s - extremely fast
No index created for ti tle!
Sunday, August 1, 2010
>>> blog.find_one({title: Bulk Entry #12253,
tags: Green})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}
Querying
(Specs)
Multiple conditions on document => AND
Value for tags is an ANY match
presumably, need more entries to effecti vely test index performance...
Sunday, August 1, 2010
>>> green_items = [ ]
>>> for item in blog.find({tags: Green}):
green_items.append(item)
Querying
(Multiple)
collection.nd(spec) => cursor
new items are fetched in bulk (behind the
scenes)
>>> green_items = list(blog.find({tags: Green}))
- or -
Sunday, August 1, 2010
>>> blog.find({"tags": "Green"}).count()
16646
Querying
(Counting)
Use the nd() method + count()
Returns number of matches found
presumably, need more entries to effecti vely test index performance...
Sunday, August 1, 2010
>>> item = blog.find_one({title: Bulk Entry #12253})
>>> item.tags.append(New)
>>> blog.update({_id: item[_id]}, item)
Updating
collection.update(spec, document)
updates single document matching spec
multi=True => updates all matching docs
Sunday, August 1, 2010
>>> blog.remove({"author":"Robot"}, safe=True)
Deleting
use remove(...)
it works like nd(...)
Example removed approximately 50% of records.
Took 2.48 seconds
Sunday, August 1, 2010
Part III:
Advanced Features
Sunday, August 1, 2010
Advanced Querying
Regular Expressions
{tag : re.compile(r^Green|Blue$)}
Nested Values {foo.bar.x : 3}
$where Clause (JavaScript)
Sunday, August 1, 2010
>>> blog.find({$or: [{tags: Green}, {tags:
Blue}]})
Advanced Querying
$lt, $gt, $lte, $gte, $ne
$in, $nin, $mod, $all, $size, $exists, $type
$or, $not
$elemmatch
Sunday, August 1, 2010
>>> blog.find().limit(50) # find 50 articles
>>> blog.find().sort(title).limit(30) # 30 titles
>>> blog.find().distinct(author) # unique author names
Advanced Querying
collection.nd(...)
sort(name) - sorting
limit(...) & skip(...) [like LIMIT & OFFSET]
distinct(...) [like SQLs DISTINCT]
collection.group(...) - like SQLs GROUP BY
wont be showing detailed
examples of all these...
there are good tutorials online
for all of this
lets move on to something even
more interesting
Sunday, August 1, 2010
Map/Reduce
collection.map_reduce(mapper, reducer)
ultimate in querying power
distribute across multiple nodes
Most powerful querying
mechanism
Sunday, August 1, 2010
Map/Reduce
Visualized
Java MapReduce
Having iun thiough how the MapReuuce piogiam woiks, the next step is to expiess it
in coue. Ve neeu thiee things: a map lunction, a ieuuce lunction, anu some coue to
iun the joL. The map lunction is iepiesenteu Ly an implementation ol the Mapper
inteilace, which ueclaies a map() methou. Example 2-3 shows the implementation ol
oui map lunction.
Exanp|c 2-3. Mappcr jor naxinun tcnpcraturc cxanp|c
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature));
}
}
}
Iigurc 2-1. MapRcducc |ogica| data j|ow
20 | Chapter 2: MapReduce
Diagram Credit:
Hadoop: The Denitive Guide
by Tom White; OReilly Books
Chapter 2, page 20
also see:
Map/Reduce : A Visual Explanation
1 2 3
Sunday, August 1, 2010
db.runCommand({
mapreduce: "DenormAggCollection",
query: {
filter1: { '$in': [ 'A', 'B' ] },
filter2: 'C',
filter3: { '$gt': 123 }
},
map: function() { emit(
{ d1: this.Dim1, d2: this.Dim2 },
{ msum: this.measure1, recs: 1, mmin: this.measure1,
mmax: this.measure2 < 100 ? this.measure2 : 0 }
);},
reduce: function(key, vals) {
var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 };
for(var i = 0; i < vals.length; i++) {
ret.msum += vals[i].msum;
ret.recs += vals[i].recs;
if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin;
if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax))
ret.mmax = vals[i].mmax;
}
return ret;
},
finalize: function(key, val) {
val.mavg = val.msum / val.recs;
return val;
},
out: 'result1',
verbose: true
});
db.result1.
find({ mmin: { '$gt': 0 } }).
sort({ recs: -1 }).
skip(4).
limit(8);
SELECT
Dim1, Dim2,
SUM(Measure1) AS MSum,
COUNT(*) AS RecordCount,
AVG(Measure2) AS MAvg,
MIN(Measure1) AS MMin
MAX(CASE
WHEN Measure2 < 100
THEN Measure2
END) AS MMax
FROM DenormAggTable
WHERE (Filter1 IN (A,B))
AND (Filter2 = C)
AND (Filter3 > 123)
GROUP BY Dim1, Dim2
HAVING (MMin > 0)
ORDER BY RecordCount DESC
LIMIT 4, 8
1
2
3
4
5
1
7
6
1
2
3
4
5
Groupeo olmenslon columns are pulleo
out as keys ln tbe map tunctlon,
reouclng tbe slze ot tbe worklng set.
Measures must be manually aggregateo.
Aggregates oepenolng on recoro counts
must walt untll tlnallzatlon.
Measures can use proceoural loglc.
Fllters bave an ORM/ActlveRecoro-
looklng style.
6 Aggregate tllterlng must be applleo to
tbe result set, not ln tbe map/reouce.
7 Ascenolng: 1, Descenolng: -1
R
e
v
l
s
l
o
n

4
,
C
r
e
a
t
e
o

2
0
1
0
-
0
3
-
0
6
R
l
c
k

O
s
b
o
r
n
e
,
r
l
c
k
o
s
b
o
r
n
e
.
o
r
g
mySQL MongoD8
http://rickosborne.org/download/SQL-to-MongoDB.pdf
Sunday, August 1, 2010
Map/Reduce
Examples
This is me, playing wi th Map/Reduce
Sunday, August 1, 2010
Health Clinic Example
Person registers with the Clinic
Weighs in on the scale
1 year => comes in 100 times
Sunday, August 1, 2010
Health Clinic Example
person = { name: Bob,
weighings: [
{date: date(2009, 1, 15), weight: 165.0},
{date: date(2009, 2, 12), weight: 163.2},
... ]
}
Sunday, August 1, 2010
for i in range(N):
person = { 'name': 'person%04i' % i }
weighings = person['weighings'] = [ ]
std_weight = random.uniform(100, 200)
for w in range(100):
date = (datetime.datetime(2009, 1, 1) +
datetime.timedelta(
days=random.randint(0, 365))
weight = random.normalvariate(std_weight, 5.0)
weighings.append({ 'date': date,
'weight': weight })
weighings.sort(key=lambda x: x['date'])
all_people.append(person)
Map/Reduce
Insert Script
Sunday, August 1, 2010
Insert Data
Performance
1
10
100
1000
1k 10k 100k
3.14s
29.5s
292s
Insert
LOG-LOG scale
Linear scaling
Sunday, August 1, 2010
map_fn = Code("""function () {
this.weighings.forEach(function(z) {
emit(z.date, z.weight);
});
}""")
reduce_fn = Code("""function (key, values) {
var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}""")
result = people.map_reduce(map_fn, reduce_fn)
Map/Reduce
Total Weight by Day
Sunday, August 1, 2010
>>> for doc in result.find():
print doc
{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value':
39136.600753163315}
{u'_id': datetime.datetime(2009, 1, 2, 0, 0), u'value':
41685.341024046182}
{u'_id': datetime.datetime(2009, 1, 3, 0, 0), u'value':
38232.326554504165}
... lots more ...
Map/Reduce
Total Weight by Day
Sunday, August 1, 2010
Total Weight by Day
Performance
1
10
100
1000
1k 10k 100k
4.29s
38.8s
384s
MapReduce
Sunday, August 1, 2010
map_fn = Code("""function () {
var target_date = new Date(2009, 9, 5);
var pos = bsearch(this.weighings, "date",
target_date);
var recent = this.weighings[pos];
emit(this._id, { name: this.name,
date: recent.date,
weight: recent.weight });
};""")
reduce_fn = Code("""function (key, values) {
return values[0];
};""")
result = people.map_reduce(map_fn, reduce_fn,
scope={"bsearch": bsearch})
Map/Reduce
Weight on Day
Sunday, August 1, 2010
bsearch = Code("""function(array, prop, value) {
var min, max, mid, midval;
for(min = 0, max = array.length - 1; min <= max; ) {
mid = min + Math.floor((max - min) / 2);
midval = array[mid][prop];
if(value === midval) {
break;
} else if(value > midval) {
min = mid + 1;
} else {
max = mid - 1;
}
}
return (midval > value) ? mid - 1 : mid;
};""")
Map/Reduce
bsearch() function
Sunday, August 1, 2010
Weight on Day
Performance
1
10
100
1000
1k 10k 100k
1.23s
10s
108s
MapReduce
Sunday, August 1, 2010
target_date = datetime.datetime(2009, 10, 5)
for person in people.find():
dates = [ w['date'] for w in person['weighings'] ]
pos = bisect.bisect_right(dates, target_date)
val = person['weighings'][pos]
Weight on Day
(Python Version)
Sunday, August 1, 2010
Map/Reduce
Performance
0.1
1
10
100
1000
1k 10k 100k
0.37s
2.2s
26s
1.23s
10s
108s
MapReduce Python
Sunday, August 1, 2010
Summary
Sunday, August 1, 2010
Resources
www.10gen.com
www.mongodb.org
MongoDB
The Denitive Guide
OReilly
api.mongodb.org/python
PyMongo
Sunday, August 1, 2010
END OF SLIDES
Sunday, August 1, 2010
Chalkboard
is not Comic Sans
This is Chalkboard, not Comic Sans.
This isnt Chalkboard, its Comic Sans.
does it matter, anyway?
Sunday, August 1, 2010

Anda mungkin juga menyukai