Introducing MongoDB

Introducing:
MongoDB
David J. C. Beach
Sunday, August 1, 2010
David Beach
Software Consultant (past 6 years)
Python since v1.4 (late 90s)
Design, Algorithms, Data Structures
Sometimes Database stuff
not a frameworks guy
Organizer: Front Range Pythoneers
Outline
Part I: Trends in Databases
Part II: Mongo Basic Usage
Part III: Advanced Features
Part I:
Trends in Databases
Database Trends
Past: Relational (RDBMS)
Data stored in Tables, Rows, Columns
Relationships designated by Primary, Foreign
keys
Data is controlled & queried via SQL
WARNING: extreme
oversimplification
Trends:
Criticisms of RDBMS
Rigid data model
Hard to scale / distribute
Slow (transactions, disk seeks)
SQL not well standardized
Awkward for modern/dynamic languages
Lots of disagreement over this
There are points & counterpoints from
both sides
The debate is not over
Not here to deli ver a verdict
POINT: This is why we see an explosion of
new databases.
Trends:
Fragmentation
Relational with ORM (Hibernate, SQLAlchemy)
ODBMS / ORDBMS (push OO-concepts into database)
Key-Value Stores (MemcacheDB, Redis, Cassandra)
Graph (neo4j)
Document Oriented (Mongo, Couch, etc...)
categories are
incomplete
some dont fi t neatly into
categories
As wi th so many things in technology,
were seeing... FRAGMENTATION!
some examples of DB categories
Where Mongo Fits
The Best Features of
Document Databases,
Key-Value Stores,
and RDBMSes.
Mongos Tagline (taken from websi te)
What is Mongo
Document-Oriented Database
Produced by 10gen / Implemented in C++
Source Code Available
Runs on Linux, Mac, Windows, Solaris
Database: GNU AGPL v3.0 License
Drivers: Apache License v2.0
Mongo
Advantages
json-style documents
(dynamic schemas)
exible indexing (B-Tree)
replication and high-
availability (HA)
automatic sharding
support (v1.6)*
easy-to-use API
fast queries (auto-tuning
planner)
fast insert & deletes
(sometimes trade-offs)
sharding support available as of
v1.6 (late July 2010)
many of these taken
straight from home page
Mongo
Language Bindings
C, C++, Java
Python, Ruby, Perl
PHP, JavaScript
(many more community supported ones)
Mongo
Disadvantages
No Relational Model / SQL
No Explicit Transactions / ACID
Limited Query API
You can do a lot more wi th MapReduce
and JavaScript!
Operations can only be atomic wi thin single
collection. (Generally)
Can mimic wi th foreign IDs, but referential
integri ty not enforced.
When to use Mongo
Rich semistructured records (Documents)
Transaction isolation not essential
Humongous amounts of data
Need for extreme speed
You hate schema migrations
My personal take on this...
Caveat: Ive never used Mongo in Production!
Part II:
Mongo Basic Usage
BRIEFLY cover:
- Download, Install, Configure
- connection, creating DB, creating Collection
- CRUD operations (Insert, Query, Update, Delete)
Installing Mongo
Use a 64-bit OS (Linux, Mac, Windows)
Get Binaries: www.mongodb.org
Run mongod process
32-bi t available; not for production
PyMongo uses memory-mapped files.
32-bi ts limi ts database to 2 GB!
Installing PyMongo
Download: http://pypi.python.org/pypi/pymongo/1.7
Build with setuptools
(includes C extension for speed)
# python setup.py install
# python setup.py --no-ext install
(to compile wi thout extension)
Mongo Anatomy
Database
Collection
Document
Mongo Server
>>> import pymongo
>>> connection = pymongo.Connection(localhost)
Getting a Connection
Connection required for using Mongo
>>> db = connection.mydatabase
Finding a Database
Databases = logically separate stores
Navigation using properties
Will create DB if not found
>>> blog = db.blog
Using a Collection
Collection is analogous to Table
Contains documents
Will create collection if not found
>>> entry1 = {title: Mongo Tutorial,
body: Heres a document to insert. }
>>> blog.insert(entry1)
ObjectId('4c3a12eb1d41c82762000001')
Inserting
collection.insert(document) => document_id
document
>>> entry1
{'_id': ObjectId('4c3a12eb1d41c82762000001'),
'body': "Here's a document to insert.",
'title': 'Mongo Tutorial'}
Inserting (contd.)
Documents must have _id eld
Automatically generated unless assigned
12-byte unique binary value
You can also assign your own _id, can be
any unique value.
Mongos IDs are designed to be unique...
...even if hundreds of thousands of
documents are generated per second, on
numerous clustered machines.
ID generated by dri ver. No wai ting on DB.
>>> entry2 = {"title": "Another Post",
"body": "Mongo is powerful",
"author": "David",
"tags": ["Mongo", "Power"]}
>>> blog.insert(entry2)
ObjectId('4c3a1a501d41c82762000002')
Inserting (contd.)
Documents may have different properties
Properties may be atomic, lists, dictionaries
another document
>>> blog.ensure_index(author)
>>> blog.ensure_index(tags)
Indexing
May create index on any eld
If eld is list => index associates all values
index by single value
by multiple values
bulk_entries = [ ]
for i in range(100000):
entry = { "title": "Bulk Entry #%i" % (i+1),
"body": "What Content!",
"author": random.choice(["David", "Robot"]),
"tags": ["bulk",
random.choice(["Red", "Blue", "Green"])]
}
bulk_entries.append(entry)
Bulk Insert
Lets produce 100,000 fake posts
>>> blog.insert(bulk_entries)
[ObjectId(...), ObjectId(...), ...]
Bulk Insert (contd.)
collection.insert(list_of_documents)
Inserts 100,000 entries into blog
Returns in 2.11 seconds
>>> blog.remove() # clear everything
>>> blog.insert(bulk_entries, safe=True)
Bulk Insert (contd.)
returns in 7.90 seconds (vs. 2.11 seconds)
driver returns early; DB is still working
...unless you specify safe=True
>>> blog.find_one({title: Bulk Entry #12253})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}
Querying
collection.nd_one(spec) => document
spec = document of query parameters
presumably, need more entries to effecti vely test index performance...
returned in 0.04s - extremely fast
No index created for ti tle!
>>> blog.find_one({title: Bulk Entry #12253,
tags: Green})
{u'_id': ObjectId('4c3a1e411d41c82762018a89'),
u'author': u'Robot',
u'body': u'What Content!',
u'tags': [u'bulk', u'Green'],
u'title': u'Bulk Entry #99999'}
Querying
(Specs)
Multiple conditions on document => AND
Value for tags is an ANY match
>>> green_items = [ ]
>>> for item in blog.find({tags: Green}):
green_items.append(item)
Querying
(Multiple)
collection.nd(spec) => cursor
new items are fetched in bulk (behind the
scenes)
>>> green_items = list(blog.find({tags: Green}))
- or -
>>> blog.find({"tags": "Green"}).count()
16646
Querying
(Counting)
Use the nd() method + count()
Returns number of matches found
>>> item = blog.find_one({title: Bulk Entry #12253})
>>> item.tags.append(New)
>>> blog.update({_id: item[_id]}, item)
Updating
collection.update(spec, document)
updates single document matching spec
multi=True => updates all matching docs
>>> blog.remove({"author":"Robot"}, safe=True)
Deleting
use remove(...)
it works like nd(...)
Example removed approximately 50% of records.
Took 2.48 seconds
Part III:
Advanced Features
Advanced Querying
Regular Expressions
{tag : re.compile(r^Green|Blue$)}
Nested Values {foo.bar.x : 3}
$where Clause (JavaScript)
>>> blog.find({$or: [{tags: Green}, {tags:
Blue}]})
Advanced Querying
$lt, $gt, $lte, $gte, $ne
$in, $nin, $mod, $all, $size, $exists, $type
$or, $not
$elemmatch
>>> blog.find().limit(50) # find 50 articles
>>> blog.find().sort(title).limit(30) # 30 titles
>>> blog.find().distinct(author) # unique author names
Advanced Querying
collection.nd(...)
sort(name) - sorting
limit(...) & skip(...) [like LIMIT & OFFSET]
distinct(...) [like SQLs DISTINCT]
collection.group(...) - like SQLs GROUP BY
wont be showing detailed
examples of all these...
there are good tutorials online
for all of this
lets move on to something even
more interesting
Map/Reduce
collection.map_reduce(mapper, reducer)
ultimate in querying power
distribute across multiple nodes
Most powerful querying
mechanism
Map/Reduce
Visualized
Java MapReduce
Having iun thiough how the MapReuuce piogiam woiks, the next step is to expiess it
in coue. Ve neeu thiee things: a map lunction, a ieuuce lunction, anu some coue to
iun the joL. The map lunction is iepiesenteu Ly an implementation ol the Mapper
inteilace, which ueclaies a map() methou. Example 2-3 shows the implementation ol
oui map lunction.
Exanp|c 2-3. Mappcr jor naxinun tcnpcraturc cxanp|c
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature));
}
}
}
Iigurc 2-1. MapRcducc |ogica| data j|ow
20 | Chapter 2: MapReduce
Diagram Credit:
Hadoop: The Denitive Guide
by Tom White; OReilly Books
Chapter 2, page 20
also see:
Map/Reduce : A Visual Explanation
1 2 3
db.runCommand({
mapreduce: "DenormAggCollection",
query: {
filter1: { '$in': [ 'A', 'B' ] },
filter2: 'C',
filter3: { '$gt': 123 }
},
map: function() { emit(
{ d1: this.Dim1, d2: this.Dim2 },
{ msum: this.measure1, recs: 1, mmin: this.measure1,
mmax: this.measure2 < 100 ? this.measure2 : 0 }
);},
reduce: function(key, vals) {
var ret = { msum: 0, recs: 0, mmin: 0, mmax: 0 };
for(var i = 0; i < vals.length; i++) {
ret.msum += vals[i].msum;
ret.recs += vals[i].recs;
if(vals[i].mmin < ret.mmin) ret.mmin = vals[i].mmin;
if((vals[i].mmax < 100) && (vals[i].mmax > ret.mmax))
ret.mmax = vals[i].mmax;
}
return ret;
},
finalize: function(key, val) {
val.mavg = val.msum / val.recs;
return val;
},
out: 'result1',
verbose: true
});
db.result1.
find({ mmin: { '$gt': 0 } }).
sort({ recs: -1 }).
skip(4).
limit(8);
SELECT
Dim1, Dim2,
SUM(Measure1) AS MSum,
COUNT(*) AS RecordCount,
AVG(Measure2) AS MAvg,
MIN(Measure1) AS MMin
MAX(CASE
WHEN Measure2 < 100
THEN Measure2
END) AS MMax
FROM DenormAggTable
WHERE (Filter1 IN (A,B))
AND (Filter2 = C)
AND (Filter3 > 123)
GROUP BY Dim1, Dim2
HAVING (MMin > 0)
ORDER BY RecordCount DESC
LIMIT 4, 8
1
2
3
4
5
1
7
6
1
2
3
4
5
Groupeo olmenslon columns are pulleo
out as keys ln tbe map tunctlon,
reouclng tbe slze ot tbe worklng set.
Measures must be manually aggregateo.
Aggregates oepenolng on recoro counts
must walt untll tlnallzatlon.
Measures can use proceoural loglc.
Fllters bave an ORM/ActlveRecoro-
looklng style.
6 Aggregate tllterlng must be applleo to
tbe result set, not ln tbe map/reouce.
7 Ascenolng: 1, Descenolng: -1
R
e
v
l
s
l
o
n

4
,
C
r
e
a
t
e
o

2
0
1
0
-
0
3
-
0
6
R
l
c
k

O
s
b
o
r
n
e
,
r
l
c
k
o
s
b
o
r
n
e
.
o
r
g
mySQL MongoD8
http://rickosborne.org/download/SQL-to-MongoDB.pdf
Map/Reduce
Examples
This is me, playing wi th Map/Reduce
Health Clinic Example
Person registers with the Clinic
Weighs in on the scale
1 year => comes in 100 times
Health Clinic Example
person = { name: Bob,
weighings: [
{date: date(2009, 1, 15), weight: 165.0},
{date: date(2009, 2, 12), weight: 163.2},
... ]
}
for i in range(N):
person = { 'name': 'person%04i' % i }
weighings = person['weighings'] = [ ]
std_weight = random.uniform(100, 200)
for w in range(100):
date = (datetime.datetime(2009, 1, 1) +
datetime.timedelta(
days=random.randint(0, 365))
weight = random.normalvariate(std_weight, 5.0)
weighings.append({ 'date': date,
'weight': weight })
weighings.sort(key=lambda x: x['date'])
all_people.append(person)
Map/Reduce
Insert Script
Insert Data
Performance
1
10
100
1000
1k 10k 100k
3.14s
29.5s
292s
Insert
LOG-LOG scale
Linear scaling
map_fn = Code("""function () {
this.weighings.forEach(function(z) {
emit(z.date, z.weight);
});
}""")
reduce_fn = Code("""function (key, values) {
var total = 0;
for (var i = 0; i < values.length; i++) {
total += values[i];
}
return total;
}""")
result = people.map_reduce(map_fn, reduce_fn)
Map/Reduce
Total Weight by Day
>>> for doc in result.find():
print doc
{u'_id': datetime.datetime(2009, 1, 1, 0, 0), u'value':
39136.600753163315}
41685.341024046182}
38232.326554504165}
... lots more ...
Map/Reduce
Total Weight by Day
Total Weight by Day
Performance
1
10
100
1000
1k 10k 100k
4.29s
38.8s
384s
MapReduce
map_fn = Code("""function () {
var target_date = new Date(2009, 9, 5);
var pos = bsearch(this.weighings, "date",
target_date);
var recent = this.weighings[pos];
emit(this._id, { name: this.name,
date: recent.date,
weight: recent.weight });
};""")
reduce_fn = Code("""function (key, values) {
return values[0];
};""")
result = people.map_reduce(map_fn, reduce_fn,
scope={"bsearch": bsearch})
Map/Reduce
Weight on Day
bsearch = Code("""function(array, prop, value) {
var min, max, mid, midval;
for(min = 0, max = array.length - 1; min <= max; ) {
mid = min + Math.floor((max - min) / 2);
midval = array[mid][prop];
if(value === midval) {
break;
} else if(value > midval) {
min = mid + 1;
} else {
max = mid - 1;
}
}
return (midval > value) ? mid - 1 : mid;
};""")
Map/Reduce
bsearch() function
Weight on Day
Performance
1
10
100
1000
1k 10k 100k
1.23s
10s
108s
MapReduce
target_date = datetime.datetime(2009, 10, 5)
for person in people.find():
dates = [ w['date'] for w in person['weighings'] ]
pos = bisect.bisect_right(dates, target_date)
val = person['weighings'][pos]
Weight on Day
(Python Version)
Map/Reduce
Performance
0.1
1
10
100
1000
1k 10k 100k
0.37s
2.2s
26s
1.23s
10s
108s
MapReduce Python
Summary
Resources
www.10gen.com
www.mongodb.org
MongoDB
The Denitive Guide
OReilly
api.mongodb.org/python
PyMongo
END OF SLIDES
Chalkboard
is not Comic Sans
This is Chalkboard, not Comic Sans.
This isnt Chalkboard, its Comic Sans.
does it matter, anyway?

Introducing MongoDB

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Introducing MongoDB

Diunggah oleh

Hak Cipta:

Format Tersedia

Introducing:

Anda mungkin juga menyukai