Anda di halaman 1dari 8

NoSQL, MongoDB, and the importance

of Data Modelling
Our Thinking Fri 29th May, 2015

Its a brave new NoSQL powered world out there, and companies are increasingly moving to
new technologies for data storage and reporting. And these solutions are indeed powerful
removing database features that people dont always need (stored procedures, schema
enforcement by the db, etc.), in exchange for features that we do want (really easy replication
and failover, easy horizontal scalability, better concurrency).
But these systems arent a magic bullet, as evidenced by the number of people that have trouble
using them, and the number of people that actively complain about them.

This is probably true for almost all NoSQL solutions, but in my experience it seems particularly
true with MongoDB, probably for a few reasons:

Its the NoSQL solution I have the most experience with, so its the one I have the most
opportunity to complain about.
Its one of the most (if not the most) popular NoSQL DBs, so its the solution most
people try, often the first solution people try, and one of the solutions most likely to be
used by fanboys relatively inexperienced developers.
It is a document database, which offers the widest array of modelling options for
better and for worse.

Data modelling is important no matter what persistence solution you use (yes, even in the SQL
world, where there are endless courses, books, and articles available about basic modelling
choices, like how to best represent a tree or graph). But document databases provide maximum
flexibility in how you persist a given bunch of dataso with good data modelling techniques
you have the maximum ability to model your way out of trouble with performance and
concurrency issues, but with bad modelling you have a lot of rope to hang yourself with.
In my professional experience, Ive run into a number of situations where a poor developer
experience with MongoDB was largely based on the way it was being used, and suboptimal data
modelling.
Lets look at two of those cases.

Example 1: A reservation system


This is one of the more recent cases Ive dealt with, a leading online restaurant reservation
system developed by our friends at bookatable.com. Bookatable provides real-time reservation
and marketing services for 13,000 restaurants across 12 countries, and they are very committed
to delivering up-to-date table availability information to users of their consumer web and native
apps. Their working data set is the set of available table reservations for each restaurant, and it
covers many days in the immediate past and near future, with many different tables for each
restaurant.
Originally, the data was modelled with one document per possible reservation, with a structure
something like this:
{
_id: //a unique MongoDB ObjectId,
restaurantId: //int or long,

date: //a date,


time of day: //a time,
...
//other details of the reservation
//(table no, size, etc)
}
The developers at Bookatable are very sharp, had done a great job with general performance
tuning of their app, and were following most of the published MongoDB guidelines. Even so,
they were experiencing significant performance problems, to the extent that there was concern
from management that MongoDB might not be the correct tool for the job. In fact, it can be quite
suitable for this approach, and some relatively minor tweaks to the data model led to some
dramatic improvements.
Tiny documents that are too finely grained:
MongoDB documentation regularly mentions the maximum document size (as of 2015
its 16MB), but theres very little discussion of the minimum effective document size. It varies
from setup to setup, and is best measured empirically, but in a discussion I had in a London pub
with Dwight Merriman once, his opinion was that the effective minimum size was about 4KB.
This makes sense when you think about how MongoDB loads documents from disk. MongoDB
files are organised into 4KB pages, and the DB always reads an entire page into memory at
once. Some of the data in that page will be the document youre looking for. However,
MongoDB documents are not stored in any particular order (including _id order), and a page
contains random documents. So its usually best to work under the assumption that the rest of
the page contains data completely unrelated to your query, and that having been loaded into
memory, it will then be ignored.
You dont need to worry about a small number of your documents being slightly smaller than the
limit, but if youve got a lot of documents that are drastically smaller than the page size then you
are very likely throwing away a lot of the data your system spends precious IO retrieving.
Documents larger than page size are no problem, as MongoDB does guarantee that a single
document is always stored contiguously on disk, which makes it easy to return. So between 4KB
and 16MB, smaller is better, but below 4KB smaller can be really bad.

A document per individual reservation meant that almost all of the documents in the database
were under the minimum effective document sizeoften by an order of magnitude. The client
was definitely experiencing the effects of all of that wasted IO.
Retrieving a large number of documents to satisfy common queries:
Related to the previous point, I generally use the number of documents that need to be loaded
as an easy-to-calculate approximation of how expensive is this query. Its far from exact, but
its usually pretty easy to figure out, and it can go a long way with a little bit of napkin math.
For this client, a common query is for a given restaurant, what are all of the times I can get a
free table for X people in the next two days?. Because of how the data was organised (one
document per potential reservation), one document had to be retrieved for each potential table
with the right size, times the number of timeslots. In a large restaurant (say 50 tables) with 20
time-slots each day (open 10 hours a day with 2 possible timeslots per hour per table) up to 2,000
documents need to be returned for a 2 day span, and that means (give or take) 2,000 potential
random IO operations. Consider the same operation across a set of restaurants in a city, and a
very large amount of IO is required to answer queries.
How data modelling fixed the problem:
Changing the data model so that a single document contained all available reservations per
restaurant per day solves almost all of these performance issues.
The documents got larger, but because they were previously well under the 4KB limit, this
doesnt negatively impact the IO requirements of the system at all. There are also far fewer
documents now just 1 document per restaurant per day. Satisfying common queries now
requires retrieving far fewer documents the common query for a given restaurant, what are all
of the times I can get a free table for X people in the next two days? now requires exactly 2
documents to be loaded.
This also has an effect on the number of writes that are required: Updating a restaurants
inventory for a day (or adding data for a new day) requires just one document write.

Example 2: Time series data & calculating catchments


This is a slightly older example from a project Equal Experts worked on with Telefnica Digital
in 2012-2013, with a very interesting use case: Calculating catchment areas over time.

Briefly, a catchment area is the area where the people who visit a certain location live. As traffic
varies a lot over the course of a year, the requirement was to calculate a catchment area over
time: Over a 6 month period, where did the people who visited location X come from, and in
what concentrations?
The United Kingdom was divided into about 120,000 areas (locations) and for each day, traffic
movement was described as X people who live in location Y visited location Z. The challenge
was to be able to retrieve and aggregate that information in a timely fashion to generate a report.
The product designer even provided a worst case nightmare query: Calculate the catchment for
a collection of 80 destinations over a period of 6 months.
Another consultancy took the data, and modelled it in the most straightforward way they could
think of, where each document was a tuple:
{sourceLocationId, destinationLocationId, date, visitors}

This didnt work. In fact, it didnt work so much that when we first got involved tuning this
particular use case, the other consultancy determined that MongoDB was completely unsuitable
for the task at hand, and shouldnt be used.
The problem is that this tuple is significantly less than 4KB in size, at about 128 bits (plus key
names) per tuple. And with 120,000 sources and destinations, up to 120k * 120k documents had
to be loaded each day. Thats 14.4 billiondocuments per day.
Trying to insert that many documents every day completely overloaded the MongoDB cluster.
The team involved tried everything they could think of to performance tune the cluster (more
shards, more IO, more mongos instances).
Worse, satisfying the nightmare query involved loading documents for 80 destinations *
120,000 potential sources * 180 days, or 1.7 billion documents, out of a potential
2.59 trillion documents in the database for the time period.
It was at this point that Equal Experts first became involved in this particular feature. Our very
first suggestion was to rethink the data model, and as it happened, changing the data model
completely solved this performance problem.

Our first modelling change was to use larger documents, where each document contained all of
the traffic for 1 destination over 1 day:
{
_id: //an objectId,
destinationLocationId: //int,
date: //date,
sources: {
sourceLocationId: visitors //both ints,
....
//up to 120,000 of these
}
}

This resulted in much larger documents, but still comfortably under the MongoDB document
size limit.
This immediately reduced the number of documents required each day from 14.4 billion to
120,000. And the number of documents required to satisfy the nightmare query was 80
destinations * 180 days = 14,400 documents.
This was an improvement, but we were still unhappy with loading over 14 thousand documents
to satisfy a query, so we further optimised the model.
We stayed with a document per destination per day, but instead of just storing the visits for that
day, we stored the visits for that day and the total visits so far that year:
{
_id: //an objectId,
destinationLocationId: //int,
date: //date,
sources: {
sourceLocationId: {
dailyVisits: //int,
ytd: //int
},

....
//up to 120,000 of these
}
}

At this point the documents were large enough that we were heavily optimising key names to
keep documents under control. I have omitted that here to keep things more readable.
Now to return the catchment for a date range, you query the document for the end of the range,
and the document for the beginning of the range, and just subtract the two sets of totals from
each other, and presto: youve got the number of visits during the range.
Inserting new data is still pretty easy: Just load yesterdays data, add todays numbers to it, and
youve got the new totals for so far this year.
The number of documents that need to be loaded to satisfy the nightmare query is now 80
destinations * 2 days = 160 documents.
Thats a total of 10.8 million times fewer documents that theoretically would need to be
retrieved. Its highly doubtful that any amount of physical tuning could ever provide a similar
performance increase.

Conclusions
So, to recap:

The number of documents that need to be read (or written) by a particular operation is
often a useful approximation for much IO needs to be performed.
Modelling is important. Changes to a data model can impact performance as much, or
even more than regular performance tuning, physical tuning, or configuration changes.
Document databases give you maximum flexibility for modelling, and therefore a lot of
potential power, but also a lot of rope with which to hang yourself.
A good data model design at the beginning of a project can be worth its weight in gold.

and
If you want some advice about data modelling, (especially with NoSQL and MongoDB), were
available.

when you cant just use neo4j


this IO strategy is also used by virtually every SQL database on the market.
a single page will always contain documents from the same collection, but there are no other
guarantees at all about which documents they will be.
in this case, names omitted to protect the very, very, guilty as will shortly become clear
theoretically mostly because the other companys implementation was never able to
successfully answer a query

Anda mungkin juga menyukai