This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
1. Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
RethinkDB object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Using drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CONTENTS
Default database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Repl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Basic data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Composite data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Sorting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Selecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Select the whole table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Select a single document by its primary key . . . . . . . . . . . . . . . . . . . . . . . . . 28
Select many documents by value of fields . . . . . . . . . . . . . . . . . . . . . . . . . . 31
r.row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Pagination data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Access Nested field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Wrap Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. Modifying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
List table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Drop table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
System table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Sync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Simple index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Compound index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Arbitray expressions index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Checking index status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7. Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
sum, average, and count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
min and max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
distinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
ungroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8. Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
WRAP UP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
1. Welcome
Introduction
Welcome to my readers. I appreciate your purchase. This will help me continue improving the book
content.
Before we go into the technical details, I have something to say.
Firstly, Im not a RethinkDB expert at all. Im just an average guy who loves programming and
new technologies. To me, RethinkDB is a pleasure to use. However, due to its age, there are not
many books and documents about it comparing to other database systems. While the RethinkDB
cocumentation and API is very good, it can be hard to know where to start. So this guide is for all
those in mind who are unsure about taking the plunge into RethinkDB as something totally new. I
hope this helps them ease into the learning process.
The purpose of this book is to organize the concepts of RethinkDB in order to help you to read
and understand the RethinkDB API directly. Upon finishing the book, you will have a foundational
knowledge in which to extend your knowledge with many other RethinkDB videos and blog posts
out on the Internet.
Secondly, Im a fan of Mixus writing style. I wont cover deeply things like installing RethinkDB,
fine-tuning, extra function parameters, and so on. Those topics are covered very well on RethinkDBs
documention itself. What I want you to take away from this book is a good grasp on RethinkDB usage
in practice. and how to apply commands in real scenarios.
Third, Im not fluent in English. If you find any mistakes, you can report the issue on repository or
email me directly.
Fourth, RethinkDB is changing so fast that things in this book may not reflect its current state. Once
again, Id be very grateful for any errata you may point out, via my email or Github. Since this is a
LeanPub book, once I update you may download it again free of charge.
And finally, due to my limited knowledge with RethinkDB, I want to keep this book short and
straight to the point. Expect a book of around 200 pages. My goal is for this to be a book that you
can pick up, read on the train while riding to work and after a week you can sit down and actually
start your first RethinkDB project without hesitation.
http://blog.mixu.net/
http://blog.mixu.net/2012/07/26/writing-about-technical-topics-like-its-2012/
https://github.com/kureikain/simplyrethink
Why learn RethinkDB?
RethinkDB is mind-blowing to me. I like the beauty and nature of ReQL which is built into the
language. It is also very developer friendly with its own administrator UI. RethinkDB is very easy
to learn, because its query language is natural to how we think when constructing a query. We can
easily tell what ReQL will do and what is the execution order of the query.
Take this SQL query:
This query is passed as a string and occaisionally you may sometimes forget the ordering or syntax.
Will we put **ORDER** before or after **LIMIT**? Where the WHERE clause should appear? We also
cant be certain if an index will be used. Because SQL is a string, the order of execution is defined
by the syntax. Memorizing that syntax is essential.
Compare this with ReQL (RethinkDB Query Language):
We can easily ascertain (or grok) immediately what will result from this query, and the order of
execution is clear to us. This is because the methods are chained, one after another, from left too
right. ReQL was designed with the intention of a very clear API but without the ambiguity that
comes with an ORM.
We can also see that it will use an index **name** when finding data. The way the query is con-
structed, feels similiar to jQuery if you are a front-end developer who never works with databases.
Or if you are a functional programming person, you probably see the similarity immediately.
If the above example hasnt convinced you, then check this out:
1 SELECT *
2 FROM foods as f
3 INNER JOIN compounds_foods as c ON c.food_id=f.id
4 WHERE f.id IN (10, 20)
5 ORDER By f.id DESC, c.id ASC
1 r.db("food")
2 .table("foodbase")
3 .filter(function (food) {
4 return r.expr([10, 20]).contains(food("id"))
5 })
6 .eqJoin("id", r.db("foodbase").table("compound_foods"), {index: "food_id"})
Even if you are not completely familar with the syntax, you can guess what is going to happen. In
ReQL, we are taking the foodbase database, and table foods, and filtering them and filtering the
result with another table called compound_foods. Within the filter, we pass an anonymous function
which determines if the id field of document is contained in the array [10, 20]. If it is either 10
or 20 then we join the results with the compound_foods table based on the id field and use an index
to efficiently search. The query looks like a chain of API call and the order of execution is clear to
the reader.
RethinkDB really makes me rethink how we work with database. I dont have to write a query in
a language that I dont like. As well, Im no longer forced to use a syntax that I dont like because I
have no choice. And further, if something does go wrong, I dont have to slowly tear apart the entire
string to find out which clause has the issue. The resulting error from a ReQL query allows me to
more precisely determine the cause of error.
Furthermore, RethinkDB is explicit. Later on, you will also learn that in RethinkDB you have to
explicitly tell it to do some not-very-safe operations. Such as when a non-atomic update is required,
you clearly set a flag to do it. RethinkDB by default has sensible and conservative settings as a
database should to help you avoid shooting yourself in the foot.
In my opinion, RethinkDB forces us to understand what we are doing. Everything is exposed on
the query. No magic, no why did this query fail on production but work as expected on my local
machine, no hidden surprises.
In Vietnamese culture, we usually follow a rule of three in demonstrations before we conclude. Being
Vietnamese, let me end by showing you this third example.
Do you understand the query below?
1 r
2 .db('foodbase')
3 .table('foods')
4 .filter(r.row('created_at').year().eq(2011))
This query finds all foods which were inserted in the year 2011. I cannot even provide an equivalent
SQL example, because it just cannot be as beautiful and concise as the above query.
Feedback
I appreciate all of your feedbacks to improve this book. Below is my handle on internet:
twitter: http://twitter.com/kureikain
email: kurei@axcoto.com
twitter book hashtag: #simplyrethinkdb
http://twitter.com/kureikain
kurei@axcoto.com
Credit
Sample dataset: foodb.ca/foods
Book cover: Design by my friend, aresta.co helps to create the cover for this book
http://foodb.ca/foods
http://aresta.co/
2. Getting to know RethinkDB
Lets warm up with some RethinkDB concepts, ideas and tools. In this chapter, thing may a bit
confuse because sometime to understand concept A, you need to understand B. To understand B,
you need C, which is based on A. So plese use your feeling and dont hestitate to do some quick
lookup on offical docs to clear thing out a bit.
From now on, we will use term ReQL to mean anything related to RethinkDB query language, or
query API.
Getting Started
Its not uncommon to see someone write an interactive shell in browser for evaluation purpose such
as mongly, tryRedis. This isnt applied for RethinkDB because it comes with an excellent editor
where you can type code and run it.
Install RethinkDB by downloading package for your platform http://rethinkdb.com/docs/install/.
Run it after installing.
The Ports
By default, RethinkDB runs on 3 ports
8080 this is the web user interface of RethinkDB or the dashboard. You can query the data, check
performance and server status on that UI.
28015
this is the client drive port. All client drive will connect to RethinkDB through this port. If you
remember in previous chapter, we used a tcpdump command to listen on this port to capture
data send over it.
29015
this is intracluster port; different RethinkDB node in a cluster communicates with eath others
via this port
The dashboard
Open your browser at http://127.0.0.1:8080 and welcome RethinkDB. You can play around to see
what you have:
Navigate to the Explorer tab, you can type the command from there. Lets start with
1 r.dbList()
soft means the writes will be acknowledge by server immdediately and data will be flushed to disk
in background.
hard The opposite of soft. The default behaviour is to acknowledge after data is written to disk.
Therefore, when you dont need the data to be consitent, such as writing a cache, or an
important log, you should set durability to soft in order to increase speed
Atomicity
According to RethinkDB docs [atomic], write atomicity is supported on a per-document basis. So
when you write to a single document, its either succesfully or nothing orrcur instead of updating
a couple field and leaving your data in a bad shape. Furthermore, RethinkDB guarantees that any
combination of operation can be executed in a single document will be write atomically.
However, it does comes with a limit. To quote RethinkDB doc, Operations that cannot be proven
deterministic cannot update the document in an atomic way. That being said, the unpredictable
value wont be atomic. Eg, randome value, operation run by using JavaScript expression other than
ReQL, or values which are fetched from somewhere else. RethinkDB will throw an error instead of
silently do it or ignore. You can choose to set a flag for writing data in non-atomic way.
Multiple document writing isnt atomic.
[atomic] http://www.rethinkdb.com/docs/architecture/#how-does-the-atomicity-model-work
Command line tool
Besides the dashboard, RethinkDB gives us some command line utility to interactive with it. Some
of them are:
import
export
dump
restore
import
In the spirit of giving users the dashboard, RethinkDB also gives us some sample data. You can down-
load the data in file input_polls and country_stats at https://github.com/rethinkdb/rethinkdb/tree/next/demos/electio
and import them into test database
Notice the --table argument, we are passing the table name in format of database_name.*table_-
name. In our case, we import the data into two tables: input_polls and county_stats inside
database test.
Basically you can easily import any file contains a valid JSON document.
export
export exports your database into many JSON files, each file is a table. The JSON file can be import
using above import command.
dump
dump will just export whole of data of a cluster, its similar to an export command, then follow by
gzip to compress all JSON output file. Syntax is as easy as.
Command line tool 13
The dump result is a gzip file whose name is in format rethinkdb_dump_{timestamp}.tar.gz Its
very useful when you want to try out something and get back your original data. Note it here because
you will need it later.
restore
Once we got the dump file with dump command. We can restore with:
What is FooDB FooDB is the worlds largest and most comprehensive resource on
food constituents, chemistry and biology. It provides information on both macronutri-
ents and micronutrients, including many of the constituents that give foods their flavor,
color, taste, texture and aroma
I import their data into RethinkDB, and generate some sample tables such as users table. At the end,
I used the dump command to generate sample data which you can download using below links
https://www.dropbox.com/s/dy48el02j9p4b2g/simplyrethink_dump_2015-08-11T22%3A15%3A51.tar.gz?dl=0.
Once you download it, you can import this sample dataset:
Once this processing is done, you should have a database call foodb which contains the data we play
throught the book. At any point, if you messed up data, you can always restore from this sample
data. Also, I encourage to back up data if you build many interesting data to experiment yourself.
http://foodb.ca/about
https://www.dropbox.com/s/dy48el02j9p4b2g/simplyrethink_dump_2015-08-11T22%3A15%3A51.tar.gz?dl=0
https://www.dropbox.com/s/dy48el02j9p4b2g/simplyrethink_dump_2015-08-11T22%3A15%3A51.tar.gz?dl=0
3. Reading Data Basic
If you are lazy(just like me) and skip straigh to this chapter, please go back to the end of previous
chapter to import sample dataset. Once you did it, lets start. Oh, before we start, let me tell you this,
sometime if you see an ... it means, we have more data returning, but I cannot paste them all into
the book. I used ... to denote for more data available.
Getting to Know ReQL
RethinkDB uses a special syntax call ReQL to interact with the data. ReQL is chainable. You start
with a database, chain to table, and chain to other API to get what you want, in a way very natural.
Type this into data explorer:
1 r.db("foodb").table('flavors').filter({'flavor_group': 'fruity'})
Dont worry about the syntax, just look at it again and even without any knowledge you know what
it does and easily remember it. A way, for me, to understand ReQL is that every command return
an object which share some API which we can call those API as if method of an object.
ReQL is particular binding to your language. Though, of course, they will look familiar between
different language to maintain consitent look and feel. But, they are different. Those querie are
constructed by making function call of your language, not by concat SQL String, or not by special
JSON object like MongoDB. Therefore, it feel very natualy to write ReQL, as if the data we
manipulate is an object or data type in our language. But everything comes with a trade off. On the
downside, we have to accept differences of ReQL betweens many language. No matter how hard we
try, different language has different syntax, especially when it comes to anonymous function.
Getting to Know ReQL 17
What is r? r is like a special namespace which is all RethinkDB is exposed via it. Its just a normal
variable in your language, or a namespace, a package name, a module. Think of r like $ of jQuery.
If you dont like r, assign another variable to it is possible.
We will call all method of r, or of any method of return resulted from other method are command
for now. Think of it like a method in jQuery world.
Here is example, with this HTML structure:
1 <div class="db">
2 <div class="table" data-type="anime">Haru River</div>
3 <div class="table" data-type="anime">Bakasara</div>
4 <div class="table" data-type="movie">James Bond</div>
5 </div>
1 $('.db').find('.table').filter('[data-type="anime"]')
If we have a database call db, and a table call table, with 3 records:
1 r.db('db').table('table').filter({type: 'anime'})
Notice how similar the structure between them? Because of those concept, I find ReQL is easy to
learn. If you can write jQuery, you can write ReQL.
Another way to understand is considering ReQL like the pipe on Linux, you select the data, passing
into another command:
An example of what above tcpdump return when I run those command(in Ruby):
1 r.db("foodb").table("users").with_fields("address").run
Its similar to a function call when we have function name, follow by its argument and the last is an
option object.
You can quickly sense a downside is that each of driver have different API to construct query. When
you come to another language, you may feel very strange. The driver hide the real query behinds its
API. Its kind of similar to how you use an ORM in SQL world to avoid writing raw SQL string. But
Drivers 19
its different because the ORM usually has its own API to turn query into raw query string, which
in turns send to server using another driver with that database protocol. Here, we are having power
of an ORM, but happen at driver level, because RethinkDB protocol is a powerful JSON protocol
to help model query like function call, with argument and follow by parameter. In fact, ReQL is
modeled after functional languages like Lisp or Hashkell.
If you would like to know more about ReQL at lower level, you should read more in official
documents
Using drivers
RethinkDB supports 3 official drives:
Ruby
NodeJS
Python
These support all driver specifications. The community drives such as Go, PHP probably wont
support them all, if you used a different language and find something isnt right, it is probably not
your fault.
All ReQL starts with r, its top level module to expose its public API. In NodeJS, we can use
1 var r = require('rethinkdb')
or Ruby
1 require 'rethinkdb'
2 include RethinkDB::Shortcuts
3 puts r.inspect
or in Go Lang
1 import (
2 r "github.com/dancannon/gorethink"
3 )
Once we constructed ReQL with r, we have to call run method to execute it. The command will be
submit to an active database connection. The database connection can be establish with connect.
http://rethinkdb.com/docs/writing-drivers/
http://rethinkdb.com/docs/driver-spec/
Drivers 20
1 var r = require('rethinkdb')
2 var connection = r.connect({
3 host: '127.0.0.1',
4 port: '28015',
5 db: 'test'
6 }, function (err, conn) {
7 r.db('db').table('table').filter({type: 'anime'})
8 })
When creating the connection with r.connect, you can pass an db parameter to specify a default
database to work on once connecting succesfully. Its similar to the current database of MySQL.
Without setting this parameter, RethinkDB assumes test as default database.
To understand more about difference of API in different language, let looks at Go Lang driver
Notice that we dont have any host, or database parameter now. They are Addressand Database in
Go Lang driver. Therefore, by using an un-official language, the API will be totally different with
official API.
Thats how beautiful it is because each language has its own design philosophy. Such as in Go lang,
we cannot have a lower case field of a struct and expect it publicly available to outside. Using names
such as host or db for connection option is impossible in Go lang.
Default database
Similar to MySQL, when you can issue use database_name to switch to another database. We can
do that in RethinkDB by calling use command on a connection object.
1 connection.use('another_db')
In this small book, most of the time, we will use Data Exploer. Therefore we can use r without
initialization and calling run method. Data Exploer will do that for us. Just keep in mind when you
write code, you have to connect, and explicitly call run to, obviously, run the query.
Note that you dont have to switch to another database to access its table, you can just call
r.db('another_db') before building query.
Repl
Repl means read-eval-print loop. To avoid burden of manually call run and passing connection
object. Some driver offers a repl to help call run without any parameter.
Such as in Ruby:
https://github.com/dancannon/gorethink
var connection *r.Session
connection, err := r.Connect(r.ConnectOpts{ Address: localhost:28015, Database: test, })
if err != nil { log.Fatalln(err.Error()) }
Drivers 21
JavaScript doesnt have a repl. I think because we can already used the Data Explorer.
Data Type
Why do we have to discuss about data type? We use dynamic language and we almost dicuss
about Ruby and JavaScript most of time. But understanding data type allow us to read API
document better. It helps us to understand why we can can r.table.insert but we cannot call
like r.table.filter().insert. Arent we still selecting data from table, so we should be able to
insert data to it?
Data type helps us know that we can only call some method on some of data type
Each ReQL method can be call on one or many above data types. Take update command, when you
browser the API document, you see
It means the command can be invoked on a table, or selection (eg: first 30 element of tables), or a
single selection - a document is an example of single selection. The behaviour maybe different based
on data type, even the command is same.
In RethinkDB, we have several data types. We will focus into those 2 kind for now:
1 * Number: any real numbers. RethinkDB uses double precision (64-bit) floating po\
2 int numbers internally
3 * String
4 * Time: This is native RethinkDB date time type. However, they will be converted\
5 automatically to your native data type in your language by the driver.
6 * Boolean: True/False
7 * Null: Depend on your language, it can be nil, null,..
8 * Object: any valid JSON object. In JavaScript, it will be a normal object. In R\
9 uby, it can be a hash.
10 * Array: any valid JSON array.
The data type of a field or column can be change. If you assign a number to a field, you can still
assign an value with different data type to that same field. So we dont have a static schema for
tables.
We have a very useful command to get the type of any vaue. Its typeOf. Example:
1 r.db('foodb').table('foods')
2 .typeOf()
3 //=>
4 "TABLE"
5
6 r.db('foodb').table('foods')
7 .filter(r.row("name").match('A^'))
8 .typeOf()
9 //=>
10 "SELECTION<STREAM>"
Its seems not very important to understand about data type at first. I really hope you should invest
some time to use it frequently to understand the data type of a value.
To give a story. In MariaDB10.0/MySQL5.6, when data type doesnt match, an index may not be
used. Lets say you have a field name with type VARCHAR(255) when you define it, then you create
an index on that column. Query on that column with exact data type will make index kicked in.
Lets come back MySQL a bit.
First I insert below records.
When we pass string 9, the index is used. When we pass number 9, the index isnt used.
Of if you have a date time column and you passing time as string, the index wont kicked in either.
The lesson here is we aboslutely should understand about data type.
Streams
:are list or array, but theyre loaded in a lazy fashion. Instead of returning a whole array at once,
meaning all data are read into memory, a cursor is return. A cursor is a pointer into the result set.
Data Type 25
We can loop over cursor to read data when we need. Imagine instead of an array, and loop over it,
you know iterate over the cursor to get next value. It allows you to iterator over a data set without
building an entire array in memory. Its equivalent to PHP iterator, or Ruby iterator, or JavaScript
iterator. Stream allows us access current element and keep track of current position so that we can,
ideally call next() on a cursor to move to next element, until we reach to the end of array, it returns
nil and iterator can stop. Because of that, we can work with large data set because RethinkDB doesnt
need to load all of data and return to client. The nature of stream make it read-only, you cannot
change the data while iterating over it.
Selections
:represent subsets of tables, for example, the return values of filter or get. There are two kinds of
selections, **Selection
Tables
:are RethinkDB database tables. They behave like selections. However, theyre writable, as you can
insert and delete documents in them. ReQL methods that use an index, like getAll, are only available
on tables. Because index are created on table level.
In short, you cannot modify streams, you can update or change value of selection but you cannot
remove existed document, or insert new one. Tables allows you insert new document or remove
existed one.
Sequence
RethinkDB document use sequence in lots of places. Its a particular data type. You can
think of it as an shortwords for all: streams, table, seletion
Remember data types seems not much important but you should understand them well because it
helps us understand the efficient of a query. If a query returns an array, it consumes lot of memory
to hold the array.
Sorting data
When talking about data type, let think of how we sort them. It really doesnt matter in the order,
what is important is the definition of sorting data.
Understanding sorting is important in RethinkDB because of its schemaless. The primary key may
not be a numeric field, it can be a string. Moreover than that, a field can have whatever data type,
how are we going to compare an object to a string when sorting.
Here is sorting order:
Data Type 26
Arrays (and strings) sort lexicographically. Objects are coerced to arrays before sorting. Strings are
sorted by UTF-8 codepoint and do not support Unicode collations.
Mixed sequences of data sort in the following order:
arrays
booleans
null
numbers
objects
binary objects
geometry objects
times
strings
That mean array < booleans < null < numbers < objects < binary objects < geometry objects < times
< strings.
Selecting data
In this section, we will learn how to get data out of RethinkDB. Most of the time, we will choose a
db to work with, and chain into command table.
1 r.db('foodb').table('foods')
2 //=>
3
4 [{
5 "created_at": Wed Feb 09 2011 00: 37: 17 GMT - 08: 00,
6 "creator_id": null,
7 "description": null,
8 "food_group": "Herbs and Spices",
9 "food_subgroup": "Spices",
10 "food_type": "Type 1",
11 "id": 43,
12 "itis_id": "29610",
13 "legacy_id": 46,
14 "name": "Caraway",
15 "name_scientific": "Carum carvi",
16 "picture_content_type": "image/jpeg",
17 "picture_file_name": "43.jpg",
18 "picture_file_size": 59897,
19 "picture_updated_at": Fri Apr 20 2012 09: 38: 36 GMT - 07: 00,
20 "updated_at": Fri Apr 20 2012 16: 38: 37 GMT - 07: 00,
21 "updater_id": null,
22 "wikipedia_id": null
23 }, {
24 "created_at": Wed Feb 09 2011 00: 37: 18 GMT - 08: 00,
25 "creator_id": null,
26 "description": null,
27 "food_group": "Herbs and Spices",
28 "food_subgroup": "Spices",
29 "food_type": "Type 1",
Selecting data 28
30 "id": 67,
31 "itis_id": "501839",
32 "legacy_id": 73,
33 "name": "Cumin",
34 "name_scientific": "Cuminum cyminum",
35 "picture_content_type": "image/jpeg",
36 "picture_file_name": "67.jpg",
37 "picture_file_size": 73485,
38 "picture_updated_at": Fri Apr 20 2012 09: 32: 32 GMT - 07: 00,
39 "updated_at": Fri Apr 20 2012 16: 32: 33 GMT - 07: 00,
40 "updater_id": null,
41 "wikipedia_id": null
42 },
43 ...
44 ]
You should get back an array of JSON object. By default, the data explorer will automatically
paginate it and display a part of data.
Typing r.db(db_name) all the time is insane. We can drop it to use r.table() without calling r.db()
if the table is in current selected database. Without any indication, the default database is test. On
Data Exploer, without a r.db command, RethinkDB will use test as default database. Unfortunately
we cannot set a default database with data exploer
Counting
We can also count the table or any sequence by calling count command.
1 r.db('foodb').table('foods').count()
2 //=>
3 863
https://github.com/rethinkdb/rethinkdb/issues/829
Selecting data 29
1 r.db('foodb').table('foods')
2 .get(108)
3 //=>
4 {
5 "created_at": Wed Feb 09 2011 00: 37: 20 GMT - 08: 00,
6 "creator_id": null,
7 "description": null,
8 "food_group": "Herbs and Spices",
9 "food_subgroup": "Herbs",
10 "food_type": "Type 1",
11 "id": 108,
12 "itis_id": "32565",
13 "legacy_id": 115,
14 "name": "Lemon balm",
15 "name_scientific": "Melissa officinalis",
16 "picture_content_type": "image/jpeg",
17 "picture_file_name": "108.jpg",
18 "picture_file_size": 30057,
19 "picture_updated_at": Fri Apr 20 2012 09: 33: 54 GMT - 07: 00,
20 "updated_at": Fri Apr 20 2012 16: 33: 54 GMT - 07: 00,
21 "updater_id": null,
22 "wikipedia_id": null
23 }
Every document in RethinkDB includes a primary key field, its value is unique across cluster and
is used to identify the document. The name of primary field is id by default. However, when you
create a table, you have an option to change name of primary field. We will learn more about it later.
Just keep a note here.
In RethinkDB, using of incremental primary key isnt recommended because thats hard in a cluster
environment. To make sure the uniqueness of the new value, We have to check in every clusters
somehow. RethinkDB team decides to use an universal unique id instead of an incremental value.
get command returns the whole document. What if we get a single field? Such as we only care about
name? RethinkDB has a command call bracket for that purpose. In Ruby its [], and in JavaScript
its ().
We can do this in JavaScript:
http://stackoverflow.com/questions/21020823/unique-integer-counter-in-rethinkdb
http://en.wikipedia.org/wiki/Universally_unique_identifier
Selecting data 30
1 r.db('foodb').table('foods')
2 .get(108)("name")
3 //=>
4 "Lemon balm"
Or in Ruby
1 r.connect.repl
2 r.db('foodb').table('foods').get(108)[:name].run
What special about bracket is that it return a single value of the field. The type of value is same
type of value, not a subset of document. We can verify that with typeOf command:
1 r.db('foodb').table('foods')
2 .get(108)
3 ("name")
4 .typeOf()
5 //=>
6 "STRING"
1 r.db('foodb').table('test')
2 .get(108)("address")("country")
with assumption that the document has address field is an object contains a field name country.
If you dont like the using of bracket, you can use getField(JavaScript) or get_field(Ruby) which
have same effect:
1 r.db('foodb').table('foods')
2 .get(108)
3 .getField('name')
4 //=>
5 "Lemon balm"
How about getting a sub set of document, we can use pluck like this:
Selecting data 31
1 r.db('foodb').table('foods')
2 .get(108)
3 .pluck(get"name", "id")
4 //=>
5 {
6 "id": 108 ,
7 "name": "Lemon balm"
8 }
pluck probably existed in many standard library of your favourite language. This example shows
you how friendly ReQL is.
1 r.db('foodb').table('foods')
2 .filter(r.row('created_at').year().eq(2011))
3 //=>Executed in 59ms. 40 rows returned, 40 displayed, more available
4 [{
5 "created_at": Wed Feb 09 2011 00: 37: 17 GMT - 08: 00,
6 "creator_id": null,
7 "description": null,
8 "food_group": "Herbs and Spices",
9 "food_subgroup": "Spices",
10 "food_type": "Type 1",
11 "id": 43,
12 "itis_id": "29610",
13 "legacy_id": 46,
14 "name": "Caraway",
15 "name_scientific": "Carum carvi",
16 "picture_content_type": "image/jpeg",
17 "picture_file_name": "43.jpg",
18 "picture_file_size": 59897,
19 "picture_updated_at": Fri Apr 20 2012 09: 38: 36 GMT - 07: 00,
20 "updated_at": Fri Apr 20 2012 16: 38: 37 GMT - 07: 00,
21 "updater_id": null,
22 "wikipedia_id": null
23 }
Selecting data 32
24 ...
25 ]
r.row is new to you, but no worry, it just means current document. We used r.row('created_at')
to get value of created_at field, similar with how we use bracket on get command to get a single
value. Because created_at is a datetime value, I get its year with, well, year command, then using
eq to do an equal compare with 2011. Sound a lot, but above query is really simple and exlain itself.
Sometimes I feel redundant to explain query but I have to write this book anyway.
We can also pass an filter object to do matching filter:
1 r.db('foodb').table('foods')
2 .filter({
3 food_type: 'Type 1',
4 food_group: 'Fruits'
5 })
6 //=>
7 [
8 {
9 "created_at": Wed Feb 09 2011 00:37:15 GMT-08:00 ,
10 "creator_id": null ,
11 "description": null ,
12 "food_group": "Fruits" ,
13 "food_subgroup": "Tropical fruits" ,
14 "food_type": "Type 1" ,
15 "id": 14 ,
16 "itis_id": "18099" ,
17 "legacy_id": 14 ,
18 "name": "Custard apple" ,
19 "name_scientific": "Annona reticulata" ,
20 "picture_content_type": "image/jpeg" ,
21 "picture_file_name": "14.jpg" ,
22 "picture_file_size": 29242 ,
23 "picture_updated_at": Fri Apr 20 2012 09:30:49 GMT-07:00 ,
24 "updated_at": Fri Apr 20 2012 16:30:49 GMT-07:00 ,
25 "updater_id": null ,
26 "wikipedia_id": null
27 },...
28 ]
Passing an object will match exactly document with those field and value. In other words, passing
an object is equal to passing multiple eq command and and command. Above query can re-write
using expression:
Selecting data 33
1 r.db('foodb').table('foods')
2 .filter(
3 r.and(
4 r.row('food_type').eq('Type 1'),
5 r.row('food_group').eq('Fruits')
6 )
7 )
1 r.db('foodb').table('foods')
2 .filter({
3 food_type: 'Type 1',
4 food_group: 'Fruits'
5 })
6 .pluck('id', 'name', 'food_subgroup')
7 //=>Executed in 70ms. 40 rows returned, 40 displayed, more available
8 [
9 {
10 "food_subgroup": "Berries" ,
11 "id": 75 ,
12 "name": "Black crowberry"
13 }, {
14 "food_subgroup": "Tropical fruits" ,
15 "id": 150 ,
16 "name": "Guava"
17 }, {
18 "food_subgroup": "Tropical fruits" ,
19 "id": 151 ,
20 "name": "Pomegranate"
21 }, ...
22 ]
1 r.db('foodb').table('foods')
2 .filter({
3 food_type: 'Type 1',
4 food_group: 'Fruits'
5 })
6 .without("created_at", "picture_content_type", 'picture_file_name', 'picture_f\
7 ile_size', 'picture_updated_at')
8 //=> Executed in 52ms. 40 rows returned, 40 displayed, more available
9 [
10 {
11 "creator_id": null ,
12 "description": null ,
13 "food_group": "Fruits" ,
14 "food_subgroup": "Berries" ,
15 "food_type": "Type 1" ,
16 "id": 75 ,
17 "itis_id": "23743" ,
18 "legacy_id": 81 ,
19 "name": "Black crowberry" ,
20 "name_scientific": "Empetrum nigrum" ,
21 "updated_at": Fri Apr 20 2012 16:29:43 GMT-07:00 ,
22 "updater_id": null ,
23 "wikipedia_id": null
24 },...
25 ]
With simple filterting, we can easily pass an filter object as above. But what up with complex
searching? Such as finding all foods whose name starts with character N. As you see at the beginning,
we used r.row command to do a bit complex query.
1 r.db('foodb').table('foods')
2 .filter(r.row('created_at').year().eq(2011))
r.row
r.row is our swiss army knife. It refers to current visited document. Literally, its the document
at which RethinkDB is accessing. You can think of it like this in a JavScript callback/iterator. Or
think of it like current element in an iterator loop. Its very handy because we can call other ReQL
command on it to achieve our filtering.
It somehow feel like jQuery filtering command. For an instance, we write this in JavaScript to filter
all DOM element whose data-type value is anime.
1 $('.db').find('.table').filter(function() {
2 return $(this).data('type')=='anime'
3 })
1 r.db('foodb').table('foods').filter({food_group: 'Fruits'})
1 r.db('foodb').table('foods').filter(r.row('food_group').eq('Fruits'))
r.row a RethinkDB object, which we can continue call many method to filter or manipulation it.
The expression that we pass into filter is a normal ReQL expression but evaluate to a boolean
result. RethinkDB runs it and if the returned value is true, the document is included into result set.
Ideally, any function that returns boolean result can used with filter. Note that the evaluation of
filter expression run on RethinkDB server, therefore they has to be a valid ReQL expression, they
cannot be any arbitrary language expression. You cannot write:
1 r.db('db').table('table').filter(r.row('type') == 'anime')
Each of above command can be call on different data type. Eg, when you call add on an array, it will
append the element to array. when you call on a string, it concat parameter to the original string.
Or calling on a number and they just do arithmetic operation.
Run those command in data explorer:
1 r.expr(["foo", "bar"]).add(['forbar'])
2 //=>
3 [
4 "foo" ,
5 "bar" ,
6 "forbar"
7 ]
8
9 r.expr(2).add(10)
10 //=>
11 12
12
13 r.expr('foo').add("bar")
14 //=>
15 "foobar"
Note that the reason we use r.expr is that we have to turn the native object(array, number,
string in our language) into RethinkDB data type, so that we can call command on those.
However, in Ruby it can be even shorter with r(r([foo, bar]) + [foorbar])
Basically, you have to remember that everything is evaluated on server, and RethinkDB
command only callable on RethinkDB data type
You can find more about those document in RethinkDB doc, in group Math and logic.
Lets apply what we learn, by finding al food where its name starts with character R and is a tropical
fruits.
http://rethinkdb.com/api/javascript/#mod
Selecting data 37
1 r.db("foodb").table("foods")
2 .filter(
3 r.row("name").match("^R")
4 .and(
5 r.row("food_subgroup").eq('Tropical fruits')
6 )
7 )
8 //=>
9 {
10 "created_at": Wed Feb 09 2011 00:37:27 GMT-08:00 ,
11 "creator_id": null ,
12 "description": null ,
13 "food_group": "Fruits" ,
14 "food_subgroup": "Tropical fruits" ,
15 "food_type": "Type 1" ,
16 "id": 234 ,
17 "itis_id": "506073" ,
18 "legacy_id": 249 ,
19 "name": "Rambutan" ,
20 "name_scientific": "Nephelium lappaceum" ,
21 "picture_content_type": "image/jpeg" ,
22 "picture_file_name": "234.jpg" ,
23 "picture_file_size": 71055 ,
24 "picture_updated_at": Fri Apr 20 2012 09:43:04 GMT-07:00 ,
25 "updated_at": Fri Apr 20 2012 16:43:05 GMT-07:00 ,
26 "updater_id": null ,
27 "wikipedia_id": null
28 }
Here we are usinbg match with an regular expression R means any name starts with R, and using
and to do an and operator with other boolean. Other boolean is result of getting field food_subgroup
and compare with tropical fruits.
filter seems handy but its actually limited. filter didnt leverage index. It scan and hold all data
in memory. Of course, this isnt scale infinite. Only 100,000 records can be filter. For anything large
than that, we have to use getAll or between which we will learn in chapter 5.
Now, lets try to find all foods which has more than 10 foods document in its group. We probably
think of a simple solution like this: for each of document, we get its food_group and count how
many items has that same food group, if the result is greater than 10, we return true, so that it will
be included in filter result. We may have duplicate result but lets try this naieve soltuion:
Selecting data 38
1 r.db('foodb').table('foods')
2 .filter(
3 r.db('foodb').table('foods')
4 .filter(
5 {food_group: r.row("food_group")}
6 )
7 .count()
8 .gt(10)
9 )
1 RqlCompileError: Cannot use r.row in nested queries. Use functions instead in:
2 r.db("foodb").table("foods").filter(r.db("foodb").table("foods").filter({food_gr\
3 oup:
4 r.row("food_group")}).count().gt(10))
Basically, we have nested query here, and RethinkDB doesnt know which query r.row should
belong to, is it parent query, or the sub query? In those case, we have to use filter with function.
Lets move to next chapter.
1 r.db('foodb').table('foods')
2 .filter(function (food) {
3 return r.db('foodb').table('foods').filter({food_group: food("food_group")}).\
4 count().gt(10)
5 })
Now, we no longer using r.row, we pass an anonymous function with a single parameter(which
we can name whatever), when itereating over the table, RethinkDB call this function, and passing
current document as its first argument. By using function, we can still access current document,
without using r.row, and clearly bind current document to a variable, so that we can access its
value and avoid conflicting. Here, we name our argument food, instead of writing:
1 filter({food_group: r.row("food_group")})
We will write:
Selecting data 39
1 filter({filter_group: food("food_group")})
And we using boolean value, count().gt(10) here, as result of function. Filter with function helps
us write query with complex logic.
Pagination data
We rarely want a whole sequence of document, usually we care about a subset of data such as
pagination data. In this section, we go over commands: order, limit and skip.
Order data
So far, we only select data and accept default ordering. Lets control how they appear:
1 r.db('foodb').table('foods')
2 .filter(function (food) {
3 return r.db('foodb').table('foods').filter({food_group:
1 r.db('foodb').table('foods')
2 .filter(function (food) {
3 return r.db('foodb').table('foods').filter({food_group: food("food_group")})\
4 .count().gt(10)
5 })
6 .orderBy(r.desc("name"))
1 r.db('foodb').table('foods')
2 .orderBy(r.desc("name"))
1 r.db('foodb').table('foods')
2 .orderBy(r.desc("name"), r.asc("created_at"))
One more thing to note is that RethinkDB doesnt order document based on time they are
inserted by default. The order seems in an unpredicted way without explicitly setting an
order . In MySQL, for example, even without any index, the default order will be exactly
same as you insrted the document. However, in RethinkDB it doesnt. I guess this is because
its distributed.
We can combine some document commands with orderBy too. Such as pluck only an useful set of
fields:
1 r.db('foodb').table('foods')
2 .pluck("id", "name", "food_group")
3 .orderBy(r.desc("name"), r.asc("created_at"))
4 //=>Executed in 122ms. 863 rows returned
5 [
6 {
7 "food_group": "Milk and milk products",
8 "id": 634,
9 "name": "Yogurt"
10 },
11 {
12 "food_group": "Milk and milk products",
13 "id": 656,
Selecting data 41
14 "name": "Ymer"
15 },
16 {
17 "food_group": "Aquatic foods",
18 "id": 523,
19 "name": "Yellowtail amberjack"
20 },
21 ...
22 ]
Limiting data
Once we have an ordering sequence, we usually want to select a limit number of document instead
of the whole sequence. We use command limit(n) for this purpose. It get n elements from the
sequence or array.
1 r.db('foodb').table('foods')
2 .pluck("id", "name", "food_group")
3 .orderBy(r.desc("name"), r.asc("created_at"))
4 .limit(4)
5 //=>Executed in 107ms. 2 rows returned
6 [{
7 "food_group": "Milk and milk products",
8 "id": 634,
9 "name": "Yogurt"
10 }, {
11 "food_group": "Milk and milk products",
12 "id": 656,
13 "name": "Ymer"
14 }, {
15 "food_group": "Aquatic foods",
16 "id": 523,
17 "name": "Yellowtail amberjack"
18 }, {
19 "food_group": "Aquatic foods",
20 "id": 522,
21 "name": "Yellowfin tuna"
22 }]
limit get us a number of document that we want, but it always start from the beginning of sequence.
To start selecting data starts from a position, we used skip.
Selecting data 42
Skip
As its name, skip(n) ignore a number of element from the head of sequence.
1 r.db('foodb').table('foods')
2 .pluck("id", "name", "food_group")
3 .orderBy(r.desc("name"), r.asc("created_at"))
4 .skip(2)
5 .limit(2)
6 //=> Executed in 97ms. 2 rows returned
7 [{
8 "food_group": "Aquatic foods",
9 "id": 523,
10 "name": "Yellowtail amberjack"
11 }, {
12 "food_group": "Aquatic foods",
13 "id": 522,
14 "name": "Yellowfin tuna"
15 }]
1 r.tableCreate("books")
1 r.table("books")
2 .insert([
3 {
4 id: 1,
5 name: "Simply RethinkDB",
6 address: {
7 country: {
8 code: "USA",
9 name: "The United State of America"
10 }
11 },
12 contact: {
13 phone: {
14 work: "408-555-1212",
15 home: "408-555-1213",
16 cell: "408-555-1214"
17 },
18 email: {
19 work: "bob@smith.com",
20 home: "bobsmith@gmail.com",
21 other: "bobbys@moosecall.net"
22 },
23 im: {
24 skype: "Bob Smith",
25 aim: "bobmoose",
26 icq: "nobodyremembersicqnumbers"
27 }
28 }
29 },
30 {
31 id: 2,
32 name: "TKKG",
33 address: {
34 country: {
35 code: "GER",
36 name: "Germany"
37 }
38 },
39 contact: {
40 phone: {
41 work: "408-111-1212",
42 home: "408-111-1213",
Selecting data 44
43 cell: "408-111-1214"
44 },
45 email: {
46 work: "bob@gmail.com",
47 home: "bobsmith@axcoto.com",
48 other: "bobbys@axcoto.com"
49 },
50 im: {
51 skype: "Jon",
52 aim: "Jon",
53 icq: "nooneremembersicqnumbers"
54 }
55 }
56 }
57 ])
Depend on your language, you will usually have some way to access nested field, by following the
nested path. In above example, lets say we want to access *skype im, the path is:
contact -> im -> skype
Using JavaScript driver, we will use bracket to access field and sub field.
1 r.table('books').get(1)('contact')('im')('skype')
2 //=>
3 "Bob Smith"
1 r.table('books').get(1)['contact']['im']['skype']
We can keep calling bracket to get the final nested field follow the path. Not just a single document,
we can use bracket on table level too:
Selecting data 45
1 r.table('books')('address')('country')
2 [
3 {
4 "code": "GER" ,
5 "name": "Germany"
6 }, {
7 "code": "USA" ,
8 "name": "The United State of America"
9 }
10 ]
1 r.table('books')
2 .filter({id: 1})('address')('country')('name')
3 //=>
4 "The United State of America"
Beside using bracket command, we can also using getField if that feel more nature:
1 r.table('books')
2 .getField('contact')('email')
3 //=>
4 [
5 {
6 "home": bobsmith@axcoto.com,
7 "other": bobbys@axcoto.com,
8 "work": bob@gmail.com,
9 }, {
10 "home": bobsmith@gmail.com,
11 "other": bobbys@moosecall.net,
12 "work": bob@smith.com,
13 }]
At the end of the day all you have to remember is to drill down the path with a chain of bracket
command.
Wrap Up
We now have some basic understanding:
We will learn more about advanced query in other chapter. For now, lets move on and try to write
some data into RethinkDB.
4. Modifying data
We know how to fetch the data. But a database is useless without ability of writing data. In this
chapter, we will learn about writing data. We will address database command, table command, and
then document command in this chapter.
Database
All commands on database levels start at the top namespace r since they are like genesis item in any
database system. Lets start our journey by creating a database. Remember, we need a database to
hold everything.
Create
Very simple. With example, you will get it easily.
1 //Create database
2 r.dbCreate("db1")
3 #=>
4 {
5 "config_changes": [
6 {
7 "new_val": {
8 "id": "5e4a85fa-d867-4a93-aa01-2d08ed6f0b14" ,
9 "name": "db1"
10 } ,
11 "old_val": null
12 }
13 ] ,
14 "dbs_created": 1
15 }
If creating succesfully, we get back the object with created is always 1. config_changes will have
new_val field is the databases config value. old_val is always null becase this is a new database.
config value is the configuration for an individual database or table. What is the configuration?
Usually, when we create any object in RethinkDB (a database, a table) we can pass a list of option
to that creating command. That option has to be stored somewhere and we should have ability to
read it back. For a database, configuration is just its name and its id. Thats why you see the id and
name are returned in above query. We will learn more about this configuration very quick in this
chapter.
We can confirm by listing what we have:
Database 49
1 r.dbList()
2 #=>
3 [
4 "foodb" ,
5 "rethinkdb" ,
6 "superheroes" ,
7 "test"
8 ]
Notice a special database call rethinkdb? This is a special database that is created by RethinkDB
to hold meta data, configuration. Its very similar to mysql database in a MySQL server. Remember
the configuration of dbCreate function? Those configuration is stored in table db_config inside this
database rethinkdb.
Drop
So we got the default test and db1 is what we just have. Since we dont use db1, lets delete it to keep
our database clean.
1 r.dbDrop('db1')
2 #=>
3 {
4 "config_changes": [
5 {
6 "new_val": null ,
7 "old_val": {
8 "id": "5e4a85fa-d867-4a93-aa01-2d08ed6f0b14" ,
9 "name": "db1"
10 }
11 }
12 ] ,
13 "dbs_dropped": 1 ,
14 "tables_dropped": 0
15 }
Very similar with dbCreate but in an opposite way. Now new_val is null because the database is no
longer existed. old_val is the id and name of old database, or the old configuration of database.
Table
Tables have to sit inside a database, therefore, all table commands have to call on a database. When
you dont explicit specify a database to run on with r.db, the current database will be the base for
table manipulation.
Create
The syntax to create a table is
1 db.tableCreate(tableName[, options])
The second parameter is optional. This is what we consider configuration for a table. Its similar to
database configuration. But table configuration is much richer. Some important ones are:
*primaryKey
the name of primary key. Default name of primary key is id. The value of id field will always
be indexed automatically and using as primary key. Using this option, you can change that
default behavior such as using uuid as default primary key. When a new document is inserted,
RethinkDB will fetch value of uuid field to create index instead of field id
*durability
accept value of soft or hard. soft means the writes will be acknowledged by server immde-
diately, and data will be flushed to disk in the background. If that flushing fail, we may not
know. The default behaviour is to acknowledge after data is written to disk. That means hard.
Its default because its much safety. When we dont need the data to be consitent, such as
writing a cache, or an unimportant log, we should set durability to soft to speed up the writing.
However, for any important, serious data, keep it default.
RethinkDB stores configuration of each table in a special table call table_config inside database
rethinkdb.
Lets try to create a table.
List table
To list what table we have inside a database, we use tableList command. Its similar to SHOW
TABLE in MySQL.
Table 51
1 r.db("foodb").tableList()
2 //=>
3 [
4 "compound_synonyms" ,
5 "compounds" ,
6 "compounds_flavors" ,
7 "compounds_foods" ,
8 "compounds_health_effects" ,
9 "flavors" ,
10 "foods" ,
11 "health_effects" ,
12 "t1" ,
13 "users"
14 ]
Drop table
To get rid of the table, use tableDrop command.
1 r.db("foodb").tableDrop("t1")
2 //=>
3 {
4 "config_changes": [{
5 "new_val": null,
6 "old_val": {
7 "db": "foodb",
8 "durability": "hard",
9 "id": "d20fe79e-9e90-4625-95f7-c9e1953bf773",
10 "name": "t1",
11 "primary_key": "id",
12 "shards": [{
13 "primary_replica": "SimplyRethinkDB",
14 "replicas": [
15 "SimplyRethinkDB"
16 ]
17 }],
18 "write_acks": "majority"
19 }
20 }],
21 "tables_dropped": 1
22 }
Table 52
Very similar to dbDrop, we also have config_changes. new_val always null because the table is
gone now. old_val is the configuration of removed table. Table configuration is usually what we
passed in when we create it with tableCreate.
We see that some db and table command returns config_changes. Lets discover where those
configs are stored.
System table
Usually a database server have to keep some meta data, or configuration information somewhere
else. In case of RethinkDB, it stores those data in rethinkdb data. Lets discover this database:
1 r.db("rethinkdb").tableList()
2 [
3 "cluster_config" ,
4 "current_issues" ,
5 "db_config" ,
6 "jobs" ,
7 "logs" ,
8 "server_config" ,
9 "server_status" ,
10 "stats" ,
11 "table_config" ,
12 "table_status"
13 ]
The name of each table should suggest what it contains. Lets inspect server_config
1 r.db("rethinkdb").table("server_config")
2 //=>
3 {
4 "cache_size_mb": "auto" ,
5 "id": "fdc5dade-2f0c-498f-8c4b-59ad0d976471" ,
6 "name": "Vinh_local_u27" ,
7 "tags": [
8 "default"
9 ]
10 }
By modifying those table, we change the configuration of our server. We can get RethinkDB version
by querying server_status.
1 r.db("rethinkdb").table("server_status")("process")("version")
In other words, those system table refelects information related to how the system operates. We can
query to fetch or modify system information.
We can get configuration that we set when creating table with tableCreate of any table:
1 r.db("rethinkdb").table("table_config")
2 //=>
3 {
4 "db": "foodb" ,
5 "durability": "hard" ,
6 "id": "2e41fc0b-ea5e-4460-bd3b-5d33a5ec49af" ,
7 "name": "health_effects" ,
8 "primary_key": "id" ,
9 "shards": [
10 {
11 "primary_replica": "SimplyRethinkDB" ,
12 "replicas": [
13 "SimplyRethinkDB"
14 ]
15 }
16 ] ,
17 "write_acks": "majority"
18 } {
19 "db": "foodb" ,
20 "durability": "hard" ,
21 "id": "3fbf59ad-35df-445c-9fa9-be19071d38d7" ,
22 "name": "compounds_flavors" ,
System table 55
23 "primary_key": "id" ,
24 "shards": [
25 {
26 "primary_replica": "SimplyRethinkDB" ,
27 "replicas": [
28 "SimplyRethinkDB"
29 ]
30 }
31 ] ,
32 "write_acks": "majority"
33 }
Looking at the above result, we can see that table health_effects of database foodb has primary_-
key is id, and write_acks is majority.
You can have more fun and some deep understanding under the hood by inspecting those tables.
Document
After creating database and creating table, we can start inserting document into table.
Insert
As you can guess, we will start from the database, chain the table, and use insert command to
insert a document into the table. Eg
Here, we set our own primary key for id field. If we dont set it. RethinkDB will generate it and
return for us via the return object.
1 r.db("foodb").table("users")
2 .insert({name: "foo", age: 12})
3 //=>
4 {
5 "deleted": 0 ,
6 "errors": 0 ,
7 "generated_keys": [
8 "b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6"
9 ] ,
10 "inserted": 1 ,
11 "replaced": 0 ,
12 "skipped": 0 ,
13 "unchanged": 0
14 }
Document 57
inserted
the number of documents that were succesfully inserted.
replaced
the number of documents that were updated when upsert is used.
unchanged
the number of documents that would have been modified, except that the new value was the
same as the old value when doing an upsert.
errors
the number of errors encountered while performing the insert.
first_error
If errors were encountered, contains the text of the first error.
deleted, skipped
0 for an insert operation.
generated_keys
a list of generated primary keys in case the primary keys for some documents were missing
(capped to 100000).
warnings
if the field generated_keys is truncated, you will get the warning: Too many generated keys
(
if returnVals is set to true, contains null. new_val
if returnVals is set to true, contains the inserted/updated document.
Notice the generated_keys. If we insert a document without set a value for primary key, whose field
name is *id* by default, RethinkDB will generate an UUID for it and use it as value of *id* field.
With our example, the primary key is b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6. We can retreive
back the document again.
http://en.wikipedia.org/wiki/Universally_unique_identifier
Document 58
1 r.db("foodb").table("users")
2 .get('b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6')
3 //=>
4 {
5 "age": 12 ,
6 "id": "b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6" ,
7 "name": "foo"
8 }
Multi insert
If we have a large array of data, we dont have to insert one by one, we can pass the whole array to
insert to do batch insert, which is much efficient than update one by one
Lets play with it a bit. Create a test table on our test db.
We also use primary key as field myid insetead of usind id. We can insert multiple data at a time.
Some document has myid, some of them dont have. We will see how RethinkDB generate primary
key for those documents:
1 r.db("test").table("users")
2 .insert([
3 {
4 myid: 1,
5 name: 'Hydra'
6 },
7 {
8 name: 'Pluto'
9 },
10 {
11 name: 'Styx',
12 myid: 'abcxyz'
13 }
14 ])
15 //=>
16 {
17 "deleted": 0 ,
18 "errors": 0 ,
19 "generated_keys": [
20 "8c3a1d6c-2b7b-4a4f-91dc-d6855c5aed15"
Document 59
21 ] ,
22 "inserted": 3 ,
23 "replaced": 0 ,
24 "skipped": 0 ,
25 "unchanged": 0
26 }
1 r.db("test").table("users")
2 //=>
3 {
4 "myid": 1,
5 "name": "Hydra"
6 } {
7 "myid": "abcxyz",
8 "name": "Styx"
9 } {
10 "myid": "8c3a1d6c-2b7b-4a4f-91dc-d6855c5aed15",
11 "name": "Pluto"
12 }
Yay, how cool is that? We used a custom primary key, we insert multiple document at a time and
RethinkDb assign a primary key for it.
Lets see if myid field is really use as primary index. We can call get command because get operator
on primary key:
1 r.db("test").table("users")
2 .get('abcxyz')
3 //=>
4 {
5 "myid": "abcxyz" ,
6 "name": "Styx"
7 }
Effect of durability
Lets see the difference of durability. We will insert a big document.
Document 60
1 r.table('git').insert(
2 r.http('https://api.github.com/repos/rethinkdb/rethinkdb/stargazers')),
3 {durability:soft}
4 )
5 Executed in 773ms. 1 row returned
1 r.table('git').insert(
2 r.http('https://api.github.com/repos/rethinkdb/rethinkdb/stargazers')),
3 {durability:soft}
4 )
5 Executed in 1.18s. 1 row returned
So its slower because it takes time to write to hard drive. You may not see the effect if you have a
very fast hard drive or SSD. I tried it on an external spin drive :).
Update
To make it easier, you can think of updating like selecting data, then change their value. We chain
update method from a selection range to update its data. With that being said, we can update one
or many documents at a time. Similar to MySQL, we can update a full table, or update only rows
that sastify a WHERE condition.
Think of modification is like a transform process where you get a list of document(one or many),
then transform by adding fields, rewrite value for some fields. By that definition, it doesnt matter
if you update one document, or many document.As long as you have an array, or a stream of data,
you can update them all.
For example, to update an attribute for a single element
Document 61
RethinkDB returns an object for the updating result. We can look into replaced field to see if the
data is actually updated. If we re-run the above command, nothing is replaced and we will got 1
unchanged.
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 age: 13,
5 gender: "f"
6 })
7 //=>
8 {
9 "deleted": 0 ,
10 "errors": 0 ,
11 "inserted": 0 ,
12 "replaced": 0 ,
13 "skipped": 0 ,
14 "unchanged": 1
15 }
Thats just how awesome RethinkDB is. All query result is very verbose. And easy to understand.
In above example, you can see when we update age, we also add a new field gender. The updating
process can be understand as a merge process of return values from update function or update
expression into current existed document. Lets verify if gender field are really there:
Document 62
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 //=>
4 {
5 "age": 13 ,
6 "gender": "f" ,
7 "id": "user-foo1" ,
8 "name": "foo"
9 }
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 "address" : {
5 country: "USA",
6 state: "CA",
7 city: "Cuppertino",
8 street: "Infinite Loop",
9 number: "1",
10 ste: "1205"
11 }
12 })
13 //=>
14 {
15 "deleted": 0 ,
16 "errors": 0 ,
17 "inserted": 0 ,
18 "replaced": 1 ,
19 "skipped": 0 ,
20 "unchanged": 0
21 }
replaced is 1, that means we update succesfully. Now, lets say I moved, I can change the address:
Document 63
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 "address" : {
5 ste: 880,
6 number: 11
7 }
8 })
9 //=>
10 {
11 "deleted": 0 ,
12 "errors": 0 ,
13 "inserted": 0 ,
14 "replaced": 1 ,
15 "skipped": 0 ,
16 "unchanged": 0
17 }
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 //=>
4 {
5 "address": {
6 "city": "Cuppertino" ,
7 "country": "USA" ,
8 "number": 11 ,
9 "state": "CA" ,
10 "ste": 880 ,
11 "street": "Infinite Loop"
12 } ,
13 "age": 13 ,
14 "gender": "f" ,
15 "id": "user-foo1" ,
16 "name": "foo"
17 }
The value of the updated field, as you can see in above example, is a single value.
Update option
update command receive some options that you can pass. In JavaScript, you can pass option object
as second parameter. In Ruby, you can use optional keyword. Such as in JavaScript:
Document 64
1 r.table("posts").get(1).update({
2 num_comments: r.js("Math.floor(Math.random()*100)")
3 }, {
4 nonAtomic: true
5 })
Here, {nonAtomic: true} is out option. In Ruby, its more elegant due:
1 r.table("posts").get(1).update({
2 :num_comments => r.js("Math.floor(Math.random()*100)")
3 }, :non_atomic => true)
durability: possible values are hard and soft. You already know what it does. However, setting
it here override durability default of tables
non_atomic: you should also know what it does. If not, coming back chapter2.
So we know the option. Lets move on. In this section, we learn how to update value for a field,
update nested value. What if the field contains an array? How can we append new element. Or how
to update value which is the result of other ReQL command. Lets move to next section.
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 address: [r.row("address")]
5 })
Here we are using r.row, it allows access to current document we are referencing to. We turn address
into an array by wrap it in [] to turn it into array.
Now, we got address is an array with a single address, How do we add more data into that array.
The simple form, we can pass an value to the update. First, we get the value of current address,
append a new element using what our language offer. In JavaScript code, we may have:
Document 65
1 addresses = r.db("foodb").table("users")
2 .get("user-foo1")("address")
3 addresses.push(new_add_ress)
4 r.db("foodb").table("users")
5 .get("user-foo1")
6 .update({address: addresses})
Thats work but its very inefficent. For example, if we have to append a new element to an array
for 1000 users. We have to fetch the data, change it, update by sending new array again.
Also a more important issue is updating lock. When we are retriving data, alter it on client side,
push back data to database. During the time since we get the data and push it back. The server may
changes the data and we didnt aware of it in first query to retrieve data, now when we push back,
we override that new changes. Image this, we have 2 admins on the sites, who are trying to edit an
user at the same time to update users address.
First, admin 1 retrieve data, add new address B and push it back. For whatever reason, admin 2
retrieve data, right after admin 1 retrieve data, then add new address B and update it, but before
admin 1 push it back. So when admin 1 pushs data back, the changes admin 2 created is override.
Its would be great if we can move the logic into RethinkDB and lets RethinkDB handles lock for
it. Just like how in SQL we can do:
We tell MySQL to increment value of login by 1 instead of doing that outself. Luckily, we have that
in RethinkDB. Some of them falls under Document manipulation section on RethinkDB docs.
They allow us do some logic to the document.
Our example above can be written using append. append command add a new element to array.
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 address: r.row("address")
5 .append({country: "Vietnam",
6 city: "Hue",
7 street: "Tran Phu",
8 number: "131"})
9 })
10 //=>
11 {
http://www.rethinkdb.com/api/ruby/pluck/
Document 66
12 "deleted": 0 ,
13 "errors": 0 ,
14 "inserted": 0 ,
15 "replaced": 1 ,
16 "skipped": 0 ,
17 "unchanged": 0
18 }
What if an user has not address field on it yet? Well, an error will be thrown out. lets try:
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 another_address_field: r.row("another_address_field")
5 .append({country: "Vietnamnam",
6 city: "Hue",
7 street: "Tran Phu",
8 number: "131"})
9 })
10 //=>
11 {
12 "deleted": 0,
13 "errors": 1,
14 "first_error": "No attribute `another_address_field` in object: {
15 "address": [{
16 "city": "Cuppertino",
17 "country": "USA",
18 "number": 11,
19 "state": "CA",
20 "ste": 880,
21 "street": "Infinite Loop"
22 }, {
23 "city": "Hue",
24 "country": "Vietnam",
25 "number": "131",
26 "street": "Tran Phu"
27 }],
28 "age": 13,
29 "gender": "f",
30 "id": "user-foo1",
31 "name": "foo"
32 }
Document 67
33 " ,
34 "inserted": 0,
35 "replaced": 0,
36 "skipped": 0,
37 "unchanged": 0
38 }
To avoid that, we have tell RethinkDB what value should be used when that field doesnt exist. Such
as for an array, we can consider that value is an empty array. For a string we can consider that value
is an empty string. For a positive number, that can be zero.
We use defaul(default_value) command for this purpose. Lets try it:
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 another_address_field:
5 r.row("another_address_fieldess_field")
6 .default([])
7 .append({country: "Vietnamnamam",
8 city: "Hue",
9 street: "Tran Phu",
10 number: "131"})
11 })
12 //=>
13 {
14 "deleted": 0 ,
15 "errors": 0 ,
16 "inserted": 0 ,
17 "replaced": 1 ,
18 "skipped": 0 ,
19 "unchanged": 0
20 }
When we call command default on a value or a sequence, it will try to evalute to the default value
in case of non-existence error for the value. We can verify address again:
Document 68
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 //=>
4 {
5 "address": [{
6 "city": "Cuppertino",
7 "country": "USA",
8 "number": 11,
9 "state": "CA",
10 "ste": 880,
11 "street": "Infinite Loop"
12 }, {
13 "city": "Hue",
14 "country": "Vietnam",
15 "number": "131",
16 "street": "Tran Phu"
17 }],
18 "age": 13,
19 "another_address_field": [],
20 "gender": "f",
21 "id": "user-foo1",
22 "name": "foo"
23 }
So we had append to add element at the end of array, we can also use prepend. It adds a new element
to an array but at the top.
Take another example, we want to count how many like an user has. We will use a fiel call like and
we increase it by 1 whenever someone like the user.
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({like: r.row("like").add(1)})
You notice that we have to use add command instead of writing like:
1 like: r.row("like") +1
Because these expressions are evaluated on RethinkDB server, not on client. When I first learn
RethinkDB, somehow I didnt understand it. Im probably dumb. I keep thinking those run on client
and it makes my life harder. So Im trying to remind this several time through the book just in case
someone confuse like me. In some language that support operator overloading, you may use
Document 69
1 like: r.row("like") + 1
Thats because the driver override + operator to make it easy to write expression. They are still
serialized to ReQL syntax by driver.
Now, with above *update command, we got error, which is expeteced:
1 {
2 "deleted": 0,
3 "errors": 1,
4 "first_error": "No attribute `like` in object: {
5 "address": [{
6 "city": "Cuppertino",
7 "country": "USA",
8 "number": 11,
9 "state": "CA",
10 "ste": 880,
11 "street": "Infinite Loop"
12 }, {
13 "city": "Hue",
14 "country": "Vietnam",
15 "number": "131",
16 "street": "Tran Phu"
17 }],
18 "age": 13,
19 "another_address_field": [{
20 "city": "Hue",
21 "country": "Vietnam",
22 "number": "131",
23 "street": "Tran Phu"
24 }],
25 "gender": "f",
26 "id": "user-foo1",
27 "name": "foo"
28 }
29 " ,
30 "inserted": 0,
31 "replaced": 0,
32 "skipped": 0,
33 "unchanged": 0
34 }
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({like: r.row("like").default(0).add(1)})
4 // Get back like
5 r.db("foodb").table("users").get("user-foo1")("like")
6 //=>
7 1
add is not limited on numeric data, it works on array, string too. I will leave that part for you as an
exercise.
Now we know how to work with an array as value of a field. Lets dive into how to work with an
object as a value of a field. Considering that we have this:
1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: {twitter: "kureikain"}
5 })
6 //=>
7 {
8
9 "deleted": 0 ,
10 "errors": 0 ,
11 "inserted": 0 ,
12 "replaced": 1 ,
13 "skipped": 0 ,
14 "unchanged": 0
15 }
The social field is an object now. Now, the user enters his facebook username so we we want to
add a new field facebook to social field to denote the facebook account of user. We can not use
append or add on an object. For object, we use merge to add or override a field.
Document 71
1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: r.row('social').default({}).merge({facebook: "kureikain"})
5 })
6 //=>
7 {
8
9 "deleted": 0 ,
10 "errors": 0 ,
11 "inserted": 0 ,
12 "replaced": 1 ,
13 "skipped": 0 ,
14 "unchanged": 0
15
16 }
Same as append, we also have to set a default value to handle non-existence error. Since we are
working with an object, we set its default value to empty object: {}. merge overide existed key with
new value in the object you are passing, or create new key from the passing object.
1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: r.row('social').default({}).merge({facebook: "kureikain2", twitter: \
5 "kureikain2"})
6 })
7 //=>
8 {
9
10 "deleted": 0 ,
11 "errors": 0 ,
12 "inserted": 0 ,
13 "replaced": 1 ,
14 "skipped": 0 ,
15 "unchanged": 0
16
17 }
18
19 // Select it back
20 r.db("foodb").table('users')
21 .get('user-foo1')("social")
Document 72
22 //=>
23 {
24 "facebook": "kureikain2" ,
25 "twitter": "kureikain2"
26 }
Cool, so we know how to add new fields or override old fields. But do you notice they when a field
contains an object, they are actually nested field. So we can easily update use nested field knowledge
before instead of using merge command:
1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: {facebook: "kureikain3", github: "kureikain"}
5 })
Its really up to you to use merge or the nested field style. I usually using nested field style when
doing simple update, and merge when I want to merge the document to other result from other ReQL
function. But thats just opinion.
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({like: r.row("like").default(0).add(1)})
However, r.row has a limit. It cannot be called on nested query. Assume that we have a friends
table that we can create with below query:
So we can see that our user with id user-foo1 has 2 friends. It will not very efficient if we have to
count this over and over. So we are going to count this and update users table with a field friend_-
count.
To count, we can get a sequence of friend2_id, and count how many items has same value as
current user id, by passing a value to count function. When we pass a value to count, it only counts
the document equal to that value. Here, We are trying use r.row to reference to current user.
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update({
4 friend_counts: r.db('foodb').table('friends')('friend2_id').count(r.row('id'\
5 ))
6 })
RqlCompileError: Cannot use r.row in nested queries. Use functions instead in:
r.table(foodb).get(user-foo1).update({friend_counts:
r.db(foodb).table(friends)(friend2_id).count(r.row(id))})
Document 74
The reason for this error is because RethinkDB doesnt know which query to base r.row on? Is it the
main query, table users, or sub query, table friends. Luckily, We can use an anynoymous function
to solve this. Function allows access to current document but it solve problem of r.row because it
clearly binds to a sequence.
Expression
Lets get some basic knowledge then we will come back to the previous example.
Beside passing an object into update command, we can also pass an expression or a function which
returns an object. RethinkDB will evaluate it, get the object result and use that value for update
command. It comes in useful when you have some logic on your document related to the updating.
In case of function, the function receive first parameter is the current document.
With previous example, we can re-write using function:
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (user) {
4 return {
5 friend_counts: r.db('foodb').table('friends')('friend2_id').count(user("id\
6 "))
7 }
8 })
1 RqlRuntimeError: Could not prove function deterministic. Maybe you want to use \
2 the non_atomic flag? in:
3 r.db("foodb").table("users").get("user-foo1").update(function(var_58) { return {\
4 friend_counts: r.db("foodb").table("friends")("friend2_id").count(var_58("id"))}\
5 ; })
non-atomic updates
A good way to remember what is non-atomic update is that they are usually value which
cannot be predicate such as random values, result of other query
To run a non-atomic update, we have to clearly tell RethinkDB that with nonAtomic option:
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (user) {
4 return {
5 friend_counts: r.db('foodb').table('friends')('friend2_id').count(user("id\
6 "))
7 }
8 }, {nonAtomic: true})
9 //=>
10 {
11 "deleted": 0 ,
12 "errors": 0 ,
13 "inserted": 0 ,
14 "replaced": 1 ,
15 "skipped": 0 ,
16 "unchanged": 0
17 }
1 r.db('foodb').table('users')
2 .get('user-foo1')('friend_counts')
3 //=>
4 2
So we can see that in RethinkDB, we have to opt-in to use some features. Later on, we know that
we have to manually passing an index name to use it. That may a little bit verbose at first. But that
helps you understand query and let you know what you are doing here.
Updating with function is really similar to passing object to update function. We have to return an
JSON object with key-value similar to the JSON document that we pass directly to update command.
We can name the parameter of function to whatever. The name isnt important. It is just like a
callback function, in what RethinkDB will pass the real value of current document to it when invoke
that function. What we can do with r.row we can mostly do with that parameter. Such as getting
value of field with. Lets change user to u and see if it works:
Document 76
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (u) {
4 return {
5 friend_counts: r.db('foodb').table('friends')('friend2_id').count(u("id"))
6 }
7 }, {nonAtomic: true})
Lets do one more complex example. If an users has more than 10 friends, we set a field social_status
to extrovert, otherwise, its introvert.
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (u) {
4 return {
5 social_status: r.branch(u('friend_counts').gt(10), 'extrovert', 'introvert\
6 ')
7 }
8 }, {nonAtomic: true})
Here we are using a new command r.branch. Its like an IF in MySQL. If the first argument is TRUE,
the second argument is the return value, otherwise the third argument. We use user('friend_-
count') to get value of friend_count as you know. We calling gt command on it. gt means greater,
it returns TRUE if the value is greater than what we pass to gt.
When using a function, the parameter pass into function will be the current visited document.
Therefore, you can use many document manipulation command in it such as: pluck, without,
merge, append, prepend. Just remember this, so you know what you can do with that parameter.
Expr
expr is a normal function but I think they are important and will help us achive many crazy things
so I cover them here.
What expr does is tranform a native object from host language into ReQL object. For example, if
a RethinkDb funciton can be call on array or sequence, we cannot write something like this: [e1,
e2].nth(2), RethinkDB will throw an error on [e1, e2]
What we have to do is somehow convert the array that we write in native language into RethinkDB
data type. To do that, we simply wrap them in expr
A real example when Im writing this book is I want to randomize generate faked data for users
table on gender field. I do this with:
Document 77
1 r.db("foodb").table("users")
2 .update({
3 gender: r.expr(['m', 'f']).nth(r.random(0, 2))
4 }, {nonAtomic: true})
It means that for every document of users table, I want to set their gender to either m or f randomly.
I create a two element array [m, f], turn them into ReQL object with expr, so that I can call nth on
them, passing a random number of either 0 or 1.
lets try a more complex example to generate some data. For every users, we generate a list of
eatenfoods name randomly by select all food name, and use sample(number) command to select a
number of random element.
1 r.db("foodb").table("users")
2 .update({
3 eatenfoods: r.db("foodb").table("foods").sample(r.random(0,
1 r.db("foodb").table("users")
2 .update({
3 eateanorlike : r.add(r.row("eatenfoods"), [r.row("favfoods").nth(0)])
4 }, {nonAtomic: true})
By combine ReQL expression, looking at its RethinkDB API and find approriate function, we can
achieve what we want. In the above example, we know we want to concat two arrays from eaten-
foods and first item of favfoods. We used r.add. We have to wrap r.row("favfoods").nth(0) in
[] because nth() return a document, where as r.add expects array, so we wrap it in [].
We also didnt have an age field on those users table. Lets generate some fake data for it so we can
play around later. Here we randomize age between 8 and 90.
Document 78
1 r.db("foodb").table("users")
2 .update({
3 age : r.random(8, 90)
4 }, {nonAtomic: true})
5 #=>
6 {
7 "deleted": 0 ,
8 "errors": 0 ,
9 "inserted": 0 ,
10 "replaced": 152 ,
11 "skipped": 0 ,
12 "unchanged": 0
13 }
By using function and/or expression, we can update document in a complex way. Just carefully look
up RethinkDB API, we can find the function we want. If not, we can probably whip some logic
inside function.
Return Values
Sometimes, it can be useful to get back the updated document. This way you can verify the result,
without issuing a sub sequent get command. We just need to set returnChanges flag to true in
option parameter of update command. Same example:
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (user) {
4 return {
5 _social_status: r.branch(user('friend_counts').gt(10), 'extrovert', 'intro\
6 vert')
7 }
8 }, {nonAtomic: true, returnChanges: true})
9 //=>
10 {
11 "changes": [{
12 "new_val": {
13 "_social_status": "introvert",
14 "address": [{
15 "city": "Cuppertino",
16 "country": "USA",
17 "number": 11,
18 "state": "CA",
Document 79
19 "ste": 880,
20 "street": "Infinite Loop"
21 }, {
22 "city": "Hue",
23 "country": "Vietnam",
24 "number": "131",
25 "street": "Tran Phu"
26 }],
27 "age": 13,
28 "another_address_field": [{
29 "city": "Hue",
30 "country": "Vietnam",
31 "number": "131",
32 "street": "Tran Phu"
33 }],
34 "friend_counts": 2,
35 "gender": "f",
36 "id": "user-foo1",
37 "name": "foo",
38 "social": {
39 "facebook": "kureikain3",
40 "github": "kureikain",
41 "twitter": "kureikain2"
42 },
43 "social_status": "introvert"
44 },
45 "old_val": {
46 "address": [{
47 "city": "Cuppertino",
48 "country": "USA",
49 "number": 11,
50 "state": "CA",
51 "ste": 880,
52 "street": "Infinite Loop"
53 }, {
54 "city": "Hue",
55 "country": "Vietnam",
56 "number": "131",
57 "street": "Tran Phu"
58 }],
59 "age": 13,
60 "another_address_field": [{
Document 80
61 "city": "Hue",
62 "country": "Vietnam",
63 "number": "131",
64 "street": "Tran Phu"
65 }],
66 "friend_counts": 2,
67 "gender": "f",
68 "id": "user-foo1",
69 "name": "foo",
70 "social": {
71 "facebook": "kureikain3",
72 "github": "kureikain",
73 "twitter": "kureikain2"
74 },
75 "social_status": "introvert"
76 }
77 }],
78 "deleted": 0,
79 "errors": 0,
80 "inserted": 0,
81 "replaced": 1,
82 "skipped": 0,
83 "unchanged": 0
84 }
The old value and value are returned in key old_val and new_val correspondingly.
As you see, you started to mess up our data. Its ok, let lean some command that destroy data. Lets
meep replace command.
Replace
First, we want to remove eateanorlike.
To remove one or many fields from document, we cannot use update anymore. We can set a field
to null value(null, nil depends on your language) to make it become null. But they key is still in the
document with a null value. In other words, update lets us overwrite the fields, but dont remove
it. Thats why we have another command for removing field. The replace command replaces entire
document with new document.
1 r.db("foodb").table("users").replace(r.row.without('eateanorlike'))
Document 81
Here we use r.row to get current documents, then calling without to remove the field.
without accepts a list of argument and will remove those fields with that name from document.
Such as:
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .without("address", "another_fieldaddress_field", "social")
4 //=>
5 {
6 "_social_status": "introvert" ,
7 "age": 13 ,
8 "friend_counts": 2 ,
9 "gender": "f" ,
10 "id": "user-foo1" ,
11 "name": "foo" ,
12 "social_status": "introvert"
13 }
without can also remove nested field. Such as remove country field from address
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .without({address: "country"}, "another_address_field", "social")
4 //=>
5 {
6 "_social_status": "introvert",
7 "address": [{
8 "city": "Cuppertino",
9 "number": 11,
10 "state": "CA",
11 "ste": 880,
12 "street": "Infinite Loop"
13 }, {
14 "city": "Hue",
15 "number": "131",
16 "street": "Tran Phu"
17 }],
18 "age": 13,
19 "friend_counts": 2,
20 "gender": "f",
21 "id": "user-foo1",
22 "name": "foo",
Document 82
23 "social_status": "introvert"
24 }
We can use nested style to denote the field that we want to remove. If we want to remove many
fields, wrap them in array.
Example, we remove all fields of address except country:
1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .without({address: ["number","state", "city", "ste", "street"]}, "another_addr\
4 ess_field", "social")
5 //=>
6 {
7 "_social_status": "introvert",
8 "address": [{
9 "country": "USA"
10 }, {
11 "country": "Vietnam"
12 }],
13 "age": 13,
14 "friend_counts": 2,
15 "gender": "f",
16 "id": "user-foo1",
17 "name": "foo",
18 "social_status": "introvert"
19 }
Note that, We can replace an entirely new document, however, the primary key cannot be changed.
It has to be same with the current primary key. An attempt to change the primary key will caused
an error Primary key id cannot be changed
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .replace({id: 1})
1 {
2 "deleted": 0,
3 "errors": 1,
4 "first_error": "Primary key `id` cannot be changed (`{
5 "_social_status": "introvert",
6 "address": [{
7 "city": "Cuppertino",
8 "country": "USA",
9 "number": 11,
10 "state": "CA",
11 "ste": 880,
12 "street": "Infinite Loop"
13 }, {
14 "city": "Hue",
15 "country": "Vietnam",
16 "number": "131",
17 "street": "Tran Phu"
18 }],
19 "age": 13,
20 "friend_counts": 2,
21 "gender": "f",
22 "id": "user-foo1",
23 "name": "foo",
24 "social": {
25 "facebook": "kureikain3",
26 "github": "kureikain",
27 "twitter": "kureikain2"
28 },
29 "social_status": "introvert"
30 }
31 ` -> ` {
32 "id": 1
33 }
34 `)." ,
35 "inserted": 0 ,
36 "replaced": 0 ,
37 "skipped": 0 ,
38 "unchanged": 0
39 }
Of course, changing primary key is actually simply removing old one and insert a new document.
Lets learn about removing data.
Document 84
Delete
Delete is similar to update or replace. We select a sequence, and call delete command on them.
This will delete a single document, by using primary to select it with get then calling delete on
that single document.
1 r.db("foodb").table("users")
2 .get("user-foo2")
3 .delete()
We cal also clear a whole table or a selection. Lets play with it via some temporary table:
1 r.db("foodb").tableCreate("test1")
2 r.db("foodb").table("test1").insert({field: 'foo', field2: 'bar'})
3 r.db("foodb").table("test1").insert({field: 'foo2', field2: 'bar2'})
4 r.db("foodb").table("test1").insert({age: 10, name: 'abc'})
5 r.db("foodb").table("test1").insert({age: 12, name: 'abc2'})
1 r.db("foodb")
2 .table("test1")
3 .filter(r.row('age').lt(11))
4 .delete()
We using r.row to get current document, then get the age field value, and call lt to do compare less
than.
Basically with any selection, we can call delete on them, it goes over and remove data.
So you can alredy guess command to delete all table:
1 r.db("foodb").table("test1").delete()
And finally, before we move on. Lets remove use user-foo1 since we mess with it a bit
Document 85
1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .delete()
Sync
As you known in the previous chapter, with value of durability as soft the write isnt guarantees
to be written to the permanent storage. So after doing a bunch of those soft durability, you may
want to say Hey, I am done all task, let's make sure you write those change you can call
sync
Using JavaScript driver:
r.table(t).sync().run(connection, function () { console.log(Syncing is done. All data is safe now) })
sync command will block until all previous write to the table are persited. With that being said,
sync will be only called on table.
Its good idea to do a bunch of soft durability and call sync at the end to ensure data persitent
and still avoid blocking during executing of other logic.
Wrap up
Some important concept you should remember:
atomicity
synca
multiple insert
change primary field
update with function
using nested field style to remove nested field
5. Reading Data Advanced
Understanding index
Index
Soon enough you will realize that filter is slow. If you have a table with more than 100,000 records,
filter stops working. All of that is because we havent used index yet. Without an index, we cannot
even order data.
1 r.db("foodb").table("compounds_foods").orderBy(r.desc("id"))
2 #->
3 RqlRuntimeError: Array over size limit `100000` in:
4 r.db("foodb").table("compounds_foods").orderBy(r.desc("id"))
5 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Without an index, RethinkDB holds all data in memory and sorts or filters in memory. A limit has
to lie somewhere else. 100,000 is the magic number of the limit RethinkDB set to read data without
index.
In order to properly fetching data, we have to create an Index. And in real application, we almost
always ending up creating index on MySQL to fetch data efficiently.
We have two kinds of indexes in RethinkDB
primary index: our ID key is this. This index is created automatically by RethinkDB. Coming
back to the above query, if we change it to use primary index
r.db(foodb).table(compounds_foods).orderBy({index: r.desc(id)})
secondary index:
Seconday index is the index we created ourselve on one or many fields. Secondary index can be
simple, just index the value of fields directly, or doing some pre-calculate on data before indexing.
While index helps to decrease the reading time, it decreases writing time, and also cost storage space.
It reduces write performance because whenever we insert a document, the index has to be calculated
and written into the database.
RethinkDB supports those kinds of index:
Understanding index 89
*Simple :indexes based on the value of a single field. *Compound :indexes based on multiple fields.
*Multi :indexes based on arrays of values. *Indexes :based on arbitrary expressions.
So now you know what index is. But the sad news is that filter cannot use those secondary index.
For that purpose, we have to use other functions: getAll and between. ### Creating index
Lets start with simple index first.
Simple index
As its name, simple index is simply a single field. Lets say, we want to find all compounds_foods
whose name contains banana. We cannot use filter here because this table has more than 100,000
item. filter also doesnt use index. Lets meet getAll. getAll grabs all documents where a given
key match the index which we specified.
First, we create index follow this syntax
1 r.db("foodb")
2 .table("compounds_foods")
3 .indexCreate("match_orig_food_common_name", r.row.match)
If we dont pass an index function, RethinkDB will try to create index for the column that has the
same name as requested index name.
Time to use it:
1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas', {index:'orig_food_common_name'})
4 #=> Executed in 10ms. No results returned.
No result. Its strange that we have no document where its orig_food_common_name doesnt
contain banana. Why so? Well, the simple index does an exact match, or in other world, its an
equal comparison. Lets try an exact match:
Understanding index 90
1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas, raw', {index:'orig_food_common_name'})
4 #=> Executed in 69ms. 40 rows returned, 40 displayed, more available
5 {
6 "citation": "USDA" ,
7 "citation_type": "DATABASE" ,
8 "compound_id": 2100 ,
9 "created_at": Tue Jan 03 2012 18:33:15 GMT-08:00 ,
10 "creator_id": null ,
11 "food_id": 208 ,
12 "id": 257686 ,
13 "orig_citation": null ,
14 "orig_compound_id": "262" ,
15 "orig_compound_name": "Caffeine" ,
16 "orig_content": "0.0" ,
17 "orig_food_common_name": "Bananas, raw" ,
18 "orig_food_id": "09040" ,
19 "orig_food_part": null ,
20 "orig_food_scientific_name": null ,
21 "orig_max": null ,
22 "orig_method": null ,
23 "orig_min": null ,
24 "orig_unit": "mg" ,
25 "orig_unit_expression": null ,
26 "updated_at": Tue Jan 03 2012 18:33:15 GMT-08:00 ,
27 "updater_id": null
28 }
We can pass multiple keys to getAll to have an or effect. Meaning RethinkDB returns document
where the index match any of value that we pass.
1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas, raw', 'Yoghurt with pear and banana', 'Alfalfa seeds',{index\
4 :'orig_food_common_name'})
1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas, raw', 'Yoghurt with pear and banana', 'Alfalfa seeds',{index\
4 :'orig_food_common_name'})
5 .count()
6 #=> 256
Not only used to find document, index can also be used for sorting as well. To sort, we call orderBy
and passing index name.
1 r.db("foodb").table("compounds_foods")
2 .orderBy({index: "orig_food_common_name"})
When passing index, we can wrap it in other expression to change the ordering:
1 r.db("foodb").table("compounds_foods")
2 .orderBy({index: r.desc("orig_food_common_name")})
3 .withFields("orig_food_common_name")
4 #=>
5 {
6 "orig_food_common_name": "Zwieback"
7 } {
8 "orig_food_common_name": "Zwieback"
9 } {
10 "orig_food_common_name": "Zwieback"
11 }
Using withFields, we can pass a list of field to choose what we want to get back.
Because index can be used to sort, we can use it to find the value between a range. In RethinkDB,
between syntax is:
1 r.db("foodb").table("compounds_foods")
2 .between("Apple", "Banana", {index: 'orig_food_common_name'})
3 .orderBy({index: r.desc("orig_food_common_name")})
4 .withFields("orig_food_common_name")
Without specifying an index, between operates on primary index, many RethinkDB function has the
same behaviour
1 r.db("foodb").table("compounds_foods")
2 .between(1, 200)
3 .count()
4 #=> 198
This works when we want to find data based on a single field. How about find value base on multiple
field? Lets meet compound index
Compound index
Compound index is created by using value of multiple fields. Its very similar to single index in
syntax, just different on how many fields we pass to index create. Lets take a look at compounds_-
foods table, it contains relationship of foods and compounds. We will learn more about JOIN later.
For now, lets say we want to find all compounds_foods document where its compound_id is 3524
and food_id is 287. We are finding on two columns, so we need an index contains those 2 columns:
1 r.db("foodb").table("compounds_foods")
2 .indexCreate("compound_food_id", [r.row(row"compound_id"), r.row("food_id")])
The only different with simple index is that we have to pass an array of field to create index. Lets
try it:
1 r.db("foodb").table("compounds_foods")
2 .getAll([354,287], {index: 'compound_food_id'})
3 #=>
4 RqlRuntimeError: Index `compound_food_id` on table `foodb.compounds_foods` was
5 accessed before its construction was finished in:
6 r.db("foodb").table("compounds_foods").getAll([354, 287], {index:
7 "compound_food_id"})
8 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\
9 ^^^^^^
We got an error. Looks like the index is not completed created yet. The table has
We can query index status:
Understanding index 93
1 r.db("foodb").table("compounds_foods")
2 .indexStatus('compound_food_id')
3 #=>
4 [
5 {
6 "blocks_processed": 13192 ,
7 "blocks_total": 15437 ,
8 "function": <binary, 408 bytes, "24 72 65 71 6c 5f..."> ,
9 "geo": false ,
10 "index": "compound_food_id" ,
11 "multi": false ,
12 "outdated": false ,
13 "ready": false
14 }
15 ]
The field ready is false. We can only wait until it finishes. This table is very big. We can verify:
1 r.db("foodb").table("compounds_foods").count()
2 #=> 737089
Lets just wait for a bit, make a cup of coffee and come back :). When its ready, you should see:
1 r.db("foodb").table("compounds_foods")
2 .indexStatus('compound_food_id')
3
4 [
5 {
6 "function": <binary, 408 bytes, "24 72 65 71 6c 5f..."> ,
7 "geo": false ,
8 "index": "compound_food_id" ,
9 "multi": false ,
10 "outdated": false ,
11 "ready": true
12 }
13 ]
1 r.db("foodb").table("compounds_foods")
2 .getAll([21477,899], {index: 'compound_food_id'})
3 #=> Executed in 7ms. 1 row returned
4 {
5 "citation": "DFC CODES" ,
6 "citation_type": "DATABASE" ,
7 "compound_id": 21477 ,
8 "created_at": Tue Sep 11 2012 16:12:30 GMT-07:00 ,
9 "creator_id": null ,
10 "food_id": 899 ,
11 "id": 740574 ,
12 "orig_citation": null ,
13 "orig_compound_id": null ,
14 "orig_compound_name": null ,
15 "orig_content": null ,
16 "orig_food_common_name": "Meats" ,
17 "orig_food_id": "WI8000" ,
18 "orig_food_part": null ,
19 "orig_food_scientific_name": null ,
20 "orig_max": null ,
21 "orig_method": null ,
22 "orig_min": null ,
23 "orig_unit": null ,
24 "orig_unit_expression": null ,
25 "updated_at": Tue Sep 11 2012 16:12:30 GMT-07:00 ,
26 "updater_id": null
27 }
With above indexing approach, you found that you have to create a dedicated index for whatever
you want to find. An index contains a single value for a document: either a single value of field, or
an order set of value in case of compound index. However, life is not that simple. Lets looking at
users table. It contains a list of users and their three most favourite foods, stored in field favfoods.
Thats a single column but it holds many elements. Because of that, we cannot simply answer the
question of who liked Mushroom:
lets create an index
1 r.db("foodb").table("users").indexCreate('favfoods')
1 r.db("foodb").table("users")
2 .getAll('Mushrooms', {index: 'favfoods'})
3 #=> Executed in 6ms. No results returned.
Why so? Because since we index the whole field as a single value, we have to match the whole value
of field:
1 r.db("foodb").table("users")
2 .getAll(["Edible shell" ,
3 "Clupeinae (Herring, Sardine, Sprat)" ,
4 "Deer" ,
5 "Perciformes (Perch-like fishes)" ,
6 "Bivalvia (Clam, Mussel, Oyster)"], {index: 'favfoods'})
7 #=>
8 {
9 "favfoods": [
10 "Edible shell" ,
11 "Clupeinae (Herring, Sardine, Sprat)" ,
12 "Deer" ,
13 "Perciformes (Perch-like fishes)" ,
14 "Bivalvia (Clam, Mussel, Oyster)"
15 ] ,
16 "id": "1dd8059c-82ca-4345-9d75-eaa0f8edbf48" ,
17 "name": "Arthur Hegmann"
18 }
Multi index is used to solve the above question: who likes Banana. A multi index is used on multiple
value, or in other word, an array of value. When RethinkDB see that we want to use a multi index,
it tries to loop over all values in the array of index, and match to each of element of the array.
To create a multi index, all we have to do is to pass the option flag: multi: true
1 r.db("foodb").table("users")
2 .getAll('Mushrooms', {index: 'favfoods_multi'})
3 #=>
4 {
5 "favfoods": [
6 "Milk substitute" ,
7 "Mushrooms" ,
8 "Nuts" ,
9 "Hummus" ,
10 "Soft-necked garlic"
11 ] ,
12 "id": "47110a8f-3c2c-46b8-96d8-244747c1818b" ,
13 "name": "Annabelle Lindgren"
14 }
If you notice, we have to pass r.row(favfoods) to create an index. Remember that in order to create
an index where its name doesnt match any field, we have to pass an expression or an anonymous
fuction to indexCreate to caculate its value. But we defined favfoods index before, so now we
cannot create other index with the same name. We can go back, delete index to clear thing up and
save our namespace:
1 r.db("foodb").table("users").indexDrop('favfoods')
2 #=>
3 { "dropped": 1 }
4 r.db("foodb").table("users").indexDrop('favfoods_multi')
5 #=>
6 { "dropped": 1 }
Now, lets create a multile index with the same field name:
Then another question comes up, can we find all user who like Mustrooms and Banana? We may try
this:
1 r.db("foodb").table("users")
2 .getAll('Kiwi', 'Banana', {index: 'favfoods'})
However, thats an or. RethinkDB will return documents where its index value matches either Kiwi
or Banana.
Even more complex, we want to find user who like Kiwi most. Meaning kiwi has to be first element
in ther favfoods array.
To do that, we see that we are passing business logic into RethinkDB. We have to somehow represent
that logic in RethinkDB, calculate the value, and index the return value. Lets meet arbitrary
expressions index.
1 r.db("foodb").table("users")
2 .indexCreate('most-favourite-food', function (user) {
3 return user("favfoods").nth(0)
4 })
Given an array, nth(n) return n-th element. We call nth(0) on array favfoods that means first
element of array since they are zero-base index.
Understanding index 98
1 r.db("foodb").table("users")
2 .getAll('Kiwi', {index: 'most-favourite-food-1'})
3 #=>
4 {
5 "favfoods": [
6 "Kiwi" ,
7 "Lemon" ,
8 "Lime" ,
9 "Coffee" ,
10 "Sweet orange"
11 ] ,
12 "id": "0b83164e-fb42-4273-8db1-ba12be6e580d" ,
13 "name": "Carl Achiban"
14 } {
15 "favfoods": [
16 "Kiwi" ,
17 "Banana" ,
18 "Peanut" ,
19 "Asparagus" ,
20 "Common cabbage"
21 ] ,
22 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30" ,
23 "name": "Luma Ramses"
24 }
This index is powerful because we can push more complex searching to database engine. Lets say
we want to find all user who like Kiwi most and is a female.
1 r.db("foodb").table("users")
2 .indexCreate('most-favourite-food-gender', function (user) {
3 return [user("gender"), user("favfoods").nth(0)]
4 })
Here, we are trying to create non-multi index. The index is an array of gender and most favorite
food item. Now, lets try our index
Understanding index 99
1 r.db("foodb").table("users")
2 .getAll(['f', 'Kiwi'], {index: 'most-favourite-food-gender'})
3 #=>
4 {
5 "favfoods": [
6 "Kiwi" ,
7 "Lemon" ,
8 "Lime" ,
9 "Coffee" ,
10 "Sweet orange"
11 ] ,
12 "gender": "f" ,
13 "id": "0b83164e-fb42-4273-8db1-ba12be6e580d" ,
14 "name": "Carl Achiban"
15 } {
16 "favfoods": [
17 "Kiwi" ,
18 "Banana" ,
19 "Peanut" ,
20 "Asparagus" ,
21 "Common cabbage"
22 ] ,
23 "gender": "f" ,
24 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30" ,
25 "name": "Luma Ramses"
26 }
One thing I want to remind you is that the return function in any RethinkDb expression is evaludated
on RethinkDB, not on client language. You cannot just write anything. You have to use RethinkDB
expression in return value so that RethinkDB can calculate it. In previous chapter, we learn about
expr, you can use that to turn native object into RethinkDb object when needed.
Lets look at this example:
1 r.table("users").indexCreate("full_name2", function(user) {
2 return r.add(user("last_name"), "_", user("first_name"))
3 }).run(conn, callback)
We are trying to create an index by appending last_name and first_name. We cannot write
because ReThinkDB wont understand that expression. We have to call r.add function.
However, even if you see something like
Understanding index 100
That doesnt mean RethinkDB understands your native expression. Its actually your driver
overload operator (+ operator) in this case because your host language apparently support operator
overloading.
If you notice, we dont pass multi: true in any of the above examples. Can we use multi index
with arbitray expression index? Yes, we can.
Lets say if we want to find any users who like Kiwi or have eaten Kiwi before. We will create an
multi index by concat array favfoods and eatenfoods
1 r.db("foodb").table("users")
2 .indexCreate(
3 'eateen-or-like-multi',
4 r.add(r.row("eatenfoods"), r.row("favfoods"))
5 , {multi: true})
1 r.db("foodb").table("users")
2 .getAll('Kiwi', {index:'eateen-or-like-multi'})
3 #=>
4 {
5 "eatenfoods": [
6 "Celery leaves" ,
7 "Kiwi" ,
8 "Rainbow trout" ,
9 "Chinese bayberry" ,
10 "Hyacinth bean" ,
11 "Other sandwich"
12 ] ,
13 "favfoods": [
14 "Honey" ,
15 "Cake" ,
16 "Butter substitute" ,
17 "Cream" ,
18 "Sugar"
19 ] ,
20 "gender": "m" ,
21 "id": "808cedd5-f2ac-4724-98bc-061ee84755c9" ,
22 "name": "Forrest Jacobs"
23 } {
Understanding index 101
24 "eatenfoods": [
25 "Jerusalem artichoke" ,
26 "Conch" ,
27 "Milk and milk products" ,
28 "Dumpling" ,
29 "Custard apple" ,
30 "Sacred lotus" ,
31 "Japanese walnut" ,
32 "Crab"
33 ] ,
34 "favfoods": [
35 "Kiwi" ,
36 "Banana" ,
37 "Peanut" ,
38 "Asparagus" ,
39 "Common cabbage"
40 ] ,
41 "gender": "f" ,
42 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30" ,
43 "name": "Luma Ramses"
44 }
How about finding users that liked Kiwi most and have eaten kiwi? We just need to change the
index function, this time we will use anonymous function:
1 r.db("foodb").table("users")
2 .indexCreate(
3 'eatean-or-like-most',
4 function (user) {
5 return r.add(user("eatenfoods"), [user("favfoods").nth(0)])
6 }
7 , {multi: true})
Actually, function is just a special case of expression where expression is the result of function
executing. Now, we can try to find:
Understanding index 102
1 r.db("foodb").table("users")
2 .getAll('Kiwi', {index: 'eatean-or-like-most'})
3 #=>
4 {
5 "eatenfoods": [
6 "Shrimp",
7 "Other fish product",
8 "Sweet orange",
9 "Unclassified food or beverage"
10 ],
11 "favfoods": [
12 "Kiwi",
13 "Lemon",
14 "Lime",
15 "Coffee",
16 "Sweet orange"
17 ],
18 "gender": "f",
19 "id": "0b83164e-fb42-4273-8db1-ba12be6e580d",
20 "name": "Carl Achiban"
21 } {
22 "eatenfoods": [
23 "Celery leaves",
24 "Kiwi",
25 "Rainbow trout",
26 "Chinese bayberry",
27 "Hyacinth bean",
28 "Other sandwich"
29 ],
30 "favfoods": [
31 "Honey",
32 "Cake",
33 "Butter substitute",
34 "Cream",
35 "Sugar"
36 ],
37 "gender": "m",
38 "id": "808cedd5-f2ac-4724-98bc-061ee84755c9",
39 "name": "Forrest Jacobs"
40 } {
41 "eatenfoods": [
42 "Jerusalem artichoke",
Understanding index 103
43 "Conch",
44 "Milk and milk products",
45 "Dumpling",
46 "Custard apple",
47 "Sacred lotus",
48 "Japanese walnut",
49 "Crab"
50 ],
51 "favfoods": [
52 "Kiwi",
53 "Banana",
54 "Peanut",
55 "Asparagus",
56 "Common cabbage"
57 ],
58 "gender": "f",
59 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30",
60 "name": "Luma Ramses"
61 }
1 r.table().indexStatus(indexName)
Such as:
1 r.db("foodb").table("compounds_foods").indexStatus("food_id")
2 #=>
3 {
4 "blocks_processed": 656 ,
5 "blocks_total": 11331 ,
6 "function": <binary, 181 bytes, "24 72 65 71 6c 5f..."> ,
7 "geo": false ,
8 "index": "food_id" ,
9 "multi": false ,
10 "outdated": false ,
11 "ready": false
12 }
Understanding index 104
1 r.table.indexWait
Using index
Ordering
Sorting with order without index limit to 100k issue. Always considering using an index. We already
learn about orderBy in chapter but we havent use index at that time. It takes this form of syntax:
On a table, that means result of a table command, you can pass over an index for sorting. Example,
we want to sort table compounds_foods by name:
If we dont use index we will get:
1 r.db("foodb")
2 .table("compounds_foods")
3 .orderBy("name")
4 //=>
5 RqlRuntimeError: Array over size limit `100000` in:
6 r.db("foodb").table("compounds_foods").orderBy("name")
7 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1 r.db("foodb")
2 .table("compounds_foods")
3 .indexCreate("foodname", r.row("name"))
When index is ready, the original query works if we tell it what index to use:
1 r.db("foodb")
2 .table("compounds_foods")
3 .orderBy("name", {index: "foodname"})
By defaul, the ordering is ascending. To change to descending, we simply wrap the index name in
r.desc or r.asc
Using index 106
1 r.db("foodb")
2 .table("compounds_foods")
3 .orderBy("name", {index: r.desc("foodname")})
Pagination
To pagination data, we will use a combination of skip, limit and slice. We already learn them
in Chapter 3. But now you can use in combination with orderBy using an index to make it more
efficient.
skip(n)
Skip a number of element from the begining of sequence or array
limit(n)
End the sequence after we read up to the give number of limit
slice
Instead of manually doing pagination by using skip and limit, we can simply tell RethinkDB that
we want the data from this position to the other position. Similarly to how we have an slice function
to slice an array.
Lets take our foods table.
1 r.db("foodb")
2 .table("foods")
3 .orderBy(r.desc("name"))
4 .slice(10, 12)
Will return two rows from position 10 and 11. So we can calculate the slice index for pagination.
Example, assume we have 10 items per page, so on page 7 we have item from 70 to 80. And we can
do:
1 r.db("foodb")
2 .table("foods")
3 .orderBy(r.desc("name"))
4 .slice(70, 80)
Transform data
So far, we always taking the value that return from RethinkDB to work with it. In any real
application, you probably want to do some transform around it. If we do it at application level, for
complex thing, it makes sense. But for simple thing, we might waste another extra loop. Sometime
we want to do thing at RethinkDB level, or we want to use transform data in other ReQL expression.
An example is the function nth or count. We call them on a sequence, or an array of data, and they
return a different data. They transform original data into a different piece of data. But nth and count
are simple function. They dont have any complex logic inside them. Some transform function has
complex logic. With those logic, we have to use some kind of if ...else... command, or loop.
To help us with that, RethinkDB has some control structure functions such as branch (similar to
if) or forEach, do. RethinkDb shines in these areas because database engine now seems to have an
embedded language in it.
Lets cover more of those functions. Along the way, we will learn somne structure control command.
Now, move on to our next function, map.
Map
Lets say we want to divide our users into 3 groups, who are under 18 years old are teenager, between
18-50 are adult, and over 50 years old are older. We have a pattern here, with each document in
table, we want to calculate a new data, depend on their existed data. In RethinkDB, we used map
function.
Map apply a function on document, and return value of function is returned from query. With our
example, in a normal programming language such as Ruby, we can write:
1 users.map do |user|
2 if user["age"] <= 18
3 "teenager"
4 else if user["age"] >18 && user["age"] <50
5 "adult"
6 else
7 "older"
8 end
9 end
With this example, we have to represent if in RethinkDB function. We do that with branch:
1 r.db("foodb").table("users").map(function (user) {
2 return r.branch(
3 user("age").lt(18),
4 "teenager",
5 r.branch(
6 user("age").gt(50),
7 "older",
8 "adult"
9 )
10 )
11 })
12 #=>
13 "older" "older" "older" "adult" "older" "older" "older" "adult" "older"
14 "older" "older" "adult" "teenager" "adult" "adult" "adult" "older" "adult"
15 "older" "older" "older" "older" "adult" "teenager" "adult" "older" "older"
16 "older" "older" "adult" "adult" "older" "adult" "older" "older" "older" "adult"
17 "older" "older" "older"
Yay, we get what we want. But it only returns the value from fuction, we dont know who is who.
Well, thats map job. It transforms the whole document into the return value. How about we return
an object, with original name field, and our group field, like this:
Using index 109
1 r.db("foodb").table("users").map(function (user) {
2 return {
3 name: user("name"),
4 group: r.branch(
5 user("age").lt(18),
6 "teenager",
7 r.branch(
8 user("age").gt(50),
9 "older",
10 "adult"
11 )
12 )}
13 })
14 #=>
15 {
16 "group": "adult" ,
17 "name": "Arthur Hegmann"
18 } {
19 "group": "older" ,
20 "name": "Ricky Quigley Sr."
21 } {
22 "group": "older" ,
23 "name": "Jazmyne Brakus"
24 }
25 ....
Great. But if we want to return whole document, with extra group field, do we have to repeatedly
write every fields? Well, lets meet merge.
1 r.db("foodb").table("users").map(function (user) {
2 return user.merge({
3 group: r.branch(
4 user("age").lt(18),
5 "teenager",
6 r.branch(
7 user("age").gt(50),
8 "older",
9 "adult"
10 )
11 )})
12 })
Using index 110
If you wondering whether we can use row in map expression instead of using function, I would say:
Yes, we can. But we cannot do that in the above example because row doesnt work in nested query.
In above example, we nested r.branch inside merge, and inside other branch.
Lets see an example where we can use r.row: couting how many foods they have eaten:
1 r.db("foodb").table("users").map({
2 name: r.row("name"),
3 total_eaten: r.row("eatenfoods").count()
4 })
5 #=>
6 {
7 "name": "Arthur Hegmann" ,
8 "total_eaten": 6
9 } {
10 "name": "Ricky Quigley Sr." ,
11 "total_eaten": 7
12 }
Inside map function, we can use any arbitrary ReQL command to fetch data. In follow example, we
try to use getAll and an index to count data. Lets find the quantity of flavor which a compound has.
Table compounds has an associated table compounds_flavors which stores a releation compounds
and flavors using two fields: compound_id and flavor_id. By counting how many item for a given
compound_id on table compounds_flavors, we can get the total flavor count of compound.
1 r.db('foodb')
2 .table('compounds')
3 .map(function(doc) {
4 return {
5 compound_id: doc('id'),
6 name: doc('name'),
7 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'), {\
8 index: 'compound_id'}).count()
9 }
10 })
11 .orderBy(r.desc('flavor_total'))
12 //=>
13 [
14 {
15 "compound_id": 3266,
16 "flavor_total": 22,
17 "name": "Ethyl methyl sulfide"
18 },
Using index 111
19 {
20 "compound_id": 930,
21 "flavor_total": 19,
22 "name": "3-Ethylpyridine"
23 },...
24 ]
In this example, with each of compound, we pass it into map function. The map function count how
many flavor it has by querying table compound_flavor. We are using getAll with an index to make
it run fast. We finally call .count() to count the total element of sequence. In the map function, we
explicitly return an object which we construct ourselves. The using of it someitmes confused people
that the function runs on client site which is wrong. The map function is executed on RethinkDB
server. If we dont like return an JSON object directly as above, we can use pluck to get only name
and compound_id and merge it with extra flavor_total field like this:
1 r.db("foodb")
2 .table('compounds')
3 .map(function(doc) {
4 return doc.pluck('name', 'compound_id').merge({
5 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'), {\
6 index: 'compound_id'}).count()
7 })
8 })
9 .orderBy(r.desc('flavor_total'))
Using map, we can pre-calculate some data, either using function or expression. Lets meet other
kind of mapping function.
concatMap
concatMap is very similar to map. It applies function to every element of sequence. But its different
since it also try to flatten, or concat all element into a single sequence.
However, concatMap will concat all sub sequence/array into final return data.
Using index 112
When is it useful? It looks like very useless, right? Well, lets look at favfoods field. If we want to
get a list of favfoods on the entire system, we can do:
1 r.db("foodb").table("users").map(
2 r.row("favfoods")
3 )
4 #=>
5 [
6 "Garden tomato (var.)" ,
7 "Linden" ,
8 "Lowbush blueberry" ,
9 "American cranberry" ,
10 "Vanilla"
11 ],
12 [
13 "Swiss chard" ,
14 "Chicory roots" ,
15 "Grapefruit" ,
16 "Jostaberry" ,
17 "Spirit"
18 ]
1 r.db("foodb").table("users").concatMap(
2 r.row("favfoods")
3 )
4 #=>
5 Pikeperch
6 Pacific ocean perch
7 True seal
8 Columbidae (Dove, Pigeon)
9 Conch
10 Kiwi
11 Lemon
12 Lime
13 Coffee
14 Sweet orange
15 Kiwi
Using index 113
16 True seal
17 Salmonidae (Salmon, Trout)
18 ...
If you notice, we have duplicate data. Thats natural because many users may like the same foods.
To get distinct value, you can call distinct on the sequence:
1 r.db("foodb").table("users").concatMap(
2 r.row("favfoods")
3 ).distinct()
4 #=>
5 [
6 "Abalone" ,
7 "Abiyuch" ,
8 "Acerola" ,
9 "Acorn" ,
10 "Adobo" ,
11 "Adzuki bean" ,
12 ]
distinct accepts an index and use it to differentce document, without any specified index, it uses
primary index, or the id value, which is good enough in our case, because we dont have foods with
the same name. Sometimes, you may want to create extra index for the field and call distinct using
that index.
Lets dig into a more complex example. For each of foods, lets find all of its compound.
Lets create an index first:
1 r.db("foodb").table("compounds_foods").indexCreate("food_id")
1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food("id"), {index: "food_id"})
8 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
9 "orig_compound_name")
Using index 114
10 .map(function(compound_food) {
11 return food.merge({compound: compound_food})
12 })
13 })
14 #=>
15 {
16 "compound": {
17 "compound_id": 21594 ,
18 "food_id": 2 ,
19 "id": 15609 ,
20 "orig_compound_name": "Fatty acids, total saturated" ,
21 "orig_food_common_name": "Cabbage, savoy, raw"
22 } ,
23 "id": 2 ,
24 "name": "Savoy cabbage"
25 } {
26 "compound": {
27 "compound_id": 21595 ,
28 "food_id": 2 ,
29 "id": 15610 ,
30 "orig_compound_name": "Fatty acids, total mono-unsaturated" ,
31 "orig_food_common_name": "Cabbage, savoy, raw"
32 } ,
33 "id": 2 ,
34 "name": "Savoy cabbage"
35 }
Note that we use some withFields command to limit on the fields that we want to include in return
document.
Above example wont work with map. Because the return value from function isnt a single value,
but another sequence. If you attempt to use map, you can see error clearly:
1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .map(function (food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food("id"), {index: "food_id"})
8 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
9 "orig_compound_name")
10 .map(function(compound_food) {
Using index 115
concatMap can operator on sequence value from funciton, and try to flatten it for us. Lets get back
to this and break down how it works:
1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food("id"), {index: "food_id"})
8 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
9 "orig_compound_name")
10 .map(function(compound_food) {
11 return food.merge({compound: compound_food})
12 })
13 })
For every document of foods, its transformed into a document similar to this:
1 [
2 {id: id1, name: name1, compound: compound_food_document1},
3 {id: id1, name: name1, compound: compound_food_document2},
4 {id: id1, name: name1, compound: compound_food_documentn},a
5 ]
1 [
2 {id: id1, name: name1, compound: compound_food_document1},
3 {id: id1, name: name1, compound: compound_food_document2},
4 {id: id1, name: name1, compound: compound_food_documentn},...
5 ],
6 [
7 {id: id2, name: name2, compound: compound_food_document1},
8 {id: id2, name: name2, compound: compound_food_document2},
9 {id: id2, name: name2, compound: compound_food_documentn},...
10 ],
11 ...
Thats power of concatMap. We can use it to do a join effect. But one problem remains, as you can
see, we have many document with same food but different compound of that food. We can somehow
compile them into an array like this:
1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return [
6 food.merge({compound:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"), {index: "food_id"})
9 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
10 "orig_compound_name")
11 .limit(10)
12 })
13 ]
14 })
First off, you notice that we wrap food.merge in array. Why so? Because concatMap expects return
value from function is a sequence. map expects return value from function is DATUM, likewise. We
also call limit(10) on the compound_foods sequence to limit first 10 results, sort by primary key, id
field of compound_foods table in this case.
Instead of creating a map to loop over the compound_foods sequence and creating a document, we
simply bring the whole compound_foods array to merge into food document. Looks good, run this
and here come the error:
Using index 117
Well, merge expect a DATUM. A datum is like a single primitive value such as a number, an array.
However, we are passing SEQUENCE. You can understand that we are expecting an primitive value,
but we passed a cursor. Like in Ruby, when we expect an array, but we pass an enumarator. In
RethinkDB, to make this kind of merge work, we have to explicitly convert it to an array, using
coerceTo with parameter array.
1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return [
6 food.merge({compound:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"), {index: "food_id"})
9 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
10 "orig_compound_name")
11 .limit(10)
12 .coerceTo('array')
13 })
14 ]
15 })
16 #=>
17 {
18 "compound": [ ... ] ,
19 "id": 2 ,
20 "name": "Savoy cabbage"
21 } {
22 "compound": [ ... ] ,
23 "id": 15 ,
24 "name": "Wild celery"
25 }
Now, with coerceTo function, we know that we can convert sequence into array. So can we achieve
this with map, instead of concatMap. Yes, and its even more simpler:
Using index 118
1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .map(function (food) {
5 return
6 food.merge({compound:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"), {index: "food_id"})
9 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
10 "orig_compound_name")
11 .limit(10)
12 .coerceTo('array')
13 })
14 })
We no longer have to wrap it in [] because map can work with OBJECT. With this example, we see
that concatMap and map sometime can be used interchangably depend on how we want to model
the data.
1 r.db("foodb")
2 .table("users")
3 .indexCreate("food-test-idx-1", function (food) {
4 return
5 food("favfoods")
6 .concatMap(function (favfood) {
7 return
8 food("favfoods").map(function (favfood2) {
9 return [favfood, favfood2]
10 })
11 })
12 }, {multi: true})
We will do a map. With each of element of favfoods, we try to create a new array of
two elements by combining the element with each of element of favfoods itself. We
use concatMap to flatten array, and set multi: true because this is a multi indexer: we are
returning an array from index function.
And now, we can use that index, pass a pair of value:
Using index 119
1 r.db("foodb")
2 .table("users")
3 .getAll(['Spirit', 'Grapefruit'], {index: "food-test-idx-1"})
4 .withFields("name", "favfoods")
5 #=>Executed in 9ms. 2 rows returned
6 {
7 "favfoods": [
8 "Swiss chard" ,
9 "Chicory roots" ,
10 "Grapefruit" ,
11 "Jostaberry" ,
12 "Spirit"
13 ] ,
14 "name": "Wilburn Price"
15 } {
16 "favfoods": [
17 "Chicory roots" ,
18 "Grapefruit" ,
19 "Jostaberry" ,
20 "Spirit" ,
21 "Abiyuch"
22 ] ,
23 "name": "Audie Muller"
24 }
The order isnt important here, because we create the pair of array on every element if itself. Such
as with [a,b,c] array, we will have this array via map:
[a,a], [a,b], [a,c], [b,a], [b,b], [b,c], [c, a], [c, b], [c, a]
We can elimiate [a,a], [b,b], [c,c] because they are useless. We just need to put them in branch
command:
Using index 120
1 r.db("foodb")
2 .table("users")
3 .indexCreate("food-test-idx-2", function (food) {
4 return food("favfoods").concatMap(function (favfood) {
5 return food("favfoods").map(function (favfood22) {
6 return r.branch(favfood.eq(favfood2),
7 [],
8 [favfood, favfood2]
9 )
10 })
11 })
12 }, {multi: true})
With each element, if we see the value of two elements are the same, we return an empty array,
otherwise, we return combination array of them.
This runs fast but it has its own limitation. We cannot create more than 256 index on multi index.
Other approach, not as fast, but scale better is to use double finding. First, we find all documents
match an value, from that result, we find all documents match second value. First we getAll using
an index, then filter it.
Lets try it again. First create an index:
1 r.db("foodb")
2 .table("users")
3 .indexCreate("food-test-idx-3", r.row('favfoods')
4 , {multi: true})
5
6 r.db("foodb")
7 .table("users")
8 .indexCreate("food-test-idx-3", function (food) {
9 return food["favfood"]
10 })
11 }, {multi: true})
Now, we firstly using that index to find all user who like first fruit, then use filter to filter out user
who also like second fruits:
Using index 121
1 r.db("foodb")
2 .table("users")
3 .getAll("Spirit", {index: "food-test-idx-3"})
4 .filter(r.row("favfoods").contains("Grapefruit"))
5 .withFields("name", "favfoods")
6 //=>Executed in 9ms. 2 rows returned
7 [
8 {
9 "favfoods": [
10 "Swiss chard" ,
11 "Chicory roots" ,
12 "Grapefruit" ,
13 "Jostaberry" ,
14 "Spirit"
15 ] ,
16 "name": "Wilburn Price"
17 } {
18 "favfoods": [
19 "Chicory roots" ,
20 "Grapefruit" ,
21 "Jostaberry" ,
22 "Spirit" ,
23 "Abiyuch"
24 ] ,
25 "name": "Audie Muller"
26 }
27 ]
In the result of getAll, we run a filter on it. If the result of getAll contains more than 100,000
element. This method wont work. We cannot continue to call getAll with specifiying an index
from another getAll because it returns *SELECTION
1 r.db("foodb")
2 .table("users")
3 .getAll(['Spirit', 'Grapefruit'], {index: "food-test-idx-1"})
4 .typeOf()
5 //=>
6 "SELECTION<STREAM>"
library.
The only downside is slowness which is understandable. Therefore, we usually have to use getAll to
leverage index. But getAll only query data based on value of index which doesnt give the flexible of
filter. Because we have to calculate an index function to generate index, where as filter evaluated
dynamically.
Considering this example. Assume we have an orders table contains order. And other orderItems
table which contains items of an order.
We can creata some data:
1 r.tableCreate("orders")
2 r.tableCreate("orderItems")
3
4 r.table("orders").insert([
5 {id:1, shipped: 1}, {id:2, shipped: 0}, {id:3, shipped: 1},
6 ])
7
8 r.table("orderItems").insert([
9 {id:1, name: "f1", orderId: 1},
10 {id:2, name: "f2", orderId: 1},
11 {id:3, name: "a1", orderId: 2},
12 {id:4, name: "f2", orderId: 2},
13 {id:5, name: "a3", orderId: 2},
14 {id:6, name: "b3", orderId: 3},
15 {id:7, name: "b4", orderId: 3},
16 ])
Question: find all items of every ship order. With filter, we can quickly do this:
1 r.table('orderItems')
2 .filter(function(orderItem) {
3 return r.table('order').get(orderItem('id'))('shipped').default(0).gt(0)
4 })
However, they are bad because filter limitation. So ideally we only have getAll choice. But getAll
only returns data of the table its using index on. In our case, shipped status is stored in orders
table, whileas we want to fetch data of orderItems. In other words, getAll doesnt give us data we
want. We have to transform its into what we want by using map/concatMap.
Lets add two indexes:
Using index 123
1 r.table("orderItems").indexCreate("orderId")
2 r.table("orders").indexCreate("shipStatus", r.row("shipped").default(0).gt(0))
Now, we will use concatMap to transform the order into its equivalent orderItem
1 r.table("orders")
2 .getAll(true, {index: "shipStatus"})
3 .concatMap(function(order) {
4 return r.table("orderItems").getAll(order("id"), {index: "orderId"}).coerc\
5 eTo("array")
6 })
So now, we basically have the power of filter. We can query whatever data inside concatMap
function(just to be sure to use proper index anyway). The way of thinking is in reverse of filter.
In filter, we start query on table we want to get data. In getAll/concatMap, we start query query on
the table contains the associated condition, then using concatMap to somehow, join the data across
table.
map and concatMap is really important and you should take a bit time to play around and master
them.
Wrap up
This chapter is quite long. So far weve learned about index. We know how to create:
Simple index
Compound index
Multi value index
Arbitrary expression index
Weve also learned how to leverage index to sort, filter data. Two important transform functions
you also learn:
And finally, by leveraging map, you can easily create multi index.
6. Data Modeling With JOIN
Join is a joy to work with in my opinion. It makes data model easier to design. Without joy, we
have to either embed or joinging data with our code instead of database taking care of that for us.
With embedding document, we will hit a limit point about document size because the document
will usually loaded into the memory. Embedding document has its own advantages such as: query
data is simple, but this section will focus on data modeling with JOIN.
Using Join
eqJoin
In RethinkDB, JOIN is automatically distributed, meaning if you run on a cluster, the data will be
combined from many clusters and presents final result to you.
In SQL, ideally you can join whatever you want by making sure that the records on 2 tables match
a condition. An example:
1 SELECT post.*
2 FROM post
3 JOIN comment ON comment.post_id=post.id
4
5 # OR
6
7 SELECT post.*
8 FROM post
9 JOIN comment ON comment.author_id=post.author_id
You dont even need to care about the index. The database is usually smart enough to figure out
what index to use, or will scan the full table for you.
Join is a bit different in RethinkDB. Similar to how we have primary index and second index, we
usually need index to join in Rethink DB. Generally, we use below techniques in a JOIN command:
primary keys
secondary indexes
sub queries
It tries to find the document on table righTable whose index value matches leftField value or the
return value of function. Its just similar to a normal JOIN in MySQL.
1 SELECT
2 FROM sequence
3 JOIN rightTable
4 ON sequence.leftField = rightTable.id
In plain English, eqJoin try to find pair of document on left table(sequence) and rightTable whereas
value of index of right table (default is the primary index) matches value of leftField on left Table
or the return value of execution of function we passed into eqJoin
So, to find all compounds and its synonyms, we can do:
1 r.db("foodb")
2 .table("compound_synonyms")
3 .eqJoin("compound_id", r.db("foodb").table("compounds"))
1 "left": {
2 "compound_id": 82 ,
3 "created_at": Fri Apr 09 2010 17:40:05 GMT-07:00 ,
4 "id": 832 ,
5 "source": "db_source" ,
6 "synonym": "3,4,2',4'-Tetrahydroxychalcone" ,
7 "updated_at": Fri Apr 09 2010 17:40:05 GMT-07:00
8 } ,
9 "right": {
10 "annotation_quality": "low" ,
11 "assigned_to_id": null ,
12 "bigg_id": null ,
13 "boiling_point": null ,
14 "boiling_point_reference": null ,
15 "cas_number": null ,
16 "charge": null ,
17 "charge_reference": null ,
18 "chebi_id": null ,
19 "comments": null ,
Using Join 128
20 "compound_source": "PHENOLEXPLORER" ,
21 "created_at": Thu Apr 08 2010 22:04:26 GMT-07:00 ,
22 "creator_id": null ,
23 "density": null ,
24 "density_reference": null ,
25 "wikipedia_id": null,
26 //...lot of other fields
27 ...
28 }
29 }
We get back an array, with element on both table match our condition. We can see that the item on
the left has its compound_id matchs id field of the one on the right. However, the above result with
left, right is not very useful. It will be more useful if we can merge both side into a single document.
To do that, we use zip
1 r.db("foodb")
2 .table("compound_synonyms")
3 .eqJoin("compound_id", r.db("foodb").table("compounds"))
4 .zip()
5 //=>
6 {
7 "annotation_quality": "low" ,
8 "assigned_to_id": null ,
9 "bigg_id": null ,
10 "boiling_point": "Bp14 72" ,
11 "boiling_point_reference": "DFC" ,
12 "cas_number": "15707-34-3" ,
13 "charge": null ,
14 "charge_reference": null ,
15 "chebi_id": null ,
16 "comments": null ,
17 "compound_id": 923 ,
18 //...lot of other fields
19 },
20 //other document here as well
What zip does is that it merges right document into left document and returns that document,
instead of a document with two left and right field.
zip is not really flexible because it simply merges all the field. We can use some transform function
to transform the document into a more read-able document since we only care about name and its
synonym:
Using Join 129
1 r.db("foodb")
2 .table("compound_synonyms")
3 .eqJoin(
4 "compound_id",
5 r.db("foodb").table("compounds")
6 )
7 .map(function (doc) {
8 return {synonym: doc("left")("synonym"), name: doc("right")("name")}
9 })
10 //=>
11 {
12 "name": "Butein" ,
13 "synonym": "Acrylophenone, 2',4'-dihydroxy-3-(3,4-dihydroxyphenyl)-"
14 },
15 {
16 "name": "3,4-Dimethoxybenzoic acid" ,
17 "synonym": "Benzoic acid, 3,4-dimethoxy-"
18 }
Much cleaner now! The important thing is that the join data is just another stream or array and we
can do transformation on it.
As you may see, we didnt specify an index on above query. When we dont specify index, RethinkDB
use primary index of table. In this case, the primary key is value of id field of table compounds.
1 r.db("foodb").table("compounds")
2 .indexCreate("compound_id")
1 r.db("foodb").table("compound_synonyms").indexStatus()
1 [
2 {
3 "function": <binary, 185 bytes, "24 72 65 71 6c 5f..."> ,
4 "geo": false ,
5 "index": "compound_id" ,
6 "multi": false ,
7 "outdated": false ,
8 "ready": true
9 }
10 ]
1 r.db("foodb")
2 .table("compounds")
3 .eqJoin("id", r.db("foodb").table("compound_synonyms"), {index: 'compound_id'})
4 .map(function (doc) {
5 return {synonym: doc("right")("synonym"), name: doc("left")("name")}
6 })
7 //=>
8 {
9 "name": "Butein" ,
10 "synonym": "3-(3,4-Dihydroxy-phenyl)-1-(2,4-dihydroxy-phenyl)-propenone"
11 } {
12 "name": "Butein" ,
13 "synonym": "2',3,4,4'-Tetrahydroxychalcone"
14 }
With proper index, a query looks cleaner and more natural. The order of how we use eqJoin is
important. Trying narrow down data first if possible, to make the join does less work.
Also, instead of passing the field name to eqJoin, we can also pass a function or using row command
to get value of nested field. In this case, the return value of function, or value of field access with
row will be used to match with value of index on the right table. These are useful especially with
structure data on field.
Do you remember we have users table with data structure looks like this:
Using Join 131
1 r.db('foodb').table('users')
2 {
3 "age": 40 ,
4 "eatenfoods": [
5 "True sole" ,
6 "Jerusalem artichoke" ,
7 "Ascidians" ,
8 "Pineappple sage" ,
9 "Lotus" ,
10 "Coffee and coffee products"
11 ] ,
12 "favfoods": [
13 "Edible shell" ,
14 "Clupeinae (Herring, Sardine, Sprat)" ,
15 "Deer" ,
16 "Perciformes (Perch-like fishes)" ,
17 "Bivalvia (Clam, Mussel, Oyster)"
18 ] ,
19 "gender": "m" ,
20 "id": "1dd8059c-82ca-4345-9d75-eaa0f8edbf48" ,
21 "name": "Arthur Hegmann"
22 ...
23 }
Lets try to find more information about the most favourite foods.
First, lets create an index for food name
1 r.db('foodb').table('foods').indexCreate('name')
1 r.db('foodb').table('users')
2 .eqJoin(r.row('favfoods').nth(0), r.db('foodb').table('foods'), {index: 'name'\
3 })
4 //=>
5 {
6 "left": {
7 "age": 40,
8 "eatenfoods": [
9 "True sole",
10 "Jerusalem artichoke",
Using Join 132
11 "Ascidians",
12 "Pineappple sage",
13 "Lotus",
14 "Coffee and coffee products"
15 ],
16 "favfoods": [
17 "Edible shell",
18 "Clupeinae (Herring, Sardine, Sprat)",
19 "Deer",
20 "Perciformes (Perch-like fishes)",
21 "Bivalvia (Clam, Mussel, Oyster)"
22 ],
23 "gender": "m",
24 "id": "1dd8059c-82ca-4345-9d75-eaa0f8edbf48",
25 "name": "Arthur Hegmann"
26 },
27 "right": {
28 "created_at": Wed Dec 21 2011 02: 40: 48 GMT - 08: 00,
29 "creator_id": 2,
30 "description": null,
31 "food_group": "Baking goods",
32 "food_subgroup": "Wrappers",
33 "food_type": "Type 2",
34 "id": 868,
35 "itis_id": null,
36 "legacy_id": null,
37 "name": "Edible shell",
38 "name_scientific": null,
39 "picture_content_type": "image/jpeg",
40 "picture_file_name": "868.jpg",
41 "picture_file_size": 51634,
42 "picture_updated_at": Fri Apr 20 2012 09: 39: 05 GMT - 07: 00,
43 "updated_at": Fri Apr 20 2012 16: 39: 06 GMT - 07: 00,
44 "updater_id": 2,
45 "wikipedia_id": null
46 }
47 }
48 //....
1 r.db('foodb').table('users')
2 .eqJoin(function(user) { return user('favfoods').nth(0) },
3 r.db('foodb').table('foods'),
4 {index: 'name'})
So basically passing field name is just a shortcut when using r.row(field_name). Using row or
function gives us much more flexibility. Also remember that row command cannot be used in any
sub queries such as in this case:
Lets say we have a
Since the beginning, the way join is constructed is to match the document between 2 tables based
on value and matching of index. How we can just simply join data across two table based on two
field? In real life, we may have even more complex join condition. Example, in MySQL, we can join
with, basically any arbitrary condition like this:
1 SELECT *
2 FROM table1 as t1
3 JOIN tabl2 as t2 ON t1.field1=t2.field1 AND t1.foo=t2.bar
Lets think of an example. I want to join data of table compounds and its compounds_synonyms
where the source is from biospider and created after 2013. It is obviously we cannot use a single
field here with eqJoin.
Luckily, we have another way of joining data, using sub queries with concatMap and getAll.
However, since they are not eqJoin command, we will learn about sub query later in this chapter.
For now, lets move on to other join command.
To join, we usually need index. But can we join data without using any index via two arbitray
sequence? Even if its not very efficient but useful to have. The answer is yes, we can do inner join
and outer join.
innerJoin
innerJoin returns an inter section of two sequences where as each row of first sequence will be put
together with each row of second sequence, then evaluates a predicate function to find pair of rows
which predicate function returns true. The syntax of innerJoin is:
Predicate function accepts two parameters of each row of first and second sequence.
Lets say the first sequence has M rows, and second sequence has N rows, the innerJoin will loop M
x N times and pass the pair of rows into predicate function. Lets say we have two sequences:
Using Join 134
1 [2,5,8,12,15,20,21,24,25]
2 [2,3,4]
And we want to find all pair of data where the first element module and second element equals zero.
We can write this:
1 r.expr([2,5,8,12,15,20,21,24,25])
2 .innerJoin(
3 r.expr([2,3,4]),
4 function (left, right) {
5 return left.mod(right).eq(0)
6 }
7 )
8 //=>
9 [
10 {
11 "left": 2 ,
12 "right": 2
13 } ,
14 {
15 "left": 8 ,
16 "right": 2
17 } ,
18 {
19 "left": 8 ,
20 "right": 4
21 } ,
22 {
23 "left": 12 ,
24 "right": 2
25 } ,
26 {
27 "left": 12 ,
28 "right": 3
29 } ,
30 {
31 "left": 12 ,
32 "right": 4
33 } ,
34 {
35 "left": 15 ,
36 "right": 3
Using Join 135
37 } ,
38 {
39 "left": 20 ,
40 "right": 2
41 } ,
42 {
43 "left": 20 ,
44 "right": 4
45 } ,
46 {
47 "left": 21 ,
48 "right": 3
49 } ,
50 {
51 "left": 24 ,
52 "right": 2
53 } ,
54 {
55 "left": 24 ,
56 "right": 3
57 } ,
58 {
59 "left": 24 ,
60 "right": 4
61 }
62 ]
RethinkDB will loop 27 times(9x3) and evaluate function to find rows. Because of the evaluation,
and no index is involved, this function is slow.
Here is another real example with real data. Lets find all foods and its compound_foods.
1 r.db("foodb")
2 .table("foods")
3 .innerJoin(
4 r.db("foodb").table("compounds_foods"),
5 function(food, compound_food) {
6 return food("id").eq(compound_food("food_id"))
7 }
8 )
9 //=>
10 {
11 "left": {
Using Join 136
54 }
55 } {
56 "left": {
57 "created_at": Wed Feb 09 2011 00:37:15 GMT-08:00 ,
58 "creator_id": null ,
59 "description": null ,
60 "food_group": "Vegetables" ,
61 "food_subgroup": "Cabbages" ,
62 "food_type": "Type 1" ,
63 "id": 2 ,
64 "itis_id": null ,
65 "legacy_id": 2 ,
66 "name": "Savoy cabbage" ,
67 "name_scientific": "Brassica oleracea var. sabauda" ,
68 "picture_content_type": "image/jpeg" ,
69 "picture_file_name": "2.jpg" ,
70 "picture_file_size": 155178 ,
71 "picture_updated_at": Fri Apr 20 2012 09:39:54 GMT-07:00 ,
72 "updated_at": Fri Apr 20 2012 16:39:55 GMT-07:00 ,
73 "updater_id": null ,
74 "wikipedia_id": null
75 } ,
76 "right": {
77 "citation": "DTU" ,
78 "citation_type": "DATABASE" ,
79 "compound_id": 1014 ,
80 "created_at": Tue Dec 13 2011 18:54:33 GMT-08:00 ,
81 "creator_id": null ,
82 "food_id": 2 ,
83 "id": 15630 ,
84 "orig_citation": null ,
85 "orig_compound_id": "0038" ,
86 "orig_compound_name": "Niacin, total" ,
87 "orig_content": "0.522E0" ,
88 "orig_food_common_name": "Cabbage, savoy, raw" ,
89 "orig_food_id": "0674" ,
90 "orig_food_part": null ,
91 "orig_food_scientific_name": null ,
92 "orig_max": null ,
93 "orig_method": null ,
94 "orig_min": null ,
95 "orig_unit": "NE" ,
Using Join 138
96 "orig_unit_expression": null ,
97 "updated_at": Tue Dec 13 2011 18:54:33 GMT-08:00 ,
98 "updater_id": null
99 }
100 }
It will run forever, because we have 888 documents in food table, and 10959 document in compound_-
foods table. It has to run the predicate function 888 * 10959 = 9,731,592 time. On my laptop, it runs
in:
Basically innerJoin is equivalent of table scan in MySQL. We should avoid using it on any
significant amount data.
outerJoin
innerJoin is an intersection of two sequences where a pair of documents sastify a condition. How
about something similar to an left join in SQL? Lets meet outerJoin
outerJoin will return all documents of left sequences. With each document, it will try to match
with every documents of right hand. If the pair sastify a predicate function, the pair is returned. If
not, the only document of left sequence is returned. At very least, the finaly sequence will include
all document of left sequence. Using same data set, but for outerJoin:
1 r.expr([2,5,8,12,15,20,21,24,25])
2 .outerJoin(
3 r.expr([2,3,4]),
4 function (left, right) {
5 return left.mod(right).eq(0)
6 }
7 )
8 //=>
9 [
10 {
11 "left": 2 ,
12 "right": 2
13 } ,
14 {
15 "left": 5
16 } ,
Using Join 139
17 {
18 "left": 8 ,
19 "right": 2
20 } ,
21 {
22 "left": 8 ,
23 "right": 4
24 } ,
25 {
26 "left": 12 ,
27 "right": 2
28 } ,
29 {
30 "left": 12 ,
31 "right": 3
32 } ,
33 {
34 "left": 12 ,
35 "right": 4
36 } ,
37 {
38 "left": 15 ,
39 "right": 3
40 } ,
41 {
42 "left": 20 ,
43 "right": 2
44 } ,
45 {
46 "left": 20 ,
47 "right": 4
48 } ,
49 {
50 "left": 21 ,
51 "right": 3
52 } ,
53 {
54 "left": 24 ,
55 "right": 2
56 } ,
57 {
58 "left": 24 ,
Using Join 140
59 "right": 3
60 } ,
61 {
62 "left": 24 ,
63 "right": 4
64 } ,
65 {
66 "left": 25
67 }
68 ]
5 and 25 were not divided by any of the number on right hand. Therfore, the return document
contains only left hand document.
Name conflict
In SQL worlds, we can alias column to avoid conflict when joining. What RethinkDB gives us, when
we used zip command to merge the document? We will loose the column on left sequence. We have
several ways to address this.
Firstly, if we want to use zip:
Secondly, we dont have to use zip, and we can merge document itself with map, and only keep what
we want.
However, we are still not able to address this issues. Those are just work-around. Luckily, RethinkDB
team are aware of that and is working on it now.
With that index, lets build our query step by step. First, we select Kiwi, its ID is 4. Then calling
merge command.
1 r.db("foodb")
2 .table("foods")
3 .get(4)
4 .merge(function (food) {
5
6 return {
7 flavors: //flavor array here
8 }
9 })
Lets see what will we fill in flavors array. We will try to grab all of its compound. That means all
of documents of compounds_foods table where its food_id is equal with main ID of kiwi.
Using Join 142
1 r.db("foodb")
2 .table("foods")
3 .get(4)
4 .merge(function (food) {
5 return {
6 flavors:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"),{index: "food_id"})
9 .concatMap(function(compound_food) {
10 //Return something flavor of compound here
11 })
12 .coerceTo("array")
13 }
14 })
Notice that we used concatMap so that it flattens array for us. We also used coerceTo to convert
selection result to an array for merge command. With each of document of compounds_foods we
can baiscally get all of its flavor as following:
1 r.db("foodb").table("compounds_flavors")
2 .getAll(compound_food("compound_id"), {index: "compound_id"})
3 .concatMap(function(compounds_flavor) {
4 return
5 r.db("foodb").table("flavors").getAll(compounds_flavor("fl\
6 avor_id"))
7 .map(function (flavor) {
8 return flavor("name")
9 })
10 .coerceTo("array")
11 })
12 .coerceTo("array")
1 r.db("foodb")
2 .table("foods")
3 .get(4)
4 .merge(function (food) {
5
6 return {
7 flavors:
8 r.db("foodb").table("compounds_foods")
9 .getAll(food("id"),{index: "food_id"})
10 .concatMap(function(compound_food) {
11 return
12 r.db("foodb").table("compounds_flavors")
13 .getAll(compound_food("compound_id"), {index: "compound_id"})
14 .concatMap(function(compounds_flavor) {
15 return
16 r.db("foodb").table("flavors").getAll(compounds_flavor("\
17 flavor_id"))
18 .map(function (flavor) {
19 return flavor("name")
20 })
21 .coerceTo("array")
22 })
23 .coerceTo("array")
24
25
26 })
27 .distinct()
28 .coerceTo("array")
29 }
30
31 })
Before the final coerceTo, we also call distinct to eliminate duplicate. And we got this result:
Using Join 144
1 {
2 "created_at": Wed Feb 09 2011 00: 37: 15 GMT - 08: 00,
3 "creator_id": null,
4 "description": null,
5 "flavors": [
6 "alcoholic",
7 "baked",
8 "bay oil",
9 "bitter",
10 "bland",
11 "bread",
12 "cheese",
13 "cheesy",
14 "citrus",
15 "coconut",
16 "ethereal",
17 "faint",
18 "fat",
19 "fatty",
20 "medical",
21 "metal",
22 "mild",
23 "odorless",
24 "rancid",
25 "slightly waxy",
26 "soapy",
27 "sour",
28 "strong",
29 "sweat",
30 "sweet",
31 "unpleasant",
32 "waxy",
33 "yeast"
34 ],
35 "food_group": "Fruits",
36 "food_subgroup": "Tropical fruits",
37 "food_type": "Type 1",
38 "id": 4,
39 "itis_id": "506775",
40 "legacy_id": 4,
41 "name": "Kiwi",
42 "name_scientific": "Actinidia chinensis",
Using Join 145
43 "picture_content_type": "image/jpeg",
44 "picture_file_name": "4.jpg",
45 "picture_file_size": 110661,
46 "picture_updated_at": Fri Apr 20 2012 09: 32: 21 GMT - 07: 00,
47 "updated_at": Fri Apr 20 2012 16: 32: 22 GMT - 07: 00,
48 "updater_id": null,
49 "wikipedia_id": null
50 }
While as the query looks giant and complex. The way to write is to drill down each table at a time,
using a map/concatMap to transform data.
1 r.db('foodb').table('foods')
2 .map(function (food) {
3 return food.merge({
4 "compound_foods":
5 r.db('foodb').table('compounds_foods')
6 .getAll(food("id"), {index: 'food_id'})
7 .coerceTo('array')
8 })
9 })
Much better than using innerJoin.Compare to >2mins before, this is a major improvement. To really
see how fast/slow a query without index, you may want to put data on a spin disk(an external hard
drive for example), because SSD is usually fast and you may not notice. All of my above example
is using a SSD on Macbook Pro (Retina, 13-inch, Mid 2014), with process Interl Core i5 2.8ghz and
16GB RAM.
Using Join 146
The key point is to ensure we get data using index. For each of document, we join data by run other
query in a map/concatMap function to merge/transform extra data. The merge document is returned
instead of original document
The main different between using sub query is that we have nested document instead of left right
field like in JOIN. However, via using some map or transform command we can turn them into
whatever we can imagine.
Why map/concatMap is important
In SQL, we can basically join whatever. In RethinkDB, join is infact just a syntaxtic sugar on top of
getAll and concatMap. As you learn in Chapter 5, map/concatMap allow you to transform data with
its relation data in an associated table, by querying extra data inside map function.
I once say that they are important and now I repeated again because they are everything. getAll is
just like SELECT in MySQL in term of how much you have to use. And getAll is not very useful
without a map.
Wrap up
At the end of this chapter, we should know how to join based on these concepts:
A very normal task of a database is to get some kind of calculate from a given sequence of data. We
will learn those kind of function on this chapter.
1 r.db('foodb').table('foods')('food_type').count('Type 1')
2 //=>
3 627
Here we are using nested field syntax to fetch food_type field and count how many value that match
Type 1.
Or count how many user who are 18 years old.
1 r.db("foodb")
2 .table("users")("age")
3 .count(18)
We can also pass a ReQL expression or a function. Lets find all food whose name starts with L
1 r.db('foodb').table('foods').count(r.row('name').match('^L'))
2 //=>
3 31
We can also passing a value or a function to count so RethinkDB only count documents match the
value or when predicate function returns true.
150
1 r.db('foodb').table('foods')
2 .count(function(food){
3 return food('name').match('^L')
4 })
5 //=>
6 31
In RethinkDB, its very flexible on how we do thing. Such as, counting, in fundamental is just count
the element from an array. With some smart combination we can count same thing in different way:
1 r.db('foodb').table('foods').map(r.row('name').split('').nth(0)).count('L')
Basically we get food name, split charater one by one using split command, then call nth(0) to
fet first document. Using map we transform food name table into a stream of first charater of food
name, then we count this stream for how many element equal L.
In a sense, passing a function is like a shortcut of filter with that function, and count the return
sequence
Below example, counting users who are 19 years old and name starts with a K:
1 r.db("foodb")
2 .table("users")
3 .count(function(user) {
4 return user("age").eq(23).and(user("name").match("^L"))
5 })
6 //=>
7 1
1 r.db("foodb")
2 .table("users")
3 .filter(function(user) {
4 return user("age").eq(23).and(user("name").match("^L"))
5 })
6 .count()
7 //=>
8 1
1 r.db("foodb")
2 .table("users")
3 .count(r.row("rowage").eq(23).and(r.row("name").match("^L")))
4 //=>
5 1
sum, and averag is similar to count on how you use them. Just different on what they give you. sum
give you sum of sequence, and average does average. Lets find out how many bytes of storage need
to store food image. Each of document in foods table has a picture_file_size field store in byte.
1 r.db('foodb').table('foods').sum('picture_file_size')
2 //=>
3 123463051
1 r.db('foodb').table('foods')('picture_file_size').sum()
The key thing is that you understand how those function operate. By default, they operator on the
whole document. Thats why we have to use sum('pictire_file_size') when we call sum direcly
on table. However, when we already use bracket to get the field, we can simply call sum() without
any parameters.
1 r.db('foodb').table('foods')('picture_file_size').sum()
So you can guess what is the average file size, lets find out it:
1 r.db('foodb').table('foods')('picture_file_size').avg()
2 //=>
3 147155.0071513707
We can also passing a function to sum or avg. In that case, RethinkDb calls the function on every
document, then get the result and use them for sum purpose.
Lets say we only interested in filesize which is bigger than 4MB
152
1 r.db('foodb').table('foods').sum(function(food) {
2 return r.branch(
3 food('picture_file_size').gt(1024*1024*4),
4 food('picture_file_size'),
5 0)
6 })
7 //=>
8 9666379
In some way, by passing function to sum, we have a simple effect of filter. If we uses filter, we can
write above query again
1 r.db('foodb').table('foods').filter(function(food) {
2 return food('picture_file_size').gt(1024 * 1024 * 4)
3 }).sum('picture_file_size')
4 //=>
5 966379
In this case, we first find and return only documents where its picture_file_size is greater than 4MB.
Then we simply sum field picture_file_size of all documents.
Basically, by passing function into sum, avg, we can transform document to our desire result for
doing sum or avg on it.
Doing some calculation is fun, but what if we want which food has smallest or biggest picture file
size. Lets move to min and max
1 r.db('foodb').table('foods').max('picture_file_size').pluck('name', 'picture_fil\
2 e_size')
3 //=>
4 {
5 "name": "Meatball" ,
6 "picture_file_size": 5102677
7 }
We can also pass expression, the value of expression is used for comparing. Lets find the compound
which has most health effect.
As you can guess, RethinkDB often runs faster if we pass a field name because no extra processing
is made. In case of function, function has to be executed. Example, lets try to get max file size of
food in group Type 1.
1 r.db('foodb').table('foods').max('picture_file_size').pluck('name', 'picture_fil\
2 e_size')
3 //=>
4 {
5 "name": "Meatball" ,
6 "picture_file_size": 5102677
7 }
In case of min, max they returns full document, but using a value to compare. That hints that min,
max may accept an index as comparing value.
Take this example. Try to find the compounds with bigest msds_file_size
1 r.db('foodb').table('compounds').max('msds_file_size')
2 //=>
3 1 row returned in 217ms.
Note that if you try max again without index, in second time the query run faster because
a part of data was cached by RethinkDB. The size of this cache is defined by this
forumula (available_mem - 1024 MB) / 2. with available_mem is the memory when
RethinkDB starts.
This runs on a SSD. Pretty slow. Now, see how fast its compare with index
1 r.db('foodb').table('compounds').indexCreate('msds_file_size')
1 r.db('foodb').table('compounds').min({index: 'msds_file_size'})
2 //=>
3 1 row returned in 8ms.
By passing an secondary index to min, max function, the index value is used to compare, run much
faster and more efficient.
For complex logic, we can event pass a function to min or max, the return values are used to compare.
Lets find the food that has most compounds. The compound of food is stored in compound_foods
table.
1 //First, let create an index, you can ignore if you created index before.
2 r.db('foodb').table('compounds_foods').indexCreate('food_id')
3
4 r.db('foodb').table('foods')
5 .max(function(food) {
6 return r.db('foodb').table('compounds_foods')
7 .getAll(food('id'), {index: 'food_id'})
8 .count()
9 })
10 //=>
11 1 row returned in 1min 8.2
Yay, it runs in 1 minute and 8.2 seconds. Super slow. Because the function has to be run on every
document. Thats being said, complex function may be slow, but they are useful when we need it.
distinct
Given a sequence, distinct remove duplication from it. When giving an index, the duplication is
detected by value of index. Its syntax is:
1 sequence.distinct() array
2 table.distinct([:index => <indexname>]) stream
As you can tell, whenever we return an array, we will run into 100,000 element isues if the return
array has more than 100,000 elements. So keep that in mind and try to call distinct with a proper
index, which we will learn quickly
Lets start with this simple example:
155
1 r.expr([1, 2, 3, 4, 1]).distinct()
2 //=> 4 rows returned
3 [
4 1 ,
5 2 ,
6 3 ,
7 4
8 ]
1 r.db("foodb")
2 .table("users")
3 .withFields("name")
4 .distinct()
5 //=>Executed in 30ms. 152 rows returned
6 [
7 {
8 "name": "Abe Willms"
9 } ,
10 {
11 "name": "Adela Klein V"
12 } ,...]
1 r.db("foodb")
2 .table("users")
3 .withFields("age")
4 .distinct()
1 r.db("foodb")
2 .table("users")
3 .indexCreate("age-group", function (user) {
4 return
5 r.branch(
6 user("age").lt(18),
7 "teenager",
8 r.branch(user("age").gt(50),
9 "older",
10 "adult"
11 )
12 )
13 })
With that index, we can quickly list age group of our users, by calling distinct on table, passing
that index:
1 r.db("foodb")
2 .table("users")
3 .distinct({index: "age-group"})
4 //=> 3 rows return
5 "adult"
6 "older"
7 "teenager"
When passing index, the value of index is returned; and therefore that value is used to detect
duplication.
Lets try on some big table: list all orig_food_common_name of compound_foods.
1 r.db('foodb').table('compounds_foods')('orig_food_common_name').distinct()
2 //=>
3 9492 rows returned in 41.44s.
Here we use bracket to return only orig_food_common_name field, then remove duplication with
distinct. The query runs in 41.44 seconds. It also returns the whole array, with 9492 rows, means
all data has to be put into memory and transfer over network. To make it faster, more efficient, we
can use index, and the result will be a stream.
Lets create an index for that field.
157
1 r.db('foodb').table('compounds_foods').indexCreate('orig_food_common_name')
1 r.db('foodb').table('compounds_foods').distinct({index: 'orig_food_common_name'})
2 //=>40 rows returned in 161ms. Displaying rows 1-40, more available
We optimize it from 41.44s to 161ms!!! So fast. It has two reason for this to be fast.
Basically, without an index, RethinkDB has to scan the whole table. It is slow in two ways:
When we pass an index, RethinkDB pickup the value of index, it doesnt have to care or load the
whole document. The return result is stream, so client receive a cursor to fetch data lazily.
As you see, the command we learn on this chapter is operating on the whole sequence. However,
usually coming with aggregation is grouping. We want to divide a sequence into many group, and
doing aggregation on those group. To do that, lets learn about group
group
Yeah, group is everywhere. Group command groups data into many sub sequence, we can continue
to run aggeration on those sub sequence. For example, instead of counting the whole sequence. We
want to count how many document are in group A, how many document in group B and so on, with
group A or group B is document that share same particular value.
Lets see how it is handle in RethinkDB.
In a nut shell, taking a sequence, depend on the value of field or return value of function, RethinkDB
groups documents with same value into a group.
Looking at flavors table, lets group them by their flavor_group field:
158
1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 #=>
5 [{
6 "group": "animal",
7 "reduction": [
8 {
9 "category": "odor",
10 "created_at": {
11 "$reql_type$": "TIME",
12 "epoch_time": 1317561018,
13 "timezone": "-07:00"
14 },
15 "creator_id": null,
16 "flavor_group": "animal",
17 "id": 112,
18 "name": "animal",
19 "updated_at": {
20 "$reql_type$": "TIME",
21 "epoch_time": 1317561018,
22 "timezone": "-07:00"
23 },
24 "updater_id": null
25 }
26 ]
27 },
28 {
29 "group": "balsamic",
30 "reduction": [
31 {
32 "category": "odor",
33 "created_at": {
34 "$reql_type$": "TIME",
35 "epoch_time": 1317561011,
36 "timezone": "-07:00"
37 },
38 "creator_id": null,
39 "flavor_group": "balsamic",
40 "id": 43,
41 "name": "others",
42 "updated_at": {
159
43 "$reql_type$": "TIME",
44 "epoch_time": 1317561011,
45 "timezone": "-07:00"
46 },
47 "updater_id": null
48 },
49 {
50 "category": "odor",
51 "created_at": {
52 "$reql_type$": "TIME",
53 "epoch_time": 1317561010,
54 "timezone": "-07:00"
55 },
56 "creator_id": null,
57 "flavor_group": "balsamic",
58 "id": 40,
59 "name": "chocolate",
60 "updated_at": {
61 "$reql_type$": "TIME",
62 "epoch_time": 1317561010,
63 "timezone": "-07:00"
64 },
65 "updater_id": null
66 }
67 ]
68 },
69 {
70 "group": "camphoraceous",
71 "reduction": [
72 {
73 "category": "odor",
74 "created_at": {
75 "$reql_type$": "TIME",
76 "epoch_time": 1317561017,
77 "timezone": "-07:00"
78 },
79 "creator_id": null,
80 "flavor_group": "camphoraceous",
81 "id": 101,
82 "name": "camphoraceous",
83 "updated_at": {
84 "$reql_type$": "TIME",
160
85 "epoch_time": 1317561017,
86 "timezone": "-07:00"
87 },
88 "updater_id": null
89 }
90 ]
91 },
92 ...]
When we continue to chain function after group, the function will operate on reduction array. The
result of function will replaced the value of reduction array.
For example, we can count how many element of reduction:
1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .count()
5 //=>
6 [
7 {
8 "group": null ,
9 "reduction": 743
10 } ,
11 {
12 "group": "animal" ,
13 "reduction": 1
14 } ,
15 {
16 "group": "balsamic" ,
17 "reduction": 10
18 } ,
19 {
20 "group": "camphoraceous" ,
21 "reduction": 1
22 } ,
23 ...
24 ]
161
So, the reduction field no longer contains an array of document, but contains value of how many
document in original reduction array.
Similarly, instead of counting, say we care about the first document only.
1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .nth(0)
5 //=>
6 [
7 {
8 "group": null ,
9 "reduction": {
10 "category": "odor" ,
11 "created_at": Sun Oct 02 2011 06:12:18 GMT-07:00 ,
12 "creator_id": null ,
13 "flavor_group": null ,
14 "id": 148 ,
15 "name": "cotton candy" ,
16 "updated_at": Sun Oct 02 2011 06:12:18 GMT-07:00 ,
17 "updater_id": null
18 }
19 } ,
20 {
21 "group": "animal" ,
22 "reduction": {
23 "category": "odor" ,
24 "created_at": Sun Oct 02 2011 06:10:18 GMT-07:00 ,
25 "creator_id": null ,
26 "flavor_group": "animal" ,
27 "id": 112 ,
28 "name": "animal" ,
29 "updated_at": Sun Oct 02 2011 06:10:18 GMT-07:00 ,
30 "updater_id": null
31 }
32 } ,
162
33 {
34 "group": "balsamic" ,
35 "reduction": {
36 "category": "odor" ,
37 "created_at": Sun Oct 02 2011 06:10:10 GMT-07:00 ,
38 "creator_id": null ,
39 "flavor_group": "balsamic" ,
40 "id": 40 ,
41 "name": "chocolate" ,
42 "updated_at": Sun Oct 02 2011 06:10:10 GMT-07:00 ,
43 "updater_id": null
44 }
45 } ,
46 ...
47 ]
Here, nth(0) will be call on reduction array, return its first element, and re-assign the result to
reduction field.
Note that we have some limitations with group where the group size is over 100000 elements. For
example, lets group compounds_foods by their orig_food_common_name
1 r.db("foodb")
2 .table("compounds_foods")
3 .group('orig_food_common_name')
1 RqlRuntimeError: Grouped data over size limit `100000`. Try putting a reduction\
2 (like `.reduce` or `.count`) on the end in:
3 r.db("foodb").table("compounds_foods").group("orig_food_common_name")
Why so? Because when we end the chain with group, the whole array is loaded into memory, and
our sequence are greater than 100000 elements. We have around 668K documents. However, when
we call reduce or count on it, the number of documen will be reduce. RethinkDB wont have to
keep them all into memory and makes it works.
Lets try what it suggests:
163
1 r.db("foodb")
2 .table("compounds_foods")
3 .group('orig_food_common_name')
4 .count()
5 #=>
6 //Executed in 45.03s. 9492 rows returned
7 [
8 {
9 "group": null,
10 "reduction": 4313
11 },
12 {
13 "group": "AMARANTH FLAKES",
14 "reduction": 68
15 },
16 ...
17 ]
Now, this result is of course not what we expected. But why it runs? It is because when we call
count, the whole array of reduction field becomes a single value instead of an array of grouped
data, that makes the size of final group data smaller. As you can see, 9492 is returned, and thats all
loaded into memory.
So keep in mind that we have some limitation with group. If you notice, 9492 is the same number
when we run distinct on orig_food_common_name field.
1 r.db('foodb').table('compounds_foods')('orig_food_common_name').distinct()
2 //=>
3 9492 rows returned in 41.02s.
They return same amount of document because while they are different, they share same concept
of equality. Look at its define a gain:
group: group many documents which has same value of field or function result into a single
document.
distinct: eliminate duplication, based on the value of a field or function result.
While they return different data, they retrun same quantity of document. Distinct elimiate equality
by remove and keeping one. Group eliminate equality by mergering them into one.
Above result confirms that our query works properly. Sometimes, its fun to go back and try different
query as a way to validate our queries.
164
ungroup
As you can see, anything follow group operates on sub stream, or reduction array. Can we make
the follow function run on returned sequence of group itself? Such as, we want to sort by the value
of reduction field. Lets try to sort flavors by how many value of flavor_group
1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .count()
5 .ungroup()
6 .orderBy(r.desc('reduction'))
7 //=>
8 [
9 {
10 "group": null ,
11 "reduction": 743
12 } ,
13 {
14 "group": "fruity" ,
15 "reduction": 24
16 } ,
17 {
18 "group": "floral" ,
19 "reduction": 14
20 } ,
21 {
22 "group": "balsamic" ,
23 "reduction": 10
24 },...
25 ]
1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .count()
5 .orderBy(r.desc('reduction'))
6 //=>
7 e: Cannot convert NUMBER to SEQUENCE in:
8 r.db("foodb").table("flavors").group("flavor_group").count().orderBy(r.desc("red\
9 uction"))
Error occurs because without ungroup, orderBy is called on reduction array, in this case its a single
number(the quantity of document in reduction array) and orderBy cannot work on a single number.
So ungroup turns the return array from group into a sequence of object, with each object includes 2
fields:
and let any subsequent commands follows it operate on the whole sequence instead of sub sequence
from group. Thats why its call ungroup because it wont treat value of reduction field as a sub
sequence to work on, the whole reduction array are now just a normal array of a document of the
sequence.
Lets dive into this example to learn more about it. Lets say we have an array of value, we want to
get sub of odd number and even number. Using expr we can easily represent an array in RethinkDB
To denote odd number and even we can group them by the result of mod to 2
12 "reduction": [
13 3,
14 5
15 ]
16 }]
Nothing is new here. We know what group and sum do. The group 0 has reduction = [4,2] so the sum
is 6.
If we put a ungroup first
15 5
16 ]
17 }]
The output is same but the context is changed. Using typeOf we can verify:
So after ungroup, the result became an array instead of grouped_stream. Now, if we call sum, sum
will work on the whole documents, that mean document with two fields group and reduction instead
of sub stream of reduction
We can confirm that the ungroup really operator on the whole document:
So with each document, we get the sum of field reduction, which is an array, then sum them all.
The final result is: (4+2) + ( 3+5) = 14
Or, we want to show the sum of odd number first, right now, the sum of even number show up first:
168
Because reduction array has no field group for orderBy to work on. If we use ungroup here, we will
have access to the whole document normally:
The idea of ungroup confused me a bit at beginning. If you are able to get it right quickly then
congrat. Otherwise, just re-read and do sime simple example yourself and you will get it.
Now, lets move on to a even more confusing function reduce
reduce
I think at this point, you already know what count() does. From a sequence, or an array, it returns
a single number of how many items the sequence holds. It transforms a whole array into a single
169
value, unlike map which turns each element of sequence into other value, and returns a new sequence
with all return value from map function.
count is an example of reduce. reduce accepts a function, lets call it reduce_function, and produces
a single value, by repeating a function with input is the previous output of reduce_function. The
reduce_function can be called with those parameters:
We can say that, on the first execution, two first elements of sequence are passed into reduce
function; on the second execution, one parameter is third element of sequence, other parameter
is the result of reduce function call on first and second element. And so on for 4th, 5h execution
But why do we have two results of previous reductions? Its because reduced function can run
parallel across shards and CPU core, or even across computers in a cluster. The final result of each
reduce functions on each shards or each computer, are then passed to reduce function again, to create
final result.
What will happen if the sequence has a single element? We dont have enough input for reduce
function. That is a special case, and RethinkDB will simply return value of element as result of
reduce function.
Usually, reduce will be used with map to transform document into a value that can be aggregated. As
you have seen, the reduce function paramters can be elements of sequence, or the result of previous
reduce function. Therefore, we need some transform so that the type of parameters and result of
reduce function are the same, or if we dont do the transformation, the reduce function have to be
able to deal with multiple data type
Take this example:
1 r.expr([1, 2, 7, 8])
2 .reduce(function(left, right) {
3 return left.add(right)
4 })
5 //=>
6 18
Thats the sum of the array. The process of reducing is similar to these steps:
1. reduce function is called with first element of sequence.left=1, right=2, and return 1.add(2)=3.
Result = 3
170
2. reduce function is called again, with third element 7, and result of previos call 3 left=3, right
= 7 -> result = 10
3. reduce function is called again, with last element of array left=10, right = 8 -> result = 18
4. no more lement, return the single value of last function call 18
So reduce is kind of like recursion. Like in above example, writing in plain english it looks like this:
In this example, the reduce function returns the smaller value from two input value. Here we use
branch as an if, and lt to compare less than. Here is how it runs:
Here, we noticed that the reduce function return same data type as input value. Lets try to count
how many document we have, using reduce style. Basically, you can already guess that we will
write a reduce function that increase to 1 while we iterates the array. But here we are writing reduce
as a recursion fuction, so we will add the left and right value.
171
1 r.db("foodb")
2 .table("flavors")
3 .reduce(function(left, right) {
4 return
5 r.branch(left.typeOf().ne('NUMBER'), 1, left)
6 .add(
7 r.branch(right.typeOf().ne('NUMBER'), 1, right)
8 )
9 })
10 //=>
11 855
That the total documents of flavors table. Let looks at our reduce function again:
1 function(left, right) {
2 return
3 r.branch(left.typeOf().ne('NUMBER'), 1, left)
4 .add(
5 r.branch(right.typeOf().ne('NUMBER'), 1, right)
6 )
7 }
left and right can be a document of flavors table with its whole fields, or a number from the result
of add command. We use typeOf to detect type, if its not a NUMBER, that means it is a document,
we consider that is a 1 item, and return 1 for counting. If its already a number, we used it, then add
both number. Its just like seeing an item, take 1, add with previous function call. Repeat this process
for whole sequence, we have a count of it.
So you see that we have to deal with branch command to turn the document into a number, both
for left and right. That job is a transformation, and sounds like a job of map. Rewrite it we can make
it cleanrer:
1 r.db("foodb")
2 .table("flavors")
3 .map(function(doc) {
4 return 1
5 })
6 .reduce(function(left, right) {
7 return left.add(right)
8 })
9 //=>
10 855
172
Now, we map each of document to become a single number 1. Then the reduce function works as a
sum of the array. Take first two elements, return the sum. Take the previous sum, add it with third
element and so on.
Usually, we will have map step before reduce to turn document into a type that compatible with
result of reduce function. Thats why the process of this sometimes is called map-reduce.
The process of reduce function executing is like recursion, but with passing the result of previous
run to the function itself, we dont have to keep a stack to store value of previous function call. In
other words, that function encapsulated its data, it doesnt access any outside variables. All data
it needs are passed to it as left and right parameter. Note that they are just name binding, we can
name them whatever, and they have to have capability of dealing with different data type: the type
of sequence element, and the type of result of function call.
Map Reduce
The process of map-reduce shines when using with group. When calling group, the subsequent
command operate on sub stream, we can take advantage of that to run reduce on that sub stream
and do the logic for our own aggeration, by writing reduce function.
Lets try to count how many compounds a food has for first 10 food, using map-reduce style instead
of built-in count command.
Given that a food has many compounds, a compounds has many healh effects. As in below diagram:
1 r.db("foodb")
2 .table("foods")
3 .limit(5)
4 .concatMap(function(food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food('id'), {index: 'food_id'})
8 .pluck('food_id', 'compound_id')
9 .merge({name: food('name')})
10 })
11 .group(function(doc) {
12 return doc.pluck('food_id', 'name')
13 })
14 .reduce(function(left, right) {
15 return
16 r.branch(left.typeOf().eq("NUMBER"), left, 1)
17 .add(
18 r.branch(right.typeOf().eq("NUMBER"), right, 1)
19 )
20 })
21 //=>5 rows returned in 254ms
22 [{
23 "group": {
24 "food_id": 4,
25 "name": "Kiwi"
26 },
27 "reduction": 378
28 }, {
29 "group": {
30 "food_id": 20,
31 "name": "Mugwort"
32 },
33 "reduction": 103
34 }, {
35 "group": {
36 "food_id": 25,
37 "name": "Common beet"
38 },
39 "reduction": 942
40 }, {
41 "group": {
42 "food_id": 26,
174
43 "name": "Borage"
44 },
45 "reduction": 225
46 }, {
47 "group": {
48 "food_id": 30,
49 "name": "Common cabbage"
50 },
51 "reduction": 1826
52 }]
1 r.db("foodb")
2 .table("foods")
3 .limit(5)
4 .concatMap(function(food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food('id'), {index: 'food_id'})
8 .pluck('food_id', 'compound_id')
9 .merge({name: food('name')})
10 })
11 .group(function(doc) {
12 return doc.pluck('food_id', 'name')
13 })
14 .map(function(doc) {
15 return 1
16 })
17 .reduce(function(left, right) {
18 return
19 left.add(right)
20 })
here, we group by food_id, and food name, then for each of document of sub stream, we map them
to 1. Because we are counting, we only care about a document as awhole instead of as individual
fields. The reduce function simply doing an sum of two left and right and return the sum. The map
steps help us clean up the reduce function because its easier to deal with number as input, and
return number too.
Lets take a more complex example. At the same time, calculate how many flavor and health effect
a food has.
First, let create necessary index
175
1 r.db("foodb").table('compounds_health_effects').indexCreate('compound_id')
2 r.db("foodb").table('compounds_flavors').indexCreate('compound_id')
3 r.db("foodb").table('compounds_foods').indexCreate('compound_id')
With the index, we can easily get all compounds and count how many flavor and health effect
associated with a compound.
1 r.db('foodb')
2 .table('compounds')
3 .concatMap(function(doc) {
4 return
5 r.db('foodb').table('compounds_foods')
6 .getAll(doc('id'), {index: 'compound_id'})
7 .pluck('food_id')
8 .merge({
9 compound_id: doc('id'),
10 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'\
11 ), {index: 'compound_id'}).count(),
12 health_effect_total: r.db('foodb').table('compounds_health_effects').g\
13 etAll(doc('id'), {index: 'compound_id'}).count()
14 })
15 })
16 //=>400 rows returned in 1min 15.09s. Displaying rows 1-400, more available
17 {
18 "compound_id": 4 ,
19 "flavor_total": 0 ,
20 "food_id": 191 ,
21 "health_effect_total": 0
22 }
23 {
24 "compound_id": 4 ,
25 "flavor_total": 0 ,
26 "food_id": 189 ,
27 "health_effect_total": 0
28 }
In above query, we first fetch the table compounds, with a given compound, we try to fetch its food_id
by query on table compound_foods. A compound can be in many foods, hence we used concatMap
to flatten the return array. We pluck field food_id from compound_table because we only care about
it instead of return the whole array.
Ok, above query give us compound and its flavor count and health effect count. But we want to
count the flavor and healh effect of a food. Well, a food contains many compound, so the total
health effect is the sum of all compounds health effect.
176
Therefore, we can group by food_id field and run reduce function on reduction group to get the
total count
1 r.db('foodb')
2 .table('compounds')
3 .concatMap(function(doc) {
4 return
5 r.db('foodb').table('compounds_foods')
6 .getAll(doc('id'), {index: 'compound_id'})
7 .pluck('food_id')
8 .merge({
9 compound_id: doc('id'),
10 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'\
11 ), {index: 'compound_id'}).count(),
12 health_effect_total: r.db('foodb').table('compounds_health_effects').g\
13 etAll(doc('id'), {index: 'compound_id'}).count()
14 })
15 })
16 .group('food_id')
17 .reduce(function(left, right) {
18 return {
19 flavor_total: left('flavor_total').add(right('flavor_total')),
20 health_effect_total: left('health_effect_total').add(right('health_eff\
21 ect_total')),
22 }
23 })
24 //=> 832 rows returned in 3min 33.23s.
25 {
26 "group": 2,
27 "reduction": {
28 "flavor_total": 16,
29 "health_effect_total": 517
30 }
31 },
32 {
33 "group": 3,
34 "reduction": {
35 "flavor_total": 0,
36 "health_effect_total": 112
37 }
38 }
This give us a list of food id and its total flavor and health effect. Lets make one more extra thing
177
by returns food name too, and remap the field to make document readable. To map document, so we
can change field name from group and reduction, we have to call ungroup first. And the final query
is:
1 r.db('foodb')
2 .table('compounds')
3 .concatMap(function(doc) {
4 return
5 r.db('foodb').table('compounds_foods')
6 .getAll(doc('id'), {index: 'compound_id'})
7 .pluck('food_id')
8 .merge({
9 compound_id: doc('id'),
10 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'\
11 ), {index: 'compound_id'}).count(),
12 health_effect_total: r.db('foodb').table('compounds_health_effects').g\
13 etAll(doc('id'), {index: 'compound_id'}).count()
14 })
15 })
16 .group('food_id')
17 .reduce(function(left, right) {
18 return {
19 flavor_total: left('flavor_total').add(right('flavor_total')),
20 health_effect_total: left('health_effect_total').add(right('health_eff\
21 ect_total')),
22 }
23 })
24 .ungroup()
25 .map(function(doc) {
26 return doc('reduction').merge({
27 food: r.db('foodb').table('foods').get(doc('group')).default({}).pluck('id\
28 ', 'name'),
29 })
30 })
You can sit around watching the nice graph of RethinkDB admin dashboard. It takes a few minutes
depend on your CPU and disk speed. And the result:
178
A very important point here is we start with compounds table instead of from foods table. But later
on, we group them by food_id. It is reverse of fetching foods first, then find all compounds and go
down all the way. It makes the query shorter, easier to follow because we eliminate one nested level
when come directly to compounds table. Its important to pick the right table to start with, and the
right order.
The reason is, if we start with food table, we have to join/map with table compounds_foods to
find its compound, then join with compounds table, then continue to join with two more table
compounds_flavors and compounds_health_effects. Thats three levels.
When we start right at compounds, we just need to join with compounds_foods and *compounds_-
flavors and compounds_health_effects at the same time, because we had compound_id. That is only
a single depth level. We are actually done right there, because we have enough information(food_id
field to grouping). Then in final step, we do a join with table foods food to fetch food name, but its
readable because the map is like go up a level, make query easier to follow.
Sometime you may not need to use reduce with map, concatMap and some built-in command, you
can already do a lot. But when you need reduce, it really helps.
Wrap up
When finishing this chapter, you should know how to aggregation, how to group data, counting,
and call function on grouped data. Some keys thing:
1 r.db("foodb")
2 .table("users")
3 .get("03f5479c-403e-4dfa-995f-5aea85c25982")
4 .update({
5 birthday: r.time(1987, 5,5, 'Z')
6 })
timezone can be Z, means UTC time. Or a string of with format +-[hh]:[mm] from UTC time. UTC
time is 7 or 8 hours a head of PST time depend on season.
When you reading back the time, again, its converted into a native time object/data type of your
language. This save you bunch of time from dealing with time formating, timezone.
Internally RethinkDB store epoch time and an associated timezone with it. Epoch time is how many
seconds since epoch, or UTC, or more clearly 00:00:00 Coordinated Universal Time (UTC), Thursday,
1 January 1970, not counting leap second
The associated timezone is a minute precision time offsets from UTC. That means PST time is [-
08]:[00].
Timezone
When you are setting a native time object to a RethinkDB document, if the object includes a timezone
value, RethinkDB picks up it and use it, otherwise, it defaults to UTC time. Lets say I was born in
Vietnam, 1987/05/23, 10:10PM. Vietnam timezone is UTC+08:00, so I will write:
https://en.wikipedia.org/wiki/Unix_time
Accessing time 182
1 r.db("foodb")
2 .table("users")
3 .insert({
4 name: 'Vinh',
5 age: 30,
6 eatenfoods: ['Frybread', 'Yogurt'],
7 favfoods: ['Avocado', 'Jellyfish', 'Vanilla', 'Sacred lotus', 'Banana'],
8 birthday: r.time(1987, 5, 23, 10, 10, 0, '+08:00')
9 })
I can then find out what is my birthday timezone, using timezone command.
1 r.db("foodb")
2 .table("users")
3 .get('12063f5f-4289-4a4b-b668-0e4a90861575')('birthday').timezone()
4 //=>
5 "+08:00"
Now I moved to the USA, people asks when were you born. Im speechless. Knowing that USA is in
PST time, that is -08:000 compare to UTC. I turn to RethinkDB:
1 r.db("foodb")
2 .table("users")
3 .get('12063f5f-4289-4a4b-b668-0e4a90861575')('birthday')
4 .inTimezone('-08:00')
5 //=>
6 Fri May 22 1987 18:10:00 GMT-08:00
epoch
The number of seconds since Unix epoch is very important and is supposed by almost language. In
RethinkDB we can get that number by using toEpochTime:
Accessing time 183
1 r.db("foodb")
2 .table("users")
3 .get('12063f5f-4289-4a4b-b668-0e4a90861575')('birthday')
4 .toEpochTime()
5 //=>
6 548734200
1 r.epochTime(548734200)
2 //=>
3 Sat May 23 1987 02:10:00 GMT+00:00
The original time that I have inserted is: 1987/05/23, 10:10PM. We then get it out, converting to
epoch, then convert to time object again and we got
Sat May 23 1987 02:10:00 GMT+00:00
Thats exactly **Sat May 23 1987 10:10:00 PM in GMT+8.
WRAP UP
At this point, you should be confident to work with date/time in RethinkDB. Here are some recap:
I call RethinkDB is a database for programmer, not for database administrator because it takes very
minimal effort to understand and pick up it. The way we write ReQL is very clear, other developer
looks at it and know exactly what is going on to happen. In SQL world, we have to profile, explain
the query to know if any index will be used. In RethinkDB, we tell it to use an index. And its ReQL
language is a wonderful way to think about database.
Then coming changefeeds, which I didnt cover in this book because you can quickly leanr and use
it after reading 5 minutes API document.
It also offer automatic failover in cluster. Which I also didnt cover because I dont have experience
using it. To me, all the upcoming thing for RethinkDB is a good sign to invest into learning it. Be
prepare and go ahead, by learning and using RethinkDB today.
When I write this book, I learnt more about RethinkDB. If I wouldnt written it I probably wont
dive deeply. It a chance for me to study carefully and improve myself. So I hope that my little book
will help you clear thing, make you confident to use RethinkDB.