Anda di halaman 1dari 193

Simply RethinkDB

A database with a query language for human

Vinh Quc Nguyn


This book is for sale at http://leanpub.com/simplyrethinkdb

This version was published on 2015-10-12

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.

2014 - 2015 Vinh Quc Nguyn


Tweet This Book!
Please help Vinh Quc Nguyn by spreading the word about this book on Twitter!
The suggested tweet for this book is:
Simply RethinkDB: Starting with RethinkDB in a week
The suggested hashtag for this book is #simplyrethinkdb.
Find out what other people are saying about the book by clicking on this link to search for this
hashtag on Twitter:
https://twitter.com/search?q=#simplyrethinkdb
This books was written as an effort to learn more about RethinkDB. At some points, I almost
abandon it because it is lots of work.
A long the way, I started to write Relang, a RethinkDB driver for Erlang which gives me more deep
understanding of RethinkDB and make me want to finish this book.
All of this will not be published without my wife supporting. An awesome girl who without her
support I wont be able to finish this.
I also thank so much for your purchase. If you happen to live in San Jose, please visit and see us in
person. I want to say thank to you in person.
This books mark an important period in my life when I married. The Rethink word also reminds
me to rethink when we have trouble in our relationship in the future. Rethinking and
understanding.
Contents

1. Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Why learn RethinkDB? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2. Getting to know RethinkDB . . . . . . . . . . . . . . . . . . . . . 7


Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
The Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
The dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

RethinkDB object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Durability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Command line tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Import sample data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3. Reading Data Basic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


Getting to Know ReQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Using drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
CONTENTS

Default database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Repl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Basic data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Composite data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Sorting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Selecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Select the whole table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Select a single document by its primary key . . . . . . . . . . . . . . . . . . . . . . . . . 28
Select many documents by value of fields . . . . . . . . . . . . . . . . . . . . . . . . . . 31
r.row . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Pagination data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Access Nested field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Wrap Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4. Modifying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Drop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
List table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Drop table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

System table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Sync . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5. Reading Data Advanced . . . . . . . . . . . . . . . . . . . . . . . . . 87


Understanding index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CONTENTS

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Simple index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Compound index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Arbitray expressions index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Checking index status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Using index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Pagination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Transform data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Index and Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
The important map/concatMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

6. Data Modeling With JOIN . . . . . . . . . . . . . . . . . . . . . . . 125


Using Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
eqJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
innerJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
outerJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Name conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Using sub queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Why map/concatMap is important . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

7. Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
sum, average, and count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
min and max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
distinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
ungroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

Wrap up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

8. Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Accessing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181


Timezone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
CONTENTS

epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

WRAP UP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

9. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
1. Welcome
Introduction
Welcome to my readers. I appreciate your purchase. This will help me continue improving the book
content.
Before we go into the technical details, I have something to say.
Firstly, Im not a RethinkDB expert at all. Im just an average guy who loves programming and
new technologies. To me, RethinkDB is a pleasure to use. However, due to its age, there are not
many books and documents about it comparing to other database systems. While the RethinkDB
cocumentation and API is very good, it can be hard to know where to start. So this guide is for all
those in mind who are unsure about taking the plunge into RethinkDB as something totally new. I
hope this helps them ease into the learning process.
The purpose of this book is to organize the concepts of RethinkDB in order to help you to read
and understand the RethinkDB API directly. Upon finishing the book, you will have a foundational
knowledge in which to extend your knowledge with many other RethinkDB videos and blog posts
out on the Internet.
Secondly, Im a fan of Mixus writing style. I wont cover deeply things like installing RethinkDB,
fine-tuning, extra function parameters, and so on. Those topics are covered very well on RethinkDBs
documention itself. What I want you to take away from this book is a good grasp on RethinkDB usage
in practice. and how to apply commands in real scenarios.
Third, Im not fluent in English. If you find any mistakes, you can report the issue on repository or
email me directly.
Fourth, RethinkDB is changing so fast that things in this book may not reflect its current state. Once
again, Id be very grateful for any errata you may point out, via my email or Github. Since this is a
LeanPub book, once I update you may download it again free of charge.
And finally, due to my limited knowledge with RethinkDB, I want to keep this book short and
straight to the point. Expect a book of around 200 pages. My goal is for this to be a book that you
can pick up, read on the train while riding to work and after a week you can sit down and actually
start your first RethinkDB project without hesitation.
http://blog.mixu.net/
http://blog.mixu.net/2012/07/26/writing-about-technical-topics-like-its-2012/
https://github.com/kureikain/simplyrethink
Why learn RethinkDB?
RethinkDB is mind-blowing to me. I like the beauty and nature of ReQL which is built into the
language. It is also very developer friendly with its own administrator UI. RethinkDB is very easy
to learn, because its query language is natural to how we think when constructing a query. We can
easily tell what ReQL will do and what is the execution order of the query.
Take this SQL query:

1 SELECT * FROM users WHERE name="Vinh" ORDER BY id DESC LIMIT 10,100

This query is passed as a string and occaisionally you may sometimes forget the ordering or syntax.
Will we put **ORDER** before or after **LIMIT**? Where the WHERE clause should appear? We also
cant be certain if an index will be used. Because SQL is a string, the order of execution is defined
by the syntax. Memorizing that syntax is essential.
Compare this with ReQL (RethinkDB Query Language):

1 r.table('users').getAll('vinh', {index: 'name'}).order_by(r.desc(id)).limit(10)

We can easily ascertain (or grok) immediately what will result from this query, and the order of
execution is clear to us. This is because the methods are chained, one after another, from left too
right. ReQL was designed with the intention of a very clear API but without the ambiguity that
comes with an ORM.
We can also see that it will use an index **name** when finding data. The way the query is con-
structed, feels similiar to jQuery if you are a front-end developer who never works with databases.
Or if you are a functional programming person, you probably see the similarity immediately.
If the above example hasnt convinced you, then check this out:

1 SELECT *
2 FROM foods as f
3 INNER JOIN compounds_foods as c ON c.food_id=f.id
4 WHERE f.id IN (10, 20)
5 ORDER By f.id DESC, c.id ASC

The same query represented as ReQL would look like this:


https://en.wikipedia.org/wiki/Grok
Why learn RethinkDB? 4

1 r.db("food")
2 .table("foodbase")
3 .filter(function (food) {
4 return r.expr([10, 20]).contains(food("id"))
5 })
6 .eqJoin("id", r.db("foodbase").table("compound_foods"), {index: "food_id"})

Even if you are not completely familar with the syntax, you can guess what is going to happen. In
ReQL, we are taking the foodbase database, and table foods, and filtering them and filtering the
result with another table called compound_foods. Within the filter, we pass an anonymous function
which determines if the id field of document is contained in the array [10, 20]. If it is either 10
or 20 then we join the results with the compound_foods table based on the id field and use an index
to efficiently search. The query looks like a chain of API call and the order of execution is clear to
the reader.
RethinkDB really makes me rethink how we work with database. I dont have to write a query in
a language that I dont like. As well, Im no longer forced to use a syntax that I dont like because I
have no choice. And further, if something does go wrong, I dont have to slowly tear apart the entire
string to find out which clause has the issue. The resulting error from a ReQL query allows me to
more precisely determine the cause of error.
Furthermore, RethinkDB is explicit. Later on, you will also learn that in RethinkDB you have to
explicitly tell it to do some not-very-safe operations. Such as when a non-atomic update is required,
you clearly set a flag to do it. RethinkDB by default has sensible and conservative settings as a
database should to help you avoid shooting yourself in the foot.
In my opinion, RethinkDB forces us to understand what we are doing. Everything is exposed on
the query. No magic, no why did this query fail on production but work as expected on my local
machine, no hidden surprises.
In Vietnamese culture, we usually follow a rule of three in demonstrations before we conclude. Being
Vietnamese, let me end by showing you this third example.
Do you understand the query below?

1 r
2 .db('foodbase')
3 .table('foods')
4 .filter(r.row('created_at').year().eq(2011))

This query finds all foods which were inserted in the year 2011. I cannot even provide an equivalent
SQL example, because it just cannot be as beautiful and concise as the above query.
Feedback
I appreciate all of your feedbacks to improve this book. Below is my handle on internet:

twitter: http://twitter.com/kureikain
email: kurei@axcoto.com
twitter book hashtag: #simplyrethinkdb

http://twitter.com/kureikain
kurei@axcoto.com
Credit
Sample dataset: foodb.ca/foods
Book cover: Design by my friend, aresta.co helps to create the cover for this book

http://foodb.ca/foods
http://aresta.co/
2. Getting to know RethinkDB

Lets warm up with some RethinkDB concepts, ideas and tools. In this chapter, thing may a bit
confuse because sometime to understand concept A, you need to understand B. To understand B,
you need C, which is based on A. So plese use your feeling and dont hestitate to do some quick
lookup on offical docs to clear thing out a bit.
From now on, we will use term ReQL to mean anything related to RethinkDB query language, or
query API.
Getting Started
Its not uncommon to see someone write an interactive shell in browser for evaluation purpose such
as mongly, tryRedis. This isnt applied for RethinkDB because it comes with an excellent editor
where you can type code and run it.
Install RethinkDB by downloading package for your platform http://rethinkdb.com/docs/install/.
Run it after installing.

The Ports
By default, RethinkDB runs on 3 ports

8080 this is the web user interface of RethinkDB or the dashboard. You can query the data, check
performance and server status on that UI.

28015
this is the client drive port. All client drive will connect to RethinkDB through this port. If you
remember in previous chapter, we used a tcpdump command to listen on this port to capture
data send over it.

29015
this is intracluster port; different RethinkDB node in a cluster communicates with eath others
via this port

The dashboard
Open your browser at http://127.0.0.1:8080 and welcome RethinkDB. You can play around to see
what you have:
Navigate to the Explorer tab, you can type the command from there. Lets start with

1 r.dbList()

Run it and you can see a list of database.


RethinkDB object
Similar to traditonal database system, we also have database in RethinkDB. A database contains
many tables. Each table contains your JSON document. Those JSON document can contains any
fields. A table doesnt force a schema for those fields.
A JSON document is similar to a row in MySQL. Each of field in the document is similar to column
in MySQL. When I say JSON document, I mean an JSON object with fields, not a single number, an
array or a string. However, each field can content whatever JSON data type.
More than that, the same field can accept whatever data type. On same table, two document can
contains diferent data type for same field.
Durability
You will see an option/argument call durability *durability appear a lot in many option of ReQL.
Because its so common and its very important, I want to address it here. Durability accepts value
of soft or hard.

soft means the writes will be acknowledge by server immdediately and data will be flushed to disk
in background.

hard The opposite of soft. The default behaviour is to acknowledge after data is written to disk.
Therefore, when you dont need the data to be consitent, such as writing a cache, or an
important log, you should set durability to soft in order to increase speed
Atomicity
According to RethinkDB docs [atomic], write atomicity is supported on a per-document basis. So
when you write to a single document, its either succesfully or nothing orrcur instead of updating
a couple field and leaving your data in a bad shape. Furthermore, RethinkDB guarantees that any
combination of operation can be executed in a single document will be write atomically.
However, it does comes with a limit. To quote RethinkDB doc, Operations that cannot be proven
deterministic cannot update the document in an atomic way. That being said, the unpredictable
value wont be atomic. Eg, randome value, operation run by using JavaScript expression other than
ReQL, or values which are fetched from somewhere else. RethinkDB will throw an error instead of
silently do it or ignore. You can choose to set a flag for writing data in non-atomic way.
Multiple document writing isnt atomic.
[atomic] http://www.rethinkdb.com/docs/architecture/#how-does-the-atomicity-model-work
Command line tool
Besides the dashboard, RethinkDB gives us some command line utility to interactive with it. Some
of them are:

import
export
dump
restore

import
In the spirit of giving users the dashboard, RethinkDB also gives us some sample data. You can down-
load the data in file input_polls and country_stats at https://github.com/rethinkdb/rethinkdb/tree/next/demos/electio
and import them into test database

1 rethinkdb import -c localhost:28015 --table test.input_polls --pkey uuid -f inpu\


2 t_polls.json --format json
3 rethinkdb import -c localhost:28015 --table test.county_stats --pkey uuid -f cou\
4 nty_stats.json --format json

Notice the --table argument, we are passing the table name in format of database_name.*table_-
name. In our case, we import the data into two tables: input_polls and county_stats inside
database test.
Basically you can easily import any file contains a valid JSON document.

export
export exports your database into many JSON files, each file is a table. The JSON file can be import
using above import command.

dump
dump will just export whole of data of a cluster, its similar to an export command, then follow by
gzip to compress all JSON output file. Syntax is as easy as.
Command line tool 13

1 rethinkdb dump -c 127.0.0.1:28015

Here is an example output when I run this command:

1 rethinkdb dump -c 127.0.0.1:28015


2 NOTE: 'rethinkdb-dump' saves data and secondary indexes, but does *not* save
3 cluster metadata. You will need to recreate your cluster setup yourself after
4 you run 'rethinkdb-restore'.
5 Exporting to directory...
6 [========================================] 100%
7 764509 rows exported from 9 tables, with 21 secondary indexes
8 Done (157 seconds)
9 Zipping export directory...
10 Done (5 seconds)

The dump result is a gzip file whose name is in format rethinkdb_dump_{timestamp}.tar.gz Its
very useful when you want to try out something and get back your original data. Note it here because
you will need it later.

restore
Once we got the dump file with dump command. We can restore with:

1 rethinkdb restore -c 127.0.0.1:28015 rethinkdb_dump_DATE_TIME.tar.gz


Import sample data
Its much nicer to work with real and fun data than boring data. I found a very useful dataset call
FooDB. Its a data about food constituents, chemistry and biolog. To quote their about page:

What is FooDB FooDB is the worlds largest and most comprehensive resource on
food constituents, chemistry and biology. It provides information on both macronutri-
ents and micronutrients, including many of the constituents that give foods their flavor,
color, taste, texture and aroma

I import their data into RethinkDB, and generate some sample tables such as users table. At the end,
I used the dump command to generate sample data which you can download using below links
https://www.dropbox.com/s/dy48el02j9p4b2g/simplyrethink_dump_2015-08-11T22%3A15%3A51.tar.gz?dl=0.
Once you download it, you can import this sample dataset:

1 rethinkdb restore -c 127.0.0.1:28015 simplyrethink_dump_2015-08-11T22:15:51.tar.\


2 gz

The output looks like this:

1 Unzipping archive file...


2 Done (1 seconds)
3 Importing from directory...
4 [ ] 0%
5 [ ] 0%
6 [ ] 0%
7 [ ] 0%
8 [========================================] 100%
9 764509 rows imported in 9 tables
10 Done (2166 seconds)

Once this processing is done, you should have a database call foodb which contains the data we play
throught the book. At any point, if you messed up data, you can always restore from this sample
data. Also, I encourage to back up data if you build many interesting data to experiment yourself.
http://foodb.ca/about
https://www.dropbox.com/s/dy48el02j9p4b2g/simplyrethink_dump_2015-08-11T22%3A15%3A51.tar.gz?dl=0
https://www.dropbox.com/s/dy48el02j9p4b2g/simplyrethink_dump_2015-08-11T22%3A15%3A51.tar.gz?dl=0
3. Reading Data Basic

If you are lazy(just like me) and skip straigh to this chapter, please go back to the end of previous
chapter to import sample dataset. Once you did it, lets start. Oh, before we start, let me tell you this,
sometime if you see an ... it means, we have more data returning, but I cannot paste them all into
the book. I used ... to denote for more data available.
Getting to Know ReQL
RethinkDB uses a special syntax call ReQL to interact with the data. ReQL is chainable. You start
with a database, chain to table, and chain to other API to get what you want, in a way very natural.
Type this into data explorer:

1 r.db("foodb").table('flavors').filter({'flavor_group': 'fruity'})

You should see some interesting data now.

Result of filter command by flavor_group

Dont worry about the syntax, just look at it again and even without any knowledge you know what
it does and easily remember it. A way, for me, to understand ReQL is that every command return
an object which share some API which we can call those API as if method of an object.
ReQL is particular binding to your language. Though, of course, they will look familiar between
different language to maintain consitent look and feel. But, they are different. Those querie are
constructed by making function call of your language, not by concat SQL String, or not by special
JSON object like MongoDB. Therefore, it feel very natualy to write ReQL, as if the data we
manipulate is an object or data type in our language. But everything comes with a trade off. On the
downside, we have to accept differences of ReQL betweens many language. No matter how hard we
try, different language has different syntax, especially when it comes to anonymous function.
Getting to Know ReQL 17

What is r? r is like a special namespace which is all RethinkDB is exposed via it. Its just a normal
variable in your language, or a namespace, a package name, a module. Think of r like $ of jQuery.
If you dont like r, assign another variable to it is possible.
We will call all method of r, or of any method of return resulted from other method are command
for now. Think of it like a method in jQuery world.
Here is example, with this HTML structure:

1 <div class="db">
2 <div class="table" data-type="anime">Haru River</div>
3 <div class="table" data-type="anime">Bakasara</div>
4 <div class="table" data-type="movie">James Bond</div>
5 </div>

To filter only anime movie, we can use this jQuery:

1 $('.db').find('.table').filter('[data-type="anime"]')

If we have a database call db, and a table call table, with 3 records:

1 {type: 'anime', title: 'Haru River'}


2 {type: 'anime', title: 'Bakasara'}
3 {type: 'movie', title: 'James Bond'}

The equavilent ReQL use to find only anime is:

1 r.db('db').table('table').filter({type: 'anime'})

Notice how similar the structure between them? Because of those concept, I find ReQL is easy to
learn. If you can write jQuery, you can write ReQL.
Another way to understand is considering ReQL like the pipe on Linux, you select the data, passing
into another command:

1 $ cd db; ls -la table/* | grep 'type: anime'


Drivers
This section deep down a bit on how the drivers work, you can skip if you are not interested in how
RethinkDB driver works at low level. But I really hope you keep reading it. Lets start.
ReQL is binded to your language. Therefore, the API is implemented totally by the driver itself. You
wont work directly with RethinkDB server. You write the query using the API of driver, the driver
will built it into a real query to send to server, receive data, parse it to return data as a native data
of your language.
Internaly, all client driver will turn the query that you write in driver language into an AST tree,
then serialize them as JSON and send to server.
If you curious, you can fire up tcpdump and watch the raw query in JSON

1 tcpdump -nl -w - -i lo0 -c 500 port 28015|strings

An example of what above tcpdump return when I run those command(in Ruby):

1 r.db("foodb").table("users").with_fields("address").run

Once I ran this command, I see this via tcpdump:


[1,[96,[[15,[[14,[foodb]],users]],address]],{}]
So basically, the whole query is turned into a special JSON object by client driver. If you would like
to dig deeper, the above query is actually translate in to this:
[QUERY_START, [WITH_FIELDS, [[TABLE, [[DB, [foodb]],users]],address]],{}]
Each of numbers is equivalent to a command in RethinkDB. Those number is predefined in
RethinkDB. So basically, whatever you write using client driver API will be turn into an JSON
array. Each of element often takes this form:

1 COMMAND, [Argument Array], {Option object}

Its similar to a function call when we have function name, follow by its argument and the last is an
option object.
You can quickly sense a downside is that each of driver have different API to construct query. When
you come to another language, you may feel very strange. The driver hide the real query behinds its
API. Its kind of similar to how you use an ORM in SQL world to avoid writing raw SQL string. But
Drivers 19

its different because the ORM usually has its own API to turn query into raw query string, which
in turns send to server using another driver with that database protocol. Here, we are having power
of an ORM, but happen at driver level, because RethinkDB protocol is a powerful JSON protocol
to help model query like function call, with argument and follow by parameter. In fact, ReQL is
modeled after functional languages like Lisp or Hashkell.
If you would like to know more about ReQL at lower level, you should read more in official
documents

Using drivers
RethinkDB supports 3 official drives:

Ruby
NodeJS
Python

These support all driver specifications. The community drives such as Go, PHP probably wont
support them all, if you used a different language and find something isnt right, it is probably not
your fault.
All ReQL starts with r, its top level module to expose its public API. In NodeJS, we can use

1 var r = require('rethinkdb')

or Ruby

1 require 'rethinkdb'
2 include RethinkDB::Shortcuts
3 puts r.inspect

or in Go Lang

1 import (
2 r "github.com/dancannon/gorethink"
3 )

Once we constructed ReQL with r, we have to call run method to execute it. The command will be
submit to an active database connection. The database connection can be establish with connect.
http://rethinkdb.com/docs/writing-drivers/
http://rethinkdb.com/docs/driver-spec/
Drivers 20

1 var r = require('rethinkdb')
2 var connection = r.connect({
3 host: '127.0.0.1',
4 port: '28015',
5 db: 'test'
6 }, function (err, conn) {
7 r.db('db').table('table').filter({type: 'anime'})
8 })

When creating the connection with r.connect, you can pass an db parameter to specify a default
database to work on once connecting succesfully. Its similar to the current database of MySQL.
Without setting this parameter, RethinkDB assumes test as default database.
To understand more about difference of API in different language, let looks at Go Lang driver
Notice that we dont have any host, or database parameter now. They are Addressand Database in
Go Lang driver. Therefore, by using an un-official language, the API will be totally different with
official API.
Thats how beautiful it is because each language has its own design philosophy. Such as in Go lang,
we cannot have a lower case field of a struct and expect it publicly available to outside. Using names
such as host or db for connection option is impossible in Go lang.

Default database
Similar to MySQL, when you can issue use database_name to switch to another database. We can
do that in RethinkDB by calling use command on a connection object.

1 connection.use('another_db')

In this small book, most of the time, we will use Data Exploer. Therefore we can use r without
initialization and calling run method. Data Exploer will do that for us. Just keep in mind when you
write code, you have to connect, and explicitly call run to, obviously, run the query.
Note that you dont have to switch to another database to access its table, you can just call
r.db('another_db') before building query.

Repl
Repl means read-eval-print loop. To avoid burden of manually call run and passing connection
object. Some driver offers a repl to help call run without any parameter.
Such as in Ruby:
https://github.com/dancannon/gorethink
var connection *r.Session
connection, err := r.Connect(r.ConnectOpts{ Address: localhost:28015, Database: test, })
if err != nil { log.Fatalln(err.Error()) }
Drivers 21

1 r.connect(:db => 'marvel').repl


2 r.table("test").run

JavaScript doesnt have a repl. I think because we can already used the Data Explorer.
Data Type
Why do we have to discuss about data type? We use dynamic language and we almost dicuss
about Ruby and JavaScript most of time. But understanding data type allow us to read API
document better. It helps us to understand why we can can r.table.insert but we cannot call
like r.table.filter().insert. Arent we still selecting data from table, so we should be able to
insert data to it?
Data type helps us know that we can only call some method on some of data type
Each ReQL method can be call on one or many above data types. Take update command, when you
browser the API document, you see

1 table.update(json | expr[, {durability: "hard", returnVals: false, nonAtomic: fa\


2 lse}]) object
3
4 selection.update(json | expr[, {durability: "hard", returnVals: false, nonAtomic\
5 : false}]) object
6
7 singleSelection.update(json | expr[, {durability: "hard", returnVals: false, non\
8 Atomic: false}]) object

It means the command can be invoked on a table, or selection (eg: first 30 element of tables), or a
single selection - a document is an example of single selection. The behaviour maybe different based
on data type, even the command is same.
In RethinkDB, we have several data types. We will focus into those 2 kind for now:

Basic data type


Composite data type

Basic data type


These are usually the native data type in your language too:
Data Type 23

1 * Number: any real numbers. RethinkDB uses double precision (64-bit) floating po\
2 int numbers internally
3 * String
4 * Time: This is native RethinkDB date time type. However, they will be converted\
5 automatically to your native data type in your language by the driver.
6 * Boolean: True/False
7 * Null: Depend on your language, it can be nil, null,..
8 * Object: any valid JSON object. In JavaScript, it will be a normal object. In R\
9 uby, it can be a hash.
10 * Array: any valid JSON array.

The data type of a field or column can be change. If you assign a number to a field, you can still
assign an value with different data type to that same field. So we dont have a static schema for
tables.
We have a very useful command to get the type of any vaue. Its typeOf. Example:

1 r.db('foodb').table('foods')
2 .typeOf()
3 //=>
4 "TABLE"
5
6 r.db('foodb').table('foods')
7 .filter(r.row("name").match('A^'))
8 .typeOf()
9 //=>
10 "SELECTION<STREAM>"

Its seems not very important to understand about data type at first. I really hope you should invest
some time to use it frequently to understand the data type of a value.
To give a story. In MariaDB10.0/MySQL5.6, when data type doesnt match, an index may not be
used. Lets say you have a field name with type VARCHAR(255) when you define it, then you create
an index on that column. Query on that column with exact data type will make index kicked in.
Lets come back MySQL a bit.
First I insert below records.

1 INSERT INTO foods(name) VALUES("100");


2 Query OK, 1 row affected, 1 warning (0.00 sec)

Below query will use index:


Data Type 24

1 MariaDB [food]> EXPLAIN SELECT * FROM foods WHERE name="100";


2 +------+-------------+-------+-------+---------------------+--------------------\
3 -+---------+-------+------+-------+
4 | id | select_type | table | type | possible_keys | key \
5 | key_len | ref | rows | Extra |
6 +------+-------------+-------+-------+---------------------+--------------------\
7 -+---------+-------+------+-------+
8 | 1 | SIMPLE | foods | const | index_foods_on_name | index_foods_on_name\
9 | 257 | const | 1 | |
10 +------+-------------+-------+-------+---------------------+--------------------\
11 -+---------+-------+------+-------+

But this query wont:

1 EXPLAIN select * from foods where name = 9;


2 MariaDB [food]> EXPLAIN SELECT * FROM foods WHERE name=100;
3 +------+-------------+-------+------+---------------------+------+---------+----\
4 --+------+-------------+
5 | id | select_type | table | type | possible_keys | key | key_len | ref\
6 | rows | Extra |
7 +------+-------------+-------+------+---------------------+------+---------+----\
8 --+------+-------------+
9 | 1 | SIMPLE | foods | ALL | index_foods_on_name | NULL | NULL | NUL\
10 L | 890 | Using where |
11 +------+-------------+-------+------+---------------------+------+---------+----\
12 --+------+-------------+
13 1 row in set (0.00 sec)

When we pass string 9, the index is used. When we pass number 9, the index isnt used.
Of if you have a date time column and you passing time as string, the index wont kicked in either.
The lesson here is we aboslutely should understand about data type.

Composite data type


We have 3 composite data types.

Streams

:are list or array, but theyre loaded in a lazy fashion. Instead of returning a whole array at once,
meaning all data are read into memory, a cursor is return. A cursor is a pointer into the result set.
Data Type 25

We can loop over cursor to read data when we need. Imagine instead of an array, and loop over it,
you know iterate over the cursor to get next value. It allows you to iterator over a data set without
building an entire array in memory. Its equivalent to PHP iterator, or Ruby iterator, or JavaScript
iterator. Stream allows us access current element and keep track of current position so that we can,
ideally call next() on a cursor to move to next element, until we reach to the end of array, it returns
nil and iterator can stop. Because of that, we can work with large data set because RethinkDB doesnt
need to load all of data and return to client. The nature of stream make it read-only, you cannot
change the data while iterating over it.

Selections

:represent subsets of tables, for example, the return values of filter or get. There are two kinds of
selections, **Selection

Tables

:are RethinkDB database tables. They behave like selections. However, theyre writable, as you can
insert and delete documents in them. ReQL methods that use an index, like getAll, are only available
on tables. Because index are created on table level.
In short, you cannot modify streams, you can update or change value of selection but you cannot
remove existed document, or insert new one. Tables allows you insert new document or remove
existed one.

Sequence
RethinkDB document use sequence in lots of places. Its a particular data type. You can
think of it as an shortwords for all: streams, table, seletion

Remember data types seems not much important but you should understand them well because it
helps us understand the efficient of a query. If a query returns an array, it consumes lot of memory
to hold the array.

Sorting data
When talking about data type, let think of how we sort them. It really doesnt matter in the order,
what is important is the definition of sorting data.
Understanding sorting is important in RethinkDB because of its schemaless. The primary key may
not be a numeric field, it can be a string. Moreover than that, a field can have whatever data type,
how are we going to compare an object to a string when sorting.
Here is sorting order:
Data Type 26

Arrays (and strings) sort lexicographically. Objects are coerced to arrays before sorting. Strings are
sorted by UTF-8 codepoint and do not support Unicode collations.
Mixed sequences of data sort in the following order:

arrays
booleans
null
numbers
objects
binary objects
geometry objects
times
strings

That mean array < booleans < null < numbers < objects < binary objects < geometry objects < times
< strings.
Selecting data
In this section, we will learn how to get data out of RethinkDB. Most of the time, we will choose a
db to work with, and chain into command table.

Select the whole table


Lets find all foods. This is same as SELECT * FROM foods in SQL.

1 r.db('foodb').table('foods')
2 //=>
3
4 [{
5 "created_at": Wed Feb 09 2011 00: 37: 17 GMT - 08: 00,
6 "creator_id": null,
7 "description": null,
8 "food_group": "Herbs and Spices",
9 "food_subgroup": "Spices",
10 "food_type": "Type 1",
11 "id": 43,
12 "itis_id": "29610",
13 "legacy_id": 46,
14 "name": "Caraway",
15 "name_scientific": "Carum carvi",
16 "picture_content_type": "image/jpeg",
17 "picture_file_name": "43.jpg",
18 "picture_file_size": 59897,
19 "picture_updated_at": Fri Apr 20 2012 09: 38: 36 GMT - 07: 00,
20 "updated_at": Fri Apr 20 2012 16: 38: 37 GMT - 07: 00,
21 "updater_id": null,
22 "wikipedia_id": null
23 }, {
24 "created_at": Wed Feb 09 2011 00: 37: 18 GMT - 08: 00,
25 "creator_id": null,
26 "description": null,
27 "food_group": "Herbs and Spices",
28 "food_subgroup": "Spices",
29 "food_type": "Type 1",
Selecting data 28

30 "id": 67,
31 "itis_id": "501839",
32 "legacy_id": 73,
33 "name": "Cumin",
34 "name_scientific": "Cuminum cyminum",
35 "picture_content_type": "image/jpeg",
36 "picture_file_name": "67.jpg",
37 "picture_file_size": 73485,
38 "picture_updated_at": Fri Apr 20 2012 09: 32: 32 GMT - 07: 00,
39 "updated_at": Fri Apr 20 2012 16: 32: 33 GMT - 07: 00,
40 "updater_id": null,
41 "wikipedia_id": null
42 },
43 ...
44 ]

You should get back an array of JSON object. By default, the data explorer will automatically
paginate it and display a part of data.
Typing r.db(db_name) all the time is insane. We can drop it to use r.table() without calling r.db()
if the table is in current selected database. Without any indication, the default database is test. On
Data Exploer, without a r.db command, RethinkDB will use test as default database. Unfortunately
we cannot set a default database with data exploer

Counting
We can also count the table or any sequence by calling count command.

1 r.db('foodb').table('foods').count()
2 //=>
3 863

Select a single document by its primary key


To select a single element, we call get on a table, and passing its primary key value.

https://github.com/rethinkdb/rethinkdb/issues/829
Selecting data 29

1 r.db('foodb').table('foods')
2 .get(108)
3 //=>
4 {
5 "created_at": Wed Feb 09 2011 00: 37: 20 GMT - 08: 00,
6 "creator_id": null,
7 "description": null,
8 "food_group": "Herbs and Spices",
9 "food_subgroup": "Herbs",
10 "food_type": "Type 1",
11 "id": 108,
12 "itis_id": "32565",
13 "legacy_id": 115,
14 "name": "Lemon balm",
15 "name_scientific": "Melissa officinalis",
16 "picture_content_type": "image/jpeg",
17 "picture_file_name": "108.jpg",
18 "picture_file_size": 30057,
19 "picture_updated_at": Fri Apr 20 2012 09: 33: 54 GMT - 07: 00,
20 "updated_at": Fri Apr 20 2012 16: 33: 54 GMT - 07: 00,
21 "updater_id": null,
22 "wikipedia_id": null
23 }

Every document in RethinkDB includes a primary key field, its value is unique across cluster and
is used to identify the document. The name of primary field is id by default. However, when you
create a table, you have an option to change name of primary field. We will learn more about it later.
Just keep a note here.
In RethinkDB, using of incremental primary key isnt recommended because thats hard in a cluster
environment. To make sure the uniqueness of the new value, We have to check in every clusters
somehow. RethinkDB team decides to use an universal unique id instead of an incremental value.
get command returns the whole document. What if we get a single field? Such as we only care about
name? RethinkDB has a command call bracket for that purpose. In Ruby its [], and in JavaScript
its ().
We can do this in JavaScript:

http://stackoverflow.com/questions/21020823/unique-integer-counter-in-rethinkdb
http://en.wikipedia.org/wiki/Universally_unique_identifier
Selecting data 30

1 r.db('foodb').table('foods')
2 .get(108)("name")
3 //=>
4 "Lemon balm"

Or in Ruby

1 r.connect.repl
2 r.db('foodb').table('foods').get(108)[:name].run

What special about bracket is that it return a single value of the field. The type of value is same
type of value, not a subset of document. We can verify that with typeOf command:

1 r.db('foodb').table('foods')
2 .get(108)
3 ("name")
4 .typeOf()
5 //=>
6 "STRING"

You can even get nested field with bracket:

1 r.db('foodb').table('test')
2 .get(108)("address")("country")

with assumption that the document has address field is an object contains a field name country.
If you dont like the using of bracket, you can use getField(JavaScript) or get_field(Ruby) which
have same effect:

1 r.db('foodb').table('foods')
2 .get(108)
3 .getField('name')
4 //=>
5 "Lemon balm"

How about getting a sub set of document, we can use pluck like this:
Selecting data 31

1 r.db('foodb').table('foods')
2 .get(108)
3 .pluck(get"name", "id")
4 //=>
5 {
6 "id": 108 ,
7 "name": "Lemon balm"
8 }

pluck probably existed in many standard library of your favourite language. This example shows
you how friendly ReQL is.

Select many documents by value of fields


To select many document based on value of field, We used filter method, and passing an object
with expected value.
Lets find all food that were inserted into database on 2011, the year I come to the US.

1 r.db('foodb').table('foods')
2 .filter(r.row('created_at').year().eq(2011))
3 //=>Executed in 59ms. 40 rows returned, 40 displayed, more available
4 [{
5 "created_at": Wed Feb 09 2011 00: 37: 17 GMT - 08: 00,
6 "creator_id": null,
7 "description": null,
8 "food_group": "Herbs and Spices",
9 "food_subgroup": "Spices",
10 "food_type": "Type 1",
11 "id": 43,
12 "itis_id": "29610",
13 "legacy_id": 46,
14 "name": "Caraway",
15 "name_scientific": "Carum carvi",
16 "picture_content_type": "image/jpeg",
17 "picture_file_name": "43.jpg",
18 "picture_file_size": 59897,
19 "picture_updated_at": Fri Apr 20 2012 09: 38: 36 GMT - 07: 00,
20 "updated_at": Fri Apr 20 2012 16: 38: 37 GMT - 07: 00,
21 "updater_id": null,
22 "wikipedia_id": null
23 }
Selecting data 32

24 ...
25 ]

r.row is new to you, but no worry, it just means current document. We used r.row('created_at')
to get value of created_at field, similar with how we use bracket on get command to get a single
value. Because created_at is a datetime value, I get its year with, well, year command, then using
eq to do an equal compare with 2011. Sound a lot, but above query is really simple and exlain itself.
Sometimes I feel redundant to explain query but I have to write this book anyway.
We can also pass an filter object to do matching filter:

1 r.db('foodb').table('foods')
2 .filter({
3 food_type: 'Type 1',
4 food_group: 'Fruits'
5 })
6 //=>
7 [
8 {
9 "created_at": Wed Feb 09 2011 00:37:15 GMT-08:00 ,
10 "creator_id": null ,
11 "description": null ,
12 "food_group": "Fruits" ,
13 "food_subgroup": "Tropical fruits" ,
14 "food_type": "Type 1" ,
15 "id": 14 ,
16 "itis_id": "18099" ,
17 "legacy_id": 14 ,
18 "name": "Custard apple" ,
19 "name_scientific": "Annona reticulata" ,
20 "picture_content_type": "image/jpeg" ,
21 "picture_file_name": "14.jpg" ,
22 "picture_file_size": 29242 ,
23 "picture_updated_at": Fri Apr 20 2012 09:30:49 GMT-07:00 ,
24 "updated_at": Fri Apr 20 2012 16:30:49 GMT-07:00 ,
25 "updater_id": null ,
26 "wikipedia_id": null
27 },...
28 ]

Passing an object will match exactly document with those field and value. In other words, passing
an object is equal to passing multiple eq command and and command. Above query can re-write
using expression:
Selecting data 33

1 r.db('foodb').table('foods')
2 .filter(
3 r.and(
4 r.row('food_type').eq('Type 1'),
5 r.row('food_group').eq('Fruits')
6 )
7 )

The object notation is much cleaner in this case.


From a selection of document, We can use pluck to get a subset of documentss field instead of
returning the whole document. Similarly to how we use bracket to get a particular field

1 r.db('foodb').table('foods')
2 .filter({
3 food_type: 'Type 1',
4 food_group: 'Fruits'
5 })
6 .pluck('id', 'name', 'food_subgroup')
7 //=>Executed in 70ms. 40 rows returned, 40 displayed, more available
8 [
9 {
10 "food_subgroup": "Berries" ,
11 "id": 75 ,
12 "name": "Black crowberry"
13 }, {
14 "food_subgroup": "Tropical fruits" ,
15 "id": 150 ,
16 "name": "Guava"
17 }, {
18 "food_subgroup": "Tropical fruits" ,
19 "id": 151 ,
20 "name": "Pomegranate"
21 }, ...
22 ]

By passing a list of field to pluck, we can get only those field.


Opposite of pluck is without. We passed a list of fields, and it removes those fiels from document.
Selecting data 34

1 r.db('foodb').table('foods')
2 .filter({
3 food_type: 'Type 1',
4 food_group: 'Fruits'
5 })
6 .without("created_at", "picture_content_type", 'picture_file_name', 'picture_f\
7 ile_size', 'picture_updated_at')
8 //=> Executed in 52ms. 40 rows returned, 40 displayed, more available
9 [
10 {
11 "creator_id": null ,
12 "description": null ,
13 "food_group": "Fruits" ,
14 "food_subgroup": "Berries" ,
15 "food_type": "Type 1" ,
16 "id": 75 ,
17 "itis_id": "23743" ,
18 "legacy_id": 81 ,
19 "name": "Black crowberry" ,
20 "name_scientific": "Empetrum nigrum" ,
21 "updated_at": Fri Apr 20 2012 16:29:43 GMT-07:00 ,
22 "updater_id": null ,
23 "wikipedia_id": null
24 },...
25 ]

With simple filterting, we can easily pass an filter object as above. But what up with complex
searching? Such as finding all foods whose name starts with character N. As you see at the beginning,
we used r.row command to do a bit complex query.

1 r.db('foodb').table('foods')
2 .filter(r.row('created_at').year().eq(2011))

Lets dive more into it.

Counting filter result


By calling count at the end of filter, we can count the result set of sequence
r.db(foodb).table(foods) .filter({food_type: Type 1,food_group: Fruits}) .count() //
122
Selecting data 35

r.row
r.row is our swiss army knife. It refers to current visited document. Literally, its the document
at which RethinkDB is accessing. You can think of it like this in a JavScript callback/iterator. Or
think of it like current element in an iterator loop. Its very handy because we can call other ReQL
command on it to achieve our filtering.
It somehow feel like jQuery filtering command. For an instance, we write this in JavaScript to filter
all DOM element whose data-type value is anime.

1 $('.db').find('.table').filter(function() {
2 return $(this).data('type')=='anime'
3 })

In ReQL, using filter with filter object:

1 r.db('foodb').table('foods').filter({food_group: 'Fruits'})

We can re-write it with r.row

1 r.db('foodb').table('foods').filter(r.row('food_group').eq('Fruits'))

Breaking it down we have:

r.row current document


(type) get value of field type
.eq(anime) return true if the value is equal to the argument, anime in this case

r.row a RethinkDB object, which we can continue call many method to filter or manipulation it.
The expression that we pass into filter is a normal ReQL expression but evaluate to a boolean
result. RethinkDB runs it and if the returned value is true, the document is included into result set.
Ideally, any function that returns boolean result can used with filter. Note that the evaluation of
filter expression run on RethinkDB server, therefore they has to be a valid ReQL expression, they
cannot be any arbitrary language expression. You cannot write:

1 r.db('db').table('table').filter(r.row('type') == 'anime')

In manner of filter action, we usually execute comparison or some condition to be matched,


RethinkDB gives us some kind of those method. You should refer to its API for extensive command.
Usually, we can use r.row in combine with pluck or without or bracket command to narrow down
data before comparing. Below are some function for that purpose:
Selecting data 36

eq(value) check equal to value. similar to ==.


ne(value) check not equal to value. similar to !=.
ge(value) check greater than or equal value. similar to >=.
gt(value) check greater than value. similar to >.
le(value) check less than or equal value. similar to <=.
lt(value) check less than value. similar to <.
add(value) Sum two numbers, concatenate two strings, or concatenate 2 arrays.
sub() Subtract two numbers.

Each of above command can be call on different data type. Eg, when you call add on an array, it will
append the element to array. when you call on a string, it concat parameter to the original string.
Or calling on a number and they just do arithmetic operation.
Run those command in data explorer:

1 r.expr(["foo", "bar"]).add(['forbar'])
2 //=>
3 [
4 "foo" ,
5 "bar" ,
6 "forbar"
7 ]
8
9 r.expr(2).add(10)
10 //=>
11 12
12
13 r.expr('foo').add("bar")
14 //=>
15 "foobar"

Note that the reason we use r.expr is that we have to turn the native object(array, number,
string in our language) into RethinkDB data type, so that we can call command on those.
However, in Ruby it can be even shorter with r(r([foo, bar]) + [foorbar])
Basically, you have to remember that everything is evaluated on server, and RethinkDB
command only callable on RethinkDB data type

You can find more about those document in RethinkDB doc, in group Math and logic.
Lets apply what we learn, by finding al food where its name starts with character R and is a tropical
fruits.
http://rethinkdb.com/api/javascript/#mod
Selecting data 37

1 r.db("foodb").table("foods")
2 .filter(
3 r.row("name").match("^R")
4 .and(
5 r.row("food_subgroup").eq('Tropical fruits')
6 )
7 )
8 //=>
9 {
10 "created_at": Wed Feb 09 2011 00:37:27 GMT-08:00 ,
11 "creator_id": null ,
12 "description": null ,
13 "food_group": "Fruits" ,
14 "food_subgroup": "Tropical fruits" ,
15 "food_type": "Type 1" ,
16 "id": 234 ,
17 "itis_id": "506073" ,
18 "legacy_id": 249 ,
19 "name": "Rambutan" ,
20 "name_scientific": "Nephelium lappaceum" ,
21 "picture_content_type": "image/jpeg" ,
22 "picture_file_name": "234.jpg" ,
23 "picture_file_size": 71055 ,
24 "picture_updated_at": Fri Apr 20 2012 09:43:04 GMT-07:00 ,
25 "updated_at": Fri Apr 20 2012 16:43:05 GMT-07:00 ,
26 "updater_id": null ,
27 "wikipedia_id": null
28 }

Here we are usinbg match with an regular expression R means any name starts with R, and using
and to do an and operator with other boolean. Other boolean is result of getting field food_subgroup
and compare with tropical fruits.
filter seems handy but its actually limited. filter didnt leverage index. It scan and hold all data
in memory. Of course, this isnt scale infinite. Only 100,000 records can be filter. For anything large
than that, we have to use getAll or between which we will learn in chapter 5.
Now, lets try to find all foods which has more than 10 foods document in its group. We probably
think of a simple solution like this: for each of document, we get its food_group and count how
many items has that same food group, if the result is greater than 10, we return true, so that it will
be included in filter result. We may have duplicate result but lets try this naieve soltuion:
Selecting data 38

1 r.db('foodb').table('foods')
2 .filter(
3 r.db('foodb').table('foods')
4 .filter(
5 {food_group: r.row("food_group")}
6 )
7 .count()
8 .gt(10)
9 )

Query looks good but when we run, we get this:

1 RqlCompileError: Cannot use r.row in nested queries. Use functions instead in:
2 r.db("foodb").table("foods").filter(r.db("foodb").table("foods").filter({food_gr\
3 oup:
4 r.row("food_group")}).count().gt(10))

Basically, we have nested query here, and RethinkDB doesnt know which query r.row should
belong to, is it parent query, or the sub query? In those case, we have to use filter with function.
Lets move to next chapter.

Filter with function


Beside passing an ReQL expression, we can also use a function which return true or false to filter.
Lets try previous example.

1 r.db('foodb').table('foods')
2 .filter(function (food) {
3 return r.db('foodb').table('foods').filter({food_group: food("food_group")}).\
4 count().gt(10)
5 })

Now, we no longer using r.row, we pass an anonymous function with a single parameter(which
we can name whatever), when itereating over the table, RethinkDB call this function, and passing
current document as its first argument. By using function, we can still access current document,
without using r.row, and clearly bind current document to a variable, so that we can access its
value and avoid conflicting. Here, we name our argument food, instead of writing:

1 filter({food_group: r.row("food_group")})

We will write:
Selecting data 39

1 filter({filter_group: food("food_group")})

And we using boolean value, count().gt(10) here, as result of function. Filter with function helps
us write query with complex logic.

Pagination data
We rarely want a whole sequence of document, usually we care about a subset of data such as
pagination data. In this section, we go over commands: order, limit and skip.

Order data

So far, we only select data and accept default ordering. Lets control how they appear:

1 r.db('foodb').table('foods')
2 .filter(function (food) {
3 return r.db('foodb').table('foods').filter({food_group:

food(food_group)}).count().gt(10) }) .orderBy(name) //Executed in 5.69s. 821 rows returned [ {


created_at: { $reql_type$: TIME, epoch_time: 1297240650, timezone: -08:00 }, creator_id:
null, description: null, food_group: Aquatic foods, food_subgroup: Mollusks, food_type:
Type 1, id: 280, itis_id: 69493, legacy_id: 307, name: Abalone, name_scientific:
Haliotis, picture_content_type: image/jpeg, picture_file_name: 280.jpg, picture_file_size:
99231, picture_updated_at: { $reql_type$: TIME, epoch_time: 1334940073, timezone: -
07:00 }, updated_at: { $reql_type$: TIME, epoch_time: 1334965273, timezone: -07:00 },
updater_id: null, wikipedia_id: null }, ]
We re-used above filter query, but append orderBy("name"). If you notice, the above command
run quite long Executed in 5.56s. 821 rows returned and all rows are returned instead of a streams
as usual. When we are calling orderBy without specifing an index, it load all data into memory to
sort, which is both of slow and in-efficient. We will learn more about ordering with index in chapter
5. For now, lets continue with this method because, well, they are easy to use, at first :D
We can reverse order by applying r.desc command:
Selecting data 40

1 r.db('foodb').table('foods')
2 .filter(function (food) {
3 return r.db('foodb').table('foods').filter({food_group: food("food_group")})\
4 .count().gt(10)
5 })
6 .orderBy(r.desc("name"))

We can order on table too, not just filter:

1 r.db('foodb').table('foods')
2 .orderBy(r.desc("name"))

We can order by multiple field, at a time

1 r.db('foodb').table('foods')
2 .orderBy(r.desc("name"), r.asc("created_at"))

We order by descending order on field name and ascending on field created_at.

One more thing to note is that RethinkDB doesnt order document based on time they are
inserted by default. The order seems in an unpredicted way without explicitly setting an
order . In MySQL, for example, even without any index, the default order will be exactly
same as you insrted the document. However, in RethinkDB it doesnt. I guess this is because
its distributed.

We can combine some document commands with orderBy too. Such as pluck only an useful set of
fields:

1 r.db('foodb').table('foods')
2 .pluck("id", "name", "food_group")
3 .orderBy(r.desc("name"), r.asc("created_at"))
4 //=>Executed in 122ms. 863 rows returned
5 [
6 {
7 "food_group": "Milk and milk products",
8 "id": 634,
9 "name": "Yogurt"
10 },
11 {
12 "food_group": "Milk and milk products",
13 "id": 656,
Selecting data 41

14 "name": "Ymer"
15 },
16 {
17 "food_group": "Aquatic foods",
18 "id": 523,
19 "name": "Yellowtail amberjack"
20 },
21 ...
22 ]

Limiting data

Once we have an ordering sequence, we usually want to select a limit number of document instead
of the whole sequence. We use command limit(n) for this purpose. It get n elements from the
sequence or array.

1 r.db('foodb').table('foods')
2 .pluck("id", "name", "food_group")
3 .orderBy(r.desc("name"), r.asc("created_at"))
4 .limit(4)
5 //=>Executed in 107ms. 2 rows returned
6 [{
7 "food_group": "Milk and milk products",
8 "id": 634,
9 "name": "Yogurt"
10 }, {
11 "food_group": "Milk and milk products",
12 "id": 656,
13 "name": "Ymer"
14 }, {
15 "food_group": "Aquatic foods",
16 "id": 523,
17 "name": "Yellowtail amberjack"
18 }, {
19 "food_group": "Aquatic foods",
20 "id": 522,
21 "name": "Yellowfin tuna"
22 }]

limit get us a number of document that we want, but it always start from the beginning of sequence.
To start selecting data starts from a position, we used skip.
Selecting data 42

Skip

As its name, skip(n) ignore a number of element from the head of sequence.

1 r.db('foodb').table('foods')
2 .pluck("id", "name", "food_group")
3 .orderBy(r.desc("name"), r.asc("created_at"))
4 .skip(2)
5 .limit(2)
6 //=> Executed in 97ms. 2 rows returned
7 [{
8 "food_group": "Aquatic foods",
9 "id": 523,
10 "name": "Yellowtail amberjack"
11 }, {
12 "food_group": "Aquatic foods",
13 "id": 522,
14 "name": "Yellowfin tuna"
15 }]

Access Nested field


As you know, RethinkDB document is a JSON object. Very likely we have two or more level of data
structure. So how we can access those nested field, or to drill down the fields.
Lets beging this chapter by creating some sample data. Just copy and paste, ignore the syntax for
now because we save them for chapter 4.
First, create table on test db.

1 r.tableCreate("books")

Then, insert sample data:


Selecting data 43

1 r.table("books")
2 .insert([
3 {
4 id: 1,
5 name: "Simply RethinkDB",
6 address: {
7 country: {
8 code: "USA",
9 name: "The United State of America"
10 }
11 },
12 contact: {
13 phone: {
14 work: "408-555-1212",
15 home: "408-555-1213",
16 cell: "408-555-1214"
17 },
18 email: {
19 work: "bob@smith.com",
20 home: "bobsmith@gmail.com",
21 other: "bobbys@moosecall.net"
22 },
23 im: {
24 skype: "Bob Smith",
25 aim: "bobmoose",
26 icq: "nobodyremembersicqnumbers"
27 }
28 }
29 },
30 {
31 id: 2,
32 name: "TKKG",
33 address: {
34 country: {
35 code: "GER",
36 name: "Germany"
37 }
38 },
39 contact: {
40 phone: {
41 work: "408-111-1212",
42 home: "408-111-1213",
Selecting data 44

43 cell: "408-111-1214"
44 },
45 email: {
46 work: "bob@gmail.com",
47 home: "bobsmith@axcoto.com",
48 other: "bobbys@axcoto.com"
49 },
50 im: {
51 skype: "Jon",
52 aim: "Jon",
53 icq: "nooneremembersicqnumbers"
54 }
55 }
56 }
57 ])

Depend on your language, you will usually have some way to access nested field, by following the
nested path. In above example, lets say we want to access *skype im, the path is:
contact -> im -> skype
Using JavaScript driver, we will use bracket to access field and sub field.

1 r.table('books').get(1)('contact')('im')('skype')
2 //=>
3 "Bob Smith"

While as, in Ruby driver, bracket notation is [field]

1 r.table('books').get(1)['contact']['im']['skype']

We can keep calling bracket to get the final nested field follow the path. Not just a single document,
we can use bracket on table level too:
Selecting data 45

1 r.table('books')('address')('country')
2 [
3 {
4 "code": "GER" ,
5 "name": "Germany"
6 }, {
7 "code": "USA" ,
8 "name": "The United State of America"
9 }
10 ]

Or using in combination with filter, on selection:

1 r.table('books')
2 .filter({id: 1})('address')('country')('name')
3 //=>
4 "The United State of America"

Beside using bracket command, we can also using getField if that feel more nature:

1 r.table('books')
2 .getField('contact')('email')
3 //=>
4 [
5 {
6 "home": bobsmith@axcoto.com,
7 "other": bobbys@axcoto.com,
8 "work": bob@gmail.com,
9 }, {
10 "home": bobsmith@gmail.com,
11 "other": bobbys@moosecall.net,
12 "work": bob@smith.com,
13 }]

At the end of the day all you have to remember is to drill down the path with a chain of bracket
command.
Wrap Up
We now have some basic understanding:

1 1. ReQL always starts with `r`.


2 2. ReQL is tie to your language depend on language driver.
3 3. Default database
4 3. Find an document by its primary
5 4. Access table data and filter data by some condition
6 5. Access nested field

We will learn more about advanced query in other chapter. For now, lets move on and try to write
some data into RethinkDB.
4. Modifying data

We know how to fetch the data. But a database is useless without ability of writing data. In this
chapter, we will learn about writing data. We will address database command, table command, and
then document command in this chapter.
Database
All commands on database levels start at the top namespace r since they are like genesis item in any
database system. Lets start our journey by creating a database. Remember, we need a database to
hold everything.

Create
Very simple. With example, you will get it easily.

1 //Create database
2 r.dbCreate("db1")
3 #=>
4 {
5 "config_changes": [
6 {
7 "new_val": {
8 "id": "5e4a85fa-d867-4a93-aa01-2d08ed6f0b14" ,
9 "name": "db1"
10 } ,
11 "old_val": null
12 }
13 ] ,
14 "dbs_created": 1
15 }

If creating succesfully, we get back the object with created is always 1. config_changes will have
new_val field is the databases config value. old_val is always null becase this is a new database.
config value is the configuration for an individual database or table. What is the configuration?
Usually, when we create any object in RethinkDB (a database, a table) we can pass a list of option
to that creating command. That option has to be stored somewhere and we should have ability to
read it back. For a database, configuration is just its name and its id. Thats why you see the id and
name are returned in above query. We will learn more about this configuration very quick in this
chapter.
We can confirm by listing what we have:
Database 49

1 r.dbList()
2 #=>
3 [
4 "foodb" ,
5 "rethinkdb" ,
6 "superheroes" ,
7 "test"
8 ]

Notice a special database call rethinkdb? This is a special database that is created by RethinkDB
to hold meta data, configuration. Its very similar to mysql database in a MySQL server. Remember
the configuration of dbCreate function? Those configuration is stored in table db_config inside this
database rethinkdb.

Drop
So we got the default test and db1 is what we just have. Since we dont use db1, lets delete it to keep
our database clean.

1 r.dbDrop('db1')
2 #=>
3 {
4 "config_changes": [
5 {
6 "new_val": null ,
7 "old_val": {
8 "id": "5e4a85fa-d867-4a93-aa01-2d08ed6f0b14" ,
9 "name": "db1"
10 }
11 }
12 ] ,
13 "dbs_dropped": 1 ,
14 "tables_dropped": 0
15 }

Very similar with dbCreate but in an opposite way. Now new_val is null because the database is no
longer existed. old_val is the id and name of old database, or the old configuration of database.
Table
Tables have to sit inside a database, therefore, all table commands have to call on a database. When
you dont explicit specify a database to run on with r.db, the current database will be the base for
table manipulation.

Create
The syntax to create a table is

1 db.tableCreate(tableName[, options])

The second parameter is optional. This is what we consider configuration for a table. Its similar to
database configuration. But table configuration is much richer. Some important ones are:

*primaryKey
the name of primary key. Default name of primary key is id. The value of id field will always
be indexed automatically and using as primary key. Using this option, you can change that
default behavior such as using uuid as default primary key. When a new document is inserted,
RethinkDB will fetch value of uuid field to create index instead of field id
*durability
accept value of soft or hard. soft means the writes will be acknowledged by server immde-
diately, and data will be flushed to disk in the background. If that flushing fail, we may not
know. The default behaviour is to acknowledge after data is written to disk. That means hard.
Its default because its much safety. When we dont need the data to be consitent, such as
writing a cache, or an unimportant log, we should set durability to soft to speed up the writing.
However, for any important, serious data, keep it default.

RethinkDB stores configuration of each table in a special table call table_config inside database
rethinkdb.
Lets try to create a table.

1 r.db("foodb").tableCreate("t1", {primaryKey: 'uuid'})

List table
To list what table we have inside a database, we use tableList command. Its similar to SHOW
TABLE in MySQL.
Table 51

1 r.db("foodb").tableList()
2 //=>
3 [
4 "compound_synonyms" ,
5 "compounds" ,
6 "compounds_flavors" ,
7 "compounds_foods" ,
8 "compounds_health_effects" ,
9 "flavors" ,
10 "foods" ,
11 "health_effects" ,
12 "t1" ,
13 "users"
14 ]

Drop table
To get rid of the table, use tableDrop command.

1 r.db("foodb").tableDrop("t1")
2 //=>
3 {
4 "config_changes": [{
5 "new_val": null,
6 "old_val": {
7 "db": "foodb",
8 "durability": "hard",
9 "id": "d20fe79e-9e90-4625-95f7-c9e1953bf773",
10 "name": "t1",
11 "primary_key": "id",
12 "shards": [{
13 "primary_replica": "SimplyRethinkDB",
14 "replicas": [
15 "SimplyRethinkDB"
16 ]
17 }],
18 "write_acks": "majority"
19 }
20 }],
21 "tables_dropped": 1
22 }
Table 52

Very similar to dbDrop, we also have config_changes. new_val always null because the table is
gone now. old_val is the configuration of removed table. Table configuration is usually what we
passed in when we create it with tableCreate.
We see that some db and table command returns config_changes. Lets discover where those
configs are stored.
System table
Usually a database server have to keep some meta data, or configuration information somewhere
else. In case of RethinkDB, it stores those data in rethinkdb data. Lets discover this database:

1 r.db("rethinkdb").tableList()
2 [
3 "cluster_config" ,
4 "current_issues" ,
5 "db_config" ,
6 "jobs" ,
7 "logs" ,
8 "server_config" ,
9 "server_status" ,
10 "stats" ,
11 "table_config" ,
12 "table_status"
13 ]

The name of each table should suggest what it contains. Lets inspect server_config

1 r.db("rethinkdb").table("server_config")
2 //=>
3 {
4 "cache_size_mb": "auto" ,
5 "id": "fdc5dade-2f0c-498f-8c4b-59ad0d976471" ,
6 "name": "Vinh_local_u27" ,
7 "tags": [
8 "default"
9 ]
10 }

Lets change our server name:


r.db(rethinkdb).table(server_config) .get(fdc5dade-2f0c-498f-8c4b-59ad0d976471) .update({name:
SimplyRethinkDB})
You will notice that the Admin UI will change the server name:
System table 54

Server name changes to SimplyRethinkDB

By modifying those table, we change the configuration of our server. We can get RethinkDB version
by querying server_status.

1 r.db("rethinkdb").table("server_status")("process")("version")

In other words, those system table refelects information related to how the system operates. We can
query to fetch or modify system information.
We can get configuration that we set when creating table with tableCreate of any table:

1 r.db("rethinkdb").table("table_config")
2 //=>
3 {
4 "db": "foodb" ,
5 "durability": "hard" ,
6 "id": "2e41fc0b-ea5e-4460-bd3b-5d33a5ec49af" ,
7 "name": "health_effects" ,
8 "primary_key": "id" ,
9 "shards": [
10 {
11 "primary_replica": "SimplyRethinkDB" ,
12 "replicas": [
13 "SimplyRethinkDB"
14 ]
15 }
16 ] ,
17 "write_acks": "majority"
18 } {
19 "db": "foodb" ,
20 "durability": "hard" ,
21 "id": "3fbf59ad-35df-445c-9fa9-be19071d38d7" ,
22 "name": "compounds_flavors" ,
System table 55

23 "primary_key": "id" ,
24 "shards": [
25 {
26 "primary_replica": "SimplyRethinkDB" ,
27 "replicas": [
28 "SimplyRethinkDB"
29 ]
30 }
31 ] ,
32 "write_acks": "majority"
33 }

Looking at the above result, we can see that table health_effects of database foodb has primary_-
key is id, and write_acks is majority.
You can have more fun and some deep understanding under the hood by inspecting those tables.
Document
After creating database and creating table, we can start inserting document into table.

Insert
As you can guess, we will start from the database, chain the table, and use insert command to
insert a document into the table. Eg

1 // Let create a fake user


2 r.db("foodb").table("users")
3 .insert({id: "user-foo1", name: "foo", age: 12})
4 //=>
5 {
6 "deleted": 0 ,
7 "errors": 0 ,
8 "inserted": 1 ,
9 "replaced": 0 ,
10 "skipped": 0 ,
11 "unchanged": 0
12 }

Here, we set our own primary key for id field. If we dont set it. RethinkDB will generate it and
return for us via the return object.

1 r.db("foodb").table("users")
2 .insert({name: "foo", age: 12})
3 //=>
4 {
5 "deleted": 0 ,
6 "errors": 0 ,
7 "generated_keys": [
8 "b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6"
9 ] ,
10 "inserted": 1 ,
11 "replaced": 0 ,
12 "skipped": 0 ,
13 "unchanged": 0
14 }
Document 57

The return object contains the following attributes:

inserted
the number of documents that were succesfully inserted.

replaced
the number of documents that were updated when upsert is used.

unchanged
the number of documents that would have been modified, except that the new value was the
same as the old value when doing an upsert.

errors
the number of errors encountered while performing the insert.

first_error
If errors were encountered, contains the text of the first error.

deleted, skipped
0 for an insert operation.

generated_keys
a list of generated primary keys in case the primary keys for some documents were missing
(capped to 100000).

warnings
if the field generated_keys is truncated, you will get the warning: Too many generated keys
(
if returnVals is set to true, contains null. new_val
if returnVals is set to true, contains the inserted/updated document.

Notice the generated_keys. If we insert a document without set a value for primary key, whose field
name is *id* by default, RethinkDB will generate an UUID for it and use it as value of *id* field.
With our example, the primary key is b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6. We can retreive
back the document again.

http://en.wikipedia.org/wiki/Universally_unique_identifier
Document 58

1 r.db("foodb").table("users")
2 .get('b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6')
3 //=>
4 {
5 "age": 12 ,
6 "id": "b7e23aa4-e7d8-4d3e-9020-4c5b1ec413c6" ,
7 "name": "foo"
8 }

Multi insert
If we have a large array of data, we dont have to insert one by one, we can pass the whole array to
insert to do batch insert, which is much efficient than update one by one
Lets play with it a bit. Create a test table on our test db.

1 r.db("test").tableCreate("users", {primaryKey: 'myid'})

We also use primary key as field myid insetead of usind id. We can insert multiple data at a time.
Some document has myid, some of them dont have. We will see how RethinkDB generate primary
key for those documents:

1 r.db("test").table("users")
2 .insert([
3 {
4 myid: 1,
5 name: 'Hydra'
6 },
7 {
8 name: 'Pluto'
9 },
10 {
11 name: 'Styx',
12 myid: 'abcxyz'
13 }
14 ])
15 //=>
16 {
17 "deleted": 0 ,
18 "errors": 0 ,
19 "generated_keys": [
20 "8c3a1d6c-2b7b-4a4f-91dc-d6855c5aed15"
Document 59

21 ] ,
22 "inserted": 3 ,
23 "replaced": 0 ,
24 "skipped": 0 ,
25 "unchanged": 0
26 }

Here, generated_keys contains a single element 8c3a1d6c-2b7b-4a4f-91dc-d6855c5aed15. The


reason is only second inserted element {name: 'Pluto'} doesnt have myid field, the other two
we set manually so RethinkDB just use it.
Lets verify:

1 r.db("test").table("users")
2 //=>
3 {
4 "myid": 1,
5 "name": "Hydra"
6 } {
7 "myid": "abcxyz",
8 "name": "Styx"
9 } {
10 "myid": "8c3a1d6c-2b7b-4a4f-91dc-d6855c5aed15",
11 "name": "Pluto"
12 }

Yay, how cool is that? We used a custom primary key, we insert multiple document at a time and
RethinkDb assign a primary key for it.
Lets see if myid field is really use as primary index. We can call get command because get operator
on primary key:

1 r.db("test").table("users")
2 .get('abcxyz')
3 //=>
4 {
5 "myid": "abcxyz" ,
6 "name": "Styx"
7 }

Effect of durability
Lets see the difference of durability. We will insert a big document.
Document 60

First, I will create a temporary table r.tableCreate(git)


Then insert data, to have a big amount of data. I use http, which is a command that fetch external
JSON data, which is really useful to deal with external API. We can treat the result of r.http as a
normal JSON document. r.http takes care of fetching data via HTTP and turn document into valid
JSON.

1 r.table('git').insert(
2 r.http('https://api.github.com/repos/rethinkdb/rethinkdb/stargazers')),
3 {durability:soft}
4 )
5 Executed in 773ms. 1 row returned

Now, if I turn on durability.

1 r.table('git').insert(
2 r.http('https://api.github.com/repos/rethinkdb/rethinkdb/stargazers')),
3 {durability:soft}
4 )
5 Executed in 1.18s. 1 row returned

So its slower because it takes time to write to hard drive. You may not see the effect if you have a
very fast hard drive or SSD. I tried it on an external spin drive :).

Update
To make it easier, you can think of updating like selecting data, then change their value. We chain
update method from a selection range to update its data. With that being said, we can update one
or many documents at a time. Similar to MySQL, we can update a full table, or update only rows
that sastify a WHERE condition.
Think of modification is like a transform process where you get a list of document(one or many),
then transform by adding fields, rewrite value for some fields. By that definition, it doesnt matter
if you update one document, or many document.As long as you have an array, or a stream of data,
you can update them all.
For example, to update an attribute for a single element
Document 61

1 // Let update age and add a new field


2 r.db("foodb").table("users")
3 .get("user-foo1")
4 .update({
5 age: 13,
6 gender: "f"
7 })
8 //=>
9 {
10 "deleted": 0 ,
11 "errors": 0 ,
12 "inserted": 0 ,
13 "replaced": 1 ,
14 "skipped": 0 ,
15 "unchanged": 0
16 }

RethinkDB returns an object for the updating result. We can look into replaced field to see if the
data is actually updated. If we re-run the above command, nothing is replaced and we will got 1
unchanged.

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 age: 13,
5 gender: "f"
6 })
7 //=>
8 {
9 "deleted": 0 ,
10 "errors": 0 ,
11 "inserted": 0 ,
12 "replaced": 0 ,
13 "skipped": 0 ,
14 "unchanged": 1
15 }

Thats just how awesome RethinkDB is. All query result is very verbose. And easy to understand.
In above example, you can see when we update age, we also add a new field gender. The updating
process can be understand as a merge process of return values from update function or update
expression into current existed document. Lets verify if gender field are really there:
Document 62

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 //=>
4 {
5 "age": 13 ,
6 "gender": "f" ,
7 "id": "user-foo1" ,
8 "name": "foo"
9 }

We can also update nested field. Lets add an address field:

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 "address" : {
5 country: "USA",
6 state: "CA",
7 city: "Cuppertino",
8 street: "Infinite Loop",
9 number: "1",
10 ste: "1205"
11 }
12 })
13 //=>
14 {
15 "deleted": 0 ,
16 "errors": 0 ,
17 "inserted": 0 ,
18 "replaced": 1 ,
19 "skipped": 0 ,
20 "unchanged": 0
21 }

replaced is 1, that means we update succesfully. Now, lets say I moved, I can change the address:
Document 63

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 "address" : {
5 ste: 880,
6 number: 11
7 }
8 })
9 //=>
10 {
11 "deleted": 0 ,
12 "errors": 0 ,
13 "inserted": 0 ,
14 "replaced": 1 ,
15 "skipped": 0 ,
16 "unchanged": 0
17 }

Here, we are updating field ste of field address.

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 //=>
4 {
5 "address": {
6 "city": "Cuppertino" ,
7 "country": "USA" ,
8 "number": 11 ,
9 "state": "CA" ,
10 "ste": 880 ,
11 "street": "Infinite Loop"
12 } ,
13 "age": 13 ,
14 "gender": "f" ,
15 "id": "user-foo1" ,
16 "name": "foo"
17 }

The value of the updated field, as you can see in above example, is a single value.

Update option

update command receive some options that you can pass. In JavaScript, you can pass option object
as second parameter. In Ruby, you can use optional keyword. Such as in JavaScript:
Document 64

1 r.table("posts").get(1).update({
2 num_comments: r.js("Math.floor(Math.random()*100)")
3 }, {
4 nonAtomic: true
5 })

Here, {nonAtomic: true} is out option. In Ruby, its more elegant due:

1 r.table("posts").get(1).update({
2 :num_comments => r.js("Math.floor(Math.random()*100)")
3 }, :non_atomic => true)

We have 3 options parameters:

durability: possible values are hard and soft. You already know what it does. However, setting
it here override durability default of tables
non_atomic: you should also know what it does. If not, coming back chapter2.

So we know the option. Lets move on. In this section, we learn how to update value for a field,
update nested value. What if the field contains an array? How can we append new element. Or how
to update value which is the result of other ReQL command. Lets move to next section.

Update data for complex field


First, we see that an user can have many address. Right now, our address field is a single object.
Lets make it an array so it can accepts many address. Using previous array that we created.

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 address: [r.row("address")]
5 })

Here we are using r.row, it allows access to current document we are referencing to. We turn address
into an array by wrap it in [] to turn it into array.
Now, we got address is an array with a single address, How do we add more data into that array.
The simple form, we can pass an value to the update. First, we get the value of current address,
append a new element using what our language offer. In JavaScript code, we may have:
Document 65

1 addresses = r.db("foodb").table("users")
2 .get("user-foo1")("address")
3 addresses.push(new_add_ress)
4 r.db("foodb").table("users")
5 .get("user-foo1")
6 .update({address: addresses})

Thats work but its very inefficent. For example, if we have to append a new element to an array
for 1000 users. We have to fetch the data, change it, update by sending new array again.
Also a more important issue is updating lock. When we are retriving data, alter it on client side,
push back data to database. During the time since we get the data and push it back. The server may
changes the data and we didnt aware of it in first query to retrieve data, now when we push back,
we override that new changes. Image this, we have 2 admins on the sites, who are trying to edit an
user at the same time to update users address.
First, admin 1 retrieve data, add new address B and push it back. For whatever reason, admin 2
retrieve data, right after admin 1 retrieve data, then add new address B and update it, but before
admin 1 push it back. So when admin 1 pushs data back, the changes admin 2 created is override.
Its would be great if we can move the logic into RethinkDB and lets RethinkDB handles lock for
it. Just like how in SQL we can do:

1 UPDATE TABLE user SET login=login+1

We tell MySQL to increment value of login by 1 instead of doing that outself. Luckily, we have that
in RethinkDB. Some of them falls under Document manipulation section on RethinkDB docs.
They allow us do some logic to the document.
Our example above can be written using append. append command add a new element to array.

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 address: r.row("address")
5 .append({country: "Vietnam",
6 city: "Hue",
7 street: "Tran Phu",
8 number: "131"})
9 })
10 //=>
11 {

http://www.rethinkdb.com/api/ruby/pluck/
Document 66

12 "deleted": 0 ,
13 "errors": 0 ,
14 "inserted": 0 ,
15 "replaced": 1 ,
16 "skipped": 0 ,
17 "unchanged": 0
18 }

What if an user has not address field on it yet? Well, an error will be thrown out. lets try:

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 another_address_field: r.row("another_address_field")
5 .append({country: "Vietnamnam",
6 city: "Hue",
7 street: "Tran Phu",
8 number: "131"})
9 })
10 //=>
11 {
12 "deleted": 0,
13 "errors": 1,
14 "first_error": "No attribute `another_address_field` in object: {
15 "address": [{
16 "city": "Cuppertino",
17 "country": "USA",
18 "number": 11,
19 "state": "CA",
20 "ste": 880,
21 "street": "Infinite Loop"
22 }, {
23 "city": "Hue",
24 "country": "Vietnam",
25 "number": "131",
26 "street": "Tran Phu"
27 }],
28 "age": 13,
29 "gender": "f",
30 "id": "user-foo1",
31 "name": "foo"
32 }
Document 67

33 " ,
34 "inserted": 0,
35 "replaced": 0,
36 "skipped": 0,
37 "unchanged": 0
38 }

To avoid that, we have tell RethinkDB what value should be used when that field doesnt exist. Such
as for an array, we can consider that value is an empty array. For a string we can consider that value
is an empty string. For a positive number, that can be zero.
We use defaul(default_value) command for this purpose. Lets try it:

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({
4 another_address_field:
5 r.row("another_address_fieldess_field")
6 .default([])
7 .append({country: "Vietnamnamam",
8 city: "Hue",
9 street: "Tran Phu",
10 number: "131"})
11 })
12 //=>
13 {
14 "deleted": 0 ,
15 "errors": 0 ,
16 "inserted": 0 ,
17 "replaced": 1 ,
18 "skipped": 0 ,
19 "unchanged": 0
20 }

When we call command default on a value or a sequence, it will try to evalute to the default value
in case of non-existence error for the value. We can verify address again:
Document 68

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 //=>
4 {
5 "address": [{
6 "city": "Cuppertino",
7 "country": "USA",
8 "number": 11,
9 "state": "CA",
10 "ste": 880,
11 "street": "Infinite Loop"
12 }, {
13 "city": "Hue",
14 "country": "Vietnam",
15 "number": "131",
16 "street": "Tran Phu"
17 }],
18 "age": 13,
19 "another_address_field": [],
20 "gender": "f",
21 "id": "user-foo1",
22 "name": "foo"
23 }

So we had append to add element at the end of array, we can also use prepend. It adds a new element
to an array but at the top.
Take another example, we want to count how many like an user has. We will use a fiel call like and
we increase it by 1 whenever someone like the user.

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({like: r.row("like").add(1)})

You notice that we have to use add command instead of writing like:

1 like: r.row("like") +1

Because these expressions are evaluated on RethinkDB server, not on client. When I first learn
RethinkDB, somehow I didnt understand it. Im probably dumb. I keep thinking those run on client
and it makes my life harder. So Im trying to remind this several time through the book just in case
someone confuse like me. In some language that support operator overloading, you may use
Document 69

1 like: r.row("like") + 1

Thats because the driver override + operator to make it easy to write expression. They are still
serialized to ReQL syntax by driver.
Now, with above *update command, we got error, which is expeteced:

1 {
2 "deleted": 0,
3 "errors": 1,
4 "first_error": "No attribute `like` in object: {
5 "address": [{
6 "city": "Cuppertino",
7 "country": "USA",
8 "number": 11,
9 "state": "CA",
10 "ste": 880,
11 "street": "Infinite Loop"
12 }, {
13 "city": "Hue",
14 "country": "Vietnam",
15 "number": "131",
16 "street": "Tran Phu"
17 }],
18 "age": 13,
19 "another_address_field": [{
20 "city": "Hue",
21 "country": "Vietnam",
22 "number": "131",
23 "street": "Tran Phu"
24 }],
25 "gender": "f",
26 "id": "user-foo1",
27 "name": "foo"
28 }
29 " ,
30 "inserted": 0,
31 "replaced": 0,
32 "skipped": 0,
33 "unchanged": 0
34 }

We have to set a default value for it. Lets default to 0.


Document 70

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({like: r.row("like").default(0).add(1)})
4 // Get back like
5 r.db("foodb").table("users").get("user-foo1")("like")
6 //=>
7 1

add is not limited on numeric data, it works on array, string too. I will leave that part for you as an
exercise.
Now we know how to work with an array as value of a field. Lets dive into how to work with an
object as a value of a field. Considering that we have this:

1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: {twitter: "kureikain"}
5 })
6 //=>
7 {
8
9 "deleted": 0 ,
10 "errors": 0 ,
11 "inserted": 0 ,
12 "replaced": 1 ,
13 "skipped": 0 ,
14 "unchanged": 0
15 }

The social field is an object now. Now, the user enters his facebook username so we we want to
add a new field facebook to social field to denote the facebook account of user. We can not use
append or add on an object. For object, we use merge to add or override a field.
Document 71

1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: r.row('social').default({}).merge({facebook: "kureikain"})
5 })
6 //=>
7 {
8
9 "deleted": 0 ,
10 "errors": 0 ,
11 "inserted": 0 ,
12 "replaced": 1 ,
13 "skipped": 0 ,
14 "unchanged": 0
15
16 }

Same as append, we also have to set a default value to handle non-existence error. Since we are
working with an object, we set its default value to empty object: {}. merge overide existed key with
new value in the object you are passing, or create new key from the passing object.

1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: r.row('social').default({}).merge({facebook: "kureikain2", twitter: \
5 "kureikain2"})
6 })
7 //=>
8 {
9
10 "deleted": 0 ,
11 "errors": 0 ,
12 "inserted": 0 ,
13 "replaced": 1 ,
14 "skipped": 0 ,
15 "unchanged": 0
16
17 }
18
19 // Select it back
20 r.db("foodb").table('users')
21 .get('user-foo1')("social")
Document 72

22 //=>
23 {
24 "facebook": "kureikain2" ,
25 "twitter": "kureikain2"
26 }

Cool, so we know how to add new fields or override old fields. But do you notice they when a field
contains an object, they are actually nested field. So we can easily update use nested field knowledge
before instead of using merge command:

1 r.db("foodb").table('users')
2 .get('user-foo1')
3 .update({
4 social: {facebook: "kureikain3", github: "kureikain"}
5 })

Its really up to you to use merge or the nested field style. I usually using nested field style when
doing simple update, and merge when I want to merge the document to other result from other ReQL
function. But thats just opinion.

Update multiple documents


Instead of select a single document and update one by one, you can update a bunch of documents
by calling update on a table or a stream, a selection. All are same like what we do above, but instead
of applying update to a single document, it updates all element in the stream.

1 r.table.filter(r.row('Day').gt(1) && r.row('Day').lt(90))


2 .update({quarter: 1})
3 //=>
4 {
5
6 "deleted": 0 ,
7 "errors": 0 ,
8 "inserted": 0 ,
9 "replaced": 55 ,
10 "skipped": 0 ,
11 "unchanged": 0
12 }

ReQL inside the updated object


As you notice, we not only pass value into an updated object, but also passing ReQL into updated
object. As long as it can be evaluated like we increased like to 1 with:
Document 73

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .update({like: r.row("like").default(0).add(1)})

However, r.row has a limit. It cannot be called on nested query. Assume that we have a friends
table that we can create with below query:

1 // First, creata table


2 r.db("foodb").tableCreate('friends')
3
4 // Insert some faked data
5 r.db("foodb").table("friends")
6 .insert([{
7 friend1_id: '12063f5f-4289-4a4b-b668-0e4a90861575',
8 friend2_id: 'user-foo1'
9 },
10 {
11 friend1_id: '8d4bcd47-3f7f-4670-a31c-2b807e3f7caf',
12 friend2_id: 'user-foo1'
13 }])

So we can see that our user with id user-foo1 has 2 friends. It will not very efficient if we have to
count this over and over. So we are going to count this and update users table with a field friend_-
count.
To count, we can get a sequence of friend2_id, and count how many items has same value as
current user id, by passing a value to count function. When we pass a value to count, it only counts
the document equal to that value. Here, We are trying use r.row to reference to current user.

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update({
4 friend_counts: r.db('foodb').table('friends')('friend2_id').count(r.row('id'\
5 ))
6 })

If we run above query we will get this error

RqlCompileError: Cannot use r.row in nested queries. Use functions instead in:
r.table(foodb).get(user-foo1).update({friend_counts:
r.db(foodb).table(friends)(friend2_id).count(r.row(id))})
Document 74

The reason for this error is because RethinkDB doesnt know which query to base r.row on? Is it the
main query, table users, or sub query, table friends. Luckily, We can use an anynoymous function
to solve this. Function allows access to current document but it solve problem of r.row because it
clearly binds to a sequence.

Expression
Lets get some basic knowledge then we will come back to the previous example.
Beside passing an object into update command, we can also pass an expression or a function which
returns an object. RethinkDB will evaluate it, get the object result and use that value for update
command. It comes in useful when you have some logic on your document related to the updating.
In case of function, the function receive first parameter is the current document.
With previous example, we can re-write using function:

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (user) {
4 return {
5 friend_counts: r.db('foodb').table('friends')('friend2_id').count(user("id\
6 "))
7 }
8 })

Then, we got an error:

1 RqlRuntimeError: Could not prove function deterministic. Maybe you want to use \
2 the non_atomic flag? in:
3 r.db("foodb").table("users").get("user-foo1").update(function(var_58) { return {\
4 friend_counts: r.db("foodb").table("friends")("friend2_id").count(var_58("id"))}\
5 ; })

Well, this is because the updating isnt atomic


Atomic update mean that an update to a document either succesfull, or fail and no change is made.
Such as we update two fields of a document, first field update is succesfull but second field fail to
update. Atomic guarantees that both of fields will be sucesfully updated to new value. Any update
happen on a single JSON document is guaranteed to be atomic city. So what is a non-atomic update?
Non atomic update is setting value to result of executing JavaScript code, random values, and values
obtained as a result of a subquery
http://rethinkdb.com/docs/architecture/#query-execution
http://rethinkdb.com/docs/architecture/#query-execution
Document 75

non-atomic updates
A good way to remember what is non-atomic update is that they are usually value which
cannot be predicate such as random values, result of other query

To run a non-atomic update, we have to clearly tell RethinkDB that with nonAtomic option:

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (user) {
4 return {
5 friend_counts: r.db('foodb').table('friends')('friend2_id').count(user("id\
6 "))
7 }
8 }, {nonAtomic: true})
9 //=>
10 {
11 "deleted": 0 ,
12 "errors": 0 ,
13 "inserted": 0 ,
14 "replaced": 1 ,
15 "skipped": 0 ,
16 "unchanged": 0
17 }

We can verify the update really succesfully:

1 r.db('foodb').table('users')
2 .get('user-foo1')('friend_counts')
3 //=>
4 2

So we can see that in RethinkDB, we have to opt-in to use some features. Later on, we know that
we have to manually passing an index name to use it. That may a little bit verbose at first. But that
helps you understand query and let you know what you are doing here.
Updating with function is really similar to passing object to update function. We have to return an
JSON object with key-value similar to the JSON document that we pass directly to update command.
We can name the parameter of function to whatever. The name isnt important. It is just like a
callback function, in what RethinkDB will pass the real value of current document to it when invoke
that function. What we can do with r.row we can mostly do with that parameter. Such as getting
value of field with. Lets change user to u and see if it works:
Document 76

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (u) {
4 return {
5 friend_counts: r.db('foodb').table('friends')('friend2_id').count(u("id"))
6 }
7 }, {nonAtomic: true})

Lets do one more complex example. If an users has more than 10 friends, we set a field social_status
to extrovert, otherwise, its introvert.

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (u) {
4 return {
5 social_status: r.branch(u('friend_counts').gt(10), 'extrovert', 'introvert\
6 ')
7 }
8 }, {nonAtomic: true})

Here we are using a new command r.branch. Its like an IF in MySQL. If the first argument is TRUE,
the second argument is the return value, otherwise the third argument. We use user('friend_-
count') to get value of friend_count as you know. We calling gt command on it. gt means greater,
it returns TRUE if the value is greater than what we pass to gt.
When using a function, the parameter pass into function will be the current visited document.
Therefore, you can use many document manipulation command in it such as: pluck, without,
merge, append, prepend. Just remember this, so you know what you can do with that parameter.

Expr

expr is a normal function but I think they are important and will help us achive many crazy things
so I cover them here.
What expr does is tranform a native object from host language into ReQL object. For example, if
a RethinkDb funciton can be call on array or sequence, we cannot write something like this: [e1,
e2].nth(2), RethinkDB will throw an error on [e1, e2]

1 ["e1","e2"].nth is not a function

What we have to do is somehow convert the array that we write in native language into RethinkDB
data type. To do that, we simply wrap them in expr
A real example when Im writing this book is I want to randomize generate faked data for users
table on gender field. I do this with:
Document 77

1 r.db("foodb").table("users")
2 .update({
3 gender: r.expr(['m', 'f']).nth(r.random(0, 2))
4 }, {nonAtomic: true})

It means that for every document of users table, I want to set their gender to either m or f randomly.
I create a two element array [m, f], turn them into ReQL object with expr, so that I can call nth on
them, passing a random number of either 0 or 1.
lets try a more complex example to generate some data. For every users, we generate a list of
eatenfoods name randomly by select all food name, and use sample(number) command to select a
number of random element.

1 r.db("foodb").table("users")
2 .update({
3 eatenfoods: r.db("foodb").table("foods").sample(r.random(0,

10)).getField(name) }, {nonAtomic: true} )


We have two random here. First, we random how many of number of food we want to get from
0 to 10 by calling r.random with a range. Then we use sample command to get us that number of
random document.
Now, we have eatenfoods field. Lets say we want to create a field contains the foods that an user
has eaten, and his or her most favourite foods (first element in favfoods field)

1 r.db("foodb").table("users")
2 .update({
3 eateanorlike : r.add(r.row("eatenfoods"), [r.row("favfoods").nth(0)])
4 }, {nonAtomic: true})

By combine ReQL expression, looking at its RethinkDB API and find approriate function, we can
achieve what we want. In the above example, we know we want to concat two arrays from eaten-
foods and first item of favfoods. We used r.add. We have to wrap r.row("favfoods").nth(0) in
[] because nth() return a document, where as r.add expects array, so we wrap it in [].

We also didnt have an age field on those users table. Lets generate some fake data for it so we can
play around later. Here we randomize age between 8 and 90.
Document 78

1 r.db("foodb").table("users")
2 .update({
3 age : r.random(8, 90)
4 }, {nonAtomic: true})
5 #=>
6 {
7 "deleted": 0 ,
8 "errors": 0 ,
9 "inserted": 0 ,
10 "replaced": 152 ,
11 "skipped": 0 ,
12 "unchanged": 0
13 }

By using function and/or expression, we can update document in a complex way. Just carefully look
up RethinkDB API, we can find the function we want. If not, we can probably whip some logic
inside function.

Return Values
Sometimes, it can be useful to get back the updated document. This way you can verify the result,
without issuing a sub sequent get command. We just need to set returnChanges flag to true in
option parameter of update command. Same example:

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .update(function (user) {
4 return {
5 _social_status: r.branch(user('friend_counts').gt(10), 'extrovert', 'intro\
6 vert')
7 }
8 }, {nonAtomic: true, returnChanges: true})
9 //=>
10 {
11 "changes": [{
12 "new_val": {
13 "_social_status": "introvert",
14 "address": [{
15 "city": "Cuppertino",
16 "country": "USA",
17 "number": 11,
18 "state": "CA",
Document 79

19 "ste": 880,
20 "street": "Infinite Loop"
21 }, {
22 "city": "Hue",
23 "country": "Vietnam",
24 "number": "131",
25 "street": "Tran Phu"
26 }],
27 "age": 13,
28 "another_address_field": [{
29 "city": "Hue",
30 "country": "Vietnam",
31 "number": "131",
32 "street": "Tran Phu"
33 }],
34 "friend_counts": 2,
35 "gender": "f",
36 "id": "user-foo1",
37 "name": "foo",
38 "social": {
39 "facebook": "kureikain3",
40 "github": "kureikain",
41 "twitter": "kureikain2"
42 },
43 "social_status": "introvert"
44 },
45 "old_val": {
46 "address": [{
47 "city": "Cuppertino",
48 "country": "USA",
49 "number": 11,
50 "state": "CA",
51 "ste": 880,
52 "street": "Infinite Loop"
53 }, {
54 "city": "Hue",
55 "country": "Vietnam",
56 "number": "131",
57 "street": "Tran Phu"
58 }],
59 "age": 13,
60 "another_address_field": [{
Document 80

61 "city": "Hue",
62 "country": "Vietnam",
63 "number": "131",
64 "street": "Tran Phu"
65 }],
66 "friend_counts": 2,
67 "gender": "f",
68 "id": "user-foo1",
69 "name": "foo",
70 "social": {
71 "facebook": "kureikain3",
72 "github": "kureikain",
73 "twitter": "kureikain2"
74 },
75 "social_status": "introvert"
76 }
77 }],
78 "deleted": 0,
79 "errors": 0,
80 "inserted": 0,
81 "replaced": 1,
82 "skipped": 0,
83 "unchanged": 0
84 }

The old value and value are returned in key old_val and new_val correspondingly.
As you see, you started to mess up our data. Its ok, let lean some command that destroy data. Lets
meep replace command.

Replace
First, we want to remove eateanorlike.
To remove one or many fields from document, we cannot use update anymore. We can set a field
to null value(null, nil depends on your language) to make it become null. But they key is still in the
document with a null value. In other words, update lets us overwrite the fields, but dont remove
it. Thats why we have another command for removing field. The replace command replaces entire
document with new document.

1 r.db("foodb").table("users").replace(r.row.without('eateanorlike'))
Document 81

Here we use r.row to get current documents, then calling without to remove the field.
without accepts a list of argument and will remove those fields with that name from document.
Such as:

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .without("address", "another_fieldaddress_field", "social")
4 //=>
5 {
6 "_social_status": "introvert" ,
7 "age": 13 ,
8 "friend_counts": 2 ,
9 "gender": "f" ,
10 "id": "user-foo1" ,
11 "name": "foo" ,
12 "social_status": "introvert"
13 }

without can also remove nested field. Such as remove country field from address

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .without({address: "country"}, "another_address_field", "social")
4 //=>
5 {
6 "_social_status": "introvert",
7 "address": [{
8 "city": "Cuppertino",
9 "number": 11,
10 "state": "CA",
11 "ste": 880,
12 "street": "Infinite Loop"
13 }, {
14 "city": "Hue",
15 "number": "131",
16 "street": "Tran Phu"
17 }],
18 "age": 13,
19 "friend_counts": 2,
20 "gender": "f",
21 "id": "user-foo1",
22 "name": "foo",
Document 82

23 "social_status": "introvert"
24 }

We can use nested style to denote the field that we want to remove. If we want to remove many
fields, wrap them in array.
Example, we remove all fields of address except country:

1 r.db('foodb').table('users')
2 .get('user-foo1')
3 .without({address: ["number","state", "city", "ste", "street"]}, "another_addr\
4 ess_field", "social")
5 //=>
6 {
7 "_social_status": "introvert",
8 "address": [{
9 "country": "USA"
10 }, {
11 "country": "Vietnam"
12 }],
13 "age": 13,
14 "friend_counts": 2,
15 "gender": "f",
16 "id": "user-foo1",
17 "name": "foo",
18 "social_status": "introvert"
19 }

Note that, We can replace an entirely new document, however, the primary key cannot be changed.
It has to be same with the current primary key. An attempt to change the primary key will caused
an error Primary key id cannot be changed

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .replace({id: 1})

We will got an error:


Document 83

1 {
2 "deleted": 0,
3 "errors": 1,
4 "first_error": "Primary key `id` cannot be changed (`{
5 "_social_status": "introvert",
6 "address": [{
7 "city": "Cuppertino",
8 "country": "USA",
9 "number": 11,
10 "state": "CA",
11 "ste": 880,
12 "street": "Infinite Loop"
13 }, {
14 "city": "Hue",
15 "country": "Vietnam",
16 "number": "131",
17 "street": "Tran Phu"
18 }],
19 "age": 13,
20 "friend_counts": 2,
21 "gender": "f",
22 "id": "user-foo1",
23 "name": "foo",
24 "social": {
25 "facebook": "kureikain3",
26 "github": "kureikain",
27 "twitter": "kureikain2"
28 },
29 "social_status": "introvert"
30 }
31 ` -> ` {
32 "id": 1
33 }
34 `)." ,
35 "inserted": 0 ,
36 "replaced": 0 ,
37 "skipped": 0 ,
38 "unchanged": 0
39 }

Of course, changing primary key is actually simply removing old one and insert a new document.
Lets learn about removing data.
Document 84

Delete
Delete is similar to update or replace. We select a sequence, and call delete command on them.
This will delete a single document, by using primary to select it with get then calling delete on
that single document.

1 r.db("foodb").table("users")
2 .get("user-foo2")
3 .delete()

We cal also clear a whole table or a selection. Lets play with it via some temporary table:

1 r.db("foodb").tableCreate("test1")
2 r.db("foodb").table("test1").insert({field: 'foo', field2: 'bar'})
3 r.db("foodb").table("test1").insert({field: 'foo2', field2: 'bar2'})
4 r.db("foodb").table("test1").insert({age: 10, name: 'abc'})
5 r.db("foodb").table("test1").insert({age: 12, name: 'abc2'})

Lets remove user who are under age of 11.

1 r.db("foodb")
2 .table("test1")
3 .filter(r.row('age').lt(11))
4 .delete()

We using r.row to get current document, then get the age field value, and call lt to do compare less
than.
Basically with any selection, we can call delete on them, it goes over and remove data.
So you can alredy guess command to delete all table:

1 r.db("foodb").table("test1").delete()

Like update, delete method accepts an optional object with:

durability: hard or soft. default is hard


returnChanges: false or true. default is false. When setting to true, a changes array will be
return with old_val and new_value if the document is succesfully to remove

And finally, before we move on. Lets remove use user-foo1 since we mess with it a bit
Document 85

1 r.db("foodb").table("users")
2 .get("user-foo1")
3 .delete()

Sync
As you known in the previous chapter, with value of durability as soft the write isnt guarantees
to be written to the permanent storage. So after doing a bunch of those soft durability, you may
want to say Hey, I am done all task, let's make sure you write those change you can call
sync
Using JavaScript driver:
r.table(t).sync().run(connection, function () { console.log(Syncing is done. All data is safe now) })
sync command will block until all previous write to the table are persited. With that being said,
sync will be only called on table.

Its good idea to do a bunch of soft durability and call sync at the end to ensure data persitent
and still avoid blocking during executing of other logic.
Wrap up
Some important concept you should remember:

atomicity
synca
multiple insert
change primary field
update with function
using nested field style to remove nested field
5. Reading Data Advanced
Understanding index
Index
Soon enough you will realize that filter is slow. If you have a table with more than 100,000 records,
filter stops working. All of that is because we havent used index yet. Without an index, we cannot
even order data.

1 r.db("foodb").table("compounds_foods").orderBy(r.desc("id"))
2 #->
3 RqlRuntimeError: Array over size limit `100000` in:
4 r.db("foodb").table("compounds_foods").orderBy(r.desc("id"))
5 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Without an index, RethinkDB holds all data in memory and sorts or filters in memory. A limit has
to lie somewhere else. 100,000 is the magic number of the limit RethinkDB set to read data without
index.
In order to properly fetching data, we have to create an Index. And in real application, we almost
always ending up creating index on MySQL to fetch data efficiently.
We have two kinds of indexes in RethinkDB

primary index: our ID key is this. This index is created automatically by RethinkDB. Coming
back to the above query, if we change it to use primary index
r.db(foodb).table(compounds_foods).orderBy({index: r.desc(id)})

id field is always indexed automatically.

secondary index:

Seconday index is the index we created ourselve on one or many fields. Secondary index can be
simple, just index the value of fields directly, or doing some pre-calculate on data before indexing.
While index helps to decrease the reading time, it decreases writing time, and also cost storage space.
It reduces write performance because whenever we insert a document, the index has to be calculated
and written into the database.
RethinkDB supports those kinds of index:
Understanding index 89

*Simple :indexes based on the value of a single field. *Compound :indexes based on multiple fields.
*Multi :indexes based on arrays of values. *Indexes :based on arbitrary expressions.
So now you know what index is. But the sad news is that filter cannot use those secondary index.
For that purpose, we have to use other functions: getAll and between. ### Creating index
Lets start with simple index first.

Simple index
As its name, simple index is simply a single field. Lets say, we want to find all compounds_foods
whose name contains banana. We cannot use filter here because this table has more than 100,000
item. filter also doesnt use index. Lets meet getAll. getAll grabs all documents where a given
key match the index which we specified.
First, we create index follow this syntax

1 table.index_create(index_name[, index_function][, :multi => false]) object

Apply it on our case, for a single column:

1 r.db("foodb")
2 .table("compounds_foods")
3 .indexCreate("match_orig_food_common_name", r.row.match)

If we dont pass an index function, RethinkDB will try to create index for the column that has the
same name as requested index name.
Time to use it:

1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas', {index:'orig_food_common_name'})
4 #=> Executed in 10ms. No results returned.

No result. Its strange that we have no document where its orig_food_common_name doesnt
contain banana. Why so? Well, the simple index does an exact match, or in other world, its an
equal comparison. Lets try an exact match:
Understanding index 90

1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas, raw', {index:'orig_food_common_name'})
4 #=> Executed in 69ms. 40 rows returned, 40 displayed, more available
5 {
6 "citation": "USDA" ,
7 "citation_type": "DATABASE" ,
8 "compound_id": 2100 ,
9 "created_at": Tue Jan 03 2012 18:33:15 GMT-08:00 ,
10 "creator_id": null ,
11 "food_id": 208 ,
12 "id": 257686 ,
13 "orig_citation": null ,
14 "orig_compound_id": "262" ,
15 "orig_compound_name": "Caffeine" ,
16 "orig_content": "0.0" ,
17 "orig_food_common_name": "Bananas, raw" ,
18 "orig_food_id": "09040" ,
19 "orig_food_part": null ,
20 "orig_food_scientific_name": null ,
21 "orig_max": null ,
22 "orig_method": null ,
23 "orig_min": null ,
24 "orig_unit": "mg" ,
25 "orig_unit_expression": null ,
26 "updated_at": Tue Jan 03 2012 18:33:15 GMT-08:00 ,
27 "updater_id": null
28 }

We can pass multiple keys to getAll to have an or effect. Meaning RethinkDB returns document
where the index match any of value that we pass.

1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas, raw', 'Yoghurt with pear and banana', 'Alfalfa seeds',{index\
4 :'orig_food_common_name'})

We can chain to count to count how many document we have


Understanding index 91

1 r.db("foodb")
2 .table("compounds_foods")
3 .getAll('Bananas, raw', 'Yoghurt with pear and banana', 'Alfalfa seeds',{index\
4 :'orig_food_common_name'})
5 .count()
6 #=> 256

Not only used to find document, index can also be used for sorting as well. To sort, we call orderBy
and passing index name.

1 r.db("foodb").table("compounds_foods")
2 .orderBy({index: "orig_food_common_name"})

When passing index, we can wrap it in other expression to change the ordering:

1 r.db("foodb").table("compounds_foods")
2 .orderBy({index: r.desc("orig_food_common_name")})
3 .withFields("orig_food_common_name")
4 #=>
5 {
6 "orig_food_common_name": "Zwieback"
7 } {
8 "orig_food_common_name": "Zwieback"
9 } {
10 "orig_food_common_name": "Zwieback"
11 }

Using withFields, we can pass a list of field to choose what we want to get back.
Because index can be used to sort, we can use it to find the value between a range. In RethinkDB,
between syntax is:

1 table.between(lowerKey, upperKey[, {index: 'id', leftBound: 'closed',

rightBound: open}]) selection


As you can see, we can only use between on table type. So, take note that using this, we can find
the document between Apple and Banana range.
Understanding index 92

1 r.db("foodb").table("compounds_foods")
2 .between("Apple", "Banana", {index: 'orig_food_common_name'})
3 .orderBy({index: r.desc("orig_food_common_name")})
4 .withFields("orig_food_common_name")

Without specifying an index, between operates on primary index, many RethinkDB function has the
same behaviour

1 r.db("foodb").table("compounds_foods")
2 .between(1, 200)
3 .count()
4 #=> 198

This works when we want to find data based on a single field. How about find value base on multiple
field? Lets meet compound index

Compound index
Compound index is created by using value of multiple fields. Its very similar to single index in
syntax, just different on how many fields we pass to index create. Lets take a look at compounds_-
foods table, it contains relationship of foods and compounds. We will learn more about JOIN later.
For now, lets say we want to find all compounds_foods document where its compound_id is 3524
and food_id is 287. We are finding on two columns, so we need an index contains those 2 columns:

1 r.db("foodb").table("compounds_foods")
2 .indexCreate("compound_food_id", [r.row(row"compound_id"), r.row("food_id")])

The only different with simple index is that we have to pass an array of field to create index. Lets
try it:

1 r.db("foodb").table("compounds_foods")
2 .getAll([354,287], {index: 'compound_food_id'})
3 #=>
4 RqlRuntimeError: Index `compound_food_id` on table `foodb.compounds_foods` was
5 accessed before its construction was finished in:
6 r.db("foodb").table("compounds_foods").getAll([354, 287], {index:
7 "compound_food_id"})
8 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\
9 ^^^^^^

We got an error. Looks like the index is not completed created yet. The table has
We can query index status:
Understanding index 93

1 r.db("foodb").table("compounds_foods")
2 .indexStatus('compound_food_id')
3 #=>
4 [
5 {
6 "blocks_processed": 13192 ,
7 "blocks_total": 15437 ,
8 "function": <binary, 408 bytes, "24 72 65 71 6c 5f..."> ,
9 "geo": false ,
10 "index": "compound_food_id" ,
11 "multi": false ,
12 "outdated": false ,
13 "ready": false
14 }
15 ]

The field ready is false. We can only wait until it finishes. This table is very big. We can verify:

1 r.db("foodb").table("compounds_foods").count()
2 #=> 737089

Lets just wait for a bit, make a cup of coffee and come back :). When its ready, you should see:

1 r.db("foodb").table("compounds_foods")
2 .indexStatus('compound_food_id')
3
4 [
5 {
6 "function": <binary, 408 bytes, "24 72 65 71 6c 5f..."> ,
7 "geo": false ,
8 "index": "compound_food_id" ,
9 "multi": false ,
10 "outdated": false ,
11 "ready": true
12 }
13 ]

Now, try it:


Understanding index 94

1 r.db("foodb").table("compounds_foods")
2 .getAll([21477,899], {index: 'compound_food_id'})
3 #=> Executed in 7ms. 1 row returned
4 {
5 "citation": "DFC CODES" ,
6 "citation_type": "DATABASE" ,
7 "compound_id": 21477 ,
8 "created_at": Tue Sep 11 2012 16:12:30 GMT-07:00 ,
9 "creator_id": null ,
10 "food_id": 899 ,
11 "id": 740574 ,
12 "orig_citation": null ,
13 "orig_compound_id": null ,
14 "orig_compound_name": null ,
15 "orig_content": null ,
16 "orig_food_common_name": "Meats" ,
17 "orig_food_id": "WI8000" ,
18 "orig_food_part": null ,
19 "orig_food_scientific_name": null ,
20 "orig_max": null ,
21 "orig_method": null ,
22 "orig_min": null ,
23 "orig_unit": null ,
24 "orig_unit_expression": null ,
25 "updated_at": Tue Sep 11 2012 16:12:30 GMT-07:00 ,
26 "updater_id": null
27 }

With above indexing approach, you found that you have to create a dedicated index for whatever
you want to find. An index contains a single value for a document: either a single value of field, or
an order set of value in case of compound index. However, life is not that simple. Lets looking at
users table. It contains a list of users and their three most favourite foods, stored in field favfoods.
Thats a single column but it holds many elements. Because of that, we cannot simply answer the
question of who liked Mushroom:
lets create an index

1 r.db("foodb").table("users").indexCreate('favfoods')

Try to find all users who liked mushrooms.


Understanding index 95

1 r.db("foodb").table("users")
2 .getAll('Mushrooms', {index: 'favfoods'})
3 #=> Executed in 6ms. No results returned.

Why so? Because since we index the whole field as a single value, we have to match the whole value
of field:

1 r.db("foodb").table("users")
2 .getAll(["Edible shell" ,
3 "Clupeinae (Herring, Sardine, Sprat)" ,
4 "Deer" ,
5 "Perciformes (Perch-like fishes)" ,
6 "Bivalvia (Clam, Mussel, Oyster)"], {index: 'favfoods'})
7 #=>
8 {
9 "favfoods": [
10 "Edible shell" ,
11 "Clupeinae (Herring, Sardine, Sprat)" ,
12 "Deer" ,
13 "Perciformes (Perch-like fishes)" ,
14 "Bivalvia (Clam, Mussel, Oyster)"
15 ] ,
16 "id": "1dd8059c-82ca-4345-9d75-eaa0f8edbf48" ,
17 "name": "Arthur Hegmann"
18 }

Multi index is used to solve the above question: who likes Banana. A multi index is used on multiple
value, or in other word, an array of value. When RethinkDB see that we want to use a multi index,
it tries to loop over all values in the array of index, and match to each of element of the array.
To create a multi index, all we have to do is to pass the option flag: multi: true

1 r.db("foodb").table("users").indexCreate('favfoods_multi', r.row("favfoods"), {m\


2 ulti: true})

Now, lets try it:


Understanding index 96

1 r.db("foodb").table("users")
2 .getAll('Mushrooms', {index: 'favfoods_multi'})
3 #=>
4 {
5 "favfoods": [
6 "Milk substitute" ,
7 "Mushrooms" ,
8 "Nuts" ,
9 "Hummus" ,
10 "Soft-necked garlic"
11 ] ,
12 "id": "47110a8f-3c2c-46b8-96d8-244747c1818b" ,
13 "name": "Annabelle Lindgren"
14 }

If you notice, we have to pass r.row(favfoods) to create an index. Remember that in order to create
an index where its name doesnt match any field, we have to pass an expression or an anonymous
fuction to indexCreate to caculate its value. But we defined favfoods index before, so now we
cannot create other index with the same name. We can go back, delete index to clear thing up and
save our namespace:

1 r.db("foodb").table("users").indexDrop('favfoods')
2 #=>
3 { "dropped": 1 }
4 r.db("foodb").table("users").indexDrop('favfoods_multi')
5 #=>
6 { "dropped": 1 }

Now, lets create a multile index with the same field name:

1 r.db("foodb").table("users").indexCreate('favfoods', {multi: true})


2 r.db("foodb").table("users").getAll('Mushrooms', {index: 'favfoods'})
3 #=>
4 {
5 "favfoods": [
6 "Milk substitute" ,
7 "Mushrooms" ,
8 "Nuts" ,
9 "Hummus" ,
10 "Soft-necked garlic"
11 ] ,
12 "id": "47110a8f-3c2c-46b8-96d8-244747c1818b" ,
Understanding index 97

13 "name": "Annabelle Lindgren"


14 }

Then another question comes up, can we find all user who like Mustrooms and Banana? We may try
this:

1 r.db("foodb").table("users")
2 .getAll('Kiwi', 'Banana', {index: 'favfoods'})

However, thats an or. RethinkDB will return documents where its index value matches either Kiwi
or Banana.
Even more complex, we want to find user who like Kiwi most. Meaning kiwi has to be first element
in ther favfoods array.
To do that, we see that we are passing business logic into RethinkDB. We have to somehow represent
that logic in RethinkDB, calculate the value, and index the return value. Lets meet arbitrary
expressions index.

Arbitray expressions index


As its name, the returned value of expression is used to calculate index.
Unfortunately at this moment, we cannot simple answer who like both of banana and kiwi. We will
save it for later chapter when we learn about map function. Now, lets find users who like kiwi the
most. With assumption that the first element of favfoods array is what an user like most.

1 r.db("foodb").table("users")
2 .indexCreate('most-favourite-food', function (user) {
3 return user("favfoods").nth(0)
4 })

Given an array, nth(n) return n-th element. We call nth(0) on array favfoods that means first
element of array since they are zero-base index.
Understanding index 98

1 r.db("foodb").table("users")
2 .getAll('Kiwi', {index: 'most-favourite-food-1'})
3 #=>
4 {
5 "favfoods": [
6 "Kiwi" ,
7 "Lemon" ,
8 "Lime" ,
9 "Coffee" ,
10 "Sweet orange"
11 ] ,
12 "id": "0b83164e-fb42-4273-8db1-ba12be6e580d" ,
13 "name": "Carl Achiban"
14 } {
15 "favfoods": [
16 "Kiwi" ,
17 "Banana" ,
18 "Peanut" ,
19 "Asparagus" ,
20 "Common cabbage"
21 ] ,
22 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30" ,
23 "name": "Luma Ramses"
24 }

This index is powerful because we can push more complex searching to database engine. Lets say
we want to find all user who like Kiwi most and is a female.

1 r.db("foodb").table("users")
2 .indexCreate('most-favourite-food-gender', function (user) {
3 return [user("gender"), user("favfoods").nth(0)]
4 })

Here, we are trying to create non-multi index. The index is an array of gender and most favorite
food item. Now, lets try our index
Understanding index 99

1 r.db("foodb").table("users")
2 .getAll(['f', 'Kiwi'], {index: 'most-favourite-food-gender'})
3 #=>
4 {
5 "favfoods": [
6 "Kiwi" ,
7 "Lemon" ,
8 "Lime" ,
9 "Coffee" ,
10 "Sweet orange"
11 ] ,
12 "gender": "f" ,
13 "id": "0b83164e-fb42-4273-8db1-ba12be6e580d" ,
14 "name": "Carl Achiban"
15 } {
16 "favfoods": [
17 "Kiwi" ,
18 "Banana" ,
19 "Peanut" ,
20 "Asparagus" ,
21 "Common cabbage"
22 ] ,
23 "gender": "f" ,
24 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30" ,
25 "name": "Luma Ramses"
26 }

One thing I want to remind you is that the return function in any RethinkDb expression is evaludated
on RethinkDB, not on client language. You cannot just write anything. You have to use RethinkDB
expression in return value so that RethinkDB can calculate it. In previous chapter, we learn about
expr, you can use that to turn native object into RethinkDb object when needed.
Lets look at this example:

1 r.table("users").indexCreate("full_name2", function(user) {
2 return r.add(user("last_name"), "_", user("first_name"))
3 }).run(conn, callback)

We are trying to create an index by appending last_name and first_name. We cannot write

1 return user("last_name") + user("first_name")

because ReThinkDB wont understand that expression. We have to call r.add function.
However, even if you see something like
Understanding index 100

1 return user("last_name") + user("first_name")

That doesnt mean RethinkDB understands your native expression. Its actually your driver
overload operator (+ operator) in this case because your host language apparently support operator
overloading.
If you notice, we dont pass multi: true in any of the above examples. Can we use multi index
with arbitray expression index? Yes, we can.
Lets say if we want to find any users who like Kiwi or have eaten Kiwi before. We will create an
multi index by concat array favfoods and eatenfoods

1 r.db("foodb").table("users")
2 .indexCreate(
3 'eateen-or-like-multi',
4 r.add(r.row("eatenfoods"), r.row("favfoods"))
5 , {multi: true})

Now, we can use that index:

1 r.db("foodb").table("users")
2 .getAll('Kiwi', {index:'eateen-or-like-multi'})
3 #=>
4 {
5 "eatenfoods": [
6 "Celery leaves" ,
7 "Kiwi" ,
8 "Rainbow trout" ,
9 "Chinese bayberry" ,
10 "Hyacinth bean" ,
11 "Other sandwich"
12 ] ,
13 "favfoods": [
14 "Honey" ,
15 "Cake" ,
16 "Butter substitute" ,
17 "Cream" ,
18 "Sugar"
19 ] ,
20 "gender": "m" ,
21 "id": "808cedd5-f2ac-4724-98bc-061ee84755c9" ,
22 "name": "Forrest Jacobs"
23 } {
Understanding index 101

24 "eatenfoods": [
25 "Jerusalem artichoke" ,
26 "Conch" ,
27 "Milk and milk products" ,
28 "Dumpling" ,
29 "Custard apple" ,
30 "Sacred lotus" ,
31 "Japanese walnut" ,
32 "Crab"
33 ] ,
34 "favfoods": [
35 "Kiwi" ,
36 "Banana" ,
37 "Peanut" ,
38 "Asparagus" ,
39 "Common cabbage"
40 ] ,
41 "gender": "f" ,
42 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30" ,
43 "name": "Luma Ramses"
44 }

How about finding users that liked Kiwi most and have eaten kiwi? We just need to change the
index function, this time we will use anonymous function:

1 r.db("foodb").table("users")
2 .indexCreate(
3 'eatean-or-like-most',
4 function (user) {
5 return r.add(user("eatenfoods"), [user("favfoods").nth(0)])
6 }
7 , {multi: true})

Actually, function is just a special case of expression where expression is the result of function
executing. Now, we can try to find:
Understanding index 102

1 r.db("foodb").table("users")
2 .getAll('Kiwi', {index: 'eatean-or-like-most'})
3 #=>
4 {
5 "eatenfoods": [
6 "Shrimp",
7 "Other fish product",
8 "Sweet orange",
9 "Unclassified food or beverage"
10 ],
11 "favfoods": [
12 "Kiwi",
13 "Lemon",
14 "Lime",
15 "Coffee",
16 "Sweet orange"
17 ],
18 "gender": "f",
19 "id": "0b83164e-fb42-4273-8db1-ba12be6e580d",
20 "name": "Carl Achiban"
21 } {
22 "eatenfoods": [
23 "Celery leaves",
24 "Kiwi",
25 "Rainbow trout",
26 "Chinese bayberry",
27 "Hyacinth bean",
28 "Other sandwich"
29 ],
30 "favfoods": [
31 "Honey",
32 "Cake",
33 "Butter substitute",
34 "Cream",
35 "Sugar"
36 ],
37 "gender": "m",
38 "id": "808cedd5-f2ac-4724-98bc-061ee84755c9",
39 "name": "Forrest Jacobs"
40 } {
41 "eatenfoods": [
42 "Jerusalem artichoke",
Understanding index 103

43 "Conch",
44 "Milk and milk products",
45 "Dumpling",
46 "Custard apple",
47 "Sacred lotus",
48 "Japanese walnut",
49 "Crab"
50 ],
51 "favfoods": [
52 "Kiwi",
53 "Banana",
54 "Peanut",
55 "Asparagus",
56 "Common cabbage"
57 ],
58 "gender": "f",
59 "id": "d10b51d7-d321-4b41-bd7d-1367ede0eb30",
60 "name": "Luma Ramses"
61 }

Checking index status


As Ive said, indexing reduces write performance, therefore it takes time to create after we issue
creating command. Depend on the table size, how many records we have, we have to wait for an
amout of time before using it. We can check the status of an index to see if its ready to use

1 r.table().indexStatus(indexName)

Such as:

1 r.db("foodb").table("compounds_foods").indexStatus("food_id")
2 #=>
3 {
4 "blocks_processed": 656 ,
5 "blocks_total": 11331 ,
6 "function": <binary, 181 bytes, "24 72 65 71 6c 5f..."> ,
7 "geo": false ,
8 "index": "food_id" ,
9 "multi": false ,
10 "outdated": false ,
11 "ready": false
12 }
Understanding index 104

The ready field indicates if the index is ready to use.


Sometimes we just want to say, when the index is ready, please run this:

1 r.table.indexWait
Using index
Ordering
Sorting with order without index limit to 100k issue. Always considering using an index. We already
learn about orderBy in chapter but we havent use index at that time. It takes this form of syntax:

1 table.orderBy([key1...], {index: index_name}) selection<stream>


2 selection.orderBy(key1, [key2...]) selection<array>
3 sequence.orderBy(key1, [key2...]) array

On a table, that means result of a table command, you can pass over an index for sorting. Example,
we want to sort table compounds_foods by name:
If we dont use index we will get:

1 r.db("foodb")
2 .table("compounds_foods")
3 .orderBy("name")
4 //=>
5 RqlRuntimeError: Array over size limit `100000` in:
6 r.db("foodb").table("compounds_foods").orderBy("name")
7 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Therefore, we have to create index:

1 r.db("foodb")
2 .table("compounds_foods")
3 .indexCreate("foodname", r.row("name"))

When index is ready, the original query works if we tell it what index to use:

1 r.db("foodb")
2 .table("compounds_foods")
3 .orderBy("name", {index: "foodname"})

By defaul, the ordering is ascending. To change to descending, we simply wrap the index name in
r.desc or r.asc
Using index 106

1 r.db("foodb")
2 .table("compounds_foods")
3 .orderBy("name", {index: r.desc("foodname")})

Pagination
To pagination data, we will use a combination of skip, limit and slice. We already learn them
in Chapter 3. But now you can use in combination with orderBy using an index to make it more
efficient.

skip(n)
Skip a number of element from the begining of sequence or array
limit(n)
End the sequence after we read up to the give number of limit

slice
Instead of manually doing pagination by using skip and limit, we can simply tell RethinkDB that
we want the data from this position to the other position. Similarly to how we have an slice function
to slice an array.
Lets take our foods table.

1 r.db("foodb")
2 .table("foods")
3 .orderBy(r.desc("name"))
4 .slice(10, 12)

Will return two rows from position 10 and 11. So we can calculate the slice index for pagination.
Example, assume we have 10 items per page, so on page 7 we have item from 70 to 80. And we can
do:

1 r.db("foodb")
2 .table("foods")
3 .orderBy(r.desc("name"))
4 .slice(70, 80)

Where else I can call Skip, Limit, Slice


These command can be called on a selection, an array or stream. So you can basically call
them on almost situation.
Using index 107

Transform data
So far, we always taking the value that return from RethinkDB to work with it. In any real
application, you probably want to do some transform around it. If we do it at application level, for
complex thing, it makes sense. But for simple thing, we might waste another extra loop. Sometime
we want to do thing at RethinkDB level, or we want to use transform data in other ReQL expression.
An example is the function nth or count. We call them on a sequence, or an array of data, and they
return a different data. They transform original data into a different piece of data. But nth and count
are simple function. They dont have any complex logic inside them. Some transform function has
complex logic. With those logic, we have to use some kind of if ...else... command, or loop.
To help us with that, RethinkDB has some control structure functions such as branch (similar to
if) or forEach, do. RethinkDb shines in these areas because database engine now seems to have an
embedded language in it.
Lets cover more of those functions. Along the way, we will learn somne structure control command.
Now, move on to our next function, map.

Map
Lets say we want to divide our users into 3 groups, who are under 18 years old are teenager, between
18-50 are adult, and over 50 years old are older. We have a pattern here, with each document in
table, we want to calculate a new data, depend on their existed data. In RethinkDB, we used map
function.
Map apply a function on document, and return value of function is returned from query. With our
example, in a normal programming language such as Ruby, we can write:

1 users.map do |user|
2 if user["age"] <= 18
3 "teenager"
4 else if user["age"] >18 && user["age"] <50
5 "adult"
6 else
7 "older"
8 end
9 end

In RethinkDB, the format of map is:


Using index 108

1 sequence1.map([sequence2, ...], mappingFunction) stream


2 array1.map([array2, ...], mappingFunction) array
3 r.map(sequence1[, sequence2, ...], mappingFunction) stream
4 r.map(array1[, array2, ...], mappingFunction) array

With this example, we have to represent if in RethinkDB function. We do that with branch:

1 r.branch(test, true_branch, false_branch) any

Time to write our function:

1 r.db("foodb").table("users").map(function (user) {
2 return r.branch(
3 user("age").lt(18),
4 "teenager",
5 r.branch(
6 user("age").gt(50),
7 "older",
8 "adult"
9 )
10 )
11 })
12 #=>
13 "older" "older" "older" "adult" "older" "older" "older" "adult" "older"
14 "older" "older" "adult" "teenager" "adult" "adult" "adult" "older" "adult"
15 "older" "older" "older" "older" "adult" "teenager" "adult" "older" "older"
16 "older" "older" "adult" "adult" "older" "adult" "older" "older" "older" "adult"
17 "older" "older" "older"

Yay, we get what we want. But it only returns the value from fuction, we dont know who is who.
Well, thats map job. It transforms the whole document into the return value. How about we return
an object, with original name field, and our group field, like this:
Using index 109

1 r.db("foodb").table("users").map(function (user) {
2 return {
3 name: user("name"),
4 group: r.branch(
5 user("age").lt(18),
6 "teenager",
7 r.branch(
8 user("age").gt(50),
9 "older",
10 "adult"
11 )
12 )}
13 })
14 #=>
15 {
16 "group": "adult" ,
17 "name": "Arthur Hegmann"
18 } {
19 "group": "older" ,
20 "name": "Ricky Quigley Sr."
21 } {
22 "group": "older" ,
23 "name": "Jazmyne Brakus"
24 }
25 ....

Great. But if we want to return whole document, with extra group field, do we have to repeatedly
write every fields? Well, lets meet merge.

1 r.db("foodb").table("users").map(function (user) {
2 return user.merge({
3 group: r.branch(
4 user("age").lt(18),
5 "teenager",
6 r.branch(
7 user("age").gt(50),
8 "older",
9 "adult"
10 )
11 )})
12 })
Using index 110

If you wondering whether we can use row in map expression instead of using function, I would say:
Yes, we can. But we cannot do that in the above example because row doesnt work in nested query.
In above example, we nested r.branch inside merge, and inside other branch.
Lets see an example where we can use r.row: couting how many foods they have eaten:

1 r.db("foodb").table("users").map({
2 name: r.row("name"),
3 total_eaten: r.row("eatenfoods").count()
4 })
5 #=>
6 {
7 "name": "Arthur Hegmann" ,
8 "total_eaten": 6
9 } {
10 "name": "Ricky Quigley Sr." ,
11 "total_eaten": 7
12 }

Inside map function, we can use any arbitrary ReQL command to fetch data. In follow example, we
try to use getAll and an index to count data. Lets find the quantity of flavor which a compound has.
Table compounds has an associated table compounds_flavors which stores a releation compounds
and flavors using two fields: compound_id and flavor_id. By counting how many item for a given
compound_id on table compounds_flavors, we can get the total flavor count of compound.

1 r.db('foodb')
2 .table('compounds')
3 .map(function(doc) {
4 return {
5 compound_id: doc('id'),
6 name: doc('name'),
7 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'), {\
8 index: 'compound_id'}).count()
9 }
10 })
11 .orderBy(r.desc('flavor_total'))
12 //=>
13 [
14 {
15 "compound_id": 3266,
16 "flavor_total": 22,
17 "name": "Ethyl methyl sulfide"
18 },
Using index 111

19 {
20 "compound_id": 930,
21 "flavor_total": 19,
22 "name": "3-Ethylpyridine"
23 },...
24 ]

In this example, with each of compound, we pass it into map function. The map function count how
many flavor it has by querying table compound_flavor. We are using getAll with an index to make
it run fast. We finally call .count() to count the total element of sequence. In the map function, we
explicitly return an object which we construct ourselves. The using of it someitmes confused people
that the function runs on client site which is wrong. The map function is executed on RethinkDB
server. If we dont like return an JSON object directly as above, we can use pluck to get only name
and compound_id and merge it with extra flavor_total field like this:

1 r.db("foodb")
2 .table('compounds')
3 .map(function(doc) {
4 return doc.pluck('name', 'compound_id').merge({
5 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'), {\
6 index: 'compound_id'}).count()
7 })
8 })
9 .orderBy(r.desc('flavor_total'))

Using map, we can pre-calculate some data, either using function or expression. Lets meet other
kind of mapping function.

concatMap
concatMap is very similar to map. It applies function to every element of sequence. But its different
since it also try to flatten, or concat all element into a single sequence.

1 r.expr([1, 2, 3]).map(function(x) { return [x, x.mul(2)] })


2 #=> [[1, 2], [2, 4], [3, 6]]

However, concatMap will concat all sub sequence/array into final return data.
Using index 112

1 r.expr([1, 2, 3]).concatMap(function(x) { return [x, x.mul(2)] })


2 #=> [1, 2, 2, 4, 3, 6]

When is it useful? It looks like very useless, right? Well, lets look at favfoods field. If we want to
get a list of favfoods on the entire system, we can do:

1 r.db("foodb").table("users").map(
2 r.row("favfoods")
3 )
4 #=>
5 [
6 "Garden tomato (var.)" ,
7 "Linden" ,
8 "Lowbush blueberry" ,
9 "American cranberry" ,
10 "Vanilla"
11 ],
12 [
13 "Swiss chard" ,
14 "Chicory roots" ,
15 "Grapefruit" ,
16 "Jostaberry" ,
17 "Spirit"
18 ]

As expected, we get an array of array. Thats when concapMap shines:

1 r.db("foodb").table("users").concatMap(
2 r.row("favfoods")
3 )
4 #=>
5 Pikeperch
6 Pacific ocean perch
7 True seal
8 Columbidae (Dove, Pigeon)
9 Conch
10 Kiwi
11 Lemon
12 Lime
13 Coffee
14 Sweet orange
15 Kiwi
Using index 113

16 True seal
17 Salmonidae (Salmon, Trout)
18 ...

If you notice, we have duplicate data. Thats natural because many users may like the same foods.
To get distinct value, you can call distinct on the sequence:

1 r.db("foodb").table("users").concatMap(
2 r.row("favfoods")
3 ).distinct()
4 #=>
5 [
6 "Abalone" ,
7 "Abiyuch" ,
8 "Acerola" ,
9 "Acorn" ,
10 "Adobo" ,
11 "Adzuki bean" ,
12 ]

distinct accepts an index and use it to differentce document, without any specified index, it uses
primary index, or the id value, which is good enough in our case, because we dont have foods with
the same name. Sometimes, you may want to create extra index for the field and call distinct using
that index.
Lets dig into a more complex example. For each of foods, lets find all of its compound.
Lets create an index first:

1 r.db("foodb").table("compounds_foods").indexCreate("food_id")

With that index, we can try our query:

1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food("id"), {index: "food_id"})
8 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
9 "orig_compound_name")
Using index 114

10 .map(function(compound_food) {
11 return food.merge({compound: compound_food})
12 })
13 })
14 #=>
15 {
16 "compound": {
17 "compound_id": 21594 ,
18 "food_id": 2 ,
19 "id": 15609 ,
20 "orig_compound_name": "Fatty acids, total saturated" ,
21 "orig_food_common_name": "Cabbage, savoy, raw"
22 } ,
23 "id": 2 ,
24 "name": "Savoy cabbage"
25 } {
26 "compound": {
27 "compound_id": 21595 ,
28 "food_id": 2 ,
29 "id": 15610 ,
30 "orig_compound_name": "Fatty acids, total mono-unsaturated" ,
31 "orig_food_common_name": "Cabbage, savoy, raw"
32 } ,
33 "id": 2 ,
34 "name": "Savoy cabbage"
35 }

Note that we use some withFields command to limit on the fields that we want to include in return
document.
Above example wont work with map. Because the return value from function isnt a single value,
but another sequence. If you attempt to use map, you can see error clearly:

1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .map(function (food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food("id"), {index: "food_id"})
8 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
9 "orig_compound_name")
10 .map(function(compound_food) {
Using index 115

11 return food.merge({compound: compound_food})


12 })
13 })
14 #=> RqlRuntimeError: Expected type DATUM but found SEQUENCE:

concatMap can operator on sequence value from funciton, and try to flatten it for us. Lets get back
to this and break down how it works:

1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food("id"), {index: "food_id"})
8 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
9 "orig_compound_name")
10 .map(function(compound_food) {
11 return food.merge({compound: compound_food})
12 })
13 })

For every document of foods, its transformed into a document similar to this:

1 [
2 {id: id1, name: name1, compound: compound_food_document1},
3 {id: id1, name: name1, compound: compound_food_document2},
4 {id: id1, name: name1, compound: compound_food_documentn},a
5 ]

That mean, with a sequence of foods, we will have this(without concatMap):

1 [
2 {id: id1, name: name1, compound: compound_food_document1},
3 {id: id1, name: name1, compound: compound_food_document2},
4 {id: id1, name: name1, compound: compound_food_documentn},...
5 ],
6 [
7 {id: id2, name: name2, compound: compound_food_document1},
8 {id: id2, name: name2, compound: compound_food_document2},
9 {id: id2, name: name2, compound: compound_food_documentn},...
10 ],
11 ...

But concat map will fallten array and we have this:


Using index 116

1 {id: id1, name: name1, compound: compound_food_document1},


2 {id: id1, name: name1, compound: compound_food_document2},
3 {id: id1, name: name1, compound: compound_food_documentn},...
4 ,
5 {id: id2, name: name2, compound: compound_food_document1},
6 {id: id2, name: name2, compound: compound_food_document2},
7 {id: id2, name: name2, compound: compound_food_documentn},...
8 , ...

Thats power of concatMap. We can use it to do a join effect. But one problem remains, as you can
see, we have many document with same food but different compound of that food. We can somehow
compile them into an array like this:

1 {id: id1, name: name1,


2 compound: [compound_food_document1, compound_food_document2, ...]
3 },
4 {id: id2, name: name2,
5 compound: [compound_food_documentn, compound_food_documentn, ...]
6 },...

Lets tweak our concatMap a bit.

1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return [
6 food.merge({compound:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"), {index: "food_id"})
9 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
10 "orig_compound_name")
11 .limit(10)
12 })
13 ]
14 })

First off, you notice that we wrap food.merge in array. Why so? Because concatMap expects return
value from function is a sequence. map expects return value from function is DATUM, likewise. We
also call limit(10) on the compound_foods sequence to limit first 10 results, sort by primary key, id
field of compound_foods table in this case.
Instead of creating a map to loop over the compound_foods sequence and creating a document, we
simply bring the whole compound_foods array to merge into food document. Looks good, run this
and here come the error:
Using index 117

1 RqlRuntimeError: Expected type DATUM but found SEQUENCE:

Well, merge expect a DATUM. A datum is like a single primitive value such as a number, an array.
However, we are passing SEQUENCE. You can understand that we are expecting an primitive value,
but we passed a cursor. Like in Ruby, when we expect an array, but we pass an enumarator. In
RethinkDB, to make this kind of merge work, we have to explicitly convert it to an array, using
coerceTo with parameter array.

1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .concatMap(function (food) {
5 return [
6 food.merge({compound:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"), {index: "food_id"})
9 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
10 "orig_compound_name")
11 .limit(10)
12 .coerceTo('array')
13 })
14 ]
15 })
16 #=>
17 {
18 "compound": [ ... ] ,
19 "id": 2 ,
20 "name": "Savoy cabbage"
21 } {
22 "compound": [ ... ] ,
23 "id": 15 ,
24 "name": "Wild celery"
25 }

Now, with coerceTo function, we know that we can convert sequence into array. So can we achieve
this with map, instead of concatMap. Yes, and its even more simpler:
Using index 118

1 r.db("foodb")
2 .table("foods")
3 .withFields("id", "name")
4 .map(function (food) {
5 return
6 food.merge({compound:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"), {index: "food_id"})
9 .withFields("id", "food_id", "compound_id", "orig_food_common_name",
10 "orig_compound_name")
11 .limit(10)
12 .coerceTo('array')
13 })
14 })

We no longer have to wrap it in [] because map can work with OBJECT. With this example, we see
that concatMap and map sometime can be used interchangably depend on how we want to model
the data.

Index and Map


I promised you an answer to quesiton who like both of Kiwi and banana. As you can guess, we
have to create a multi index with each of value is a pair of favourite foods. Its very hard to do this
without map. We are thinking of two nested loops. With map, we can do this:

1 r.db("foodb")
2 .table("users")
3 .indexCreate("food-test-idx-1", function (food) {
4 return
5 food("favfoods")
6 .concatMap(function (favfood) {
7 return
8 food("favfoods").map(function (favfood2) {
9 return [favfood, favfood2]
10 })
11 })
12 }, {multi: true})

We will do a map. With each of element of favfoods, we try to create a new array of
two elements by combining the element with each of element of favfoods itself. We
use concatMap to flatten array, and set multi: true because this is a multi indexer: we are
returning an array from index function.
And now, we can use that index, pass a pair of value:
Using index 119

1 r.db("foodb")
2 .table("users")
3 .getAll(['Spirit', 'Grapefruit'], {index: "food-test-idx-1"})
4 .withFields("name", "favfoods")
5 #=>Executed in 9ms. 2 rows returned
6 {
7 "favfoods": [
8 "Swiss chard" ,
9 "Chicory roots" ,
10 "Grapefruit" ,
11 "Jostaberry" ,
12 "Spirit"
13 ] ,
14 "name": "Wilburn Price"
15 } {
16 "favfoods": [
17 "Chicory roots" ,
18 "Grapefruit" ,
19 "Jostaberry" ,
20 "Spirit" ,
21 "Abiyuch"
22 ] ,
23 "name": "Audie Muller"
24 }

Using withFieds to quickly pluck some fields


I often use withFields to only pluck some fields that I care That way the result set is easier
to read

The order isnt important here, because we create the pair of array on every element if itself. Such
as with [a,b,c] array, we will have this array via map:
[a,a], [a,b], [a,c], [b,a], [b,b], [b,c], [c, a], [c, b], [c, a]
We can elimiate [a,a], [b,b], [c,c] because they are useless. We just need to put them in branch
command:
Using index 120

1 r.db("foodb")
2 .table("users")
3 .indexCreate("food-test-idx-2", function (food) {
4 return food("favfoods").concatMap(function (favfood) {
5 return food("favfoods").map(function (favfood22) {
6 return r.branch(favfood.eq(favfood2),
7 [],
8 [favfood, favfood2]
9 )
10 })
11 })
12 }, {multi: true})

With each element, if we see the value of two elements are the same, we return an empty array,
otherwise, we return combination array of them.
This runs fast but it has its own limitation. We cannot create more than 256 index on multi index.
Other approach, not as fast, but scale better is to use double finding. First, we find all documents
match an value, from that result, we find all documents match second value. First we getAll using
an index, then filter it.
Lets try it again. First create an index:

1 r.db("foodb")
2 .table("users")
3 .indexCreate("food-test-idx-3", r.row('favfoods')
4 , {multi: true})
5
6 r.db("foodb")
7 .table("users")
8 .indexCreate("food-test-idx-3", function (food) {
9 return food["favfood"]
10 })
11 }, {multi: true})

Now, we firstly using that index to find all user who like first fruit, then use filter to filter out user
who also like second fruits:
Using index 121

1 r.db("foodb")
2 .table("users")
3 .getAll("Spirit", {index: "food-test-idx-3"})
4 .filter(r.row("favfoods").contains("Grapefruit"))
5 .withFields("name", "favfoods")
6 //=>Executed in 9ms. 2 rows returned
7 [
8 {
9 "favfoods": [
10 "Swiss chard" ,
11 "Chicory roots" ,
12 "Grapefruit" ,
13 "Jostaberry" ,
14 "Spirit"
15 ] ,
16 "name": "Wilburn Price"
17 } {
18 "favfoods": [
19 "Chicory roots" ,
20 "Grapefruit" ,
21 "Jostaberry" ,
22 "Spirit" ,
23 "Abiyuch"
24 ] ,
25 "name": "Audie Muller"
26 }
27 ]

In the result of getAll, we run a filter on it. If the result of getAll contains more than 100,000
element. This method wont work. We cannot continue to call getAll with specifiying an index
from another getAll because it returns *SELECTION

1 r.db("foodb")
2 .table("users")
3 .getAll(['Spirit', 'Grapefruit'], {index: "food-test-idx-1"})
4 .typeOf()
5 //=>
6 "SELECTION<STREAM>"

The important map/concatMap


On the surface, filter is powerful because they accept a function and evaluate that function to filter
data based on its result. Its very similar to array filter that we usually has as language standard
Using index 122

library.
The only downside is slowness which is understandable. Therefore, we usually have to use getAll to
leverage index. But getAll only query data based on value of index which doesnt give the flexible of
filter. Because we have to calculate an index function to generate index, where as filter evaluated
dynamically.
Considering this example. Assume we have an orders table contains order. And other orderItems
table which contains items of an order.
We can creata some data:

1 r.tableCreate("orders")
2 r.tableCreate("orderItems")
3
4 r.table("orders").insert([
5 {id:1, shipped: 1}, {id:2, shipped: 0}, {id:3, shipped: 1},
6 ])
7
8 r.table("orderItems").insert([
9 {id:1, name: "f1", orderId: 1},
10 {id:2, name: "f2", orderId: 1},
11 {id:3, name: "a1", orderId: 2},
12 {id:4, name: "f2", orderId: 2},
13 {id:5, name: "a3", orderId: 2},
14 {id:6, name: "b3", orderId: 3},
15 {id:7, name: "b4", orderId: 3},
16 ])

Question: find all items of every ship order. With filter, we can quickly do this:

1 r.table('orderItems')
2 .filter(function(orderItem) {
3 return r.table('order').get(orderItem('id'))('shipped').default(0).gt(0)
4 })

However, they are bad because filter limitation. So ideally we only have getAll choice. But getAll
only returns data of the table its using index on. In our case, shipped status is stored in orders
table, whileas we want to fetch data of orderItems. In other words, getAll doesnt give us data we
want. We have to transform its into what we want by using map/concatMap.
Lets add two indexes:
Using index 123

1 r.table("orderItems").indexCreate("orderId")
2 r.table("orders").indexCreate("shipStatus", r.row("shipped").default(0).gt(0))

With that index, we can find all of shipper order

1 r.table("orders").getAll(true, {index: "shipStatus"})

Now, we will use concatMap to transform the order into its equivalent orderItem

1 r.table("orders")
2 .getAll(true, {index: "shipStatus"})
3 .concatMap(function(order) {
4 return r.table("orderItems").getAll(order("id"), {index: "orderId"}).coerc\
5 eTo("array")
6 })

So now, we basically have the power of filter. We can query whatever data inside concatMap
function(just to be sure to use proper index anyway). The way of thinking is in reverse of filter.
In filter, we start query on table we want to get data. In getAll/concatMap, we start query query on
the table contains the associated condition, then using concatMap to somehow, join the data across
table.
map and concatMap is really important and you should take a bit time to play around and master
them.
Wrap up
This chapter is quite long. So far weve learned about index. We know how to create:

Simple index
Compound index
Multi value index
Arbitrary expression index

Weve also learned how to leverage index to sort, filter data. Two important transform functions
you also learn:

map: transform from a source data to other value


concatMap: as map, but it flatten return array.

And finally, by leveraging map, you can easily create multi index.
6. Data Modeling With JOIN

Join is a joy to work with in my opinion. It makes data model easier to design. Without joy, we
have to either embed or joinging data with our code instead of database taking care of that for us.
With embedding document, we will hit a limit point about document size because the document
will usually loaded into the memory. Embedding document has its own advantages such as: query
data is simple, but this section will focus on data modeling with JOIN.
Using Join
eqJoin
In RethinkDB, JOIN is automatically distributed, meaning if you run on a cluster, the data will be
combined from many clusters and presents final result to you.
In SQL, ideally you can join whatever you want by making sure that the records on 2 tables match
a condition. An example:

1 SELECT post.*
2 FROM post
3 JOIN comment ON comment.post_id=post.id
4
5 # OR
6
7 SELECT post.*
8 FROM post
9 JOIN comment ON comment.author_id=post.author_id

You dont even need to care about the index. The database is usually smart enough to figure out
what index to use, or will scan the full table for you.
Join is a bit different in RethinkDB. Similar to how we have primary index and second index, we
usually need index to join in Rethink DB. Generally, we use below techniques in a JOIN command:

primary keys
secondary indexes
sub queries

Lets go over them one by one.

Join with primary index


Starting with one-to-many relationship. Lets find all compound and its synonyms. First, let see
eqJoin syntax, a command for joining data is used to join data:
Using Join 127

1 sequence.eqJoin(leftField, rightTable[, {index:'id'}]) sequence


2 sequence.eqJoin(function, rightTable[, {index:'id'}]) sequence

It tries to find the document on table righTable whose index value matches leftField value or the
return value of function. Its just similar to a normal JOIN in MySQL.

1 SELECT
2 FROM sequence
3 JOIN rightTable
4 ON sequence.leftField = rightTable.id

In plain English, eqJoin try to find pair of document on left table(sequence) and rightTable whereas
value of index of right table (default is the primary index) matches value of leftField on left Table
or the return value of execution of function we passed into eqJoin
So, to find all compounds and its synonyms, we can do:

1 r.db("foodb")
2 .table("compound_synonyms")
3 .eqJoin("compound_id", r.db("foodb").table("compounds"))

And we get this:

1 "left": {
2 "compound_id": 82 ,
3 "created_at": Fri Apr 09 2010 17:40:05 GMT-07:00 ,
4 "id": 832 ,
5 "source": "db_source" ,
6 "synonym": "3,4,2',4'-Tetrahydroxychalcone" ,
7 "updated_at": Fri Apr 09 2010 17:40:05 GMT-07:00
8 } ,
9 "right": {
10 "annotation_quality": "low" ,
11 "assigned_to_id": null ,
12 "bigg_id": null ,
13 "boiling_point": null ,
14 "boiling_point_reference": null ,
15 "cas_number": null ,
16 "charge": null ,
17 "charge_reference": null ,
18 "chebi_id": null ,
19 "comments": null ,
Using Join 128

20 "compound_source": "PHENOLEXPLORER" ,
21 "created_at": Thu Apr 08 2010 22:04:26 GMT-07:00 ,
22 "creator_id": null ,
23 "density": null ,
24 "density_reference": null ,
25 "wikipedia_id": null,
26 //...lot of other fields
27 ...
28 }
29 }

We get back an array, with element on both table match our condition. We can see that the item on
the left has its compound_id matchs id field of the one on the right. However, the above result with
left, right is not very useful. It will be more useful if we can merge both side into a single document.
To do that, we use zip

1 r.db("foodb")
2 .table("compound_synonyms")
3 .eqJoin("compound_id", r.db("foodb").table("compounds"))
4 .zip()
5 //=>
6 {
7 "annotation_quality": "low" ,
8 "assigned_to_id": null ,
9 "bigg_id": null ,
10 "boiling_point": "Bp14 72" ,
11 "boiling_point_reference": "DFC" ,
12 "cas_number": "15707-34-3" ,
13 "charge": null ,
14 "charge_reference": null ,
15 "chebi_id": null ,
16 "comments": null ,
17 "compound_id": 923 ,
18 //...lot of other fields
19 },
20 //other document here as well

What zip does is that it merges right document into left document and returns that document,
instead of a document with two left and right field.
zip is not really flexible because it simply merges all the field. We can use some transform function
to transform the document into a more read-able document since we only care about name and its
synonym:
Using Join 129

1 r.db("foodb")
2 .table("compound_synonyms")
3 .eqJoin(
4 "compound_id",
5 r.db("foodb").table("compounds")
6 )
7 .map(function (doc) {
8 return {synonym: doc("left")("synonym"), name: doc("right")("name")}
9 })
10 //=>
11 {
12 "name": "Butein" ,
13 "synonym": "Acrylophenone, 2',4'-dihydroxy-3-(3,4-dihydroxyphenyl)-"
14 },
15 {
16 "name": "3,4-Dimethoxybenzoic acid" ,
17 "synonym": "Benzoic acid, 3,4-dimethoxy-"
18 }

Much cleaner now! The important thing is that the join data is just another stream or array and we
can do transformation on it.
As you may see, we didnt specify an index on above query. When we dont specify index, RethinkDB
use primary index of table. In this case, the primary key is value of id field of table compounds.

Join with secondary index


Now loking at the previous query, it seems a bit awkard because table compound_synonyms comes
first. We can make it more natural, just follow above syntax: for each document on compounds,
fetch all document on compound_synonyms where documents compound_id field match value id of
document on compounds table. To do that, we have to have an index on compound_synonyms table
for compound_id field. Lets create an index for it:

1 r.db("foodb").table("compounds")
2 .indexCreate("compound_id")

Note that we can always query index status by using indexStatus

1 r.db("foodb").table("compound_synonyms").indexStatus()

Until we get status ready(this table is really big btw):


Using Join 130

1 [
2 {
3 "function": <binary, 185 bytes, "24 72 65 71 6c 5f..."> ,
4 "geo": false ,
5 "index": "compound_id" ,
6 "multi": false ,
7 "outdated": false ,
8 "ready": true
9 }
10 ]

Lets try it out:

1 r.db("foodb")
2 .table("compounds")
3 .eqJoin("id", r.db("foodb").table("compound_synonyms"), {index: 'compound_id'})
4 .map(function (doc) {
5 return {synonym: doc("right")("synonym"), name: doc("left")("name")}
6 })
7 //=>
8 {
9 "name": "Butein" ,
10 "synonym": "3-(3,4-Dihydroxy-phenyl)-1-(2,4-dihydroxy-phenyl)-propenone"
11 } {
12 "name": "Butein" ,
13 "synonym": "2',3,4,4'-Tetrahydroxychalcone"
14 }

With proper index, a query looks cleaner and more natural. The order of how we use eqJoin is
important. Trying narrow down data first if possible, to make the join does less work.
Also, instead of passing the field name to eqJoin, we can also pass a function or using row command
to get value of nested field. In this case, the return value of function, or value of field access with
row will be used to match with value of index on the right table. These are useful especially with
structure data on field.
Do you remember we have users table with data structure looks like this:
Using Join 131

1 r.db('foodb').table('users')
2 {
3 "age": 40 ,
4 "eatenfoods": [
5 "True sole" ,
6 "Jerusalem artichoke" ,
7 "Ascidians" ,
8 "Pineappple sage" ,
9 "Lotus" ,
10 "Coffee and coffee products"
11 ] ,
12 "favfoods": [
13 "Edible shell" ,
14 "Clupeinae (Herring, Sardine, Sprat)" ,
15 "Deer" ,
16 "Perciformes (Perch-like fishes)" ,
17 "Bivalvia (Clam, Mussel, Oyster)"
18 ] ,
19 "gender": "m" ,
20 "id": "1dd8059c-82ca-4345-9d75-eaa0f8edbf48" ,
21 "name": "Arthur Hegmann"
22 ...
23 }

Lets try to find more information about the most favourite foods.
First, lets create an index for food name

1 r.db('foodb').table('foods').indexCreate('name')

With that index, we can join data:

1 r.db('foodb').table('users')
2 .eqJoin(r.row('favfoods').nth(0), r.db('foodb').table('foods'), {index: 'name'\
3 })
4 //=>
5 {
6 "left": {
7 "age": 40,
8 "eatenfoods": [
9 "True sole",
10 "Jerusalem artichoke",
Using Join 132

11 "Ascidians",
12 "Pineappple sage",
13 "Lotus",
14 "Coffee and coffee products"
15 ],
16 "favfoods": [
17 "Edible shell",
18 "Clupeinae (Herring, Sardine, Sprat)",
19 "Deer",
20 "Perciformes (Perch-like fishes)",
21 "Bivalvia (Clam, Mussel, Oyster)"
22 ],
23 "gender": "m",
24 "id": "1dd8059c-82ca-4345-9d75-eaa0f8edbf48",
25 "name": "Arthur Hegmann"
26 },
27 "right": {
28 "created_at": Wed Dec 21 2011 02: 40: 48 GMT - 08: 00,
29 "creator_id": 2,
30 "description": null,
31 "food_group": "Baking goods",
32 "food_subgroup": "Wrappers",
33 "food_type": "Type 2",
34 "id": 868,
35 "itis_id": null,
36 "legacy_id": null,
37 "name": "Edible shell",
38 "name_scientific": null,
39 "picture_content_type": "image/jpeg",
40 "picture_file_name": "868.jpg",
41 "picture_file_size": 51634,
42 "picture_updated_at": Fri Apr 20 2012 09: 39: 05 GMT - 07: 00,
43 "updated_at": Fri Apr 20 2012 16: 39: 06 GMT - 07: 00,
44 "updater_id": 2,
45 "wikipedia_id": null
46 }
47 }
48 //....

We can have the same result, using function syntax


Using Join 133

1 r.db('foodb').table('users')
2 .eqJoin(function(user) { return user('favfoods').nth(0) },
3 r.db('foodb').table('foods'),
4 {index: 'name'})

So basically passing field name is just a shortcut when using r.row(field_name). Using row or
function gives us much more flexibility. Also remember that row command cannot be used in any
sub queries such as in this case:
Lets say we have a
Since the beginning, the way join is constructed is to match the document between 2 tables based
on value and matching of index. How we can just simply join data across two table based on two
field? In real life, we may have even more complex join condition. Example, in MySQL, we can join
with, basically any arbitrary condition like this:

1 SELECT *
2 FROM table1 as t1
3 JOIN tabl2 as t2 ON t1.field1=t2.field1 AND t1.foo=t2.bar

Lets think of an example. I want to join data of table compounds and its compounds_synonyms
where the source is from biospider and created after 2013. It is obviously we cannot use a single
field here with eqJoin.
Luckily, we have another way of joining data, using sub queries with concatMap and getAll.
However, since they are not eqJoin command, we will learn about sub query later in this chapter.
For now, lets move on to other join command.
To join, we usually need index. But can we join data without using any index via two arbitray
sequence? Even if its not very efficient but useful to have. The answer is yes, we can do inner join
and outer join.

innerJoin
innerJoin returns an inter section of two sequences where as each row of first sequence will be put
together with each row of second sequence, then evaluates a predicate function to find pair of rows
which predicate function returns true. The syntax of innerJoin is:

1 sequence.innerJoin(otherSequence, predicate) stream


2 array.innerJoin(otherSequence, predicate) array

Predicate function accepts two parameters of each row of first and second sequence.
Lets say the first sequence has M rows, and second sequence has N rows, the innerJoin will loop M
x N times and pass the pair of rows into predicate function. Lets say we have two sequences:
Using Join 134

1 [2,5,8,12,15,20,21,24,25]
2 [2,3,4]

And we want to find all pair of data where the first element module and second element equals zero.
We can write this:

1 r.expr([2,5,8,12,15,20,21,24,25])
2 .innerJoin(
3 r.expr([2,3,4]),
4 function (left, right) {
5 return left.mod(right).eq(0)
6 }
7 )
8 //=>
9 [
10 {
11 "left": 2 ,
12 "right": 2
13 } ,
14 {
15 "left": 8 ,
16 "right": 2
17 } ,
18 {
19 "left": 8 ,
20 "right": 4
21 } ,
22 {
23 "left": 12 ,
24 "right": 2
25 } ,
26 {
27 "left": 12 ,
28 "right": 3
29 } ,
30 {
31 "left": 12 ,
32 "right": 4
33 } ,
34 {
35 "left": 15 ,
36 "right": 3
Using Join 135

37 } ,
38 {
39 "left": 20 ,
40 "right": 2
41 } ,
42 {
43 "left": 20 ,
44 "right": 4
45 } ,
46 {
47 "left": 21 ,
48 "right": 3
49 } ,
50 {
51 "left": 24 ,
52 "right": 2
53 } ,
54 {
55 "left": 24 ,
56 "right": 3
57 } ,
58 {
59 "left": 24 ,
60 "right": 4
61 }
62 ]

RethinkDB will loop 27 times(9x3) and evaluate function to find rows. Because of the evaluation,
and no index is involved, this function is slow.
Here is another real example with real data. Lets find all foods and its compound_foods.

1 r.db("foodb")
2 .table("foods")
3 .innerJoin(
4 r.db("foodb").table("compounds_foods"),
5 function(food, compound_food) {
6 return food("id").eq(compound_food("food_id"))
7 }
8 )
9 //=>
10 {
11 "left": {
Using Join 136

12 "created_at": Wed Feb 09 2011 00:37:15 GMT-08:00 ,


13 "creator_id": null ,
14 "description": null ,
15 "food_group": "Vegetables" ,
16 "food_subgroup": "Cabbages" ,
17 "food_type": "Type 1" ,
18 "id": 2 ,
19 "itis_id": null ,
20 "legacy_id": 2 ,
21 "name": "Savoy cabbage" ,
22 "name_scientific": "Brassica oleracea var. sabauda" ,
23 "picture_content_type": "image/jpeg" ,
24 "picture_file_name": "2.jpg" ,
25 "picture_file_size": 155178 ,
26 "picture_updated_at": Fri Apr 20 2012 09:39:54 GMT-07:00 ,
27 "updated_at": Fri Apr 20 2012 16:39:55 GMT-07:00 ,
28 "updater_id": null ,
29 "wikipedia_id": null
30 } ,
31 "right": {
32 "citation": "DTU" ,
33 "citation_type": "DATABASE" ,
34 "compound_id": 13831 ,
35 "created_at": Tue Dec 13 2011 18:54:33 GMT-08:00 ,
36 "creator_id": null ,
37 "food_id": 2 ,
38 "id": 15619 ,
39 "orig_citation": null ,
40 "orig_compound_id": "0014" ,
41 "orig_compound_name": "Vitamin A, total" ,
42 "orig_content": "0.5E2" ,
43 "orig_food_common_name": "Cabbage, savoy, raw" ,
44 "orig_food_id": "0674" ,
45 "orig_food_part": null ,
46 "orig_food_scientific_name": null ,
47 "orig_max": null ,
48 "orig_method": null ,
49 "orig_min": null ,
50 "orig_unit": "RE" ,
51 "orig_unit_expression": null ,
52 "updated_at": Tue Dec 13 2011 18:54:33 GMT-08:00 ,
53 "updater_id": null
Using Join 137

54 }
55 } {
56 "left": {
57 "created_at": Wed Feb 09 2011 00:37:15 GMT-08:00 ,
58 "creator_id": null ,
59 "description": null ,
60 "food_group": "Vegetables" ,
61 "food_subgroup": "Cabbages" ,
62 "food_type": "Type 1" ,
63 "id": 2 ,
64 "itis_id": null ,
65 "legacy_id": 2 ,
66 "name": "Savoy cabbage" ,
67 "name_scientific": "Brassica oleracea var. sabauda" ,
68 "picture_content_type": "image/jpeg" ,
69 "picture_file_name": "2.jpg" ,
70 "picture_file_size": 155178 ,
71 "picture_updated_at": Fri Apr 20 2012 09:39:54 GMT-07:00 ,
72 "updated_at": Fri Apr 20 2012 16:39:55 GMT-07:00 ,
73 "updater_id": null ,
74 "wikipedia_id": null
75 } ,
76 "right": {
77 "citation": "DTU" ,
78 "citation_type": "DATABASE" ,
79 "compound_id": 1014 ,
80 "created_at": Tue Dec 13 2011 18:54:33 GMT-08:00 ,
81 "creator_id": null ,
82 "food_id": 2 ,
83 "id": 15630 ,
84 "orig_citation": null ,
85 "orig_compound_id": "0038" ,
86 "orig_compound_name": "Niacin, total" ,
87 "orig_content": "0.522E0" ,
88 "orig_food_common_name": "Cabbage, savoy, raw" ,
89 "orig_food_id": "0674" ,
90 "orig_food_part": null ,
91 "orig_food_scientific_name": null ,
92 "orig_max": null ,
93 "orig_method": null ,
94 "orig_min": null ,
95 "orig_unit": "NE" ,
Using Join 138

96 "orig_unit_expression": null ,
97 "updated_at": Tue Dec 13 2011 18:54:33 GMT-08:00 ,
98 "updater_id": null
99 }
100 }

It will run forever, because we have 888 documents in food table, and 10959 document in compound_-
foods table. It has to run the predicate function 888 * 10959 = 9,731,592 time. On my laptop, it runs
in:

Executed in 2min 25.86s. 40 rows returned, 40 displayed, more available

Basically innerJoin is equivalent of table scan in MySQL. We should avoid using it on any
significant amount data.

outerJoin
innerJoin is an intersection of two sequences where a pair of documents sastify a condition. How
about something similar to an left join in SQL? Lets meet outerJoin
outerJoin will return all documents of left sequences. With each document, it will try to match
with every documents of right hand. If the pair sastify a predicate function, the pair is returned. If
not, the only document of left sequence is returned. At very least, the finaly sequence will include
all document of left sequence. Using same data set, but for outerJoin:

1 r.expr([2,5,8,12,15,20,21,24,25])
2 .outerJoin(
3 r.expr([2,3,4]),
4 function (left, right) {
5 return left.mod(right).eq(0)
6 }
7 )
8 //=>
9 [
10 {
11 "left": 2 ,
12 "right": 2
13 } ,
14 {
15 "left": 5
16 } ,
Using Join 139

17 {
18 "left": 8 ,
19 "right": 2
20 } ,
21 {
22 "left": 8 ,
23 "right": 4
24 } ,
25 {
26 "left": 12 ,
27 "right": 2
28 } ,
29 {
30 "left": 12 ,
31 "right": 3
32 } ,
33 {
34 "left": 12 ,
35 "right": 4
36 } ,
37 {
38 "left": 15 ,
39 "right": 3
40 } ,
41 {
42 "left": 20 ,
43 "right": 2
44 } ,
45 {
46 "left": 20 ,
47 "right": 4
48 } ,
49 {
50 "left": 21 ,
51 "right": 3
52 } ,
53 {
54 "left": 24 ,
55 "right": 2
56 } ,
57 {
58 "left": 24 ,
Using Join 140

59 "right": 3
60 } ,
61 {
62 "left": 24 ,
63 "right": 4
64 } ,
65 {
66 "left": 25
67 }
68 ]

5 and 25 were not divided by any of the number on right hand. Therfore, the return document
contains only left hand document.

Name conflict
In SQL worlds, we can alias column to avoid conflict when joining. What RethinkDB gives us, when
we used zip command to merge the document? We will loose the column on left sequence. We have
several ways to address this.
Firstly, if we want to use zip:

1 * Removing conflict fields: By simply remove the field we don't want,


2 we can get what we want.
3
4 * What if we want to keep both fields? We can rename it, using `map`.

Secondly, we dont have to use zip, and we can merge document itself with map, and only keep what
we want.
However, we are still not able to address this issues. Those are just work-around. Luckily, RethinkDB
team are aware of that and is working on it now.

Using sub queries


As I promised before, in many complex case, eqJoin wont work for us. And we can use concatMap
and getAll. Now is the time to learn them. We will slowly go over some kind of relation again.
https://github.com/rethinkdb/rethinkdb/issues/1855
Using Join 141

One to many relation


Almost above JOIN commands is operator on two sequence. What if we have exactly a single
document which we get by using get and we want to join some data. If we think a bit, we can
see that in RethinkDB, we can query whatever extra data inside an anonoymous function. What
if we query extra data inside those function then merge it with parent document. It will has same
effect as a JOIN. Lets do it. Lets say we want to know flavors of Kiwi, a fruit that we never eat. We
had a table compounds_flavors which contains association of a compound and a flavor. We have
table compounds_foods which contains association of a compound and a food. So basically, we can
do this:

Given a food, we know its ID


Find all of its compound using compound_foods tables.
With each of compound, we know its flavor, by query compound_flavors to find association
with flavors table. In other words, to find the flavor_id for that compounds.

First of all, we needs some index:

1 // We need this index to look up by food_id on `compounds_foods`


2 r.db('foodb').table('compounds_foods').indexCreate('food_id')
3
4 // We need this index to find on `compounds_falvors` by compound_id
5 r.db("foodb").table("compounds_flavors").indexCreate('compound_id')

With that index, lets build our query step by step. First, we select Kiwi, its ID is 4. Then calling
merge command.

1 r.db("foodb")
2 .table("foods")
3 .get(4)
4 .merge(function (food) {
5
6 return {
7 flavors: //flavor array here
8 }
9 })

Lets see what will we fill in flavors array. We will try to grab all of its compound. That means all
of documents of compounds_foods table where its food_id is equal with main ID of kiwi.
Using Join 142

1 r.db("foodb")
2 .table("foods")
3 .get(4)
4 .merge(function (food) {
5 return {
6 flavors:
7 r.db("foodb").table("compounds_foods")
8 .getAll(food("id"),{index: "food_id"})
9 .concatMap(function(compound_food) {
10 //Return something flavor of compound here
11 })
12 .coerceTo("array")
13 }
14 })

Notice that we used concatMap so that it flattens array for us. We also used coerceTo to convert
selection result to an array for merge command. With each of document of compounds_foods we
can baiscally get all of its flavor as following:

1 r.db("foodb").table("compounds_flavors")
2 .getAll(compound_food("compound_id"), {index: "compound_id"})
3 .concatMap(function(compounds_flavor) {
4 return
5 r.db("foodb").table("flavors").getAll(compounds_flavor("fl\
6 avor_id"))
7 .map(function (flavor) {
8 return flavor("name")
9 })
10 .coerceTo("array")
11 })
12 .coerceTo("array")

Putting together, we have this final giant, scary query:


Using Join 143

1 r.db("foodb")
2 .table("foods")
3 .get(4)
4 .merge(function (food) {
5
6 return {
7 flavors:
8 r.db("foodb").table("compounds_foods")
9 .getAll(food("id"),{index: "food_id"})
10 .concatMap(function(compound_food) {
11 return
12 r.db("foodb").table("compounds_flavors")
13 .getAll(compound_food("compound_id"), {index: "compound_id"})
14 .concatMap(function(compounds_flavor) {
15 return
16 r.db("foodb").table("flavors").getAll(compounds_flavor("\
17 flavor_id"))
18 .map(function (flavor) {
19 return flavor("name")
20 })
21 .coerceTo("array")
22 })
23 .coerceTo("array")
24
25
26 })
27 .distinct()
28 .coerceTo("array")
29 }
30
31 })

Before the final coerceTo, we also call distinct to eliminate duplicate. And we got this result:
Using Join 144

1 {
2 "created_at": Wed Feb 09 2011 00: 37: 15 GMT - 08: 00,
3 "creator_id": null,
4 "description": null,
5 "flavors": [
6 "alcoholic",
7 "baked",
8 "bay oil",
9 "bitter",
10 "bland",
11 "bread",
12 "cheese",
13 "cheesy",
14 "citrus",
15 "coconut",
16 "ethereal",
17 "faint",
18 "fat",
19 "fatty",
20 "medical",
21 "metal",
22 "mild",
23 "odorless",
24 "rancid",
25 "slightly waxy",
26 "soapy",
27 "sour",
28 "strong",
29 "sweat",
30 "sweet",
31 "unpleasant",
32 "waxy",
33 "yeast"
34 ],
35 "food_group": "Fruits",
36 "food_subgroup": "Tropical fruits",
37 "food_type": "Type 1",
38 "id": 4,
39 "itis_id": "506775",
40 "legacy_id": 4,
41 "name": "Kiwi",
42 "name_scientific": "Actinidia chinensis",
Using Join 145

43 "picture_content_type": "image/jpeg",
44 "picture_file_name": "4.jpg",
45 "picture_file_size": 110661,
46 "picture_updated_at": Fri Apr 20 2012 09: 32: 21 GMT - 07: 00,
47 "updated_at": Fri Apr 20 2012 16: 32: 22 GMT - 07: 00,
48 "updater_id": null,
49 "wikipedia_id": null
50 }

While as the query looks giant and complex. The way to write is to drill down each table at a time,
using a map/concatMap to transform data.

Many to many relation


Using of outerJoin and innerJoin are not efficient because they are not using index at all. They
are useful, like filter, powerful too but just not very fast. As in Chapter 5, we used concatMap with
getAll to query data across tables, that is just an idea of JOIN. We should always avoid outerJoin
and innerJoin if possible and adopt getAll with concatMap.
In inner join section, it takes more than 2 minutes to run our query. Lets try it again, find all foods
and its compounds_founds, this time we leverage an index with getAll and join data with concatMap

1 r.db('foodb').table('foods')
2 .map(function (food) {
3 return food.merge({
4 "compound_foods":
5 r.db('foodb').table('compounds_foods')
6 .getAll(food("id"), {index: 'food_id'})
7 .coerceTo('array')
8 })
9 })

And how fast it runs:

Executed in 1.52s. 40 rows returned, 40 displayed, more available

Much better than using innerJoin.Compare to >2mins before, this is a major improvement. To really
see how fast/slow a query without index, you may want to put data on a spin disk(an external hard
drive for example), because SSD is usually fast and you may not notice. All of my above example
is using a SSD on Macbook Pro (Retina, 13-inch, Mid 2014), with process Interl Core i5 2.8ghz and
16GB RAM.
Using Join 146

The key point is to ensure we get data using index. For each of document, we join data by run other
query in a map/concatMap function to merge/transform extra data. The merge document is returned
instead of original document
The main different between using sub query is that we have nested document instead of left right
field like in JOIN. However, via using some map or transform command we can turn them into
whatever we can imagine.
Why map/concatMap is important
In SQL, we can basically join whatever. In RethinkDB, join is infact just a syntaxtic sugar on top of
getAll and concatMap. As you learn in Chapter 5, map/concatMap allow you to transform data with
its relation data in an associated table, by querying extra data inside map function.
I once say that they are important and now I repeated again because they are everything. getAll is
just like SELECT in MySQL in term of how much you have to use. And getAll is not very useful
without a map.
Wrap up
At the end of this chapter, we should know how to join based on these concepts:

Primary key: the ID field of document


Secondary index: join using secondary index
Sub queries: using merge, map/concatMap to join data
7. Aggregation

A very normal task of a database is to get some kind of calculate from a given sequence of data. We
will learn those kind of function on this chapter.

sum, average, and count


You already knew count. It also has an extra useful way to count is by passing a value or a function.
When passing a value, it counts the document whose value matches the value such as counting users
who are 18 year olds:
For example, let count how many food is Type 1 we have

1 r.db('foodb').table('foods')('food_type').count('Type 1')
2 //=>
3 627

Here we are using nested field syntax to fetch food_type field and count how many value that match
Type 1.
Or count how many user who are 18 years old.

1 r.db("foodb")
2 .table("users")("age")
3 .count(18)

We can also pass a ReQL expression or a function. Lets find all food whose name starts with L

1 r.db('foodb').table('foods').count(r.row('name').match('^L'))
2 //=>
3 31

We can also passing a value or a function to count so RethinkDB only count documents match the
value or when predicate function returns true.
150

1 r.db('foodb').table('foods')
2 .count(function(food){
3 return food('name').match('^L')
4 })
5 //=>
6 31

In RethinkDB, its very flexible on how we do thing. Such as, counting, in fundamental is just count
the element from an array. With some smart combination we can count same thing in different way:

1 r.db('foodb').table('foods').map(r.row('name').split('').nth(0)).count('L')

Basically we get food name, split charater one by one using split command, then call nth(0) to
fet first document. Using map we transform food name table into a stream of first charater of food
name, then we count this stream for how many element equal L.
In a sense, passing a function is like a shortcut of filter with that function, and count the return
sequence
Below example, counting users who are 19 years old and name starts with a K:

1 r.db("foodb")
2 .table("users")
3 .count(function(user) {
4 return user("age").eq(23).and(user("name").match("^L"))
5 })
6 //=>
7 1

If we run filter, before count:

1 r.db("foodb")
2 .table("users")
3 .filter(function(user) {
4 return user("age").eq(23).and(user("name").match("^L"))
5 })
6 .count()
7 //=>
8 1

We got same result but it feel very redundant.


We can also pass a ReQL expression which is evaluate to true or false:
151

1 r.db("foodb")
2 .table("users")
3 .count(r.row("rowage").eq(23).and(r.row("name").match("^L")))
4 //=>
5 1

sum, and averag is similar to count on how you use them. Just different on what they give you. sum
give you sum of sequence, and average does average. Lets find out how many bytes of storage need
to store food image. Each of document in foods table has a picture_file_size field store in byte.

1 r.db('foodb').table('foods').sum('picture_file_size')
2 //=>
3 123463051

We can also sum direcly on the value of stream

1 r.db('foodb').table('foods')('picture_file_size').sum()

The key thing is that you understand how those function operate. By default, they operator on the
whole document. Thats why we have to use sum('pictire_file_size') when we call sum direcly
on table. However, when we already use bracket to get the field, we can simply call sum() without
any parameters.

1 r.db('foodb').table('foods')('picture_file_size').sum()

So you can guess what is the average file size, lets find out it:

1 r.db('foodb').table('foods')('picture_file_size').avg()
2 //=>
3 147155.0071513707

We can also passing a function to sum or avg. In that case, RethinkDb calls the function on every
document, then get the result and use them for sum purpose.
Lets say we only interested in filesize which is bigger than 4MB
152

1 r.db('foodb').table('foods').sum(function(food) {
2 return r.branch(
3 food('picture_file_size').gt(1024*1024*4),
4 food('picture_file_size'),
5 0)
6 })
7 //=>
8 9666379

Here, we are using branch as a normal if else block:

1 if food('picture_file_size') > 1024 * 1024 * 4


2 return food('picture_file_size')
3 else
4 0
5 end

In some way, by passing function to sum, we have a simple effect of filter. If we uses filter, we can
write above query again

1 r.db('foodb').table('foods').filter(function(food) {
2 return food('picture_file_size').gt(1024 * 1024 * 4)
3 }).sum('picture_file_size')
4 //=>
5 966379

In this case, we first find and return only documents where its picture_file_size is greater than 4MB.
Then we simply sum field picture_file_size of all documents.
Basically, by passing function into sum, avg, we can transform document to our desire result for
doing sum or avg on it.
Doing some calculation is fun, but what if we want which food has smallest or biggest picture file
size. Lets move to min and max

min and max


As their name, given a sequence, they find the minimum document. But how we compare a JSON
document? Therefore, we have to pass a field name to min, the value of that field of document is
used to compare among documents. If we pass a function, the function is called on every document,
the return value is used to compare among documents.
153

1 r.db('foodb').table('foods').max('picture_file_size').pluck('name', 'picture_fil\
2 e_size')
3 //=>
4 {
5 "name": "Meatball" ,
6 "picture_file_size": 5102677
7 }

We can also pass expression, the value of expression is used for comparing. Lets find the compound
which has most health effect.
As you can guess, RethinkDB often runs faster if we pass a field name because no extra processing
is made. In case of function, function has to be executed. Example, lets try to get max file size of
food in group Type 1.

1 r.db('foodb').table('foods').max('picture_file_size').pluck('name', 'picture_fil\
2 e_size')
3 //=>
4 {
5 "name": "Meatball" ,
6 "picture_file_size": 5102677
7 }

In case of min, max they returns full document, but using a value to compare. That hints that min,
max may accept an index as comparing value.

Take this example. Try to find the compounds with bigest msds_file_size

1 r.db('foodb').table('compounds').max('msds_file_size')
2 //=>
3 1 row returned in 217ms.

Note that if you try max again without index, in second time the query run faster because
a part of data was cached by RethinkDB. The size of this cache is defined by this
forumula (available_mem - 1024 MB) / 2. with available_mem is the memory when
RethinkDB starts.

This runs on a SSD. Pretty slow. Now, see how fast its compare with index

1 r.db('foodb').table('compounds').indexCreate('msds_file_size')

Using this index to query, we can see it much faster.


154

1 r.db('foodb').table('compounds').min({index: 'msds_file_size'})
2 //=>
3 1 row returned in 8ms.

By passing an secondary index to min, max function, the index value is used to compare, run much
faster and more efficient.
For complex logic, we can event pass a function to min or max, the return values are used to compare.
Lets find the food that has most compounds. The compound of food is stored in compound_foods
table.

1 //First, let create an index, you can ignore if you created index before.
2 r.db('foodb').table('compounds_foods').indexCreate('food_id')
3
4 r.db('foodb').table('foods')
5 .max(function(food) {
6 return r.db('foodb').table('compounds_foods')
7 .getAll(food('id'), {index: 'food_id'})
8 .count()
9 })
10 //=>
11 1 row returned in 1min 8.2

Yay, it runs in 1 minute and 8.2 seconds. Super slow. Because the function has to be run on every
document. Thats being said, complex function may be slow, but they are useful when we need it.

distinct
Given a sequence, distinct remove duplication from it. When giving an index, the duplication is
detected by value of index. Its syntax is:

1 sequence.distinct() array
2 table.distinct([:index => <indexname>]) stream

As you can tell, whenever we return an array, we will run into 100,000 element isues if the return
array has more than 100,000 elements. So keep that in mind and try to call distinct with a proper
index, which we will learn quickly
Lets start with this simple example:
155

1 r.expr([1, 2, 3, 4, 1]).distinct()
2 //=> 4 rows returned
3 [
4 1 ,
5 2 ,
6 3 ,
7 4
8 ]

Lets get a list of our username, without duplication

1 r.db("foodb")
2 .table("users")
3 .withFields("name")
4 .distinct()
5 //=>Executed in 30ms. 152 rows returned
6 [
7 {
8 "name": "Abe Willms"
9 } ,
10 {
11 "name": "Adela Klein V"
12 } ,...]

Lets see what kind of age our users are:

1 r.db("foodb")
2 .table("users")
3 .withFields("age")
4 .distinct()

// 71 rows are returned [ { age: 9 } , { age: 10 } , { age: 13 } , { age: 14 } ,]


So we have 152 unique users name and 71 unique age.
Sometime, the way to detect duplication is not by comparing value of a single field but the result of
some logic. We can build an index based on that logic. Then using value of index to run distinct on.
Imagine we want to divide user into 3 groups, depend on their age
156

1 r.db("foodb")
2 .table("users")
3 .indexCreate("age-group", function (user) {
4 return
5 r.branch(
6 user("age").lt(18),
7 "teenager",
8 r.branch(user("age").gt(50),
9 "older",
10 "adult"
11 )
12 )
13 })

With that index, we can quickly list age group of our users, by calling distinct on table, passing
that index:

1 r.db("foodb")
2 .table("users")
3 .distinct({index: "age-group"})
4 //=> 3 rows return
5 "adult"
6 "older"
7 "teenager"

When passing index, the value of index is returned; and therefore that value is used to detect
duplication.
Lets try on some big table: list all orig_food_common_name of compound_foods.

1 r.db('foodb').table('compounds_foods')('orig_food_common_name').distinct()
2 //=>
3 9492 rows returned in 41.44s.

Here we use bracket to return only orig_food_common_name field, then remove duplication with
distinct. The query runs in 41.44 seconds. It also returns the whole array, with 9492 rows, means
all data has to be put into memory and transfer over network. To make it faster, more efficient, we
can use index, and the result will be a stream.
Lets create an index for that field.
157

1 r.db('foodb').table('compounds_foods').indexCreate('orig_food_common_name')

We passing an argument index: index_name to distinct

1 r.db('foodb').table('compounds_foods').distinct({index: 'orig_food_common_name'})
2 //=>40 rows returned in 161ms. Displaying rows 1-40, more available

We optimize it from 41.44s to 161ms!!! So fast. It has two reason for this to be fast.

1. We using an index as value for distinct to find the difference


2. A stream is return. So the whole array wont have to load into memory and transfer to client

Basically, without an index, RethinkDB has to scan the whole table. It is slow in two ways:

slow to fetch data: read whole table, no index is used


slow to return data: a big amout of data transfer overnetwork from RethinkDB to client

When we pass an index, RethinkDB pickup the value of index, it doesnt have to care or load the
whole document. The return result is stream, so client receive a cursor to fetch data lazily.
As you see, the command we learn on this chapter is operating on the whole sequence. However,
usually coming with aggregation is grouping. We want to divide a sequence into many group, and
doing aggregation on those group. To do that, lets learn about group

group
Yeah, group is everywhere. Group command groups data into many sub sequence, we can continue
to run aggeration on those sub sequence. For example, instead of counting the whole sequence. We
want to count how many document are in group A, how many document in group B and so on, with
group A or group B is document that share same particular value.
Lets see how it is handle in RethinkDB.

1 sequence.group(fieldOrFunction..., [{index: "indexName", multi: false}]) group\


2 ed_stream

In a nut shell, taking a sequence, depend on the value of field or return value of function, RethinkDB
groups documents with same value into a group.
Looking at flavors table, lets group them by their flavor_group field:
158

1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 #=>
5 [{
6 "group": "animal",
7 "reduction": [
8 {
9 "category": "odor",
10 "created_at": {
11 "$reql_type$": "TIME",
12 "epoch_time": 1317561018,
13 "timezone": "-07:00"
14 },
15 "creator_id": null,
16 "flavor_group": "animal",
17 "id": 112,
18 "name": "animal",
19 "updated_at": {
20 "$reql_type$": "TIME",
21 "epoch_time": 1317561018,
22 "timezone": "-07:00"
23 },
24 "updater_id": null
25 }
26 ]
27 },
28 {
29 "group": "balsamic",
30 "reduction": [
31 {
32 "category": "odor",
33 "created_at": {
34 "$reql_type$": "TIME",
35 "epoch_time": 1317561011,
36 "timezone": "-07:00"
37 },
38 "creator_id": null,
39 "flavor_group": "balsamic",
40 "id": 43,
41 "name": "others",
42 "updated_at": {
159

43 "$reql_type$": "TIME",
44 "epoch_time": 1317561011,
45 "timezone": "-07:00"
46 },
47 "updater_id": null
48 },
49 {
50 "category": "odor",
51 "created_at": {
52 "$reql_type$": "TIME",
53 "epoch_time": 1317561010,
54 "timezone": "-07:00"
55 },
56 "creator_id": null,
57 "flavor_group": "balsamic",
58 "id": 40,
59 "name": "chocolate",
60 "updated_at": {
61 "$reql_type$": "TIME",
62 "epoch_time": 1317561010,
63 "timezone": "-07:00"
64 },
65 "updater_id": null
66 }
67 ]
68 },
69 {
70 "group": "camphoraceous",
71 "reduction": [
72 {
73 "category": "odor",
74 "created_at": {
75 "$reql_type$": "TIME",
76 "epoch_time": 1317561017,
77 "timezone": "-07:00"
78 },
79 "creator_id": null,
80 "flavor_group": "camphoraceous",
81 "id": 101,
82 "name": "camphoraceous",
83 "updated_at": {
84 "$reql_type$": "TIME",
160

85 "epoch_time": 1317561017,
86 "timezone": "-07:00"
87 },
88 "updater_id": null
89 }
90 ]
91 },
92 ...]

The returned array includes two fields:

group: value of group field, in our case is value of flavor_group field


reduction: an array contains all of document has same value for flavor_group field

When we continue to chain function after group, the function will operate on reduction array. The
result of function will replaced the value of reduction array.
For example, we can count how many element of reduction:

1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .count()
5 //=>
6 [
7 {
8 "group": null ,
9 "reduction": 743
10 } ,
11 {
12 "group": "animal" ,
13 "reduction": 1
14 } ,
15 {
16 "group": "balsamic" ,
17 "reduction": 10
18 } ,
19 {
20 "group": "camphoraceous" ,
21 "reduction": 1
22 } ,
23 ...
24 ]
161

So, the reduction field no longer contains an array of document, but contains value of how many
document in original reduction array.

Command chain after group runs on grouped


array
Its important to understand group make next function call operator on its reduction field.

Similarly, instead of counting, say we care about the first document only.

1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .nth(0)
5 //=>
6 [
7 {
8 "group": null ,
9 "reduction": {
10 "category": "odor" ,
11 "created_at": Sun Oct 02 2011 06:12:18 GMT-07:00 ,
12 "creator_id": null ,
13 "flavor_group": null ,
14 "id": 148 ,
15 "name": "cotton candy" ,
16 "updated_at": Sun Oct 02 2011 06:12:18 GMT-07:00 ,
17 "updater_id": null
18 }
19 } ,
20 {
21 "group": "animal" ,
22 "reduction": {
23 "category": "odor" ,
24 "created_at": Sun Oct 02 2011 06:10:18 GMT-07:00 ,
25 "creator_id": null ,
26 "flavor_group": "animal" ,
27 "id": 112 ,
28 "name": "animal" ,
29 "updated_at": Sun Oct 02 2011 06:10:18 GMT-07:00 ,
30 "updater_id": null
31 }
32 } ,
162

33 {
34 "group": "balsamic" ,
35 "reduction": {
36 "category": "odor" ,
37 "created_at": Sun Oct 02 2011 06:10:10 GMT-07:00 ,
38 "creator_id": null ,
39 "flavor_group": "balsamic" ,
40 "id": 40 ,
41 "name": "chocolate" ,
42 "updated_at": Sun Oct 02 2011 06:10:10 GMT-07:00 ,
43 "updater_id": null
44 }
45 } ,
46 ...
47 ]

Here, nth(0) will be call on reduction array, return its first element, and re-assign the result to
reduction field.

Why null group?


Why do we have a value null here. Its because some documents dont have any value
for flavor_group, or in other words, a NULL value. They are put into the same group of
NULL

Note that we have some limitations with group where the group size is over 100000 elements. For
example, lets group compounds_foods by their orig_food_common_name

1 r.db("foodb")
2 .table("compounds_foods")
3 .group('orig_food_common_name')

And we got this:

1 RqlRuntimeError: Grouped data over size limit `100000`. Try putting a reduction\
2 (like `.reduce` or `.count`) on the end in:
3 r.db("foodb").table("compounds_foods").group("orig_food_common_name")

Why so? Because when we end the chain with group, the whole array is loaded into memory, and
our sequence are greater than 100000 elements. We have around 668K documents. However, when
we call reduce or count on it, the number of documen will be reduce. RethinkDB wont have to
keep them all into memory and makes it works.
Lets try what it suggests:
163

1 r.db("foodb")
2 .table("compounds_foods")
3 .group('orig_food_common_name')
4 .count()
5 #=>
6 //Executed in 45.03s. 9492 rows returned
7 [
8 {
9 "group": null,
10 "reduction": 4313
11 },
12 {
13 "group": "AMARANTH FLAKES",
14 "reduction": 68
15 },
16 ...
17 ]

Now, this result is of course not what we expected. But why it runs? It is because when we call
count, the whole array of reduction field becomes a single value instead of an array of grouped
data, that makes the size of final group data smaller. As you can see, 9492 is returned, and thats all
loaded into memory.
So keep in mind that we have some limitation with group. If you notice, 9492 is the same number
when we run distinct on orig_food_common_name field.

1 r.db('foodb').table('compounds_foods')('orig_food_common_name').distinct()
2 //=>
3 9492 rows returned in 41.02s.

They return same amount of document because while they are different, they share same concept
of equality. Look at its define a gain:

group: group many documents which has same value of field or function result into a single
document.
distinct: eliminate duplication, based on the value of a field or function result.

While they return different data, they retrun same quantity of document. Distinct elimiate equality
by remove and keeping one. Group eliminate equality by mergering them into one.
Above result confirms that our query works properly. Sometimes, its fun to go back and try different
query as a way to validate our queries.
164

ungroup
As you can see, anything follow group operates on sub stream, or reduction array. Can we make
the follow function run on returned sequence of group itself? Such as, we want to sort by the value
of reduction field. Lets try to sort flavors by how many value of flavor_group

1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .count()
5 .ungroup()
6 .orderBy(r.desc('reduction'))
7 //=>
8 [
9 {
10 "group": null ,
11 "reduction": 743
12 } ,
13 {
14 "group": "fruity" ,
15 "reduction": 24
16 } ,
17 {
18 "group": "floral" ,
19 "reduction": 14
20 } ,
21 {
22 "group": "balsamic" ,
23 "reduction": 10
24 },...
25 ]

Without ungroup, We will get an error


165

1 r.db("foodb")
2 .table("flavors")
3 .group('flavor_group')
4 .count()
5 .orderBy(r.desc('reduction'))
6 //=>
7 e: Cannot convert NUMBER to SEQUENCE in:
8 r.db("foodb").table("flavors").group("flavor_group").count().orderBy(r.desc("red\
9 uction"))

Error occurs because without ungroup, orderBy is called on reduction array, in this case its a single
number(the quantity of document in reduction array) and orderBy cannot work on a single number.
So ungroup turns the return array from group into a sequence of object, with each object includes 2
fields:

1 * group: the value is used for grouping


2 * reduction: all document contains same `group` value

and let any subsequent commands follows it operate on the whole sequence instead of sub sequence
from group. Thats why its call ungroup because it wont treat value of reduction field as a sub
sequence to work on, the whole reduction array are now just a normal array of a document of the
sequence.
Lets dive into this example to learn more about it. Lets say we have an array of value, we want to
get sub of odd number and even number. Using expr we can easily represent an array in RethinkDB

1 r.expr([1, 2, 14, 4, 3, 1, 7, 12, 10, 9, 3, 5])

To denote odd number and even we can group them by the result of mod to 2

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 //=>
4 [{
5 "group": 0,
6 "reduction": [
7 4,
8 2
9 ]
10 }, {
11 "group": 1,
166

12 "reduction": [
13 3,
14 5
15 ]
16 }]

Now, a sum command follow will work on value of reduction.

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .sum()
4 //=>
5 [
6 {
7 "group": 0 ,
8 "reduction": 6
9 } ,[
10 {
11 "group": 1 ,
12 "reduction": 8
13 }
14 ]

Nothing is new here. We know what group and sum do. The group 0 has reduction = [4,2] so the sum
is 6.
If we put a ungroup first

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .ungroup()
4 //=>
5 [{
6 "group": 0,
7 "reduction": [
8 4,
9 2
10 ]
11 }, {
12 "group": 1,
13 "reduction": [
14 3,
167

15 5
16 ]
17 }]

The output is same but the context is changed. Using typeOf we can verify:

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .ungroup().typeOf()
4 //=>
5 ARRAY
6
7 r.expr([4, 3,5, 2])
8 .group(r.row.mod(2))
9 .typeOf()
10 //=>
11 GROUPED_STREAM

So after ungroup, the result became an array instead of grouped_stream. Now, if we call sum, sum
will work on the whole documents, that mean document with two fields group and reduction instead
of sub stream of reduction

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .ungroup()
4 .sum()
5 //=>
6 e: Expected type NUMBER but found OBJECT in:
7 r([4, 3, 5, 2]).group(r.row.mod(2)).ungroup().sum()

We can confirm that the ungroup really operator on the whole document:

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .ungroup()
4 .sum(r.row('reduction').sum())
5 //=>
6 14

So with each document, we get the sum of field reduction, which is an array, then sum them all.
The final result is: (4+2) + ( 3+5) = 14
Or, we want to show the sum of odd number first, right now, the sum of even number show up first:
168

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 //=>
4 [{
5 "group": 0,
6 "reduction": [
7 4,
8 2
9 ]
10 }, {
11 "group": 1,
12 "reduction": [
13 3,
14 5
15 ]
16 }]

Using orderBy as below wont work:

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .orderBy(r.desc('group'))

Because reduction array has no field group for orderBy to work on. If we use ungroup here, we will
have access to the whole document normally:

1 r.expr([4, 3,5, 2])


2 .group(r.row.mod(2))
3 .ungroup()
4 .orderBy(r.desc('group'))

The idea of ungroup confused me a bit at beginning. If you are able to get it right quickly then
congrat. Otherwise, just re-read and do sime simple example yourself and you will get it.
Now, lets move on to a even more confusing function reduce

reduce
I think at this point, you already know what count() does. From a sequence, or an array, it returns
a single number of how many items the sequence holds. It transforms a whole array into a single
169

value, unlike map which turns each element of sequence into other value, and returns a new sequence
with all return value from map function.
count is an example of reduce. reduce accepts a function, lets call it reduce_function, and produces
a single value, by repeating a function with input is the previous output of reduce_function. The
reduce_function can be called with those parameters:

two elements of sequence


one element of sequence and one result of previous reduce_function execution
two results of previous reductions.

We can say that, on the first execution, two first elements of sequence are passed into reduce
function; on the second execution, one parameter is third element of sequence, other parameter
is the result of reduce function call on first and second element. And so on for 4th, 5h execution
But why do we have two results of previous reductions? Its because reduced function can run
parallel across shards and CPU core, or even across computers in a cluster. The final result of each
reduce functions on each shards or each computer, are then passed to reduce function again, to create
final result.
What will happen if the sequence has a single element? We dont have enough input for reduce
function. That is a special case, and RethinkDB will simply return value of element as result of
reduce function.
Usually, reduce will be used with map to transform document into a value that can be aggregated. As
you have seen, the reduce function paramters can be elements of sequence, or the result of previous
reduce function. Therefore, we need some transform so that the type of parameters and result of
reduce function are the same, or if we dont do the transformation, the reduce function have to be
able to deal with multiple data type
Take this example:

1 r.expr([1, 2, 7, 8])
2 .reduce(function(left, right) {
3 return left.add(right)
4 })
5 //=>
6 18

Thats the sum of the array. The process of reducing is similar to these steps:

1. reduce function is called with first element of sequence.left=1, right=2, and return 1.add(2)=3.
Result = 3
170

2. reduce function is called again, with third element 7, and result of previos call 3 left=3, right
= 7 -> result = 10
3. reduce function is called again, with last element of array left=10, right = 8 -> result = 18
4. no more lement, return the single value of last function call 18

Order of reduce function


Above steps just to help you illustrate how it works. The real order maybe different Because
the data can be compute across the shard, cpu or even computer and the final result of each
continued to be called on reduce function Never make assumption that the order of reduce
function is from left to right

So reduce is kind of like recursion. Like in above example, writing in plain english it looks like this:

1 sum(a) = a if a contains a single element


2 sum(a) = a[0] + a[1] if a has two element
3 sum(a) = sum([a[0], a[1]) + sum(a[2..last]) if a has more than element

Let try another example. Find the minimun value of an array.

1 r.expr([10, 12, 4, 9])


2 .reduce(function(left, right) {
3 return r.branch(left.lt(right), left, right)
4 })
5 //=>
6 4

In this example, the reduce function returns the smaller value from two input value. Here we use
branch as an if, and lt to compare less than. Here is how it runs:

1. call function with left=10, right=12, return 10


2. call function with left=10(previous result), right=4 , return 4
3. call function with left=4(previous result), right=9(last element), return 4
4. final value 4 is returned

Here, we noticed that the reduce function return same data type as input value. Lets try to count
how many document we have, using reduce style. Basically, you can already guess that we will
write a reduce function that increase to 1 while we iterates the array. But here we are writing reduce
as a recursion fuction, so we will add the left and right value.
171

1 r.db("foodb")
2 .table("flavors")
3 .reduce(function(left, right) {
4 return
5 r.branch(left.typeOf().ne('NUMBER'), 1, left)
6 .add(
7 r.branch(right.typeOf().ne('NUMBER'), 1, right)
8 )
9 })
10 //=>
11 855

That the total documents of flavors table. Let looks at our reduce function again:

1 function(left, right) {
2 return
3 r.branch(left.typeOf().ne('NUMBER'), 1, left)
4 .add(
5 r.branch(right.typeOf().ne('NUMBER'), 1, right)
6 )
7 }

left and right can be a document of flavors table with its whole fields, or a number from the result
of add command. We use typeOf to detect type, if its not a NUMBER, that means it is a document,
we consider that is a 1 item, and return 1 for counting. If its already a number, we used it, then add
both number. Its just like seeing an item, take 1, add with previous function call. Repeat this process
for whole sequence, we have a count of it.
So you see that we have to deal with branch command to turn the document into a number, both
for left and right. That job is a transformation, and sounds like a job of map. Rewrite it we can make
it cleanrer:

1 r.db("foodb")
2 .table("flavors")
3 .map(function(doc) {
4 return 1
5 })
6 .reduce(function(left, right) {
7 return left.add(right)
8 })
9 //=>
10 855
172

Now, we map each of document to become a single number 1. Then the reduce function works as a
sum of the array. Take first two elements, return the sum. Take the previous sum, add it with third
element and so on.
Usually, we will have map step before reduce to turn document into a type that compatible with
result of reduce function. Thats why the process of this sometimes is called map-reduce.
The process of reduce function executing is like recursion, but with passing the result of previous
run to the function itself, we dont have to keep a stack to store value of previous function call. In
other words, that function encapsulated its data, it doesnt access any outside variables. All data
it needs are passed to it as left and right parameter. Note that they are just name binding, we can
name them whatever, and they have to have capability of dealing with different data type: the type
of sequence element, and the type of result of function call.

Map Reduce
The process of map-reduce shines when using with group. When calling group, the subsequent
command operate on sub stream, we can take advantage of that to run reduce on that sub stream
and do the logic for our own aggeration, by writing reduce function.
Lets try to count how many compounds a food has for first 10 food, using map-reduce style instead
of built-in count command.
Given that a food has many compounds, a compounds has many healh effects. As in below diagram:

1 +------------+ +----------------+ +------------------+


2 | food | | compounds_food| | compounds_flavors|
3 +------------+ +----------------+ +------------------+
4 | id +--->+ food_id | | |
5 | | | | | |
6 +------------+ +----------------+ | |
7 | compound_id +------>+ compound_id |
8 +---+------------+ +------------------+
9 |
10 +--------+
11 |
12 | +--------------------------+
13 | | compounds_health_effects |
14 | +--------------------------+
15 +--->+ compound_id +
16 +--------------------------+
173

1 r.db("foodb")
2 .table("foods")
3 .limit(5)
4 .concatMap(function(food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food('id'), {index: 'food_id'})
8 .pluck('food_id', 'compound_id')
9 .merge({name: food('name')})
10 })
11 .group(function(doc) {
12 return doc.pluck('food_id', 'name')
13 })
14 .reduce(function(left, right) {
15 return
16 r.branch(left.typeOf().eq("NUMBER"), left, 1)
17 .add(
18 r.branch(right.typeOf().eq("NUMBER"), right, 1)
19 )
20 })
21 //=>5 rows returned in 254ms
22 [{
23 "group": {
24 "food_id": 4,
25 "name": "Kiwi"
26 },
27 "reduction": 378
28 }, {
29 "group": {
30 "food_id": 20,
31 "name": "Mugwort"
32 },
33 "reduction": 103
34 }, {
35 "group": {
36 "food_id": 25,
37 "name": "Common beet"
38 },
39 "reduction": 942
40 }, {
41 "group": {
42 "food_id": 26,
174

43 "name": "Borage"
44 },
45 "reduction": 225
46 }, {
47 "group": {
48 "food_id": 30,
49 "name": "Common cabbage"
50 },
51 "reduction": 1826
52 }]

Lets improve it to remove branch command which is odd

1 r.db("foodb")
2 .table("foods")
3 .limit(5)
4 .concatMap(function(food) {
5 return
6 r.db("foodb").table("compounds_foods")
7 .getAll(food('id'), {index: 'food_id'})
8 .pluck('food_id', 'compound_id')
9 .merge({name: food('name')})
10 })
11 .group(function(doc) {
12 return doc.pluck('food_id', 'name')
13 })
14 .map(function(doc) {
15 return 1
16 })
17 .reduce(function(left, right) {
18 return
19 left.add(right)
20 })

here, we group by food_id, and food name, then for each of document of sub stream, we map them
to 1. Because we are counting, we only care about a document as awhole instead of as individual
fields. The reduce function simply doing an sum of two left and right and return the sum. The map
steps help us clean up the reduce function because its easier to deal with number as input, and
return number too.
Lets take a more complex example. At the same time, calculate how many flavor and health effect
a food has.
First, let create necessary index
175

1 r.db("foodb").table('compounds_health_effects').indexCreate('compound_id')
2 r.db("foodb").table('compounds_flavors').indexCreate('compound_id')
3 r.db("foodb").table('compounds_foods').indexCreate('compound_id')

With the index, we can easily get all compounds and count how many flavor and health effect
associated with a compound.

1 r.db('foodb')
2 .table('compounds')
3 .concatMap(function(doc) {
4 return
5 r.db('foodb').table('compounds_foods')
6 .getAll(doc('id'), {index: 'compound_id'})
7 .pluck('food_id')
8 .merge({
9 compound_id: doc('id'),
10 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'\
11 ), {index: 'compound_id'}).count(),
12 health_effect_total: r.db('foodb').table('compounds_health_effects').g\
13 etAll(doc('id'), {index: 'compound_id'}).count()
14 })
15 })
16 //=>400 rows returned in 1min 15.09s. Displaying rows 1-400, more available
17 {
18 "compound_id": 4 ,
19 "flavor_total": 0 ,
20 "food_id": 191 ,
21 "health_effect_total": 0
22 }
23 {
24 "compound_id": 4 ,
25 "flavor_total": 0 ,
26 "food_id": 189 ,
27 "health_effect_total": 0
28 }

In above query, we first fetch the table compounds, with a given compound, we try to fetch its food_id
by query on table compound_foods. A compound can be in many foods, hence we used concatMap
to flatten the return array. We pluck field food_id from compound_table because we only care about
it instead of return the whole array.
Ok, above query give us compound and its flavor count and health effect count. But we want to
count the flavor and healh effect of a food. Well, a food contains many compound, so the total
health effect is the sum of all compounds health effect.
176

Therefore, we can group by food_id field and run reduce function on reduction group to get the
total count

1 r.db('foodb')
2 .table('compounds')
3 .concatMap(function(doc) {
4 return
5 r.db('foodb').table('compounds_foods')
6 .getAll(doc('id'), {index: 'compound_id'})
7 .pluck('food_id')
8 .merge({
9 compound_id: doc('id'),
10 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'\
11 ), {index: 'compound_id'}).count(),
12 health_effect_total: r.db('foodb').table('compounds_health_effects').g\
13 etAll(doc('id'), {index: 'compound_id'}).count()
14 })
15 })
16 .group('food_id')
17 .reduce(function(left, right) {
18 return {
19 flavor_total: left('flavor_total').add(right('flavor_total')),
20 health_effect_total: left('health_effect_total').add(right('health_eff\
21 ect_total')),
22 }
23 })
24 //=> 832 rows returned in 3min 33.23s.
25 {
26 "group": 2,
27 "reduction": {
28 "flavor_total": 16,
29 "health_effect_total": 517
30 }
31 },
32 {
33 "group": 3,
34 "reduction": {
35 "flavor_total": 0,
36 "health_effect_total": 112
37 }
38 }

This give us a list of food id and its total flavor and health effect. Lets make one more extra thing
177

by returns food name too, and remap the field to make document readable. To map document, so we
can change field name from group and reduction, we have to call ungroup first. And the final query
is:

1 r.db('foodb')
2 .table('compounds')
3 .concatMap(function(doc) {
4 return
5 r.db('foodb').table('compounds_foods')
6 .getAll(doc('id'), {index: 'compound_id'})
7 .pluck('food_id')
8 .merge({
9 compound_id: doc('id'),
10 flavor_total: r.db('foodb').table('compounds_flavors').getAll(doc('id'\
11 ), {index: 'compound_id'}).count(),
12 health_effect_total: r.db('foodb').table('compounds_health_effects').g\
13 etAll(doc('id'), {index: 'compound_id'}).count()
14 })
15 })
16 .group('food_id')
17 .reduce(function(left, right) {
18 return {
19 flavor_total: left('flavor_total').add(right('flavor_total')),
20 health_effect_total: left('health_effect_total').add(right('health_eff\
21 ect_total')),
22 }
23 })
24 .ungroup()
25 .map(function(doc) {
26 return doc('reduction').merge({
27 food: r.db('foodb').table('foods').get(doc('group')).default({}).pluck('id\
28 ', 'name'),
29 })
30 })

You can sit around watching the nice graph of RethinkDB admin dashboard. It takes a few minutes
depend on your CPU and disk speed. And the result:
178

1 //=>832 rows returned in 3min 49.60s.


2 [
3 {
4 "flavor_total": 16,
5 "food": {
6 "id": 2,
7 "name": "Savoy cabbage"
8 },
9 "health_effect_total": 517
10 },
11 {
12 "flavor_total": 0,
13 "food": {
14 "id": 3,
15 "name": "Silver linden"
16 },
17 "health_effect_total": 112
18 },...]

A very important point here is we start with compounds table instead of from foods table. But later
on, we group them by food_id. It is reverse of fetching foods first, then find all compounds and go
down all the way. It makes the query shorter, easier to follow because we eliminate one nested level
when come directly to compounds table. Its important to pick the right table to start with, and the
right order.
The reason is, if we start with food table, we have to join/map with table compounds_foods to
find its compound, then join with compounds table, then continue to join with two more table
compounds_flavors and compounds_health_effects. Thats three levels.
When we start right at compounds, we just need to join with compounds_foods and *compounds_-
flavors and compounds_health_effects at the same time, because we had compound_id. That is only
a single depth level. We are actually done right there, because we have enough information(food_id
field to grouping). Then in final step, we do a join with table foods food to fetch food name, but its
readable because the map is like go up a level, make query easier to follow.
Sometime you may not need to use reduce with map, concatMap and some built-in command, you
can already do a lot. But when you need reduce, it really helps.
Wrap up
When finishing this chapter, you should know how to aggregation, how to group data, counting,
and call function on grouped data. Some keys thing:

try to use index if possible on min, max, distinct


without ungroup, any chain command works on sub stream group
know how to use map reduce, and remember that the order of reduce function is not left to
right
8. Time
Accessing time
RethinkDB supports date time date with millisecond precison times with timezon. It is integrated
with the drivers. Meaning you will work with the time data in your language. You dont have to
convert it to a special format. Such as in JavaScript you can just insert a timestamp like:

1 r.db("foodb")
2 .table("users")
3 .get("03f5479c-403e-4dfa-995f-5aea85c25982")
4 .update({
5 birthday: r.time(1987, 5,5, 'Z')
6 })

The syntax of time function

1 r.time(year, month, day[, hour, minute, second], timezone) time

timezone can be Z, means UTC time. Or a string of with format +-[hh]:[mm] from UTC time. UTC
time is 7 or 8 hours a head of PST time depend on season.
When you reading back the time, again, its converted into a native time object/data type of your
language. This save you bunch of time from dealing with time formating, timezone.
Internally RethinkDB store epoch time and an associated timezone with it. Epoch time is how many
seconds since epoch, or UTC, or more clearly 00:00:00 Coordinated Universal Time (UTC), Thursday,
1 January 1970, not counting leap second
The associated timezone is a minute precision time offsets from UTC. That means PST time is [-
08]:[00].

Timezone
When you are setting a native time object to a RethinkDB document, if the object includes a timezone
value, RethinkDB picks up it and use it, otherwise, it defaults to UTC time. Lets say I was born in
Vietnam, 1987/05/23, 10:10PM. Vietnam timezone is UTC+08:00, so I will write:

https://en.wikipedia.org/wiki/Unix_time
Accessing time 182

1 r.db("foodb")
2 .table("users")
3 .insert({
4 name: 'Vinh',
5 age: 30,
6 eatenfoods: ['Frybread', 'Yogurt'],
7 favfoods: ['Avocado', 'Jellyfish', 'Vanilla', 'Sacred lotus', 'Banana'],
8 birthday: r.time(1987, 5, 23, 10, 10, 0, '+08:00')
9 })

I can then find out what is my birthday timezone, using timezone command.

1 r.db("foodb")
2 .table("users")
3 .get('12063f5f-4289-4a4b-b668-0e4a90861575')('birthday').timezone()
4 //=>
5 "+08:00"

Now I moved to the USA, people asks when were you born. Im speechless. Knowing that USA is in
PST time, that is -08:000 compare to UTC. I turn to RethinkDB:

1 r.db("foodb")
2 .table("users")
3 .get('12063f5f-4289-4a4b-b668-0e4a90861575')('birthday')
4 .inTimezone('-08:00')
5 //=>
6 Fri May 22 1987 18:10:00 GMT-08:00

inTimezone helps us to convert time into other timezone.

epoch
The number of seconds since Unix epoch is very important and is supposed by almost language. In
RethinkDB we can get that number by using toEpochTime:
Accessing time 183

1 r.db("foodb")
2 .table("users")
3 .get('12063f5f-4289-4a4b-b668-0e4a90861575')('birthday')
4 .toEpochTime()
5 //=>
6 548734200

We can confirm it: r.expr(548734200).div(3600 * 24 * 365) // 17.400247336377472


1987 is around 17 years since 1970.
Likewise, if we have epoch time we can convert it into a timeobject with epochTime:

1 r.epochTime(548734200)
2 //=>
3 Sat May 23 1987 02:10:00 GMT+00:00

The original time that I have inserted is: 1987/05/23, 10:10PM. We then get it out, converting to
epoch, then convert to time object again and we got
Sat May 23 1987 02:10:00 GMT+00:00
Thats exactly **Sat May 23 1987 10:10:00 PM in GMT+8.
WRAP UP
At this point, you should be confident to work with date/time in RethinkDB. Here are some recap:

1 * `now`: using `now` to get current time


2 * `time`: passing **year, month, day[, hour, minute, second], timezone** to
3 create a time object
4 * `inTimezon` to convert time to other timezone
5 * `timezone` do detects timezone when we don't know what is the timezon
9. Conclusion

I call RethinkDB is a database for programmer, not for database administrator because it takes very
minimal effort to understand and pick up it. The way we write ReQL is very clear, other developer
looks at it and know exactly what is going on to happen. In SQL world, we have to profile, explain
the query to know if any index will be used. In RethinkDB, we tell it to use an index. And its ReQL
language is a wonderful way to think about database.
Then coming changefeeds, which I didnt cover in this book because you can quickly leanr and use
it after reading 5 minutes API document.
It also offer automatic failover in cluster. Which I also didnt cover because I dont have experience
using it. To me, all the upcoming thing for RethinkDB is a good sign to invest into learning it. Be
prepare and go ahead, by learning and using RethinkDB today.
When I write this book, I learnt more about RethinkDB. If I wouldnt written it I probably wont
dive deeply. It a chance for me to study carefully and improve myself. So I hope that my little book
will help you clear thing, make you confident to use RethinkDB.

Anda mungkin juga menyukai