Timothy G. Griffin
Computer Laboratory
University of Cambridge, UK
Lent 2017
date topics
1 17/2 What is a Database Management System (DBMS)?
2 20/2 Entity-Relationship (ER) diagrams
3 22/2 Relational Databases ...
4 24/2 ... and SQL
5 27/2 HyperSQL Practical Lecture
6 1/3 Some limitations of SQL ...
7 3/3 ... that can be solved with Graph Database
8 6/3 Neo4j Practical Lecture
9 8/3 No Lecture!
10 17/2 Document-oriented Database
11 17/2 DoctorWho Practical Lecture
12 17/2 Cloud computing and distributed databases
3 models
Relational Model: Data is stored in tables. SQL is the main query
language.
Graph-oriented Model: Data is stored as a graph (nodes and edges).
Query languages tend to have path-oriented
capabilities.
Aggregate-oriented Model: Also called document-oriented database.
Optimised for read-oriented databases.
The relational model has been the industry mainstay for the last 35
years. The other two models are representatives of an ongoing
revolution in database systems often described under the (misleading)
NoSQL banner.
We have processed raw data available from IMDb (plain text data
files at http://www.imdb.com/interfaces)
Small instance: 100 movies and 7771 people (actors, directors,
etc)
Large instance: 887,747 movies and 3,424,147 people
For each of our three database systems we have generated a
small and large database instance.
Practical work should be done on the small instances. Large
instances are for your entertainment and enlightenment.
We will see that there is no one system that nicely solves all data
problems.
There are several fundamental trade-offs faced by application
designers and system designers.
A database engine might be optimised for a particular class of
queries.
title gender
year name
id Movie Person id
title gender
year name
title gender
year position name
character
Description
Role
Title FirstName
Year LastName
RequiresRole
Role
title country
id date month
Year
Title Country
id Date Month
S R T
The relation R is
one-to-many: Every member of T is related to at most one member of
S.
many-to-one: Every member of S is related to at most one member of
T.
one-to-one: R is both many-to-one and one-to-many.
many-to-many: No constraint.
R S T.
Database parlance
S and T are referred to as domains.
We are interested in finite relations R that can be stored!
S1 , S2 , . . . , Sn ,
R S1 S2 Sn = {(s1 , s2 , . . . , sn ) | si 2 Si }
Tabular presentation
1 2 n
x y w
u v s
.. .. ..
. . .
n m k
R1 , R2 , , Rk =) Q(R1 , R2 , , Rk )
R
Q(R)
A B C D
A B C D
20 10 0 55 =)
20 10 0 55
11 10 0 7
77 25 4 0
4 99 17 2
77 25 4 0
Q
RA A>12 (R)
SQL SELECT DISTINCT * FROM R WHERE R.A > 12
R
Q(R)
A B C D
B C
20 10 0 55 =)
10 0
11 10 0 7
99 17
4 99 17 2
25 4
77 25 4 0
Q
RA B,C (R)
SQL SELECT DISTINCT B, C FROM R
R Q(R)
A B C D A E C F
20 10 0 55 =) 20 10 0 55
11 10 0 7 11 10 0 7
4 99 17 2 4 99 17 2
77 25 4 0 77 25 4 0
Q
RA {B7!E, D7!F } (R)
SQL SELECT A, B AS E, C, D AS F FROM R
Q(R, S)
R S
A B
A B A B
=) 20 10
20 10 20 10
11 10
11 10 77 1000
4 99
4 99
77 1000
Q
RA R [ S
SQL (SELECT * FROM R) UNION (SELECT * FROM S)
R
S Q(R)
A B
A B =) A B
20 10
20 10 20 10
11 10
77 1000
4 99
Q
RA R \ S
SQL (SELECT * FROM R) INTERSECT (SELECT * FROM S)
R
S Q(R)
A B
A B =) A B
20 10
20 10 11 10
11 10
77 1000 4 99
4 99
Q
RA R S
SQL (SELECT * FROM R) EXCEPT (SELECT * FROM S)
Q
RA R S
SQL SELECT A, B, C, D FROM R CROSS JOIN S
SQL SELECT A, B, C, D FROM R, S
Natural Join
Given R(A, B) and S(B, C), we define the natural join, denoted
Ron S, as a relation over attributes A, B, C defined as
Ro
n S {t | 9u 2 R, v 2 S, u.[B] = v .[B] ^ t = u.[A] [ u.[B] [ v .[C]}
Ro
n S = A,B,C ( B=B0 (R B7 ~0 (S)))
~ !B
Students Colleges
name sid cid cid cname
Fatima fm21 cl k Kings
Eva ev77 k cl Clare
James jj25 cl q Queens
Students o
n Colleges
name sid cid cname
=) Fatima fm21 cl Clare
Eva ev77 k Kings
James jj25 cl Clare
(We will look at joins in SQL very soon ...)
title gender
year name
DirectsComplete
mid title year pid name gender
2544956 12 Years a Slave (2013) 2013 1390487 McQueen, Steve (III) male
2552840 4 luni, 3 saptamni si 2 zile (2007) 2007 1486556 Mungiu, Cristian male
2589088 Afghan Star (2009) 2009 3097222 Marking, Havana female
2607939 American Hustle (2013) 2013 1835629 Russell, David O. male
2611256 An Education (2009) 2009 3404232 Scherfig, Lone female
2622261 Anvil: The Story of Anvil (2008) 2008 751015 Gervasi, Sacha male
2626541 Argo (2012) 2012 16507 Affleck, Ben male
2629853 Aruitemo aruitemo (2008) 2008 1133907 Koreeda, Hirokazu male
... ... ... ... ... ...
People
id name gender
1390487 McQueen, Steve (III) male
1486556 Mungiu, Cristian male
3097222 Marking, Havana female
1835629 Russell, David O. male
3404232 Scherfig, Lone female
751015 Gervasi, Sacha male
16507 Affleck, Ben male
1133907 Koreeda, Hirokazu male
... ... ...
Directs
movie_id person_id
2544956 1390487
2552840 1486556
2589088 3097222
2607939 1835629
2611256 3404232
2622261 751015
2626541 16507
2629853 1133907
... ...
(No, this relation does not actually exist in our IMDb databases
more on that later ...)
Relational Key
Suppose R(X) is a relational schema with Z X. If for any records u
and v in any instance of R we have
Note that this is a semantic assertion, and that a relation can have
multiple keys.
Foreign Key
Suppose we have R(Z, Y). Furthermore, let S(W) be a relational
schema with Z W. We say that Z represents a Foreign Key in S for R
if for any instance we have Z (S) Z (R). Think of these as (logical)
pointers!
Referential integrity
A database is said to have referential integrity when all foreign key
constraints are satisfied.
A relational schema
ActsIn(movie_id, person_id)
With referential integrity constraints
Z S R T X
Relation R is Schema
W U Y
Z S R T X
W U Y
Z S R T X
W R Y
Z S T X
R(X , Z , U)
Q(X , Z , V )
RQ(X , Z , type, U, V )
using a tag domain(type) = {r, q} (for some constant values r and q).
represent an R-record (x, z, u) as an RQ-record (x, z, r, u, NULL)
represent an Q-record (x, z, v ) as an RQ-record (x, z, q, NULL, v )
Redundancy alert!
If we now the value of the type column, we can compute the value of
either the U column or the V column!
PRIZES!!!
// scan R
for each (a, b) in R {
// dont scan S, use an index
for each s in S-INDEX-ON-B(b) {
create (a, b, s.c) ...
}
}
There are many types of database indices and the commands for
creating them can be complex.
Index creation is not defined in the SQL standards.
While an index can speed up reads, it will slow down
updates. This is one more illustration of a fundamental
database tradeoff.
The tuning of database performance using indices is a fine art.
In some cases it is better to store read-oriented data in a separate
database optimised for that purpose.
select B, C from R
R Q(R)
A B C D B C
20 10 0 55 =) 10 0 ? ? ?
11 10 0 7 10 0 ? ? ?
4 99 17 2 99 17
77 25 4 0 25 4
course mark
sid course mark spelling 99
ev77 databases 92 spelling 3
ev77 spelling 99 spelling 100
tgg22 spelling 3 spelling 92
group by
tgg22 databases 100 =)
fm21 databases 92 course mark
fm21 spelling 100 databases 92
jj25 databases 88 databases 100
jj25 spelling 92 databases 92
databases 88
course mark
spelling 99
spelling 3
spelling 100
spelling 92 course min(mark)
min(mark)
=) spelling 3
course mark databases 88
databases 92
databases 100
databases 92
databases 88
select course,
min(mark),
max(mark),
avg(mark)
from marks
group by course;
+-----------+-----------+-----------+-----------+
| course | min(mark) | max(mark) | avg(mark) |
+-----------+-----------+-----------+-----------+
| databases | 88 | 100 | 93.0000 |
| spelling | 3 | 100 | 73.5000 |
+-----------+-----------+-----------+-----------+
^ T F ? _ T F ? v v
T T F ? T T T T T F
F F F F F T F ? F T
? ? F ? ? T ? ? ? ?
note total
---- -----
NULL 4232
This seems to mean that NULL is equal to NULL. But recall that
NULL = NULL returns NULL!
R ST
Q T U
A directed graph
V = {A, B, C, D}
A = {(A, B), (A, D), (B, C), (C, C)}
A B C D
A B C D
Observation
If G = (V , A) is a directed graph, and (u, v ) 2 Ak , then there is at least
one path in G from u to v of length k . Such paths may contain loops.
2 3 4
HyperSQL 3 9 OOM
OOM = Out Of Memory.
(x, y ) 2 R + ^ (y , z) 2 R + ! (x, z) 2 R + .
Then [
R+ = Rn.
n2{1, 2, }
BIG IDEA
Is it possible to design and implement a database system that is
optimised for transitive closure and related path-oriented
queries?
A fundamental tradeoff
Introducing data redundancy can speed up read-oriented transactions
at the expense of slowing down write-oriented transactions.
{"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{"value": "New", "onclick": "CreateNewDoc()"},
{"value": "Open", "onclick": "OpenDoc()"},
{"value": "Close", "onclick": "CloseDoc()"}
]
}
}}
From http://json.org/example.html.
From http://json.org/example.html.
A S R T B
A database instance
S R T
A B Y
B Z
A X a1 b1 y1
b1 z1
a1 x1 a1 b2 y2
b2 z2
a2 x2 a1 b3 y3
b3 z3
a3 x3 a2 b1 y4
b4 z4
a2 b3 y5
Since we dont update this date we will not encounter the problems
associated with redundancy.
X W Z
A S Q T B
A database instance
S Q T
B Z
A X A B W
b1 z1
a1 x1 a1 b4 w1
b2 z2
a2 x2 a3 b2 w2
b3 z3
a3 x3 a3 b3 w3
b4 z4
Potential Solution
Represent the data using semi-structured objects.
"writer_in": [
{ "line_order": 1, "group_order": 1, "subgroup_order": 1,
"movie_id": 3256273, "title": "No Country for Old Men (2007)",
"note": "(screenplay)" },
{ "line_order": 1, "group_order": 1, "subgroup_order": 1,
"movie_id": 3667358, "title": "True Grit (2010)",
"note": "(screenplay)"
},
{ "line_order": 1, "group_order": 1, "subgroup_order": 1,
"movie_id": 3026002, "title": "Inside Llewyn Davis (2013)",
"note": "(written by)"
}]
}
import uk.ac.cam.cl.databases.moviedb.MovieDB;
import uk.ac.cam.cl.databases.moviedb.model.*;
For quite a few years now, the received wisdom has been that
social data is not relational, and that if you store it in a relational
database, youre doing it wrong.
Diaspora chose MongoDB for their social data in this zeitgeist. It
was not an unreasonable choice at the time, given the information
they had.
You can see why this is attractive: all the data you need is already
located where you need it.
Note
The blog describes implementation decisions made in 2010 for the
development of a social networking platform called Diaspora. This was
before mature graph-oriented databases were available.
You can also see why this is dangerous. Updating a users data
means walking through all the activity streams that they appear in
to change the data in all those different places. This is very
error-prone, and often leads to inconsistent data and mysterious
errors, particularly when dealing with deletions.
If your data looks like that, youve got documents.
Congratulations! Its a good use case for Mongo. But if theres
value in the links between documents, then you dont actually
have documents. MongoDB is not the right solution for you. Its
certainly not the right solution for social data, where links between
documents are actually the most critical data in the system.
OLAP OLTP
Supports analysis day-to-day operations
Data is historical current
Transactions mostly reads updates
optimized for reads updates
data redundancy high low
database size humongous large
Extract
Flat tables are great for processing, but hard for people to read
and understand.
Pivot tables and cross tabulations (spreadsheet terminology) are
very useful for presenting data in ways that people can
understand.
Note that some table values become column or row names!
Standard SQL does not handle pivot tables and cross tabulations
well.
From VLDB 2009 Tutorial: Column-Oriented Database Systems, by Stavros Harizopoulos, Daniel Abadi, Peter Boncz.
CAP concepts
Consistency. All reads return data that is up-to-date.
Availability. All clients can find some replica of the data.
Partition tolerance. The system continues to operate despite
arbitrary message loss or failure of part of the system.
(http://xkcd.com/327)