Discussion Session Week 7: Database Indices

Discussion Session Week 7
Database Indices
Example 1
Assume the table created by the
following statement :
CREATE TABLE customer {
id
Serial
Primary key,
SSN
Integer
NOT NULL UNIQUE,
gender Varchar[6] NOT NULL CHECK
(gender=MALE or
gender=FEMALE),
city
text
NOT NULL
}
Example 1
Assume the following information for the
customer table :
Id
SSN
Gender
City
4M
4M
40K
Where the number below each field

corresponds to the number of distinct
values for each attribute. Notice that
given ID is a primary key, the table itself
has 4M tuples.
Example 1a
Consider the following prepared query :
SELECT *
FRO M
custom er
W H ERE SSN = ?
And the following information we have about the system (DA =
disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists
Should we use an index on the SSN attribute to answer this

query more efficiently?
Example 1a
Cost without index :
Only one tuple in the answer.
Need to scan over entire relation to find
the tuple.
100k page access =100 k DA
Cost with index :

Index lookup finds the correct page right
away.
T(customer)/V(customer,SSN) = 4M/4M
= 1 DA
Example 1b
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
G ender= ?
Assume same information about the
system.
Should we use an index on the
G ender attribute to answer this
query more efficiently?
Example 1b
Still need to scan over the entire relation
100k page access = 100k DA
Cost with index (unclustered tuples) :

Half the tuples are in the answer (50% chance for
each tuple).
T(customer)/V(customer,gender) = 4M/2= 2M DA
Cost with index (clustered tuples) :

Pages are accessed at most once, therefore
#pageAccessed = 100k DA.
Index beneficial : No, in all cases.
Example 1c
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
City= ?
Assume same information about the
system.
Should we use an index on the city
attribute to answer this query more
efficiently?
Example 1c
Scan over the entire relation
100k page access = 100k DA
Cost with index :

1% of tuples are in the answer.
T(customer)/V(customer,city) =
4M/40k= 100 DA
Index beneficial : Yes.
Example 2
Assume we now have another
relation :
CREATE TABLE sales {
id
SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customer(id)
NOT NULL,
product TEXT NOT NULL,
amount INTEGER NOT NULL CHECK (amount > 0
AND amount <= 4000),
};
Assume the company doesnt allow sales of more
than 4000 products at a time.
Example 2
Assume the following information for the sales
table :
ID
Customer
_id
Product
Amount
40M
4M
4K
4K
Again, the number below each field corresponds

to the number of distinct values for each
attribute. Given customer_id is a foreign key,
their cannot be more than 4M distinct values.
There are 40M tuples in this table.
Example 2
Consider the following prepared query :
SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id
Recall the information we have about the system (DA = disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes for both relations (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists
Page write cost : 1DA
Should we use an index on the custom er_id attribute of sales or

the id attribute of customer to answer this query more
efficiently?
Example 2
Recall the best alternative, the sort-merge join :
Recall from the lecture notes : given a join of tables R
and S, sort-merge join only takes 2 reads of R and S
and one write of the equivalent amount of data.
Recall : #pagesInPerson = 100k, #pagesInSales =
1M.
Cost of sort-merge join
#PageAccesses = (2read+1write) x (#pagesInPerson
+ #pagesInSales)
= 3 x (100k + 1M)
= 3300k DA
Example 2
Assume the index is on
sales.customer_id :
For each tuple in customer, we join with
tuples from sales that match the
customer tuple id.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 4M x (40M / 4M) = 40M
Index on sales.custom er_id is much worse
than sort-merge join.
Example 2
Assume the index is on customer.id :
For each tuple in sales, we join with the
tuples from customer that is referred by
the sales tuples customer_id.
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is also worse than
sort-merge join.
Example 3
Now assume we change slightly the previous
query by adding a selection :
SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id AN D
C.id = 12345 % Yannis ID
Important to know : selection will happen first

on most database systems.
Should we use an index on the custom er_id
attribute of sales or the id attribute of
customer to answer this query more efficiently?
Example 3
If we use the sort-merge join :
Recall : #pagesInSales = 1M.
Given only one tuple from person is
selected and selection happens before
the join, we have #pagesInPerson = 1.
#PageAccesses = (2read+1write) x
(#pagesInPerson + #pagesInSales)
= 3 x (1+ 1M)
= 3M DA
Example 3
Assume the index is on
sales.customer_id :
There is only one tuple in customer after
the selection.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 1 x (40M / 4M) = 10
Index on sales.custom er_id is much, much
better than sort-merge join if we have a
very selective selection.
Example 3
Assume the index is on customer.id :
There is no selection over sales
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is not affected by the
selection on the customer table.

Discussion Session Week 7: Database Indices

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Discussion Session Week 7: Database Indices

Diunggah oleh

Hak Cipta:

Format Tersedia

Discussion Session Week 7

Where the number below each field

Should we use an index on the SSN attribute to answer this

Cost with index :

Cost with index (unclustered tuples) :

Cost with index (clustered tuples) :

Index beneficial : No, in all cases.

Cost with index :

Index beneficial : Yes.

Again, the number below each field corresponds

Should we use an index on the custom er_id attribute of sales or

Important to know : selection will happen first

Anda mungkin juga menyukai