Anda di halaman 1dari 19

Discussion Session Week 7

Database Indices

Example 1
Assume the table created by the
following statement :
CREATE TABLE customer {
id
Serial
Primary key,
SSN
Integer
NOT NULL UNIQUE,
gender Varchar[6] NOT NULL CHECK
(gender=MALE or
gender=FEMALE),
city
text
NOT NULL
}

Example 1
Assume the following information for the
customer table :
Id

SSN

Gender

City

4M

4M

40K

Where the number below each field


corresponds to the number of distinct
values for each attribute. Notice that
given ID is a primary key, the table itself
has 4M tuples.

Example 1a
Consider the following prepared query :

SELECT *
FRO M
custom er
W H ERE SSN = ?
And the following information we have about the system (DA =
disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists

Should we use an index on the SSN attribute to answer this


query more efficiently?

Example 1a
Cost without index :
Only one tuple in the answer.
Need to scan over entire relation to find
the tuple.
100k page access =100 k DA

Cost with index :


Index lookup finds the correct page right
away.
T(customer)/V(customer,SSN) = 4M/4M
= 1 DA

Example 1b
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
G ender= ?
Assume same information about the
system.
Should we use an index on the
G ender attribute to answer this
query more efficiently?

Example 1b
Cost without index :
Still need to scan over the entire relation
100k page access = 100k DA

Cost with index (unclustered tuples) :


Half the tuples are in the answer (50% chance for
each tuple).
T(customer)/V(customer,gender) = 4M/2= 2M DA

Cost with index (clustered tuples) :


Pages are accessed at most once, therefore
#pageAccessed = 100k DA.

Index beneficial : No, in all cases.

Example 1c
What about this prepared query :
SELECT
*
FRO M
custom er
W H ERE
City= ?
Assume same information about the
system.
Should we use an index on the city
attribute to answer this query more
efficiently?

Example 1c
Cost without index :
Scan over the entire relation
100k page access = 100k DA

Cost with index :


1% of tuples are in the answer.
T(customer)/V(customer,city) =
4M/40k= 100 DA

Index beneficial : Yes.

Example 2
Assume we now have another
relation :
CREATE TABLE sales {
id
SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customer(id)
NOT NULL,
product TEXT NOT NULL,
amount INTEGER NOT NULL CHECK (amount > 0
AND amount <= 4000),
};
Assume the company doesnt allow sales of more
than 4000 products at a time.

Example 2
Assume the following information for the sales
table :
ID

Customer
_id

Product

Amount

40M

4M

4K

4K

Again, the number below each field corresponds


to the number of distinct values for each
attribute. Given customer_id is a foreign key,
their cannot be more than 4M distinct values.
There are 40M tuples in this table.

Example 2
Consider the following prepared query :

SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id
Recall the information we have about the system (DA = disk access) :
Disk page size : 4KB
Assume each tuple is 100 bytes for both relations (40 tuples per page)
Index lookup cost : 0DA (index in main memory)
Page access cost : 1DA
Assume tuples not clustered
Index on ID always exists
Page write cost : 1DA

Should we use an index on the custom er_id attribute of sales or


the id attribute of customer to answer this query more
efficiently?

Example 2
Recall the best alternative, the sort-merge join :
Recall from the lecture notes : given a join of tables R
and S, sort-merge join only takes 2 reads of R and S
and one write of the equivalent amount of data.
Recall : #pagesInPerson = 100k, #pagesInSales =
1M.
Cost of sort-merge join
#PageAccesses = (2read+1write) x (#pagesInPerson
+ #pagesInSales)
= 3 x (100k + 1M)
= 3300k DA

Example 2
Assume the index is on
sales.customer_id :
For each tuple in customer, we join with
tuples from sales that match the
customer tuple id.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 4M x (40M / 4M) = 40M
Index on sales.custom er_id is much worse
than sort-merge join.

Example 2
Assume the index is on customer.id :
For each tuple in sales, we join with the
tuples from customer that is referred by
the sales tuples customer_id.
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is also worse than
sort-merge join.

Example 3
Now assume we change slightly the previous
query by adding a selection :
SELECT *
FRO M
custom er AS C, sales as S
W H ERE C.id = S.custom er_id AN D
C.id = 12345 % Yannis ID

Important to know : selection will happen first


on most database systems.
Should we use an index on the custom er_id
attribute of sales or the id attribute of
customer to answer this query more efficiently?

Example 3
If we use the sort-merge join :
Recall : #pagesInSales = 1M.
Given only one tuple from person is
selected and selection happens before
the join, we have #pagesInPerson = 1.
#PageAccesses = (2read+1write) x
(#pagesInPerson + #pagesInSales)
= 3 x (1+ 1M)
= 3M DA

Example 3
Assume the index is on
sales.customer_id :
There is only one tuple in customer after
the selection.
#pageAccesses = T(Customer) x
(T(Sales)/V(Sales, customer_id))
= 1 x (40M / 4M) = 10
Index on sales.custom er_id is much, much
better than sort-merge join if we have a
very selective selection.

Example 3
Assume the index is on customer.id :
There is no selection over sales
#pageAccesses = T(Sales) x
(T(Customer)/V(Customer, customer_id))
= 40M x (4M / 4M) = 40M
Index on custom er.id is not affected by the
selection on the customer table.

Anda mungkin juga menyukai