Anda di halaman 1dari 29

CS 345:

Topics in Data Warehousing


Thursday, October 21, 2004
Review of Tuesday’s Class
• Database System Architecture
– Memory management
– Secondary storage (disk)
– Query planning process
• Joins
– Nested Loop Join
– Merge Join
– Hash Join
• Grouping
– Sort vs. Hash
Outline of Today’s Class
• Indexes
– B-Tree and Hash Indexes
– Clustered vs. Non-Clustered
– Covering Indexes
• Using Indexes in Query Plans
• Bitmap Indexes
– Index intersection plans
– Bitmap compression
Indexes
• Provide efficient access to relevant records
– Based on values of particular attribute(s)
• Same idea as index in back of a book
• “fact tables 16, 17, 49”
– Information about fact tables on pages 16, 17, and 49
– No information about fact tables on other pages
– Without an index, we’d have to look through the
whole book page by page
Typical Index Structure
• Indexes organized based on some search key
– Column (or set of columns) whose values are used to
access the index
– Organization can be sorting or hashing
• Index is built for some relation
– One index entry per record in the relation
• Index consists of <Value, RID> pairs
– Value = value of the search key for this record
– RID = record identifier
• Tells the DBMS where the record is stored
• Usually (page number, offset in page)
Sorted Index
• Index entries usually much smaller than records
– Record has many attributes besides search key
• Build search tree on top of index entries
– Allows particular value to be located quickly

2 5

2 4 4 5 7 8
B-Tree Index
• By far the most common type of index
• Sorted index with search tree
• Good for point queries and range queries
– Point query: A = 5
– Range query: A BETWEEN 5 AND 10
• Search tree nodes are page-sized
– Contain <Value, Pointer> pairs
– Each Pointer is to a node of the level below
• Trade-off in choosing index page sizes
– Larger pages → fewer search tree levels → fewer
page reads
– Larger pages → each page read takes longer
Hash Indexes
• Useful for point queries
– Slightly better performance than B-Trees
– Not useful for range queries
• Less widely supported than B-Trees
Alternate B-Tree Organization
• Many records with same search key causes
redundancy
– <Stanford,RID1>,<Stanford,RID2>,
<Stanford,RID3>,<Stanford,RID4>
• Can store RID-lists instead
– <Stanford, (RID1,RID2,RID3,RID4)>
– Each value occurs once in the index
– Index entry is <Value,RID-list> instead of
<Value,RID>
– Saves space when search key has many repeated
values
Clustered Indexes
• An index is clustered (or “clustering”) if records in the
relation are organized based on index search key
• Clustered indexes are good because:
– Records satisfying a range query are packed onto a small number
of consecutive pages
• In unclustered indexes, by contrast:
– Records satisfying a range query are spread across a large
number of random pages
– Commingled with other records that do not satisfy the query
• Only one clustered index allowed per relation
– A relation can’t be simultaneously sorted by 2 different attributes
– (Unless there are multiple copies of the relation)
Clustered vs. Unclustered
Clustered Sequential
2 5 Reads

2 4 4 5 7 8

2 4 4

5 7 8

Unclustered 2 5
Random
Reads

2 4 4 5 7 8

4 7 5

2 4 8
Comparing Access Plans
• Consider query “SELECT * FROM R WHERE A=5”
• Three query plans:
– Scan relation R
• Sequential read of all pages in R
• Regardless of how many tuples have A=5
– Use clustered index on A
• Sequential read of relevant pages in R
• Num. relevant pages = (# of tuples with A=5) / (# of tuples per page)
• Plus overhead of accessing index pages
– Use unclustered index on A
• Random read of relevant pages in R
• Number of relevant pages = (# of tuples with A=5)
– Less if A is highly correlated with sort order of relation
• Plus overhead of accessing index pages
Comparing Access Plans
• Clustered index is always best
– Unless all tuples are being returned (then use scan)
– But clustered index may not be available
• Unclustered index beats scan when fraction of
tuples returned is small
– Depends on these factors:
• % of tuples being returned
• Cost ratio of random I/O vs. sequential I/O
• # of tuples per page
– Query returns >10% of rows → scan is almost
certainly faster
Covering Indexes
• Example using index in a book:
– “What does this book say about fact tables?”
• Look up “fact tables” in the index
• Turn to each page that is listed
• Read that page and see what it says
– “Which of these topics are discussed in this
book: fact tables, bridge tables, B-trees?”
• Look up the three topics in the index
• See how many of them appear
• Don’t need to read any of the actual book
Covering Indexes
• Sometimes an index has all the data you need
– Allows index-only query plan
– Not necessary to access the actual tuples
– Such an index is called a covering index
• SELECT COUNT(*) FROM R WHERE A=5
– Use index on A
– Count number of <5,RID> entries
– No need to look up records referenced by RIDs
• An index is a “thin” copy of a relation
– Not all columns from the relation are included
– The index is sorted in a particular way
Multi-Column Indexes
• Multi-column indexes are very useful in data
warehousing
– We say such an index has a composite key
• Example: B-Tree index on (A,B)
– Search key is (A,B) combination
– Index entries sorted by A value
– Entries with same A value are sorted by B value
– Called a lexicographic sort
• SELECT SUM(B) FROM R WHERE A=5
– Our (A,B) index covers this query!
• Coverage vs. size trade-off
– More attributes in search key → index covers more queries
– More attributes in search key → index takes up more disk space
Fact and Dimension Indexes
• Dimension table index
• Narrow version of table with
only frequently-queried
attributes
• Always include dimension key!
• Improve performance on large
dimension tables

• Fact table index


• Narrow version of fact that
omits certain dimensions /
measures
• Useful for queries that
exclusively reference indexed
dimensions / measures
Order of Composite Key
• Index on (A,B) ≠ Index on (B,A)
– Can efficiently search based on leading terms
– No efficient search for trailing terms
• SELECT SUM(B) FROM R WHERE A=5
– Index on (A,B) is sorted by A
• Search for records where A=5
• Scan only the relevant portion of the index
– Index on (B,A) is sorted by B
• Records with A=5 are scattered throughout index
• Need to scan the entire index
• Or else do one search for each distinct value of B
– Oracle’s “index skip scans”
– Index on (A,B) is better for this query
– Either index is much faster than accessing relation!
Index Summary
• Indexes are useful in two ways:
– Indexes allow efficient search on some attributes due
to the way they are organized
– Index-only plans use small indexes in place of large
relations
• For OLAP queries, the second use is generally
more important
– Search via non-covering, non-clustered index leads to
random I/O
– Analysis queries typically aggregate lots of tuples
– Doing one random I/O per tuple can be costly
Example
• Sales(Date, Store, Product, Promotion,
TransactionId, Quantity, DollarAmt)
– Index on (Date, Store, Quantity, DollarAmt)
– Index on (Date, Promotion, Product, Quantity,
DollarAmt)
– Index on (Product, Date, Store, Quantity, DollarAmt)
• Store
– Index on (Name, District, StoreKey)
• Product
– Index on (Name, Brand, Dept, ProductKey)
– Index on (Brand, Dept, ProductKey)
Example Query
Product:
Sales:
Brand
DollarAmt

SELECT Brand, SUM(DollarAmt)


FROM Sales, Product, Store
WHERE Sales.ProductKey = Product.ProductKey
AND Sales.StoreKey = Store.StoreKey
AND Store.Name = 'Crystal Springs Safeway‘
GROUP BY Brand
Store:
Name
Selecting Indexes
Lacks
• Sales(Date, Store, Product, Promotion, Product
TransactionId, Quantity, DollarAmt)
– Index on (Date, Store, Quantity, DollarAmt)
– Index on (Date, Promotion, Product, Quantity,
DollarAmt)
– Index on (Product, Date, Store, Quantity, DollarAmt)
• Store Lacks
– Index on (Name, District, StoreKey) Store
• Product
– Index on (Name, Brand, Dept, ProductKey)
– Index on (Brand, Dept, ProductKey)
Wider
Than
Needed
Query Plan
• Search Store(Name, District, StoreKey) index for
Name=‘Crystal Springs Safeway’
• Nested Loop Join
– Outer = Sales(Product,Date,Store,Quantity,DollarAmt) index
– Inner = Qualifying Store index entries
– Output preserves sort order of Sales index
• Sort Product(Brand,Dept,ProductKey) index entries by
ProductKey
• Merge Join
– Result of Nested Loop Join (already sorted by ProductKey)
– Product(Brand,Dept,ProductKey)
• Hash resulting tuples on Brand (for GROUP BY)
– Compute SUM(DollarAmt) for each Brand
Index Intersection
• Suppose we have table R(A,B,C,D,E)
– B-Tree index on A
– B-Tree index on B
– No multi-column indexes
• SELECT COUNT(*) FROM R WHERE A=5 AND B < 10
• Use an index intersection plan
– Search A index for A=5
• Index entries have <A,RID>
• Think of the index as a 2-column table with schema I1(A,RID)
– Search B index for B<10
• Index entries have <B,RID>
• Think of the index as a 2-column table with schema I2(B,RID)
– Join qualifying index entries on I1.RID = I2.RID
Index Intersection
• Index intersection works well for conjunction of
multiple, moderately selective filters
– SELECT SUM(C) FROM R WHERE A=5 AND B<10
– 5% of rows have A=5
– 5% of rows have B<10
– 5% * 5% = 0.25% of rows have A=5 AND B<10
– Retrieving rows matching A index alone, or B index
alone, would be slow
– Only a few rows match both indexes
• Intersect indexes and retrieve rows that match both
– Overhead of joining indexes often small relative to
cost of retrieving matching records from relation
Bitmap Indexes
• Earlier idea: use RID-lists in place of RIDs
– Save space when attribute values repeat
• Bitmap indexes take this one step further
– Use Bitmap in place of RID-list
– Each RID in the entire relation is represented by 1 bit
• 1 = RID is present in RID-list
• 0 = RID is absent from RID-list
– Bitmaps are usually compressed
• E.g using run-length encoding
Bitmap Index Example
• Bitmap index looks ID Name Sex
like this: 1 Fred M
<M,10100011>
2 Jill F
<F,01011100>
3 Joe M
4 Fran F
5 Ellen F
6 Kate F
7 Matt M
8 Bob M
Why Bitmap Indexes?
• Index intersection plans with bitmap indexes are fast
– Just perform bitwise AND!
– Index intersection with B-Trees requires a join
• SELECT COUNT(*) FROM R WHERE A=5 AND B < 10
– Bitmap index on A
– Bitmap index on B
– OR together bitmaps for B values that are < 10
– AND the result with the bitmap for A=5
– Can be computed very quickly
• Assuming not too many distinct B values that are < 10
• Save space for low-cardinality attributes
– As compared to a B-Tree or Hash index
– Particularly if compression is used
• Most useful for attributes with low or medium cardinality
– Not good for something like LastName
Compressing Bitmaps
• Consider a bitmap index on an attribute with 20 distinct values
• Each row has 1 value for that attribute
• 20 different bitmaps
– ith bit is set to 1 in one bitmap
– ith is set to 0 in 19 bitmaps
• Bitmaps consist mostly of zeros (95% of bits are zero)
– Good opportunity for compression
• Compression via run length encoding
– Just record number of zeros between adjacent ones
– 00000001000010000000000001100000
– Store this as “7,4,12,0,5”
• Compression Pros and Cons
– Reduce storage space → reduce number of I/Os required
– Need to compress/uncompress → increase CPU work required
– Each compression scheme negotiates this trade-off differently
– Operate directly on compressed bitmap → improved performance

Anda mungkin juga menyukai