CS 345: Topics in Data Warehousing: Thursday, October 21, 2004

CS 345:
Topics in Data Warehousing

Thursday, October 21, 2004
Review of Tuesday’s Class
• Database System Architecture
– Memory management
– Secondary storage (disk)
– Query planning process
• Joins
– Nested Loop Join
– Merge Join
– Hash Join
• Grouping
– Sort vs. Hash
Outline of Today’s Class
• Indexes
– B-Tree and Hash Indexes
– Clustered vs. Non-Clustered
– Covering Indexes
• Using Indexes in Query Plans
• Bitmap Indexes
– Index intersection plans
– Bitmap compression
Indexes
• Provide efficient access to relevant records
– Based on values of particular attribute(s)
• Same idea as index in back of a book
• “fact tables 16, 17, 49”
– Information about fact tables on pages 16, 17, and 49
– No information about fact tables on other pages
– Without an index, we’d have to look through the
whole book page by page
Typical Index Structure
• Indexes organized based on some search key
– Column (or set of columns) whose values are used to
access the index
– Organization can be sorting or hashing
• Index is built for some relation
– One index entry per record in the relation
• Index consists of <Value, RID> pairs
– Value = value of the search key for this record
– RID = record identifier
• Tells the DBMS where the record is stored
• Usually (page number, offset in page)
Sorted Index
• Index entries usually much smaller than records
– Record has many attributes besides search key
• Build search tree on top of index entries
– Allows particular value to be located quickly
2 5
2 4 4 5 7 8
B-Tree Index
• By far the most common type of index
• Sorted index with search tree
• Good for point queries and range queries
– Point query: A = 5
– Range query: A BETWEEN 5 AND 10
• Search tree nodes are page-sized
– Contain <Value, Pointer> pairs
– Each Pointer is to a node of the level below
• Trade-off in choosing index page sizes
– Larger pages → fewer search tree levels → fewer
page reads
– Larger pages → each page read takes longer
Hash Indexes
• Useful for point queries
– Slightly better performance than B-Trees
– Not useful for range queries
• Less widely supported than B-Trees
Alternate B-Tree Organization
• Many records with same search key causes
redundancy
– <Stanford,RID1>,<Stanford,RID2>,
<Stanford,RID3>,<Stanford,RID4>
• Can store RID-lists instead
– <Stanford, (RID1,RID2,RID3,RID4)>
– Each value occurs once in the index
– Index entry is <Value,RID-list> instead of
<Value,RID>
– Saves space when search key has many repeated
values
Clustered Indexes
• An index is clustered (or “clustering”) if records in the
relation are organized based on index search key
• Clustered indexes are good because:
– Records satisfying a range query are packed onto a small number
of consecutive pages
• In unclustered indexes, by contrast:
– Records satisfying a range query are spread across a large
number of random pages
– Commingled with other records that do not satisfy the query
• Only one clustered index allowed per relation
– A relation can’t be simultaneously sorted by 2 different attributes
– (Unless there are multiple copies of the relation)
Clustered vs. Unclustered
Clustered Sequential
2 5 Reads
2 4 4 5 7 8
2 4 4
5 7 8
Unclustered 2 5
Random
Reads
2 4 4 5 7 8
4 7 5
2 4 8
Comparing Access Plans
• Consider query “SELECT * FROM R WHERE A=5”
• Three query plans:
– Scan relation R
• Sequential read of all pages in R
• Regardless of how many tuples have A=5
– Use clustered index on A
• Sequential read of relevant pages in R
• Num. relevant pages = (# of tuples with A=5) / (# of tuples per page)
• Plus overhead of accessing index pages
– Use unclustered index on A
• Random read of relevant pages in R
• Number of relevant pages = (# of tuples with A=5)
– Less if A is highly correlated with sort order of relation
• Plus overhead of accessing index pages
Comparing Access Plans
• Clustered index is always best
– Unless all tuples are being returned (then use scan)
– But clustered index may not be available
• Unclustered index beats scan when fraction of
tuples returned is small
– Depends on these factors:
• % of tuples being returned
• Cost ratio of random I/O vs. sequential I/O
• # of tuples per page
– Query returns >10% of rows → scan is almost
certainly faster
Covering Indexes
• Example using index in a book:
– “What does this book say about fact tables?”
• Look up “fact tables” in the index
• Turn to each page that is listed
• Read that page and see what it says
– “Which of these topics are discussed in this
book: fact tables, bridge tables, B-trees?”
• Look up the three topics in the index
• See how many of them appear
• Don’t need to read any of the actual book
Covering Indexes
• Sometimes an index has all the data you need
– Allows index-only query plan
– Not necessary to access the actual tuples
– Such an index is called a covering index
• SELECT COUNT(*) FROM R WHERE A=5
– Use index on A
– Count number of <5,RID> entries
– No need to look up records referenced by RIDs
• An index is a “thin” copy of a relation
– Not all columns from the relation are included
– The index is sorted in a particular way
Multi-Column Indexes
• Multi-column indexes are very useful in data
warehousing
– We say such an index has a composite key
• Example: B-Tree index on (A,B)
– Search key is (A,B) combination
– Index entries sorted by A value
– Entries with same A value are sorted by B value
– Called a lexicographic sort
• SELECT SUM(B) FROM R WHERE A=5
– Our (A,B) index covers this query!
• Coverage vs. size trade-off
– More attributes in search key → index covers more queries
– More attributes in search key → index takes up more disk space
Fact and Dimension Indexes
• Dimension table index
• Narrow version of table with
only frequently-queried
attributes
• Always include dimension key!
• Improve performance on large
dimension tables
• Fact table index

• Narrow version of fact that
omits certain dimensions /
measures
• Useful for queries that
exclusively reference indexed
dimensions / measures
Order of Composite Key
• Index on (A,B) ≠ Index on (B,A)
– Can efficiently search based on leading terms
– No efficient search for trailing terms
• SELECT SUM(B) FROM R WHERE A=5
– Index on (A,B) is sorted by A
• Search for records where A=5
• Scan only the relevant portion of the index
– Index on (B,A) is sorted by B
• Records with A=5 are scattered throughout index
• Need to scan the entire index
• Or else do one search for each distinct value of B
– Oracle’s “index skip scans”
– Index on (A,B) is better for this query
– Either index is much faster than accessing relation!
Index Summary
• Indexes are useful in two ways:
– Indexes allow efficient search on some attributes due
to the way they are organized
– Index-only plans use small indexes in place of large
relations
• For OLAP queries, the second use is generally
more important
– Search via non-covering, non-clustered index leads to
random I/O
– Analysis queries typically aggregate lots of tuples
– Doing one random I/O per tuple can be costly
Example
• Sales(Date, Store, Product, Promotion,
TransactionId, Quantity, DollarAmt)
– Index on (Date, Store, Quantity, DollarAmt)
– Index on (Date, Promotion, Product, Quantity,
DollarAmt)
– Index on (Product, Date, Store, Quantity, DollarAmt)
• Store
– Index on (Name, District, StoreKey)
• Product
– Index on (Name, Brand, Dept, ProductKey)
– Index on (Brand, Dept, ProductKey)
Example Query
Product:
Sales:
Brand
DollarAmt
SELECT Brand, SUM(DollarAmt)

FROM Sales, Product, Store
WHERE Sales.ProductKey = Product.ProductKey
AND Sales.StoreKey = Store.StoreKey
AND Store.Name = 'Crystal Springs Safeway‘
GROUP BY Brand
Store:
Name
Selecting Indexes
Lacks
• Sales(Date, Store, Product, Promotion, Product
TransactionId, Quantity, DollarAmt)
– Index on (Date, Store, Quantity, DollarAmt)
– Index on (Date, Promotion, Product, Quantity,
DollarAmt)
– Index on (Product, Date, Store, Quantity, DollarAmt)
• Store Lacks
– Index on (Name, District, StoreKey) Store
• Product
– Index on (Name, Brand, Dept, ProductKey)
– Index on (Brand, Dept, ProductKey)
Wider
Than
Needed
Query Plan
• Search Store(Name, District, StoreKey) index for
Name=‘Crystal Springs Safeway’
• Nested Loop Join
– Outer = Sales(Product,Date,Store,Quantity,DollarAmt) index
– Inner = Qualifying Store index entries
– Output preserves sort order of Sales index
• Sort Product(Brand,Dept,ProductKey) index entries by
ProductKey
• Merge Join
– Result of Nested Loop Join (already sorted by ProductKey)
– Product(Brand,Dept,ProductKey)
• Hash resulting tuples on Brand (for GROUP BY)
– Compute SUM(DollarAmt) for each Brand
Index Intersection
• Suppose we have table R(A,B,C,D,E)
– B-Tree index on A
– B-Tree index on B
– No multi-column indexes
• SELECT COUNT(*) FROM R WHERE A=5 AND B < 10
• Use an index intersection plan
– Search A index for A=5
• Index entries have <A,RID>
• Think of the index as a 2-column table with schema I1(A,RID)
– Search B index for B<10
• Index entries have <B,RID>
• Think of the index as a 2-column table with schema I2(B,RID)
– Join qualifying index entries on I1.RID = I2.RID
Index Intersection
• Index intersection works well for conjunction of
multiple, moderately selective filters
– SELECT SUM(C) FROM R WHERE A=5 AND B<10
– 5% of rows have A=5
– 5% of rows have B<10
– 5% * 5% = 0.25% of rows have A=5 AND B<10
– Retrieving rows matching A index alone, or B index
alone, would be slow
– Only a few rows match both indexes
• Intersect indexes and retrieve rows that match both
– Overhead of joining indexes often small relative to
cost of retrieving matching records from relation
Bitmap Indexes
• Earlier idea: use RID-lists in place of RIDs
– Save space when attribute values repeat
• Bitmap indexes take this one step further
– Use Bitmap in place of RID-list
– Each RID in the entire relation is represented by 1 bit
• 1 = RID is present in RID-list
• 0 = RID is absent from RID-list
– Bitmaps are usually compressed
• E.g using run-length encoding
Bitmap Index Example
• Bitmap index looks ID Name Sex
like this: 1 Fred M
<M,10100011>
2 Jill F
<F,01011100>
3 Joe M
4 Fran F
5 Ellen F
6 Kate F
7 Matt M
8 Bob M
Why Bitmap Indexes?
• Index intersection plans with bitmap indexes are fast
– Just perform bitwise AND!
– Index intersection with B-Trees requires a join
• SELECT COUNT(*) FROM R WHERE A=5 AND B < 10
– Bitmap index on A
– Bitmap index on B
– OR together bitmaps for B values that are < 10
– AND the result with the bitmap for A=5
– Can be computed very quickly
• Assuming not too many distinct B values that are < 10
• Save space for low-cardinality attributes
– As compared to a B-Tree or Hash index
– Particularly if compression is used
• Most useful for attributes with low or medium cardinality
– Not good for something like LastName
Compressing Bitmaps
• Consider a bitmap index on an attribute with 20 distinct values
• Each row has 1 value for that attribute
• 20 different bitmaps
– ith bit is set to 1 in one bitmap
– ith is set to 0 in 19 bitmaps
• Bitmaps consist mostly of zeros (95% of bits are zero)
– Good opportunity for compression
• Compression via run length encoding
– Just record number of zeros between adjacent ones
– 00000001000010000000000001100000
– Store this as “7,4,12,0,5”
• Compression Pros and Cons
– Reduce storage space → reduce number of I/Os required
– Need to compress/uncompress → increase CPU work required
– Each compression scheme negotiates this trade-off differently
– Operate directly on compressed bitmap → improved performance

CS 345: Topics in Data Warehousing: Thursday, October 21, 2004

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

CS 345: Topics in Data Warehousing: Thursday, October 21, 2004

Diunggah oleh

Hak Cipta:

Format Tersedia

CS 345:

Topics in Data Warehousing

• Fact table index

SELECT Brand, SUM(DollarAmt)

Anda mungkin juga menyukai