E0 261
J Jayant t Haritsa H it Computer Science and Automation Indian Institute of Science
JAN 2010 R-TREES Slide 1
JAN 2010
R-TREES
Slide 3
Spatial p Applications pp
Computer Aided Design (CAD)
JAN 2010
p-wells p wells that are within 1 micron of a clock line Find Fi d all ll rivers i that th t go th through hK Karnataka t k (Line-Region intersection) Find all forests that lie within Karnataka (Region-Region Overlap) Find the ten nearest cities to Bangalore (Nearest Neighbor) Find all villages within 10 miles of a metropolis (Spatial Join)
R-TREES
Low L Dim
Slide 4
Biological g Databases
Protein Folding Sequence S clustering l t i
JAN 2010
R-TREES
Slide 5
Solution Techniques q
Multi-dimensional SAM index
R-tree, R tree R R*-tree, tree R+-tree, tree P-tree, P tree s-k-d s k d tree tree, ...
JAN 2010
R-TREES
Slide 6
R(egion)-Tree ( g )
Balanced (similar to B+ tree) I is an n-dimensional n dimensional rectangle of the form (I0, I1, ... , In-1) where Ii is a range [ b] [[a,b] [ ,] Leaf node index entries: (I, (I tuple tuple_id) id) Non-leaf node entry: (I, child_ptr) M is maximum entries per node. m M/2 is the minimum entries per node.
JAN 2010 R-TREES Slide 7
Invariants
Every leaf (non-leaf) has between m and M records ( (children) ) except p for the root. Root has at least two children unless it is a leaf. leaf For each leaf ( (non-leaf) ) entry, y, I is the smallest rectangle that contains the data objects (children) (children). All leaves appear at the same level.
JAN 2010
R-TREES
Slide 8
JAN 2010
R-TREES
Slide 9
JAN 2010
R-TREES
Slide 10
AdjustTree during Insert has to go from leaf to root, even if not split case
JAN 2010 R-TREES Slide 12
Searching (Intersection)
Given search rectangle S S, find objects in DB that intersect S Given search rectangle S, find index records whose MBRs intersect S S. Start at root and locate all child nodes whose rectangle t l I intersects i t t S (via ( i linear li search). h) Search the subtrees of those child nodes.
Search strategy?
JAN 2010
When you get to the leaves, return entries whose rectangles intersect S. Search may y require q inspecting p g several p paths. Worst case running time is not so good ...
R-TREES Slide 13
S = R16 (Intersection) ( )
JAN 2010
R-TREES
Slide 15
Insertion
Insertion is done at the leaves Where to put new index entry E with rectangle R?
Start at root. Go down the tree by choosing child whose rectangle needs the least enlargement to include R. If there th is i room in i the th correct t leaf l f node, d insert i t it. Otherwise split the node. Adjust the tree.
R-TREES Slide 16
JAN 2010
JAN 2010
R-TREES
Slide 17
Deletion
Find entry to delete and remove from leaf L L. Set N=L and Q = . (Q is set of eliminated nodes) Let P be Ns parent and EN be the entry that points to N. If N has less than m entries, delete EN from P and add dd N t to Q Q. If N has at least m entries then set the rectangle of EN to tightly enclose N. Set N=P and repeat from step 3. *Reinsert entries from eliminated leaves. Insert nonleaf entries higher g up p so that all leaves are at the same level.
R-TREES Slide 18
JAN 2010
Why Reinsert?
Nodes can be merged with sibling whose area will increase the least least, or entries can be redistributed. In I any case, nodes d may need dt to be b split. lit p Reinsertion is easier to implement. Reinsertion refines the spatial structure of the tree reduces the effect of skew Entries to be reinserted are likely to be in memory because their pages are visited during the search to find the index to delete.
JAN 2010 R-TREES Slide 19
Splitting Nodes
Problem: Divide M+1 entries among two nodes so that it is unlikely that the nodes are needlessly examined during a search search. g Solution: Minimize total area of the covering rectangles for both nodes. Exponential algorithm. algorithm Quadratic algorithm. Linear time algorithm.
JAN 2010
R-TREES
Slide 21
Exhaustive Search
Try all possible combinations
M+1 1C * M M+1-m 1 m C * 2M M+1-2m 1 2m M m m Includes repetitions can you come up with correct formula?
JAN 2010
R-TREES
Slide 22
Quadratic Algorithm
Find p pair of entries E1 and E2 that maximizes area(J) - area(E1) - area(E2) where J is covering rectangle of E1 E2. (i.e. maximizes wasted area) Put E1 in one group, E2 in the other. If one group has M-m+1 M m+1 entries entries, put the remaining entries into the other group and stop. If all entries have been distributed then stop. For each entry E, calculate d1 and d2 where di is the area increase in covering rectangle of Group i when E is added. Find E with maximum |d1 - d2| and add E to the group whose area will increase the least. (i.e. maximum affinity) Repeat starting with step 3 3.
R-TREES Slide 23
JAN 2010
Quadratic (contd) ( )
Algorithm is quadratic in M. Linear in number of dimensions dimensions. But not optimal.
JAN 2010
R-TREES
Slide 24
Linear Algorithm
For each dimension, dimension choose the pair of rectangles that have the maximum distance between them (w (w.r.t. r t edges) Normalize by dividing the distance by the width of f entire ti set t along l that th t dimension. di i g normalized Put the two entries with largest separation (along any dimension) into different g p groups. Randomly choose next non-assigned point and then put it in group with lesser area increase. Algorithm is linear, almost no attempt at optimality optimality.
R-TREES Slide 25
JAN 2010
Performance Tests
CENTRAL circuit cell (1057 rectangles) Measure performance on last 10% inserts. inserts Search used randomly generated rectangles that match about 5% of the data. Delete every 10th data item.
JAN 2010
R-TREES
Slide 26
Performance e o a ce
JAN 2010
R-TREES
Slide 27
Conclusions
Linear time splitting algorithm is almost as good as the others others. Low node-fill requirement reduces spaceutilization but is not significantly worse than stricter node-fill node fill requirements requirements. R-tree can be added to relational databases.
JAN 2010
R-TREES
Slide 32
Q Questions
Why choose rectangle as bounding structure? Why not some other object, object for example, sphere ? Why isnt E always the best in Figures 4 4.4-4.6 4 4 6 (search performance)?
JAN 2010
R-TREES
Slide 33
JAN 2010
R-TREES
Slide 34