《數(shù)據(jù)庫系統(tǒng)》教學(xué)課件
《數(shù)據(jù)庫系統(tǒng)》教學(xué)課件,數(shù)據(jù)庫系統(tǒng),數(shù)據(jù)庫,系統(tǒng),教學(xué),課件
Tree-Structured IndexesJianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYcourtesy of Joe Hellerstein for some slidesReview:Files,Pages,RecordsnAbstraction of stored data is“files”with“pages”of“records”.qRecords live on pagesqPhysical Record ID(RID)=qRecords can have fixed length or variable length.nFiles can be unordered(heap),sorted,or kind of sorted(i.e.,“clustered”)on a search key.nIndexes can be used to speed up many kinds of accesses.(i.e.,“access paths”)Tree-Structured Indexes:IntroductionnSelections of form:field constantnEquality selections(op is=)qEither“tree”or“hash”indexes help here.nRange selections(op is one of,=,BETWEEN)q“Hash”indexes dont work for these.nMore complex selections(e.g.spatial containment)qThere are fancier trees that can do thisnTree-structured indexing techniques support both range selections and equality selections.qISAM:static structure;early index technology.qB+tree:dynamic,adjusts gracefully under inserts and deletes.Range SearchesnFind all students with gpa 3.0qIf data is in sorted file,do binary search to find first such student,then scan to find others.qCost of binary search in a database can be quite high.nWhy?nSimple idea:Create an index file.*Can do binary search on(smaller)index file!Page 1Page 2Page NPage 3Data Filek2kNk1Index FileISAMnIndex file may still be quite large.But we can apply the idea repeatedly!*Leaf pages contain data entries.index entryNon-leafPagesPagesPrimary pagesLeafP0K1P1K2P2KmPmOverflow pageExample ISAM TreenIndex entries:,they direct search for data entries in leaves.nExample where each node can hold 2 entries;10*15*20*27*33*37*40*46*51*55*63*97*2033516340RootISAM is a STATIC StructurenFile creation:qLeaf(data)pages allocated sequentially,sorted by search keyqthen index pagesqthen overflow pgs.nSearch:Start at root;use key comparisons to go to leaf.nCost=log F N qF=#entries/page(i.e.,fanout)qN=#leaf pgsq no need for next-leaf-page pointers.(Why?)nInsert:Find leaf that data entry belongs to,and put it there.Overflow page if necessary.nDelete:Seek and destroy!If deleting a tuple empties an overflow page,de-allocate it and remove from linked-list.Static tree structure:inserts/deletes affect only leaf pages.Data PagesIndex PagesOverflow pagesPage Number48*Example:Insert 23*,48*,41*,42*10*15*20*27*33*37*40*46*51*55*63*97*2033516340RootOverflowPagesLeafIndexPagesPagesPrimary23*41*42*48*10*15*20*27*33*37*40*46*51*55*63*97*2033516340RootOverflowPagesLeafIndexPagesPagesPrimary23*41*42*.then Deleting 42*,51*,97*Note that 51*appears in index levels,but not in leaf!B+Tree StructurenEach node contains d=m =2d entries(index or data)qThe parameter d is called the order of the tree.qEach internal node contains m index entries:.qEach leaf node contains m data entries:nThe ROOT node contains between 1 and 2d index entries.qIt is a leaf or has at least two children.nEach path from the ROOT to any leaf has the same length.nSupports equality and range-searches efficiently.Index EntriesData Entries(Sequence set)(Direct search)B+Tree Equality SearchnSearch begins at root,and key comparisons direct it to a leaf.nSearch for 15*Based on the search for 15*,we know it is not in the tree!Root1724302*3*5*7*14*16*19*20*22*24*27*29*33*34*38*39*13B+Tree Range SearchnSearch all records whose ages are in 15,28.qEquality search 15*.qFollow sibling pointers.Root1724302*3*5*7*14*16*19*20*22*24*27*29*33*34*38*39*13B+Trees in PracticenTypical order:100.Typical fill-factor:67%.qaverage fanout=133nCan often hold top levels in buffer pool:qLevel 1=1 page =8 KBqLevel 2=133 pages=1 MBqLevel 3=17,689 pages=145 MB qLevel 4=2,352,637 pages=19 GB vWith 1 MB buffer,can locate one record in 19 GB(or 0.3 billion records)in two I/Os!Inserting a Data Entry into a B+TreenFind correct leaf L.nPut data entry onto L.qIf L has enough space,done!qElse,must split L(into L and a new node L2)nRedistribute entries evenly,copy up middle key.nInsert index entry pointing to L2 into parent of L.nThis can happen recursivelyqTo split index node,redistribute entries evenly,but push up middle key.(Contrast with leaf splits.)nSplits“grow”tree;root split increases height.qTree growth:gets wider or one level taller at top.Example B+Tree Inserting 8*Root1724302*3*5*7*14*16*19*20*22*24*27*29*33*34*38*39*13Animation:Insert 8*Root1724302*3*5*7*14*16*19*20*22*24*27*29*33*34*38*39*138*145Final B+Tree-Inserting 8*v Notice that root was split,leading to increase in height.v In this example,we can avoid split by re-distributing entries;however,this is usually not done in practice.Root1724302*3*5*7*14*16*19*20*22*24*27*29*33*34*38*39*132*3*Root17243014*16*19*20*22*24*27*29*33*34*38*39*1357*5*8*Data vs.Index Page Split(from previous example of inserting“8*”)nObserve how minimum occupancy is guaranteed in both leaf and index pg splits.nNote difference between copy-up and push-up;be sure you understand the reasons for this.2*3*5*7*5Entry to be inserted in parent node.(Note that 5 iscontinues to appear in the leaf.)s copied up and2*3*5*7*8*Data Page Split8*5243013appears once in the index.Contrast17Entry to be inserted in parent node.(Note that 17 is pushed up and onlythis with a leaf split.)17243013Index Page Split5Deleting a Data Entry from a B+TreenStart at root,find leaf L where entry belongs.nRemove the entry.qIf L is at least half-full,done!qIf L has only d-1 entries,nTry to re-distribute,borrowing from sibling(adjacent node with same parent as L).nIf re-distribution fails,merge L and sibling.nIf merge occurred,must delete entry(pointing to L or sibling)from parent of L.nMerge could propagate to root,decreasing height.Root1724302*3*5*7*14*16*19*20*22*24*27*29*33*34*38*39*132*3*Root17243014*16*19*20*22*24*27*29*33*34*38*39*1357*5*8*Example Tree(including 8*)Delete 19*and 20*.Example Tree(including 8*)Delete 19*and 20*.nDeleting 19*is easy.nDeleting 20*is done with re-distribution.Notice how middle key is copied up.2*3*Root17243014*16*19*20*22*24*27*29*33*34*38*39*1357*5*8*2*3*Root173014*16*33*34*38*39*1357*5*8*22*24*2727*29*.And Then Deleting 24*nMust merge.nObserve toss of index entry(on right),and pull down of index entry(below).3022*27*29*33*34*38*39*2*3*7*14*16*22*27*29*33*34*38*39*5*8*Root3013517Example of Non-leaf Re-distributionnTree is shown below during deletion of 24*.(What could be a possible initial tree?)nIn contrast to previous example,can re-distribute entry from left child of root to right child.Root1351720223014*16*17*18*20*33*34*38*39*22*27*29*21*7*5*8*3*2*After Re-distributionnIntuitively,entries are re-distributed by pushing through the splitting entry in the parent node.nIt suffices to re-distribute index entry with key 20;weve re-distributed 17 as well for illustration.14*16*33*34*38*39*22*27*29*17*18*20*21*7*5*8*2*3*Root13517302022Bulk Loading of a B+TreenGiven:large collection of recordsnDesire:B+tree on some fieldnBad idea:repeatedly insert recordsqSlow,and poor leaf space utilization.Why?nBulk Loading can be done much more efficiently.nInitialization:Sort all data entries,insert pointer to first(leaf)page in a new(root)page.3*4*6*9*10*11*12*13*20*22*23*31*35*36*38*41*44*Sorted pages of data entries;not yet in B+treeRootBulk Loading(Contd.)nIndex entries for leaf pages always entered into right-most index page just above leaf level.When this fills up,it splits.(Split may go up right-most path to the root.)nMuch faster than repeated inserts.3*4*6*9*10*11*12*13*20*22*23*31*35*36*38*41*44*RootData entry pages not yet in B+tree352312610203*4*6*9*10*11*12*13*20*22*23*31*35*36*38*41*44*6Root101223203538not yet in B+treeData entry pages Summary of Bulk LoadingnOption 1:multiple inserts.qSlow.qDoes not give sequential storage of leaves.nOption 2:Bulk Loading qFewer I/Os during build.qLeaves will be stored sequentially(and linked,of course).qCan control“fill factor”on pages.A Note on OrdernOrder(d)makes little sense with variable-length entriesnUse a physical criterion in practice(at least half-full).qIndex pages often hold many more entries than leaf pages.qVariable sized records and search keys:ndifferent nodes have different numbers of entries.qEven with fixed length fields,Alternative(3)gives variable lengthnMany real systems are even sloppier than this-only reclaim space when a page is completely empty.SummarynTree-structured indexes are ideal for range-searches,also good for equality searches.nISAM is a static structure.qOnly leaf pages modified;overflow pages needed.qOverflow chains can degrade performance unless size of data set and data distribution stay constant.nB+tree is a dynamic structure.qInserts/deletes leave tree height-balanced;log F N cost.qHigh fanout(F)means depth rarely more than 3 or 4.qTypically,67%occupancy on average.qUsually preferable to ISAM;adjusts to growth gracefully.qIf data entries are data records,splits can change rids!Summary(Contd.)nKey compression increases fanout,reduces height.nBulk loading can be much faster than repeated inserts for creating a B+tree on a large data set.nB+tree widely used because of its versatility.qOne of the most optimized components of a DBMS.
收藏