C: Implementing Selection (single-attribute)

>>

Varieties of Selection
Implementing Select Efficiently
Heap Files
Selection in Heaps
Insertion in Heaps
Deletion in Heaps
Deletion in PostgreSQL
Updates in Heaps
Heaps in PostgreSQL
Sorted Files
Selection in Sorted Files
Search with pmr and range Queries
Insertion into Sorted Files
Deletion from Sorted Files
Hashing
Hashing Performance
Selection with Hashing
Insertion with Hashing
Deletion with Hashing
Problem with Hashing...
Page Splitting
Linear Hashing
Selection with Lin.Hashing
File Expansion with Lin.Hashing
Selection with Lin.Hashing
File Expansion with Lin.Hashing
Insertion with Lin.Hashing
Splitting
Exercise: Insertion into Linear Hashed File
Insertion Cost
Deletion with Lin.Hashing
Hash Files in PostgreSQL
Indexing
Indexing
Indexes
Dense Primary Index
Sparse Primary Index
Exercise: Index Storage Overheads
Selection with Primary Index
Insertion with Primary Index
Deletion with Primary Index
Clustering Index
Secondary Index
Multi-level Indexes
Select with Multi-level Index
B-Trees
B-Tree Depth
Insertion into B-Trees
Example: B-tree Insertion
B-Tree Insertion Cost
Selection with B-Trees
B-trees in PostgreSQL

∧ >>

❖ Varieties of Selection

Selection: select * from R where C

filters a subset of tuples from one relation R
based on a condition C on the attribute values

We consider three distinct styles of selection:

1-d (one dimensional) (condition uses only 1 attribute)
n-d (multi-dimensional) (condition uses >1 attribute)
similarity (approximate matching, with ranking)

Each style has several possible file-structures/techniques.

<< ∧ >>

❖ Varieties of Selection (cont)

Examples of different selection types:

one: select * from R where id = 1234
pmr: select * from R where age=65 (1-d)

select * from R where age=65 and gender='m' (n-d)
rng: select * from R where age≥18 and age≤21 (1-d)

select * from R where age between 18 and 21 (n-d)
and height between 160 and 190
note: rng = range

<< ∧ >>

❖ Implementing Select Efficiently

Two basic approaches:

physical arrangement of tuples
- sorting (search strategy)
- hashing (static, dynamic, n-dimensional)
additional indexing information
- index files (primary, secondary, trees)
- signatures (superimposed, disjoint)

Our analyses assume: 1 input buffer available for each relation.

If more buffers are available, most methods benefit.

<< ∧ >>

❖ Heap Files

Note: this is not "heap" as in the top-to-bottom ordered tree.
It means simply an unordered collection of tuples in a file.

<< ∧ >>

❖ Selection in Heaps

For all selection queries, the only possible strategy is:

// select * from R where C
for each page P in file of relation R {
    for each tuple t in page P {
        if (t satisfies C)
            add tuple t to result set
    }
}

i.e. linear scan through file searching for matching tuples

<< ∧ >>

❖ Selection in Heaps (cont)

The heap is scanned from the first to the last page:

Cost_range = Cost_pmr = b

If we know that only one tuple matches the query (one query),
a simple optimisation is to stop the scan once that tuple is found.

Cost_one : Best = 1 Average = b/2 Worst = b

<< ∧ >>

❖ Insertion in Heaps

Insertion: new tuple is appended to file (in last page).

rel = openRelation("R", READ|WRITE);
pid = nPages(rel)-1;
get_page(rel, pid, buf);
if (size(newTup) > size(buf))
   { deal with oversize tuple }
else {
   if (!hasSpace(buf,newTup))
      { pid++; nPages(rel)++; clear(buf); }
   insert_record(buf,newTup);
   put_page(rel, pid, buf);
}

Cost_insert = 1_r + 1_w

Plus possible extra writes for oversize tuples, e.g. PostgreSQL's TOAST

<< ∧ >>

❖ Insertion in Heaps (cont)

Alternative strategy:

find any page from R with enough space
preferably a page already loaded into memory buffer

PostgreSQL's strategy:

use last updated page of R in buffer pool
otherwise, search buffer pool for page with enough space
assisted by free space map (FSM) associated with each table
for details: backend/access/heap/{heapam.c,hio.c}

<< ∧ >>

❖ Insertion in Heaps (cont)

PostgreSQL's tuple insertion:

heap_insert(Relation relation,    // relation desc
            HeapTuple newtup,     // new tuple data
            CommandId cid, ...)   // SQL statement

finds page which has enough free space for newtup
ensures page loaded into buffer pool and locked
copies tuple data into page buffer, sets xmin, etc.
marks buffer as dirty
writes details of insertion into transaction log
returns OID of new tuple if relation has OIDs

<< ∧ >>

❖ Deletion in Heaps

SQL: delete from R where Condition

Implementation of deletion:

rel = openRelation("R",READ|WRITE);
for (p = 0; p < nPages(rel); p++) {
    get_page(rel, p, buf);
    ndels = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = get_record(buf,i);
        if (tup satisfies Condition)
            { ndels++; delete_record(buf,i); }
    }
    if (ndels > 0) put_page(rel, p, buf);
    if (ndels > 0 && unique) break;
}

<< ∧ >>

❖ Deletion in PostgreSQL

PostgreSQL tuple deletion:

heap_delete(Relation relation,    // relation desc
            ItemPointer tid, ..., // tupleID
            CommandId cid, ...)   // SQL statement

gets page containing tuple into buffer pool and locks it
sets flags, commandID and xmax in tuple; dirties buffer
writes indication of deletion to transaction log

Vacuuming eventually compacts space in each page.

<< ∧ >>

❖ Updates in Heaps

SQL: update R set F = val where Condition

Analysis for updates is similar to that for deletion

scan all pages
replace any updated tuples (within each page)
write affected pages to disk

Cost_update = b_r + b_qw

Complication: new tuple larger than old version (too big for page)

Solution: delete, re-organise free space, then insert

<< ∧ >>

❖ Updates in Heaps (cont)

PostgreSQL tuple update:

heap_update(Relation relation,     // relation desc
            ItemPointer otid,      // old tupleID
            HeapTuple newtup, ..., // new tuple data
            CommandId cid, ...)    // SQL statement

essentially does delete(otid), then insert(newtup)
also, sets old tuple's ctid field to reference new tuple
can also update-in-place if no referencing transactions

<< ∧ >>

❖ Heaps in PostgreSQL

PostgreSQL stores all table data in heap files (by default).

Typically there are also associated index files.

If a file is more useful in some other form:

PostgreSQL may make a transformed copy during query execution
programmer can set it via create index...using hash

Heap file implementation: src/backend/access/heap

<< ∧ >>

❖ Heaps in PostgreSQL (cont)

PostgreSQL "heap file" may use multiple physical files

files are named after the OID of the corresponding table
first data file is called simply OID
if size exceeds 1GB, create a fork called OID.1
add more forks as data size grows (one fork for each 1GB)
other files:
- free space map (OID_fsm), visibility map (OID_vm)
- optionally, TOAST file (if table has varlen attributes)
for details: Chapter 73 in PostgreSQL 15 documentation

<< ∧ >>

❖ Sorted Files

Records stored in file in order of some field k (the sort key).

Makes searching more efficient; makes insertion less efficient

E.g. assume c = 4

[Diagram:Pics/file-struct/insert-in-sorted.png]

<< ∧ >>

❖ Sorted Files (cont)

In order to mitigate insertion costs, use overflow blocks.

Total number of overflow blocks = b_ov.

Average overflow chain length = Ov = b_ov / b.

Bucket = data page + its overflow page(s)

<< ∧ >>

❖ Selection in Sorted Files

For one queries on sort key, use binary search.

// select * from R where k = val  (sorted on R.k)
lo = 0; hi = b-1
while (lo <= hi) {
    mid = (lo+hi) / 2;  // int division with truncation
    (tup,loVal,hiVal) = searchBucket(f,mid,k,val);
    if (tup != NULL) return tup;
    else if (val < loVal) hi = mid - 1;
    else if (val > hiVal) lo = mid + 1;
    else return NOT_FOUND;
}
return NOT_FOUND;

where f is file for relation, mid,lo,hi are page indexes,
k is a field/attr, val,loVal,hiVal are values for k

<< ∧ >>

❖ Selection in Sorted Files (cont)

Search a page and its overflow chain for a key value

searchBucket(f,pid,k,val)
{
    buf = getPage(f,pid);
    (tup,min,max) = searchPage(buf,k,val,+INF,-INF)
    if (tup != NULL) return(tup,min,max);
    ovf = openOvFile(f);
    ovp = ovflow(buf);
    while (tup == NULL && ovp != NO_PAGE) {
        buf = getPage(ovf,ovp);
        (tup,min,max) = searchPage(buf,k,val,min,max)
        ovp = ovflow(buf);
    }     
    return (tup,min,max);
}

Assumes each page contains index of next page in Ov chain

Note: getPage(f,pid) = { read_page(f,pid,buf); return buf; }

<< ∧ >>

❖ Selection in Sorted Files (cont)

Search within a page for key; also find min/max key values

searchPage(buf,k,val,min,max)
{
    res = NULL;
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) res = tup;
        if (tup.k < min) min = tup.k;
        if (tup.k > max) max = tup.k;
    }
    return (res,min,max);
}

<< ∧ >>

❖ Selection in Sorted Files (cont)

The above method treats each bucket like a single large page.

Cases:

best: find tuple in first data page we read
worst: full binary search, and not found
- examine log₂b data pages
- plus examine all of their overflow pages
average: examine some data pages + their overflow pages

Cost_one : Best = 1 Worst = log₂ b + b_ov

Average case cost analysis needs assumptions (e.g. data distribution)

<< ∧ >>

❖ Search with pmr and range Queries

For pmr query, on non-unique attribute k, where file is sorted on k

tuples containing k may span several pages

E.g. select * from R where k = 2

[Diagram:Pics/file-struct/sfile-pmr.png]

Begin by locating a page p containing k=val (as for one query).

Scan backwards and forwards from p to find matches.

Thus, Cost_pmr = Cost_one + (b_q-1).(1+Ov)

<< ∧ >>

❖ Search with pmr and range Queries (cont)

For range queries on unique sort key (e.g. primary key):

use binary search to find lower bound
read sequentially until reach upper bound

E.g. select * from R where k >= 5 and k <= 13

[Diagram:Pics/file-struct/sfile-range-pk.png]

Cost_range = Cost_one + (b_q-1).(1+Ov)

<< ∧ >>

❖ Search with pmr and range Queries (cont)

For range queries on non-unique sort key, similar method to pmr.

binary search to find lower bound
then go backwards to start of run
then go forwards to last occurence of upper-bound

E.g. select * from R where k >= 2 and k <= 6

[Diagram:Pics/file-struct/sfile-range.png]

Cost_range = Cost_one + (b_q-1).(1+Ov)

<< ∧ >>

❖ Search with pmr and range Queries (cont)

So far, have assumed query condition involves sort key k.

But what about select * from R where j = 100.0 ?

If condition contains attribute j, not the sort key

file is unlikely to be sorted by j as well
sortedness gives no searching benefits

Cost_one, Cost_range, Cost_pmr as for heap files

<< ∧ >>

❖ Insertion into Sorted Files

Insertion approach:

find appropriate page for tuple (via binary search)
if page not full, insert into page
otherwise, insert into next overflow block with space

Thus, Cost_insert = Cost_one + δ_w (where δ_w = 1 or 2)

Consider insertions of k=33, k=25, k=99 into:

[Diagram:Pics/file-struct/sorted-file1.png]

<< ∧ >>

❖ Deletion from Sorted Files

E.g. delete from R where k = 2

Deletion strategy:

find matching tuple(s)
mark them as deleted

Cost depends on selectivity of selection condition

Recall: selectivity determines b_q (# pages with matches)

Thus, Cost_delete = Cost_select + b_qw

<< ∧ >>

❖ Hashing

Basic idea: use key value to compute page address of tuple.

e.g. tuple with key = v is stored in page i

Requires: hash function h(v) that maps KeyDomain → [0..b-1].

hashing converts key value (any type) into integer value
integer value is then mapped to page index
note: can view integer value as a bit-string

<< ∧ >>

❖ Hashing (cont)

PostgreSQL hash function (simplified):


Datum hash_any(unsigned char *k, register int keylen)
{
   register uint32 a, b, c, len;
   /* Set up the internal state */
   len = keylen;  a = b = c = 0x9e3779b9 + len + 3923095;
   /* handle most of the key */
   while (len >= 12) {
      a += ka[0]; b += ka[1]; c += ka[2];
      mix(a, b, c);
      ka += 3;  len -= 12;
   }
   /* collect any data from last 11 bytes into a,b,c */
   mix(a, b, c);
   return UInt32GetDatum(c);
}

See backend/access/hash/hashfunc.c for details (incl mix())

<< ∧ >>

❖ Hashing (cont)

hash_any() gives hash value as 32-bit quantity (uint32).

Two ways to map raw hash value into a page address:

if b = 2^k, bitwise AND with k low-order bits set to one

uint32 hashToPageNum(uint32 hval) {
    uint32 mask = 0xFFFFFFFF;
    return (hval & (mask >> (32-k)));
}

otherwise, use mod to produce value in range 0..b-1

uint32 hashToPageNum(uint32 hval) {
    return (hval % b);
}

<< ∧ >>

❖ Hashing Performance

Aims:

distribute tuples evenly amongst buckets
have most buckets nearly full (attempt to minimise wasted space)

Note: if data distribution not uniform, address distribution can't be uniform.

Best case: every bucket contains same number of tuples.

Worst case: every tuple hashes to same bucket.

Average case: some buckets have more tuples than others.

Use overflow pages to handle "overfull" buckets (cf. sorted files)

All tuples in each bucket must have same hash value.

<< ∧ >>

❖ Hashing Performance (cont)

Two important measures for hash files:

load factor: L = r / bc
average overflow chain length: Ov = b_ov / b

Three cases for distribution of tuples in a hashed file:

Case	L	Ov
Best	≅ 1	0
Worst	≫ 1	**
Average	< 1	0<Ov<1

(** performance is same as Heap File)

To achieve average case, aim for 0.75 ≤ L ≤ 0.9.

<< ∧ >>

❖ Selection with Hashing

Select via hashing on unique key k (one)

// select * from R where k = val
(pid,P) = getPageViaHash(val,R)
for each tuple t in page P {
    if (t.k == val) return t
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.k == val) return t
}   }

Cost_one : Best = 1, Avg = 1+Ov/2 Worst = 1+max(OvLen)

<< ∧ >>

❖ Selection with Hashing (cont)

Select via hashing on non-unique hash key nk (pmr)

// select * from R where nk = val
(pid,P) = getPageViaHash(val,R) 
for each tuple t in page P {
    if (t.nk == val) add t to results
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.nk == val) add t to results
}   }
return results

Cost_pmr = 1 + Ov

<< ∧ >>

❖ Selection with Hashing (cont)

Hashing does not help with range queries** ...

Cost_range = b + b_ov

Selection on attribute j which is not hash key ...

Cost_one, Cost_range, Cost_pmr = b + b_ov

** unless the hash function is order-preserving (and most aren't)

<< ∧ >>

❖ Insertion with Hashing

Insertion uses similar process to one queries.

// insert tuple t with key=val into rel R
(pid,P) = getPageViaHash(val,R) 
if room in page P {
    insert t into P; return
}
for each overflow page Q of P {
    if room in page Q {
        insert t into Q; return
}   }
add new overflow page Q
link Q to previous page
insert t into Q

Cost_insert : Best: 1_r + 1_w Worst: 1+max(OvLen))_r + 2_w

<< ∧ >>

❖ Deletion with Hashing

Similar performance to select on non-unique key:

// delete from R where k = val
// f = data file ... ovf = ovflow file
(pid,P) = getPageViaHash(val,R)
ndel = delTuples(P,k,val)
if (ndel > 0) putPage(f,P,pid)
for each overflow page qid,Q of P {
    ndel = delTuples(Q,k,val)
    if (ndel > 0) putPage(ovf,Q,qid)
}

Extra cost over select is cost of writing back modified blocks.

Method works for both unique and non-unique hash keys.

<< ∧ >>

❖ Problem with Hashing...

So far, discussion of hashing has assumed a fixed file size (b).

What size file to use?

the size we need right now (performance degrades as file overflows)
the maximum size we might ever need (signifcant waste of space)

Change file size ⇒ change hash function ⇒ rebuild file

Methods for hashing with dynamic files:

extendible hashing, dynamic hashing (need a directory, no overflows)
linear hashing (expands file "sytematically", no directory, has overflows)

<< ∧ >>

❖ Problem with Hashing... (cont)

All flexible hashing methods ...

treat hash as 32-bit bit-string
adjust hashing by using more/less bits

Start with hash function to convert value to bit-string:

uint32 hash(unsigned char *val)

Require a function to extract d bits from bit-string:

unit32 bits(int d, uint32 val)

Use result of bits() as page address.

<< ∧ >>

❖ Page Splitting

Important concept for flexible hashing: splitting

consider one page (all tuples have same hash value)
recompute page numbers by considering one extra bit
if current page is 101, new pages have hashes 0101 and 1101
some tuples stay in page 0101 (was 101)
some tuples move to page 1101 (new page)
also, rehash any tuples in overflow pages of page 101

Result: expandable data file, never requiring a complete file rebuild

<< ∧ >>

❖ Page Splitting (cont)

Example of splitting:

Tuples only show key value; assume h(val) = val

<< ∧ >>

❖ Linear Hashing

File organisation:

file of primary data blocks
file of overflow data blocks
a "register" called the split pointer (sp)

Uses systematic method of growing data file ...

hash function "adapts" to changing address range
systematic splitting controls length of overflow chains

Advantage: does not require auxiliary storage for a directory

Disadvantage: requires overflow pages (don't split on full pages)

<< ∧ >>

❖ Linear Hashing (cont)

File grows linearly (one block at a time, at regular intervals).

Has "phases" of expansion; over each phase, b doubles.

<< ∧ >>

❖ Selection with Lin.Hashing

If b=2^d, the file behaves exactly like standard hashing.

Use d bits of hash to compute block address.

// select * from R where k = val
h = hash(val);
pid = bits(d,h);  // lower-order bits
P = getPage(f, pid)
for each tuple t in page P
         and its overflow pages {
    if (t.k == val) add t to Result;
}

Average Cost_one = 1+Ov

<< ∧ >>

❖ Selection with Lin.Hashing (cont)

If b != 2^d, treat different parts of the file differently.

Parts A and C are treated as if part of a file of size 2^d+1.

Part B is treated as if part of a file of size 2^d.

Part D does not yet exist (tuples in B may eventually move into it).

<< ∧ >>

❖ Selection with Lin.Hashing (cont)

Modified search algorithm:

// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (pid < sp) { pid = bits(d+1,h); }
P = getPage(f, pid)
for each tuple t in page P
         and its overflow blocks {
    if (t.k == val) add t to Result;
}

<< ∧ >>

❖ File Expansion with Lin.Hashing

<< ∧ >>

❖ Selection with Lin.Hashing

If b != 2^d, treat different parts of the file differently.

Parts A and C are treated as if part of a file of size 2^d+1.

Part B is treated as if part of a file of size 2^d.

Part D does not yet exist (tuples in B may move into it).

<< ∧ >>

❖ Selection with Lin.Hashing (cont)

Modified search algorithm:

// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (P < sp) { pid = bits(d+1,h); }
P = getPage(f, pid)
for each tuple t in page P
         and its overflow blocks {
    if (t.k == val) return R;
}

<< ∧ >>

❖ File Expansion with Lin.Hashing

<< ∧ >>

❖ Insertion with Lin.Hashing

Abstract view:

pid = bits(d,hash(val));
if (pid < sp) pid = bits(d+1,hash(val));
// bucket[pid] = page pid + its overflow pages
for each page P in bucket[pid] {
    if (space in P) {
        insert tuple into P
        break
    }
}
if (no insertion) {
    add new ovflow page to bucket[pid]
    insert tuple into new page
}
if (need to split) {
    partition tuples from bucket[sp]
          into bucket[sp] and bucket[sp+2^d]
    sp++;
    if (sp == 2^d) { d++; sp = 0; }
}

<< ∧ >>

❖ Splitting

How to decide that we "need to split"?

Two approaches to triggering a split:

split every time a tuple is inserted into full block
split when load factor reaches threshold (every k inserts)

Note: always split block sp, even if not full/"current"

Systematic splitting like this ...

eventually reduces length of every overflow chain
helps to maintain short average overflow chain length

<< ∧ >>

❖ Splitting (cont)

Splitting process for block sp=01:

<< ∧ >>

❖ Exercise: Insertion into Linear Hashed File

Consider a file with b=4, c=3, d=2, sp=0, hash(x) as above

Insert tuples in alpha order with the following keys and hashes:

k	hash(k)	k	hash(k)	k	hash(k)	k	hash(k)
`a`	`10001`	`g`	`00000`	`m`	`11001`	`s`	`01110`
`b`	`11010`	`h`	`00000`	`n`	`01000`	`t`	`10011`
`c`	`01111`	`i`	`10010`	`o`	`00110`	`u`	`00010`
`d`	`01111`	`j`	`10110`	`p`	`11101`	`v`	`11111`
`e`	`01100`	`k`	`00101`	`q`	`00010`	`w`	`10000`
`f`	`00010`	`l`	`00101`	`r`	`00000`	`x`	`00111`

The hash values are the 5 lower-order bits from the full 32-bit hash.

<< ∧ >>

❖ Exercise: Insertion into Linear Hashed File (cont)

Splitting algorithm:


// partition tuples between two buckets
newp = sp + 2^d; oldp = sp;
for all tuples t in bucket[oldp] {
    pid = bits(d+1,hash(t.k));
    if (pid == newp) 
        add tuple t to bucket[newp]
    else
        add tuple t to bucket[oldp]
}
sp++;
if (sp == 2^d) { d++; sp = 0; }

<< ∧ >>

❖ Insertion Cost

If no split required, cost same as for standard hashing:

Cost_insert: Best: 1_r + 1_w, Avg: (1+Ov)_r + 1_w, Worst: (1+max(Ov))_r + 2_w

If split occurs, incur Cost_insert plus cost of splitting:

read block sp (plus all of its overflow blocks)
write block sp (and its new overflow blocks)
write block sp+2^d (and its new overflow blocks)

On average, Cost_split = (1+Ov)_r + (2+Ov)_w

<< ∧ >>

❖ Deletion with Lin.Hashing

Deletion is similar to ordinary static hash file.

But might wish to contract file when enough tuples removed.

Rationale: r shrinks, b stays large ⇒ wasted space.

Method:

remove last bucket in data file (contracts linearly).
merge tuples from bucket with its buddy page (using d-1 hash bits)

<< ∧ >>

❖ Hash Files in PostgreSQL

PostgreSQL uses linear hashing on tables which have been:

create index Ix on R using hash (k);

Hash file implementation: backend/access/hash

hashfunc.c ... a family of hash functions
hashinsert.c ... insert, with overflows
hashpage.c ... utilities + splitting
hashsearch.c ... iterator for hash files

Based on "A New Hashing Package for Unix", Margo Seltzer, Winter Usenix 1991

<< ∧ >>

❖ Hash Files in PostgreSQL (cont)

PostgreSQL uses slightly different file organisation ...

has a single file containing main and overflow pages
has groups of main pages of size 2ⁿ
in between groups, arbitrary number of overflow pages
maintains collection of "split pointers" in header page
each split pointer indicates start of main page group

If overflow pages become empty, add to free list and re-use.

<< ∧ >>

❖ Hash Files in PostgreSQL (cont)

PostgreSQL hash file structure:

[Diagram:Pics/file-struct/pgsql-hashfile.png]

<< ∧ >>

❖ Indexing

<< ∧ >>

❖ Indexing

An index is a file of (keyVal,tupleID) pairs, e.g.

<< ∧ >>

❖ Indexes

A 1-d index is based on the value of a single attribute A.

Some possible properties of A:

may be used to sort data file (or may be sorted on some other field)
values may be unique (or there may be multiple instances)

Taxonomy of index types, based on properties of index attribute:

primary index on unique field, may be sorted on A

clustering index on non-unique field, file sorted on A

secondary file not sorted on A

A given table may have indexes on several attributes.

<< ∧ >>

❖ Indexes (cont)

Indexes themselves may be structured in several ways:

dense every tuple is referenced by an entry in the index file

sparse only some tuples are referenced by index file entries

single-level tuples are accessed directly from the index file

multi-level may need to access several index pages to reach tuple

Index file has total i pages (where typically i ≪ b)

Index file has page capacity c_i (where typically c_i ≫ c)

Dense index: i = ceil( r/c_i ) Sparse index: i = ceil( b/c_i )

<< ∧ >>

❖ Dense Primary Index

Data file unsorted; one index entry for each tuple

<< ∧ >>

❖ Sparse Primary Index

Data file sorted; one index entry for each page

<< ∧ >>

❖ Exercise: Index Storage Overheads

Consider a relation with the following storage parameters:

B = 8192, R = 128, r = 100000
header in data pages: 256 bytes
key is integer, data file is sorted on key
index entries (keyVal,tupleID): 8 bytes
header in index pages: 32 bytes

How many pages are needed to hold a dense index?

How many pages are needed to hold a sparse index?

<< ∧ >>

❖ Selection with Primary Index

For one queries:

ix = binary search index for entry with key K
if nothing found { return NotFound }
P = getPage(pageOf(ix.tid))
t = getTuple(P,offsetOf(ix.tid))
   -- may require reading overflow pages
return t

Worst case: read log₂i index pages + read 1+Ov data pages.

Thus, Cost_one,prim = log₂ i + 1 + Ov

Assume: index pages are same size as data pages ⇒ same reading cost

<< ∧ >>

❖ Selection with Primary Index (cont)

For range queries on primary key:

use index search to find lower bound
read index sequentially until reach upper bound
accumulate set of buckets to be examined
examine each bucket in turn to check for matches

For pmr queries involving primary key:

search as if performing one query.

For queries not involving primary key, index gives no help.

<< ∧ >>

❖ Selection with Primary Index (cont)

Method for range queries (when data file is not sorted)

// e.g. select * from R where a between lo and hi
pages = {}   results = {}
ixPage = findIndexPage(R.ixf,lo)
while (ixTup = getNextIndexTuple(R.ixf)) {
   if (ixTup.key > hi) break;
   pages = pages ∪ pageOf(ixTup.tid)
}
foreach pid in pages {
   // scan data page plus ovflow chain
   while (buf = getPage(R.datf,pid)) {
      foreach tuple T in buf {
         if (lo<=T.a && T.a<=hi)
            results = results ∪ T
}  }  }

<< ∧ >>

❖ Insertion with Primary Index

Overview:

tid = insert tuple into page P at position p
find location for new entry in index file
insert new index entry (k,tid) into index file

Problem: order of index entries must be maintained

need to avoid overflow pages in index
either reorganise index file or mark entries

Reogranisation requires, on average, read/write half of index file:

Cost_insert,prim = (log₂i)_r + i/2.(1_r+1_w) + (1+Ov)_r + (1+δ)_w

<< ∧ >>

❖ Deletion with Primary Index

Overview:

find tuple using index
mark tuple as deleted
delete index entry for tuple

If we delete index entries by marking ...

Cost_delete,prim = (log₂ i)_r + (1 + Ov)_r + 1_w + 1_w

If we delete index entry by index file reorganisation ...

Cost_delete,prim = (log₂ i)_r + (1 + Ov)_r + i/2.(1_r+1_w) + 1_w

<< ∧ >>

❖ Clustering Index

Data file sorted; one index entry for each key value

Cost penalty: maintaining both index and data file as sorted

(Note: can't mark index entry for value X until all X tuples are deleted)

<< ∧ >>

❖ Secondary Index

Data file not sorted; want one index entry for each key value

[Diagram:Pics/file-struct/sec-index.png]

Cost_pmr = (log₂i_ix1 + a_ix2 + b_q.(1 + Ov))

<< ∧ >>

❖ Multi-level Indexes

Above Secondary Index used two index files to speed up search

by keeping the initial index search relatively quick
Ix1 small (depends on number of unique key values)
Ix2 larger (depends on amount of repetition of keys)
typically, b_Ix1 ≪ b_Ix2

Could improve further by

making Ix1 sparse, since Ix2 is guaranteed to be ordered
in this case, b_Ix1 = ceil( b_Ix2 / c_i )
if Ix1 becomes too large, add Ix3 and make Ix2 sparse
if data file ordered on key, could make Ix3 sparse

Ultimately, reduce top-level of index hierarchy to one page.

<< ∧ >>

❖ Multi-level Indexes (cont)

Example data file with three-levels of index:

[Diagram:Pics/file-struct/multi-level-index.png]

Assume: not primary key, c = 100, c_i = 3

<< ∧ >>

❖ Select with Multi-level Index

For one query on indexed key field:

xpid = top level index page
for level = 1 to d {
    read index entry xpid
    search index page for J'th entry
        where index[J].key <= K < index[J+1].key
    if (J == -1) { return NotFound }
    xpid = index[J].page
}
pid = xpid  // pid is data page index
search page pid and its overflow pages

Cost_one,mli = (d + 1 + Ov)_r

(Note that d = ceil( log_{c_i} r ) and c_i is large because index entries are small)

<< ∧ >>

❖ B-Trees

B-trees are MSTs with the properties:

they are updated so as to remain balanced
each node has at least (n-1)/2 entries in it
each tree node occupies an entire disk page

B-tree insertion and deletion methods

are moderately complicated to describe
can be implemented very efficiently

Advantages of B-trees over general MSTs

better storage utilisation (around 2/3 full)
better worst case performance (shallower)

<< ∧ >>

❖ B-Trees (cont)

Example B-tree (depth=3, n=3) (actually B+ tree)

(Note: in DBs, nodes are pages ⇒ large branching factor, e.g. n=500)

<< ∧ >>

❖ B-Tree Depth

Depth depends on effective branching factor (i.e. how full nodes are).

Simulation studies show typical B-tree nodes are 69% full.

Gives load L_i = 0.69 × c_i and depth of tree ~ ceil( log_{L_i} r ).

Example: c_i=128, L_i=88

Level #nodes #keys

root 1 87

1 88 7656

2 7744 673728

3 681472 59288064

Note: c_i is generally larger than 128 for a real B-tree.

<< ∧ >>

❖ Insertion into B-Trees

Overview of the method:

find leaf node and position in node where entry would be stored
if node is not full, insert entry into appropriate spot
if node is full, split node into two half-full nodes
and promote middle element to parent
if parent full, split and promote upwards
if reach root, and root is full, make new root upwards

Note: if duplicates not allowed and key exists, may stop after step 1.

<< ∧ >>

❖ Example: B-tree Insertion

Starting from this tree:

insert the following keys in the given order 12 15 30 10

<< ∧ >>

❖ Example: B-tree Insertion (cont)

<< ∧ >>

❖ Example: B-tree Insertion (cont)

<< ∧ >>

❖ B-Tree Insertion Cost

Insertion cost = Cost_treeSearch + Cost_treeInsert + Cost_dataInsert

Best case: write one page (most of time)

traverse from root to leaf
read/write data page, write updated leaf

Cost_insert = D_r + 1_w + 1_r + 1_w

Common case: 3 node writes (rearrange 2 leaves + parent)

traverse from root to leaf, holding nodes in buffer
read/write data page
update/write leaf, parent and sibling

Cost_insert = D_r + 3_w + 1_r + 1_w

<< ∧ >>

❖ B-Tree Insertion Cost (cont)

Worst case: 2D-1 node writes (propagate to root)

traverse from root to leaf, holding nodes in buffers
read/write data page
update/write leaf, parent and sibling
repeat previous step D-1 times

Cost_insert = D_r + (2D-1)_w + 1_r + 1_w

<< ∧ >>

❖ Selection with B-Trees

For one queries:

N = B-tree root node
while (N is not a leaf node)
   N = scanToFindChild(N,K)
tid = scanToFindEntry(N,K)
access tuple T using tid

Cost_one = (D + 1)_r

<< ∧ >>

❖ Selection with B-Trees (cont)

For range queries (assume sorted on index attribute):

search index to find leaf node for Lo
for each leaf node entry until Hi found {
	access tuple T using tid from entry
}

Cost_range = (D + b_i + b_q)_r

<< ∧ >>

❖ B-trees in PostgreSQL

PostgreSQL implements ≅ Lehman/Yao-style B-trees

variant that works effectively in high-concurrency environments.

B-tree implementation: backend/access/nbtree

README ... comprehensive description of methods
nbtree.c ... interface functions (for iterators)
nbtsearch.c ... traverse index to find key value
nbtinsert.c ... add new entry to B-tree index

Notes:

stores all instances of equal keys
avoids splitting by scanning right if key = max(key) in page
common insert case: new key is max(key) overall; handled efficiently

<< ∧

❖ B-trees in PostgreSQL (cont)

Interface functions for B-trees

// build Btree index on relation
Datum btbuild(rel,index,...)
// insert index entry into Btree
Datum btinsert(rel,key,tupleid,index,...)
// start scan on Btree index
Datum btbeginscan(rel,key,scandesc,...)
// get next tuple in a scan
Datum btgettuple(scandesc,scandir,...)
// close down a scan
Datum btendscan(scandesc)

primary		index on unique field, may be sorted on A
clustering		index on non-unique field, file sorted on A
secondary		file not sorted on A

dense		every tuple is referenced by an entry in the index file
sparse		only some tuples are referenced by index file entries
single-level		tuples are accessed directly from the index file
multi-level		may need to access several index pages to reach tuple

Level	#nodes	#keys
root	1	87
1	88	7656
2	7744	673728
3	681472	59288064