Selection on One Attribute


Selection


Varieties of Selection2/164

Selection is a relational algebra operation that ...

We consider three distinct styles of selection: Each style has several possible file-structures/techniques.


... Varieties of Selection3/164

We can view a relation as defining a tuple space

E.g. if N=n, we are checking existence of a tuple at a point


One-dimensional Selection


Operations for 1d Select5/164

One-dimensional select queries = condition on single attribute.


... Operations for 1d Select6/164

[Diagram:Pics/select/qspace-small.png]


... Operations for 1d Select7/164

Other operations on relations:


... Operations for 1d Select8/164

In the algorithms below, we assume:


Implementing Select Efficiently9/164

Two basic approaches:

Our analyses assume: 1 input buffer available for each relation.

If more buffers are available, most methods benefit.


Heap Files

Note: this is not "heap" as in the top-to-bottom ordered tree.
It means simply an unordered collection of tuples in a file.


Heap File Structure11/164

The simplest possible file organisation.

New tuples inserted at end of file; tuples deleted by marking.

[Diagram:Pics/file-struct/heap-small.png]


Selection in Heaps12/164

For all selection queries, the only possible strategy is:

// select * from R where C
for each page P in file of relation R {
    for each tuple t in page P {
        if (t satisfies C)
            add tuple t to result set
    }
}

i.e. linear scan through file searching for matching tuples


... Selection in Heaps13/164

The heap is scanned from the first to the last page:

[Diagram:Pics/file-struct/heapscan-small.png]

Costrange  =  Costpmr  =  b

If we know that only one tuple matches the query (one query),
a simple optimisation is to stop the scan once that tuple is found.

Costone :      Best = 1      Average = b/2      Worst = b


Insertion in Heaps14/164

Insertion: new tuple is appended to file (in last page).

rel = openRelation("R", READ|WRITE);
pid = nPages(rel)-1;
get_page(rel, pid, buf);
if (size(newTup) > size(buf))
   { deal with oversize tuple }
else {
   if (!hasSpace(buf,newTup)) 
      { pid++; nPages(rel)++; clear(buf); }
   insert_record(buf,newTup);
   put_page(rel, pid, buf);
}

Costinsert  =  1r + 1w         (plus possible extra writes for oversize recs)


... Insertion in Heaps15/164

Alternative strategy:

PostgreSQL's strategy:


... Insertion in Heaps16/164

PostgreSQL's tuple insertion:

heap_insert(Relation relation,    // relation desc
            HeapTuple newtup,     // new tuple data
            CommandId cid, ...)   // SQL statement


Deletion in Heaps17/164

Deletion for one conditions, e.g.

delete from Employees where id = 12345

Method:

Costdelete1:   best: 1r+1w   avg: (b/2)r+1w   worst: br+1w


... Deletion in Heaps18/164

Deletion for pmr or range conditions, e.g.

delete from Employees where dept = 'Marketing'
delete from Employees where 40 <= age and age < 50

Method:

Costdelete  =  br + bqw     (where bq is #pages with matches)


... Deletion in Heaps19/164

Implementation of deletion:

// Deletion in heaps ... delete from R where C
rel = openRelation("R",READ|WRITE);
for (p = 0; p < nPages(rel); p++) {
    get_page(rel, p, buf);
    ndels = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = get_record(buf,i);
        if (tup satisfies C)
            { ndels++; delete_record(buf,i); }
    }
    if (ndels > 0) put_page(rel, p, buf);
    if (ndels > 0 && unique) break;
}


... Deletion in Heaps20/164

How to deal with free slots:

Insertion with free slots:

rel = openRelation("R",READ|WRITE);
for (p = 0; p < nPages(f); p++) {
    get_page(rel, p, buf);
    if (hasSpace(buf,newTup))
        { insert_record(buf,newTup); break; }
}
if (p == nPages(f)) // not yet added
    { clear(buf); insert_record(buf,newTup); }
put_page(rel, p, buf);


... Deletion in Heaps21/164

PostgreSQL tuple deletion:

heap_delete(Relation relation,    // relation desc
            ItemPointer tid, ..., // tupleID
            CommandId cid, ...)   // SQL statement


Updates in Heaps22/164

Example:   update R set F = val where Condition

Analysis for updates is similar to that for deletion

Costupdate  =  br + bqw

A complication: new version of tuple may not fit in page.

Solution:   delete and then insert
(Cost harder to analyse; depends on how many "oversize" tuples are produced)


... Updates in Heaps23/164


// Update in heaps ... update R set F = val where C
rel = openRelation("R",READ|WRITE);
for (p = 0; p < nPages(rel); p++) {
    get_page(rel, p, buf);
    nmods = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (tup satisfies C) {
            nmods++;
            newTup = modTuple(tup,F,val);
            delTuple(buf,i);
            InsertTupleInHeap(rel,tup); 
    }   }
    if (nmods > 0) put_page(rel, p, buf);
}


... Updates in Heaps24/164

PostgreSQL tuple update:

heap_update(Relation relation,     // relation desc
            ItemPointer otid,      // old tupleID
            HeapTuple newtup, ..., // new tuple data
            CommandId cid, ...)    // SQL statement


Heaps in PostgreSQL25/164

PostgreSQL stores all table data in heap files (by default).

Typically there are also associated index files.

If a file is more useful in some other form:

Heap file implementation:   src/backend/access/heap


... Heaps in PostgreSQL26/164

PostgreSQL "heap files" use 3 physical files


... Heaps in PostgreSQL27/164

Free space map

Visibility map


Sorted Files


Sorted Files29/164

Records stored in file in order of some field k (the sort key).

Makes searching more efficient; makes insertion less efficient

[Diagram:Pics/file-struct/insert-in-sorted-small.png]


... Sorted Files30/164

In order to simplify insertion, use overflow blocks.

[Diagram:Pics/file-struct/sorted-file-small.png]

Large-scale file re-arrangement occurs less frequently.


... Sorted Files31/164

Conceptual view of sorted file structure:

[Diagram:Pics/file-struct/sfile1-small.png]

Total number of overflow blocks = bov.

Average overflow chain length = Ov = bov/b.

Bucket = data page + its overflow page(s)


Selection in Sorted Files32/164

For one queries on sort key, use binary search.

// select * from R where k = val  (sorted on R.k)
lo = 0; hi = b-1
while (lo <= hi) {
    mid = (lo+hi) div 2;
    buf = getPage(f,mid);
    (tup,loVal,hiVal) = searchBucket(f,mid,k,val);
    if (tup != null) return tup;
    else if (val < loVal) hi = mid - 1;
    else if (val > hiVal) lo = mid + 1;
    else return NOT_FOUND;
}
return NOT_FOUND;


... Selection in Sorted Files33/164

Search a page and its overflow chain for a key value

searchBucket(f,p,k,val)
{
    buf = getPage(f,p);
    (tup,min,max) = searchPage(buf,k,val,+INF,-INF)
    if (tup != null) return(tup,min,max);
    ovf = openOvFile(f);
    ovp = ovflow(buf);
    while (tup == NULL && ovp != NO_PAGE) {
        buf = getPage(ovf,ovp);
        (tup,min,max) = searchPage(buf,k,val,min,max)
        ovp = ovflow(buf);
    }     
    return (tup,min,max);
}

Assumes each page contains index of next page in Ov chain

Note: getPage(f,pid) = { read_page(relOf(f),pid,buf); return buf; }


... Selection in Sorted Files34/164

Search within a page for key; also find min/max values

searchPage(buf,k,val,min,max)
{
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) res = tup;
        if (tup.k < min) min = tup.k;
        if (tup.k > max) max = tup.k;
    }
	return (res,min,max);
}


... Selection in Sorted Files35/164

The above method treats each bucket as a single large page.

Cases:

Costone :     Best = 1      Worst = log2 b + bov


... Selection in Sorted Files36/164

Average Costone is messier to analyse:

Costone :    Average  ≅  ((log2b)-1)(1+Ov)

To be more "precise"

In general, average case is difficult to analyse since dependent on other factors such as data distribution.


... Selection in Sorted Files37/164

For pmr query, on non-unique attribute k

[Diagram:Pics/file-struct/sfile-pmr-small.png]

Begin by locating a page p containing k=val   (as for one query).

Scan backwards and forwards from p to find matches.

Thus, Costpmr  =  Costone + (bq-1).(1+Ov)


... Selection in Sorted Files38/164

For range queries on unique sort key (e.g. primary key):

Costrange  =  Costone + (bq-1).(1+Ov)

If secondary key, similar method to pmr.

[Diagram:Pics/file-struct/sfile-range-small.png]


... Selection in Sorted Files39/164

So far, have assumed query condition involves sort key k.

If condition contains attribute j, not the sort key

Costone,   Costrange,   Costpmr   as for heap files


Insertion in Sorted Files40/164

Insertion is more complex than for Heaps.

Thus, Costinsert  =  Costone + δw

where δ  =  1 or 2, depending on whether we need to create a new overflow block


Deletion in Sorted Files41/164

Deletion involves:

Cost depends on selection criteria   (i.e. on how many matching tuples there are)

Thus, Costdelete  =  Costselect + bqw


Hashed Files


Hashing43/164

Basic idea: use key value to compute page address of tuple.

[Diagram:Pics/file-struct/hash-small.png]

e.g. tuple with key = v is stored in page i

Requires: hash function h(v) that maps KeyDomain → [0..b-1].


Hash Functions44/164

Points to note:


... Hash Functions45/164

Usual approach in hash function:

May need separate hash function for each data type

Or, convert each data value to a string and hash that.


... Hash Functions46/164

PostgreSQL hash function (simplified):


Datum hash_any(unsigned char *k, register int keylen)
{
   register uint32 a, b, c, len;
   /* Set up the internal state */
   len = keylen;  a = b = c = 0x9e3779b9 + len + 3923095;
   /* handle most of the key */
   while (len >= 12) {
      a += ka[0]; b += ka[1]; c += ka[2];
      mix(a, b, c);
      ka += 3;  len -= 12;
   }
   /* collect any data from last 11 bytes into a,b,c */
   mix(a, b, c);
   return UInt32GetDatum(c);
}


... Hash Functions47/164

Where mix is defined as:


#define mix(a,b,c) \
{ \
  a -= b; a -= c; a ^= (c>>13); \
  b -= c; b -= a; b ^= (a<<8);  \
  c -= a; c -= b; c ^= (b>>13); \
  a -= b; a -= c; a ^= (c>>12); \
  b -= c; b -= a; b ^= (a<<16); \
  c -= a; c -= b; c ^= (b>>5);  \
  a -= b; a -= c; a ^= (c>>3);  \
  b -= c; b -= a; b ^= (a<<10); \
  c -= a; c -= b; c ^= (b>>15); \
}

i.e. scrambles all of the bits from the bytes of the key value

See backend/access/hash/hashfunc.c for details.


... Hash Functions48/164

Above functions give hash value as 32-bit quantity (uint32).

Two ways to map raw hash value into a page address:


Hashing Performance49/164

Aims:

Note: if data distribution not uniform, block address distribution cannot be uniform.

Best case: every block contains same number of tuples.

Worst case: every tuple hashes to same block.

Average case: some blocks have more tuples than others.


... Hashing Performance50/164

With non-uniform distribution or too many tuples, some blocks will fill.

Use overflow blocks to contain "excess" tuples; one chain of overflow blocks for each primary block   (containing all tuples with same hash value).

[Diagram:Pics/file-struct/hash-file-small.png]


... Hashing Performance51/164

Trade-off (assuming reasonably uniform hash function):

Want to minimise overflow chain length, yet keep blocks as full as possible.


... Hashing Performance52/164

Two important measures for hash files:

Three cases for distribution of tuples in a hashed file:

Case L Ov
Best ≅ 1 0
Worst >> 1 **
Average < 1 0<Ov<1

(** performance is same as Heap File)

To achieve average case, aim for   0.75  ≤  L  ≤  0.9.


Selection with Hashing53/164

Best performance occurs for one queries on hash key field.

Basic strategy:

Best Costone  =  1    (find in data page)

Average Costone  =  1+Ov/2    (scan half of ovflow chain)

Worst Costone  =  1+max(OvLen)    (find in last page of ovflow chain)


... Selection with Hashing54/164

Select via hashing on unique hash key k (one)

Overview:

// select * from R where k = val
pid,P = hash(val) % nPages(R);
for each tuple t in page P {
    if (t.k == val) return t
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.k == val) return t
}   }


... Selection with Hashing55/164

Details of select via hashing on unique key k:

// select * from R where k = val
f = openFile(relName("R"),READ);
pid,P = hash(val) % nPages(f);
buf = getPage(f, pid)
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    if (tup.k == val) return tup;
}
ovp = ovflow(buf);
while (ovp != NO_PAGE) {
    buf = getPage(ovf,ovp);
    for (i = 0; i < nTuples(Buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) return tup;
}   }


... Selection with Hashing56/164

Select via hashing on non-unique hash key nk (pmr)

// select * from R where nk = val
pid,P = hash(val) % nPages(R);
for each tuple t in page P {
    if (t.nk == val) add t to results
}
for each overflow page Q of P {
    for each tuple t in page Q {
        if (t.nk == val) add t to results
}   }
return results

Costpmr  =  1 + Ov


... Selection with Hashing57/164

Details of selection via hashing on non-unique hash key nk (pmr)

// select * from R where k = val
f = openFile(relName("R"),READ);
pid = hash(val) % nPages(f);
buf = getPage(f, pid)
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    if (tup.k == val) append tup to results
}
ovp = ovflow(buf);
while (ovp != NO_PAGE) {
    buf = getPage(ovf,ovp);
    for (i = 0; i < nTuples(Buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val) append tup to results
}   }

Costpmr  =  1 + Ov


... Selection with Hashing58/164

Unless the hash function is order-preserving (and most aren't) hashing does not help with range queries.

Costrange = b + bov

Selection on attribute j which is not hash key ...

Costone,   Costrange,   Costpmr  =  b + bov


Insertion with Hashing59/164

Insertion uses similar process to one queries.

Overview:

pid,P = hash(val) % nPages(R);
if room in page P {
    insert t into P; return
}
for each overflow page Q of P {
    if room in page Q {
        insert t into Q; return
}   }
add new overflow page Q
link Q to previous page
insert t into Q


... Insertion with Hashing60/164

Details of insertion into hashed file:

f = openFile(fileName("R"),READ|WRITE);
p = hash(tup.k) % nPages(R);
buf = getPage(f,p);
if (!isFull(buf) && addTuple(buf,tup) >= 0)
    { writePage(f, p, buf); return; }
ovf = openFile(ovFileName("R"),READ|WRITE);
p = ovFlow(buf); hasOvFlow = (p != NO_PAGE);
while (p != NO_PAGE) {
    buf = getPage(ovf,p);
    if (!isFull(buf) && addTuple(buf,tup) >= 0)
        { writePage(ovf, p, buf); return; }
    p = ovflow(buf);
}
ovflow(buf) = nPages(ovf);
writePage((hasOvFlow?ovf:f), p, buf);
clear(buf);
addTuple(buf, tup);
writePage(ovf, nPages(ovf), buf);


... Insertion with Hashing61/164

Variation in costs determined by iteration on overflow chain.

Best Costinsert  =  1r + 1w

Average Costinsert  =  (1+Ov)r + 1w

Worst Costinsert  =  (1+max(OvLen))r + 2w


Deletion with Hashing62/164

Similar performance to select on non-unique key:

// delete from R where k = val
pid,P = hash(val) % nPages(R);
ndel = delTuples(P,k,val)
if (ndel > 0) putPage(f,buf,p)
for each overflow page qid,Q of P {
    ndel = delTuples(Q,k,val)
    if (ndel > 0) putPage(ovf,Q,qid)
}

Extra cost over select is cost of writing back modified blocks.

Method works for both unique and non-unique hash keys.


... Deletion with Hashing63/164

More details on deletion with hashing

// delete from R where k = val
// f = data file ... ovf = ovflow file
pid,P = getPageViaHash(val,R) 
ndel = delTuples(P,k,val);
if (ndel > 0) putPage(f,P,pid)
pid = ovFlow(buf)
while (pid != NO_PAGE) {
    buf = getPage(ovf,pid);
    delTuples(ovf,buf,k,val);
    pid = ovFlow(buf);
}


... Deletion with Hashing64/164

Function for deleting all matches in a buffer:

delTuples(outf, buf, k, val)
{
    int i;
    int ndels = 0;
    for (i = 0; i < nTuples(buf); i++) {
        tup = getTuple(buf,i);
        if (tup.k == val)
            { ndels++; delTuple(buf,i); }
    }
    if (ndels > 0)
        writePage(outf, p, buf);
}

Realistic deletion would also handle free-list of empty pages.


Another Problem with Hashing...65/164

So far, discussion of hashing has assumed a fixed file size (fixed b).

This is known as static hashing.

What size file to use?


... Another Problem with Hashing...66/164

If we change the file size, by adding more blocks to accomodate more tuples, then we are effectively changing the hash function.

In order to be able to use the new hash function to find existing tuples, all tuples need to be removed and re-inserted.

In other words, this requires complete re-organisation of the file.

Obvious disadvantages:


Flexible Hashing67/164

Several hashing schemes have been proposed to deal with dynamic files.

Two methods access blocks via a directory:

A third method expands files systematically, so needs no directory: All methods:


... Flexible Hashing68/164

Require a function to act on hash values to give d bits

#define HashSize 32
typedef unsigned int HashVal;

// return low-order d bits
HashVal bits(d int, h HashVal)
{
    HashVal mask;
    assert(d > 0 && d <= HashSize);
    mask = 0xffffffff >> (HashSize-d);
    return (h & mask);
}

// return high-order d bits
HashVal bits'(d int, h HashVal)
{
    assert(d > 0 && d <= HashSize);
    return (h >> (HashSize-d));
}


... Flexible Hashing69/164

Important concept for flexible hashing: splitting

Aim: expandable data file, requiring minimal large reorganisation


... Flexible Hashing70/164

Example of splitting:

[Diagram:Pics/file-struct/split-small.png]

Tuples only show key value; assume h(val) = val


Extendible Hashing71/164

File organisation:

Directory could live in page 0 and be cached in memory buffer.


... Extendible Hashing72/164

[Diagram:Pics/file-struct/exhash1-small.png]


Selection with Ext.Hashing73/164

Size of directory = 2d; d is called depth.

Hash function h produces a (e.g. 32-bit) bit-string.

Use first d bits of it to access directory to obtain block address.

Selection method for one queries:

p = directory[bits'(d,h(val))];
buf = getPage(f,p);
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    if (tup.k == val) return tup;
}

No overflow blocks, so Costone  =  1

Assume: a copy of the directory is pinned in DBMS buffer pool.


Insertion with Ext.Hashing74/164

For insertion, use selection method to determine block.

If selected block S not full, simply insert tuple.

What to do if selected block full?

The partitioning process is called splitting.


... Insertion with Ext.Hashing75/164

Basis for splitting?

All tuples r in this block have same bits'(d,hash(r.k))

So ... use d+1 bits of hash (i.e. bits'(d+1,hash(r.k))

This gives twice as many possible hash values.

Since hash values index directory, directory size is doubled.

E.g. d=4 gives indexes 0..15, d=5 gives indexes 0..31


... Insertion with Ext.Hashing76/164

Doubling directory size and adding one block creates many empty directory slots.

What to do with these slots?   Make them point to existing (buddy) pages.

[Diagram:Pics/file-struct/exhash2-small.png]

The idea: some parts are hashed effectively using d bits, others using d+1 bits.


... Insertion with Ext.Hashing77/164

If we split a block with two pointers to it, no need to make directory larger.

[Diagram:Pics/file-struct/exhash3-small.png]


Ext.Hashing Example78/164

Consider the following set of tuples describing bank deposits:

Branch Acct.No Name Amount
Brighton 217 Green 750
Downtown 101 Johnshon 500
Mianus 215 Smith 700
Perryridge 102 Hayes 400
Redwood 222 Lindsay 700
Round Hill 305 Turner 350
Clearview 117 Throggs 295


... Ext.Hashing Example79/164

We hash on the branch name, with the following hash function:

Branch Hash Value
Brighton 0010 1101 1111 1011
Clearview 1101 0101 1101 1110
Downtown 1010 0011 1010 0000
Mianus 1000 0111 1110 1101
Perryridge 1111 0001 0010 0100
Redwood 1011 0101 1010 0110
Round Hill 0101 1000 0011 1111


... Ext.Hashing Example80/164

Assume we have a file with   c=2.

Start with an initially empty file:

[Diagram:Pics/file-struct/exhashex0-small.png]

Add Brighton... tuple (hash=0010...).
Add Downtown... tuple (hash=1010...).

[Diagram:Pics/file-struct/exhashex1-small.png]


... Ext.Hashing Example81/164

Add Mianus... tuple (hash=1000...).

[Diagram:Pics/file-struct/exhashex2-small.png]


... Ext.Hashing Example82/164

Add Perryridge... tuple (hash=1111...).

[Diagram:Pics/file-struct/exhashex3-small.png]


... Ext.Hashing Example83/164

Add Redwood... tuple (hash=1011...).

[Diagram:Pics/file-struct/exhashex4-small.png]


... Ext.Hashing Example84/164

Add Round Hill... tuple (hash=0101...).

[Diagram:Pics/file-struct/exhashex5-small.png]


... Ext.Hashing Example85/164

Add Clearview... tuple (hash=1101...).

[Diagram:Pics/file-struct/exhashex6-small.png]


... Ext.Hashing Example86/164

Add another Perryridge... tuple (hash=1111...).

[Diagram:Pics/file-struct/exhashex7-small.png]


... Ext.Hashing Example87/164

Add another Round Hill... tuple (hash=0101...).

[Diagram:Pics/file-struct/exhashex8-small.png]


... Ext.Hashing Example88/164

Asumption: directory is small enough to be stored in memory.

Most of the time we read and write exactly one block.

If we need to split, we write just one extra block.

Conditions to minimise splitting:

On average, Costinsert  =  1r + (1+δ)w

(δ is a small value depending on b, load, hash function,...)


... Ext.Hashing Example89/164

A potential problem:

A degenerate case:


Deletion with Ext.Hashing90/164

Similar to ordinary static hash file - mark tuple as removed.

However, might also wish to reclaim space:

Both cases:


... Deletion with Ext.Hashing91/164

What to do with empty blocks from file perspective?

Empty block might create hole in middle of file.

Two methods to deal with this:


Linear Hashing92/164

File organisation:

Advantages over other approaches: Uses systematic method of growing data file ...


... Linear Hashing93/164

File grows linearly (one block at a time, at regular intervals).

Can be viewed as having "phases" of expansion; during each phase, b doubles.

[Diagram:Pics/file-struct/linhash1-small.png]


Selection with Lin.Hashing94/164

If b=2d, the file behaves exactly like standard hashing.

Use d bits of hash to compute block address.

// select * from R where k = val
p = bits(d,hash(val));  --least sig bits
P = getPage(f,p)
for each tuple t in page P
         and its overflow pages {
    if (t.k == val) return t;
}

Average Costone  =  1+Ov


... Selection with Lin.Hashing95/164

If b != 2d, treat different parts of the file differently.

[Diagram:Pics/file-struct/linhash2-small.png]

Parts A and C are treated as if part of a file of size 2d+1.

Part B is treated as if part of a file of size 2d.

Part D does not yet exist (B expands into it).


... Selection with Lin.Hashing96/164

Modified algorithm:

// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (p < sp) { p = bits(d+1,h); }
P = getPage(f,pid);
// now proceed as before
for each tuple t in page P
         and its overflow pages {
    if (t.k == val) return R;
}


Insertion with Lin.Hashing97/164

To see why selection works, need to look at how file expands.

[Diagram:Pics/file-struct/linhash3-small.png]


Lin.Hashing Example98/164

Assume we have a file with: c=2.

Start with an initially empty file:

[Diagram:Pics/file-struct/linhashex0-small.png]

Add Brighton... tuple (hash=...1011).

Add Downtown... tuple (hash=...0000).

[Diagram:Pics/file-struct/linhashex1-small.png]


... Lin.Hashing Example99/164

Add Mianus... tuple (hash=...1101).

[Diagram:Pics/file-struct/linhashex2-small.png]

Add Perryridge... tuple (hash=...0100).

[Diagram:Pics/file-struct/linhashex3-small.png]


... Lin.Hashing Example100/164

Add Redwood... tuple (hash=...0110).

[Diagram:Pics/file-struct/linhashex4-small.png]


... Lin.Hashing Example101/164

Add Round Hill... tuple (hash=...1111).

[Diagram:Pics/file-struct/linhashex5-small.png]


... Lin.Hashing Example102/164

Add Clearview... tuple (hash=...1110).

[Diagram:Pics/file-struct/linhashex6-small.png]


... Lin.Hashing Example103/164

Add another Clearview... tuple.

[Diagram:Pics/file-struct/linhashex7-small.png]


Linear Hashing Insertion104/164

Abstract view:

p = bits(d,hash(tup.k));
if (p < sp) p = bits(d+1,hash(val));
// bucket p = page p + its ovflow pages
for each page Q in bucket p {
    if (space in Q) {
        insert tuple into Q
        break
    }
}
if (no insertion) {
    add new ovflow page to bucket p
    insert tuple into new page
}
if (need to split) {
    partition tuples from bucket sp
          into buckets sp and sp+2^d
    sp++;
    if (sp == 2^d) { d++; sp = 0; }
}


... Linear Hashing Insertion105/164

Detailed algorithm:

h = hash(tup.k);
p = bits(d,h);
if (p < sp) p = bits(d+1,h);
//start insertion into p bucket
inserted = 0; inOvflow = 0;
buf = getPage(f,p);
if (isFull(buf))
    { p = ovFlow(buf); }
else {
    // have space in data page
    addTuple(buf,tup); // assume it fits
    writePage(f, p, buf);
    inserted = 1;
}
...


... Linear Hashing Insertion106/164

...
// attempt insertion into ovflow pages
while (!inserted && p != NO_PAGE) {
    inOvflow = 1;
    buf = getPage(ovf,p);
    if (!isFull(buf)) {
        addTuple(buf,tup);
        writePage(ovf, p, buf);
        inserted = 1;
    }
    if (!inserted)
        { p = ovFlow(buf); }
}
// add a new ovflow page
if (!inserted) {
    ovflow(buf) = nPages(ovf);
    outf = inOvflow ? ovf : f;
    writePage(outf, p, buf);
    clear(buf);
    addTuple(buf, tup);
    writePage(ovf, nPages(ovf), buf);
}
// tuple inserted


Splitting107/164

How to decide that we "need to split"?

In extendible hashing, blocks were split on overflow.

In linear hashing, we always split block sp.

Two approaches to triggering a split:


... Splitting108/164

How do we accomplish partitioning?

All tuples in sp plus its overflow blocks have same hash (e.g. 01).

New block is at address sp+2d.

We consider d+1 bits of hash (giving e.g. 001, 101).

Gives addresses for each tuple of sp or sp+2d.

Place tuples according to d+1 hash bits.


... Splitting109/164

Partitioning process for block sp=01:

[Diagram:Pics/file-struct/linhash4-small.png]


... Splitting110/164

Some observations on splitting:

Simplest if we maintain free list of overflow pages.


... Splitting111/164

Detailed splitting algorithm:


// partitions tuples between two buckets
newp = sp + 2^d; oldp = sp;
buf = getPage(f,sp);
clear(oldBuf); clear(newBuf);
for (i = 0; i < nTuples(buf); i++) {
    tup = getTuple(buf,i);
    p = bits(d+1,hash(tup.k));
    if (p == newp) 
        addTuple(newBuf,tup);
    else
        addTuple(oldBuf,tup);
}
p = ovflow(buf);  oldOv = newOv = 0;
while (p != NO_PAGE) {
    ovbuf = getPage(ovf,p);
    for (i = 0; i < nTuples(ovbuf); i++) {
        tup = getTuple(buf,i);
        p = bits(d+1,hash(tup.k));
        if (p == newp) {
            if (isFull(newBuf)) {
                nextp = nextFree(ovf);
                ovflow(newBuf) = nextp;
                outf = newOv ? f : ovf;
                writePage(outf, newp, newBuf);
                newOv++; newp = nextp; clear(newBuf);
            }
            addTuple(newBuf, tup);
        }
        else {
            if (isFull(oldBuf)) {
                nextp = nextFree(ovf);
                ovflow(oldBuf) = nextp;
                outf = oldOv ? f : ovf;
                writePage(outf, oldp, oldBuf);
          	oldOv++; oldp = nextp; clear(oldBuf);
            }
            addTuple(oldBuf, tup);
        }
    }
    addToFreeList(ovf,p);
    p = ovflow(buf);
}
sp++;
if (sp == 2^d) { d++; sp = 0; }


Insertion Cost112/164

If no split required, cost same as for standard hashing:

Best Costinsert  =  1r + 1w

Average Costinsert  =  (1+Ov)r + 1w

Worst Costinsert  =  (1+max(Ov))r + 2w


... Insertion Cost113/164

If split occurs, incur Costinsert plus cost of partitioning.

Partitioning cost involves:

On average, Costpartition  =  (1+Ov)r + (2+Ov)w


Deletion with Lin.Hashing114/164

Deletion is similar to ordinary static hash file.

But might wish to contract file when enough tuples removed.

Rationale: r shrinks, b stays large wasted space.

Method: remove last block in file (contracts linearly).

Involves a coalesce procedure which is an inverse split.


Hash Files in PostgreSQL115/164

PostgreSQL uses linear hashing on tables which have been:

create index Ix on R using hash (k);

Hash file implementation: backend/access/hash

Note: "bucket" = data page + associated ovflow pages


... Hash Files in PostgreSQL116/164

PostgreSQL uses slightly different file organisation ...

If overflow pages become empty, add to free list and re-use.


... Hash Files in PostgreSQL117/164

PostgreSQL hash file structure:

[Diagram:Pics/file-struct/pgsql-hashfile-small.png]


... Hash Files in PostgreSQL118/164

Converting bucket # to page address:

// which page is primary page of bucket
uint bucket_to_page(headerp, B) {
   uint *splits = headerp->hashm_spares;
   uint chunk, base, offset, lg2(uint);
   chunk = (B<2) ? 0 : lg2(B+1)-1;
   base = splits[chunk];
   offset = (B<2) ? B : B-(1<<chunk);
   return (base + offset);
}
// returns ceil(log_2(n))
int lg2(uint n) {
    int i, v;
    for (i = 0, v = 1; v < n; v <<= 1) i++;
    return i;
}


Indexes


Selection via Trees120/164

Tree-based file access methods


Indexing121/164

The basic idea behind an index is as for books.

[Diagram:Pics/file-struct/book-index-small.png]

A table of (key,usage) pairs:


Indexing File Structure122/164

An index is a file of (keyVal,tupleID) pairs, e.g.

[Diagram:Pics/file-struct/index-small.png]


Indexes123/164

A 1-d index is based on the value of a single attribute A.

Some possible properties of A:

Taxonomy of index types, based on properties of index attribute:

primary index on unique field, file sorted on this field
clustering index on non-unique field, file sorted on this field
seconday file not sorted on this field


... Indexes124/164

Indexes themselves may be structured in several ways:

dense every tuple is referenced by an entry in the index file
sparse only some tuples are referenced by index file entries
single-level tuples are accessed directly from the main index file
multi-level may need to access several index pages to reach tuple

Index structures aim to minimise storage and search costs.


... Indexes125/164

A relation/file may have:

Indexes are typically stored in separate files to data   (possibly cached in memory).


Primary Index126/164

Primary index: ordered file whose "tuples" have two fields:

Dense: one index tuple per data tuple   (no requirement on data file ordering)

Sparse: one index tuple per data page     (requires primary key sorted data file)

Index file has blocking factor ci   (where typically ci ≫ c)

Index has total i blocks:    Dense: i = r/ci    Sparse: i = b/ci


... Primary Index127/164

One possible implementation for index blocks:

typedef struct {
    IndexHeader  info,
    IndexEntry[] entries
} IndexBlock;

typedef struct {
    Datum   key,
    int     pageNum,
    int     tupleNum 
} IndexEntry;

Note that pageNum+tupleNum is equivalent to TupleId.


Dense Primary Index128/164

[Diagram:Pics/file-struct/prim-index-small.png]


Sparse Primary Index129/164

[Diagram:Pics/file-struct/prim-index1-small.png]


Index Overheads130/164

Consider a relation and index with the following parameters:

(continued)


... Index Overheads131/164

For each page in the index file ...

Size of index file:


Selection with Prim.Index132/164

For one queries:

ix = binary search index for entry with key K
if nothing found { return NotFound }
b = getPage(ix.pageNum)
t = getTuple(b,ix.tupleNum)
   -- may require reading overflow pages
return t

Worst cost:   read log2i index pages  +  read 1+Ov data pages.

Thus, Costone,prim  =  log2 i + 1 + Ov


... Selection with Prim.Index133/164

For range queries on primary key:

For pmr queries involving primary key: For space queries involving primary key: For queries not involving primary key, index gives no help.


... Selection with Prim.Index134/164

Method for range queries (when data file is not sorted)

// e.g. select * from R where a between lo and hi
pages = {}   results = {}
ixPage = findIndexPage(R.ixf,lo)
while (ixTup = getNextIndexTuple(R.ixf)) {
   if (ixTup.key > hi) break;
   pages += pageOf(ixTup.tid)
}
foreach pid in pages {
   // scan data page plus ovflow chain
   while (buf = getPage(R.datf,pid)) {
      foreach tuple T in buf {
         if (lo<=T.a && T.a<=hi) results += T
}  }  }


Insertion with Prim.Index135/164

Overview:

tid = insert tuple into page P at position p
find location for new entry in index file
insert new index entry (k,tid) into index file

Problem: order of index entries must be maintained.

Reogranisation requires, on average, read/write half of index file.

Costinsert,prim  =  log2ir + i/2.(1r+1w) + (1+Ov)r + (1+δ)w


Deletion with Prim.Index136/164

Overview:

find tuple using index
mark tuple as deleted
remove index entry for tuple

If we delete index entries by marking ...

If we delete index entry by index file reorganisation ...


Clustering Index137/164

Index on non-unique ordering attribute Ac.

Usually a sparse index; one pointer to first tuple containing value.

Assists with:


... Clustering Index138/164

[Diagram:Pics/file-struct/clus-index-small.png]


... Clustering Index139/164

Insertions are expensive: rearrange index file and data file.

This applies whether new value or new tuple containing known value.

One "optimisation": reserve whole block for one particular value.

Deletions relatively cheap (similar to primary index).

(Note: can't mark index entry for value X until all X tuples are deleted)


Secondary Index140/164

Generally, dense index on non-unique attribute As

Problem: multiple tuples with same value for As.

Three possible solutions:


... Secondary Index141/164

First style of secondary index:

[Diagram:Pics/file-struct/sec-index-small.png]


... Secondary Index142/164

Third (favoured) style of secondary index:

[Diagram:Pics/file-struct/sec-index1-small.png]


Selection with Sec.Index143/164

Since non-unique attribute, there are no (explicit) one queries.

For pmr queries involving As:

binary search for key K in index file 1
for each entry for K in index file 2 {
    b = getPage(ix.pageNum)
    search for tuple within buffer b
}

Assume that answering the query q requires:

Costpmr  =  (log2i + aq + bq.(1 + Ov))r


... Selection with Sec.Index144/164

For range queries e.g. Lo≤ AsHi:

binary search for key Lo in index file 1
FOR each entry in index file 2 until Hi DO
    access tuple R via TupleID from index
END

Analysis almost same as for pmr ...

Costrange  =  (log2i + aq + ∑k=LoHi(bk.(1 + Ov)))r

Note: may read some blocks multiple times   (alleviate with buffer pool)


Insertion/Deletion with Sec.Index145/164

Insertion:

As for primary index:

Deletion:

Can use mark-style (tombstone) deletion for tuples.

Caution: can only mark index-entry for k when no more entries for k in block-address blocks.


Multi-level Indexes146/164

In indexes described so far, search begins with binary search of index.

Requires log2 i index block accesses.

We can improve this by putting a second index on the first index.

Second index is sparse; (key,pointer) to first entry in first index block.

If there are i first-level index blocks, there will be i2 = i/ci second-level index blocks.

If second-level index too big, add third level ...

Maximum levels:   t  =  logci r     (dense index)

Top-level of index is always just a single block.


... Multi-level Indexes147/164

[Diagram:Pics/file-struct/ml-index-small.png]


Select with ML.Index148/164

For one query on indexed key field:

xpid = top level index page
for level = 1 to d {
    read index entriy xpid
    search index page for J'th entry
        where index[J].key <= K < index[J+1].key
    if (J == -1) { return NotFound }
    pid = xpid = index[J].page
}
pid = xpid  // pid is data page index
search page pid and its overflow pages

Read d index entries and 1+Ov data blocks.

Thus, Costone,mli  =  (t + 1 + Ov)r

(Recall that t = logcir and single-level index needs log2i)


B-Trees149/164

B-trees are MSTs with the properties:

B-tree insertion and deletion methods Advantages of B-trees over general MSTs


... B-Trees150/164

Example B-tree (depth=3, n=3):

[Diagram:Pics/file-struct/btree0-small.png]


B-Tree Depth151/164

Depth depends on effective branching factor  (i.e. how full nodes are).

Simulation studies (random insertions and deletions) show typical B-tree nodes are 69% full.

Gives   load Li = 0.69 × ci   and   depth of tree ~ logLi r .

Example: ci=128,    Li=88

Level #nodes #keys
root 1 87
1 88 7656
2 7744 673728
3 681472 59288064

Note: ci is generally larger than 128 for a real B-tree.


B+Trees152/164

In database context, nodes are index blocks.

Two possible ways to structure them:

Higher levels of B+ tree have greater ci     B+ tree may have less levels.


B-Tree Index153/164

[Diagram:Pics/file-struct/btree-small.png]


B+Tree Index154/164

[Diagram:Pics/file-struct/b+tree-small.png]


Insertion into B-Trees155/164

Overview of the method:

  1. find leaf node and position in node where entry would be stored
  2. if node is not full, insert entry into appropriate spot
  3. if node is full, split node into two half-full nodes and
  4. if parent full, split and promote
Note: if duplicates not allowed and key is found, may stop after step 1.


... Insertion into B-Trees156/164

When splitting a node and promoting the middle element, we may find that the parent node is full.

In this case we split the parent in the same way as the child and promote its middle element.

This splitting may propagate all the way to root.

In this case we create a new root containing a single entry.

This is the only exception to the half-full rule.


B-Tree Insertion Cost157/164

Insertion cost = CosttreeSearch + CosttreeInsert + CostdataInsert

Best case: write one page (most of time)

Costinsert  =  Dr + 1w + 1r + 1w

Common case: 3 node writes (rearrange 2 leaves + parent)

Costinsert  =  Dr + 1r + 1w + 3w


... B-Tree Insertion Cost158/164

Worst case: 2D-1 node writes (propagate to root)

Costinsert  =  Dr + (2D-1)w + 1r + 1w


Deletion from B-Trees159/164

Overview of the method:

  1. find entry in node nd
  2. if nd is a non-leaf node:
    1. remove entry
    2. promote its immediate successor from child node to take its place
    3. continue as if operation were deletion of promoted child
  3. if nd is a leaf node
    1. remove entry
    2. if still more than half full, compact
    3. if < half full, rearrange entries between parent and siblings to make half full


Selection with B+Trees160/164

For one queries:

N = B-tree root node
while (N is not a leaf node) {
   scan N to find reference to next node
   N = next node
}
scan leaf node N to find entry for K
access tuple t using TupleId from N

Costone  =  (D + 1)r

Generally, D is around 2 or 3, even for very large data files.

A further optimisation is to buffer B-tree root page   ( total D page reads)


... Selection with B+Trees161/164

For range queries (assume sorted on index attribute):

search index to find leaf node for Lo
for each leaf node entry until Hi found {
	access tuple t using TupleId from entry
}

Costrange  =  (D + bi + bq)r

(If hold root block in memory read D-1 index blocks)


... Selection with B+Trees162/164

For pmr, need index on ≥ 1 query attribute.

Could have indexes on several attributes:

Pages = {}
for each Ai = k in query {
    find pages P containing Ai = k
    Pages = Pages  P
}
for each P in Pages {
    scan P for matching tuples
}

If q mentions ai, aj, ak, then

Costpmr  =  (Di + Dj + Dk + |bqi ∩ bqj ∩ bqk|)r

For space queries, treat as a combination of pmr and range.


B-trees in PostgreSQL163/164

PostgreSQL implements ≅ Lehman/Yao-style B-trees

B-tree implementation: backend/access/nbtree Notes:


... B-trees in PostgreSQL164/164

Interface functions for B-trees

// build Btree index on relation
Datum btbuild(rel,index,...)
// insert index entry into Btree
Datum btinsert(rel,key,tupleid,index,...)
// start scan on Btree index
Datum btbeginscan(rel,key,scandesc,...)
// get next tuple in a scan
Datum btgettuple(scandesc,scandir,...)
// close down a scan
Datum btendscan(scandesc)


Produced: 9 Jul 2019