Selection on One Attribute
Selection |
Varieties of Selection | 2/164 |
Selection is a relational algebra operation that ...
... Varieties of Selection | 3/164 |
We can view a relation as defining a tuple space
One-dimensional Selection |
Operations for 1d Select | 5/164 |
One-dimensional select queries = condition on single attribute.
select * from Employees where id = 15567;
select * from Employees where age = 25;
select * from Employees where age>20 and age<50;
... Operations for 1d Select | 6/164 |
... Operations for 1d Select | 7/164 |
Other operations on relations:
insert into Employees(id,name,age,dept,salary) values (12345, 'John', 21, 'Admin', 55000.00);
delete Employees where age > 55;
update Employees set salary = salary * 1.10 where dept = 'Admin';
... Operations for 1d Select | 8/164 |
In the algorithms below, we assume:
select * from R where k = val
where k
unique
val
select * from R where k = val
where k
unique
val
select * from R where k between lo and hi
where k
lo
hi
R.k
k
R
Implementing Select Efficiently | 9/164 |
Two basic approaches:
Our analyses assume: 1 input buffer available for each relation.
If more buffers are available, most methods benefit.
Heap Files |
Heap File Structure | 11/164 |
The simplest possible file organisation.
New tuples inserted at end of file; tuples deleted by marking.
Selection in Heaps | 12/164 |
For all selection queries, the only possible strategy is:
// select * from R where C for each page P in file of relation R { for each tuple t in page P { if (t satisfies C) add tuple t to result set } }
i.e. linear scan through file searching for matching tuples
... Selection in Heaps | 13/164 |
The heap is scanned from the first to the last page:
Costrange = Costpmr = b
If we know that only one tuple matches the query (one query),
a simple optimisation is to stop the scan once that tuple is found.
Costone : Best = 1 Average = b/2 Worst = b
Insertion in Heaps | 14/164 |
Insertion: new tuple is appended to file (in last page).
rel = openRelation("R", READ|WRITE); pid = nPages(rel)-1; get_page(rel, pid, buf); if (size(newTup) > size(buf)) { deal with oversize tuple } else { if (!hasSpace(buf,newTup)) { pid++; nPages(rel)++; clear(buf); } insert_record(buf,newTup); put_page(rel, pid, buf); }
Costinsert = 1r + 1w (plus possible extra writes for oversize recs)
... Insertion in Heaps | 15/164 |
Alternative strategy:
R
R
backend/access/heap/{heapam.c,hio.c}
... Insertion in Heaps | 16/164 |
PostgreSQL's tuple insertion:
heap_insert(Relation relation,// relation desc HeapTuple newtup,// new tuple data CommandId cid, ...)// SQL statement
newtup
Deletion in Heaps | 17/164 |
Deletion for one conditions, e.g.
delete from Employees where id = 12345
Method:
... Deletion in Heaps | 18/164 |
Deletion for pmr or range conditions, e.g.
delete from Employees where dept = 'Marketing' delete from Employees where 40 <= age and age < 50
Method:
... Deletion in Heaps | 19/164 |
Implementation of deletion:
// Deletion in heaps ... delete from R where C rel = openRelation("R",READ|WRITE); for (p = 0; p < nPages(rel); p++) { get_page(rel, p, buf); ndels = 0; for (i = 0; i < nTuples(buf); i++) { tup = get_record(buf,i); if (tup satisfies C) { ndels++; delete_record(buf,i); } } if (ndels > 0) put_page(rel, p, buf); if (ndels > 0 && unique) break; }
... Deletion in Heaps | 20/164 |
How to deal with free slots:
Insertion with free slots:
rel = openRelation("R",READ|WRITE); for (p = 0; p < nPages(f); p++) { get_page(rel, p, buf); if (hasSpace(buf,newTup)) { insert_record(buf,newTup); break; } } if (p == nPages(f))// not yet added { clear(buf); insert_record(buf,newTup); } put_page(rel, p, buf);
... Deletion in Heaps | 21/164 |
PostgreSQL tuple deletion:
heap_delete(Relation relation,// relation desc ItemPointer tid, ...,// tupleID CommandId cid, ...)// SQL statement
xmax
Updates in Heaps | 22/164 |
Example: update
set
=
where
Analysis for updates is similar to that for deletion
A complication: new version of tuple may not fit in page.
Solution: delete and then insert
(Cost harder to analyse; depends on how many "oversize" tuples are produced)
... Updates in Heaps | 23/164 |
// Update in heaps ... update R set F = val where C rel = openRelation("R",READ|WRITE); for (p = 0; p < nPages(rel); p++) { get_page(rel, p, buf); nmods = 0; for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup satisfies C) { nmods++; newTup = modTuple(tup,F,val); delTuple(buf,i); InsertTupleInHeap(rel,tup); } } if (nmods > 0) put_page(rel, p, buf); }
... Updates in Heaps | 24/164 |
PostgreSQL tuple update:
heap_update(Relation relation,// relation desc ItemPointer otid,// old tupleID HeapTuple newtup, ...,// new tuple data CommandId cid, ...)// SQL statement
delete(otid)
insert(newtup)
ctid
Heaps in PostgreSQL | 25/164 |
PostgreSQL stores all table data in heap files (by default).
Typically there are also associated index files.
If a file is more useful in some other form:
create index...using hash
... Heaps in PostgreSQL | 26/164 |
PostgreSQL "heap files" use ≥ 3 physical files
OID
OID.1
OID_fsm
OID_vm
... Heaps in PostgreSQL | 27/164 |
Free space map
Sorted Files |
Sorted Files | 29/164 |
Records stored in file in order of some field k (the sort key).
Makes searching more efficient; makes insertion less efficient
... Sorted Files | 30/164 |
In order to simplify insertion, use overflow blocks.
Large-scale file re-arrangement occurs less frequently.
... Sorted Files | 31/164 |
Conceptual view of sorted file structure:
Total number of overflow blocks = bov.
Average overflow chain length = Ov = bov/b.
Bucket = data page + its overflow page(s)
Selection in Sorted Files | 32/164 |
For one queries on sort key, use binary search.
// select * from R where k = val (sorted on R.k) lo = 0; hi = b-1 while (lo <= hi) { mid = (lo+hi) div 2; buf = getPage(f,mid); (tup,loVal,hiVal) = searchBucket(f,mid,k,val); if (tup != null) return tup; else if (val < loVal) hi = mid - 1; else if (val > hiVal) lo = mid + 1; else return NOT_FOUND; } return NOT_FOUND;
... Selection in Sorted Files | 33/164 |
Search a page and its overflow chain for a key value
searchBucket(f,p,k,val) { buf = getPage(f,p); (tup,min,max) = searchPage(buf,k,val,+INF,-INF) if (tup != null) return(tup,min,max); ovf = openOvFile(f); ovp = ovflow(buf); while (tup == NULL && ovp != NO_PAGE) { buf = getPage(ovf,ovp); (tup,min,max) = searchPage(buf,k,val,min,max) ovp = ovflow(buf); } return (tup,min,max); }
Assumes each page contains index of next page in Ov chain
Note: getPage(f,pid) = { read_page(relOf(f),pid,buf); return buf; }
... Selection in Sorted Files | 34/164 |
Search within a page for key; also find min/max values
searchPage(buf,k,val,min,max) { for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup.k == val) res = tup; if (tup.k < min) min = tup.k; if (tup.k > max) max = tup.k; } return (res,min,max); }
... Selection in Sorted Files | 35/164 |
The above method treats each bucket as a single large page.
Cases:
Costone : Best = 1 Worst = log2 b + bov
... Selection in Sorted Files | 36/164 |
Average Costone is messier to analyse:
To be more "precise"
... Selection in Sorted Files | 37/164 |
For pmr query, on non-unique attribute k
Begin by locating a page p containing k=
Scan backwards and forwards from p to find matches.
Thus, Costpmr = Costone + (bq-1).(1+Ov)
... Selection in Sorted Files | 38/164 |
For range queries on unique sort key (e.g. primary key):
If secondary key, similar method to pmr.
... Selection in Sorted Files | 39/164 |
So far, have assumed query condition involves sort key k.
If condition contains attribute j, not the sort key
Costone, Costrange, Costpmr as for heap files
Insertion in Sorted Files | 40/164 |
Insertion is more complex than for Heaps.
where δ = 1 or 2, depending on whether we need to create a new overflow block
Deletion in Sorted Files | 41/164 |
Deletion involves:
Thus, Costdelete = Costselect + bqw
Hashed Files |
Hashing | 43/164 |
Basic idea: use key value to compute page address of tuple.
e.g. tuple with key = v is stored in page i
Requires: hash function h(v) that maps KeyDomain → [0..b-1].
Hash Functions | 44/164 |
Points to note:
mod
... Hash Functions | 45/164 |
Usual approach in hash function:
Or, convert each data value to a string and hash that.
... Hash Functions | 46/164 |
PostgreSQL hash function (simplified):
Datum hash_any(unsigned char *k, register int keylen) { register uint32 a, b, c, len;/* Set up the internal state */ len = keylen; a = b = c = 0x9e3779b9 + len + 3923095;/* handle most of the key */ while (len >= 12) { a += ka[0]; b += ka[1]; c += ka[2]; mix(a, b, c); ka += 3; len -= 12; }/* collect any data from last 11 bytes into a,b,c */ mix(a, b, c); return UInt32GetDatum(c); }
... Hash Functions | 47/164 |
Where mix
#define mix(a,b,c) \ { \ a -= b; a -= c; a ^= (c>>13); \ b -= c; b -= a; b ^= (a<<8); \ c -= a; c -= b; c ^= (b>>13); \ a -= b; a -= c; a ^= (c>>12); \ b -= c; b -= a; b ^= (a<<16); \ c -= a; c -= b; c ^= (b>>5); \ a -= b; a -= c; a ^= (c>>3); \ b -= c; b -= a; b ^= (a<<10); \ c -= a; c -= b; c ^= (b>>15); \ }
i.e. scrambles all of the bits from the bytes of the key value
See backend/access/hash/hashfunc.c
... Hash Functions | 48/164 |
Above functions give hash value as 32-bit quantity (uint32
Two ways to map raw hash value into a page address:
uint32 hashToPageNum(uint32 hval) { uint32 mask = 0xFFFFFFFF; return (hval & (mask >> (32-k))); }
uint32 hashToPageNum(uint32 hval) { return (hval % b); }
Hashing Performance | 49/164 |
Aims:
Best case: every block contains same number of tuples.
Worst case: every tuple hashes to same block.
Average case: some blocks have more tuples than others.
... Hashing Performance | 50/164 |
With non-uniform distribution or too many tuples, some blocks will fill.
Use overflow blocks to contain "excess" tuples; one chain of overflow blocks for each primary block (containing all tuples with same hash value).
... Hashing Performance | 51/164 |
Trade-off (assuming reasonably uniform hash function):
effect: reduces long overflow chains which cause query-time overhead
effect: half-full blocks result in storage overhead
... Hashing Performance | 52/164 |
Two important measures for hash files:
Case | L | Ov |
Best | ≅ 1 | 0 |
Worst | >> 1 | ** |
Average | < 1 | 0<Ov<1 |
(** performance is same as Heap File)
To achieve average case, aim for 0.75 ≤ L ≤ 0.9.
Selection with Hashing | 53/164 |
Best performance occurs for one queries on hash key field.
Basic strategy:
Average Costone = 1+Ov/2 (scan half of ovflow chain)
Worst Costone = 1+max(OvLen) (find in last page of ovflow chain)
... Selection with Hashing | 54/164 |
Select via hashing on unique hash key k (one)
Overview:
// select * from R where k = val pid,P = hash(val) % nPages(R); for each tuple t in page P { if (t.k == val) return t } for each overflow page Q of P { for each tuple t in page Q { if (t.k == val) return t } }
... Selection with Hashing | 55/164 |
Details of select via hashing on unique key k:
// select * from R where k = val f = openFile(relName("R"),READ); pid,P = hash(val) % nPages(f); buf = getPage(f, pid) for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup.k == val) return tup; } ovp = ovflow(buf); while (ovp != NO_PAGE) { buf = getPage(ovf,ovp); for (i = 0; i < nTuples(Buf); i++) { tup = getTuple(buf,i); if (tup.k == val) return tup; } }
... Selection with Hashing | 56/164 |
Select via hashing on non-unique hash key nk (pmr)
// select * from R where nk = val pid,P = hash(val) % nPages(R); for each tuple t in page P { if (t.nk == val) add t to results } for each overflow page Q of P { for each tuple t in page Q { if (t.nk == val) add t to results } } return results
Costpmr = 1 + Ov
... Selection with Hashing | 57/164 |
Details of selection via hashing on non-unique hash key nk (pmr)
// select * from R where k = val f = openFile(relName("R"),READ); pid = hash(val) % nPages(f); buf = getPage(f, pid) for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup.k == val) append tup to results } ovp = ovflow(buf); while (ovp != NO_PAGE) { buf = getPage(ovf,ovp); for (i = 0; i < nTuples(Buf); i++) { tup = getTuple(buf,i); if (tup.k == val) append tup to results } }
Costpmr = 1 + Ov
... Selection with Hashing | 58/164 |
Unless the hash function is order-preserving (and most aren't) hashing does not help with range queries.
Costrange = b + bov
Selection on attribute j which is not hash key ...
Costone, Costrange, Costpmr = b + bov
Insertion with Hashing | 59/164 |
Insertion uses similar process to one queries.
Overview:
pid,P = hash(val) % nPages(R); if room in page P { insert t into P; return } for each overflow page Q of P { if room in page Q { insert t into Q; return } } add new overflow page Q link Q to previous page insert t into Q
... Insertion with Hashing | 60/164 |
Details of insertion into hashed file:
f = openFile(fileName("R"),READ|WRITE); p = hash(tup.k) % nPages(R); buf = getPage(f,p); if (!isFull(buf) && addTuple(buf,tup) >= 0) { writePage(f, p, buf); return; } ovf = openFile(ovFileName("R"),READ|WRITE); p = ovFlow(buf); hasOvFlow = (p != NO_PAGE); while (p != NO_PAGE) { buf = getPage(ovf,p); if (!isFull(buf) && addTuple(buf,tup) >= 0) { writePage(ovf, p, buf); return; } p = ovflow(buf); } ovflow(buf) = nPages(ovf); writePage((hasOvFlow?ovf:f), p, buf); clear(buf); addTuple(buf, tup); writePage(ovf, nPages(ovf), buf);
... Insertion with Hashing | 61/164 |
Variation in costs determined by iteration on overflow chain.
Best Costinsert = 1r + 1w
Average Costinsert = (1+Ov)r + 1w
Worst Costinsert = (1+max(OvLen))r + 2w
Deletion with Hashing | 62/164 |
Similar performance to select on non-unique key:
// delete from R where k = val pid,P = hash(val) % nPages(R); ndel = delTuples(P,k,val) if (ndel > 0) putPage(f,buf,p) for each overflow page qid,Q of P { ndel = delTuples(Q,k,val) if (ndel > 0) putPage(ovf,Q,qid) }
Extra cost over select is cost of writing back modified blocks.
Method works for both unique and non-unique hash keys.
... Deletion with Hashing | 63/164 |
More details on deletion with hashing
// delete from R where k = val // f = data file ... ovf = ovflow file pid,P = getPageViaHash(val,R) ndel = delTuples(P,k,val); if (ndel > 0) putPage(f,P,pid) pid = ovFlow(buf) while (pid != NO_PAGE) { buf = getPage(ovf,pid); delTuples(ovf,buf,k,val); pid = ovFlow(buf); }
... Deletion with Hashing | 64/164 |
Function for deleting all matches in a buffer:
delTuples(outf, buf, k, val) { int i; int ndels = 0; for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup.k == val) { ndels++; delTuple(buf,i); } } if (ndels > 0) writePage(outf, p, buf); }
Realistic deletion would also handle free-list of empty pages.
Another Problem with Hashing... | 65/164 |
So far, discussion of hashing has assumed a fixed file size (fixed b).
This is known as static hashing.
What size file to use?
... Another Problem with Hashing... | 66/164 |
If we change the file size, by adding more blocks to accomodate more tuples, then we are effectively changing the hash function.
In order to be able to use the new hash function to find existing tuples, all tuples need to be removed and re-inserted.
In other words, this requires complete re-organisation of the file.
Obvious disadvantages:
Flexible Hashing | 67/164 |
Several hashing schemes have been proposed to deal with dynamic files.
Two methods access blocks via a directory:
... Flexible Hashing | 68/164 |
Require a function to act on hash values to give d bits
#define HashSize 32 typedef unsigned int HashVal;// return low-order d bits HashVal bits(d int, h HashVal) { HashVal mask; assert(d > 0 && d <= HashSize); mask = 0xffffffff >> (HashSize-d); return (h & mask); }// return high-order d bits HashVal bits'(d int, h HashVal) { assert(d > 0 && d <= HashSize); return (h >> (HashSize-d)); }
... Flexible Hashing | 69/164 |
Important concept for flexible hashing: splitting
101
0101
1101
0101
101
1101
101
... Flexible Hashing | 70/164 |
Example of splitting:
Tuples only show key value; assume h(val) = val
Extendible Hashing | 71/164 |
File organisation:
... Extendible Hashing | 72/164 |
Selection with Ext.Hashing | 73/164 |
Size of directory = 2d; d is called depth.
Hash function h
Use first d bits of it to access directory to obtain block address.
Selection method for one queries:
p = directory[bits'(d,h(val))]; buf = getPage(f,p); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup.k == val) return tup; }
No overflow blocks, so Costone = 1
Assume: a copy of the directory is pinned in DBMS buffer pool.
Insertion with Ext.Hashing | 74/164 |
For insertion, use selection method to determine block.
If selected block S not full, simply insert tuple.
What to do if selected block full?
... Insertion with Ext.Hashing | 75/164 |
Basis for splitting?
All tuples r
bits'(d,hash(r.k))
So ... use d+1 bits of hash (i.e. bits'(d+1,hash(r.k))
This gives twice as many possible hash values.
Since hash values index directory, directory size is doubled.
E.g. d=4 gives indexes 0..15, d=5 gives indexes 0..31
... Insertion with Ext.Hashing | 76/164 |
Doubling directory size and adding one block creates many empty directory slots.
What to do with these slots? Make them point to existing (buddy) pages.
The idea: some parts are hashed effectively using d bits, others using d+1 bits.
... Insertion with Ext.Hashing | 77/164 |
If we split a block with two pointers to it, no need to make directory larger.
Ext.Hashing Example | 78/164 |
Consider the following set of tuples describing bank deposits:
Branch | Acct.No | Name | Amount |
Brighton | 217 | Green | 750 |
Downtown | 101 | Johnshon | 500 |
Mianus | 215 | Smith | 700 |
Perryridge | 102 | Hayes | 400 |
Redwood | 222 | Lindsay | 700 |
Round Hill | 305 | Turner | 350 |
Clearview | 117 | Throggs | 295 |
... Ext.Hashing Example | 79/164 |
We hash on the branch name, with the following hash function:
Branch | Hash Value |
Brighton | 0010 1101 1111 1011 |
Clearview | 1101 0101 1101 1110 |
Downtown | 1010 0011 1010 0000 |
Mianus | 1000 0111 1110 1101 |
Perryridge | 1111 0001 0010 0100 |
Redwood | 1011 0101 1010 0110 |
Round Hill | 0101 1000 0011 1111 |
... Ext.Hashing Example | 80/164 |
Assume we have a file with c=2.
Start with an initially empty file:
Add Brighton...
0010...
Add Downtown...
1010...
... Ext.Hashing Example | 81/164 |
Add Mianus...
1000...
... Ext.Hashing Example | 82/164 |
Add Perryridge...
1111...
... Ext.Hashing Example | 83/164 |
Add Redwood...
1011...
... Ext.Hashing Example | 84/164 |
Add Round Hill...
0101...
... Ext.Hashing Example | 85/164 |
Add Clearview...
1101...
... Ext.Hashing Example | 86/164 |
Add another Perryridge...
1111...
... Ext.Hashing Example | 87/164 |
Add another Round Hill...
0101...
... Ext.Hashing Example | 88/164 |
Asumption: directory is small enough to be stored in memory.
Most of the time we read and write exactly one block.
If we need to split, we write just one extra block.
Conditions to minimise splitting:
(δ is a small value depending on b, load, hash function,...)
... Ext.Hashing Example | 89/164 |
A potential problem:
Deletion with Ext.Hashing | 90/164 |
Similar to ordinary static hash file - mark tuple as removed.
However, might also wish to reclaim space:
... Deletion with Ext.Hashing | 91/164 |
What to do with empty blocks from file perspective?
Empty block might create hole in middle of file.
Two methods to deal with this:
Linear Hashing | 92/164 |
File organisation:
... Linear Hashing | 93/164 |
File grows linearly (one block at a time, at regular intervals).
Can be viewed as having "phases" of expansion; during each phase, b doubles.
Selection with Lin.Hashing | 94/164 |
If b=2d, the file behaves exactly like standard hashing.
Use d bits of hash to compute block address.
// select * from R where k = val p = bits(d,hash(val)); --least sig bits P = getPage(f,p) for each tuple t in page P and its overflow pages { if (t.k == val) return t; }
Average Costone = 1+Ov
... Selection with Lin.Hashing | 95/164 |
If b != 2d, treat different parts of the file differently.
Parts A and C are treated as if part of a file of size 2d+1.
Part B is treated as if part of a file of size 2d.
Part D does not yet exist (B expands into it).
... Selection with Lin.Hashing | 96/164 |
Modified algorithm:
// select * from R where k = val h = hash(val); pid = bits(d,h); if (p < sp) { p = bits(d+1,h); } P = getPage(f,pid);// now proceed as before for each tuple t in page P and its overflow pages { if (t.k == val) return R; }
Insertion with Lin.Hashing | 97/164 |
To see why selection works, need to look at how file expands.
Lin.Hashing Example | 98/164 |
Assume we have a file with: c=2.
Start with an initially empty file:
Add Brighton...
...1011
Add Downtown...
...0000
... Lin.Hashing Example | 99/164 |
Add Mianus...
...1101
Add Perryridge...
...0100
... Lin.Hashing Example | 100/164 |
Add Redwood...
...0110
... Lin.Hashing Example | 101/164 |
Add Round Hill...
...1111
... Lin.Hashing Example | 102/164 |
Add Clearview...
...1110
... Lin.Hashing Example | 103/164 |
Add another Clearview...
Linear Hashing Insertion | 104/164 |
Abstract view:
p = bits(d,hash(tup.k)); if (p < sp) p = bits(d+1,hash(val));// bucket p = page p + its ovflow pages for each page Q in bucket p { if (space in Q) { insert tuple into Q break } } if (no insertion) { add new ovflow page to bucket p insert tuple into new page } if (need to split) { partition tuples from bucket sp into buckets sp and sp+2^d sp++; if (sp == 2^d) { d++; sp = 0; } }
... Linear Hashing Insertion | 105/164 |
Detailed algorithm:
h = hash(tup.k); p = bits(d,h); if (p < sp) p = bits(d+1,h);//start insertion into p bucket inserted = 0; inOvflow = 0; buf = getPage(f,p); if (isFull(buf)) { p = ovFlow(buf); } else {// have space in data page addTuple(buf,tup);// assume it fits writePage(f, p, buf); inserted = 1; } ...
... Linear Hashing Insertion | 106/164 |
...// attempt insertion into ovflow pages while (!inserted && p != NO_PAGE) { inOvflow = 1; buf = getPage(ovf,p); if (!isFull(buf)) { addTuple(buf,tup); writePage(ovf, p, buf); inserted = 1; } if (!inserted) { p = ovFlow(buf); } }// add a new ovflow page if (!inserted) { ovflow(buf) = nPages(ovf); outf = inOvflow ? ovf : f; writePage(outf, p, buf); clear(buf); addTuple(buf, tup); writePage(ovf, nPages(ovf), buf); }// tuple inserted
Splitting | 107/164 |
How to decide that we "need to split"?
In extendible hashing, blocks were split on overflow.
In linear hashing, we always split block sp.
Two approaches to triggering a split:
... Splitting | 108/164 |
How do we accomplish partitioning?
All tuples in sp plus its overflow blocks have same hash (e.g. 01
New block is at address sp+2d.
We consider d+1 bits of hash (giving e.g. 001
101
Gives addresses for each tuple of sp or sp+2d.
Place tuples according to d+1 hash bits.
... Splitting | 109/164 |
Partitioning process for block sp=01
... Splitting | 110/164 |
Some observations on splitting:
... Splitting | 111/164 |
Detailed splitting algorithm:
// partitions tuples between two buckets newp = sp + 2^d; oldp = sp; buf = getPage(f,sp); clear(oldBuf); clear(newBuf); for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); p = bits(d+1,hash(tup.k)); if (p == newp) addTuple(newBuf,tup); else addTuple(oldBuf,tup); } p = ovflow(buf); oldOv = newOv = 0; while (p != NO_PAGE) { ovbuf = getPage(ovf,p); for (i = 0; i < nTuples(ovbuf); i++) { tup = getTuple(buf,i); p = bits(d+1,hash(tup.k)); if (p == newp) { if (isFull(newBuf)) { nextp = nextFree(ovf); ovflow(newBuf) = nextp; outf = newOv ? f : ovf; writePage(outf, newp, newBuf); newOv++; newp = nextp; clear(newBuf); } addTuple(newBuf, tup); } else { if (isFull(oldBuf)) { nextp = nextFree(ovf); ovflow(oldBuf) = nextp; outf = oldOv ? f : ovf; writePage(outf, oldp, oldBuf); oldOv++; oldp = nextp; clear(oldBuf); } addTuple(oldBuf, tup); } } addToFreeList(ovf,p); p = ovflow(buf); } sp++; if (sp == 2^d) { d++; sp = 0; }
Insertion Cost | 112/164 |
If no split required, cost same as for standard hashing:
Best Costinsert = 1r + 1w
Average Costinsert = (1+Ov)r + 1w
Worst Costinsert = (1+max(Ov))r + 2w
... Insertion Cost | 113/164 |
If split occurs, incur Costinsert plus cost of partitioning.
Partitioning cost involves:
Deletion with Lin.Hashing | 114/164 |
Deletion is similar to ordinary static hash file.
But might wish to contract file when enough tuples removed.
Rationale: r shrinks, b stays large ⇒ wasted space.
Method: remove last block in file (contracts linearly).
Involves a coalesce procedure which is an inverse split.
Hash Files in PostgreSQL | 115/164 |
PostgreSQL uses linear hashing on tables which have been:
create index Ix on R using hash (k);
Hash file implementation: backend/access/hash
hashfunc.c
hashinsert.c
hashpage.c
hashsearch.c
... Hash Files in PostgreSQL | 116/164 |
PostgreSQL uses slightly different file organisation ...
... Hash Files in PostgreSQL | 117/164 |
PostgreSQL hash file structure:
... Hash Files in PostgreSQL | 118/164 |
Converting bucket # to page address:
// which page is primary page of bucket uint bucket_to_page(headerp, B) { uint *splits = headerp->hashm_spares; uint chunk, base, offset, lg2(uint); chunk = (B<2) ? 0 : lg2(B+1)-1; base = splits[chunk]; offset = (B<2) ? B : B-(1<<chunk); return (base + offset); }// returns ceil(log_2(n)) int lg2(uint n) { int i, v; for (i = 0, v = 1; v < n; v <<= 1) i++; return i; }
Indexes |
Selection via Trees | 120/164 |
Tree-based file access methods
Indexing | 121/164 |
The basic idea behind an index is as for books.
A table of (key,usage) pairs:
Indexing File Structure | 122/164 |
An index is a file of (keyVal,tupleID) pairs, e.g.
Indexes | 123/164 |
A 1-d index is based on the value of a single attribute A.
Some possible properties of A:
index on unique field, file sorted on this field | ||
index on non-unique field, file sorted on this field | ||
file not sorted on this field |
... Indexes | 124/164 |
Indexes themselves may be structured in several ways:
every tuple is referenced by an entry in the index file | ||
only some tuples are referenced by index file entries | ||
tuples are accessed directly from the main index file | ||
may need to access several index pages to reach tuple |
Index structures aim to minimise storage and search costs.
... Indexes | 125/164 |
A relation/file may have:
Primary Index | 126/164 |
Primary index: ordered file whose "tuples" have two fields:
Sparse: one index tuple per data page (requires primary key sorted data file)
Index file has blocking factor ci (where typically ci ≫ c)
Index has total i blocks: Dense: i = ⌈ r/ci ⌉ Sparse: i = ⌈ b/ci ⌉
... Primary Index | 127/164 |
One possible implementation for index blocks:
typedef struct { IndexHeader info, IndexEntry[] entries } IndexBlock; typedef struct { Datum key, int pageNum, int tupleNum } IndexEntry;
Note that pageNum+tupleNum
TupleId
Dense Primary Index | 128/164 |
Sparse Primary Index | 129/164 |
Index Overheads | 130/164 |
Consider a relation and index with the following parameters:
... Index Overheads | 131/164 |
For each page in the index file ...
Selection with Prim.Index | 132/164 |
For one queries:
ix = binary search index for entry with key K if nothing found { return NotFound } b = getPage(ix.pageNum) t = getTuple(b,ix.tupleNum)-- may require reading overflow pages return t
Worst cost: read log2i index pages + read 1+Ov data pages.
Thus, Costone,prim = log2 i + 1 + Ov
... Selection with Prim.Index | 133/164 |
For range queries on primary key:
... Selection with Prim.Index | 134/164 |
Method for range queries
// e.g. select * from R where a between lo and hi pages = {} results = {} ixPage = findIndexPage(R.ixf,lo) while (ixTup = getNextIndexTuple(R.ixf)) { if (ixTup.key > hi) break; pages += pageOf(ixTup.tid) } foreach pid in pages {// scan data page plus ovflow chain while (buf = getPage(R.datf,pid)) { foreach tuple T in buf { if (lo<=T.a && T.a<=hi) results += T } } }
Insertion with Prim.Index | 135/164 |
Overview:
tid = insert tuple into page P at position p find location for new entry in index file insert new index entry (k,tid) into index file
Problem: order of index entries must be maintained.
Reogranisation requires, on average, read/write half of index file.
Costinsert,prim = log2ir + i/2.(1r+1w) + (1+Ov)r + (1+δ)w
Deletion with Prim.Index | 136/164 |
Overview:
find tuple using index mark tuple as deleted remove index entry for tuple
If we delete index entries by marking ...
Clustering Index | 137/164 |
Index on non-unique ordering attribute Ac.
Usually a sparse index; one pointer to first tuple containing value.
Assists with:
... Clustering Index | 138/164 |
... Clustering Index | 139/164 |
Insertions are expensive: rearrange index file and data file.
This applies whether new value or new tuple containing known value.
One "optimisation": reserve whole block for one particular value.
Deletions relatively cheap (similar to primary index).
(Note: can't mark index entry for value X until all X tuples are deleted)
Secondary Index | 140/164 |
Generally, dense index on non-unique attribute As
Three possible solutions:
... Secondary Index | 141/164 |
First style of secondary index:
... Secondary Index | 142/164 |
Third (favoured) style of secondary index:
Selection with Sec.Index | 143/164 |
Since non-unique attribute, there are no (explicit) one queries.
For pmr queries involving As:
binary search for key K in index file 1 for each entry for K in index file 2 { b = getPage(ix.pageNum) search for tuple within buffer b }
Assume that answering the query q requires:
... Selection with Sec.Index | 144/164 |
For range queries e.g. Lo
Hi
binary search for key Lo in index file 1 FOR each entry in index file 2 until Hi DO access tuple R via TupleID from index END
Analysis almost same as for pmr ...
Costrange =
(log2i + aq + ∑k=Lo
Hi
Note: may read some blocks multiple times (alleviate with buffer pool)
Insertion/Deletion with Sec.Index | 145/164 |
Insertion:
As for primary index:
Can use mark-style (tombstone) deletion for tuples.
Caution: can only mark index-entry for k when no more entries for k in block-address blocks.
Multi-level Indexes | 146/164 |
In indexes described so far, search begins with binary search of index.
Requires log2 i index block accesses.
We can improve this by putting a second index on the first index.
Second index is sparse; (key,pointer) to first entry in first index block.
If there are i first-level index blocks, there will be i2 = ⌈ i/ci ⌉ second-level index blocks.
If second-level index too big, add third level ...
Maximum levels: t = ⌈ logci r ⌉ (dense index)
Top-level of index is always just a single block.
... Multi-level Indexes | 147/164 |
Select with ML.Index | 148/164 |
For one query on indexed key field:
xpid = top level index page for level = 1 to d { read index entriy xpid search index page for J'th entry where index[J].key <= K < index[J+1].key if (J == -1) { return NotFound } pid = xpid = index[J].page } pid = xpid// pid is data page index search page pid and its overflow pages
Read d index entries and 1+Ov data blocks.
Thus, Costone,mli = (t + 1 + Ov)r
(Recall that t = ⌈ logcir ⌉ and single-level index needs log2i)
B-Trees | 149/164 |
B-trees are MSTs with the properties:
... B-Trees | 150/164 |
Example B-tree (depth=3, n=3):
B-Tree Depth | 151/164 |
Depth depends on effective branching factor (i.e. how full nodes are).
Simulation studies (random insertions and deletions) show typical B-tree nodes are 69% full.
Gives load Li = 0.69 × ci and depth of tree ~ ⌈ logLi r ⌉.
Example: ci=128, Li=88
Level | #nodes | #keys |
root | 1 | 87 |
1 | 88 | 7656 |
2 | 7744 | 673728 |
3 | 681472 | 59288064 |
Note: ci is generally larger than 128 for a real B-tree.
B+Trees | 152/164 |
In database context, nodes are index blocks.
Two possible ways to structure them:
B-Tree Index | 153/164 |
B+Tree Index | 154/164 |
Insertion into B-Trees | 155/164 |
Overview of the method:
... Insertion into B-Trees | 156/164 |
When splitting a node and promoting the middle element, we may find that the parent node is full.
In this case we split the parent in the same way as the child and promote its middle element.
This splitting may propagate all the way to root.
In this case we create a new root containing a single entry.
This is the only exception to the half-full rule.
B-Tree Insertion Cost | 157/164 |
Insertion cost = CosttreeSearch + CosttreeInsert + CostdataInsert
Best case: write one page (most of time)
Common case: 3 node writes (rearrange 2 leaves + parent)
... B-Tree Insertion Cost | 158/164 |
Worst case: 2D-1 node writes (propagate to root)
Deletion from B-Trees | 159/164 |
Overview of the method:
Selection with B+Trees | 160/164 |
For one queries:
N = B-tree root node while (N is not a leaf node) { scan N to find reference to next node N = next node } scan leaf node N to find entry for K access tuple t using TupleId from N
Costone = (D + 1)r
Generally, D is around 2 or 3, even for very large data files.
A further optimisation is to buffer B-tree root page (⇒ total D page reads)
... Selection with B+Trees | 161/164 |
For range queries (assume sorted on index attribute):
search index to find leaf node for Lo for each leaf node entry until Hi found { access tuple t using TupleId from entry }
Costrange = (D + bi + bq)r
(If hold root block in memory ⇒ read D-1 index blocks)
... Selection with B+Trees | 162/164 |
For pmr, need index on ≥ 1 query attribute.
Could have indexes on several attributes:
Pages = {} for each Ai = k in query { find pages P containing Ai = k Pages = Pages ∩ P } for each P in Pages { scan P for matching tuples }
If q mentions ai, aj, ak, then
Costpmr = (Di + Dj + Dk + |bqi ∩ bqj ∩ bqk|)r
For space queries, treat as a combination of pmr and range.
B-trees in PostgreSQL | 163/164 |
PostgreSQL implements ≅ Lehman/Yao-style B-trees
backend/access/nbtree
README
nbtree.c
nbtsearch.c
nbtinsert.c
... B-trees in PostgreSQL | 164/164 |
Interface functions for B-trees
// build Btree index on relation Datum btbuild(rel,index,...)// insert index entry into Btree Datum btinsert(rel,key,tupleid,index,...)// start scan on Btree index Datum btbeginscan(rel,key,scandesc,...)// get next tuple in a scan Datum btgettuple(scandesc,scandir,...)// close down a scan Datum btendscan(scandesc)
Produced: 9 Jul 2019