❖ Varieties of Selection |
Selection: select * from R where C
R
C
❖ Varieties of Selection (cont) |
Examples of different selection types:
one: select * from
where id = 1234
pmr:
select * from
where age=65
select * from
where age=65 and gender='m'
rng:
select * from
where age≥18 and age≤21
select * from
where age between 18 and 21
and height between 160 and 190
note: rng = range
❖ Implementing Select Efficiently |
Two basic approaches:
Our analyses assume: 1 input buffer available for each relation.
If more buffers are available, most methods benefit.
❖ Heap Files |
❖ Selection in Heaps |
For all selection queries, the only possible strategy is:
// select * from R where C
for each page P in file of relation R {
for each tuple t in page P {
if (t satisfies C)
add tuple t to result set
}
}
i.e. linear scan through file searching for matching tuples
❖ Selection in Heaps (cont) |
The heap is scanned from the first to the last page:
Costrange = Costpmr = b
If we know that only one tuple matches the query (one query),
a simple optimisation is to stop the scan once that tuple is found.
Costone : Best = 1 Average = b/2 Worst = b
❖ Insertion in Heaps |
Insertion: new tuple is appended to file (in last page).
rel = openRelation("R", READ|WRITE); pid = nPages(rel)-1; get_page(rel, pid, buf); if (size(newTup) > size(buf)) { deal with oversize tuple } else { if (!hasSpace(buf,newTup)) { pid++; nPages(rel)++; clear(buf); } insert_record(buf,newTup); put_page(rel, pid, buf); }
Costinsert = 1r + 1w
Plus possible extra writes for oversize tuples, e.g. PostgreSQL's TOAST
❖ Insertion in Heaps (cont) |
Alternative strategy:
R
R
backend/access/heap/{heapam.c,hio.c}
❖ Insertion in Heaps (cont) |
PostgreSQL's tuple insertion:
(Relation relation, // relation desc HeapTuple newtup, // new tuple data CommandId cid, ...) // SQL statement heap_insert
newtup
xmin
❖ Deletion in Heaps |
SQL: delete from
where
Implementation of deletion:
rel = openRelation("R",READ|WRITE); for (p = 0; p < nPages(rel); p++) { get_page(rel, p, buf); ndels = 0; for (i = 0; i < nTuples(buf); i++) { tup = get_record(buf,i); if (tup satisfies Condition) { ndels++; delete_record(buf,i); } } if (ndels > 0) put_page(rel, p, buf); if (ndels > 0 && unique) break; }
❖ Deletion in PostgreSQL |
PostgreSQL tuple deletion:
(Relation relation, // relation desc ItemPointer tid, ..., // tupleID CommandId cid, ...) // SQL statement heap_delete
xmax
❖ Updates in Heaps |
SQL: update
set
=
where
Analysis for updates is similar to that for deletion
Complication: new tuple larger than old version (too big for page)
Solution: delete, re-organise free space, then insert
❖ Updates in Heaps (cont) |
PostgreSQL tuple update:
(Relation relation, // relation desc ItemPointer otid, // old tupleID HeapTuple newtup, ..., // new tuple data CommandId cid, ...) // SQL statement heap_update
delete(otid)
insert(newtup)
ctid
❖ Heaps in PostgreSQL |
PostgreSQL stores all table data in heap files (by default).
Typically there are also associated index files.
If a file is more useful in some other form:
create index...using hash
❖ Heaps in PostgreSQL (cont) |
PostgreSQL "heap file" may use multiple physical files
OID
OID.1
OID_fsm
OID_vm
❖ Sorted Files |
Records stored in file in order of some field k (the sort key).
Makes searching more efficient; makes insertion less efficient
E.g. assume c = 4
❖ Sorted Files (cont) |
In order to mitigate insertion costs, use overflow blocks.
Total number of overflow blocks = bov.
Average overflow chain length = Ov = bov / b.
Bucket = data page + its overflow page(s)
❖ Selection in Sorted Files |
For one queries on sort key, use binary search.
// select * from R where k = val (sorted on R.k) lo = 0; hi = b-1 while (lo <= hi) { mid = (lo+hi) / 2; // int division with truncation (tup,loVal,hiVal) = searchBucket(f,mid,k,val); if (tup != NULL) return tup; else if (val < loVal) hi = mid - 1; else if (val > hiVal) lo = mid + 1; else return NOT_FOUND; } return NOT_FOUND;
where f
mid,lo,hi
k
val,loVal,hiVal
k
❖ Selection in Sorted Files (cont) |
Search a page and its overflow chain for a key value
searchBucket(f,pid,k,val) { buf = getPage(f,pid); (tup,min,max) = searchPage(buf,k,val,+INF,-INF) if (tup != NULL) return(tup,min,max); ovf = openOvFile(f); ovp = ovflow(buf); while (tup == NULL && ovp != NO_PAGE) { buf = getPage(ovf,ovp); (tup,min,max) = searchPage(buf,k,val,min,max) ovp = ovflow(buf); } return (tup,min,max); }
Assumes each page contains index of next page in Ov chain
getPage(f,pid) = { read_page(f,pid,buf); return buf; }
❖ Selection in Sorted Files (cont) |
Search within a page for key; also find min/max key values
searchPage(buf,k,val,min,max) { res = NULL; for (i = 0; i < nTuples(buf); i++) { tup = getTuple(buf,i); if (tup.k == val) res = tup; if (tup.k < min) min = tup.k; if (tup.k > max) max = tup.k; } return (res,min,max); }
❖ Selection in Sorted Files (cont) |
The above method treats each bucket like a single large page.
Cases:
Costone : Best = 1 Worst = log2 b + bov
Average case cost analysis needs assumptions (e.g. data distribution)
❖ Search with pmr and range Queries |
For pmr query, on non-unique attribute k, where file is sorted on k
select * from R where k = 2
Begin by locating a page p containing k=
Scan backwards and forwards from p to find matches.
Thus, Costpmr = Costone + (bq-1).(1+Ov)
❖ Search with pmr and range Queries (cont) |
For range queries on unique sort key (e.g. primary key):
select * from R where k >= 5 and k <= 13
Costrange = Costone + (bq-1).(1+Ov)
❖ Search with pmr and range Queries (cont) |
For range queries on non-unique sort key, similar method to pmr.
select * from R where k >= 2 and k <= 6
Costrange = Costone + (bq-1).(1+Ov)
❖ Search with pmr and range Queries (cont) |
So far, have assumed query condition involves sort key k.
But what about select * from R where j = 100.0
If condition contains attribute j, not the sort key
Costone,
Costrange,
Costpmr
as for heap files
❖ Insertion into Sorted Files |
Insertion approach:
Consider insertions of k=33, k=25, k=99 into:
❖ Deletion from Sorted Files |
E.g. delete from R where k = 2
Deletion strategy:
Recall: selectivity determines bq (# pages with matches)
Thus, Costdelete = Costselect + bqw
❖ Hashing |
Basic idea: use key value to compute page address of tuple.
e.g. tuple with key = v is stored in page i
Requires: hash function h(v) that maps KeyDomain → [0..b-1].
❖ Hashing (cont) |
PostgreSQL hash function (simplified):
Datum hash_any(unsigned char *k, register int keylen) { register uint32 a, b, c, len; /* Set up the internal state */ len = keylen; a = b = c = 0x9e3779b9 + len + 3923095; /* handle most of the key */ while (len >= 12) { a += ka[0]; b += ka[1]; c += ka[2]; mix(a, b, c); ka += 3; len -= 12; } /* collect any data from last 11 bytes into a,b,c */ mix(a, b, c); return UInt32GetDatum(c); }
See backend/access/hash/hashfunc.c
mix()
❖ Hashing (cont) |
hash_any()
uint32
Two ways to map raw hash value into a page address:
uint32 hashToPageNum(uint32 hval) { uint32 mask = 0xFFFFFFFF; return (hval & (mask >> (32-k))); }
uint32 hashToPageNum(uint32 hval) { return (hval % b); }
❖ Hashing Performance |
Aims:
Best case: every bucket contains same number of tuples.
Worst case: every tuple hashes to same bucket.
Average case: some buckets have more tuples than others.
Use overflow pages to handle "overfull" buckets (cf. sorted files)
All tuples in each bucket must have same hash value.
❖ Hashing Performance (cont) |
Two important measures for hash files:
Case | L | Ov |
Best | ≅ 1 | 0 |
Worst | ≫ 1 | ** |
Average | < 1 | 0<Ov<1 |
(** performance is same as Heap File)
To achieve average case, aim for 0.75 ≤ L ≤ 0.9.
❖ Selection with Hashing |
Select via hashing on unique key k (one)
// select * from R where k = val
(pid,P) = getPageViaHash(val,R)
for each tuple t in page P {
if (t.k == val) return t
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.k == val) return t
} }
Costone : Best = 1, Avg = 1+Ov/2 Worst = 1+max(OvLen)
❖ Selection with Hashing (cont) |
Select via hashing on non-unique hash key nk (pmr)
// select * from R where nk = val
(pid,P) = getPageViaHash(val,R)
for each tuple t in page P {
if (t.nk == val) add t to results
}
for each overflow page Q of P {
for each tuple t in page Q {
if (t.nk == val) add t to results
} }
return results
Costpmr = 1 + Ov
❖ Selection with Hashing (cont) |
Hashing does not help with range queries** ...
Costrange = b + bov
Selection on attribute j which is not hash key ...
Costone, Costrange, Costpmr = b + bov
** unless the hash function is order-preserving (and most aren't)
❖ Insertion with Hashing |
Insertion uses similar process to one queries.
// insert tuple t with key=val into rel R
(pid,P) = getPageViaHash(val,R)
if room in page P {
insert t into P; return
}
for each overflow page Q of P {
if room in page Q {
insert t into Q; return
} }
add new overflow page Q
link Q to previous page
insert t into Q
Costinsert : Best: 1r + 1w Worst: 1+max(OvLen))r + 2w
❖ Deletion with Hashing |
Similar performance to select on non-unique key:
// delete from R where k = val // f = data file ... ovf = ovflow file (pid,P) = getPageViaHash(val,R) ndel = delTuples(P,k,val) if (ndel > 0) putPage(f,P,pid) for each overflow page qid,Q of P { ndel = delTuples(Q,k,val) if (ndel > 0) putPage(ovf,Q,qid) }
Extra cost over select is cost of writing back modified blocks.
Method works for both unique and non-unique hash keys.
❖ Problem with Hashing... |
So far, discussion of hashing has assumed a fixed file size (b).
What size file to use?
Methods for hashing with dynamic files:
❖ Problem with Hashing... (cont) |
All flexible hashing methods ...
uint32 hash(unsigned char *val)
Require a function to extract d bits from bit-string:
unit32 bits(int d, uint32 val)
Use result of bits()
❖ Page Splitting |
Important concept for flexible hashing: splitting
101
0101
1101
0101
101
1101
101
❖ Page Splitting (cont) |
Example of splitting:
Tuples only show key value; assume h(val) = val
❖ Linear Hashing |
File organisation:
Disadvantage: requires overflow pages (don't split on full pages)
❖ Linear Hashing (cont) |
File grows linearly (one block at a time, at regular intervals).
Has "phases" of expansion; over each phase, b doubles.
❖ Selection with Lin.Hashing |
If b=2d, the file behaves exactly like standard hashing.
Use d bits of hash to compute block address.
// select * from R where k = val h = hash(val); pid = bits(d,h); // lower-order bits P = getPage(f, pid) for each tuple t in page P and its overflow pages { if (t.k == val) add t to Result; }
Average Costone = 1+Ov
❖ Selection with Lin.Hashing (cont) |
If b != 2d, treat different parts of the file differently.
Parts A and C are treated as if part of a file of size 2d+1.
Part B is treated as if part of a file of size 2d.
Part D does not yet exist (tuples in B may eventually move into it).
❖ Selection with Lin.Hashing (cont) |
Modified search algorithm:
// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (pid < sp) { pid = bits(d+1,h); }
P = getPage(f, pid)
for each tuple t in page P
and its overflow blocks {
if (t.k == val) add t to Result;
}
❖ Selection with Lin.Hashing |
If b != 2d, treat different parts of the file differently.
Parts A and C are treated as if part of a file of size 2d+1.
Part B is treated as if part of a file of size 2d.
Part D does not yet exist (tuples in B may move into it).
❖ Selection with Lin.Hashing (cont) |
Modified search algorithm:
// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (P < sp) { pid = bits(d+1,h); }
P = getPage(f, pid)
for each tuple t in page P
and its overflow blocks {
if (t.k == val) return R;
}
❖ Insertion with Lin.Hashing |
Abstract view:
pid = bits(d,hash(val));
if (pid < sp) pid = bits(d+1,hash(val));
// bucket[pid] = page pid + its overflow pages
for each page P in bucket[pid] {
if (space in P) {
insert tuple into P
break
}
}
if (no insertion) {
add new ovflow page to bucket[pid]
insert tuple into new page
}
if (need to split) {
partition tuples from bucket[sp]
into bucket[sp] and bucket[sp+2^d]
sp++;
if (sp == 2^d) { d++; sp = 0; }
}
❖ Splitting |
How to decide that we "need to split"?
Two approaches to triggering a split:
Note: always split block sp, even if not full/"current"
Systematic splitting like this ...
❖ Exercise: Insertion into Linear Hashed File |
Consider a file with b=4, c=3, d=2, sp=0, hash(x) as above
Insert tuples in alpha order with the following keys and hashes:
k | hash(k) | k | hash(k) | k | hash(k) | k | hash(k) | |||
a | 10001 |
g | 00000 |
m | 11001 |
s | 01110 |
|||
b | 11010 |
h | 00000 |
n | 01000 |
t | 10011 |
|||
c | 01111 |
i | 10010 |
o | 00110 |
u | 00010 |
|||
d | 01111 |
j | 10110 |
p | 11101 |
v | 11111 |
|||
e | 01100 |
k | 00101 |
q | 00010 |
w | 10000 |
|||
f | 00010 |
l | 00101 |
r | 00000 |
x | 00111 |
The hash values are the 5 lower-order bits from the full 32-bit hash.
❖ Exercise: Insertion into Linear Hashed File (cont) |
Splitting algorithm:
// partition tuples between two buckets
newp = sp + 2^d; oldp = sp;
for all tuples t in bucket[oldp] {
pid = bits(d+1,hash(t.k));
if (pid == newp)
add tuple t to bucket[newp]
else
add tuple t to bucket[oldp]
}
sp++;
if (sp == 2^d) { d++; sp = 0; }
❖ Insertion Cost |
If no split required, cost same as for standard hashing:
Costinsert: Best: 1r + 1w, Avg: (1+Ov)r + 1w, Worst: (1+max(Ov))r + 2w
If split occurs, incur Costinsert plus cost of splitting:
❖ Deletion with Lin.Hashing |
Deletion is similar to ordinary static hash file.
But might wish to contract file when enough tuples removed.
Rationale: r shrinks, b stays large ⇒ wasted space.
Method:
❖ Hash Files in PostgreSQL |
PostgreSQL uses linear hashing on tables which have been:
create index Ix on R using hash (k);
Hash file implementation: backend/access/hash
hashfunc.c
hashinsert.c
hashpage.c
hashsearch.c
❖ Hash Files in PostgreSQL (cont) |
PostgreSQL uses slightly different file organisation ...
❖ Indexes |
A 1-d index is based on the value of a single attribute A.
Some possible properties of A:
index on unique field, may be sorted on A | ||
index on non-unique field, file sorted on A | ||
file not sorted on A |
A given table may have indexes on several attributes.
❖ Indexes (cont) |
Indexes themselves may be structured in several ways:
every tuple is referenced by an entry in the index file | ||
only some tuples are referenced by index file entries | ||
tuples are accessed directly from the index file | ||
may need to access several index pages to reach tuple |
Index file has total i pages (where typically i ≪ b)
Index file has page capacity ci (where typically ci ≫ c)
Dense index: i = ceil( r/ci ) Sparse index: i = ceil( b/ci )
❖ Exercise: Index Storage Overheads |
Consider a relation with the following storage parameters:
How many pages are needed to hold a sparse index?
❖ Selection with Primary Index |
For one queries:
ix = binary search index for entry with key K
if nothing found { return NotFound }
P = getPage(pageOf(ix.tid))
t = getTuple(P,offsetOf(ix.tid))
-- may require reading overflow pages
return t
Worst case: read log2i index pages + read 1+Ov data pages.
Thus, Costone,prim = log2 i + 1 + Ov
Assume: index pages are same size as data pages ⇒ same reading cost
❖ Selection with Primary Index (cont) |
For range queries on primary key:
❖ Selection with Primary Index (cont) |
Method for range queries
// e.g. select * from R where a between lo and hi pages = {} results = {} ixPage = findIndexPage(R.ixf,lo) while (ixTup = getNextIndexTuple(R.ixf)) { if (ixTup.key > hi) break; pages = pages ∪ pageOf(ixTup.tid) } foreach pid in pages { // scan data page plus ovflow chain while (buf = getPage(R.datf,pid)) { foreach tuple T in buf { if (lo<=T.a && T.a<=hi) results = results ∪ T } } }
❖ Insertion with Primary Index |
Overview:
tid = insert tuple into page P at position p find location for new entry in index file insert new index entry (k,tid) into index file
Problem: order of index entries must be maintained
Reogranisation requires, on average, read/write half of index file:
Costinsert,prim = (log2i)r + i/2.(1r+1w) + (1+Ov)r + (1+δ)w
❖ Deletion with Primary Index |
Overview:
find tuple using index mark tuple as deleted delete index entry for tuple
If we delete index entries by marking ...
❖ Clustering Index |
Data file sorted; one index entry for each key value
Cost penalty: maintaining both index and data file as sorted
(Note: can't mark index entry for value X until all X tuples are deleted)
❖ Secondary Index |
Data file not sorted; want one index entry for each key value
Costpmr =
(log2i
❖ Multi-level Indexes |
Above Secondary Index used two index files to speed up search
Ix1
Ix2
Ix1
Ix2
Ix1
Ix3
Ix2
Ix3
Ultimately, reduce top-level of index hierarchy to one page.
❖ Multi-level Indexes (cont) |
Example data file with three-levels of index:
Assume: not primary key, c = 100, ci = 3
❖ Select with Multi-level Index |
For one query on indexed key field:
xpid = top level index page
for level = 1 to d {
read index entry xpid
search index page for J'th entry
where index[J].key <= K < index[J+1].key
if (J == -1) { return NotFound }
xpid = index[J].page
}
pid = xpid // pid is data page index
search page pid and its overflow pages
Costone,mli = (d + 1 + Ov)r
(Note that d = ceil( logci r ) and ci is large because index entries are small)
❖ B-Trees |
B-trees are MSTs with the properties:
❖ B-Trees (cont) |
Example B-tree (depth=3, n=3)
(Note: in DBs, nodes are pages ⇒ large branching factor, e.g. n=500)
❖ B-Tree Depth |
Depth depends on effective branching factor (i.e. how full nodes are).
Simulation studies show typical B-tree nodes are 69% full.
Gives load Li = 0.69 × ci and depth of tree ~ ceil( logLi r ).
Example: ci=128, Li=88
Level | #nodes | #keys |
root | 1 | 87 |
1 | 88 | 7656 |
2 | 7744 | 673728 |
3 | 681472 | 59288064 |
Note: ci is generally larger than 128 for a real B-tree.
❖ Insertion into B-Trees |
Overview of the method:
❖ Example: B-tree Insertion |
Starting from this tree:
insert the following keys in the given order 12 15 30 10
❖ B-Tree Insertion Cost |
Insertion cost = CosttreeSearch + CosttreeInsert + CostdataInsert
Best case: write one page (most of time)
Common case: 3 node writes (rearrange 2 leaves + parent)
❖ B-Tree Insertion Cost (cont) |
Worst case: 2D-1 node writes (propagate to root)
❖ Selection with B-Trees |
For one queries:
N = B-tree root node while (N is not a leaf node) N = scanToFindChild(N,K) tid = scanToFindEntry(N,K) access tuple T using tid
Costone = (D + 1)r
❖ Selection with B-Trees (cont) |
For range queries (assume sorted on index attribute):
search index to find leaf node for Lo for each leaf node entry until Hi found { access tuple T using tid from entry }
Costrange = (D + bi + bq)r
❖ B-trees in PostgreSQL |
PostgreSQL implements ≅ Lehman/Yao-style B-trees
backend/access/nbtree
README
nbtree.c
nbtsearch.c
nbtinsert.c
❖ B-trees in PostgreSQL (cont) |
Interface functions for B-trees
// build Btree index on relation Datum btbuild(rel,index,...) // insert index entry into Btree Datum btinsert(rel,key,tupleid,index,...) // start scan on Btree index Datum btbeginscan(rel,key,scandesc,...) // get next tuple in a scan Datum btgettuple(scandesc,scandir,...) // close down a scan Datum btendscan(scandesc)
Produced: 30 Apr 2024