❖ Things To Note |
❖ Debugging Assignment 1 |
Don't rely just on running ./run_test.py
Run some tests "manually"
... run your PostgreSQL server ... $ cd /localstorage/$USER/testing $ createdb xyz $ psql xyz -f pname.sql # load the data type $ psql xyz -e -f tests/0_sanity-checks/queries1.sql $ psql xyz -e -f tests/0_sanity-checks/queries2.sql $ psql xyz -e -f tests/0_sanity-checks/queries3.sql $ psql xyz -e -f tests/0_sanity-checks/queries4.sql $ psql xyz -e -f tests/0_sanity-checks/queries5.sql
Note: -e
If this is not fine-grained enough, run the individual queries.
❖ Debugging Assignment 1 (cont) |
Tests in 0_sanity_checks
Later tests attempt to store and retrieve PersonName
... ensure your PostgreSQL server is running ... $ cd /localstorage/$USER/testing $ dropdb xyz $ createdb xyz $ psql xyz -f pname.sql # load the data type $ cd tests/1_users $ psql xyz -f schema.sql $ psql xyz -f data1.sql $ psql xyz -e xyz -f queries1.sql $ psql xyz -e xyz -f queries2.sql $ psql xyz ... xyz=# select count(*) from Users; count ------- 20
❖ Debugging Assignment 1 (cont) |
Any implementation of PersonName
typedef struct PersonName { char *family; char *given; } PersonName;
is incorrect ... it produces the following:
❖ Debugging Assignment 1 (cont) |
This implementation of PersonName
typedef struct PersonName { char family[128]; char given[128]; } PersonName;
It keeps all of the data within the PersonName
❖ Debugging Assignment 1 (cont) |
Correct implementation of PersonName
If the internal representation of the data type is variable-length,
the internal representation must follow the standard layout for
variable-length data: the first four bytes must be a char[4]
field which is never accessed directly (customarily named vl_len_
).
You must use the SET_VARSIZE() macro to store the total size of the
datum (including the length field itself) in this field and VARSIZE()
to retrieve it. (These macros exist because the length field may be
encoded differently depending on platform.)
There are examples in postgresql-15.6/contrib
❖ Debugging Assignment 1 (cont) |
Debugging outside PostgreSQL:
typedef struct PersonName {...}
PersonName *pname_in(char *name)
char *pname_out(PersonName *pname)
family()
given()
show()
PersonName *
PersonName
❖ Selection with Lin.Hashing |
If b != 2d, treat different parts of the file differently.
Parts A and C are treated as if part of a file of size 2d+1.
Part B is treated as if part of a file of size 2d.
Part D does not yet exist (tuples in B may move into it).
❖ Selection with Lin.Hashing (cont) |
Modified search algorithm:
// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (P < sp) { pid = bits(d+1,h); }
P = getPage(f, pid)
for each tuple t in page P
and its overflow blocks {
if (t.k == val) return R;
}
❖ Insertion with Lin.Hashing |
Abstract view:
pid = bits(d,hash(val));
if (pid < sp) pid = bits(d+1,hash(val));
// bucket[pid] = page pid + its overflow pages
for each page P in bucket[pid] {
if (space in P) {
insert tuple into P
break
}
}
if (no insertion) {
add new ovflow page to bucket[pid]
insert tuple into new page
}
if (need to split) {
partition tuples from bucket[sp]
into bucket[sp] and bucket[sp+2^d]
sp++;
if (sp == 2^d) { d++; sp = 0; }
}
❖ Splitting |
How to decide that we "need to split"?
Two approaches to triggering a split:
Note: always split block sp, even if not full/"current"
Systematic splitting like this ...
❖ Exercise: Insertion into Linear Hashed File |
Consider a file with b=4, c=3, d=2, sp=0, hash(x) as above
Insert tuples in alpha order with the following keys and hashes:
k | hash(k) | k | hash(k) | k | hash(k) | k | hash(k) | |||
a | 10001 |
g | 00000 |
m | 11001 |
s | 01110 |
|||
b | 11010 |
h | 00000 |
n | 01000 |
t | 10011 |
|||
c | 01111 |
i | 10010 |
o | 00110 |
u | 00010 |
|||
d | 01111 |
j | 10110 |
p | 11101 |
v | 11111 |
|||
e | 01100 |
k | 00101 |
q | 00010 |
w | 10000 |
|||
f | 00010 |
l | 00101 |
r | 00000 |
x | 00111 |
The hash values are the 5 lower-order bits from the full 32-bit hash.
❖ Exercise: Insertion into Linear Hashed File (cont) |
Splitting algorithm:
// partition tuples between two buckets
newp = sp + 2^d; oldp = sp;
for all tuples t in bucket[oldp] {
pid = bits(d+1,hash(t.k));
if (pid == newp)
add tuple t to bucket[newp]
else
add tuple t to bucket[oldp]
}
sp++;
if (sp == 2^d) { d++; sp = 0; }
❖ Insertion Cost |
If no split required, cost same as for standard hashing:
Costinsert: Best: 1r + 1w, Avg: (1+Ov)r + 1w, Worst: (1+max(Ov))r + 2w
If split occurs, incur Costinsert plus cost of splitting:
❖ Deletion with Lin.Hashing |
Deletion is similar to ordinary static hash file.
But might wish to contract file when enough tuples removed.
Rationale: r shrinks, b stays large ⇒ wasted space.
Method:
❖ Hash Files in PostgreSQL |
PostgreSQL uses linear hashing on tables which have been:
create index Ix on R using hash (k);
Hash file implementation: backend/access/hash
hashfunc.c
hashinsert.c
hashpage.c
hashsearch.c
❖ Hash Files in PostgreSQL (cont) |
PostgreSQL uses slightly different file organisation ...
❖ Indexes |
A 1-d index is based on the value of a single attribute A.
Some possible properties of A:
index on unique field, may be sorted on A | ||
index on non-unique field, file sorted on A | ||
file not sorted on A |
A given table may have indexes on several attributes.
❖ Indexes (cont) |
Indexes themselves may be structured in several ways:
every tuple is referenced by an entry in the index file | ||
only some tuples are referenced by index file entries | ||
tuples are accessed directly from the index file | ||
may need to access several index pages to reach tuple |
Index file has total i pages (where typically i ≪ b)
Index file has page capacity ci (where typically ci ≫ c)
Dense index: i = ceil( r/ci ) Sparse index: i = ceil( b/ci )
❖ Exercise: Index Storage Overheads |
Consider a relation with the following storage parameters:
How many pages are needed to hold a sparse index?
❖ Selection with Primary Index |
For one queries:
ix = binary search index for entry with key K
if nothing found { return NotFound }
P = getPage(pageOf(ix.tid))
t = getTuple(P,offsetOf(ix.tid))
-- may require reading overflow pages
return t
Worst case: read log2i index pages + read 1+Ov data pages.
Thus, Costone,prim = log2 i + 1 + Ov
Assume: index pages are same size as data pages ⇒ same reading cost
❖ Selection with Primary Index (cont) |
For range queries on primary key:
❖ Selection with Primary Index (cont) |
Method for range queries
// e.g. select * from R where a between lo and hi pages = {} results = {} ixPage = findIndexPage(R.ixf,lo) while (ixTup = getNextIndexTuple(R.ixf)) { if (ixTup.key > hi) break; pages = pages ∪ pageOf(ixTup.tid) } foreach pid in pages { // scan data page plus ovflow chain while (buf = getPage(R.datf,pid)) { foreach tuple T in buf { if (lo<=T.a && T.a<=hi) results = results ∪ T } } }
❖ Insertion with Primary Index |
Overview:
tid = insert tuple into page P at position p find location for new entry in index file insert new index entry (k,tid) into index file
Problem: order of index entries must be maintained
Reogranisation requires, on average, read/write half of index file:
Costinsert,prim = (log2i)r + i/2.(1r+1w) + (1+Ov)r + (1+δ)w
❖ Deletion with Primary Index |
Overview:
find tuple using index mark tuple as deleted delete index entry for tuple
If we delete index entries by marking ...
❖ Clustering Index |
Data file sorted; one index entry for each key value
Cost penalty: maintaining both index and data file as sorted
(Note: can't mark index entry for value X until all X tuples are deleted)
❖ Secondary Index |
Data file not sorted; want one index entry for each key value
Costpmr =
(log2i
❖ Multi-level Indexes |
Above Secondary Index used two index files to speed up search
Ix1
Ix2
Ix1
Ix2
Ix1
Ix3
Ix2
Ix3
Ultimately, reduce top-level of index hierarchy to one page.
❖ Multi-level Indexes (cont) |
Example data file with three-levels of index:
Assume: not primary key, c = 100, ci = 3
❖ Select with Multi-level Index |
For one query on indexed key field:
xpid = top level index page
for level = 1 to d {
read index entry xpid
search index page for J'th entry
where index[J].key <= K < index[J+1].key
if (J == -1) { return NotFound }
xpid = index[J].page
}
pid = xpid // pid is data page index
search page pid and its overflow pages
Costone,mli = (d + 1 + Ov)r
(Note that d = ceil( logci r ) and ci is large because index entries are small)
❖ B-Trees |
B-trees are MSTs with the properties:
❖ B-Trees (cont) |
Example B-tree (depth=3, n=3)
(Note: in DBs, nodes are pages ⇒ large branching factor, e.g. n=500)
❖ B-Tree Depth |
Depth depends on effective branching factor (i.e. how full nodes are).
Simulation studies show typical B-tree nodes are 69% full.
Gives load Li = 0.69 × ci and depth of tree ~ ceil( logLi r ).
Example: ci=128, Li=88
Level | #nodes | #keys |
root | 1 | 87 |
1 | 88 | 7656 |
2 | 7744 | 673728 |
3 | 681472 | 59288064 |
Note: ci is generally larger than 128 for a real B-tree.
❖ Insertion into B-Trees |
Overview of the method:
❖ Example: B-tree Insertion |
Starting from this tree:
insert the following keys in the given order 12 15 30 10
❖ B-Tree Insertion Cost |
Insertion cost = CosttreeSearch + CosttreeInsert + CostdataInsert
Best case: write one page (most of time)
Common case: 3 node writes (rearrange 2 leaves + parent)
❖ B-Tree Insertion Cost (cont) |
Worst case: 2D-1 node writes (propagate to root)
❖ Selection with B-Trees |
For one queries:
N = B-tree root node while (N is not a leaf node) N = scanToFindChild(N,K) tid = scanToFindEntry(N,K) access tuple T using tid
Costone = (D + 1)r
❖ Selection with B-Trees (cont) |
For range queries (assume sorted on index attribute):
search index to find leaf node for Lo for each leaf node entry until Hi found { access tuple T using tid from entry }
Costrange = (D + bi + bq)r
❖ B-trees in PostgreSQL |
PostgreSQL implements ≅ Lehman/Yao-style B-trees
backend/access/nbtree
README
nbtree.c
nbtsearch.c
nbtinsert.c
❖ B-trees in PostgreSQL (cont) |
Interface functions for B-trees
// build Btree index on relation Datum btbuild(rel,index,...) // insert index entry into Btree Datum btinsert(rel,key,tupleid,index,...) // start scan on Btree index Datum btbeginscan(rel,key,scandesc,...) // get next tuple in a scan Datum btgettuple(scandesc,scandir,...) // close down a scan Datum btendscan(scandesc)
Produced: 7 Mar 2024