COMP9315 Week 04 Thursday Lecture

>>

Things To Note
Debugging Assignment 1
Linear Hashing
Selection with Lin.Hashing
File Expansion with Lin.Hashing
Insertion with Lin.Hashing
Splitting
Exercise: Insertion into Linear Hashed File
Insertion Cost
Deletion with Lin.Hashing
Hash Files in PostgreSQL
Indexing
Indexing
Indexes
Dense Primary Index
Sparse Primary Index
Exercise: Index Storage Overheads
Selection with Primary Index
Insertion with Primary Index
Deletion with Primary Index
Clustering Index
Secondary Index
Multi-level Indexes
Select with Multi-level Index
B-Trees
B-Tree Depth
Insertion into B-Trees
Example: B-tree Insertion
B-Tree Insertion Cost
Selection with B-Trees
B-trees in PostgreSQL

∧ >>

❖ Things To Note

Quiz 2
- due midnight Friday 8 March ... 190/534 submissions so far
Assignment 1
- due midnight Friday 15 March ... 33?/534 submissions so far
- Help Session ... Friday 8 March ... 2.30 - 4.30 ... K17-410
Census
- drop course before Sunday if you don't want to pay

<< ∧ >>

❖ Debugging Assignment 1

Don't rely just on running ./run_test.py

Run some tests "manually"

... run your PostgreSQL server ...
$ cd /localstorage/$USER/testing
$ createdb xyz
$ psql xyz -f pname.sql   # load the data type
$ psql xyz -e -f tests/0_sanity-checks/queries1.sql
$ psql xyz -e -f tests/0_sanity-checks/queries2.sql
$ psql xyz -e -f tests/0_sanity-checks/queries3.sql
$ psql xyz -e -f tests/0_sanity-checks/queries4.sql
$ psql xyz -e -f tests/0_sanity-checks/queries5.sql

Note: -e displays the SQL statements being executed

If this is not fine-grained enough, run the individual queries.

<< ∧ >>

❖ Debugging Assignment 1 (cont)

Tests in 0_sanity_checks don't write any data to disk.

Later tests attempt to store and retrieve PersonName data, e.g.

... ensure your PostgreSQL server is running ...
$ cd /localstorage/$USER/testing
$ dropdb xyz
$ createdb xyz
$ psql xyz -f pname.sql   # load the data type
$ cd tests/1_users
$ psql xyz -f schema.sql
$ psql xyz -f data1.sql
$ psql xyz -e xyz -f queries1.sql
$ psql xyz -e xyz -f queries2.sql
$ psql xyz
...
xyz=# select count(*) from Users;
 count 
-------
    20

<< ∧ >>

❖ Debugging Assignment 1 (cont)

Any implementation of PersonName that looks like:

typedef struct PersonName {
   char *family;
   char *given;
} PersonName;

is incorrect ... it produces the following:

<< ∧ >>

❖ Debugging Assignment 1 (cont)

This implementation of PersonName is slightly better:

typedef struct PersonName {
   char family[128];
   char given[128];
} PersonName;

It keeps all of the data within the PersonName object, BUT

most names aren't nearly this big ⇒ wasted space
some names may be larger than this ⇒ truncated?

<< ∧ >>

❖ Debugging Assignment 1 (cont)

Correct implementation of PersonName requires:

a flexible-sized array that is part of the struct
an number to record how long is the array

From chapter 38 in the PostgreSQL documentation:

If the internal representation of the data type is variable-length, the internal representation must follow the standard layout for variable-length data: the first four bytes must be a char[4] field which is never accessed directly (customarily named vl_len_). You must use the SET_VARSIZE() macro to store the total size of the datum (including the length field itself) in this field and VARSIZE() to retrieve it. (These macros exist because the length field may be encoded differently depending on platform.)

There are examples in postgresql-15.6/contrib

<< ∧ >>

❖ Debugging Assignment 1 (cont)

Debugging outside PostgreSQL:

write a new C program containing versions of
- typedef struct PersonName {...}
- PersonName *pname_in(char *name)
- char *pname_out(PersonName *pname)
- family(), given(), show() functions
- comparison operators
main program reads lines containing names
stores in array of PersonName *
displays, shows, compares PersonName values

Will require re-writing function interfaces to fit into PostgreSQL.

<< ∧ >>

❖ Linear Hashing

<< ∧ >>

❖ Selection with Lin.Hashing

If b != 2^d, treat different parts of the file differently.

Parts A and C are treated as if part of a file of size 2^d+1.

Part B is treated as if part of a file of size 2^d.

Part D does not yet exist (tuples in B may move into it).

<< ∧ >>

❖ Selection with Lin.Hashing (cont)

Modified search algorithm:

// select * from R where k = val
h = hash(val);
pid = bits(d,h);
if (P < sp) { pid = bits(d+1,h); }
P = getPage(f, pid)
for each tuple t in page P
         and its overflow blocks {
    if (t.k == val) return R;
}

<< ∧ >>

❖ File Expansion with Lin.Hashing

<< ∧ >>

❖ Insertion with Lin.Hashing

Abstract view:

pid = bits(d,hash(val));
if (pid < sp) pid = bits(d+1,hash(val));
// bucket[pid] = page pid + its overflow pages
for each page P in bucket[pid] {
    if (space in P) {
        insert tuple into P
        break
    }
}
if (no insertion) {
    add new ovflow page to bucket[pid]
    insert tuple into new page
}
if (need to split) {
    partition tuples from bucket[sp]
          into bucket[sp] and bucket[sp+2^d]
    sp++;
    if (sp == 2^d) { d++; sp = 0; }
}

<< ∧ >>

❖ Splitting

How to decide that we "need to split"?

Two approaches to triggering a split:

split every time a tuple is inserted into full block
split when load factor reaches threshold (every k inserts)

Note: always split block sp, even if not full/"current"

Systematic splitting like this ...

eventually reduces length of every overflow chain
helps to maintain short average overflow chain length

<< ∧ >>

❖ Splitting (cont)

Splitting process for block sp=01:

<< ∧ >>

❖ Exercise: Insertion into Linear Hashed File

Consider a file with b=4, c=3, d=2, sp=0, hash(x) as above

Insert tuples in alpha order with the following keys and hashes:

k	hash(k)	k	hash(k)	k	hash(k)	k	hash(k)
`a`	`10001`	`g`	`00000`	`m`	`11001`	`s`	`01110`
`b`	`11010`	`h`	`00000`	`n`	`01000`	`t`	`10011`
`c`	`01111`	`i`	`10010`	`o`	`00110`	`u`	`00010`
`d`	`01111`	`j`	`10110`	`p`	`11101`	`v`	`11111`
`e`	`01100`	`k`	`00101`	`q`	`00010`	`w`	`10000`
`f`	`00010`	`l`	`00101`	`r`	`00000`	`x`	`00111`

The hash values are the 5 lower-order bits from the full 32-bit hash.

<< ∧ >>

❖ Exercise: Insertion into Linear Hashed File (cont)

Splitting algorithm:


// partition tuples between two buckets
newp = sp + 2^d; oldp = sp;
for all tuples t in bucket[oldp] {
    pid = bits(d+1,hash(t.k));
    if (pid == newp) 
        add tuple t to bucket[newp]
    else
        add tuple t to bucket[oldp]
}
sp++;
if (sp == 2^d) { d++; sp = 0; }

<< ∧ >>

❖ Insertion Cost

If no split required, cost same as for standard hashing:

Cost_insert: Best: 1_r + 1_w, Avg: (1+Ov)_r + 1_w, Worst: (1+max(Ov))_r + 2_w

If split occurs, incur Cost_insert plus cost of splitting:

read block sp (plus all of its overflow blocks)
write block sp (and its new overflow blocks)
write block sp+2^d (and its new overflow blocks)

On average, Cost_split = (1+Ov)_r + (2+Ov)_w

<< ∧ >>

❖ Deletion with Lin.Hashing

Deletion is similar to ordinary static hash file.

But might wish to contract file when enough tuples removed.

Rationale: r shrinks, b stays large ⇒ wasted space.

Method:

remove last bucket in data file (contracts linearly).
merge tuples from bucket with its buddy page (using d-1 hash bits)

<< ∧ >>

❖ Hash Files in PostgreSQL

PostgreSQL uses linear hashing on tables which have been:

create index Ix on R using hash (k);

Hash file implementation: backend/access/hash

hashfunc.c ... a family of hash functions
hashinsert.c ... insert, with overflows
hashpage.c ... utilities + splitting
hashsearch.c ... iterator for hash files

Based on "A New Hashing Package for Unix", Margo Seltzer, Winter Usenix 1991

<< ∧ >>

❖ Hash Files in PostgreSQL (cont)

PostgreSQL uses slightly different file organisation ...

has a single file containing main and overflow pages
has groups of main pages of size 2ⁿ
in between groups, arbitrary number of overflow pages
maintains collection of "split pointers" in header page
each split pointer indicates start of main page group

If overflow pages become empty, add to free list and re-use.

<< ∧ >>

❖ Hash Files in PostgreSQL (cont)

PostgreSQL hash file structure:

[Diagram:Pics/file-struct/pgsql-hashfile.png]

<< ∧ >>

❖ Indexing

<< ∧ >>

❖ Indexing

An index is a file of (keyVal,tupleID) pairs, e.g.

<< ∧ >>

❖ Indexes

A 1-d index is based on the value of a single attribute A.

Some possible properties of A:

may be used to sort data file (or may be sorted on some other field)
values may be unique (or there may be multiple instances)

Taxonomy of index types, based on properties of index attribute:

primary index on unique field, may be sorted on A

clustering index on non-unique field, file sorted on A

secondary file not sorted on A

A given table may have indexes on several attributes.

<< ∧ >>

❖ Indexes (cont)

Indexes themselves may be structured in several ways:

dense every tuple is referenced by an entry in the index file

sparse only some tuples are referenced by index file entries

single-level tuples are accessed directly from the index file

multi-level may need to access several index pages to reach tuple

Index file has total i pages (where typically i ≪ b)

Index file has page capacity c_i (where typically c_i ≫ c)

Dense index: i = ceil( r/c_i ) Sparse index: i = ceil( b/c_i )

<< ∧ >>

❖ Dense Primary Index

Data file unsorted; one index entry for each tuple

<< ∧ >>

❖ Sparse Primary Index

Data file sorted; one index entry for each page

<< ∧ >>

❖ Exercise: Index Storage Overheads

Consider a relation with the following storage parameters:

B = 8192, R = 128, r = 100000
header in data pages: 256 bytes
key is integer, data file is sorted on key
index entries (keyVal,tupleID): 8 bytes
header in index pages: 32 bytes

How many pages are needed to hold a dense index?

How many pages are needed to hold a sparse index?

<< ∧ >>

❖ Selection with Primary Index

For one queries:

ix = binary search index for entry with key K
if nothing found { return NotFound }
P = getPage(pageOf(ix.tid))
t = getTuple(P,offsetOf(ix.tid))
   -- may require reading overflow pages
return t

Worst case: read log₂i index pages + read 1+Ov data pages.

Thus, Cost_one,prim = log₂ i + 1 + Ov

Assume: index pages are same size as data pages ⇒ same reading cost

<< ∧ >>

❖ Selection with Primary Index (cont)

For range queries on primary key:

use index search to find lower bound
read index sequentially until reach upper bound
accumulate set of buckets to be examined
examine each bucket in turn to check for matches

For pmr queries involving primary key:

search as if performing one query.

For queries not involving primary key, index gives no help.

<< ∧ >>

❖ Selection with Primary Index (cont)

Method for range queries (when data file is not sorted)

// e.g. select * from R where a between lo and hi
pages = {}   results = {}
ixPage = findIndexPage(R.ixf,lo)
while (ixTup = getNextIndexTuple(R.ixf)) {
   if (ixTup.key > hi) break;
   pages = pages ∪ pageOf(ixTup.tid)
}
foreach pid in pages {
   // scan data page plus ovflow chain
   while (buf = getPage(R.datf,pid)) {
      foreach tuple T in buf {
         if (lo<=T.a && T.a<=hi)
            results = results ∪ T
}  }  }

<< ∧ >>

❖ Insertion with Primary Index

Overview:

tid = insert tuple into page P at position p
find location for new entry in index file
insert new index entry (k,tid) into index file

Problem: order of index entries must be maintained

need to avoid overflow pages in index
either reorganise index file or mark entries

Reogranisation requires, on average, read/write half of index file:

Cost_insert,prim = (log₂i)_r + i/2.(1_r+1_w) + (1+Ov)_r + (1+δ)_w

<< ∧ >>

❖ Deletion with Primary Index

Overview:

find tuple using index
mark tuple as deleted
delete index entry for tuple

If we delete index entries by marking ...

Cost_delete,prim = (log₂ i)_r + (1 + Ov)_r + 1_w + 1_w

If we delete index entry by index file reorganisation ...

Cost_delete,prim = (log₂ i)_r + (1 + Ov)_r + i/2.(1_r+1_w) + 1_w

<< ∧ >>

❖ Clustering Index

Data file sorted; one index entry for each key value

Cost penalty: maintaining both index and data file as sorted

(Note: can't mark index entry for value X until all X tuples are deleted)

<< ∧ >>

❖ Secondary Index

Data file not sorted; want one index entry for each key value

[Diagram:Pics/file-struct/sec-index.png]

Cost_pmr = (log₂i_ix1 + a_ix2 + b_q.(1 + Ov))

<< ∧ >>

❖ Multi-level Indexes

Above Secondary Index used two index files to speed up search

by keeping the initial index search relatively quick
Ix1 small (depends on number of unique key values)
Ix2 larger (depends on amount of repetition of keys)
typically, b_Ix1 ≪ b_Ix2

Could improve further by

making Ix1 sparse, since Ix2 is guaranteed to be ordered
in this case, b_Ix1 = ceil( b_Ix2 / c_i )
if Ix1 becomes too large, add Ix3 and make Ix2 sparse
if data file ordered on key, could make Ix3 sparse

Ultimately, reduce top-level of index hierarchy to one page.

<< ∧ >>

❖ Multi-level Indexes (cont)

Example data file with three-levels of index:

[Diagram:Pics/file-struct/multi-level-index.png]

Assume: not primary key, c = 100, c_i = 3

<< ∧ >>

❖ Select with Multi-level Index

For one query on indexed key field:

xpid = top level index page
for level = 1 to d {
    read index entry xpid
    search index page for J'th entry
        where index[J].key <= K < index[J+1].key
    if (J == -1) { return NotFound }
    xpid = index[J].page
}
pid = xpid  // pid is data page index
search page pid and its overflow pages

Cost_one,mli = (d + 1 + Ov)_r

(Note that d = ceil( log_{c_i} r ) and c_i is large because index entries are small)

<< ∧ >>

❖ B-Trees

B-trees are MSTs with the properties:

they are updated so as to remain balanced
each node has at least (n-1)/2 entries in it
each tree node occupies an entire disk page

B-tree insertion and deletion methods

are moderately complicated to describe
can be implemented very efficiently

Advantages of B-trees over general MSTs

better storage utilisation (around 2/3 full)
better worst case performance (shallower)

<< ∧ >>

❖ B-Trees (cont)

Example B-tree (depth=3, n=3) (actually B+ tree)

(Note: in DBs, nodes are pages ⇒ large branching factor, e.g. n=500)

<< ∧ >>

❖ B-Tree Depth

Depth depends on effective branching factor (i.e. how full nodes are).

Simulation studies show typical B-tree nodes are 69% full.

Gives load L_i = 0.69 × c_i and depth of tree ~ ceil( log_{L_i} r ).

Example: c_i=128, L_i=88

Level #nodes #keys

root 1 87

1 88 7656

2 7744 673728

3 681472 59288064

Note: c_i is generally larger than 128 for a real B-tree.

<< ∧ >>

❖ Insertion into B-Trees

Overview of the method:

find leaf node and position in node where entry would be stored
if node is not full, insert entry into appropriate spot
if node is full, split node into two half-full nodes
and promote middle element to parent
if parent full, split and promote upwards
if reach root, and root is full, make new root upwards

Note: if duplicates not allowed and key exists, may stop after step 1.

<< ∧ >>

❖ Example: B-tree Insertion

Starting from this tree:

insert the following keys in the given order 12 15 30 10

<< ∧ >>

❖ Example: B-tree Insertion (cont)

<< ∧ >>

❖ Example: B-tree Insertion (cont)

<< ∧ >>

❖ B-Tree Insertion Cost

Insertion cost = Cost_treeSearch + Cost_treeInsert + Cost_dataInsert

Best case: write one page (most of time)

traverse from root to leaf
read/write data page, write updated leaf

Cost_insert = D_r + 1_w + 1_r + 1_w

Common case: 3 node writes (rearrange 2 leaves + parent)

traverse from root to leaf, holding nodes in buffer
read/write data page
update/write leaf, parent and sibling

Cost_insert = D_r + 3_w + 1_r + 1_w

<< ∧ >>

❖ B-Tree Insertion Cost (cont)

Worst case: 2D-1 node writes (propagate to root)

traverse from root to leaf, holding nodes in buffers
read/write data page
update/write leaf, parent and sibling
repeat previous step D-1 times

Cost_insert = D_r + (2D-1)_w + 1_r + 1_w

<< ∧ >>

❖ Selection with B-Trees

For one queries:

N = B-tree root node
while (N is not a leaf node)
   N = scanToFindChild(N,K)
tid = scanToFindEntry(N,K)
access tuple T using tid

Cost_one = (D + 1)_r

<< ∧ >>

❖ Selection with B-Trees (cont)

For range queries (assume sorted on index attribute):

search index to find leaf node for Lo
for each leaf node entry until Hi found {
	access tuple T using tid from entry
}

Cost_range = (D + b_i + b_q)_r

<< ∧ >>

❖ B-trees in PostgreSQL

PostgreSQL implements ≅ Lehman/Yao-style B-trees

variant that works effectively in high-concurrency environments.

B-tree implementation: backend/access/nbtree

README ... comprehensive description of methods
nbtree.c ... interface functions (for iterators)
nbtsearch.c ... traverse index to find key value
nbtinsert.c ... add new entry to B-tree index

Notes:

stores all instances of equal keys
avoids splitting by scanning right if key = max(key) in page
common insert case: new key is max(key) overall; handled efficiently

<< ∧

❖ B-trees in PostgreSQL (cont)

Interface functions for B-trees

// build Btree index on relation
Datum btbuild(rel,index,...)
// insert index entry into Btree
Datum btinsert(rel,key,tupleid,index,...)
// start scan on Btree index
Datum btbeginscan(rel,key,scandesc,...)
// get next tuple in a scan
Datum btgettuple(scandesc,scandir,...)
// close down a scan
Datum btendscan(scandesc)

primary		index on unique field, may be sorted on A
clustering		index on non-unique field, file sorted on A
secondary		file not sorted on A

dense		every tuple is referenced by an entry in the index file
sparse		only some tuples are referenced by index file entries
single-level		tuples are accessed directly from the index file
multi-level		may need to access several index pages to reach tuple

Level	#nodes	#keys
root	1	87
1	88	7656
2	7744	673728
3	681472	59288064