COMP9315 Week 07 Monday Lecture

>>

Things To Note
Assignment 2
Signature-based Selection
Indexing with Signatures
Signatures
Generating Codewords
Superimposed Codewords (SIMC)
SIMC Example
SIMC Queries
Example SIMC Query
SIMC Parameters
Query Cost for SIMC
Exercise: SIMC Query Cost
Page-level SIMC
Page-Level SIMC Files
Exercise: Page-level SIMC Query Cost
Bit-sliced SIMC
Exercise: Bit-sliced SIMC Query Cost
Implementing Join
Join
Join Example
Nested Loop Join
Block Nested Loop Join
Exercise: Nested Loop Join Cost
BNJ in Practice
Index Nested Loop Join
Exercise: Index Nested Loop Join Cost
Sort-Merge Join
Sort-Merge Join on Example
Exercise: Sort-merge Join Cost

∧ >>

❖ Things To Note

Assignment 1
- assignments submitted after midnight Sunday are worht ZERO
- unless you have an extension via Special Consid or ELS
Assignment 2
- now released ... due start week 10
Quiz 4
- released Monday week 8 ... due Friday week 8

<< ∧ >>

❖ Assignment 2

An implementation of multi-attribute linear hashing

linear hashing ... grow data file one page at a time
- covered in Week 04 Lectures ... end Monday, start Thursday
multi-attribute hashing ... use bits from all attributes in hash
- covered in Week 05 Lectures ... Monday (slides 19-37)

Individual relations are built as MALH files

info file (1 page), data file (b pages), overflow file (b_Ov pages)

Does NOT involve PostgreSQL

<< ∧ >>

❖ Assignment 2 (cont)

Implementation involves three main commands

create ... create an empty MALH relation
insert ... add tuples into an MALH relation
select ... find tuples in an MALH relation

plus three utility commands

stats ... print info about an MALH relation
dump ... list all tuples in an MALH relation
gendata ... generate random tuples

Each of the above is a C program containing a main() function.

<< ∧ >>

❖ Assignment 2 (cont)

The commands are supported by a set of ADTs:

bits ... ADT for bit-strings
chvec ... ADT for choice vectors
hash ... PostgreSQL hash function
page ... ADT for data/overflow pages
query ... ADT for query scanners (incomplete)
reln ... ADT for relations (partly complete)
tuple ... ADT for tuples (partly complete)
util ... utility functions

Each ADT has a .h file (interface) and a .c file (implementation)

<< ∧ >>

❖ Assignment 2 (cont)

As supplied, all code compiles

create ... creates an empty MALH relation
insert ... adds tuples into an MALH relation, BUT
- does not do MA hashing ... simply hashes first attribute
- does not do linear hashing ... data file has fixed number of pages
select ... does not do anything

The three utility commands are all complete

Your task: complete the ADTs so that all commands work.

<< ∧ >>

❖ Assignment 2 (cont)

What to change:

change tuple.c to use multi-attribute hashing
change query.c to implement query evaluation
change reln.c to implement linear hashing

What to submit:

zip ass2.zip all changed files
give cs9315 ass2 ass2.zip

Do NOT change create.c, insert.c, select.c

<< ∧ >>

❖ Signature-based Selection

<< ∧ >>

❖ Indexing with Signatures

Signature-based indexing:

designed for pmr queries (conjunction of equalities)
does not try to achieve better than O(n) performance
attempts to provide an "efficient" linear scan

Each tuple is associated with a signature

a compact (lossy) descriptor for the tuple
formed by combining information from multiple attributes
stored in a signature file, parallel to data file

Instead of scanning/testing tuples, do pre-filtering via signatures.

<< ∧ >>

❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

<< ∧ >>

❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A₁ .. A_n

A codeword cw(A_i) is

a bit-string, m bits long, where k bits are set to 1 (k ≪ m)
derived from the value of a single attribute A_i

A tuple descriptor (signature) is built by combining cw(A_i), i=1..n

could combine by overlaying (or concatenating) codewords
aim to have roughly half of the bits set to 1

<< ∧ >>

❖ Generating Codewords

Generating a k-in-m codeword for attribute A_i

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   bits cword = 0;   // assuming m <= 32 bits
   srandom(hash(attr_value));
   while (nbits < k) {
      int i = random() % m;
      if (((1 << i) & cword) == 0) {
         cword |= (1 << i);
         nbits++;
      }
   }
   return cword;  // m-bits with k 1-bits and m-k 0-bits
}

<< ∧ >>

❖ Superimposed Codewords (SIMC)

In a superimposed codewords (simc) indexing scheme

a tuple descriptor is formed by overlaying attribute codewords

A tuple descriptor desc(r) is

a bit-string, m bits long, where j ≤ nk bits are set to 1
desc(r) = cw(A₁) OR cw(A₂) OR ... OR cw(A_n)

Method (assuming all n attributes are used in descriptor):

bits desc = 0 
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i])
   desc = desc | cw
}

<< ∧ >>

❖ SIMC Example

Consider the following tuple (from bank deposit database)

Branch	AcctNo	Name	Amount
Perryridge	102	Hayes	400

It has the following codewords/descriptor (for m = 12, k = 2 )

A_i	cw(A_i)
Perryridge	`010000000001`
102	`000000000011`
Hayes	`000001000100`
400	`000010000100`
desc(r)	`010011000111`

<< ∧ >>

❖ SIMC Queries

To answer query q in SIMC

first generate a query descriptor desc(q)
then use the query descriptor to search the signature file

desc(q) is formed by OR of codewords for known attributes.

E.g. consider the query (Perryridge, ?, ?, ?).

A_i	cw(A_i)
Perryridge	`010000000001`
`?`	`000000000000`
`?`	`000000000000`
`?`	`000000000000`
desc(q)	`010000000001`

<< ∧ >>

❖ SIMC Queries (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
for each P in pagesToCheck {
    Buf = getPage(f,P)
    check tuples in Buf for answers
}
// where ...
#define matches(rdesc,qdesc)
               ((rdesc & qdesc) == qdesc)

<< ∧ >>

❖ Example SIMC Query

Consider the query and the example database:

Signature	Deposit Record
`010000000001`	(Perryridge,?,?,?)
`100101001001`	(Brighton,217,Green,750)
`010011000111`	(Perryridge,102,Hayes,400)
`101001001001`	(Downtown,101,Johnshon,512)
`101100000011`	(Mianus,215,Smith,700)
`010101010101`	(Clearview,117,Throggs,295)
`100101010011`	(Redwood,222,Lindsay,695)

Gives two matches: one true match, one false match.

<< ∧ >>

❖ SIMC Parameters

False match probablity p_F = likelihood of a false match

How to reduce likelihood of false matches?

use different hash function for each attribute (h_i for A_i)
increase descriptor size (m)
choose k so that ≅ half of bits are set

Larger m means reading more descriptor data.

Having k too high ⇒ increased overlapping.
Having k too low ⇒ increased hash collisions.

<< ∧ >>

❖ SIMC Parameters (cont)

How to determine "optimal" m and k?

start by choosing acceptable p_F
(e.g. p_F ≤ 10^-5 i.e. one false match in 10,000)
then choose m and k to achieve no more than this p_F.

Formulae to derive m and k given p_F and n:

k = 1/log_e2 . log_e ( 1/p_F )

m = ( 1/log_e 2 )² . n . log_e ( 1/p_F )

<< ∧ >>

❖ Query Cost for SIMC

Cost to answer pmr query: Cost_pmr = b_D + b_q

read r descriptors on b_D descriptor pages
then read b_q data pages and check for matches

b_D = ceil( r/c_D ) and c_D = floor(B/ceil(m/8))

E.g. m=64, B=8192, r=10⁴ ⇒ c_D = 1024, b_D=10

b_q includes pages with r_q matching tuples and r_F false matches

Expected false matches = r_F = (r - r_q).p_F ≅ r.p_F if r_q ≪ r

E.g. Worst b_q = r_q+r_F, Best b_q = 1, Avg b_q = ceil(b(r_q+r_F)/r)

<< ∧ >>

❖ Exercise: SIMC Query Cost

Consider a SIMC-indexed database with the following properties

all pages are B = 8192 bytes
tuple descriptors have m = 64 bits ( = 8 bytes)
total records r = 102,400, records/page c = 100
false match probability p_F = 1/1000
answer set has 1000 tuples from 100 pages
90% of false matches occur on data pages with true match
10% of false matches are distributed 1 per page

Calculate the total number of pages read in answering the query.

<< ∧ >>

❖ Page-level SIMC

SIMC has one descriptor per tuple ... potentially inefficient.

Alternative approach: one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

Size of page descriptor (PD) (clearly larger than tuple descriptor):

use above formulae but with c.n "attributes"

E.g. n = 4, c = 128, p_F = 10^-3 ⇒ m ≅ 7000bits ≅ 900bytes

Typically, pages are 1..8KB ⇒ 1..9 PD/page (N_PD).

<< ∧ >>

❖ Page-Level SIMC Files

File organisation for page-level superimposed codeword index

<< ∧ >>

❖ Exercise: Page-level SIMC Query Cost

Consider a SIMC-indexed database with the following properties

all pages are B = 8192 bytes
page descriptors have m = 4096 bits ( = 512 bytes)
total records r = 102,400, records/page c = 100
false match probability p_F = 1/1000
answer set has 1000 tuples from 100 pages
90% of false matches occur on data pages with true match
10% of false matches are distributed 1 per page

Calculate the total number of pages read in answering the query.

<< ∧ >>

❖ Bit-sliced SIMC

Improvement: store b m-bit page descriptors as m b-bit "bit-slices"

<< ∧ >>

❖ Bit-sliced SIMC (cont)

At query time

matches = ~0  //all ones
for each bit i set to 1 in desc(q) {
   slice = fetch bit-slice i
   matches = matches & slice
}
for each bit i set to 1 in matches {
   fetch page i
   scan page for matching records
}

Effective because desc(q) typically has less than half bits set to 1

<< ∧ >>

❖ Exercise: Bit-sliced SIMC Query Cost

Consider a SIMC-indexed database with the following properties

all pages are B = 8192 bytes
r = 102,400, c = 100, b = 1024
page descriptors have m = 4096 bits ( = 512 bytes)
bit-slices have b = 1024 bits ( = 128 bytes)
false match probability p_F = 1/1000
query descriptor has k = 10 bits set to 1
answer set has 1000 tuples from 100 pages
90% of false matches occur on data pages with true match
10% of false matches are distributed 1 per page

Calculate the total number of pages read in answering the query.

<< ∧ >>

❖ Implementing Join

<< ∧ >>

❖ Join

DBMSs are engines to store, combine and filter information.

Join (⋈) is the primary means of combining information.

Join is important and potentially expensive

Most common join condition: equijoin, e.g. (R.pk = S.fk)

Join varieties (natural, inner, outer, semi, anti) all behave similarly.

We consider three strategies for implementing join

nested loop ... simple, widely applicable, inefficient without buffering
sort-merge ... works best if tables are sorted on join attributes
hash-based ... requires good hash function and sufficient buffering

<< ∧ >>

❖ Join Example

Consider a university database with the schema:

create table Student(
   id     integer primary key,
   name   text,  ...
);
create table Enrolled(
   stude  integer references Student(id),
   subj   text references Subject(code),  ...
);
create table Subject(
   code   text primary key,
   title  text,  ...
);

<< ∧ >>

❖ Join Example (cont)

List names of students in all subjects, arranged by subject.

SQL query to provide this information:

select E.subj, S.name
from   Student S, Enrolled E
where  S.id = E.stude
order  by E.subj, S.name;

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )

To simplify formulae, we denote Student by S and Enrolled by E

<< ∧ >>

❖ Join Example (cont)

Some database statistics:

Sym	Meaning	Value
r_S	# student records	20,000
r_E	# enrollment records	80,000
c_S	`Student` records/page	20
c_E	`Enrolled` records/page	40
b_S	# data pages in `Student`	1,000
b_E	# data pages in `Enrolled`	2,000

Also, in cost analyses below, N = number of memory buffers.

<< ∧ >>

❖ Join Example (cont)

Out = Student ⋈ Enrolled relation statistics:

Sym	Meaning	Value
r_Out	# tuples in result	80,000
C_Out	result records/page	80
b_Out	# data pages in result	1,000

Notes:

r_Out ... one result tuple for each Enrolled tuple
C_Out ... result tuples have only subj and name
in analyses, ignore cost of writing result ... same in all methods

<< ∧ >>

❖ Nested Loop Join

Basic strategy (R.a ⋈ S.b):

Result = {}
for each page i in R {
   pageR = getPage(R,i)
   for each page j in S {
      pageS = getPage(S,j)
      for each pair of tuples t_R,t_S
                       from pageR,pageS {
         if (t_R.a == t_S.b)
            Result = Result ∪ (t_R:t_S)
}  }  }

Needs input buffers for R and S, output buffer for "joined" tuples

Terminology: R is outer relation, S is inner relation

Cost = b_R . b_S ... ouch!

<< ∧ >>

❖ Block Nested Loop Join

Method (for N memory buffers):

read N-2-page chunk of R into memory buffers
for each S page
check join condition on all (t_R,t_S) pairs in buffers
repeat for all N-2-page chunks of R

<< ∧ >>

❖ Block Nested Loop Join (cont)

Best-case scenario: b_R ≤ N-2

read b_R pages of relation R into buffers
while whole R is buffered, read b_S pages of S

Cost = b_R + b_S

Typical-case scenario: b_R > N-2

read ceil(b_R/(N-2)) chunks of pages from R
for each chunk, read b_S pages of S

Cost = b_R + b_S . ceil(b_R/N-2)

Note: always requires r_R.r_S checks of the join condition

<< ∧ >>

❖ Exercise: Nested Loop Join Cost

Compute the cost (# pages fetched) of (S ⋈ E)

Sym	Meaning	Value
r_S	# student records	20,000
r_E	# enrollment records	80,000
c_S	`Student` records/page	20
c_E	`Enrolled` records/page	40
b_S	# data pages in `Student`	1,000
b_E	# data pages in `Enrolled`	2,000

for N = 22, 202, 2002 and different inner/outer combinations

<< ∧ >>

❖ Exercise: Nested Loop Join Cost (cont)

If the query in the above example was:

select j.code, j.title, s.name
from   Student s
       join Enrolled e on (s.id=e.student)
       join Subject j on (e.subj=j.code)

how would this change the previous analysis?

What join combinations are there?

Assume 2000 subjects, with c_J = 10

How large would the intermediate tuples be? What assumptions?

Compute the cost (# pages fetched, # pages written) for N = 202

<< ∧ >>

❖ BNJ in Practice

Why block nested loop join is actually useful in practice ...

Many queries have the form

select * from R,S where r.i=s.j and r.x=k

This would typically be evaluated as

Tmp = Sel[r.x=k](R)
Res = Tmp Join[i=j] S

If Tmp is small ⇒ may fit in memory (in small #buffers)

<< ∧ >>

❖ Index Nested Loop Join

A problem with nested-loop join:

needs repeated scans of entire inner relation S

If there is an index on S, we can avoid such repeated whole-of-S scanning.

Consider Join[i=j](R,S):

for each tuple r in relation R {
    use index to select tuples
        from S where s.j = r.i
    for each selected tuple s from S {
        add (r,s) to result
}   }

<< ∧ >>

❖ Index Nested Loop Join (cont)

This method requires:

one scan of R relation (b_R)
- only one buffer needed, since we use R tuple-at-a-time
for each tuple in R (r_R), one index lookup on S
- cost depends on type of index and number of results
- best case is when each R.i matches few S tuples

Cost = b_R + r_R.Sel_S (Sel_S is the cost of performing a select on S).

Typical Sel_S = 1-2 (hashing) .. b_q (unclustered index)

Trade-off: r_R.Sel_S vs b_R.b_S, where b_R ≪ r_R and Sel_S ≪ b_S

<< ∧ >>

❖ Exercise: Index Nested Loop Join Cost

Consider executing Join[i=j](S,T) with the following parameters:

r_S = 1000, b_S = 50, r_T = 3000, b_T = 600
S.i is primary key, and T has index on T.j
T is sorted on T.j, each S tuple joins with 2 T tuples
DBMS has N = 12 buffers available for the join

Calculate the costs for evaluating the above join

using block nested loop join
using index nested loop join

Cost_r = # pages read and Cost_j = # join-condition checks

<< ∧ >>

❖ Sort-Merge Join

Basic approach:

sort both relations on join attribute (reminder: Join [i=j] (R,S))
scan together using merge to form result (r,s) tuples

Advantages:

no need to deal with "entire" S relation for each r tuple
deal with runs of matching R and S tuples

Disadvantages:

cost of sorting both relations (already sorted on join key?)
some rescanning required when long runs of S tuples

<< ∧ >>

❖ Sort-Merge Join (cont)

Method requires several cursors to scan sorted relations:

r = current record in R relation
s = start of current run in S relation
ss = current record in current run in S relation

<< ∧ >>

❖ Sort-Merge Join (cont)

Algorithm using query iterators/scanners:

Query ri, si;  Tuple r,s;

ri = startScan("SortedR");
si = startScan("SortedS");
while ((r = nextTuple(ri)) != NULL
       && (s = nextTuple(si)) != NULL) {
    // align cursors to start of next common run
    while (r != NULL && r.i < s.j)
           r = nextTuple(ri);
    if (r == NULL) break;
    while (s != NULL && r.i > s.j)
           s = nextTuple(si);
    if (s == NULL) break;
    // must have (r.i == s.j) here
...

<< ∧ >>

❖ Sort-Merge Join (cont)

...
    // remember start of current run in S
    TupleID startRun = scanCurrent(si)
    // scan common run, generating result tuples
    while (r != NULL && r.i == s.j) {
        while (s != NULL and s.j == r.i) {
            addTuple(outbuf, combine(r,s));
            if (isFull(outbuf)) {
                writePage(outf, outp++, outbuf);
                clearBuf(outbuf);
            }
            s = nextTuple(si);
        }
        r = nextTuple(ri);
        setScan(si, startRun);
    }
}

<< ∧ >>

❖ Sort-Merge Join (cont)

Buffer requirements:

for sort phase:
- as many as possible (remembering that cost is O(log_N) )
- if insufficient buffers, sorting cost can dominate
for merge phase:
- one output buffer for result
- one input buffer for relation R
- (preferably) enough buffers for longest run in S

<< ∧ >>

❖ Sort-Merge Join (cont)

Cost of sort-merge join.

Step 1: sort each relation (if not already sorted):

Cost = 2.b_R (1 + log_N-1(b_R /N)) + 2.b_S (1 + log_N-1(b_S /N))
(where N = number of memory buffers)

Step 2: merge sorted relations:

if every run of values in S fits completely in buffers,
merge requires single scan, Cost = b_R + b_S
if some runs in of values in S are larger than buffers,
need to re-scan run for each corresponding value from R

<< ∧ >>

❖ Sort-Merge Join on Example

Case 1: Join[id=stude](Student,Enrolled)

relations are not sorted on id#
memory buffers N=32; all runs are of length < 30

Cost	=	sort(S) + sort(E) + b_S + b_E
	=	2b_S(1+log₃₁(b_S/32)) + 2b_E(1+log₃₁(b_E/32)) + b_S + b_E
	=	2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000
	=	6000 + 12000 + 1000 + 2000
	=	21,000

<< ∧ >>

❖ Sort-Merge Join on Example (cont)

Case 2: Join[id=stude](Student,Enrolled)

Student and Enrolled already sorted on id#
memory buffers N=4 (S input, 2 × E input, output)
5% of the "runs" in E span two pages
there are no "runs" in S, since id# is a primary key

For the above, no re-scans of E runs are ever needed

Cost = 2,000 + 1,000 = 3,000 (regardless of which relation is outer)

<< ∧

❖ Exercise: Sort-merge Join Cost

Consider executing Join[i=j](S,T) with the following parameters:

r_S = 1000, b_S = 50, r_T = 3000, b_T = 150
S.i is primary key, and T has index on T.j
T is sorted on T.j, each S tuple joins with 2 T tuples
DBMS has N = 42 buffers available for the join

Calculate the cost for evaluating the above join

using sort-merge join
compute #pages read/written
compute #join-condition checks performed