COMP9315 Week 07 Monday Lecture

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [0/50]
❖ Things To Note


COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [1/50]
❖ Assignment 2

An implementation of multi-attribute linear hashing

Individual relations are built as MALH files
Does NOT involve PostgreSQL
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [2/50]
❖ Assignment 2 (cont)

Implementation involves three main commands

plus three utility commands Each of the above is a C program containing a main() function.
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [3/50]
❖ Assignment 2 (cont)

The commands are supported by a set of ADTs:

Each ADT has a .h file (interface) and a .c file (implementation)
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [4/50]
❖ Assignment 2 (cont)

As supplied, all code compiles

The three utility commands are all complete

Your task: complete the ADTs so that all commands work.

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [5/50]
❖ Assignment 2 (cont)

What to change:

What to submit: Do NOT change create.c, insert.c, select.c
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [6/50]
❖ Signature-based Selection

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [7/50]
❖ Indexing with Signatures

Signature-based indexing:

Each tuple is associated with a signature

Instead of scanning/testing tuples, do pre-filtering via signatures.
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [8/50]
❖ Indexing with Signatures (cont)

File organisation for signature indexing (two files)

[Diagram:Pics/select/sigfile1.png]

One signature slot per tuple slot; unused signature slots are zeroed.

Signatures do not determine record placement ⇒ can use with other indexing.

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [9/50]
❖ Signatures

A signature "summarises" the data in one tuple

A tuple consists of N attribute values A1 .. An

A codeword cw(Ai) is

A tuple descriptor (signature) is built by combining cw(Ai), i=1..n
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [10/50]
❖ Generating Codewords

Generating a k-in-m codeword for attribute Ai

bits codeword(char *attr_value, int m, int k)
{
   int  nbits = 0;   // count of set bits
   bits cword = 0;   // assuming m <= 32 bits
   srandom(hash(attr_value));
   while (nbits < k) {
      int i = random() % m;
      if (((1 << i) & cword) == 0) {
         cword |= (1 << i);
         nbits++;
      }
   }
   return cword;  // m-bits with k 1-bits and m-k 0-bits
}

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [11/50]
❖ Superimposed Codewords (SIMC)

In a superimposed codewords (simc) indexing scheme

A tuple descriptor desc(r) is Method (assuming all n attributes are used in descriptor):

bits desc = 0 
for (i = 1; i <= n; i++) {
   bits cw = codeword(A[i])
   desc = desc | cw
}

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [12/50]
❖ SIMC Example

Consider the following tuple (from bank deposit database)

Branch AcctNo Name Amount
Perryridge 102 Hayes 400

It has the following codewords/descriptor (for m = 12,   k = 2 )

Ai cw(Ai)
Perryridge 010000000001
102 000000000011
Hayes 000001000100
400 000010000100
desc(r) 010011000111
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [13/50]
❖ SIMC Queries

To answer query q in SIMC

desc(q) is formed by OR of codewords for known attributes.

E.g. consider the query (Perryridge, ?, ?, ?).

Ai cw(Ai)
Perryridge 010000000001
? 000000000000
? 000000000000
? 000000000000
desc(q) 010000000001
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [14/50]
❖ SIMC Queries (cont)

Once we have a query descriptor, we search the signature file:

pagesToCheck = {}
for each descriptor D[i] in signature file {
    if (matches(D[i],desc(q))) {
        pid = pageOf(tupleID(i))
        pagesToCheck = pagesToCheck ∪ pid
    }
}
for each P in pagesToCheck {
    Buf = getPage(f,P)
    check tuples in Buf for answers
}
// where ...
#define matches(rdesc,qdesc)
               ((rdesc & qdesc) == qdesc)

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [15/50]
❖ Example SIMC Query

Consider the query and the example database:

Signature Deposit Record
010000000001 (Perryridge,?,?,?)
100101001001 (Brighton,217,Green,750)
010011000111 (Perryridge,102,Hayes,400)
101001001001 (Downtown,101,Johnshon,512)
101100000011 (Mianus,215,Smith,700)
010101010101 (Clearview,117,Throggs,295)
100101010011 (Redwood,222,Lindsay,695)

Gives two matches:  one true match,  one false match.

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [16/50]
❖ SIMC Parameters

False match probablity pF  =  likelihood of a false match

How to reduce likelihood of false matches?


Larger m means reading more descriptor data.

Having k too high    increased overlapping.
Having k too low    increased hash collisions.

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [17/50]
❖ SIMC Parameters (cont)

How to determine "optimal" m and k?

  1. start by choosing acceptable pF
      (e.g. pF ≤ 10-5 i.e. one false match in 10,000)
  2. then choose m and k to achieve no more than this pF.
Formulae to derive m and k given pF and n:

k  =  1/loge2 . loge ( 1/pF )

m  =  ( 1/loge 2 )2 . n . loge ( 1/pF )

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [18/50]
❖ Query Cost for SIMC

Cost to answer pmr query: Costpmr = bD + bq

bD = ceil( r/cD )  and  cD = floor(B/ceil(m/8))

E.g. m=64,   B=8192,   r=104    ⇒    cD = 1024,   bD=10

bq includes pages with rq matching tuples and rF false matches

Expected false matches = rF  =  (r - rq).pF  ≅  r.pF   if rq ≪ r

E.g. Worst bq = rq+rF,   Best bq = 1,   Avg bq = ceil(b(rq+rF)/r)

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [19/50]
❖ Exercise: SIMC Query Cost

Consider a SIMC-indexed database with the following properties

Calculate the total number of pages read in answering the query.
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [20/50]
❖ Page-level SIMC

SIMC has one descriptor per tuple ... potentially inefficient.

Alternative approach: one descriptor for each data page.

Every attribute of every tuple in page contributes to descriptor.

Size of page descriptor (PD) (clearly larger than tuple descriptor):

E.g. n = 4, c = 128, pF = 10-3   ⇒   m ≅ 7000bits ≅ 900bytes

Typically, pages are 1..8KB    1..9 PD/page (NPD).

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [21/50]
❖ Page-Level SIMC Files

File organisation for page-level superimposed codeword index

[Diagram:Pics/select/simc-2level.png]

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [22/50]
❖ Exercise: Page-level SIMC Query Cost

Consider a SIMC-indexed database with the following properties

Calculate the total number of pages read in answering the query.
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [23/50]
❖ Bit-sliced SIMC

Improvement: store b m-bit page descriptors as m b-bit "bit-slices"


[Diagram:Pics/select/bit-sliced.png]

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [24/50]
❖ Bit-sliced SIMC (cont)

At query time

matches = ~0  //all ones
for each bit i set to 1 in desc(q) {
   slice = fetch bit-slice i
   matches = matches & slice
}
for each bit i set to 1 in matches {
   fetch page i
   scan page for matching records
}


Effective because desc(q) typically has less than half bits set to 1

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [25/50]
❖ Exercise: Bit-sliced SIMC Query Cost

Consider a SIMC-indexed database with the following properties

Calculate the total number of pages read in answering the query.
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [26/50]
❖ Implementing Join

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [27/50]
❖ Join

DBMSs are engines to store, combine and filter information.

Join () is the primary means of combining information.

Join is important and potentially expensive

Most common join condition: equijoin, e.g. (R.pk = S.fk)

Join varieties (natural, inner, outer, semi, anti) all behave similarly.

We consider three strategies for implementing join

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [28/50]
❖ Join Example

Consider a university database with the schema:

create table Student(
   id     integer primary key,
   name   text,  ...
);
create table Enrolled(
   stude  integer references Student(id),
   subj   text references Subject(code),  ...
);
create table Subject(
   code   text primary key,
   title  text,  ...
);

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [29/50]
❖ Join Example (cont)

List names of students in all subjects, arranged by subject.

SQL query to provide this information:

select E.subj, S.name
from   Student S, Enrolled E
where  S.id = E.stude
order  by E.subj, S.name;

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )


To simplify formulae, we denote Student by S and Enrolled by E
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [30/50]
❖ Join Example (cont)

Some database statistics:

Sym Meaning Value
rS # student records 20,000
rE # enrollment records 80,000
cS Student records/page 20
cE Enrolled records/page 40
bS # data pages in Student 1,000
bE # data pages in Enrolled 2,000

Also, in cost analyses below, N = number of memory buffers.

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [31/50]
❖ Join Example (cont)

Out = Student ⋈ Enrolled relation statistics:

Sym Meaning Value
rOut # tuples in result 80,000
COut result records/page 80
bOut # data pages in result 1,000

Notes:

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [32/50]
❖ Nested Loop Join

Basic strategy (R.a ⋈ S.b):

Result = {}
for each page i in R {
   pageR = getPage(R,i)
   for each page j in S {
      pageS = getPage(S,j)
      for each pair of tuples tR,tS
                       from pageR,pageS {
         if (tR.a == tS.b)
            Result = Result ∪ (tR:tS)
}  }  }

Needs input buffers for R and S, output buffer for "joined" tuples

Terminology: R is outer relation, S is inner relation

Cost = bR . bS   ...   ouch!

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [33/50]
❖ Block Nested Loop Join

Method (for N memory buffers):

[Diagram:Pics/join/blk-nested-loop.png]

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [34/50]
❖ Block Nested Loop Join (cont)

Best-case scenario: bR ≤ N-2

Cost   =   bR + bS

Typical-case scenario: bR > N-2

Cost   =   bR + bS . ceil(bR/N-2)

Note: always requires rR.rS checks of the join condition

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [35/50]
❖ Exercise: Nested Loop Join Cost

Compute the cost (# pages fetched) of (S ⋈ E)

Sym Meaning Value
rS # student records 20,000
rE # enrollment records 80,000
cS Student records/page 20
cE Enrolled records/page 40
bS # data pages in Student 1,000
bE # data pages in Enrolled 2,000

for N = 22, 202, 2002 and different inner/outer combinations

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [36/50]
❖ Exercise: Nested Loop Join Cost (cont)

If the query in the above example was:

select j.code, j.title, s.name
from   Student s
       join Enrolled e on (s.id=e.student)
       join Subject j on (e.subj=j.code)

how would this change the previous analysis?

What join combinations are there?

Assume 2000 subjects, with cJ = 10

How large would the intermediate tuples be? What assumptions?

Compute the cost (# pages fetched, # pages written) for N = 202

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [37/50]
❖ BNJ in Practice

Why block nested loop join is actually useful in practice ...

Many queries have the form

select * from R,S where r.i=s.j and r.x=k

This would typically be evaluated as

Tmp = Sel[r.x=k](R)
Res = Tmp Join[i=j] S

If Tmp is small ⇒ may fit in memory (in small #buffers)

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [38/50]
❖ Index Nested Loop Join

A problem with nested-loop join:

If there is an index on S, we can avoid such repeated whole-of-S  scanning.

Consider Join[i=j](R,S):

for each tuple r in relation R {
    use index to select tuples
        from S where s.j = r.i
    for each selected tuple s from S {
        add (r,s) to result
}   }

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [39/50]
❖ Index Nested Loop Join (cont)

This method requires:

Cost   =   bR + rR.SelS    (SelS is the cost of performing a select on S).

Typical SelS  =  1-2 (hashing) .. bq (unclustered index)

Trade-off:   rR.SelS  vs  bR.bS,   where   bR ≪ rR and SelS ≪ bS

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [40/50]
❖ Exercise: Index Nested Loop Join Cost

Consider executing Join[i=j](S,T) with the following parameters:

Calculate the costs for evaluating the above join Costr = # pages read   and   Costj = # join-condition checks
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [41/50]
❖ Sort-Merge Join

Basic approach:

Advantages: Disadvantages:
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [42/50]
❖ Sort-Merge Join (cont)

Method requires several cursors to scan sorted relations:

[Diagram:Pics/join/sort-merge.png]

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [43/50]
❖ Sort-Merge Join (cont)

Algorithm using query iterators/scanners:

Query ri, si;  Tuple r,s;

ri = startScan("SortedR");
si = startScan("SortedS");
while ((r = nextTuple(ri)) != NULL
       && (s = nextTuple(si)) != NULL) {
    // align cursors to start of next common run
    while (r != NULL && r.i < s.j)
           r = nextTuple(ri);
    if (r == NULL) break;
    while (s != NULL && r.i > s.j)
           s = nextTuple(si);
    if (s == NULL) break;
    // must have (r.i == s.j) here
...

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [44/50]
❖ Sort-Merge Join (cont)

...
    // remember start of current run in S
    TupleID startRun = scanCurrent(si)
    // scan common run, generating result tuples
    while (r != NULL && r.i == s.j) {
        while (s != NULL and s.j == r.i) {
            addTuple(outbuf, combine(r,s));
            if (isFull(outbuf)) {
                writePage(outf, outp++, outbuf);
                clearBuf(outbuf);
            }
            s = nextTuple(si);
        }
        r = nextTuple(ri);
        setScan(si, startRun);
    }
}

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [45/50]
❖ Sort-Merge Join (cont)

Buffer requirements:

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [46/50]
❖ Sort-Merge Join (cont)

Cost of sort-merge join.

Step 1: sort each relation   (if not already sorted):

Step 2: merge sorted relations:
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [47/50]
❖ Sort-Merge Join on Example

Case 1:   Join[id=stude](Student,Enrolled)

 
Cost = sort(S) + sort(E) + bS + bE
= 2bS(1+log31(bS/32)) + 2bE(1+log31(bE/32)) + bS + bE
= 2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000
= 6000 + 12000 + 1000 + 2000
= 21,000
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [48/50]
❖ Sort-Merge Join on Example (cont)

Case 2:   Join[id=stude](Student,Enrolled)

For the above, no re-scans of E  runs are ever needed

Cost  =  2,000 + 1,000  =  3,000   (regardless of which relation is outer)

COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [49/50]
❖ Exercise: Sort-merge Join Cost

Consider executing Join[i=j](S,T) with the following parameters:

Calculate the cost for evaluating the above join
COMP9315 24T1 ♢ Week 7 Monday Lecture ♢ [50/50]


Produced: 25 Mar 2024