F: Implementing Join

∧ >>

❖ Join

DBMSs are engines to store, combine and filter information.

Join (⋈) is the primary means of combining information.

Join is important and potentially expensive

Most common join condition: equijoin, e.g. (R.pk = S.fk)

Join varieties (natural, inner, outer, semi, anti) all behave similarly.

We consider three strategies for implementing join

nested loop ... simple, widely applicable, inefficient without buffering
sort-merge ... works best if tables are sorted on join attributes
hash-based ... requires good hash function and sufficient buffering

<< ∧ >>

❖ Join Example

Consider a university database with the schema:

create table Student(
   id     integer primary key,
   name   text,  ...
);
create table Enrolled(
   stude  integer references Student(id),
   subj   text references Subject(code),  ...
);
create table Subject(
   code   text primary key,
   title  text,  ...
);

<< ∧ >>

❖ Join Example (cont)

List names of students in all subjects, arranged by subject.

SQL query to provide this information:

select E.subj, S.name
from   Student S, Enrolled E
where  S.id = E.stude
order  by E.subj, S.name;

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )

To simplify formulae, we denote Student by S and Enrolled by E

<< ∧ >>

❖ Join Example (cont)

Some database statistics:

Sym	Meaning	Value
r_S	# student records	20,000
r_E	# enrollment records	80,000
c_S	`Student` records/page	20
c_E	`Enrolled` records/page	40
b_S	# data pages in `Student`	1,000
b_E	# data pages in `Enrolled`	2,000

Also, in cost analyses below, N = number of memory buffers.

<< ∧ >>

❖ Join Example (cont)

Out = Student ⋈ Enrolled relation statistics:

Sym	Meaning	Value
r_Out	# tuples in result	80,000
C_Out	result records/page	80
b_Out	# data pages in result	1,000

Notes:

r_Out ... one result tuple for each Enrolled tuple
C_Out ... result tuples have only subj and name
in analyses, ignore cost of writing result ... same in all methods

<< ∧ >>

❖ Nested Loop Join

Basic strategy (R.a ⋈ S.b):

Result = {}
for each page i in R {
   pageR = getPage(R,i)
   for each page j in S {
      pageS = getPage(S,j)
      for each pair of tuples t_R,t_S
                       from pageR,pageS {
         if (t_R.a == t_S.b)
            Result = Result ∪ (t_R:t_S)
}  }  }

Needs input buffers for R and S, output buffer for "joined" tuples

Terminology: R is outer relation, S is inner relation

Cost = b_R . b_S ... ouch!

<< ∧ >>

❖ Block Nested Loop Join

Method (for N memory buffers):

read N-2-page chunk of R into memory buffers
for each S page
check join condition on all (t_R,t_S) pairs in buffers
repeat for all N-2-page chunks of R

<< ∧ >>

❖ Block Nested Loop Join (cont)

Best-case scenario: b_R ≤ N-2

read b_R pages of relation R into buffers
while whole R is buffered, read b_S pages of S

Cost = b_R + b_S

Typical-case scenario: b_R > N-2

read ceil(b_R/(N-2)) chunks of pages from R
for each chunk, read b_S pages of S

Cost = b_R + b_S . ceil(b_R/N-2)

Note: always requires r_R.r_S checks of the join condition

<< ∧ >>

❖ BNJ in Practice

Why block nested loop join is actually useful in practice ...

Many queries have the form

select * from R,S where r.i=s.j and r.x=k

This would typically be evaluated as

Tmp = Sel[r.x=k](R)
Res = Tmp Join[i=j] S

If Tmp is small ⇒ may fit in memory (in small #buffers)

<< ∧ >>

❖ Index Nested Loop Join

A problem with nested-loop join:

needs repeated scans of entire inner relation S

If there is an index on S, we can avoid such repeated whole-of-S scanning.

Consider Join[i=j](R,S):

for each tuple r in relation R {
    use index to select tuples
        from S where s.j = r.i
    for each selected tuple s from S {
        add (r,s) to result
}   }

<< ∧ >>

❖ Index Nested Loop Join (cont)

This method requires:

one scan of R relation (b_R)
- only one buffer needed, since we use R tuple-at-a-time
for each tuple in R (r_R), one index lookup on S
- cost depends on type of index and number of results
- best case is when each R.i matches few S tuples

Cost = b_R + r_R.Sel_S (Sel_S is the cost of performing a select on S).

Typical Sel_S = 1-2 (hashing) .. b_q (unclustered index)

Trade-off: r_R.Sel_S vs b_R.b_S, where b_R ≪ r_R and Sel_S ≪ b_S

<< ∧ >>

❖ Sort-Merge Join

Basic approach:

sort both relations on join attribute (reminder: Join [i=j] (R,S))
scan together using merge to form result (r,s) tuples

Advantages:

no need to deal with "entire" S relation for each r tuple
deal with runs of matching R and S tuples

Disadvantages:

cost of sorting both relations (already sorted on join key?)
some rescanning required when long runs of S tuples

<< ∧ >>

❖ Sort-Merge Join (cont)

Method requires several cursors to scan sorted relations:

r = current record in R relation
s = current record in current run in S relation
ss = start of current run in S relation

<< ∧ >>

❖ Sort-Merge Join (cont)

Algorithm using query iterators/scanners:

Query ri, si;  Tuple r,s;

ri = startScan("SortedR");
si = startScan("SortedS");
while ((r = nextTuple(ri)) != NULL
       && (s = nextTuple(si)) != NULL) {
    // align cursors to start of next common run
    while (r != NULL && r.i < s.j)
           r = nextTuple(ri);
    if (r == NULL) break;
    while (s != NULL && r.i > s.j)
           s = nextTuple(si);
    if (s == NULL) break;
    // must have (r.i == s.j) here
...

<< ∧ >>

❖ Sort-Merge Join (cont)

...
    // remember start of current run in S
    TupleID startRun = scanCurrent(si)
    // scan common run, generating result tuples
    while (r != NULL && r.i == s.j) {
        while (s != NULL and s.j == r.i) {
            addTuple(outbuf, combine(r,s));
            if (isFull(outbuf)) {
                writePage(outf, outp++, outbuf);
                clearBuf(outbuf);
            }
            s = nextTuple(si);
        }
        r = nextTuple(ri);
        setScan(si, startRun);
    }
}

<< ∧ >>

❖ Sort-Merge Join (cont)

Buffer requirements:

for sort phase:
- as many as possible (remembering that cost is O(log_N) )
- if insufficient buffers, sorting cost can dominate
for merge phase:
- one output buffer for result
- one input buffer for relation R
- (preferably) enough buffers for longest run in S

<< ∧ >>

❖ Sort-Merge Join (cont)

Cost of sort-merge join.

Step 1: sort each relation (if not already sorted):

Cost = 2.b_R (1 + log_N-1(b_R /N)) + 2.b_S (1 + log_N-1(b_S /N))
(where N = number of memory buffers)

Step 2: merge sorted relations:

if every run of values in S fits completely in buffers,
merge requires single scan, Cost = b_R + b_S
if some runs in of values in S are larger than buffers,
need to re-scan run for each corresponding value from R

<< ∧ >>

❖ Sort-Merge Join on Example

Case 1: Join[id=stude](Student,Enrolled)

relations are not sorted on id#
memory buffers N=32; all runs are of length < 30

Cost	=	sort(S) + sort(E) + b_S + b_E
	=	2b_S(1+log₃₁(b_S/32)) + 2b_E(1+log₃₁(b_E/32)) + b_S + b_E
	=	2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000
	=	6000 + 12000 + 1000 + 2000
	=	21,000

<< ∧ >>

❖ Sort-Merge Join on Example (cont)

Case 2: Join[id=stude](Student,Enrolled)

Student and Enrolled already sorted on id#
memory buffers N=4 (S input, 2 × E input, output)
5% of the "runs" in E span two pages
there are no "runs" in S, since id# is a primary key

For the above, no re-scans of E runs are ever needed

Cost = 2,000 + 1,000 = 3,000 (regardless of which relation is outer)

<< ∧ >>

❖ Hash Join

Basic idea:

use hashing as a technique to partition relations
to avoid having to consider all pairs of tuples

Requires sufficent memory buffers

to hold substantial portions of partitions
(preferably) to hold largest partition of outer relation

Other issues:

works only for equijoin R.i=S.j (but this is a common case)
susceptible to data skew (or poor hash function)

Variations: simple, grace, hybrid.

<< ∧ >>

❖ Simple Hash Join

Basic approach:

hash part of outer relation R into memory buffers (build)
scan inner relation S, using hash to search (probe)
- if R.i=S.j, then h(R.i)=h(S.j) (hash to same buffer)
- only need to check one memory buffer for each S tuple
repeat until whole of R has been processed

No overflows allowed in in-memory hash table

works best with uniform hash function
can be adversely affected by data/hash skew

<< ∧ >>

❖ Simple Hash Join (cont)

Data flow:

<< ∧ >>

❖ Simple Hash Join (cont)

Algorithm for simple hash join Join[R.i=S.j](R,S):

for each tuple r in relation R {
   if (buffer[h(R.i)] is full) {
      for each tuple s in relation S {
         for each tuple rr in buffer[h(S.j)] {
            if ((rr,s) satisfies join condition) {
               add (rr,s) to result
      }  }  }
      clear all hash table buffers
   }
   insert r into buffer[h(R.i)]
}

Best case: # join tests ≤ r_S.c_R (cf. nested-loop r_S.r_R)

<< ∧ >>

❖ Simple Hash Join (cont)

Cost for simple hash join ...

Best case: all tuples of R fit in the hash table

Cost = b_R + b_S
Same page reads as block nested loop, but less join tests

Good case: refill hash table m times (where m ≥ ceil(b_R / (N-2)) )

Cost = b_R + m.b_S
More page reads that block nested loop, but less join tests

Worst case: everything hashes to same page

Cost = b_R + b_R.b_S

<< ∧ >>

❖ Grace Hash Join

Basic approach (for R ⋈ S ):

partition both relations on join attribute using hashing (h1)
load each partition of R into N-buffer hash table (h2)
scan through corresponding partition of S to form results
repeat until all partitions exhausted

For best-case cost (O(b_R + b_S) ):

need ≥ √b_R buffers to hold largest partition of outer relation

If < √b_R buffers or poor hash distribution

need to scan some partitions of S multiple times

<< ∧ >>

❖ Grace Hash Join (cont)

Partition phase (applied to both R and S):

<< ∧ >>

❖ Grace Hash Join (cont)

Probe/join phase:

The second hash function (h2) simply speeds up the matching process.
Without it, would need to scan entire R partition for each record in S partition.

<< ∧ >>

❖ Grace Hash Join (cont)

Cost of grace hash join: