COMP9315 24T1 ♢ Lectures Part F ♢ [0/35]
DBMSs are engines to store, combine and filter information.
Join (⋈) is the primary means of combining information.
Join is important and potentially expensive
Most common join condition: equijoin, e.g. (R.pk = S.fk)
Join varieties (natural, inner, outer, semi, anti) all behave similarly.
We consider three strategies for implementing join
- nested loop ... simple, widely applicable, inefficient without buffering
- sort-merge ... works best if tables are sorted on join attributes
- hash-based ... requires good hash function and sufficient buffering
COMP9315 24T1 ♢ Lectures Part F ♢ [1/35]
Consider a university database with the schema:
create table Student(
id integer primary key,
name text, ...
);
create table Enrolled(
stude integer references Student(id),
subj text references Subject(code), ...
);
create table Subject(
code text primary key,
title text, ...
);
COMP9315 24T1 ♢ Lectures Part F ♢ [2/35]
List names of students in all subjects, arranged by subject.
SQL query to provide this information:
select E.subj, S.name
from Student S, Enrolled E
where S.id = E.stude
order by E.subj, S.name;
And its relational algebra equivalent:
Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )
To simplify formulae, we denote
Student
by
S and
Enrolled
by
E
COMP9315 24T1 ♢ Lectures Part F ♢ [3/35]
Some database statistics:
Sym |
Meaning |
Value |
rS |
# student records |
20,000 |
rE |
# enrollment records |
80,000 |
cS |
Student records/page |
20 |
cE |
Enrolled records/page |
40 |
bS |
# data pages in Student |
1,000 |
bE |
# data pages in Enrolled |
2,000 |
Also, in cost analyses below, N = number of memory buffers.
COMP9315 24T1 ♢ Lectures Part F ♢ [4/35]
Out
= Student ⋈ Enrolled relation statistics:
Sym |
Meaning |
Value |
rOut |
# tuples in result |
80,000 |
COut |
result records/page |
80 |
bOut |
# data pages in result |
1,000 |
Notes:
- rOut ... one result tuple for each
Enrolled
tuple
- COut ... result tuples have only
subj
and name
- in analyses, ignore cost of writing result ... same in all methods
COMP9315 24T1 ♢ Lectures Part F ♢ [5/35]
Basic strategy (R.a ⋈ S.b):
Result = {}
for each page i in R {
pageR = getPage(R,i)
for each page j in S {
pageS = getPage(S,j)
for each pair of tuples tR,tS
from pageR,pageS {
if (tR.a == tS.b)
Result = Result ∪ (tR:tS)
} } }
Needs input buffers for R and S, output buffer for "joined" tuples
Terminology: R is outer relation, S is inner relation
Cost = bR . bS ... ouch!
COMP9315 24T1 ♢ Lectures Part F ♢ [6/35]
Method (for N memory buffers):
- read N-2-page chunk of R into memory buffers
- for each S page
check join condition on all (tR,tS)
pairs in buffers
- repeat for all N-2-page chunks of R
COMP9315 24T1 ♢ Lectures Part F ♢ [7/35]
❖ Block Nested Loop Join (cont) | |
Best-case scenario: bR ≤ N-2
- read bR pages of relation R into buffers
- while whole R is buffered, read bS pages of S
Cost =
bR + bS
Typical-case scenario: bR > N-2
- read ceil(bR/(N-2)) chunks of pages from R
- for each chunk, read bS pages of S
Cost =
bR + bS .
ceil(bR/N-2)
Note: always requires rR.rS checks of the join condition
COMP9315 24T1 ♢ Lectures Part F ♢ [8/35]
Why block nested loop join is actually useful in practice ...
Many queries have the form
select * from R,S where r.i=s.j and r.x=k
This would typically be evaluated as
Tmp = Sel[r.x=k](R)
Res = Tmp Join[i=j] S
If Tmp
is small ⇒ may fit in memory
(in small #buffers)
COMP9315 24T1 ♢ Lectures Part F ♢ [9/35]
A problem with nested-loop join:
- needs repeated scans of entire inner relation S
If there is an index on
S, we can avoid such repeated whole-of-
S scanning.
Consider Join[i=j](R,S):
for each tuple r in relation R {
use index to select tuples
from S where s.j = r.i
for each selected tuple s from S {
add (r,s) to result
} }
COMP9315 24T1 ♢ Lectures Part F ♢ [10/35]
❖ Index Nested Loop Join (cont) | |
This method requires:
- one scan of R relation (bR)
- only one buffer needed, since we use R tuple-at-a-time
- for each tuple in R (rR), one index lookup on S
- cost depends on type of index and number of results
- best case is when each R.i matches few S tuples
Cost =
bR + rR.SelS
(SelS is the cost of performing a select on S).
Typical SelS = 1-2 (hashing) .. bq (unclustered index)
Trade-off: rR.SelS vs bR.bS, where bR ≪ rR and SelS ≪ bS
COMP9315 24T1 ♢ Lectures Part F ♢ [11/35]
Basic approach:
- sort both relations on join attribute
(reminder: Join [i=j] (R,S))
- scan together using merge to form result
(r,s)
tuples
Advantages:
- no need to deal with "entire" S relation for each r tuple
- deal with runs of matching R and S tuples
Disadvantages:
- cost of sorting both relations
(already sorted on join key?)
- some rescanning required when long runs of S tuples
COMP9315 24T1 ♢ Lectures Part F ♢ [12/35]
Method requires several cursors to scan sorted relations:
-
r
= current record in R relation
-
s
= current record in current run in S relation
-
ss
= start of current run in S relation
COMP9315 24T1 ♢ Lectures Part F ♢ [13/35]
Algorithm using query iterators/scanners:
Query ri, si; Tuple r,s;
ri = startScan("SortedR");
si = startScan("SortedS");
while ((r = nextTuple(ri)) != NULL
&& (s = nextTuple(si)) != NULL) {
while (r != NULL && r.i < s.j)
r = nextTuple(ri);
if (r == NULL) break;
while (s != NULL && r.i > s.j)
s = nextTuple(si);
if (s == NULL) break;
...
COMP9315 24T1 ♢ Lectures Part F ♢ [14/35]
...
TupleID startRun = scanCurrent(si)
while (r != NULL && r.i == s.j) {
while (s != NULL and s.j == r.i) {
addTuple(outbuf, combine(r,s));
if (isFull(outbuf)) {
writePage(outf, outp++, outbuf);
clearBuf(outbuf);
}
s = nextTuple(si);
}
r = nextTuple(ri);
setScan(si, startRun);
}
}
COMP9315 24T1 ♢ Lectures Part F ♢ [15/35]
Buffer requirements:
- for sort phase:
- as many as possible (remembering that cost is O(logN) )
- if insufficient buffers, sorting cost can dominate
- for merge phase:
- one output buffer for result
- one input buffer for relation R
- (preferably) enough buffers for longest run in S
COMP9315 24T1 ♢ Lectures Part F ♢ [16/35]
Cost of sort-merge join.
Step 1: sort each relation (if not already sorted):
- Cost =
2.bR (1 + logN-1(bR /N)) +
2.bS (1 + logN-1(bS /N))
(where N = number of memory buffers)
Step 2: merge sorted relations:
- if every run of values in S fits completely in buffers,
merge requires single scan,
Cost = bR + bS
- if some runs in of values in S are larger than buffers,
need to re-scan run for each corresponding value from R
COMP9315 24T1 ♢ Lectures Part F ♢ [17/35]
❖ Sort-Merge Join on Example | |
Case 1: Join[id=stude](Student,Enrolled)
- relations are not sorted on id#
- memory buffers N=32; all runs are of length < 30
Cost |
= |
sort(S) + sort(E) + bS + bE |
|
= |
2bS(1+log31(bS/32)) + 2bE(1+log31(bE/32)) + bS + bE |
|
= |
2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000 |
|
= |
6000 + 12000 + 1000 + 2000 |
|
= |
21,000 |
COMP9315 24T1 ♢ Lectures Part F ♢ [18/35]
❖ Sort-Merge Join on Example (cont) | |
Case 2: Join[id=stude](Student,Enrolled)
- Student and Enrolled already sorted on id#
- memory buffers N=4 (S input, 2 × E input, output)
- 5% of the "runs" in E span two pages
- there are no "runs" in S, since id# is a primary key
For the above, no re-scans of
E runs are ever needed
Cost = 2,000 + 1,000 = 3,000 (regardless of which relation is outer)
COMP9315 24T1 ♢ Lectures Part F ♢ [19/35]
Basic idea:
- use hashing as a technique to partition relations
- to avoid having to consider all pairs of tuples
Requires sufficent memory buffers
- to hold substantial portions of partitions
- (preferably) to hold largest partition of outer relation
Other issues:
- works only for equijoin
R.i=S.j
(but this is a common case)
- susceptible to data skew (or poor hash function)
Variations:
simple,
grace,
hybrid.
COMP9315 24T1 ♢ Lectures Part F ♢ [20/35]
Basic approach:
- hash part of outer relation R into memory buffers (build)
- scan inner relation S, using hash to search (probe)
- if R.i=S.j, then h(R.i)=h(S.j) (hash to same buffer)
- only need to check one memory buffer for each S tuple
- repeat until whole of R has been processed
No overflows allowed in in-memory hash table
- works best with uniform hash function
- can be adversely affected by data/hash skew
COMP9315 24T1 ♢ Lectures Part F ♢ [21/35]
❖ Simple Hash Join (cont) | |
Data flow:
COMP9315 24T1 ♢ Lectures Part F ♢ [22/35]
❖ Simple Hash Join (cont) | |
Algorithm for simple hash join Join[R.i=S.j](R,S):
for each tuple r in relation R {
if (buffer[h(R.i)] is full) {
for each tuple s in relation S {
for each tuple rr in buffer[h(S.j)] {
if ((rr,s) satisfies join condition) {
add (rr,s) to result
} } }
clear all hash table buffers
}
insert r into buffer[h(R.i)]
}
Best case: # join tests ≤ rS.cR
(cf. nested-loop rS.rR)
COMP9315 24T1 ♢ Lectures Part F ♢ [23/35]
❖ Simple Hash Join (cont) | |
Cost for simple hash join ...
Best case: all tuples of R fit in the hash table
- Cost = bR + bS
- Same page reads as block nested loop, but less join tests
Good case: refill hash table
m times
(where m ≥ ceil(bR / (N-2)) )
- Cost = bR + m.bS
- More page reads that block nested loop, but less join tests
Worst case: everything hashes to same page
COMP9315 24T1 ♢ Lectures Part F ♢ [24/35]
Basic approach (for R ⋈ S ):
- partition both relations on join attribute using hashing (h1)
- load each partition of R into N-buffer hash table (h2)
- scan through corresponding partition of S to form results
- repeat until all partitions exhausted
For best-case cost (
O(bR + bS) ):
- need ≥ √bR buffers to hold largest partition of outer relation
If
< √bR buffers or poor hash distribution
- need to scan some partitions of S multiple times
COMP9315 24T1 ♢ Lectures Part F ♢ [25/35]
Partition phase (applied to both R and S):
COMP9315 24T1 ♢ Lectures Part F ♢ [26/35]
Probe/join phase:
The second hash function (h2
) simply speeds up the matching process.
Without it, would need to scan entire R partition for each record in S partition.
COMP9315 24T1 ♢ Lectures Part F ♢ [27/35]
Cost of grace hash join:
- #pages in all partition files of Rel ≅ bRel
(maybe slightly more)
- partition relation R ...
Cost =
bR.Tr + bR.Tw
= 2bR
- partition relation S ...
Cost =
bS.Tr + bS.Tw
= 2bS
- probe/join requires one scan of each (partitioned) relation
Cost = bR + bS
- all hashing and comparison occurs in memory ⇒ ≅0 cost
Total Cost =
2bR + 2bS + bR + bS =
3 (bR + bS)
COMP9315 24T1 ♢ Lectures Part F ♢ [28/35]
A variant of grace join if we have √bR < N < bR+2
- create k≪N partitions, m in memory, k-m on disk
- buffers: 1 input, k-m output, p = N-(k-m)-1 for in-memory partitions
When we come to scan and partition
S relation
- any tuple with hash in range 0..m-1 can be resolved
- other tuples are written to one of k partition files for S
Final phase is same as grace join, but with only
k partitions.
Comparison:
- grace hash join creates N-1 partitions on disk
- hybrid hash join creates m (memory) + k (disk) partitions
COMP9315 24T1 ♢ Lectures Part F ♢ [29/35]
❖ Hybrid Hash Join (cont) | |
First phase of hybrid hash join with m=1 (partitioning R):
COMP9315 24T1 ♢ Lectures Part F ♢ [30/35]
❖ Hybrid Hash Join (cont) | |
Next phase of hybrid hash join with m=1 (partitioning S):
COMP9315 24T1 ♢ Lectures Part F ♢ [31/35]
❖ Hybrid Hash Join (cont) | |
Final phase of hybrid hash join with m=1 (finishing join):
COMP9315 24T1 ♢ Lectures Part F ♢ [32/35]
❖ Hybrid Hash Join (cont) | |
Some observations:
- with k partitions, each partition has expected size bR/k
- holding m partitions in memory needs ceil(mbR/k) buffers
- trade-off between in-memory partition space and #partitions
Best-cost scenario:
- m = 1, k ≅ ceil(bR/N) (satisfying above constraint)
Other notes:
- if N = bR+2, using block nested loop join is simpler
- cost depends on N (but less than grace hash join)
COMP9315 24T1 ♢ Lectures Part F ♢ [33/35]
No single join algorithm is superior in some overall sense.
Which algorithm is best for a given query depends on:
- sizes of relations being joined, size of buffer pool
- any indexing on relations, whether relations are sorted
- which attributes and operations are used in the query
- number of tuples in S matching each tuple in R
- distribution of data values (uniform, skew, ...)
Choosing the "best" join algorithm is critical because the
cost difference between best and worst case can be very large.
E.g. Join[id=stude](Student,Enrolled): 3,000 ... 2,000,000
COMP9315 24T1 ♢ Lectures Part F ♢ [34/35]
Join implementations are under: src/backend/executor
PostgreSQL suports three kinds of join:
- nested loop join (
nodeNestloop.c
)
- sort-merge join (
nodeMergejoin.c
)
- hash join (
nodeHashjoin.c
) (hybrid hash join)
Query optimiser chooses appropriate join, by considering
- physical characteristics of tables being joined
- estimated selectivity (likely number of result tuples)
COMP9315 24T1 ♢ Lectures Part F ♢ [35/35]
Produced: 30 Apr 2024