Implementing Join
Join | 1/87 |
DBMSs are engines to store, combine and filter information.
Filtering is achieved via selection and projection.
The join operation (⋈) is the primary means of combining information.
Because join is
(We use a running example to compare costs of the various join processing methods)
... Join | 2/87 |
Types of join:
select * from R,S where R.i = S.j
select * from R,S where R.a = S.b and R.c = S.d ...
select * from R,S where R.a < S.b and R.c <> S.d ...
R.pk=S.fk
Join Example | 3/87 |
Consider a university database with the schema:
create table Student( id integer primary key, name text, ... ); create table Enrolled( stude integer references Student(id), subj text references Subject(code), ... ); create table Subject( code text primary key, title text, ... );
And the following request on this database:
... Join Example | 4/87 |
The result of this request would look like:
Subj Name -------- ----------------- COMP1011 Chen Hwee Ling COMP1011 John Smith COMP1011 Ravi Shastri ... COMP1021 David Jones COMP1021 Stephen Mao ... COMP3311 Dean Jones COMP3311 Mark Taylor COMP3311 Sashin Tendulkar
... Join Example | 5/87 |
An SQL query to provide this information:
select E.subj, S.name from Student S, Enrolled E where S.id = E.stude order by E.subj, S.name;
And its relational algebra equivalent:
The core of the query is the join Join[id=stude](Student,Enrolled)
To simplify writing of formulae, S = Student, E = Enrolled.
... Join Example | 6/87 |
Some database statistics:
Sym | Meaning | Value |
rS | # student records | 20,000 |
rE | # enrollment records | 80,000 |
CS | Student |
20 |
CE | Enrolled |
40 |
bS | # data pages in Student |
1,000 |
bE | # data pages in Enrolled |
2,000 |
Also, in cost analyses below, N = number of memory buffers.
... Join Example | 7/87 |
Out
Sym | Meaning | Value |
rOut | # tuples in result | 80,000 |
COut | result records/page | 80 |
bOut | # data pages in result | 1,000 |
Notes:
Enrolled
subj
name
Join via Cross-product | 8/87 |
Join can be defined as a cross-product followed by selection:
For the example query, could implement
Cross-product contains 20,000 × 80,000 = 1,600,000,000 tuples.
... Join via Cross-product | 9/87 |
For Temp
I/O costs:
Temp
Temp
... Join via Cross-product | 10/87 |
Because
DBMSs implement only join and provide cross-product as:
select * from R,S
Nested-Loop Join |
Nested Loop Join | 12/87 |
The simplest join algorithm:
for each tuple r in R { for each tuple s in S { if ((r,s) satisfies join condition) { add (r,s) to result } } }
R is the outer relation; S is the inner relation.
... Nested Loop Join | 13/87 |
Requires (at least) three memory buffers (2 input, 1 output).
... Nested Loop Join | 14/87 |
Abstract algorithm for Join[Cond](R,S) (with 3 memory buffers):
for each page of relation R { read into buffer rBuf for each page of relation S { read into buffer sBuf for each record r in rBuf { for each record s in sBuf { if ((r,s) satisfies Cond) { add combined(r,s) to OutBuf write Outbuf when full } } } } }
... Nested Loop Join | 15/87 |
Detailed algorithm for Join[Cond](R,S) (with 3 memory buffers):
// rf: file for R, sf: file for S, of: output file outp = 0; clearBuf(oBuf); for (rp = 0; rp < nPages(rf); rp++) { readPage(rf, rp, rBuf); for (sp = 0; sp < nPages(sf); sp++) { readPage(sf, sp, sBuf); for (i = 0; i < nTuples(rBuf); i++) { rTup = getTuple(rBuf, i); for (j = 0; j < nTuples(sBuf); j++) { sTup = getTuple(sBuf, j); if (satisfies(rTup,sTup,Cond)) { rsTup = combine(rTup,sTup); addTuple(oBuf, rsTup); if (isFull(oBuf)) { writePage(of, outp++, oBuf); clearBuf(oBuf); } } } } } }
... Nested Loop Join | 16/87 |
The three-memory-buffer nested loop join requires:
If we use S as the outer relation in the join
Cost = bS + bS bR
It is (slightly) better to use smaller relation as outer relation.
Nested Loop Join on Example | 17/87 |
If Student
Enrolled
Cost | = | bS + bS bE |
= | 1,000 + 1,000 × 2,000 = 2,001,000 |
If Enrolled
Student
Cost | = | bE + bE bS |
= | 2,000 + 2,000 × 1,000 = 2,002,000 |
Cost of nested-loop join is too high (5 hours, if Tr=0.01 sec)
Implementing Join Better | 18/87 |
Aims of effective join computation:
Range of costs for Join(R,S)
Block Nested Loop Join | 19/87 |
If at least bR+2 memory buffers available:
(r,s)
... Block Nested Loop Join | 20/87 |
Algorithm for nested loop join with bR+2 memory buffers:
read all of R's pages into memory buffers for each page of relation S { read page into S's input buffer for each tuple s in S's buffer { for each tuple r in R's memory buffers { if ((r,s) satisfies JoinCond)) { add (r,s) to output buffer write output buffer when full } } } }
Note that R effectively becomes the inner relation in this scheme.
... Block Nested Loop Join | 21/87 |
This method requires:
Notes:
... Block Nested Loop Join | 22/87 |
Further performance improvements:
Block Nested Loop Join on Example | 23/87 |
If ≥ 1002 memory buffers are available:
Student
Enrolled
Cost | = | bS + bE |
= | 1,000 + 2,000 = 3,000 |
This is considerably better than 106 (30 secs vs 5 hours).
But what if we have only N memory buffers, where N < bR , N < bS?
... Block Nested Loop Join on Example | 24/87 |
In general case, read outer relation in runs of N-2 pages
for each run of N-2 pages from R { read N-2 of R's pages into memory buffers for each page of relation S { read page into S's input buffer for each tuple s in S's buffer do for each tuple r in R's memory buffers { if ((r,s) satisfies JoinCond)) { add (r,s) to output buffer write output buffer when full } } } } }
... Block Nested Loop Join on Example | 25/87 |
Block nested loop join requires
Notes:
... Block Nested Loop Join on Example | 26/87 |
Costs for various buffer pool sizes:
N | Inner | Outer | #runs | Cost |
22 | Student |
Enrolled |
50 | 101,000 |
52 | Student |
Enrolled |
20 | 41,000 |
102 | Student |
Enrolled |
10 | 21,000 |
1002 | Student |
Enrolled |
1 | 3,000 |
22 | Enrolled |
Student |
100 | 102,000 |
52 | Enrolled |
Student |
40 | 42,000 |
102 | Enrolled |
Student |
20 | 22,000 |
1002 | Enrolled |
Student |
2 | 4,000 |
Block Nested Loop Join in Practice | 27/87 |
Why block nested loop join is very useful in practice ...
Many queries have the form
select * from R,S where r.i=s.j and r.x=k
This would typically be evaluated as
If |Sel[r.x=k](R)| is small ⇒ may fit in memory (in small #buffers)
Join Conditions and Methods | 28/87 |
Nested loop join makes no assumptions about join conditions.
for each pair of tuples (r,s) { check join condition on (r,s) if satisfied, add to results }
To improve join:
Thus, a range of other join algorithms has been developed specifically for equality join conditions.
Index Nested Loop Join | 29/87 |
Most joins considered so far have a common problem:
Consider Join[R.i=S.j](R,S):
for each tuple r in relation R { use index to select tuples from S where s.j = r.i for each selected tuple s from S { add (r,s) to result } }
(For ordered indexes (e.g. Btree), this also assists join conditions like R.i<S.j)
... Index Nested Loop Join | 30/87 |
This method requires:
... Index Nested Loop Join | 31/87 |
For index lookup:
Index Nested Loop Join on Example | 32/87 |
Case 1: Join[id=stude](Student,Enrolled)
Student
Enrolled
Enrolled
stude
Cost | = | bS + rS btreeE |
= | 1,000 + 20,000 × (3+1.01) = 80,000 |
... Index Nested Loop Join on Example | 33/87 |
Case 2: Join[id=stude](Student,Enrolled)
Student
Enrolled
Enrolled
stude
Cost | = | bS + rS btreeE |
= | 1,000 + 20,000 × (3+4) = 150,000 |
... Index Nested Loop Join on Example | 34/87 |
Case 3: Join[id=stude](Student,Enrolled)
Enrolled
Student
Student
id
Cost | = | bE + rE hashS |
= | 2,000 + 80,000 × 1.1 = 90,000 | |
Optimised Index Nested Loop Join | 35/87 |
Consider the following scenario for Join[R.i=S.j](R,S):
R.i
R.i
R
R.i
R.i
R.i
R.i
R.i
R
R.i
... Optimised Index Nested Loop Join | 36/87 |
Abstract algorithm for optimised index nested loop join:
for each tuple r in relation R { if (prev == r.i) use selected tuples in buffer(s) else { use index to select tuples from S where s.j = r.i store selected tuples in buffer(s) } for each selected tuple s from S add (r,s) to result prev = r.i }
Cost savings depend on repetition factor, #buffers, size of index scans
Sort-Merge Join |
Sort-Merge Join | 38/87 |
Basic approach:
(r,s)
... Sort-Merge Join | 39/87 |
Method requires several cursors to scan sorted relations:
r
s
ss
... Sort-Merge Join | 40/87 |
Abstract algorithm for merge phase of Join[R.i=S.j](R,S):
r = first tuple in R s = first tuple in S while (r != eof and s != eof) {// align cursors to start of next common run while (r != eof and r.i < s.j) { r = next tuple in R } while (s != eof and r.i > s.j) { s = next tuple in S }// scan common run, generating result tuples while (r != eof and r.i == s.j) { ss = s// set to start of run while (ss != eof and ss.j == r.i) { add (r,s) to result ss = next tuple in S } r = next tuple in R } s = ss// start search for next run }
Sidetrack: Iterators | 41/87 |
Sort-merge join implementation is simplified by use of iterators.
Iterator iter; Tuple tup; iter = startScan("Rel","i=5"); while ((tup = nextTuple(iter)) != NULL) { process(tuple); } endScan(iter);
... Sidetrack: Iterators | 42/87 |
typedef struct { File inf; // input file Buffer buf; // buffer holding current page int curp; // current page during scan int curr; // index of current record in page } Iterator;// simple linear scan; no condition Iterator *startScan(char *relName) { Iterator *iter = malloc(sizeof(Iterator)); iter->inf = openFile(fileName(relName),READ); iter->curp = 0; iter->curr = -1; readPage(iter->inf, iter->curp, iter->buf); }
... Sidetrack: Iterators | 43/87 |
Tuple nextTuple(Iterator *iter) {// check if reached end of current page if (iter->curr == nTuples(iter->buf)-1) {// check if reached end of data file if (iter->curp == nPages(iter->inf)-1) return NULL; iter->curp++; iter->buf = readPage(iter->inf, iter->curp); iter->curr = -1; } iter->curr++; return getTuple(iter->buf, iter->curr); }// curp and curr hold indexes of most recently read page/record
... Sidetrack: Iterators | 44/87 |
TupleID scanCurrent(Iterator *iter) {// form TupleID for current record return iter->curp + iter->curr; } void setScan(Iterator *iter, int page, int rec) {assert(page >= 0 && page < nPages(iter->inf)); if (iter->curp != page) { iter->curp = page; readPage(iter->inf, iter->curp, iter->buf); }assert(rec >= 0 && rec < nTuples(iter->buf)); iter->curr = rec; } void endScan(Iterator *iter) { closeFile(iter->buf); free(iter); }
Sort-Merge Join | 45/87 |
Concrete algorithm using iterators:
Iterator *ri, *si; Tuple rup, stup; ri = startScan("SortedR"); si = startScan("SortedS"); while ((rtup = nextTuple(ri)) != NULL && (stup = nextTuple(si)) != NULL) {// align cursors to start of next common run while (rtup != NULL && rtup.i < stup.j) rtup = nextTuple(ri); if (rtup == NULL) break; while (stup != NULL && rtup.i > stup.j) stup = nextTuple(si); if (stup == NULL) break;// must have (r.i == s.j) here ...
... Sort-Merge Join | 46/87 |
...// remember start of current run in S TupleID startRun = scanCurrent(si);// scan common run, generating result tuples while (rtup != NULL && rtup.i == stup.j) { while (stup != NULL and stup.j == rtup.i) { addTuple(outbuf, combine(rtup,stup)); if (isFull(outbuf)) { writePage(outf, outp++, outbuf); clearBuf(outbuf); } stup = nextTuple(si); } rtup = nextTuple(ri); setScan(si, startRun); } }
... Sort-Merge Join | 47/87 |
Buffer requirements:
... Sort-Merge Join | 48/87 |
Cost of sort-merge join.
Step 1: sort each relation (if not already sorted):
Sort-Merge Join on Example | 49/87 |
Case 1: Join[id=stude](Student,Enrolled)
... Sort-Merge Join on Example | 50/87 |
Case 2: Join[id=stude](Student,Enrolled)
Cost | = | sort(S) + sort(E) + bS + bE |
= | bS ⌈ log30 bS ⌉ + bE ⌈ log30 bE ⌉ + bS + bE | |
= | 1,000 × 3 + 2,000 × 3 + 1,000 + 2,000 | |
= | 12,000 |
... Sort-Merge Join on Example | 51/87 |
Case 3: Join[id=stude](Student,Enrolled)
... Sort-Merge Join on Example | 52/87 |
Case 3 (continued) ...
If E is outer relation:
Sidetrack 2: More on Iterators | 53/87 |
Above description of iterators:
... Sidetrack 2: More on Iterators | 54/87 |
Requires a more general definition of execution state:
typedef struct { Oper op; // operation (sel,sort,join,...) Reln r1; // first relation Reln r2; // second relation (if any) Buffer *bufs; // buffers used by operation int curp1; // index of current page for r1 int curr1; // index of current record in page int curp2; // index of current page for r2 int curr2; // index of current record in page Cond cond; // condition for choosing tuple(s) } Iterator;
For PostgreSQL details, see include/nodes/execnodes.h
Hash Join |
Hash Join | 56/87 |
Basic idea:
R.i=S.j
Simple Hash Join | 57/87 |
Basic approach:
... Simple Hash Join | 58/87 |
Data flow:
... Simple Hash Join | 59/87 |
Algorithm for ideal simple hash join Join[R.i=S.j](R,S):
for each tuple r in relation R { insert r into buffer[h(R.i)] } for each tuple s in relation S { for each tuple r in buffer[h(S.j)] { if ((r,s) satisfies join condition) { add (r,s) to result } } }
Cost = bR + bS (minimum possible cost)
... Simple Hash Join | 60/87 |
Consider that we have N buffers available
If bR ≤ N-3 buffers, no need to hash (use nested loop).
In practice, size of hash table bhR > bR
(e.g. data skew)
⇒ hash table for R is even less likely to fit in memory
Can be handled by a variation on above algorithm:
... Simple Hash Join | 61/87 |
Algorithm for realistic simple hash join Join[R.i=S.j](R,S):
for each tuple r in relation R { if (buffer[h(R.i)] is full) { for each tuple s in relation S { for each tuple rr in buffer[h(S.j)] { if ((rr,s) satisfies join condition) { add (rr,s) to result } } } clear all hash table buffers } insert r into buffer[h(R.i)] }
Note: requires multiple passes over the S relation.
... Simple Hash Join | 62/87 |
Cost depends on N and on properties of data/hash.
Worst case:
Grace Hash Join | 63/87 |
Basic approach:
Similar approach to sort-merge join, except:
Requires enough buffer space to hold largest partition of inner relation.
... Grace Hash Join | 64/87 |
Partition phase:
This is applied to each relation R and S.
... Grace Hash Join | 65/87 |
Probe/join phase:
The second hash function (h2
Without it, would need to scan entire R partition for each record in S partition.
... Grace Hash Join | 66/87 |
Abstract algorithm for Join[R.i=S.j](R,S):
// assume h(val) generates [0..N-2] // assume h2(val) generates [0..N-3] // Partition phase (each relation -> N-1 partitions) // 1 input buffer, N-1 output buffers for each tuple r in relation R add r to partition h(r.i) in output file R' for each tuple s in relation S add s to partition h(s.j) in output file S' ...
... Grace Hash Join | 67/87 |
Abstract algorithm for Join[R.i=S.j](R,S) (cont.)
// Probe/join phase // 1 input buffer for S, 1 output buffer // N-2 buffers to build hash table for R partition for each partition p = 0 .. N-2 {// Build in-memory hash table for partition p of R' for each tuple r in partition p of R' insert r into buffer h2(r.i)// Scan partition p of S', probing for matching tuples for each tuple s in partition p of S' { b = h2(s.j) for all matching tuples r in buffer b add (r,s) to result } }
... Grace Hash Join | 68/87 |
Concrete algorithm for partitioning:
Buffer iBuf, oBuf[N-1]; File inf, outf[N-1]; char rel[100]; int i, r, h, ip, op[N-1]; Tuple tup; for (i = 0; i < N-1; i++) { clearBuf(oBuf[i]); op[i] = 0; rel = sprintf("%s%d","Rel",i); outf[i] = openFile(fileName(rel),WRITE)); } inf = openFile(fileName("Rel"),READ); for (ip = 0; ip < nPages(inf); ip++) { iBuf = readPage(inf, ip); for (r = 0; r < nTuples(iBuf); r++) { tup = getTuple(iBuf, r); h = hash(tup.i, N-1); addTuple(oBuf[h], tup); if (isFull(oBuf[h])) { writePage(outf[h], op[h]++, oBuf[h]); clearBuf(oBuf[h]); } } }
... Grace Hash Join | 69/87 |
Cost of grace hash join:
... Grace Hash Join | 70/87 |
The above cost analysis assumes:
... Grace Hash Join | 71/87 |
Possibilities for dealing with "over-long" partitions of R
Grace Hash Join on Example | 72/87 |
For the example Join[id=stude](Student,Enrolled):
Cost | = | 3 (bS + bE) |
= | 3 (1,000 + 2,000) = 9,000 |
Hybrid Hash Join | 73/87 |
An optimisation if we have √bR < N < bR+2
... Hybrid Hash Join | 74/87 |
Some observations:
... Hybrid Hash Join | 75/87 |
Need to choose appropriate m and k to minimise cost
... Hybrid Hash Join | 76/87 |
Data flow for hybrid hash join (partitioning R):
... Hybrid Hash Join | 77/87 |
Data flow for hybrid hash join (partitioning S):
After this, proceed as for grace hash join.
... Hybrid Hash Join | 78/87 |
Cost of hybrid hash join:
How to determine k:
Hybrid Hash Join on Example | 79/87 |
Case 1: N = 100 buffers, bR = 1000
Pointer-based Join | 80/87 |
Conventional join algorithms set up R ↔ S connections via attribute values.
Join could be performed faster if direct connections already existed.
... Pointer-based Join | 81/87 |
The basic idea for pointer-based join is:
for each tuple r in relation R { for each rid associated with r { fetch tuple s from S via rid add (r,s) to result relation } }
Often, each R tuple is associated with only one rid, so the inner loop is not needed.
... Pointer-based Join | 82/87 |
The advantage over value-based joins:
fetch
General Join Conditions | 83/87 |
Above examples all used simple equijoin e.g. Join[i=j](R,S).
For theta-join e.g Join[i<j](R,S):
... General Join Conditions | 84/87 |
For multi-equality (pmr) join e.g. Join[i=j ∧ k=l](R,S)
Join Summary | 85/87 |
No single join algorithm is superior in some overall sense.
Which algorithm is best for a given query depends on:
E.g. Join[id=stude](Student,Enrolled): 3,000 ... 2,000,000
In some cases, it may be worth modifying access methods "on the fly" (e.g. add index) to enable an efficient join algorithm.
... Join Summary | 86/87 |
Comparison of join costs (from Zeller/Gray VLDB90, assumes bR = bS = b)
Join in PostgreSQL | 87/87 |
Join implementations are under: src/backend/executor
PostgreSQL suports three kinds of join:
nodeNestloop.c
nodeMergejoin.c
nodeHashjoin.c
Produced: 30 Apr 2020