COMP9315 Week 07 Thursday Lecture

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [0/61]
❖ Things To Note

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [1/61]
❖ Implementing Join

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [2/61]
❖ Join Methods

So far, we have looked at ...

Others to look at ...
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [3/61]
❖ Example Join Query

List names of students in all subjects, arranged by subject.

SQL query to provide this information:

select E.subj, S.name
from   Student S, Enrolled E
where  S.id = E.stude
order  by E.subj, S.name;

And its relational algebra equivalent:

Sort[subj] ( Project[subj,name] ( Join[id=stude](Student,Enrolled) ) )


To simplify formulae, we denote Student by S and Enrolled by E
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [4/61]
❖ Example Join Query (cont)

Some database statistics:

Sym Meaning Value
rS # student records 20,000
rE # enrollment records 80,000
cS Student records/page 20
cE Enrolled records/page 40
bS # data pages in Student 1,000
bE # data pages in Enrolled 2,000
rOut # tuples in result 80,000

Also, in cost analyses below, N = number of memory buffers.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [5/61]
❖ Sort-Merge Join

Basic approach:

Advantages: Disadvantages:
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [6/61]
❖ Sort-Merge Join (cont)

Method requires several cursors to scan sorted relations:

[Diagram:Pics/join/sort-merge.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [7/61]
❖ Sort-Merge Join (cont)

Algorithm using query iterators/scanners:

Query ri, si;  Tuple r,s;

ri = startScan("SortedR");
si = startScan("SortedS");
while ((r = nextTuple(ri)) != NULL
       && (s = nextTuple(si)) != NULL) {
    // align cursors to start of next common run
    while (r != NULL && r.i < s.j)
           r = nextTuple(ri);
    if (r == NULL) break;
    while (s != NULL && r.i > s.j)
           s = nextTuple(si);
    if (s == NULL) break;
    // must have (r.i == s.j) here
...

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [8/61]
❖ Sort-Merge Join (cont)

...
    // remember start of current run in S
    TupleID startRun = scanCurrent(si)
    // scan common run, generating result tuples
    while (r != NULL && r.i == s.j) {
        while (s != NULL and s.j == r.i) {
            addTuple(outbuf, combine(r,s));
            if (isFull(outbuf)) {
                writePage(outf, outp++, outbuf);
                clearBuf(outbuf);
            }
            s = nextTuple(si);
        }
        r = nextTuple(ri);
        setScan(si, startRun);
    }
}

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [9/61]
❖ Sort-Merge Join (cont)

Buffer requirements:

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [10/61]
❖ Sort-Merge Join (cont)

Cost of sort-merge join.

Step 1: sort each relation   (if not already sorted):

Step 2: merge sorted relations:
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [11/61]
❖ Sort-Merge Join on Example

Case 1:   Join[id=stude](Student,Enrolled)

 
Cost = sort(S) + sort(E) + bS + bE
= 2bS(1+log31(bS/32)) + 2bE(1+log31(bE/32)) + bS + bE
= 2×1000×(1+2) + 2×2000×(1+2) + 1000 + 2000
= 6000 + 12000 + 1000 + 2000
= 21,000
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [12/61]
❖ Sort-Merge Join on Example (cont)

Case 2:   Join[id=stude](Student,Enrolled)

For the above, no re-scans of E runs are ever needed

Cost  =  2,000 + 1,000  =  3,000   (regardless of which relation is outer)

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [13/61]
❖ Exercise: Sort-merge Join Cost

Consider executing Join[i=j](S,T)  with the following parameters:

Calculate the cost for evaluating the above join
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [14/61]
❖ Hash Join

Basic idea:

Requires sufficent memory buffers Other issues: Variations:   simple,   grace,   hybrid.
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [15/61]
❖ Simple Hash Join

Basic approach:

No overflows allowed in in-memory hash table
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [16/61]
❖ Simple Hash Join (cont)

Data flow:


[Diagram:Pics/join/simple-hash.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [17/61]
❖ Simple Hash Join (cont)

Algorithm for simple hash join Join[R.i=S.j](R,S):

for each tuple r in relation R {
   if (buffer[h(R.i)] is full) {
      for each tuple s in relation S {
         for each tuple rr in buffer[h(S.j)] {
            if ((rr,s) satisfies join condition) {
               add (rr,s) to result
      }  }  }
      clear all hash table buffers
   }
   insert r into buffer[h(R.i)]
}

Best case:  # join tests  ≤  rS.cR    (cf. nested-loop  rS.rR)

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [18/61]
❖ Simple Hash Join (cont)

Cost for simple hash join ...

Best case: all tuples of R fit in the hash table

Good case: refill hash table m times (where m ≥ ceil(bR / (N-2)) ) Worst case: everything hashes to same page
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [19/61]
❖ Exercise: Simple Hash Join Cost

Consider executing Join[i=j](R,S) with the following parameters:

Calculate the cost for evaluating the above join
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [20/61]
❖ Grace Hash Join

Basic approach (for R ⋈ S ):

For best-case cost (O(bR + bS) ): If < √bR buffers or poor hash distribution
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [21/61]
❖ Grace Hash Join (cont)

Partition phase (applied to both R and S):

[Diagram:Pics/join/grace-hash1.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [22/61]
❖ Grace Hash Join (cont)

Probe/join phase:

[Diagram:Pics/join/hash2.png]

The second hash function (h2) simply speeds up the matching process.
Without it, would need to scan entire R partition for each record in S partition.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [23/61]
❖ Grace Hash Join (cont)

Cost of grace hash join:

Total Cost   =   2bR + 2bS + bR + bS   =   3 (bR + bS)
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [24/61]
❖ Exercise: Grace Hash Join Cost

Consider executing Join[i=j](R,S) with the following parameters:

Calculate the cost for evaluating the above join
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [25/61]
❖ Exercise: Grace Hash Join Cost

Consider executing Join[i=j](R,S) with the following parameters:

Calculate the cost for evaluating the above join
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [26/61]
❖ Hybrid Hash Join

A variant of grace join if we have √bR < N < bR+2

When we come to scan and partition S relation Final phase is same as grace join, but with only k partitions.

Comparison:

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [27/61]
❖ Hybrid Hash Join (cont)

First phase of hybrid hash join with m=1 (partitioning R):

[Diagram:Pics/join/hyb-hash1.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [28/61]
❖ Hybrid Hash Join (cont)

Next phase of hybrid hash join with m=1 (partitioning S):

[Diagram:Pics/join/hyb-hash2.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [29/61]
❖ Hybrid Hash Join (cont)

Final phase of hybrid hash join with m=1 (finishing join):

[Diagram:Pics/join/hyb-hash3.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [30/61]
❖ Hybrid Hash Join (cont)

Some observations:

Best-cost scenario: Other notes:
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [31/61]
❖ Exercise: Hybrid Hash Join Cost

Consider executing Join[i=j](R,S) with the following parameters:

Calculate the cost for evaluating the above join
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [32/61]
❖ Join Summary

No single join algorithm is superior in some overall sense.

Which algorithm is best for a given query depends on:

Choosing the "best" join algorithm is critical because the cost difference between best and worst case can be very large.

E.g.   Join[id=stude](Student,Enrolled):   3,000 ... 2,000,000

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [33/61]
❖ Join in PostgreSQL

Join implementations are under: src/backend/executor

PostgreSQL suports three kinds of join:

Query optimiser chooses appropriate join, by considering
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [34/61]
❖ Exercise: Outer Join?

Above discussion was all in terms of theta inner-join.

How would the algorithms above adapt to outer join?

Consider the following ...

select *
from   R left outer join S on (R.i = S.j)

select *
from   R right outer join S on (R.i = S.j)

select *
from   R full outer join S on (R.i = S.j)

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [35/61]
❖ Query Evaluation

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [36/61]
❖ Query Evaluation


[Diagram:Pics/qproc/dbmsarch.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [37/61]
❖ Query Evaluation (cont)

A query in SQL:

A query evaluator/processor :
Some DBMSs can save query plans for later re-use.
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [38/61]
❖ Query Evaluation (cont)

Internals of the query evaluation "black-box":

[Diagram:Pics/qproc/qproc0.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [39/61]
❖ Query Evaluation (cont)

DBMSs provide several "flavours" of each RA operation.

For example:

Similarly, π  and  have versions to match specific query types.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [40/61]
❖ Query Evaluation (cont)

We call these specialised version of RA operations RelOps.

One major task of the query processor:

Requires the query translator/optimiser to consider RelOps are realised at execution time
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [41/61]
❖ Terminology Variations

Relational algebra expression of SQL query

Execution plan as collection of RelOps Representation of RA operators and expressions
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [42/61]
❖ Query Translation

Query translation:   SQL statement text RA expression

[Diagram:Pics/qproc/qproc1.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [43/61]
❖ Query Translation

Translation step:   SQL text → RA expression

Example:

SQL: select name from Students where id=7654321;
-- is translated to
RA:  Proj[name](Sel[id=7654321]Students)

Processes:  lexer/parser,  mapping rules,  rewriting rules.

Mapping from SQL to RA may include some optimisations, e.g.

select * from Students where id = 54321 and age > 50;
-- is translated to
Sel[age>50](Sel[id=54321]Students)
-- rather than ... because of index on id
Sel[id=54321&age>50](Students)

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [44/61]
❖ Parsing SQL

Parsing task is similar to that for programming languages.

Language elements:

PostgreSQL parser ...

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [45/61]
❖ SQL → RA

Remaining steps in processing an SQL statement

Cost-based optimisation:
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [46/61]
❖ Expression Rewriting Rules

Since RA is a well-defined formal system

Expression transformation based on such rules can be used
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [47/61]
❖ Relational Algebra Laws

Commutative and Associative Laws:

Selection splitting (where c and d are conditions):
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [48/61]
❖ Relational Algebra Laws (cont)

Selection pushing   ( σc(R ∪ S) and σc(R ∪ S) ):

Selection pushing with join ... If condition contains attributes from both R and S:
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [49/61]
❖ Relational Algebra Laws (cont)

Rewrite rules for projection ...

All but last projection can be ignored:

Projections can be pushed into joins:

where
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [50/61]
❖ Relational Algebra Laws (cont)

Subqueries ⇒ convert to a join

Example:   (on schema Courses(id,code,...), Enrolments(cid,sid,...), Students(id,name,...)

select c.code, count(*)
from   Courses c
where  c.id in (select cid from Enrolments)
group  by c.code

becomes

select c.code, count(*)
from   Courses c join Enrolments e on c.id = e.cid
group  by c.code

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [51/61]
❖ Relational Algebra Laws (cont)

But not all subqueries can be converted to join, e.g.

select e.sid as student_id, e.cid as course_id
from   Enrolments e
where  e.sid = (select max(id) from Students)

has to be evaluated as

Val = max[id]Students

Res = π(sid,cid)(σsid=ValEnrolments)

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [52/61]
❖ Query Optimisation

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [53/61]
❖ Query Optimisation

Query optimiser:   RA expression efficient evaluation plan

[Diagram:Pics/qproc/qproc2.png]

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [54/61]
❖ Query Optimisation (cont)

Query optimisation is a critical step in query evaluation.

The query optimiser

"Optimisation" is a misnomer since query optimisers
Observed Query Time = Planning time + Evaluation time
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [55/61]
❖ Query Optimisation (cont)

Why do we not generate optimal query execution plans?

Finding an optimal query plan ...

Even for relatively small query, search space is very large.

Compromise:

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [56/61]
❖ Approaches to Optimisation

Three main classes of techniques developed:

All driven by aim of minimising (or at least reducing) "cost".

Real query optimisers use a combination of algrebraic+physical.

Semantic QO is good idea, but expensive/difficult to implement.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [57/61]
❖ Approaches to Optimisation (cont)

Example of optimisation transformations:

[Diagram:Pics/qproc/query-transform.png]


For join, may also consider sort/merge join and hash join.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [58/61]
❖ Cost-based Query Optimiser

Approximate algorithm for cost-based optimisation:

translate SQL query to RAexp
for enough transformations RA' of RAexp {
   while (more choices for RelOps) {
      Plan = {};  i = 0;  cost = 0
      for each node e of RA' (recursively) {
         ROp = select RelOp method for e
         Plan = Plan ∪ ROp
         cost += Cost(ROp) // using child info
      }
      if (cost < MinCost)
         { MinCost = cost;  BestPlan = Plan }
   }
}

Heuristics: push selections down, consider only left-deep join trees.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [59/61]
❖ Exercise: Alternative Join Plans

Consider the schema

Students(id,name,....)   Enrol(student,course,mark)
Staff(id,name,...)    Courses(id,code,term,lic,...)

the following query on this schema

select c.code, s.id, s.name
from   Students s, Enrol e, Courses c, Staff f
where  s.id=e.student and e.course=c.id
       and c.lic=f.id and c.term='11s2'
       and f.name='John Shepherd'

Show some possible evaluation orders for this query.

COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [60/61]
❖ Cost Models and Analysis

The cost of evaluating a query is determined by:

Analysis of costs involves estimating:
COMP9315 24T1 ♢ Week 7 Thursday Lecture ♢ [61/61]


Produced: 28 Mar 2024