❖ Week 08 Thursday |
❖ Assignment 2 |
Things to note:
db2
helpers.*
create view
❖ Relational Algebra |
Relational algebra (RA) can be viewed as ...
❖ Relational Algebra (cont) |
Core relational algebra operations:
❖ Relational Algebra (cont) |
Select, project, join provide a powerful set of operations for building relations and extracting interesting data from them.
Adding set operations and renaming makes RA complete.
❖ Notation |
Standard treatments of relational algebra use Greek symbols.
We use the following notation (because it is easier to reproduce):
Standard Notation |
Our Notation |
|||
σexpr(Rel) | Sel[expr](Rel) | |||
πA,B,C(Rel) | Proj[A,B,C](Rel) | |||
Rel1 ⋈expr Rel2 | Rel1 Join[expr] Rel2 | |||
ρschemaRel | Rename[schema](Rel) |
For other operations (e.g. set operations) we adopt the standard notation.
Except when typing in a text file, where *
+
❖ Describing RA Operations |
We define the semantics of RA operations using
❖ Describing RA Operations (cont) |
All RA operators return a result of type relation.
For convenience, we can name a result and use it later.
E.g.
Temp = R op1 S op2 T
Res = Temp op3 Z
-- which is equivalent to
Res = (R op1 S op2 T) op3 Z
Each result is a relation with a well-defined schema
❖ Rename |
Rename provides "schema mapping".
If expression E returns a relation R(A1, A2, ... An), then
gives a relation called S
The only thing that Rename changes is the schema.
❖ Rename (cont) |
Rename can be viewed as a "technical" apparatus of RA.
Can use implicit rename/project in sequences of RA operations, e.g.
-- R(a,b,c), S(c,d) Res = Rename[Res(b,c,d)](Project[b,c](Sel[a>5](R)) Join S) -- vs Tmp1 = Select[a>5](R) Tmp2 = Project[b,c](Tmp1) Tmp3 = Rename[Tmp3(cc,d)](S) Tmp4 = Tmp2 Join[c=cc] Tmp3 Res = Rename[Res(b,c,d)](Tmp4) -- vs Tmp1(b,c) = Select[a>5](R) Tmp2(cc,d) = S Res(b,c,d) = Tmp1 Join[c=cc] Tmp2
In SQL, can achieve a similar effect by defining a set of views
❖ Exercise: Rename |
Answer the following in SQL then relational algebra
addr
suburb
❖ Projection |
Projection returns a set of tuples containing a subset of the attributes in the original relation.
X specifies a subset of the attributes of R.
Note that removing key attributes can produce duplicates.
In RA, duplicates are removed from the result set.
Result size: |πX(r)| ≤ |r| Result schema: R'(X)
Algorithmic view:
result = {} for each tuple t in relation r result = result ∪ {t[X]}
❖ Exercise: Projection |
Answer the following in SQL then relational algebra
❖ Selection |
Selection returns a subset of the tuples in a relation r(R) that satisfy a specified condition C.
C is a boolean expression on attributes in R.
Result size: |σC(r)| ≤ |r|
Result schema: same as the schema of r (i.e. R)
Algorithmic view:
result = {} for each tuple t in relation r if (C(t)) { result = result ∪ {t} }
❖ Exercise: Selection |
Answer the following in SQL and relational algebra
❖ Union |
Union combines two compatible relations into a single relation via set union of sets of tuples.
Compatibility = both relations have the same schema
Result size: |r1 ∪ r2| ≤ |r1| + |r2| Result schema: R
Algorithmic view:
result = r1 for each tuple t in relation r2 result = result ∪ {t}
❖ Intersection |
Intersection combines two compatible relations into a single relation via set intersection of sets of tuples.
Uses same notion of relation compatibility as union.
Result size: |r1 ∪ r2| ≤ min(|r1|,|r2|) Result schema: R
Algorithmic view:
result = {} for each tuple t in relation r1 if (t ∈ r2) { result = result ∪ {t} }
❖ Exercise: Union/Intersection |
Answer the following in SQL then relational algebra
❖ Difference |
Difference finds the set of tuples that exist in one relation but do not occur in a second compatible relation.
Uses same notion of relation compatibility as union.
Note: tuples in r2 but not r1 do not appear in the result
result = {} for each tuple t in relation r1 if (!(t ∈ r2)) { result = result ∪ {t} }
❖ Difference (cont) |
Answer the following in SQL then relational algebra
❖ Product |
Product (Cartesian product) combines information from two relations pairwise on tuples.
Each tuple in the result contains all attributes from r and s, possibly with some fields renamed to avoid ambiguity.
If t1 = (A1...An) and t2 = (B1...Bn) then (t1:t2) = (A1...An,B1...Bn)
Note: relations do not have to be union-compatible.
Result size is large: |r × s| = |r|.|s| Schema: R∪S
Algorithmic view:
result = {} for each tuple t1 in relation r for each tuple t2 in relation s result = result ∪ {(t1:t2)}
❖ Natural Join |
Natural join is a specialised product:
The natural join of relations r(R) and s(S) is defined as:
where match = t1[K] = t2[K] ∧ t1[L] = t2[L] ∧ t1[M] = t2[M]
Algorithmic view:
result = {} for each tuple t1 in relation r for each tuple t2 in relation s if (matches(t1,t2)) result = result ∪ {combine(t1,t2)}
❖ Theta Join |
The theta join is a specialised product containing only pairs that match on a supplied condition C.
Examples: (r1 Join[B>E] r2) ... (r1 Join[E<D∧C=G] r2)
All attribute names are required to be distinct (cf natural join)
Can be defined in terms of other RA operations:
Note that r ⋈true s = r × s.
❖ Theta Join (cont) |
Example theta join:
(Theta join is the most frequently-used join in SQL queries)
❖ Theta Join (cont) |
Querying with relational algebra (join) ...
NewtownBars(nbar) = Sel[addr=Newtown](Bars) Tmp = Frequents Join[bar=nbar] NewtownBars Result(drinker) = Proj[drinker](Tmp)
CarltonBeers = Sel[manf=Carlton](Beers) Tmp = Likes Join[beer=name] CarltonBeers Result(drinker) = Proj[drinker)Tmp
Reminder: projection removes duplicates
❖ Exercise: Mapping SQL to RelAlg |
Give sequences of relational algebra operations to solve each of these:
❖ Division |
Consider two relation schemas R and S where S ⊂ R.
The division operation is defined on instances r(R), s(S) as:
where satisfy = ∀ ts ∈ S ( ∃ tr ∈ R ( tr[S] = ts ∧ tr[R-S] = t ) )
Operationally:
❖ Division (cont) |
Querying with relational algebra (division) ...
Division handles queries that include the notion "for all".
E.g. Which beers are sold in all bars?
We can answer this as follows:
❖ DBMS Architecture |
COMP3311 is not a course on DBMS Architecture
But knowing just a little about how DBMSs work can help
❖ DBMS Architecture (cont) |
Our view of the DBMS so far ...
A machine to process SQL queries.
❖ DBMS Architecture (cont) |
One view of DB engine: "relational algebra virtual machine"
projection (π) | join (⋈, ×) | |||
intersection (∩) | difference (-) | |||
insert | delete |
For each of these operations:
❖ Mapping SQL to RA |
Mapping SQL to relational algebra, e.g.
-- schema: R(a,b) S(c,d) select a as x from R join S on (b=c) where d = 100 -- could be mapped to Tmp1(a,b,c,d) = R Join[b=c] S Tmp2(a,b,c,d) = Sel[d=100](Tmp1) Tmp3(a) = Proj[a](Tmp2) Res(x) = Rename[Res(x)](Tmp3)
In general:
SELECT
WHERE
FROM
❖ Exercise: A Better SQL to RA Mapping |
On the previous slide, we translated an SQL query as follows:
-- schema: R(a,b) S(c,d) select a as x from R join S on (b=c) where d = 100 -- could be mapped to Tmp1(a,b,c,d) = R Join[b=c] S Tmp2(a,b,c,d) = Sel[d=100](Tmp1) Tmp3(a) = Proj[a](Tmp2) Res(x) = Rename[Res(x)](Tmp3)
Suggest a more efficient approach (based on likely size of intermediate results)
❖ Mapping Example |
The query: Courses with more than 100 students in them?
Can be expressed in SQL as
select s.id, s.code from Course c, Subject s, Enrolment e where c.id = e.course and c.subject = s.id group by s.id, s.code having count(*) > 100;
and might be compiled to
Result = Project[id,code]( GroupSelect[size>100] ( GroupBy[id,code] ( Subject Join[s.id=c.subject] (Course Join[c.id=e.course] Enrolment) ) ) )
❖ Exercise: Relational Operation Plan |
So far, we have been expressing RA evaluation as follows:
Tmp(a,b,c,...) = Op[x](R) or R Op[x] S
Each statement involves a single relational algebra operation.
Render the RA from the previous slide in this form.
Reminder:
Result = Project[id,code]( GroupSelect[size>100] ( GroupBy[id,code] ( Subject Join[s.id=c.subject] (Course Join[c.id=e.course] Enrolment) ) ) )
❖ Query Cost Estimation |
The cost of evaluating a query is determined by
❖ Query Cost Estimation (cont) |
An execution plan is a sequence of relational operations.
Consider execution plans for: σc (R ⋈d S ⋈e T)
tmp1 := hash_join[d](R,S) tmp2 := sort_merge_join[e](tmp1,T) result := binary_search[c](tmp2)
or
tmp1 := sort_merge_join[e](S,T) tmp2 := hash_join[d](R,tmp1) result := linear_search[c](tmp2)
or
tmp1 := btree_search[c](R) tmp2 := hash_join[d](tmp1,S) result := sort_merge_join[e](tmp2)
All produce same result, but have different costs.
❖ Implementations of RA Ops |
Sorting (quicksort, etc. are not applicable)
❖ Query Optimisation |
What is the "best" method for evaluating a query?
Generally, best = lowest cost = fastest evaluation time
Cost is measured in terms of pages read/written
❖ Query Optimisation (cont) |
A DBMS query optimiser works as follows:
Input: relational algebra expression Output: execution plan (sequence of RA ops) bestCost = INF; bestPlan = none while (more possible plans) { plan = produce a new evaluation plan cost = estimated_cost(plan) if (cost < bestCost) { bestCost = cost; bestPlan = plan } } return bestPlan
Typically, there are very many possible plans
❖ DB Application Performance |
In order to make DB applications efficient, it is useful to know:
(which queries, updates, inserts and how frequently is each one performed)
(data structures and algorithms for select, project, join, sort, ...)
(in terms of the amount of data transferred between memory and disk)
Achieve by using indexes and avoiding certain SQL query structures
❖ DB Application Performance (cont) |
Application programmer choices that affect query cost:
Produced: 6 Apr 2023