COMP3311 Week 8 Thursday Lecture

>>

Week 08 Thursday
Assignment 2
Relational Algebra
Notation
Describing RA Operations
Example Database #1
Example Database #2
Rename
Exercise: Rename
Projection
Exercise: Projection
Selection
Exercise: Selection
Union
Intersection
Exercise: Union/Intersection
Difference
Product
Natural Join
Theta Join
Exercise: Mapping SQL to RelAlg
Division
DBMS Architecture
Query Evaluation
Mapping SQL to RA
Exercise: A Better SQL to RA Mapping
Mapping Example
Exercise: Relational Operation Plan
Query Cost Estimation
Implementations of RA Ops
Query Optimisation
DB Application Performance

∧ >>

❖ Week 08 Thursday

In today's lecture ...

Relational Algebra, Query Execution

Things to do ...

Quiz 5 due by 23:59 Friday 7 April
Assignment 2 due by 23:59 Friday 14 April
Must understand Python/Psycopg2 by end of week

Things to note ...

Testing script and submission pages now up and fixed(!)

<< ∧ >>

❖ Assignment 2

Things to note:

XX submissions so far ... 550-XX to go
last minute work ... db2 overloaded ... jas/tutors overloaded
send code in email as attachment, not as a screenshot
make sure helpers.* load without error
don't create views within Python code
write queries to answer questions like "Are there any ...?"

<< ∧ >>

❖ Relational Algebra

Relational algebra (RA) can be viewed as ...

mathematical system for manipulating relations, or
data manipulation language (DML) for the relational model

Relational algebra consists of:

operands: relations, or variables representing relations
operators that map relations to relations
rules for combining operands/operators into expressions
rules for evaluating such expressions

Why is it important?

because if forms the basis for DBMS implementation
relational algebra ops are like machine code for DBMSs

<< ∧ >>

❖ Relational Algebra (cont)

Core relational algebra operations:

selection: choosing a subset of tuples/rows
projection: choosing a subset of attributes/columns
product, join: combining relations
union, intersection, difference: combining relations
rename: change names of relations/attributes

Common extensions include:

aggregation, projection++, division

<< ∧ >>

❖ Relational Algebra (cont)

Select, project, join provide a powerful set of operations for building relations and extracting interesting data from them.

Adding set operations and renaming makes RA complete.

<< ∧ >>

❖ Notation

Standard treatments of relational algebra use Greek symbols.

We use the following notation (because it is easier to reproduce):

Operation Standard
Notation Our
Notation

Selection σ_expr(Rel) Sel[expr](Rel)

Projection π_A,B,C(Rel) Proj[A,B,C](Rel)

Join Rel₁ ⋈_expr Rel₂ Rel₁ Join[expr] Rel₂

Rename ρ_schemaRel Rename[schema](Rel)

For other operations (e.g. set operations) we adopt the standard notation.
Except when typing in a text file, where * = intersection, + = union

<< ∧ >>

❖ Describing RA Operations

We define the semantics of RA operations using

"conditional set" expressions e.g. { X | condition on X }
tuple notations:
- t[AB] (extracts attributes A and B from tuple t)
- (x,y,z) (enumerated tuples; specify attribute values)
quantifiers, set operations, boolean operators

For each operation, we also describe it operationally:

give an algorithm to compute the result, tuple-by-tuple

<< ∧ >>

❖ Describing RA Operations (cont)

All RA operators return a result of type relation.

For convenience, we can name a result and use it later.

E.g.

Temp = R op₁ S op₂ T
Res  = Temp op₃ Z
-- which is equivalent to
Res  = (R op₁ S op₂ T) op₃ Z

Each result is a relation with a well-defined schema

this applies equally to intermediate results and final results

<< ∧ >>

❖ Example Database #1

<< ∧ >>

❖ Example Database #2

<< ∧ >>

❖ Rename

Rename provides "schema mapping".

If expression E returns a relation R(A₁, A₂, ... A_n), then

Rename[S(B₁, B₂, ... B_n)](E)

gives a relation called S

containing the same set of tuples as E
but with the name of each attribute changed from A_i to B_i

Rename is like the identity function on the contents of a relation

The only thing that Rename changes is the schema.

<< ∧ >>

❖ Rename (cont)

Rename can be viewed as a "technical" apparatus of RA.

Can use implicit rename/project in sequences of RA operations, e.g.

--  R(a,b,c),  S(c,d)
Res = Rename[Res(b,c,d)](Project[b,c](Sel[a>5](R)) Join S)
-- vs
Tmp1 = Select[a>5](R)
Tmp2 = Project[b,c](Tmp1)
Tmp3 = Rename[Tmp3(cc,d)](S)
Tmp4 = Tmp2 Join[c=cc] Tmp3
Res  = Rename[Res(b,c,d)](Tmp4)
-- vs
Tmp1(b,c)  = Select[a>5](R)
Tmp2(cc,d) = S
Res(b,c,d) = Tmp1 Join[c=cc] Tmp2

In SQL, can achieve a similar effect by defining a set of views

<< ∧ >>

❖ Exercise: Rename

Answer the following in SQL then relational algebra

rename the columns in the Beers relation (beer,brewer)
rename the addr column in Drinkers as suburb
change the name of the Bars relation to Pubs

<< ∧ >>

❖ Projection

Projection returns a set of tuples containing a subset of the attributes in the original relation.

π_X(r) = Proj[X](r) = { t[X] | t ∈ r }, where r(R)

X specifies a subset of the attributes of R.

Note that removing key attributes can produce duplicates.

In RA, duplicates are removed from the result set.

Result size: |π_X(r)| ≤ |r| Result schema: R'(X)

Algorithmic view:

result = {}
for each tuple t in relation r
    result = result ∪ {t[X]}

<< ∧ >>

❖ Projection (cont)

Examples of projection:

<< ∧ >>

❖ Exercise: Projection

Answer the following in SQL then relational algebra

Names of all beers
What are all of the breweries?
Names of drinkers who live in Newtown

<< ∧ >>

❖ Selection

Selection returns a subset of the tuples in a relation r(R) that satisfy a specified condition C.

σ_C(r) = Sel[C](r) = { t | t ∈ r ∧ C(t) }

C is a boolean expression on attributes in R.

Result size: |σ_C(r)| ≤ |r|

Result schema: same as the schema of r (i.e. R)

Algorithmic view:

result = {}
for each tuple t in relation r
    if (C(t)) { result = result ∪ {t} }

<< ∧ >>

❖ Selection (cont)

Examples of selection:

<< ∧ >>

❖ Exercise: Selection

Answer the following in SQL and relational algebra

details of all bars in The Rocks
beers made by Sierra Nevada
beers sold in the Coogee Bay Hotel

<< ∧ >>

❖ Union

Union combines two compatible relations into a single relation via set union of sets of tuples.

r₁ ∪ r₂ = { t | t ∈ r₁ ∨ t ∈ r₂ }, where r₁(R), r₂(R)

Compatibility = both relations have the same schema

Result size: |r₁ ∪ r₂| ≤ |r₁| + |r₂| Result schema: R

Algorithmic view:

result = r₁
for each tuple t in relation r₂
    result = result ∪ {t}

<< ∧ >>

❖ Intersection

Intersection combines two compatible relations into a single relation via set intersection of sets of tuples.

r₁ ∩ r₂ = { t | t ∈ r₁ ∧ t ∈ r₂ }, where r₁(R), r₂(R)

Uses same notion of relation compatibility as union.

Result size: |r₁ ∪ r₂| ≤ min(|r₁|,|r₂|) Result schema: R

Algorithmic view:

result = {}
for each tuple t in relation r₁
    if (t ∈ r₂) { result = result ∪ {t} }

<< ∧ >>

❖ Intersection (cont)

Examples of union and intersection:

<< ∧ >>

❖ Exercise: Union/Intersection

Answer the following in SQL then relational algebra

Bars where either John or Gernot drinks
Bars where both John and Gernot drink

<< ∧ >>

❖ Difference

Difference finds the set of tuples that exist in one relation but do not occur in a second compatible relation.

r₁ - r₂ = { t | t ∈ r₁ ∧ t ∉ r₂ }, where r₁(R), r₂(R)

Uses same notion of relation compatibility as union.

Note: tuples in r₂ but not r₁ do not appear in the result

i.e. set difference != complement of set intersection

Algorithmic view:

result = {}
for each tuple t in relation r₁
    if (!(t ∈ r₂)) { result = result ∪ {t} }

<< ∧ >>

❖ Difference (cont)

Examples of difference:

Clearly, difference is not symmetric.

<< ∧ >>

❖ Difference (cont)

Answer the following in SQL then relational algebra

Bars where John drinks and Gernot doesn't
Bars that sell VB but not New

<< ∧ >>

❖ Product

Product (Cartesian product) combines information from two relations pairwise on tuples.

r × s = { (t₁ : t₂) | t₁ ∈ r ∧ t₂ ∈ s }, where r(R), s(S)

Each tuple in the result contains all attributes from r and s, possibly with some fields renamed to avoid ambiguity.

If t₁ = (A₁...A_n) and t₂ = (B₁...B_n) then (t₁:t₂) = (A₁...A_n,B₁...B_n)

Note: relations do not have to be union-compatible.

Result size is large: |r × s| = |r|.|s| Schema: R∪S

Algorithmic view:

result = {}
for each tuple t₁ in relation r
    for each tuple t₂ in relation s
        result = result ∪ {(t₁:t₂)}

<< ∧ >>

❖ Product (cont)

Example of product:

<< ∧ >>

❖ Natural Join

Natural join is a specialised product:

containing only pairs that match on common attributes
with one of each pair of common attributes eliminated

Consider relation schemas R(ABC..JKLM), S(KLMN..XYZ).

The natural join of relations r(R) and s(S) is defined as:

r ⋈ s = r Join s =
{ (t₁[ABC..J] : t₂[K..XYZ]) | t₁ ∈ r ∧ t₂ ∈ s ∧ match }

where match = t₁[K] = t₂[K] ∧ t₁[L] = t₂[L] ∧ t₁[M] = t₂[M]

Algorithmic view:

result = {}
for each tuple t₁ in relation r
   for each tuple t₂ in relation s
      if (matches(t₁,t₂))
         result = result ∪ {combine(t₁,t₂)}

<< ∧ >>

❖ Natural Join (cont)

Example of natural join:

<< ∧ >>

❖ Theta Join

The theta join is a specialised product containing only pairs that match on a supplied condition C.

r ⋈_C s = { (t₁ : t₂) | t₁ ∈ r ∧ t₂ ∈ s ∧ C(t₁ : t₂) },
where r(R),s(S)

Examples: (r1 Join[B>E] r2) ... (r1 Join[E<D∧C=G] r2)

All attribute names are required to be distinct (cf natural join)

Can be defined in terms of other RA operations:

r ⋈_C s = r Join[C] s = Sel[C] ( r × s )

Note that r ⋈_true s = r × s.

<< ∧ >>

❖ Theta Join (cont)

Example theta join:

(Theta join is the most frequently-used join in SQL queries)

<< ∧ >>

❖ Theta Join (cont)

Querying with relational algebra (join) ...

Who drinks in Newtown bars?

NewtownBars(nbar) = Sel[addr=Newtown](Bars)
Tmp = Frequents Join[bar=nbar] NewtownBars
Result(drinker) = Proj[drinker](Tmp)

Who drinks beers made by Carlton?

CarltonBeers = Sel[manf=Carlton](Beers)
Tmp = Likes Join[beer=name] CarltonBeers
Result(drinker) = Proj[drinker)Tmp

Reminder: projection removes duplicates

<< ∧ >>

❖ Exercise: Mapping SQL to RelAlg

Give sequences of relational algebra operations to solve each of these:

Find bars that serve New at the same price
as the Coogee Bay Hotel charges for VB.
What is the price of new at the Coogee Bay Hotel?
What beers are sold at the same price as CBH/New?
Which bar is most popular? (Most drinkers)
Price of cheapest beer at each bar?
Which beers are sold at all bars?

<< ∧ >>

❖ Division

Consider two relation schemas R and S where S ⊂ R.

The division operation is defined on instances r(R), s(S) as:

r / s = r Div s = { t | t ∈ r[R-S] ∧ satisfy }

where satisfy = ∀ t_s ∈ S ( ∃ t_r ∈ R ( t_r[S] = t_s ∧ t_r[R-S] = t ) )

Operationally:

consider each subset of tuples in R that match on t[R-S]
for this subset of tuples, take the t[S] values from each
if this covers all tuples in S, then include t[R-S] in the result

<< ∧ >>

❖ Division (cont)

Example of division:

<< ∧ >>

❖ Division (cont)

Querying with relational algebra (division) ...

Division handles queries that include the notion "for all".

E.g. Which beers are sold in all bars?

We can answer this as follows:

generate a relation of beers and bars where they are sold
- r1 = Proj[beer,bar](Sold)
generate a relation of all bars
- r2 = Rename[r2(bar)](Proj[name](Bars))
find which beers appear in tuples with every bar
- res = r1 Div r2

<< ∧ >>

❖ DBMS Architecture

COMP3311 is not a course on DBMS Architecture (that's COMP9315)

But knowing just a little about how DBMSs work can help

to avoid/fix inefficiencies in database applications
ensure that there are no concurrency issues

DBMSs attempt to handle these issues in ..

query processing (QP) .. methods for evaluating queries
transaction processing (TxP) ... controlling concurrency

As a programmer, you give a lot of control to the DBMS, but can

use QP knowledge to make DB applications efficient
use TxP knowledge to make DB applications safe

<< ∧ >>

❖ DBMS Architecture (cont)

Our view of the DBMS so far ...

A machine to process SQL queries.

<< ∧ >>

❖ DBMS Architecture (cont)

One view of DB engine: "relational algebra virtual machine"

selection (σ)	projection (π)	join (⋈, ×)
union (∪)	intersection (∩)	difference (-)
sort	insert	delete

For each of these operations:

various data structures and algorithms are available
DBMSs may provide only one, or may provide a choice

<< ∧ >>

❖ DBMS Architecture (cont)

Layers in a DB Engine (Ramakrishnan's View)

<< ∧ >>

❖ Query Evaluation

The path of a query through its evaluation:

<< ∧ >>

❖ Mapping SQL to RA

Mapping SQL to relational algebra, e.g.

-- schema: R(a,b) S(c,d)
select a as x
from   R join S on (b=c)
where  d = 100
-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

In general:

SELECT clause becomes projection
WHERE condition becomes selection or join
FROM clause becomes join

<< ∧ >>

❖ Exercise: A Better SQL to RA Mapping

On the previous slide, we translated an SQL query as follows:

-- schema: R(a,b) S(c,d)
select a as x
from   R join S on (b=c)
where  d = 100
-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

Suggest a more efficient approach (based on likely size of intermediate results)

<< ∧ >>

❖ Mapping Example

The query: Courses with more than 100 students in them?

Can be expressed in SQL as

select   s.id, s.code
from     Course c, Subject s, Enrolment e
where    c.id = e.course and c.subject = s.id
group by s.id, s.code
having   count(*) > 100;

and might be compiled to

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

<< ∧ >>

❖ Exercise: Relational Operation Plan

So far, we have been expressing RA evaluation as follows:

Tmp(a,b,c,...) =  Op[x](R)  or  R Op[x] S

Each statement involves a single relational algebra operation.

Render the RA from the previous slide in this form.

Reminder:

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

<< ∧ >>

❖ Query Cost Estimation

The cost of evaluating a query is determined by

the operations specified in the query execution plan
size of relations (database relations and temporary relations)
access mechanisms (indexing, hashing, sorting, join algorithms)
size/number of main memory buffers (and replacement strategy)

Analysis of costs involves estimating:

the size of intermediate results
then, based on this, cost of secondary storage accesses

Accessing data from disk is the dominant cost in query evaluation

<< ∧ >>

❖ Query Cost Estimation (cont)

An execution plan is a sequence of relational operations.

Consider execution plans for: σ_c (R ⋈_d S ⋈_e T)

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

or

tmp1   :=  sort_merge_join[e](S,T)
tmp2   :=  hash_join[d](R,tmp1)
result :=  linear_search[c](tmp2)

or

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

All produce same result, but have different costs.

<< ∧ >>

❖ Implementations of RA Ops

Sorting (quicksort, etc. are not applicable)

external merge sort (cost O(Nlog_BN) with B memory buffers)

Selection (different techniques developed for different query types)

sequential scan (worst case, cost O(N))
index-based (e.g. B-trees, cost O(logN), tree nodes are pages)
hash-based (O(1) best case, only works for equality tests)

Join (fast joins are critical to success of relational DBMSs)

nested-loop join (cost O(N.M), buffering can reduce to O(N+M))
sort-merge join (cost O(NlogN+MlogM))
hash-join (best case cost O(N+M.N/B), with B memory buffers)

<< ∧ >>

❖ Query Optimisation

What is the "best" method for evaluating a query?

Generally, best = lowest cost = fastest evaluation time

Cost is measured in terms of pages read/written

data is stored in fixed-size blocks (e.g. 4KB)
data transferred disk↔memory in whole blocks
cost of disk↔memory transfer is highest cost in system
processing in memory is very fast compared to I/O

<< ∧ >>

❖ Query Optimisation (cont)

A DBMS query optimiser works as follows:

Input: relational algebra expression
Output: execution plan (sequence of RA ops)

bestCost = INF; bestPlan = none
while (more possible plans) {
   plan = produce a new evaluation plan
   cost = estimated_cost(plan)
   if (cost < bestCost) {
      bestCost = cost; bestPlan = plan
   }
}
return bestPlan

Typically, there are very many possible plans

smarter optimisers generate likely subset of possible plans

<< ∧ >>

❖ DB Application Performance

In order to make DB applications efficient, it is useful to know:

what operations on the data does the application require
(which queries, updates, inserts and how frequently is each one performed)
how these operations might be implemented in the DBMS
(data structures and algorithms for select, project, join, sort, ...)
how much each implementation will cost
(in terms of the amount of data transferred between memory and disk)

and then, "encourage" DBMS to use the most efficient methods

Achieve by using indexes and avoiding certain SQL query structures

<< ∧

❖ DB Application Performance (cont)

Application programmer choices that affect query cost:

how queries are expressed
- generally join is faster than subquery
- especially if subquery is correlated
- filter first, then join (avoids large intermediate tables)
- avoid applying functions in where/group-by clasues
creating indexes on tables
- index will speed-up filtering based on indexed attributes
- indexes generally only effective for equality, gt/lt
- mainly useful if filtering much more frequent than update

Operation	Standard Notation	Our Notation
Selection	σ_expr(Rel)	Sel[expr](Rel)
Projection	π_A,B,C(Rel)	Proj[A,B,C](Rel)
Join	Rel₁ ⋈_expr Rel₂	Rel₁ Join[expr] Rel₂
Rename	ρ_schemaRel	Rename[schema](Rel)