COMP3311 Week 8 Monday Lecture

>>

Week 08 Monday
Assignment 2 Database
Assignment 2 Scripts
Normalisation
Relation Decomposition
Schema (Re)Design
Boyce-Codd Normal Form
BCNF Decomposition
Exercise: BCNF Normalisation (1)
Exercise: BCNF Normalisation (2)
Exercise: BCNF Normalisation (3)
Third Normal Form
Exercise: BCNF Normalisation (1)
Exercise: 3NF Normalisation (2)
Exercise: 3NF Normalisation (3)
Database Design Methodology
Relational Algebra
Notation
Describing RA Operations
Example Database #1
Example Database #2
Rename
Exercise: Rename
Projection
Exercise: Projection
Selection
Exercise: Selection
Union
Intersection
Exercise: Union/Intersection
Difference

∧ >>

❖ Week 08 Monday

In today's lecture ...

Relational Design Theory: Normalisation
Relational Algebra: Intro and basic operations

Things to do ...

Quiz 5 due by 23:59 Friday 3 November
Assignment 2 due by 23:59 Wednesday 15 November

<< ∧ >>

❖ Assignment 2 Database

Things to note:

was a pain in the *** to generate
is a tiny subset of the real UNSW student DB
- so your favourite program or stream might be missing
- and most courses have tiny enrolments (< 10)
some data might not quite make sense (e.g. enrolled in program, but not in streams)
- work with the data as supplied; don't second-guess
some degrees/streams don't have requirements (in the DB)
focuses on Engineering degrees, but should work on any degree
some obvious NOT NULL constraints were not included
assumes program/stream requirements have not changed 2019-2023

<< ∧ >>

❖ Assignment 2 Scripts

Write Python/Psycopg2 scripts to ...

q1.py: track proportion of international students over time
q2.py: track satisfaction with a given course over time
q3.py: print requirements for a stream/program
q4.py: produce a transcript, including WAM
q5.py: do progression check for current program
q6.py: check progression for a proposed program

Will merge Q5 and Q6 ... essentially the same script

python3 q5.py zID   vs   python3 q5.py zID Prog Stream

Since output is text, need to be extra careful with output format

<< ∧ >>

❖ Normalisation

Normalisation: branch of relational theory providing design insights.

Makes use of schema normal forms

to characterise the level of redundancy in a relational schema

And normalisation algorithms which

provide mechanisms for transforming schemas to remove redundancy

Normalisation draws heavily on the theory of functional dependencies.

<< ∧ >>

❖ Relation Decomposition

The standard transformation technique to remove redundancy:

decompose relation R into relations S and T

We accomplish decomposition by

selecting (overlapping) subsets of attributes
forming new relations based on attribute subsets

Properties: R = S ∪ T, S ∩ T ≠ ∅ and r(R) = s(S) ⋈ t(T)

May require several decompositions to achieve acceptable NF.

Normalisation algorithms tell us how to choose S and T.

<< ∧ >>

❖ Schema (Re)Design

Consider the following relation for BankLoans:

branchName branchCity assets custName loanNo amount

Downtown Brooklyn 9000000 Jones L-17 1000

Redwood Palo Alto 2100000 Smith L-23 2000

Perryridge Horseneck 1700000 Hayes L-15 1500

Downtown Brooklyn 9000000 Jackson L-15 1500

Minus Horseneck 400000 Jones L-93 500

Round Hill Horseneck 8000000 Turner L-11 900

North Town Rye 3700000 Hayes L-16 1300

This schema has all of the update anomalies mentioned earlier.

<< ∧ >>

❖ Schema (Re)Design (cont)

To improve the design, decompose the BankLoans relation.

The following decomposition is not helpful:

Branch(branchName, branchCity, assets)
CustLoan(custName, loanNo, amount)

because we lose information (which branch is a loan held at?)

Another possible decomposition:

BranchCust(branchName, branchCity, assets, custName)
CustLoan(custName, loanNo, amount)

<< ∧ >>

❖ Schema (Re)Design (cont)

The BranchCust relation instance:

branchName branchCity assets custName

Downtown Brooklyn 9000000 Jones

Redwood Palo Alto 2100000 Smith

Perryridge Horseneck 1700000 Hayes

Downtown Brooklyn 9000000 Jackson

Minus Horseneck 400000 Jones

Round Hill Horseneck 8000000 Turner

North Town Rye 3700000 Hayes

<< ∧ >>

❖ Schema (Re)Design (cont)

The CustLoan relation instance:

custName loanNo amount

Jones L-17 1000

Smith L-23 2000

Hayes L-15 1500

Jackson L-15 1500

Jones L-93 500

Turner L-11 900

Hayes L-16 1300

<< ∧ >>

❖ Schema (Re)Design (cont)

Now consider the result of (BranchCust Join CustLoan)

branchName branchCity assets custName loanNo amount

Downtown Brooklyn 9000000 Jones L-17 1000

Downtown Brooklyn 9000000 Jones L-93 500

Redwood Palo Alto 2100000 Smith L-23 2000

Perryridge Horseneck 1700000 Hayes L-15 1500

Perryridge Horseneck 1700000 Hayes L-16 1300

Downtown Brooklyn 9000000 Jackson L-15 1500

Minus Horseneck 400000 Jones L-93 500

Minus Horseneck 400000 Jones L-17 1000

Round Hill Horseneck 8000000 Turner L-11 900

North Town Rye 3700000 Hayes L-16 1300

North Town Rye 3700000 Hayes L-15 1500

<< ∧ >>

❖ Schema (Re)Design (cont)

This is clearly not a successful decomposition.

The fact that we ended up with extra tuples is symptomatic of losing some critical "connection" information during the decomposition.

Such a decomposition is called a lossy decomposition.

In a good decomposition, we should be able to reconstruct the original relation exactly:

if R is decomposed into S and T, then S ⋈ T = R

Such a decomposition is called lossless join decomposition.

<< ∧ >>

❖ Boyce-Codd Normal Form

A relation schema R is in BCNF w.r.t a set F of functional dependencies iff:

for all fds X → Y in F⁺

either X → Y is trivial (i.e. Y ⊂ X)
or X is a superkey (i.e. non-strict superset of attributes in key)

A DB schema is in BCNF if all of its relation schemas are in BCNF.

Observations:

any two-attribute relation is in BCNF

any relation with key K, other attributes Y, and K → Y is in BCNF

<< ∧ >>

❖ Boyce-Codd Normal Form (cont)

If we transform a schema into BCNF, we are guaranteed:

no update anomalies due to fd-based redundancy
lossless join decomposition

However, we are not guaranteed:

the new schema preserves all fds from the original schema

If we need to preserve dependencies, use 3NF.

A dependency X → Y is preserved if all of its attributes exist in one table.

Decomposition may e.g. place X and Y in separate tables

<< ∧ >>

❖ BCNF Decomposition

The following algorithm converts an arbitrary schema to BCNF:

Inputs: schema R, set F of fds
Output: set Res of BCNF schemas (tables)

Res = {R};
while (any schema S ∈ Res is not in BCNF) {
    choose any fd X → Y on S that violates BCNF
    Res = (Res-S) ∪ (S-Y) ∪ XY
}

The last step means: make a table from XY; drop Y from table S; drop table S

The "choose any" step means that the algorithm is non-deterministic

<< ∧ >>

❖ Exercise: BCNF Normalisation (1)

Recall the BankLoans schema:

BankLoans(branchName, branchCity, assets, custName, loanNo, amount)

Has functional dependencies F

branchName → assets,branchCity
loanNo → amount,branchName

The key for BankLoans is branchName,custName,loanNo

Use the BCNF decompositon algorithm to produce a BCNF schema.

<< ∧ >>

❖ Exercise: BCNF Normalisation (2)

Recall the following table (based on e.g. a spreadsheet):

P#  | When        | Address    | Notes         | S#   | Name  | CarReg
----+-------------+------------+---------------+------+-------+-------
PG4 | 03/06 15:15 | 55 High St | Bathroom leak | SG44 | Rob   | ABK754
PG1 | 04/06 11:10 | 47 High St | All ok        | SG44 | Rob   | ABK754
PG4 | 03/07 12:30 | 55 High St | All ok        | SG43 | Dave  | ATS123
PG1 | 05/07 15:00 | 47 High St | Broken window | SG44 | Rob   | ABK754
PG1 | 05/07 15:00 | 47 High St | Leaking tap   | SG44 | Rob   | ABK754
PG2 | 13/07 12:00 | 12 High St | All ok        | SG42 | Peter | ATS123
...

Recall the functional dependencies identified previously,

Use these to convert the table to a BCNF schema.

<< ∧ >>

❖ Exercise: BCNF Normalisation (3)

Consider the schema R and set of fds F

R = ABCDEFGH

F = { ABH → C, A → DE, BGH → F, F → ADH, BH → GE }

Produce a BCNF decomposition of R.

<< ∧ >>

❖ Third Normal Form

A relation schema R is in 3NF w.r.t a set F of functional dependencies iff:

for all fds X → Y in F⁺

either X → Y is trivial (i.e. Y ⊂ X)
or X is a superkey
or Y is a single attribute from a key

A DB schema is in 3NF if all relation schemas are in 3NF.

The extra condition represents a slight weakening of BCNF requirements.

<< ∧ >>

❖ Third Normal Form (cont)

If we transform a schema into 3NF, we are guaranteed:

lossless join decomposition
the new schema preserves all of the fds from the original schema

However, we are not guaranteed:

no update anomalies due to fd-based redundancy

Whether to use BCNF or 3NF depends on overall design considerations.

<< ∧ >>

❖ Third Normal Form (cont)

The following algorithm converts an arbitrary schema to 3NF:

Inputs: schema R, set F of fds
Output: set R_i of 3NF schemas

let F_c be a minimal cover for F
Res = {}
for each fd X → Y in F_c {
    if (no schema S ∈ Res contains XY) {
        Res = Res ∪ XY
    }
}
if (no schema S ∈ Res contains a candidate key for R) {
    K = any candidate key for R
    Res = Res ∪ K
}

<< ∧ >>

❖ Third Normal Form (cont)

Critical step is producing minimal cover F_c for F

A set F of fds is minimal if

every fd X → Y is simple
(Y is a single attribute)
every fd X → Y is left-reduced
(no Z ⊂ X such that Z → A could replace X → A in F and preserve F⁺)
every fd X → Y is necessary
(no X → Y can be removed without changing F⁺)

Algorithm: right-reduce, left-reduce, eliminate redundant fds

<< ∧ >>

❖ Exercise: BCNF Normalisation (1)

Recall the BankLoans schema:

BankLoans(branchName, branchCity, assets, custName, loanNo, amount)

Has functional dependencies F

branchName → assets,branchCity
loanNo → amount,branchName

The key for BankLoans is branchName,custName,loanNo

Use the 3NF decompositon algorithm to produce a 3NF schema.

<< ∧ >>

❖ Exercise: 3NF Normalisation (2)

Recall the following table (based on e.g. a spreadsheet):


P#  | When        | Address    | Notes         | S#   | Name  | CarReg
----+-------------+------------+---------------+------+-------+-------
PG4 | 03/06 15:15 | 55 High St | Bathroom leak | SG44 | Rob   | ABK754
PG1 | 04/06 11:10 | 47 High St | All ok        | SG44 | Rob   | ABK754
PG4 | 03/07 12:30 | 55 High St | All ok        | SG43 | Dave  | ATS123
PG1 | 05/07 15:00 | 47 High St | Broken window | SG44 | Rob   | ABK754
PG1 | 05/07 15:00 | 47 High St | Leaking tap   | SG44 | Rob   | ABK754
PG2 | 13/07 12:00 | 12 High St | All ok        | SG42 | Peter | ATS123
...

Recall the functional dependencies identified previously,

Use these to convert the table to a 3NF schema.

<< ∧ >>

❖ Exercise: 3NF Normalisation (3)

Consider the schema R and set of fds F

R = ABCDEFGH

F = F_c = { BH → C, A → D, C → E, F → A, E → F, BGH → E }

Produce a 3NF decomposition of R.

<< ∧ >>

❖ Database Design Methodology

To achieve a "good" database design:

identify attributes, entities, relationships → ER design
map ER design to relational schema
identify constraints (including keys and functional dependencies)
apply BCNF/3NF algorithms to produce normalised schema

Note: may subsequently need to "denormalise" if the design yields inadequate performance.

<< ∧ >>

❖ Relational Algebra

Relational algebra (RA) can be viewed as ...

mathematical system for manipulating relations, or
data manipulation language (DML) for the relational model

Relational algebra consists of:

operands: relations, or variables representing relations
operators that map relations to relations
rules for combining operands/operators into expressions
rules for evaluating such expressions

Why is it important?

because it forms the basis for DBMS implementation
relational algebra ops are like machine code for DBMSs

<< ∧ >>

❖ Relational Algebra (cont)

Core relational algebra operations:

selection: choosing a subset of tuples/rows
projection: choosing a subset of attributes/columns
product, join: combining relations
union, intersection, difference: combining relations
rename: change names of relations/attributes

Common extensions include:

aggregation, projection++, division

<< ∧ >>

❖ Relational Algebra (cont)

Select, project, join provide a powerful set of operations for building relations and extracting interesting data from them.

Adding set operations and renaming makes RA complete.

<< ∧ >>

❖ Notation

Standard treatments of relational algebra use Greek symbols.

We use the following notation (because it is easier to reproduce):

Operation Standard
Notation Our
Notation

Selection σ_expr(Rel) Sel[expr](Rel)

Projection π_A,B,C(Rel) Proj[A,B,C](Rel)

Join Rel₁ ⋈_expr Rel₂ Rel₁ Join[expr] Rel₂

Rename ρ_schemaRel Rename[schema](Rel)

For other operations (e.g. set operations) we adopt the standard notation.
Except when typing in a text file, where * = intersection, + = union

<< ∧ >>

❖ Describing RA Operations

We define the semantics of RA operations using

"conditional set" expressions e.g. { X | condition on X }
tuple notations:
- t[AB] (extracts attributes A and B from tuple t)
- (x,y,z) (enumerated tuples; specify attribute values)
quantifiers, set operations, boolean operators

For each operation, we also describe it operationally:

give an algorithm to compute the result, tuple-by-tuple

<< ∧ >>

❖ Describing RA Operations (cont)

All RA operators return a result of type relation.

For convenience, we can name a result and use it later.

E.g.

Temp = R op₁ S op₂ T
Res  = Temp op₃ Z
-- which is equivalent to
Res  = (R op₁ S op₂ T) op₃ Z

Each result is a relation with a well-defined schema

this applies equally to intermediate results and final results

<< ∧ >>

❖ Example Database #1

<< ∧ >>

❖ Example Database #2

<< ∧ >>

❖ Rename

Rename provides "schema mapping".

If expression E returns a relation R(A₁, A₂, ... A_n), then

Rename[S(B₁, B₂, ... B_n)](E)

gives a relation called S

containing the same set of tuples as E
but with the name of each attribute changed from A_i to B_i

Rename is like the identity function on the contents of a relation

The only thing that Rename changes is the schema.

<< ∧ >>

❖ Rename (cont)

Rename can be viewed as a "technical" apparatus of RA.

Can use implicit rename/project in sequences of RA operations, e.g.

--  R(a,b,c),  S(c,d)
Res = Rename[Res(b,c,d)](Project[b,c](Sel[a>5](R)) Join S)
-- vs
Tmp1 = Select[a>5](R)
Tmp2 = Project[b,c](Tmp1)
Tmp3 = Rename[Tmp3(cc,d)](S)
Tmp4 = Tmp2 Join[c=cc] Tmp3
Res  = Rename[Res(b,c,d)](Tmp4)
-- vs
Tmp1(b,c)  = Select[a>5](R)
Tmp2(cc,d) = S
Res(b,c,d) = Tmp1 Join[c=cc] Tmp2

In SQL, can achieve a similar effect by defining a set of views

<< ∧ >>

❖ Exercise: Rename

Answer the following in SQL then relational algebra

rename the columns in the Beers relation (beer,brewer)
rename the addr column in Drinkers as suburb
change the name of the Bars relation to Pubs

<< ∧ >>

❖ Projection

Projection returns a set of tuples containing a subset of the attributes in the original relation.

π_X(r) = Proj[X](r) = { t[X] | t ∈ r }, where r(R)

X specifies a subset of the attributes of R.

Note that removing key attributes can produce duplicates.

In RA, duplicates are removed from the result set.

Result size: |π_X(r)| ≤ |r| Result schema: R'(X)

Algorithmic view:

result = {}
for each tuple t in relation r
    result = result ∪ {t[X]}

<< ∧ >>

❖ Projection (cont)

Examples of projection:

<< ∧ >>

❖ Exercise: Projection

Answer the following in SQL then relational algebra

Names of all beers
What are all of the breweries?
Names of drinkers who live in Newtown

<< ∧ >>

❖ Selection

Selection returns a subset of the tuples in a relation r(R) that satisfy a specified condition C.

σ_C(r) = Sel[C](r) = { t | t ∈ r ∧ C(t) }

C is a boolean expression on attributes in R.

Result size: |σ_C(r)| ≤ |r|

Result schema: same as the schema of r (i.e. R)

Algorithmic view:

result = {}
for each tuple t in relation r
    if (C(t)) { result = result ∪ {t} }

<< ∧ >>

❖ Selection (cont)

Examples of selection:

<< ∧ >>

❖ Exercise: Selection

Answer the following in SQL and relational algebra

details of all bars in The Rocks
beers made by Sierra Nevada
beers sold in the Coogee Bay Hotel

<< ∧ >>

❖ Union

Union combines two compatible relations into a single relation via set union of sets of tuples.

r₁ ∪ r₂ = { t | t ∈ r₁ ∨ t ∈ r₂ }, where r₁(R), r₂(R)

Compatibility = both relations have the same schema

Result size: |r₁ ∪ r₂| ≤ |r₁| + |r₂| Result schema: R

Algorithmic view:

result = r₁
for each tuple t in relation r₂
    result = result ∪ {t}

<< ∧ >>

❖ Intersection

Intersection combines two compatible relations into a single relation via set intersection of sets of tuples.

r₁ ∩ r₂ = { t | t ∈ r₁ ∧ t ∈ r₂ }, where r₁(R), r₂(R)

Uses same notion of relation compatibility as union.

Result size: |r₁ ∪ r₂| ≤ min(|r₁|,|r₂|) Result schema: R

Algorithmic view:

result = {}
for each tuple t in relation r₁
    if (t ∈ r₂) { result = result ∪ {t} }

<< ∧ >>

❖ Intersection (cont)

Examples of union and intersection:

<< ∧ >>

❖ Exercise: Union/Intersection

Answer the following in SQL then relational algebra

Bars where either John or Gernot drinks
Bars where both John and Gernot drink

<< ∧ >>

❖ Difference

Difference finds the set of tuples that exist in one relation but do not occur in a second compatible relation.

r₁ - r₂ = { t | t ∈ r₁ ∧ t ∉ r₂ }, where r₁(R), r₂(R)

Uses same notion of relation compatibility as union.

Note: tuples in r₂ but not r₁ do not appear in the result

i.e. set difference != complement of set intersection

Algorithmic view:

result = {}
for each tuple t in relation r₁
    if (!(t ∈ r₂)) { result = result ∪ {t} }

<< ∧ >>

❖ Difference (cont)

Examples of difference:

Clearly, difference is not symmetric.

<< ∧

❖ Difference (cont)

Answer the following in SQL then relational algebra

Bars where John drinks and Gernot doesn't
Bars that sell VB but not New

branchName	branchCity	assets	custName	loanNo	amount
Downtown	Brooklyn	9000000	Jones	L-17	1000
Redwood	Palo Alto	2100000	Smith	L-23	2000
Perryridge	Horseneck	1700000	Hayes	L-15	1500
Downtown	Brooklyn	9000000	Jackson	L-15	1500
Minus	Horseneck	400000	Jones	L-93	500
Round Hill	Horseneck	8000000	Turner	L-11	900
North Town	Rye	3700000	Hayes	L-16	1300

Operation	Standard Notation	Our Notation
Selection	σ_expr(Rel)	Sel[expr](Rel)
Projection	π_A,B,C(Rel)	Proj[A,B,C](Rel)
Join	Rel₁ ⋈_expr Rel₂	Rel₁ Join[expr] Rel₂
Rename	ρ_schemaRel	Rename[schema](Rel)