Relational Design Theory


Relational Design Theory

As noted earlier, the relational model is:

Such properties tend to lead to useful mathematical theories.

One important theory developed for the relational model involves the notion of functional dependency (fd ).

Like constraints, assertions, etc. functional dependencies are drawn from the semantics of the application domain.

Essentially, fd 's describe how individual attributes are related.


Relational Design Theory (cont)

Functional dependencies

What we study here: The aim of studying this:

Relational Design and Redundancy

A good relational database design:

Minimal stored information no redundant data.

In database design, redundancy is generally a "bad thing":

However, it can sometimes lead to performance improvements

Relational Design and Redundancy (cont)

Consider the following relation defining bank accounts/branches:

accountNo balance customer branch address assets
A-101 500 1313131 Downtown Brooklyn 9000000
A-102 400 1313131 Perryridge Horseneck 1700000
A-113 600 9876543 Round Hill Horseneck 8000000
A-201 900 9876543 Brighton Brooklyn 7100000
A-215 700 1111111 Mianus Horseneck 400000
A-222 700 1111111 Redwood Palo Alto 2100000
A-305 350 1234567 Round Hill Horseneck 8000000

We need to be careful updating this data, otherwise we may introduce inconsistencies.


Relational Design and Redundancy (cont)

Insertion anomaly:

Update anomaly: Deletion anomaly: (If we do manage to avoid inconsistencies, the cost is additional updates)

Relational Design and Redundancy (cont)

Insertion anomaly example (insert account A-306 at Round Hill):

accountNo balance customer branch address assets
A-101 500 1313131 Downtown Brooklyn 9000000
A-102 400 1313131 Perryridge Horseneck 1700000
A-113 600 9876543 Round Hill Horseneck 8000000
A-201 900 9876543 Brighton Brooklyn 7100000
A-215 700 1111111 Mianus Horseneck 400000
A-222 700 1111111 Redwood Palo Alto 2100000
A-305 350 1234567 Round Hill Horseneck 8000000
A-306 800 1111111 Round Hill Horseneck 8000800


Relational Design and Redundancy (cont)

Update anomaly example (update Round Hill branch address):

accountNo balance customer branch address assets
A-101 500 1313131 Downtown Brooklyn 9000000
A-102 400 1313131 Perryridge Horseneck 1700000
A-113 600 9876543 Round Hill Palo Alto 8000000
A-201 900 9876543 Brighton Brooklyn 7100000
A-215 700 1111111 Mianus Horseneck 400000
A-222 700 1111111 Redwood Palo Alto 2100000
A-305 350 1234567 Round Hill Horseneck 8000000


Relational Design and Redundancy (cont)

Deletion anomaly example (remove account A-101):

accountNo balance customer branch address assets
A-101 500 1313131 Downtown Brooklyn 9000000
A-102 400 1313131 Perryridge Horseneck 1700000
A-113 600 9876543 Round Hill Horseneck 8000000
A-201 900 9876543 Brighton Brooklyn 7100000
A-215 700 1111111 Mianus Horseneck 400000
A-222 700 1111111 Redwood Palo Alto 2100000
A-305 350 1234567 Round Hill Horseneck 8000000

Where is the Downtown branch located? What are its assets?


Database Design (revisited)

To avoid these kinds of update problems:

Typically, each Ri contains information about one entity (e.g. branch, customer, ...)

This leads to a (bottom-up) database design procedure:


Database Design (revisited) (cont)

This contrasts with our earlier (top-down) design procedure:

It appears that ER-design-then-relational-mapping So why do we need a dependency theory and normalisation procedure to deal with redundancy?

Database Design (revisited) (cont)

Some reasons ...

  1. ER design does not guarantee minimal redundancy
    • dependency theory allows us to check designs for residual problems
  2. Normalisation can be viewed as (semi)automated design
    • determine all of the attributes in the problem domain
    • collect them all together in a "super-relation"   (with update anomalies)
    • provide information about how attributes are related
    • apply normalisation to decompose into non-redundant relations

Notation/Terminology

Most texts adopt the following terminology:

Relation
schemas
upper-case letters, denoting set of all attributes (e.g. R, S, P, Q )
Relation
instances
lower-case letter corresponding to schema (e.g. r(R), s(S), p(P), q(Q) )
Tuples lower-case letters   (e.g. t, t', t1, u, v )
Attributes upper-case letters from start of alphabet (e.g. A, B, C, D )
Sets of
attributes
simple concatenation of attribute names (e.g. X=ABCD, Y=EFG )
Attributes
in tuples
tuple[attrSet] (e.g. t[ABCD], t[X])


Functional Dependency

A relation instance r(R) satisfies a dependency X → Y if

In other words, if two tuples in R agree in their values for the set of attributes X, then they must also agree in their values for the set of attributes Y.

We say that "Y is functionally dependent on X".

Attribute sets X and Y may overlap; trivially true that X → X.

Notes:


Functional Dependency (cont)

The above definition talks about dependency within a relation instance r(R).

Much more important for design is the notion of dependency across all possible instances of the relation (i.e. a schema-based dependency).

This is a simple generalisation of the previous definition:

Useful because such dependencies reflect the semantics of the problem.

Functional Dependency (cont)

Consider the following instance r(R) of the relation schema R(ABCDE):

A       B       C       D       E      
a1 b1 c1 d1 e1
a2 b1 c2 d2 e1
a3 b2 c1 d1 e1
a4 b2 c2 d2 e1
a5 b3 c3 d1 e1

What kind of dependencies can we observe among the attributes in r(R)?


Functional Dependency (cont)

Since the values of A are unique, it follows from the fd definition that:

A → B,    A → C,    A → D,    A → E

It also follows that A → BC   (or any other subset of ABCDE).

This can be summarised as   A → BCDE

From our understanding of primary keys, A is a PK.


Functional Dependency (cont)

Since the values of E are always the same, it follows that:

A → E,    B → E,    C → E,    D → E

Note: cannot generally summarise above by   ABCD → E

(However, ABCD → E does happen to be true in this example)

In general,    A → Y,   B → Y       AB → Y


Functional Dependency (cont)

Other observations:

We could derive many other dependencies, e.g.   AE → BC, ...

In practice, choose a minimal set of fds (basis)


Functional Dependency (cont)

Can we generalise some ideas about functional dependency?

E.g. are there dependencies that hold for any relation?

Yes, but they're rather uninteresting ones such as:

t[ABC] = u[ABC]  ⇒  t[AB] = u[AB]   giving   ABC → AB

which generalises to   Y ⊂ X   ⇒   X → Y.

E.g. do some dependencies suggest the existence of others?

Yes, and this is much more interesting ... there are a number of rules of inference that allow us to derive dependencies.


Inference Rules

Armstrong's rules are complete, general rules of inference on fds.

F1. Reflexivity   e.g.   X → X

F2. Augmentation   e.g.   X → Y  ⇒  XZ → YZ F3. Transitivity   e.g.   X → Y, Y → Z  ⇒  X → Z

Inference Rules (cont)

While Armstrong's rules are complete, other useful rules exist:

F4. Additivity   e.g.   X → Y, X → Z   ⇒   X → YZ

F5. Projectivity   e.g.   X → YZ   ⇒   X → Y, X → Z F6. Pseudotransitivity   e.g.   X → Y, YZ → W   ⇒   XZ → W

Inference Rules (cont)

Using rules and a set F of given fds, we can determine what other fds hold.

Example (derivation of AB → GH):

R = ABCDEFGHIJ

F = { AB → E,   AG → J,   BE → I,   E → G,   GI → H }

1.   AB → E (given)
2.   AB → AB (using F1)
3.   AB → B (using F5 on 2)
4.   AB → BE (using F4 on 1,3)
5.   BE → I (given)


Inference Rules (cont)

Continuing the derivation ...

6.   AB → I (using F3 on 4,5)
7.   E → G (given)
8.   AB → G (using F3 on 1,7)
9.   AB → GI (using F4 on 6,8)
10. GI → H (given)
11. AB → H (using F3 on 6,8)
12. GI → GI (using F1)
13. GI → I (using F5 on 12)
14. AB → GH (using F4 on 8,11)

Closures

Given a set F of fds, how many new fds can we derive?

For a finite set of attributes, there must be a finite set of fds.

The largest collection of dependencies that can be derived from F is called the closure of F and is denoted F+.

Closures allow us to answer two interesting questions:


Closures (cont)

For the question "is X → Y derivable from F?" ...

For the question "are F and G equivalent?" ... Unfortunately, closures on even small sets of functional dependencies can be very large.

Algorithms based on F+ rapidly become infeasible.


Closures (cont)

Example (of fd closure):

R = ABC,     F = { AB → C,   C → B }
F+ = { A → A,   AB → A,   AC → A,   AB → B,   BC → B,   ABC → B,  
            C → C,   AC → C,   BC → C,   ABC → C,   AB → AB,   . . . . . . ,
            AB → ABC,   AB → ABC,   C → B,   C → BC,   AC → B,   AC → AB }

To solve this problem, use closures based on sets of attributes rather than sets of fds.

Given a set X of attributes and a set F of fds, the largest set of attributes that can be derived from X using F, is called the closure of X (denoted X+).

We can prove (using additivity) that    (X → Y) ∈ F+   iff   Y ⊂ X+.

For computation, | X+ | is bounded by the number of attributes.


Closures (cont)

For the question "is X → Y derivable from F?" ...

For the question "are F and G equivalent?" ...

Closure Algorithm

Inputs: set F of fds
        set X of attributes
Output: closure of X (i.e. X+)

X+ = X
stillChanging = true;
while (stillChanging) {
    stillChanging = false;
    for each W → Z in F {
        if (W ⊆ X+) and not (Z ⊆ X+) {
            X+ = X+ ∪ Z
            stillChanging = true;
        }
    }
}


Closure Algorithm (cont)

E.g. R = ABCDEF, F = { AB → C,  BC → AD,  D → E,  CF → B }

Does AB → D follow from F?     Solve by checking D ∈ AB+.

Computing AB+:

1.   AB+ = AB (initially)
2.   AB+ = ABC (using AB → C)
3.   AB+ = ABCD (using BC → AD)
4.   AB+ = ABCDE (using D → E)

Since D is in AB+, then AB → D does follow from F.


Closure Algorithm (cont)

E.g. R = ABCDEF, F = { AB → C,  BC → AD,  D → E,  CF → B }

Does D → A follow from F?     Solve by checking A ∈ D+.

Computing D+:

1.   D+ = D (initially)
2.   D+ = DE (using D → E)

Since A is not in D+, then D → A does not follow from F.


Closure Algorithm (cont)

E.g. R = ABCDEF, F = { AB → C,  BC → AD,  D → E,  CF → B }

What are the keys of R?

Solve by finding X ⊂ R such that X+ = R.

From previous examples, we know AB and D are not keys.

This also implies that A and B alone are not keys.

So how to find keys? Try all combinations of ABCDEF ...

E.g. maybe ACF is a key ...


Closure Algorithm (cont)

Computing ACF+:

1.   ACF+ = ACF (initially)
2.   ACF+ = ABCF (using CF → B)
3.   ACF+ = ABCDF (using BC → AD)
4.   ACF+ = ABCDEF (using D → E)

Since ACF+ = R,   ACF is a key    (as is ABF).


Minimal Covers

For a given application, we can define many different sets of fds with the same closure (e.g. F and G where F+ = G+)

Which one is best to "model" the application?

If we can ... Better still, can we derive the smallest complete set of fds?

Minimal Covers (cont)

Minimal cover Fc for a set F of fds:

An fd d is redundant if (F-{d})+ = F+

An attribute a is redundant if (F-{d}∪{d'})+ = F+
(where d' is the same as d but with attribute A removed)


Minimal Covers (cont)

Algorithm for computing minimal cover:


Inputs: set F of fds
Output: minimal cover Fc of F

Fc = F
Step 1:
    put f ∈ Fc into canonical form
Step 2:
    eliminate redundant attributes from f ∈ Fc
Step 3:
    eliminate redundant fds from Fc


Minimal Covers (cont)

Step 1: put fds into canonical form


for each f ∈ Fc like X → {A1,..,An}
    Fc = Fc - {f}
    for each a in {A1,..,An}
        Fc = Fc ∪ {X → a}
    end
end


Minimal Covers (cont)

Step 2: eliminate redundant attributes


for each f ∈ Fc like X → A
    for each b in X
        f' = (X-{b}) → A;
        G = Fc - {f} ∪ {f'}
        if (G+ == Fc+) Fc = G
    end
end


Minimal Covers (cont)

Step 3: eliminate redundant functional dependencies


for each f ∈ Fc
    G = Fc - {f}
    if (G+ == Fc+) Fc = G
end


Note: we often assume that any supplied F is minimal.


Minimal Covers (cont)

E.g. R = ABC, F = { A → BC,   B → C,   A → B,   AB → C }

Compute the minimal cover:

This gives the minimal cover   Fc = { A → B,   B → C }.

Normalization

Normalization: branch of relational theory providing design insights.

The goals of normalization:

Normalization draws heavily on the theory of functional dependencies.


Normal Forms

Normalization theory defines six normal forms (NFs).

Each normal form:

Higher normal forms have less redundancy less update problems.

Normal Forms (cont)

Must first decide which normal form rNF is "acceptable".

The normalization process:


Normal Forms (cont)

A brief history of normal forms:

NF hierarachy:   5NF 4NF BCNF 3NF 2NF 1NF

1NF allows most redundancy;   5NF allows least redundancy.


Normal Forms (cont)

1NF all attributes have atomic values
we assume this as part of relational model
2NF all non-key attributes fully depend on key
(i.e. no partial dependencies)
avoids much redundancy
3NF
BCNF
no attributes dependent on non-key attrs
(i.e. no transitive dependencies)
avoids remaining redundancy
4NF removes problems due to multivalued dependencies
5NF removes problems due to join dependencies


Normal Forms (cont)

In practice, BCNF and 3NF are the most important.
(these are generally the "acceptable normal forms" for relational design)

Boyce-Codd Normal Form (BCNF):

Third Normal Form (3NF):

Relation Decomposition

The standard transformation technique to remove redundancy:

We accomplish decomposition by

Properties:   R = S ∪ T,   S ∩ T ≠ {}   and ideally   r(R) = s(S) t(T)

We may require several decompositions to achieve acceptable NF.

Normalization algorithms tell us how to choose S and T.


Schema Design

Consider the following relation for BankLoans:

branchName branchCity assets custName loanNo amount
Downtown Brooklyn 9000000 Jones L-17 1000
Redwood Palo Alto 2100000 Smith L-23 2000
Perryridge Horseneck 1700000 Hayes L-15 1500
Downtown Brooklyn 9000000 Jackson L-15 1500
Mianus Horseneck 400000 Jones L-93 500
Round Hill Horseneck 8000000 Turner L-11 900
North Town Rye 3700000 Hayes L-16 1300


Schema Design (cont)

The BankLoans relation exhibits update anomalies (insert, update, delete).

The cause of these problems can be stated in terms of fds

In other words, some attributes are determined by branchName, while others are not.

This suggests that we have two separate notions (branch and loan) mixed up in a single relation


Schema Design (cont)

To improve the design, decompose the BankLoans relation.

The following decomposition is not helpful:

    Branch(branchName, branchCity, assets)
    CustLoan(custName, loanNo, amount)

because we lose information (which branch is a loan held at?)

Clearly, we need to leave some "connection" between the new relations, so that we can reconstruct the original information if needed.

Another possible decomposition:

    BranchCust(branchName, branchCity, assets, custName)
    CustLoan(custName, loanNo, amount)


Schema Design (cont)

The BranchCust relation instance:

branchName branchCity assets custName
Downtown Brooklyn 9000000 Jones
Redwood Palo Alto 2100000 Smith
Perryridge Horseneck 1700000 Hayes
Downtown Brooklyn 9000000 Jackson
Mianus Horseneck 400000 Jones
Round Hill Horseneck 8000000 Turner
North Town Rye 3700000 Hayes


Schema Design (cont)

The CustLoan relation instance:

custName loanNo amount
Jones L-17 1000
Smith L-23 2000
Hayes L-15 1500
Jackson L-15 1500
Jones L-93 500
Turner L-11 900
Hayes L-16 1300


Schema Design (cont)

The result:

BranchCust still has redundancy problems.

CustLoan doesn't, but there is potential confusion over L-15.

But even worse, when we put these relations back together to try to re-create the original relation, we get some extra tuples!

Not good.


Schema Design (cont)

The result of Join(BranchCust,CustLoan)

branchName branchCity assets custName loanNo amount
Downtown Brooklyn 9000000 Jones L-17 1000
Downtown Brooklyn 9000000 Jones L-93 500
Redwood Palo Alto 2100000 Smith L-23 2000
Perryridge Horseneck 1700000 Hayes L-15 1500
Perryridge Horseneck 1700000 Hayes L-16 1300
Downtown Brooklyn 9000000 Jackson L-15 1500
Mianus Horseneck 400000 Jones L-93 500
Mianus Horseneck 400000 Jones L-17 1000
Round Hill Horseneck 8000000 Turner L-11 900
North Town Rye 3700000 Hayes L-16 1300
North Town Rye 3700000 Hayes L-15 1500


Schema Design (cont)

This is clearly not a successful decomposition.

The fact that we ended up with extra tuples was symptomatic of losing some critical "connection" information during the decomposition.

Such a decomposition is called a lossy decomposition.

In a good decomposition, we should be able to reconstruct the original relation exactly:

if R is decomposed into S and T, then Join(S,T) = R

Such a decomposition is called lossless join decomposition.


Boyce-Codd Normal Form

A relation schema R is in BCNF w.r.t a set F of functional dependencies iff:

A DB schema is in BCNF if all relation schemas are in BCNF.

Observations:


Boyce-Codd Normal Form (cont)

If we transform a schema into BCNF, we are guaranteed:

However, we are not guaranteed:

This may be a problem if the fds contain significant semantic information about the problem domain.

If we need to preserve dependencies, use 3NF.


BCNF Decomposition

The following algorithm converts an arbitrary schema to BCNF:

Inputs: schema R, set F of fds
Output: set Res of BCNF schemas

Res = {R};
while (any schema S ∈ Res is not in BCNF) {
    choose an fd X → Y on S that violates BCNF
    Res = (Res-S) ∪ (S-Y) ∪ XY
}


BCNF Decomposition (cont)

Example (the BankLoans schema):

BankLoans(branchName, branchCity, assets, custName, loanNo, amount)

Has functional dependencies F

The key for BankLoans is branchName,custName,loanNo

BCNF Decomposition (cont)

Applying the BCNF algorithm:

(continued)

BCNF Decomposition (cont)

Applying the BCNF algorithm (cont):


Third Normal Form

A relation schema R is in 3NF w.r.t a set F of functional dependencies iff:

A DB schema is in 3NF if all relation schemas are in 3NF.

The extra condition represents a slight weakening of BCNF requirements.


Third Normal Form (cont)

If we transform a schema into 3NF, we are guaranteed:

However, we are not guaranteed:

Whether to use BCNF or 3NF depends on overall design considerations.


Third Normal Form (cont)

The following algorithm converts an arbitrary schema to 3NF:

Inputs: schema R, set F of fds
Output: set Res of 3NF schemas

let Fc be a minimal cover for F
Res = {}
for each fd X → Y in Fc {
    if (no schema S ∈ Res contains XY) {
        Res = Res ∪ XY
    }
}
if (no schema S ∈ Res contains a candidate key for R) {
    K = any candidate key for R
    Res = Res ∪ K
}


Database Design Methodology

To achieve a "good" database design:

Note: may subsequently need to "denormalise" if the design yields inadequate performance.


Produced: 13 Sep 2020