As noted earlier, the relational model is:
One important theory developed for the relational model involves the notion of functional dependency (fd ).
Like constraints, assertions, etc. functional dependencies are drawn from the semantics of the application domain.
Essentially, fd 's describe how individual attributes are related.
Functional dependencies
A good relational database design:
In database design, redundancy is generally a "bad thing":
Consider the following relation defining bank accounts/branches:
accountNo | balance | customer | branch | address | assets |
A-101 | 500 | 1313131 | Downtown | Brooklyn | 9000000 |
A-102 | 400 | 1313131 | Perryridge | Horseneck | 1700000 |
A-113 | 600 | 9876543 | Round Hill | Horseneck | 8000000 |
A-201 | 900 | 9876543 | Brighton | Brooklyn | 7100000 |
A-215 | 700 | 1111111 | Mianus | Horseneck | 400000 |
A-222 | 700 | 1111111 | Redwood | Palo Alto | 2100000 |
A-305 | 350 | 1234567 | Round Hill | Horseneck | 8000000 |
We need to be careful updating this data, otherwise we may introduce inconsistencies.
Insertion anomaly:
Insertion anomaly example (insert account A-306 at Round Hill):
accountNo | balance | customer | branch | address | assets |
A-101 | 500 | 1313131 | Downtown | Brooklyn | 9000000 |
A-102 | 400 | 1313131 | Perryridge | Horseneck | 1700000 |
A-113 | 600 | 9876543 | Round Hill | Horseneck | 8000000 |
A-201 | 900 | 9876543 | Brighton | Brooklyn | 7100000 |
A-215 | 700 | 1111111 | Mianus | Horseneck | 400000 |
A-222 | 700 | 1111111 | Redwood | Palo Alto | 2100000 |
A-305 | 350 | 1234567 | Round Hill | Horseneck | 8000000 |
A-306 | 800 | 1111111 | Round Hill | Horseneck | 8000800 |
Update anomaly example (update Round Hill branch address):
accountNo | balance | customer | branch | address | assets |
A-101 | 500 | 1313131 | Downtown | Brooklyn | 9000000 |
A-102 | 400 | 1313131 | Perryridge | Horseneck | 1700000 |
A-113 | 600 | 9876543 | Round Hill | Palo Alto | 8000000 |
A-201 | 900 | 9876543 | Brighton | Brooklyn | 7100000 |
A-215 | 700 | 1111111 | Mianus | Horseneck | 400000 |
A-222 | 700 | 1111111 | Redwood | Palo Alto | 2100000 |
A-305 | 350 | 1234567 | Round Hill | Horseneck | 8000000 |
Deletion anomaly example (remove account A-101):
accountNo | balance | customer | branch | address | assets |
A-101 | 500 | 1313131 | Downtown | Brooklyn | 9000000 |
A-102 | 400 | 1313131 | Perryridge | Horseneck | 1700000 |
A-113 | 600 | 9876543 | Round Hill | Horseneck | 8000000 |
A-201 | 900 | 9876543 | Brighton | Brooklyn | 7100000 |
A-215 | 700 | 1111111 | Mianus | Horseneck | 400000 |
A-222 | 700 | 1111111 | Redwood | Palo Alto | 2100000 |
A-305 | 350 | 1234567 | Round Hill | Horseneck | 8000000 |
Where is the Downtown branch located? What are its assets?
To avoid these kinds of update problems:
This leads to a (bottom-up) database design procedure:
This contrasts with our earlier (top-down) design procedure:
Some reasons ...
Most texts adopt the following terminology:
schemas |
upper-case letters, denoting set of all attributes (e.g. R, S, P, Q ) | |
instances |
lower-case letter corresponding to schema (e.g. r(R), s(S), p(P), q(Q) ) | |
lower-case letters (e.g. t, t', t1, u, v ) | ||
upper-case letters from start of alphabet (e.g. A, B, C, D ) | ||
attributes |
simple concatenation of attribute names (e.g. X=ABCD, Y=EFG ) | |
in tuples |
tuple[attrSet] (e.g. t[ABCD], t[X]) |
A relation instance r(R) satisfies a dependency X → Y if
We say that "Y is functionally dependent on X".
Attribute sets X and Y may overlap; trivially true that X → X.
Notes:
The above definition talks about dependency within a relation instance r(R).
Much more important for design is the notion of dependency across all possible instances of the relation (i.e. a schema-based dependency).
This is a simple generalisation of the previous definition:
Consider the following instance r(R) of the relation schema R(ABCDE):
A | B | C | D | E |
a1 | b1 | c1 | d1 | e1 |
a2 | b1 | c2 | d2 | e1 |
a3 | b2 | c1 | d1 | e1 |
a4 | b2 | c2 | d2 | e1 |
a5 | b3 | c3 | d1 | e1 |
What kind of dependencies can we observe among the attributes in r(R)?
Since the values of A are unique, it follows from the fd definition that:
It also follows that A → BC (or any other subset of ABCDE).
This can be summarised as A → BCDE
From our understanding of primary keys, A is a PK.
Since the values of E are always the same, it follows that:
Note: cannot generally summarise above by ABCD → E
(However, ABCD → E does happen to be true in this example)
In general, A → Y, B → Y
Other observations:
In practice, choose a minimal set of fds (basis)
Can we generalise some ideas about functional dependency?
E.g. are there dependencies that hold for any relation?
Yes, but they're rather uninteresting ones such as:
which generalises to Y ⊂ X ⇒ X → Y.
E.g. do some dependencies suggest the existence of others?
Yes, and this is much more interesting ... there are a number of rules of inference that allow us to derive dependencies.
Armstrong's rules are complete, general rules of inference on fds.
F1. Reflexivity e.g. X → X
F2. Augmentation e.g. X → Y ⇒ XZ → YZ
F3. Transitivity e.g. X → Y, Y → Z ⇒ X → Z
While Armstrong's rules are complete, other useful rules exist:
F4. Additivity e.g. X → Y, X → Z ⇒ X → YZ
F5. Projectivity e.g. X → YZ ⇒ X → Y, X → Z
F6. Pseudotransitivity e.g. X → Y, YZ → W ⇒ XZ → W
Using rules and a set F of given fds, we can determine what other fds hold.
Example (derivation of AB → GH):
F = { AB → E, AG → J, BE → I, E → G, GI → H }
AB → E | (given) | |||
AB → AB | (using F1) | |||
AB → B | (using F5 on 2) | |||
AB → BE | (using F4 on 1,3) | |||
BE → I | (given) |
Continuing the derivation ...
AB → I | (using F3 on 4,5) | |||
E → G | (given) | |||
AB → G | (using F3 on 1,7) | |||
AB → GI | (using F4 on 6,8) | |||
GI → H | (given) | |||
AB → H | (using F3 on 6,8) | |||
GI → GI | (using F1) | |||
GI → I | (using F5 on 12) | |||
AB → GH | (using F4 on 8,11) |
Given a set F of fds, how many new fds can we derive?
For a finite set of attributes, there must be a finite set of fds.
The largest collection of dependencies that can be derived from F is called the closure of F and is denoted F+.
Closures allow us to answer two interesting questions:
For the question "is X → Y derivable from F?" ...
Algorithms based on F+ rapidly become infeasible.
Example (of fd closure):
R = ABC, F = { AB → C, C → B }
F+ = { A → A, AB → A,
AC → A, AB → B,
BC → B, ABC → B,
C → C, AC → C,
BC → C, ABC → C,
AB → AB, . . . . . . ,
AB → ABC, AB → ABC,
C → B, C → BC,
AC → B, AC → AB }
To solve this problem, use closures based on sets of attributes rather than sets of fds.
Given a set X of attributes and a set F of fds, the largest set of attributes that can be derived from X using F, is called the closure of X (denoted X+).
We can prove (using additivity) that (X → Y) ∈ F+ iff Y ⊂ X+.
For computation, | X+ | is bounded by the number of attributes.
For the question "is X → Y derivable from F?" ...
Inputs: set F of fds set X of attributes Output: closure of X (i.e. X+) X+ = X stillChanging = true; while (stillChanging) { stillChanging = false; for each W → Z in F { if (W ⊆ X+) and not (Z ⊆ X+) { X+ = X+ ∪ Z stillChanging = true; } } }
E.g. R = ABCDEF, F = { AB → C, BC → AD, D → E, CF → B }
Does AB → D follow from F? Solve by checking D ∈ AB+.
Computing AB+:
AB+ = AB | (initially) | |||
AB+ = ABC | (using AB → C) | |||
AB+ = ABCD | (using BC → AD) | |||
AB+ = ABCDE | (using D → E) |
Since D is in AB+, then AB → D does follow from F.
E.g. R = ABCDEF, F = { AB → C, BC → AD, D → E, CF → B }
Does D → A follow from F? Solve by checking A ∈ D+.
Computing D+:
D+ = D | (initially) | |||
D+ = DE | (using D → E) |
Since A is not in D+, then D → A does not follow from F.
E.g. R = ABCDEF, F = { AB → C, BC → AD, D → E, CF → B }
What are the keys of R?
Solve by finding X ⊂ R such that X+ = R.
From previous examples, we know AB and D are not keys.
This also implies that A and B alone are not keys.
So how to find keys? Try all combinations of ABCDEF ...
E.g. maybe ACF is a key ...
Computing ACF+:
ACF+ = ACF | (initially) | |||
ACF+ = ABCF | (using CF → B) | |||
ACF+ = ABCDF | (using BC → AD) | |||
ACF+ = ABCDEF | (using D → E) |
Since ACF+ = R, ACF is a key (as is ABF).
For a given application, we can define many different sets of fds with the same closure (e.g. F and G where F+ = G+)
Which one is best to "model" the application?
Minimal cover Fc for a set F of fds:
An attribute a is redundant if (F-{d}∪{d'})+ = F+
(where d' is the same as d but with attribute A removed)
Algorithm for computing minimal cover:
Inputs: set F of fds Output: minimal cover Fc of F Fc = F Step 1: put f ∈ Fc into canonical form Step 2: eliminate redundant attributes from f ∈ Fc Step 3: eliminate redundant fds from Fc
Step 1: put fds into canonical form
for each f ∈ Fc like X → {A1,..,An} Fc = Fc - {f} for each a in {A1,..,An} Fc = Fc ∪ {X → a} end end
Step 2: eliminate redundant attributes
for each f ∈ Fc like X → A for each b in X f' = (X-{b}) → A; G = Fc - {f} ∪ {f'} if (G+ == Fc+) Fc = G end end
Step 3: eliminate redundant functional dependencies
for each f ∈ Fc G = Fc - {f} if (G+ == Fc+) Fc = G end
Note: we often assume that any supplied F is minimal.
E.g. R = ABC, F = { A → BC, B → C, A → B, AB → C }
Compute the minimal cover:
Normalization: branch of relational theory providing design insights.
The goals of normalization:
Normalization draws heavily on the theory of functional dependencies.
Normalization theory defines six normal forms (NFs).
Each normal form:
Must first decide which normal form rNF is "acceptable".
The normalization process:
A brief history of normal forms:
1NF allows most redundancy; 5NF allows least redundancy.
all attributes have atomic values we assume this as part of relational model |
||
all non-key attributes fully depend on key (i.e. no partial dependencies) avoids much redundancy |
||
BCNF |
no attributes dependent on non-key attrs (i.e. no transitive dependencies) avoids remaining redundancy |
|
removes problems due to multivalued dependencies | ||
removes problems due to join dependencies |
In practice, BCNF and 3NF are the most important.
(these are generally the "acceptable normal forms" for relational design)
Boyce-Codd Normal Form (BCNF):
The standard transformation technique to remove redundancy:
We accomplish decomposition by
We may require several decompositions to achieve acceptable NF.
Normalization algorithms tell us how to choose S and T.
Consider the following relation for BankLoans:
branchName | branchCity | assets | custName | loanNo | amount |
Downtown | Brooklyn | 9000000 | Jones | L-17 | 1000 |
Redwood | Palo Alto | 2100000 | Smith | L-23 | 2000 |
Perryridge | Horseneck | 1700000 | Hayes | L-15 | 1500 |
Downtown | Brooklyn | 9000000 | Jackson | L-15 | 1500 |
Mianus | Horseneck | 400000 | Jones | L-93 | 500 |
Round Hill | Horseneck | 8000000 | Turner | L-11 | 900 |
North Town | Rye | 3700000 | Hayes | L-16 | 1300 |
The BankLoans relation exhibits update anomalies (insert, update, delete).
The cause of these problems can be stated in terms of fds
This suggests that we have two separate notions (branch and loan) mixed up in a single relation
To improve the design, decompose the BankLoans relation.
The following decomposition is not helpful:
Branch(branchName, branchCity, assets)
CustLoan(custName, loanNo, amount)
because we lose information (which branch is a loan held at?)
Clearly, we need to leave some "connection" between the new relations, so that we can reconstruct the original information if needed.
Another possible decomposition:
BranchCust(branchName, branchCity, assets, custName)
CustLoan(custName, loanNo, amount)
The BranchCust relation instance:
branchName | branchCity | assets | custName |
Downtown | Brooklyn | 9000000 | Jones |
Redwood | Palo Alto | 2100000 | Smith |
Perryridge | Horseneck | 1700000 | Hayes |
Downtown | Brooklyn | 9000000 | Jackson |
Mianus | Horseneck | 400000 | Jones |
Round Hill | Horseneck | 8000000 | Turner |
North Town | Rye | 3700000 | Hayes |
The CustLoan relation instance:
custName | loanNo | amount |
Jones | L-17 | 1000 |
Smith | L-23 | 2000 |
Hayes | L-15 | 1500 |
Jackson | L-15 | 1500 |
Jones | L-93 | 500 |
Turner | L-11 | 900 |
Hayes | L-16 | 1300 |
The result:
BranchCust still has redundancy problems.
CustLoan doesn't, but there is potential confusion over L-15.
But even worse, when we put these relations back together to try to re-create the original relation, we get some extra tuples!
Not good.
The result of Join(BranchCust,CustLoan)
branchName | branchCity | assets | custName | loanNo | amount |
Downtown | Brooklyn | 9000000 | Jones | L-17 | 1000 |
Downtown | Brooklyn | 9000000 | Jones | L-93 | 500 |
Redwood | Palo Alto | 2100000 | Smith | L-23 | 2000 |
Perryridge | Horseneck | 1700000 | Hayes | L-15 | 1500 |
Perryridge | Horseneck | 1700000 | Hayes | L-16 | 1300 |
Downtown | Brooklyn | 9000000 | Jackson | L-15 | 1500 |
Mianus | Horseneck | 400000 | Jones | L-93 | 500 |
Mianus | Horseneck | 400000 | Jones | L-17 | 1000 |
Round Hill | Horseneck | 8000000 | Turner | L-11 | 900 |
North Town | Rye | 3700000 | Hayes | L-16 | 1300 |
North Town | Rye | 3700000 | Hayes | L-15 | 1500 |
This is clearly not a successful decomposition.
The fact that we ended up with extra tuples was symptomatic of losing some critical "connection" information during the decomposition.
Such a decomposition is called a lossy decomposition.
In a good decomposition, we should be able to reconstruct the original relation exactly:
Such a decomposition is called lossless join decomposition.
A relation schema R is in BCNF w.r.t a set F of functional dependencies iff:
Observations:
If we transform a schema into BCNF, we are guaranteed:
This may be a problem if the fds contain significant semantic information about the problem domain.
If we need to preserve dependencies, use 3NF.
The following algorithm converts an arbitrary schema to BCNF:
Inputs: schema R, set F of fds Output: set Res of BCNF schemas Res = {R}; while (any schema S ∈ Res is not in BCNF) { choose an fd X → Y on S that violates BCNF Res = (Res-S) ∪ (S-Y) ∪ XY }
Example (the BankLoans schema):
BankLoans(branchName, branchCity, assets, custName, loanNo, amount)
Has functional dependencies F
Applying the BCNF algorithm:
Branch(branchName, branchCity, assets)
LoanInfo(branchName, custName, loanNo, amount)
Applying the BCNF algorithm (cont):
Loan(branchName, loanNo, amount)
Borrower(custName, loanNo)
A relation schema R is in 3NF w.r.t a set F of functional dependencies iff:
The extra condition represents a slight weakening of BCNF requirements.
If we transform a schema into 3NF, we are guaranteed:
Whether to use BCNF or 3NF depends on overall design considerations.
The following algorithm converts an arbitrary schema to 3NF:
Inputs: schema R, set F of fds Output: set Res of 3NF schemas let Fc be a minimal cover for F Res = {} for each fd X → Y in Fc { if (no schema S ∈ Res contains XY) { Res = Res ∪ XY } } if (no schema S ∈ Res contains a candidate key for R) { K = any candidate key for R Res = Res ∪ K }
To achieve a "good" database design:
Produced: 13 Sep 2020