COMP3311 Week 7 Monday Lecture
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [0/28]
In today's lecture ...
- Assignment 2, Python/psycopg2, Relational Design Theory
Things to do ...
- Quiz 4 due by 23:59 Friday 27 October (Week 7)
- Assignment 2 due by 23:59 Friday 10 November (Week 9)
- Play with the
ass2
database and Python/psycopg2
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [1/28]
Plan for remainder of term:
Week 7 | |
Python, Psycopg2, PostgreSQL |
Week 7/8 | |
Relational Design, Normalization |
Week 8/9 | |
Relational Algebra, Query Execution |
Week 9/10 | |
Transactions, Concurrency Control |
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [2/28]
Database about UNSW, called MyMyUNSW ...
- subjects, courses, streams (specialisations), programs
- people, students, staff, orgunits (organisational units)
- course_enrolments, stream_enrolments, program_enrolments
- degree requirements (total UOC, core, electives, gened, ...)
Similar to SiMS, the database behind myUNSW
Goal: build a progression checker (no more waiting for the Nucleus)
Database schema info will eventually be in .../assignments/ass2/schema.php
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [3/28]
Partial ER diagram of the MyMyUNSW database:
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [4/28]
Differences/examples from diagram:
- students don't enrol directly in streams; via program enrolment
- OrgUnits exist in a hierarchy (school in faculty in UNSW)
- People.id is not the same as zID
- Requirements apply to streams and programs, e.g.
Requirements(id,name,type,min,max,acadObjects,stream/program)
Req(123,'CS core','core',42,42,'COMP1511,COMP1521,...',COMPA1)
Req(345,'Gen Ed','gened',12,12,'FREE####',3778)
Req(678,"CS Elective','elective',30,-,'COMP[3469]###',COMPA1)
Req(789,'CS Majors','stream',1,1,'COMPA1,COMPD1,...',3778)
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [5/28]
How we are building the database ...
- collected lots of enrolment/course/etc data
- wrote scripts to extract (some) raw data into schema form
- filtered out most of the enrolment data; "randomised" the data
- built database, via
createdb
, psql -f
, \copy
Database dump is > 20MB
(don't copy it to Vlab)
Database under pgsql/data/base
directory is > 200 MB
Your quota on /localstorage
is not infinite; manage space carefully
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [6/28]
What you get (as templates) and what you submit:
-
helpers.sql
... holds SQL views and PLpgSQL functions
-
helpers.py
... holds Python functions for general use
- script1 ... Python script for Q1
- script2 ... Python script for Q2
- etc. etc. etc.
Put these under your VLab directory, not under
/localstorage
Submit via give
or Webcms3 before 11:59 on Friday 10 November
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [7/28]
❖ Keep in mind for Assignment 2 | |
If 100's of people are finishing the assignment last minute ...
Beware:
/localstorage
is not backed up ... put
only your DB there
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [8/28]
Psycopg2
is a library for Python ↔ PostgreSQL interaction.
Where psycopg2
fits in the PL/DB architecture
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [9/28]
Common psycopg2
operations:
-
conn = psycopg2.connect(
DB_connection_string)
-
conn.close()
-
conn.commit()
-
cur = conn.cursor()
-
cur.execute(
SQL_statement,
Values)
-
cur.mogrify(
SQL_statement,
Values)
-
list = cur.fetchall()
-
tup = cur.fetchone()
-
list = cur.fetchmany(
nTuples)
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [10/28]
Typical database access pattern in Psycopg2
qry = 'select a,b,c from R where d = %s'
try:
db = psycopg2.connect('dbname=mydb')
cur = db.cursor()
cur.execute(qry, [6])
for tup in cur.fetchall():
x,y,z = tup
... do something with x, y, z ...
except Exception as err:
print("DB error: ", err)
finally:
if db:
db.close()
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [11/28]
Consider (a variation of) the MyMyUNSW database ...
People(id, zid, family, given, fullname, origin)
Students(id)
Staff(id, office, phone, employed, supervisor)
Orgunits(id, utype, name, unswid)
Terms(id, code, name, starting, ending)
Subjects(id, code, title, uoc, career, owner)
Courses(id, subject, term, homepage)
Streams(id, code, name, owner, stype)
Programs(id, code, name, uoc, owner, career, duration)
Program_enrolments(id, student, term, program, ...))
Stream_enrolments(part_of_prog_enr, stream)
Course_enrolments(student, course, mark, grade)
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [12/28]
❖ Exercise: Python/psycopg2 Scripts | |
Write a Python script that, given a partial student name
- shows the full names and zids of students matching the name
Usage examples:
./stu 'smith'
,
./stu 'nobody'
Write a Python script that, given a term code
- prints how many courses were offered in that term
Usage examples:
./crs 17s2
,
./crs 19T1
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [13/28]
❖ Exercise: Python/psycopg2 Scripts (cont) | |
Write a Python script that, given a student ID
- gives a list of what program(s) the student was enrolled in
Usage examples:
./progs 5124862
,
./progs 1234567
Write a Python script that, given a term code
- counts enrolments for every course in that term
Usage examples:
./enrs 17s2
,
./enrs 19T1
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [14/28]
❖ Relational Design Theory | |
We know how to express ER data models and SQL schemas
But how do we know that our models/schemas are "good"?
Properties of "good" models/schemas
- accurate ... represent scenario faithfully
- complete ... represent all aspects of a scenario
- minimal ... minimse stored data; no repetition
Relational design theory addresses the last issue
Relies on the notion of functional dependency
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [15/28]
❖ Relational Design Theory (cont) | |
Functional dependencies
- describe relationships between attributes within a relation
What we study here:
- basic theory and definition of functional dependencies
- methodology for improving schema designs (normalisation)
The aim of studying this:
- improve understanding of relationships among data
- gain enough formalism to assist practical database design
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [16/28]
❖ Relational Design and Redundancy | |
In database design, redundancy is generally a "bad thing":
- makes it difficult to maintain consistency after updates
Our goal is to reduce redundancy in stored data
- by ensuring that the schema is structured appropriately
But ... redundancy may have some advantages
- e.g. give performance improvements
(by avoiding joins to collect bits of related data together)
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [17/28]
❖ Relational Design and Redundancy (cont) | |
Consider the following relation defining bank accounts/branches:
accountNo |
balance |
customer |
branch |
address |
assets |
A-101 |
500 |
1313131 |
Downtown |
Brooklyn |
9000000 |
A-102 |
400 |
1313131 |
Perryridge |
Horseneck |
1700000 |
A-113 |
600 |
9876543 |
Round Hill |
Horseneck |
8000000 |
A-201 |
900 |
9876543 |
Brighton |
Brooklyn |
7100000 |
A-215 |
700 |
1111111 |
Minus |
Horseneck |
400000 |
A-222 |
700 |
1111111 |
Redwood |
Palo Alto |
2100000 |
A-305 |
350 |
1234567 |
Round Hill |
Horseneck |
8000000 |
Careless updating of this data may introduce inconsistencies.
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [18/28]
❖ Exercise: Update Anomalies | |
Show what happens when the following changes occur:
- add a new account (A-306,100,2424242,Round Hill)
- change the address of the Round Hill branch to San Jose
- close the account A-201
How to ensure that the database is updated appropriately
when a new account is added as above?
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [19/28]
Insertion anomaly:
- when we insert a new record, we need to check that branch data is consistent with existing tuples
Update anomaly:
- if a branch changes address, we need to update all tuples referring to that branch
Deletion anomaly:
- if we remove information about the last account at a branch, all of the branch information disappears
Insertion/update anomalies are avoidable, but need extra DBMS work
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [20/28]
❖ Avoiding Update Anomalies | |
Redundancy in stored data can lead to update anomalies
Functional dependencies allow us to
- reason about relationships between attributes
- identify redundancies (at schema level)
Normalization methods allow us to
- transform a schema to remove redundancy
Note: most algorithms for normalization are non-deterministic
- multiple different correct solutions are possible
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [21/28]
Most texts adopt the following terminology:
Relation schemas |
|
upper-case letters, denoting set of all attributes
(e.g. R, S, P, Q )
|
Relation instances |
|
lower-case letter corresponding to schema
(e.g. r(R), s(S), p(P), q(Q) )
|
Tuples |
|
lower-case letters (e.g. t, t', t1, u, v )
|
Attributes |
|
upper-case letters from start of alphabet
(e.g. A, B, C, D )
|
Sets of attributes |
|
simple concatenation of attribute names
(e.g. X=ABCD, Y=EFG )
|
Attributes in tuples |
|
tuple[attrSet] (e.g. t[ABCD], t[X])
|
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [22/28]
❖ Database Design (revisited) | |
To avoid update anomalies, normalization should give:
- a schema with "minimal overlap" between tables
- each table contains a "coherent" collection of data values
Normalization typically involves
- decompose a relation based on FDs
- R → R1 and R2, where R1 ∪ R2 = R and R1 ∩ R2 ≠ ∅
- repeat decomposition until no more decomposition is possible
Finally, each
Ri contains info about one entity (e.g. branch, customer, ...)
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [23/28]
❖ Database Design (revisited) (cont) | |
But we already have a design procedure (ER-then-relational)
- and it appears to produce schemas without redundancy
Why do we need another design procedure?
- ER → Relational does not guarantee no redundancy
- dependency theory allows us to check designs for residual problems
- the new procedure gives (semi)automated design
- determine all of the attributes in the problem domain
- collect them all together in a "universal relation"
- provide information about how attributes are related
- apply normalisation to decompose into non-redundant relations
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [24/28]
A relation instance r(R) satisfies a dependency X → Y if
- for any t, u ∈ r, t[X] = u[X] ⇒ t[Y] = u[Y]
In other words, if two tuples in
R agree in their values for the
set of attributes
X, then they must also agree in their values for
the set of attributes
Y.
We say that "Y is functionally dependent on X".
Attribute sets X and Y may overlap;
trivially true that X → X.
Notes:
- X → Y can also be read as "X determines Y"
- the single arrow → denotes functional dependency
- the double arrow ⇒ denotes logical implication
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [25/28]
❖ Identifying Functional Dependencies | |
Consider an instance r(R) of the relation schema R(ABCDE):
A |
B |
C |
D |
E |
a1 |
b1 |
c1 |
d1 |
e1 |
a2 |
b1 |
c2 |
d2 |
e1 |
a3 |
b2 |
c1 |
d1 |
e1 |
a4 |
b2 |
c2 |
d2 |
e1 |
a5 |
b3 |
c3 |
d1 |
e1 |
What dependencies can we observe among its attributes?
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [26/28]
❖ Identifying Functional Dependencies (cont) | |
Since A values are unique, the definition of fd gives:
- A → B, A → C, A → D, A → E
- A → BC, A → CD, ... A → BCDE
- can be summarised as A → BCDE
Since all
E values are the same, it follows that:
- A → E, B → E, C → E, D → E
- in general, cannot be summarised as ABCD → E
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [27/28]
❖ Identifying Functional Dependencies (cont) | |
Other observations:
- combinations of BC are unique, therefore BC → ADE
- combinations of BD are unique, therefore BD → ACE
- if C values match, so do D values, therefore C → D
- however, D values don't determine C values, so !(D → C)
We could derive many other dependencies, e.g.
AE → BC, ...
In practice, choose a minimal set of fds (basis)
- from which all other fds can be derived
- which captures useful problem-domain information
COMP3311 23T3 ♢ Week 7 Monday Lecture ♢ [28/28]
Produced: 24 Oct 2023