COMP3311 Week 7 Monday Lecture

>>

Week 07 Monday
Rest of Term
Assignment 2
Keep in mind for Assignment 2
Psycopg2 recap
Psycopg2 Examples
Exercise: Python/psycopg2 Scripts
Relational Design Theory
Relational Design and Redundancy
Exercise: Update Anomalies
Update Anomalies
Avoiding Update Anomalies
Notation/Terminology
Database Design (revisited)
Functional Dependency
Identifying Functional Dependencies

∧ >>

❖ Week 07 Monday

In today's lecture ...

Assignment 2, Python/psycopg2, Relational Design Theory

Things to do ...

Quiz 4 due by 23:59 Friday 27 October (Week 7)
Assignment 2 due by 23:59 Friday 10 November (Week 9)
Play with the ass2 database and Python/psycopg2

<< ∧ >>

❖ Rest of Term

Plan for remainder of term:

Week 7		Python, Psycopg2, PostgreSQL
Week 7/8		Relational Design, Normalization
Week 8/9		Relational Algebra, Query Execution
Week 9/10		Transactions, Concurrency Control

<< ∧ >>

❖ Assignment 2

Database about UNSW, called MyMyUNSW ...

subjects, courses, streams (specialisations), programs
people, students, staff, orgunits (organisational units)
course_enrolments, stream_enrolments, program_enrolments
degree requirements (total UOC, core, electives, gened, ...)

Similar to SiMS, the database behind myUNSW

Goal: build a progression checker (no more waiting for the Nucleus)

Database schema info will eventually be in .../assignments/ass2/schema.php

<< ∧ >>

❖ Assignment 2 (cont)

Partial ER diagram of the MyMyUNSW database:

<< ∧ >>

❖ Assignment 2 (cont)

Differences/examples from diagram:

students don't enrol directly in streams; via program enrolment
OrgUnits exist in a hierarchy (school in faculty in UNSW)
People.id is not the same as zID

Requirements apply to streams and programs, e.g.

Requirements(id,name,type,min,max,acadObjects,stream/program)
Req(123,'CS core','core',42,42,'COMP1511,COMP1521,...',COMPA1)
Req(345,'Gen Ed','gened',12,12,'FREE####',3778)
Req(678,"CS Elective','elective',30,-,'COMP[3469]###',COMPA1)
Req(789,'CS Majors','stream',1,1,'COMPA1,COMPD1,...',3778)

<< ∧ >>

❖ Assignment 2 (cont)

How we are building the database ...

collected lots of enrolment/course/etc data
wrote scripts to extract (some) raw data into schema form
filtered out most of the enrolment data; "randomised" the data
built database, via createdb, psql -f, \copy

Database dump is > 20MB (don't copy it to Vlab)

Database under pgsql/data/base directory is > 200 MB

Your quota on /localstorage is not infinite; manage space carefully

<< ∧ >>

❖ Assignment 2 (cont)

What you get (as templates) and what you submit:

helpers.sql ... holds SQL views and PLpgSQL functions
helpers.py ... holds Python functions for general use
script₁ ... Python script for Q1
script₂ ... Python script for Q2
etc. etc. etc.

Put these under your VLab directory, not under /localstorage

Submit via give or Webcms3 before 11:59 on Friday 10 November

<< ∧ >>

❖ Keep in mind for Assignment 2

If 100's of people are finishing the assignment last minute ...

we get bombarded with 100's of emails, forum posts, etc.
(it takes much longer to get your question answered)
vxdb2 has 100's of PG servers running simultaneously
(it runs s-l-o-w; queries take much longer to run)

Beware: /localstorage is not backed up ... put only your DB there

<< ∧ >>

❖ Psycopg2 recap

Psycopg2 is a library for Python ↔ PostgreSQL interaction.

Where psycopg2 fits in the PL/DB architecture

<< ∧ >>

❖ Psycopg2 recap (cont)

Common psycopg2 operations:

conn = psycopg2.connect(DB_connection_string)
conn.close()
conn.commit()
cur = conn.cursor()
cur.execute(SQL_statement, Values)
cur.mogrify(SQL_statement, Values)
list = cur.fetchall()
tup = cur.fetchone()
list = cur.fetchmany(nTuples)

<< ∧ >>

❖ Psycopg2 recap (cont)

Typical database access pattern in Psycopg2

qry = 'select a,b,c from R where d = %s'
try:
   db = psycopg2.connect('dbname=mydb')
   cur = db.cursor()
   cur.execute(qry, [6])
   for tup in cur.fetchall():
      x,y,z = tup  # extract a,b,c values into x,y,z
      ... do something with x, y, z ...

except Exception as err:
  print("DB error: ", err)

finally:
  if db:
    db.close()

<< ∧ >>

❖ Psycopg2 Examples

Consider (a variation of) the MyMyUNSW database ...

People(id, zid, family, given, fullname, origin)
Students(id)
Staff(id, office, phone, employed, supervisor)

Orgunits(id, utype, name, unswid)
Terms(id, code, name, starting, ending)

Subjects(id, code, title, uoc, career, owner)
Courses(id, subject, term, homepage)
Streams(id, code, name, owner, stype)
Programs(id, code, name, uoc, owner, career, duration)

Program_enrolments(id, student, term, program, ...))
Stream_enrolments(part_of_prog_enr, stream)
Course_enrolments(student, course, mark, grade)

<< ∧ >>

❖ Exercise: Python/psycopg2 Scripts

Write a Python script that, given a partial student name

shows the full names and zids of students matching the name

Usage examples: ./stu 'smith', ./stu 'nobody'

Write a Python script that, given a term code

prints how many courses were offered in that term

Usage examples: ./crs 17s2, ./crs 19T1

<< ∧ >>

❖ Exercise: Python/psycopg2 Scripts (cont)

Write a Python script that, given a student ID

gives a list of what program(s) the student was enrolled in

Usage examples: ./progs 5124862, ./progs 1234567

Write a Python script that, given a term code

counts enrolments for every course in that term

Usage examples: ./enrs 17s2, ./enrs 19T1

<< ∧ >>

❖ Relational Design Theory

We know how to express ER data models and SQL schemas

But how do we know that our models/schemas are "good"?

Properties of "good" models/schemas

accurate ... represent scenario faithfully
complete ... represent all aspects of a scenario
minimal ... minimse stored data; no repetition

Relational design theory addresses the last issue

Relies on the notion of functional dependency

<< ∧ >>

❖ Relational Design Theory (cont)

Functional dependencies

describe relationships between attributes within a relation

What we study here:

basic theory and definition of functional dependencies
methodology for improving schema designs (normalisation)

The aim of studying this:

improve understanding of relationships among data
gain enough formalism to assist practical database design

<< ∧ >>

❖ Relational Design and Redundancy

In database design, redundancy is generally a "bad thing":

makes it difficult to maintain consistency after updates

Our goal is to reduce redundancy in stored data

by ensuring that the schema is structured appropriately

But ... redundancy may have some advantages

e.g. give performance improvements
(by avoiding joins to collect bits of related data together)

<< ∧ >>

❖ Relational Design and Redundancy (cont)

Consider the following relation defining bank accounts/branches:

accountNo balance customer branch address assets

A-101 500 1313131 Downtown Brooklyn 9000000

A-102 400 1313131 Perryridge Horseneck 1700000

A-113 600 9876543 Round Hill Horseneck 8000000

A-201 900 9876543 Brighton Brooklyn 7100000

A-215 700 1111111 Minus Horseneck 400000

A-222 700 1111111 Redwood Palo Alto 2100000

A-305 350 1234567 Round Hill Horseneck 8000000

Careless updating of this data may introduce inconsistencies.

<< ∧ >>

❖ Exercise: Update Anomalies

Show what happens when the following changes occur:

add a new account (A-306,100,2424242,Round Hill)
change the address of the Round Hill branch to San Jose
close the account A-201

How to ensure that the database is updated appropriately when a new account is added as above?

<< ∧ >>

❖ Update Anomalies

Insertion anomaly:

when we insert a new record, we need to check that branch data is consistent with existing tuples

Update anomaly:

if a branch changes address, we need to update all tuples referring to that branch

Deletion anomaly:

if we remove information about the last account at a branch, all of the branch information disappears

Insertion/update anomalies are avoidable, but need extra DBMS work

<< ∧ >>

❖ Avoiding Update Anomalies

Redundancy in stored data can lead to update anomalies

Functional dependencies allow us to

reason about relationships between attributes
identify redundancies (at schema level)

Normalization methods allow us to

transform a schema to remove redundancy

Note: most algorithms for normalization are non-deterministic

multiple different correct solutions are possible

<< ∧ >>

❖ Notation/Terminology

Most texts adopt the following terminology:

Relation schemas upper-case letters, denoting set of all attributes (e.g. R, S, P, Q )

Relation instances lower-case letter corresponding to schema (e.g. r(R), s(S), p(P), q(Q) )

Tuples lower-case letters (e.g. t, t', t₁, u, v )

Attributes upper-case letters from start of alphabet (e.g. A, B, C, D )

Sets of attributes simple concatenation of attribute names (e.g. X=ABCD, Y=EFG )

Attributes in tuples tuple[attrSet] (e.g. t[ABCD], t[X])

<< ∧ >>

❖ Database Design (revisited)

To avoid update anomalies, normalization should give:

a schema with "minimal overlap" between tables
each table contains a "coherent" collection of data values

Normalization typically involves

decompose a relation based on FDs
R → R₁ and R₂, where R₁ ∪ R₂ = R and R₁ ∩ R₂ ≠ ∅
repeat decomposition until no more decomposition is possible

Finally, each R_i contains info about one entity (e.g. branch, customer, ...)

<< ∧ >>

❖ Database Design (revisited) (cont)

But we already have a design procedure (ER-then-relational)

and it appears to produce schemas without redundancy

Why do we need another design procedure?

ER → Relational does not guarantee no redundancy
- dependency theory allows us to check designs for residual problems
the new procedure gives (semi)automated design
- determine all of the attributes in the problem domain
- collect them all together in a "universal relation"
- provide information about how attributes are related
- apply normalisation to decompose into non-redundant relations

<< ∧ >>

❖ Functional Dependency

A relation instance r(R) satisfies a dependency X → Y if

for any t, u ∈ r, t[X] = u[X] ⇒ t[Y] = u[Y]

In other words, if two tuples in R agree in their values for the set of attributes X, then they must also agree in their values for the set of attributes Y.

We say that "Y is functionally dependent on X".

Attribute sets X and Y may overlap; trivially true that X → X.

Notes:

X → Y can also be read as "X determines Y"
the single arrow → denotes functional dependency
the double arrow ⇒ denotes logical implication

<< ∧ >>

❖ Identifying Functional Dependencies

Consider an instance r(R) of the relation schema R(ABCDE):

A B C D E

a₁ b₁ c₁ d₁ e₁

a₂ b₁ c₂ d₂ e₁

a₃ b₂ c₁ d₁ e₁

a₄ b₂ c₂ d₂ e₁

a₅ b₃ c₃ d₁ e₁

What dependencies can we observe among its attributes?

<< ∧ >>

❖ Identifying Functional Dependencies (cont)

Since A values are unique, the definition of fd gives:

A → B, A → C, A → D, A → E
A → BC, A → CD, ... A → BCDE
can be summarised as A → BCDE

Since all E values are the same, it follows that:

A → E, B → E, C → E, D → E
in general, cannot be summarised as ABCD → E

<< ∧

❖ Identifying Functional Dependencies (cont)

Other observations:

combinations of BC are unique, therefore BC → ADE
combinations of BD are unique, therefore BD → ACE
if C values match, so do D values, therefore C → D
however, D values don't determine C values, so !(D → C)

We could derive many other dependencies, e.g. AE → BC, ...

In practice, choose a minimal set of fds (basis)

from which all other fds can be derived
which captures useful problem-domain information

accountNo	balance	customer	branch	address	assets
A-101	500	1313131	Downtown	Brooklyn	9000000
A-102	400	1313131	Perryridge	Horseneck	1700000
A-113	600	9876543	Round Hill	Horseneck	8000000
A-201	900	9876543	Brighton	Brooklyn	7100000
A-215	700	1111111	Minus	Horseneck	400000
A-222	700	1111111	Redwood	Palo Alto	2100000
A-305	350	1234567	Round Hill	Horseneck	8000000

Relation schemas		upper-case letters, denoting set of all attributes (e.g. R, S, P, Q )
Relation instances		lower-case letter corresponding to schema (e.g. r(R), s(S), p(P), q(Q) )
Tuples		lower-case letters (e.g. t, t', t₁, u, v )
Attributes		upper-case letters from start of alphabet (e.g. A, B, C, D )
Sets of attributes		simple concatenation of attribute names (e.g. X=ABCD, Y=EFG )
Attributes in tuples		tuple[attrSet] (e.g. t[ABCD], t[X])

A	B	C	D	E
a₁	b₁	c₁	d₁	e₁
a₂	b₁	c₂	d₂	e₁
a₃	b₂	c₁	d₁	e₁
a₄	b₂	c₂	d₂	e₁
a₅	b₃	c₃	d₁	e₁