COMP3311 Week 10 Monday Lecture

>>

Week 10 Monday
Assignment 2
Transactions (recap)
Concurrency Control in SQL
Locking in PostgreSQL
Exercise: Locking
Locking and Performance
Multi-version Concurrency Control
A Brief History of Databases
Future of Database
Big Data
Object-relational Mapping
Information Retrieval
Multimedia Data
Uncertainty
Stream Management Systems
Graph-based Databases
Dispersed Databases
Main-memory Databases
Other DB Research
Beyond COMP3311
COMP3311 Course Aims
Syllabus Overview
Assessment Summary
Final Exam
Revision
Supplementary Exams
Assessment
Some Thoughts ...
Finally ...

∧ >>

❖ Week 10 Monday

In today's lecture ...

Concurency, Database History, Course Review, Exam Preview

Things to do ...

Assignment 2 now due midday Thursday (Nov 16)
(and no fiurther extensions based on you overloading vxdb2)
Quiz 6 due by 23:59 Saturday (Nov 18)
MyExperience course evaluation now open (7% response rate so far)

Coming Up ...

no lecture on Wednesday
will schedule pre-exam Help Sessions for coming weeks

<< ∧ >>

❖ Assignment 2

How to allocate courses to requirements:

overall order: stream core, prog core, stream elective, prog elective, gened, free
within groups, consider in order of Requirements.id
does it appear in a core requirement, add to that requirement
does it appear in an elective requirement (either directly or via a pattern)
and are the UOC already allocated < min_req, add to that requirement
is the allocated gened UOC < max_req, add to the gened requirement
if UOC already allocated to free < min_req, add to free requirement
otherwise, cannot be allocated

Make a single pass through the courses in the transcript; allocate greedily

<< ∧ >>

❖ Assignment 2 (cont)

Side-effects of this one-pass greedy allocation approach

courses which could potentially be allocated to free, aren't
some GEN##### course may end up in free

First problem could be alleviated by changing the allocation to free

allocate to free even if UOC for free ≥ min_req

Second problem is difficult to fix

don't want to hold off allocating non-GEN courses to gened
just about any course can be a gened these days

DO NOT implement these fixes for the assignment

<< ∧ >>

❖ Transactions (recap)

Transaction are application atomic operations, involing multiple DB ops

A transaction must always terminate, either:

successfully (COMMIT), with all changes preserved
unsuccessfully (ABORT), with database unchanged

<< ∧ >>

❖ Concurrency Control in SQL

Transactions in SQL are specified by

BEGIN ... start a transaction
COMMIT ... successfully complete a transaction
ROLLBACK ... undo changes made by transaction + abort

In PostgreSQL, some actions cause implicit rollback:

raise exception during execution of a function
returning null from a before trigger

PostgreSQL also provdes ABORT as a synonym for SQL standard ROLLBACK

<< ∧ >>

❖ Concurrency Control in SQL (cont)

Concurrent access can be controlled via SQL:

table-level locking: apply lock to entire table
row-level locking: apply lock to just some rows

LOCK TABLE explicitly acquires lock on an entire table.

Some SQL commands implicitly acquire appropriate locks, e.g.

ALTER TABLE acquires an exclusive lock on table
UPDATE, DELETE acquire locks on affected rows

All locks are released at end of transaction

<< ∧ >>

❖ Locking in PostgreSQL

Some locking in PG is implicit (e.g. changing schema)

Explicit locks are available:

lock table TableName [ in LockMode mode ]

Some possible LockModes:

SHARE ... simply reading the table
SHARE ROW EXCLUSIVE ... intend to update table

No UNLOCK ... all locks are released at end of transaction

Row-level locking: lock all selected rows for writing

select * from Table where Condition for update

<< ∧ >>

❖ Locking in PostgreSQL (cont)

Two examples of lock usage in PostgreSQL:

begin;
lock table films in share mode;
select id into _id_ from films
    where name = 'Star Wars: Episode I - The Phantom Menace';
-- Do rollback if no record was returned
insert into films_user_comments values
    (_id_, 'GREAT! I was waiting for it for so long!');
commit;

begin;
lock table films in share row exclusive mode;
delete from films_user_comments where id in
    (select id from films where rating < 5);
delete from films where rating < 5;
commit;

<< ∧ >>

❖ Exercise: Locking

Apply appropriate locking in the bank transfer transaction.

create or replace function
   transfer(N integer, Src text, Dest text)
   returns integer
declare
   sID integer; dID integer; avail integer;
begin
   select id,balance into sID,avail
   from   Accounts where name=Src;
...
   return nextval('tx_id_seq');
end;

<< ∧ >>

❖ Locking and Performance

Locking reduces concurrency ⇒ lower throughput.

Granularity of locking can impact performance:

+ lock a small item ⇒ more of database accessible

+ lock a small item ⇒ quick update ⇒ quick lock release

- lock small items ⇒ more locks ⇒ more lock management

Granularity levels: field, row (tuple), table, whole database

Many DBMSs support multiple lock-granularities.

<< ∧ >>

❖ Multi-version Concurrency Control

One approach to reducing the requirement for locks is to

provide multiple (consistent) versions of the database
give each transaction access to an "appropriate" version
(i.e. a version that maintains the serializability of the transaction)

This approach is called Multi-Version Concurrency Control.

Differences between MVCC and standard locking models:

writing never blocks reading (make new version of tuple)
reading never blocks writing (read old version of tuple)

PostgreSQL pioneered MVCC as a concurrency control mechanism

<< ∧ >>

❖ Multi-version Concurrency Control (cont)

PostgreSQL MVCC ...

each tuple is tagged with (time of) tx that created/deleted it
each tuple is linked to the next newer version of same tuple

Access to a tuple requires

check most recent version which existed when tx started; use that version

Periodic vacuum process deletes tuples that

are not accessible to any currently executing transactions

Time/space overheads in implementing MVCC

are justified by reduced requirement for locking (⇒ more concurrency)

<< ∧ >>

❖ A Brief History of Databases

Some important moments in the development of Databases ...

1960's ... large data stores based on linked network of nodes
1970 ... Ted Codd invents relational model
1975 ... System R from IBM ... first relational DBMS (incl.SQL)
1979 ... Oracle ships first version of its DBMS
1980's ... Stonebraker makes Postgres (some OO features)
1990's ... Object-oriented DBMSs, Access (DBMS?), PostgreSQL
1998 ... IBM releases SQL Server
2000's ... XML databases, then NoSQL databases
2010's ... DBMSs become more "distributed", also MongoDB
recently ... eventual consistency, column stores, graph databases

<< ∧ >>

❖ Future of Database

Core "database" goals:

deal with very large amounts of data (terabytes, petabyes, ...)
very-high-level languages (deal with big data in uniform ways)
query optimisation (if evaluation too slow ⇒ useless)

At the moment (and for the last 40 years) RDBMSs dominate ...

simple, clean data model, backed up by theory
high-level language for accessing data
50 years development work on RDB engine technology

RDBMSs work well in domains with uniform, structured data.

<< ∧ >>

❖ Future of Database (cont)

Limitations/pitfalls of RDBMSs:

NULL is ambiguous: unknown, not applicable, not supplied
"limited" support for constraints/integrity and rules
no support for uncertainty (data represents the state-of-the-world)
data model too simple (e.g. limited support for complex objects, but PG)
query model too rigid (e.g. no approximate match, but PG regexp)
cotinually changing data sources not well-handled
data must be "molded" to fit a single rigid schema
database systems must be manually "tuned"
do not scale well to some data sets (e.g. Google, Telco's)

<< ∧ >>

❖ Future of Database (cont)

How to overcome (some of) these limitations?

Extend the relational model ...

add new data types and query ops for new applications
non-1NF data e.g. arrays, JSON ... both supported by PostgreSQL
deal with uncertainty/inaccuracy/approximation in data

Replace the relational model ...

object-oriented DBMS ... OO programming with persistent objects
XML DBMS ... all data stored as XML documents, new query model
application-effective data model (e.g. (key,value) pairs)

Performance: DBMSs that "tune" themselves ...

<< ∧ >>

❖ Big Data

Some modern applications have massive data sets (e.g. Google)

far too large to store on a single machine/RDBMS
query demands far too high even if could store in DBMS

Approach to dealing with such data

distribute data over large collection of nodes (redundancy)
provide computational mechanisms for distributing computation

Often this data does not need full relational selection

represent data via (key,value) pairs
unique key values can be used for addressing data
values can be large objects (e.g. web pages, images, ...)

<< ∧ >>

❖ Big Data (cont)

Popular computational approach to Big Data: map/reduce

suitable for widely-distributed, very-large data
allows parallel computation on such data to be easily specified
distribute (map) parts of computation across network
compute in parallel (possibly with further map'ing)
merge (reduce) multiple results for delivery to requestor

Some Big Data proponents see no future need for SQL/relational ...

depends on application (e.g. hard integrity vs eventual consistency)

Humour: Parody of noSQL fans (strong language warning)

<< ∧ >>

❖ Object-relational Mapping

Pure OO databases came and went in the 90's (similarly XML databases)

A compromise:

data stored in relational format
"wrapper" gives the apperance of objects

Programmer works purely with objects; wrapper converts to SQL

Problems:

OO query mechanisms are no better than SQL
translation may lead to inefficient SQL

<< ∧ >>

❖ Information Retrieval

DBMSs generally do precise matching (although like/regexps)

Information retrieval systems do approximate matching.

E.g. documents containing these words (Google, etc.)

Also, introduce notion of "quality" of matching
(e.g. tuple T₁ is a better match than tuple T₂)

Quality also implies ranking of results.

Much activity in incorporating IR ideas into DBMS context.

Goal: support database exploration better.

<< ∧ >>

❖ Multimedia Data

Data which does not fit the "tabular model":

image, video, music, text, ... (and combinations of these)

Research problems:

how to specify queries on such data? (image₁ ≅ image₂)
how to "display" results? (synchronize components)

Solutions to the first problem typically:

extend notions of "matching"/indexes for querying
require sophisticated methods for capturing data features

Sample query: find other songs like this one?

<< ∧ >>

❖ Uncertainty

Multimedia/IR introduces approximate matching.

In some contexts, we have approximate/uncertain data.

E.g. witness statements in a crime-fighting database

"I think the getaway car was red ... or maybe orange ..."

"I am 75% sure that John carried out the crime"

Work by Jennifer Widom at Stanford on the Trio system

extends the relational model (ULDB)
extends the query language (TriQL)

<< ∧ >>

❖ Stream Management Systems

Makes one addition to the relational model

stream = infinite sequence of tuples, arriving one-at-a-time

Applications: news feeds, telecomms, monitoring web usage, ...

RDBMSs: run a variety of queries on (relatively) fixed data

StreamDBs: run fixed queries on changing data (stream)

Approaches:

window = relation formed from a stream via a rule
stream data type = build new stream-specific operations

<< ∧ >>

❖ Graph-based Databases

Use graphs rather than tables as basic data structure tool.

Applications: complex data representation, via "flexible" objects

Implementing graphs in RDBMSs is possible, but often inefficient.

Graph nature of data changes query model considerably.

different kinds of queries: kNN, neighbourhood, skyline

Research problem: query processing for large graphs

<< ∧ >>

❖ Dispersed Databases

Characteristics of dispersed databases:

very large numbers of small processing nodes
data is distributed/shared among nodes

Applications: environmental monitoring devices, "intelligent dust", ...

Research issues:

query/search strategies (how to organise query processing)
distribution of data (trade-off between centralised and diffused)

Less extreme versions of this already exist:

grid and cloud computing
database management for mobile devices

<< ∧ >>

❖ Main-memory Databases

Disk-based data is slow to access, and page-at-a-time

RBDMS techniques aim to minimise page read/writes
methods: buffering, indexes, join algotithms

Non-volatile non-disk storage is becoming larger and cheaper

many databases would fit in memory ... disk as a backup/archive

Research problems:

how to structure data on SSD-like devices
how to process queries in such an environment

<< ∧ >>

❖ Other DB Research

Open problems:

database integration/federation
- making a collection of heterogeneous databases work together
- schema reconciliation is a difficult problem (vocab, schemas)
replication
- maintain consistent copies of one database across multiple hosts
- how to propagate changes systematically?
column-based storage
- row = collection of values for a given attribute
- column = collection of attributes for tuple

<< ∧ >>

❖ Beyond COMP3311

COMP9312 Data Analytics for Graphs

manipulating large-scale graph structures

COMP9313 Big Data Management

handling data that won't fit in a DBMS on one machine

COMP9315 Database Systems Implementation

comprehensive study of DBMS internals

COMP9319 Web Data Compression and Search

compression and searching algorithms, and XML

COMP6714 Information Retrieval and Web Search

finding information in unstructured text

<< ∧ >>

❖ COMP3311 Course Aims

At the end of this course you should be able to:

develop accurate, non-redundant data models;
realise data models as relational database schemas;
formulate queries via the full range of SQL constructs;
use stored procedures and triggers to extend DBMS capabilities;
write applications in Python that interact effectively with databases;
understand the overall architecture of relational DBMSs;
analyze performance issues in relational database applications;
understand how queries are evaluated via relational algebra operations;
understand the concepts behind transactions and concurrency control

<< ∧ >>

❖ Syllabus Overview

Data modelling and database design
- Entity-relationship (ER) design, relational data model
- Relational theory (algebra, dependencies, normalisation)
Database application development
- SQL for querying, data definition and modification (PostgreSQL's version)
- extending SQL Queries, Functions, Aggregates, Triggers
- PostgreSQL, psql (an SQL shell), PLpgSQL (procedural SQL)
- SQLite, sqlite3 (an SQL shell)
- Python3, Psycopg2, accessing data programmatically
DBMS theory/technology
- relational algebra, functional dependencies, normalization
- performance tuning, catalogues, access control
- DBMS architecture, query processing, transaction processing

Things in gray will definitely not be examined.

<< ∧ >>

❖ Assessment Summary

Your final mark/grade will be determined as follows:

quizzes = mark for on-line quizzes   (out of 12)
ass1    = mark for assignment 1      (out of 12)
ass2    = mark for assignment 2      (out of 16)
exam    = mark for final exam        (out of 60)
okExam  = exam >= 25                 (after scaling)
mark    = quizzes + ass1 + ass2 + exam
grade   = HD|DN|CR|PS  if mark >= 50 && okExam
        = FL           if mark <  50 && okExam
        = UF           if !okExam

<< ∧ >>

❖ Final Exam

Mon 4 Dec, 3-hour exam, morning and afternoon sessions

Morning: arrive by 10:00am, Afternoon: arrive by 1:20pm

Exam environment:

invigilated, in the CSE labs, on CSE workstations
closed exam environment, no Webcms3, no email, no web
provided: questions, documentation, course notes
no BYO laptop

<< ∧ >>

❖ Final Exam (cont)

Questions ...

5 practical exercises, using PostgreSQL/Python3/Psycopg2
4 theory exercises, typed answers, like tute questions

What you have access to ...

the exam paper (html), templates for all questions
checking scripts for programming questions (like autotest)
course notes, documentation for PostgreSQL/Python/Psycopg2

What you do not have access to

the class directory, your home directory
Webcms3, email, messaging, etc.

<< ∧ >>

❖ Final Exam (cont)

When you login (zID/zPass) into the exam environment ...

we create some working directories
- including templates and testing scripts
we create a local PostgreSQL server for you
we load the exam database into your PG server
we start a web browser, pointing at the exam entry page

PostgreSQL server runs on your workstation

Database: will not be huge, will not be overly complex (e.g. < 15 tables)

You do not have access to VLab or /localstorage or /home/YOU

<< ∧ >>

❖ Final Exam (cont)

For the prac questions ...

files are contained in a q? directory
including a template, test cases and a check script

For the theory questions ...

each question has a text file template q?.txt

General suggestions ...

submit your answers using a submit script (e.g. submit q2)
give up on questions that are taking too long (e.g. > 30 mins for prac)
submit questions as you complete them; can submit multiple times
don't leave all submissions to the last minute; if late, not marked

<< ∧ >>

❖ Final Exam (cont)

Want to know what are the questions on the exam ... ?

<< ∧ >>

❖ Final Exam (cont)

Questions are as follows:

an SQL query
another SQL query
a PLpgSQL function
a Python3/Psycopg2 script
another Python3/Psycopg2 script
an analysis/synthesis question
another analysis/synthesis question
etc. etc. etc.

Note that the database for Q1-5 will be available in advance.

<< ∧ >>

❖ Revision

Sources for revision material:

Lecture Material, Prac/Tute Exercises, Assignments
Fundamentals of Database Systems, Elmasri/Navathe
Database System Concepts, Silberschatz/Korth/Sudarshan
Database Management Systems, Ramakrishnan/Gehrke
Database Systems: Complete Book, Garcia-Molina/Ullman/Widom
Database Systems: App-oriented, Kifer/Bernstein/Lewis
PostgreSQL/Python3/Psycopg2 Documentation (to some extent)

<< ∧ >>

❖ Supplementary Exams

Supplementary Exams are only available to people who

are absent from the Final Exam with good reason
(good = documented, serious, clearly affects ability to do exam)
have performed well during the rest of the semester

If you are awarded a Supp Exam ...