COMP3311 Week 9 Thursday Lecture

∧ >>

❖ Week 09 Thursday

In today's lecture ...

Transactions, Concurrency

Things to do ...

Assignment 2 due by 23:59 THIS Friday (Apr 14)
Help Sessions: Thu 4-6, Fri 4-7
MyExperience ... do it after submitting Ass2

Things to note ...

Exam on Mon 1 May, two sessions: morning, afternoon

<< ∧ >>

❖ Transactions (review)

Transaction (tx) = application unit of work involving multiple DB updates

Want tx's to be: Atomic, Consistent, Isolated, Durable

Description of tx's normally abstracted to R(X), W(X), C, A operations

Schedule = sequence of operations involving one or more tx's

Example serial schedule involving two transactions:

T1: R(X) W(X) R(Y) W(Y) R(Z) W(Z)
T2:                               R(Y) W(Y) R(X)

Example concurrent schedule involving same transactions:

T1: R(X)      W(X) R(Y)      W(Y)      R(Z) W(Z)
T2:      R(Y)           W(Y)      R(X)

<< ∧ >>

❖ Transactions (review) (cont)

All serial schedules (e.g. T1 then T2) are valid/safe

Not all concurrent schedules are valid (e.g. above example is not safe)

Serializable schedule = concurrent schedule equivalent to a serial schedule

Example serializable concurrent schedule:

T1: R(X) W(X)      R(Y)     W(Y)
T2:                              R(Y)       W(Y) R(X)
T3:           R(X)      W(X)           W(Z)

Two notions of serializability: conflict serializability, view serializability

DBMSs aim to control concurrency via locking and rollback

arrange order of R and W operations to produce serializable schedules

<< ∧ >>

❖ Conflict Serializability

A characterization of serializability based on conflicting operations

Consider two transactions T₁ and T₂ acting on data item X.

Possible orders for read/write operations by T₁ and T₂:

T₁ first	T₂ first	Equiv?
R₁(X) R₂(X)	R₂(X) R₁(X)	yes
R₁(X) W₂(X)	W₂(X) R₁(X)	no
W₁(X) R₂(X)	R₂(X) W₁(X)	no
W₁(X) W₂(X)	W₂(X) W₁(X)	no

Note: if T₁ and T₂ act on different data items (e.g. R(X), W(Y)), order does not matter.

<< ∧ >>

❖ Conflict Serializability (cont)

Two transactions have a potential conflict if

they perform operations on the same data item
at least one of the operations is a write operation

In such cases, the order of operations affects the result.

If no conflict, can swap order without affecting the result.

If we can transform a schedule

by swapping the order of non-conflicting operations
such that the result is a serial schedule

then we say that the schedule is conflict serializible.

<< ∧ >>

❖ Conflict Serializability (cont)

Example: transform a concurrent schedule to serial schedule

T1: R(A) W(A)      R(B)      W(B)
T2:           R(A)      W(A)      R(B) W(B)
swap
T1: R(A) W(A) R(B)           W(B)
T2:                R(A) W(A)      R(B) W(B)
swap
T1: R(A) W(A) R(B)      W(B)
T2:                R(A)      W(A) R(B) W(B)
swap
T1: R(A) W(A) R(B) W(B)
T2:                     R(A) W(A) R(B) W(B)

<< ∧ >>

❖ Conflict Serializability (cont)

Checking for conflict-serializability:

show that ordering in concurrent schedule
cannot be achieved in any serial schedule

Method for doing this:

build a precedence-graph
nodes represent transactions
arcs represent order of action on shared data
arc from T₁→T₂ means T₁ acts on X before T₂
a cycle indicates schedule is not conflict-serializable.

<< ∧ >>

❖ Example Precedence Graph

Example schedule which is not conflict serializable:

T1: R(X)           R(Y) W(X)      W(Y)
T2:           R(X)           W(X)
T3:      R(X)                          W(X)

Precendence graph for the above schedule:

<< ∧ >>

❖ Exercise: Precedence Graphs

Build precedence graphs for the following schedules:

T1: R(X) W(X)      R(Y)      W(Y)
T2:           R(X)      W(X)

T1: R(X)      W(X)      R(Y) W(Y)
T2:      R(X)      W(X)

For each schedule:

is it conflict serializable?
if so, show the sequence of swaps that produces a serial schedule.

<< ∧ >>

❖ View Serializability

View Serializability is

an alternative formulation of serializability
that is less conservative than conflict serializability (CS)
(some safe schedules that are view serializable are not conflict serializable)

As with CS, it is based on a notion of schedule equivalence

a schedule is "safe" if view equivalent to a serial schedule

The idea: if across the two schedules ...

they read the same version of a shared object
they write the same final version of an object

then they are view equivalent

<< ∧ >>

❖ View Serializability (cont)

Two schedules S and S' on T₁ .. T_n are view equivalent iff

for each shared data item X
- if, in S, T_j reads the initial value of X,
  then, in S', T_j also reads the initial value of X
- if, in S, T_j reads X written by T_k,
  then, in S', T_j also reads the value of X written by T_k
- if, in S, T_j performs the final write of X,
  then, in S', T_j also performs the final write of X

To check serializibilty of S ...

find a serial schedule that is view equivalent to S
from among the n! possible serial schedules

<< ∧ >>

❖ Exercise: Checking Serializability

Check whether each schedule is conflict/view serializable:

T1: R(X)      W(X)
T2:      R(X)      W(X)

T1: W(X)      R(Y)
T2:      R(Y)      R(X)

T1: R(X)      W(X)
T2:      W(X)
T3:                W(X)

<< ∧ >>

❖ Concurrency Control

Serializability tests are useful theoretically ...

But don't provide a mechanism for organising schedules

they can only be done "after the event"
they are computationally very expensive O(n!)

What is required are methods that ...

can be applied to each transaction individually
guarantee that overall schedule is serializable

<< ∧ >>

❖ Concurrency Control (cont)

Approaches to ensuring ACID transactions:

lock-based
Synchronise transaction execution via locks on some portion of the database.
version-based
Allow multiple consistent versions of the data to exist, and allow each transaction exclusive access to one version.
timestamp-based
Organise transaction execution in advance by assigning timestamps to operations.
validation-based (optimistic concurrency control)
Exploit typical execution-sequence properties of transactions to determine safety dynamically.

<< ∧ >>

❖ Lock-based Concurrency Control

Synchronise access to shared data items via following rules:

before reading X, get shared (read) lock on X
before writing X, get exclusive (write) lock on X
an attempt to get a shared lock on X is blocked
if another transaction already has exclusive lock on X
an attempt to get an exclusive lock on X is blocked
if another transaction has any kind of lock on X

These rules alone do not guarantee serializability ...

but two-phase locking does
acquire all needed locks before performing any unlocks

Locking also introduces potential for deadlock and starvation.

<< ∧ >>

❖ Lock-based Concurrency Control (cont)

Examples of locking:

Schedule 1
T1: L_x(X)       R(X)           W(X) U(X)
T2:       L_x(Y)      R(Y) W(Y)           U(Y)

Schedule 2
T1: L_x(X)       R(X) W(X) U(X)
T2:       L_x(X) .............. R(X) W(X) U(X)

New operations: L_x() = exclusive lock, L_s() = shared lock, U() unlock

Note: in Schedule 2, locking effectively forces serial execution

This is not generally the case; there may be some concurrency

In general, locking reduces concurrency, but gains safety

<< ∧ >>

❖ Concurrency Control in SQL

Transactions in SQL are specified by

BEGIN ... start a transaction
COMMIT ... successfully complete a transaction
ROLLBACK ... undo changes made by transaction + abort

In PostgreSQL, some actions cause implicit rollback:

raise exception during execution of a function
returning null from a before trigger

<< ∧ >>

❖ Concurrency Control in SQL (cont)

Concurrent access can be controlled via SQL:

table-level locking: apply lock to entire table
row-level locking: apply lock to just some rows

LOCK TABLE explicitly acquires lock on an entire table.

Some SQL commands implicitly acquire appropriate locks, e.g.

ALTER TABLE acquires an exclusive lock on table
UPDATE, DELETE acquire locks on affected rows

All locks are released at end of transaction (no explicit unlock)

<< ∧ >>

❖ Locking in PostgreSQL

Some locking in PG is implicit (e.g. changing schema)

Explicit locks are available:

lock table TableName [ in LockMode mode ]

Some possible LockModes:

SHARE ... simply reading the table
SHARE ROW EXCLUSIVE ... intend to update table

No UNLOCK ... all locks are released at end of transaction

Row-level locking: lock all selected rows for writing

select * from Table where Condition for update

<< ∧ >>

❖ Locking in PostgreSQL (cont)

Two examples of lock usage in PostgreSQL:

begin;
lock table films in share mode;
select id into _id_ from films
    where name = 'Star Wars: Episode I - The Phantom Menace';
-- Do rollback if no record was returned
insert into films_user_comments values
    (_id_, 'GREAT! I was waiting for it for so long!');
commit;

begin;
lock table films in share row exclusive mode;
delete from films_user_comments where id in
    (select id from films where rating < 5);
delete from films where rating < 5;
commit;

<< ∧ >>

❖ Exercise: Locking

Apply appropriate locking in the bank transfer transaction.

create or replace function
   transfer(N integer, Src text, Dest text)
   returns integer
declare
   sID integer; dID integer; avail integer;
begin
   select id,balance into sID,avail
   from   Accounts where name=Src;
...
   return nextval('tx_id_seq');
end;

<< ∧ >>

❖ Locking and Performance

Locking reduces concurrency ⇒ lower throughput.

Granularity of locking can impact performance:

+ lock a small item ⇒ more of database accessible

+ lock a small item ⇒ quick update ⇒ quick lock release

- lock small items ⇒ more locks ⇒ more lock management

Granularity levels: field, row (tuple), table, whole database

Many DBMSs support multiple lock-granularities.

<< ∧ >>

❖ Multi-version Concurrency Control

One approach to reducing the requirement for locks is to

provide multiple (consistent) versions of the database
give each transaction access to an "appropriate" version
(i.e. a version that maintains the serializability of the transaction)

This approach is called Multi-Version Concurrency Control.

Differences between MVCC and standard locking models:

writing never blocks reading (make new version of tuple)
reading never blocks writing (read old version of tuple)

PostgreSQL pioneered MVCC as a concurrency control mechanism

<< ∧ >>

❖ Multi-version Concurrency Control (cont)

PostgreSQL MVCC ...

each tuple is tagged with (time of) tx that created/deleted it
each tuple is linked to the next newer version of same tuple

Access to a tuple requires

check most recent version which existed when tx started; use that version

Periodic vacuum process deletes tuples that

are not accessible to any currently executing transactions

Time/space overheads in implementing MVCC

are justified by reduced requirement for locking (⇒ more concurrency)

<< ∧ >>

❖ Future of Database

Core "database" goals:

deal with very large amounts of data (terabytes, petabyes, ...)
very-high-level languages (deal with big data in uniform ways)
query optimisation (if evaluation too slow ⇒ useless)

At the moment (and for the last 40 years) RDBMSs dominate ...

simple, clean data model, backed up by theory
high-level language for accessing data
50 years development work on RDB engine technology

RDBMSs work well in domains with uniform, structured data.

<< ∧ >>

❖ Future of Database (cont)

Limitations/pitfalls of RDBMSs:

NULL is ambiguous: unknown, not applicable, not supplied
"limited" support for constraints/integrity and rules
no support for uncertainty (data represents the state-of-the-world)
data model too simple (e.g. no support for complex objects)
query model too rigid (e.g. no approximate match)
cotinually changing data sources not well-handled
data must be "molded" to fit a single rigid schema
database systems must be manually "tuned"
do not scale well to some data sets (e.g. Google, Telco's)

<< ∧ >>

❖ Future of Database (cont)

How to overcome (some of) these limitations?

Extend the relational model ...

add new data types and query ops for new applications
non-1NF data e.g. arrays, JSON ... both supported by PostgreSQL
deal with uncertainty/inaccuracy/approximation in data

Replace the relational model ...

object-oriented DBMS ... OO programming with persistent objects
XML DBMS ... all data stored as XML documents, new query model
application-effective data model (e.g. (key,value) pairs)

Performance: DBMSs that "tune" themselves ...

<< ∧ >>

❖ Big Data

Some modern applications have massive data sets (e.g. Google)

far too large to store on a single machine/RDBMS
query demands far too high even if could store in DBMS

Approach to dealing with such data

distribute data over large collection of nodes (redundancy)
provide computational mechanisms for distributing computation

Often this data does not need full relational selection

represent data via (key,value) pairs
unique key values can be used for addressing data
values can be large objects (e.g. web pages, images, ...)

<< ∧ >>

❖ Big Data (cont)

Popular computational approach to Big Data: map/reduce

suitable for widely-distributed, very-large data
allows parallel computation on such data to be easily specified
distribute (map) parts of computation across network
compute in parallel (possibly with further map'ing)
merge (reduce) multiple results for delivery to requestor

Some Big Data proponents see no future need for SQL/relational ...

depends on application (e.g. hard integrity vs eventual consistency)

Humour: Parody of noSQL fans (strong language warning)

<< ∧ >>

❖ Object-relational Mapping

Pure OO databases came and went in the 90's (similarly XML databases)

A compromise:

data stored in relational format
"wrapper" gives the apperance of objects

Programmer works purely with objects; wrapper converts to SQL

Problems:

OO query mechanisms are no better than SQL
translation leads to inefficient SQL

<< ∧ >>

❖ Information Retrieval

DBMSs generally do precise matching (although like/regexps)

Information retrieval systems do approximate matching.

E.g. documents containing these words (Google, etc.)

Also introduces notion of "quality" of matching
(e.g. tuple T₁ is a better match than tuple T₂)

Quality also implies ranking of results.

Much activity in incorporating IR ideas into DBMS context.

Goal: support database exploration better.

<< ∧ >>

❖ Multimedia Data

Data which does not fit the "tabular model":

image, video, music, text, ... (and combinations of these)

Research problems:

how to specify queries on such data? (image₁ ≅ image₂)
how to "display" results? (synchronize components)

Solutions to the first problem typically:

extend notions of "matching"/indexes for querying
require sophisticated methods for capturing data features

Sample query: find other songs like this one?

<< ∧ >>

❖ Uncertainty

Multimedia/IR introduces approximate matching.

In some contexts, we have approximate/uncertain data.

E.g. witness statements in a crime-fighting database

"I think the getaway car was red ... or maybe orange ..."

"I am 75% sure that John carried out the crime"

Work by Jennifer Widom at Stanford on the Trio system

extends the relational model (ULDB)
extends the query language (TriQL)

<< ∧ >>

❖ Stream Management Systems

Makes one addition to the relational model

stream = infinite sequence of tuples, arriving one-at-a-time

Applications: news feeds, telecomms, monitoring web usage, ...

RDBMSs: run a variety of queries on (relatively) fixed data

StreamDBs: run fixed queries on changing data (stream)

Approaches:

window = relation formed from a stream via a rule
stream data type = build new stream-specific operations

<< ∧ >>

❖ Graph-based Databases

Uses graphs rather than tables as basic data structure tool.

Applications: complex data representation, via "flexible" objects

Implementing graphs in RDBMSs is possible, but often inefficient.

Graph nature of data changes query model considerably.

different kinds of queries: kNN, neighbourhood, skyline

Research problem: query processing for large graphs

<< ∧ >>

❖ Dispersed Databases

Characteristics of dispersed databases:

very large numbers of small processing nodes
data is distributed/shared among nodes

Applications: environmental monitoring devices, "intelligent dust", ...

Research issues:

query/search strategies (how to organise query processing)
distribution of data (trade-off between centralised and diffused)

Less extreme versions of this already exist:

grid and cloud computing
database management for mobile devices

<< ∧ >>

❖ Main-memory Databases

Disk-based data is slow to access, and page-at-a-time

Non-volatile non-disk storage is becoming larger and cheaper

many databases would fit in memory ... disk as a backup/archive

Research problems:

how to structure data on SSD-like devices
how to process queries in such an environment

<< ∧ >>

❖ Other DB Research

Open problems:

database integration/federation
- making a collection of heterogeneous databases work together
- schema reconciliation is a difficult problem (vocab, schemas)
replication
- maintain consistent copies of one database across multiple hosts
- how to propagate changes systematically?

<< ∧

❖ Beyond COMP3311

COMP9312 Data Analyutics for Graphs

manipulating large-scale graph structures

COMP9313 Big Data Management

handling data that won't fit in a DBMS on one machine

COMP9315 Database Systems Implementation

comprehensive study of DBMS internals

COMP9319 Web Data Compression and Search

compression and searching algorithms, and XML

COMP6714 Information Retrieval and Web Search

finding information in unstructured text