H: Transactions, Concurrency, Recovery
COMP9315 24T1 ♢ Lectures Part H ♢ [0/92]
A transaction (tx) is ...
- a single application-level operation
- performed by a computation involving multiple DB operations
A transaction effects a state change on the DB
COMP9315 24T1 ♢ Lectures Part H ♢ [1/92]
❖ Transaction Processing (cont) | |
Transaction states:
COMMIT
⇒ all changes preserved,
ABORT
⇒ database unchanged
COMP9315 24T1 ♢ Lectures Part H ♢ [2/92]
❖ Transaction Processing (cont) | |
Concurrent transactions are
- desirable, for improved performance (throughput)
- problematic, because of potential unwanted interactions
To ensure problem-free concurrent transactions:
- Atomic ... whole effect of tx, or nothing
- Consistent ... individual tx's are "correct" (wrt application)
- Isolated ... each tx behaves as if no concurrency
- Durable ... effects of committed tx's persist
COMP9315 24T1 ♢ Lectures Part H ♢ [3/92]
❖ Transaction Processing (cont) | |
Transaction processing:
- the study of techniques for realising ACID properties
Consistency is the property:
- a tx is correct with respect to its own specification
- a tx performs a mapping that maintains all DB constraints
Ensuring this must be left to application programmers.
Our discussion focusses on:
Atomicity, Durability, Isolation
COMP9315 24T1 ♢ Lectures Part H ♢ [4/92]
❖ Transaction Processing (cont) | |
Atomicity is handled by the commit and abort mechanisms
- commit ends tx and ensures all changes are saved/persisted
- abort ends tx and undoes changes "already made"
Durability is handled by implementing stable storage, via
- redundancy, to deal with hardware failures
- logging/checkpoint mechanisms, to recover state
Isolation is handled by
concurrency control mechanisms
- possibilities: lock-based, timestamp-based, check-based
- various levels of isolation are possible (e.g. serializable)
COMP9315 24T1 ♢ Lectures Part H ♢ [5/92]
❖ Transaction Processing (cont) | |
Where transaction processing fits in the DBMS:
COMP9315 24T1 ♢ Lectures Part H ♢ [6/92]
❖ Transaction Terminology | |
To describe transaction effects, we consider:
-
READ
- transfer data from "disk" to memory
-
WRITE
- transfer data from memory to "disk"
-
ABORT
- terminate transaction, unsuccessfully
-
COMMIT
- terminate transaction, successfully
Relationship between the above operations and SQL:
-
SELECT
produces READ
operations on the database
-
UPDATE
and DELETE
produce READ
then WRITE
operations
-
INSERT
produces WRITE
operations
COMP9315 24T1 ♢ Lectures Part H ♢ [7/92]
❖ Transaction Terminology (cont) | |
More on transactions and SQL
-
BEGIN
starts a transaction
- the
begin
keyword in PLpgSQL is not the same thing
-
COMMIT
commits and ends the current transaction
- some DBMSs e.g. PostgreSQL also provide
END
as a synonym
- the
end
keyword in PLpgSQL is not the same thing
-
ROLLBACK
aborts the current transaction, undoing any changes
- some DBMSs e.g. PostgreSQL also provide
ABORT
as a synonym
In PostgreSQL, tx's cannot be defined inside functions (e.g. PLpgSQL)
COMP9315 24T1 ♢ Lectures Part H ♢ [8/92]
❖ Transaction Terminology (cont) | |
The READ
, WRITE
, ABORT
, COMMIT
operations:
- occur in the context of some transaction T
- involve manipulation of data items X, Y, ...
(READ and WRITE)
The operations are typically denoted as:
RT(X) |
| read item X in transaction T |
WT(X) |
| write item X in transaction T |
AT |
| abort transaction T |
CT |
| commit transaction T |
COMP9315 24T1 ♢ Lectures Part H ♢ [9/92]
A schedule gives the sequence of operations from ≥ 1 tx
Serial schedule for a set of tx's T1 .. Tn
- all operations of Ti complete before Ti+1 begins
E.g.
RT1(A)
WT1(A)
RT2(B)
RT2(A)
WT3(C)
WT3(B)
Concurrent schedule for a set of tx's T1 .. Tn
- operations from individual Ti's are interleaved
E.g.
RT1(A)
RT2(B)
WT1(A)
WT3(C)
WT3(B)
RT2(A)
COMP9315 24T1 ♢ Lectures Part H ♢ [10/92]
Serial schedules guarantee database consistency
- each Ti commits before Ti+1
- prior to Ti database is consistent
- after Ti database is consistent (assuming Ti is correct)
- before Ti+1 database is consistent ...
Concurrent schedules interleave tx operations arbitrarily
- and may produce a database that is not consistent
- after all of the transactions have committed successfully
COMP9315 24T1 ♢ Lectures Part H ♢ [11/92]
What problems can occur with (uncontrolled) concurrent tx's?
The set of phenomena can be characterised broadly under:
- dirty read:
reading data item written by a concurrent uncommitted tx
- nonrepeateable read:
re-reading data item, since changed by another concurrent tx
- phantom read:
re-scanning result set, finding it changed by another tx
COMP9315 24T1 ♢ Lectures Part H ♢ [12/92]
If a concurrent schedule on a set of tx's TT ...
- produces the same effect as a serial schedule on TT
- then we say that the schedule is serializable
Primary goal of isolation mechanisms (see later) is
- arrange execution of individual operations in tx's in TT
- to ensure that a serializable schedule is produced
Serializability is one property of a schedule, focusing on isolation
Other properties of schedules focus on recovering from failures
COMP9315 24T1 ♢ Lectures Part H ♢ [13/92]
So far, have implicitly assumed that all transactions commit.
Additional problems can arise when transactions abort.
Consider the following schedule where transaction T1 fails:
T1: R(X) W(X) A
T2: R(X) W(X) C
Abort will rollback the changes to X
, but ...
Consider three places where the rollback might occur:
T1: R(X) W(X) A [1] [2] [3]
T2: R(X) W(X) C
COMP9315 24T1 ♢ Lectures Part H ♢ [14/92]
❖ Transaction Failure (cont) | |
Abort / rollback scenarios:
T1: R(X) W(X) A [1] [2] [3]
T2: R(X) W(X) C
Case [1] is ok
- all effects of T1 vanish; final effect is simply from T2
Case [2] is problematic
- some of T1's effects persist, even though T1 aborted
Case [3] is also problematic
- T2's effects are lost, even though T2 committed
COMP9315 24T1 ♢ Lectures Part H ♢ [15/92]
Consider the serializable schedule:
T1: R(X) W(Y) C
T2: W(X) A
(where the final value of Y
is dependent on the X
value)
Notes:
- the final value of X is valid (change from T2 rolled back)
- T1 reads/uses an X value that is eventually rolled-back
- even though T2 is correctly aborted, it has produced an effect
Produces an invalid database state, even though serializable.
COMP9315 24T1 ♢ Lectures Part H ♢ [16/92]
Recoverable schedules avoid these kinds of problems.
For a schedule to be recoverable, we require additional constraints
- all tx's Ti that wrote values used by Tj
- must have committed before Tj commits
and this property must hold for all transactions
Tj
Note that recoverability does not prevent "dirty reads".
In order to make schedules recoverable in the presence of dirty reads and
aborts, may need to abort multiple transactions.
COMP9315 24T1 ♢ Lectures Part H ♢ [17/92]
Recall the earlier non-recoverable schedule:
T1: R(X) W(Y) C
T2: W(X) A
To make it recoverable requires:
- delaying T1's commit until T2 commits
- if T2 aborts, cannot allow T1 to commit
T1: R(X) W(Y) ... C? A!
T2: W(X) A
Known as cascading aborts (or cascading rollback).
COMP9315 24T1 ♢ Lectures Part H ♢ [18/92]
❖ Cascading Aborts (cont) | |
Example: T3 aborts, causing T2 to abort, causing T1 to abort
T1: R(Y) W(Z) A
T2: R(X) W(Y) A
T3: W(X) A
Even though T1 has no direct connection with T3
(i.e. no shared data).
This kind of problem ...
- can potentially affect very many concurrent transactions
- could have a significant impact on system throughput
COMP9315 24T1 ♢ Lectures Part H ♢ [19/92]
❖ Cascading Aborts (cont) | |
Cascading aborts can be avoided if
Effectively: eliminate the possibility of reading dirty data.
Downside: reduces opportunity for concurrency.
These are called ACR (avoid cascading rollback) schedules.
All ACR schedules are also recoverable.
COMP9315 24T1 ♢ Lectures Part H ♢ [20/92]
Strict schedules also eliminate the chance of writing dirty data.
A schedule is strict if
- no tx can read values written by another uncommitted tx (ACR)
- no tx can write a data item written by another uncommitted tx
Strict schedules simplify the task of rolling back after aborts.
COMP9315 24T1 ♢ Lectures Part H ♢ [21/92]
Example: non-strict schedule
T1: W(X) A
T2: W(X) A
Problems with handling rollback after aborts:
- when T1 aborts, don't rollback
(need to retain value written by T2)
- when T2 aborts, need to rollback to pre-T1
(not just pre-T2)
COMP9315 24T1 ♢ Lectures Part H ♢ [22/92]
Relationship between various classes of schedules:
Schedules ought to be serializable and strict.
But more serializable/strict ⇒ less concurrency.
DBMSs allow users to trade off "safety" against performance.
COMP9315 24T1 ♢ Lectures Part H ♢ [23/92]
Simplest form of isolation: serial execution
(T1 ; T2 ; T3 ; ...)
Problem: serial execution yields poor throughput.
Concurrency control schemes (CCSs) aim for "safe" concurrency
Abstract view of DBMS concurrency mechanisms:
COMP9315 24T1 ♢ Lectures Part H ♢ [24/92]
Consider two schedules S1 and S2 produced by
- executing the same set of transactions T1..Tn concurrently
- but with a non-serial interleaving of R/W operations
S1 and
S2 are
equivalent if
StateAfter(S1) = StateAfter(S2)
- i.e. final state yielded by S1 is same as final state yielded by S2
S is a
serializable schedule (for a set of concurrent tx's
T1 ..Tn) if
- S is equivalent to some serial schedule Ss of T1 ..Tn
Under these circumstances, consistency is guaranteed
(assuming no aborted transactions and no system failures)
COMP9315 24T1 ♢ Lectures Part H ♢ [25/92]
Two formulations of serializability:
- conflict serializibility
- i.e. conflicting R/W operations occur in the "right order"
- check via precedence graph; look for absence of cycles
- view serializibility
- i.e. read operations see the correct version of data
- checked via VS conditions on likely equivalent schedules
View serializability is strictly weaker than conflict serializability.
COMP9315 24T1 ♢ Lectures Part H ♢ [26/92]
❖ Transaction Isolation Levels | |
SQL programmers' concurrency control mechanism ...
set transaction
read only
read write
isolation level
read uncommitted
read committed
repeatable read
serializable
Applies to current tx only; affects how scheduler treats this tx.
COMP9315 24T1 ♢ Lectures Part H ♢ [27/92]
❖ Transaction Isolation Levels (cont) | |
Implication of transaction isolation levels:
Isolation Level |
Dirty Read |
Nonrepeatable Read |
Phantom Read |
Read Uncommitted |
Possible |
Possible |
Possible |
Read Committed |
Not Possible |
Possible |
Possible |
Repeatable Read |
Not Possible |
Not Possible |
Possible |
Serializable |
Not Possible |
Not Possible |
Not Possible |
COMP9315 24T1 ♢ Lectures Part H ♢ [28/92]
❖ Transaction Isolation Levels (cont) | |
For transaction isolation, PostgreSQL
- provides syntax for all four levels
- treats read uncommitted as read committed
- repeatable read behaves like serializable
- default level is read committed
Note: cannot implement
read uncommitted because of MVCC
For more details, see PostgreSQL Documentation section 13.2
- extensive discussion of semantics of
UPDATE
, INSERT
, DELETE
COMP9315 24T1 ♢ Lectures Part H ♢ [29/92]
❖ Transaction Isolation Levels (cont) | |
A PostgreSQL tx consists of a sequence of SQL statements:
BEGIN S1; S2; ... Sn; COMMIT;
Isolation levels affect view of DB provided to each Si:
- in read committed ...
- each Si sees snapshot of DB at start of Si
- in repeatable read and serializable ...
- each Si sees snapshot of DB at start of tx
- serializable checks for extra conditions
Transactions fail if the system detects violation of isolation level.
COMP9315 24T1 ♢ Lectures Part H ♢ [30/92]
Approaches to concurrency control:
- Lock-based
- Synchronise tx execution via locks on relevant part of DB.
- Version-based (multi-version concurrency control)
- Allow multiple consistent versions of the data to exist.
Each tx has access only to version existing at start of tx.
- Validation-based (optimistic concurrency control)
- Execute all tx's; check for validity problems on commit.
- Timestamp-based
- Organise tx execution via timestamps assigned to actions.
COMP9315 24T1 ♢ Lectures Part H ♢ [31/92]
❖ Lock-based Concurrency Control | |
Locks introduce additional mechanisms in DBMS:
The Lock Manager
- manages the locks requested by the scheduler
COMP9315 24T1 ♢ Lectures Part H ♢ [32/92]
❖ Lock-based Concurrency Control (cont) | |
Lock table entries contain:
- object being locked (DB, table, tuple, field)
- type of lock: read/shared, write/exclusive
- FIFO queue of tx's requesting this lock
- count of tx's currently holding lock
(max 1 for write locks)
Lock and unlock operations
must be atomic.
Lock upgrade:
- if a tx holds a read lock, and it is the only tx holding that lock
- then the lock can be converted into a write lock
COMP9315 24T1 ♢ Lectures Part H ♢ [33/92]
❖ Lock-based Concurrency Control (cont) | |
Synchronise access to shared data items via following rules:
- before reading X, get read (shared) lock on X
- before writing X, get write (exclusive) lock on X
- a tx attempting to get a read lock on X is blocked
if another tx already has write lock on X
- a tx attempting to get an write lock on X is blocked
if another tx has any kind of lock on X
These rules alone do not guarantee serializability.
COMP9315 24T1 ♢ Lectures Part H ♢ [34/92]
❖ Lock-based Concurrency Control (cont) | |
Consider the following schedule, using locks:
T1(a): Lr(Y) R(Y)
T2(a): Lr(X) R(X) U(X)
T1(b): U(Y) Lw(X) W(X) U(X)
T2(b): Lw(Y)....W(Y) U(Y)
(where Lr = read-lock, Lw = write-lock, U = unlock)
Locks correctly ensure controlled access to X
and Y
.
Despite this, the schedule is not serializable. (Ex: prove this)
COMP9315 24T1 ♢ Lectures Part H ♢ [35/92]
To guarantee serializability, we require an additional constraint:
- in every tx, all lock requests precede all unlock requests
Each transaction is then structured as:
- growing phase where locks are acquired
- action phase where "real work" is done
- shrinking phase where locks are released
Clearly, this reduces potential concurrency ...
COMP9315 24T1 ♢ Lectures Part H ♢ [36/92]
Appropriate locking can guarantee correctness.
However, it also introduces potential undesirable effects:
- Deadlock
- No transactions can proceed; each waiting on lock held by another.
- Starvation
- One transaction is permanently "frozen out" of access to data.
- Reduced performance
- Locking introduces delays while waiting for locks to be released.
COMP9315 24T1 ♢ Lectures Part H ♢ [37/92]
Deadlock occurs when two transactions are waiting
for a lock on an item held by the other.
Example:
T1: Lw(A) R(A) Lw(B) ......
T2: Lw(B) R(B) Lw(A) .....
How to deal with deadlock?
- prevent it happening in the first place
- let it happen, detect it, recover from it
COMP9315 24T1 ♢ Lectures Part H ♢ [38/92]
Handling deadlock involves forcing a transaction to "back off"
- select process to roll back
- choose on basis of how far tx has progressed, # locks held, ...
- roll back the selected process
- how far does this it need to be rolled back?
- worst-case scenario: abort one transaction, then retry
- prevent starvation
- need methods to ensure that same tx isn't always chosen
COMP9315 24T1 ♢ Lectures Part H ♢ [39/92]
Methods for managing deadlock
- timeout : set max time limit for each tx
- waits-for graph : records Tj waiting on lock held by Tk
- prevent deadlock by checking for new cycle ⇒ abort Ti
- detect deadlock by periodic check for cycles ⇒ abort Ti
- timestamps : use tx start times as basis for priority
- scenario: Tj tries to get lock held by Tk ...
- wait-die:
if Tj < Tk, then Tj waits;
if Tj > Tk, then Tj rolls back
- wound-wait:
if Tj < Tk, then Tk rolls back;
if Tj > Tk Tj waits
COMP9315 24T1 ♢ Lectures Part H ♢ [40/92]
Properties of deadlock handling methods:
- both wait-die and wound-wait are fair
- wait-die tends to
- roll back tx's that have done little work
- but rolls back tx's more often
- wound-wait tends to
- roll back tx's that may have done significant work
- but rolls back tx's less often
- timestamps easier to implement than waits-for graph
- waits-for minimises roll backs because of deadlock
COMP9315 24T1 ♢ Lectures Part H ♢ [41/92]
❖ Optimistic Concurrency Control | |
Locking is a pessimistic approach to concurrency control:
- limit concurrency to ensure that conflicts don't occur
Costs: lock management, deadlock handling, contention.
In scenarios where there are far more reads than writes ...
- don't lock (allow arbitrary interleaving of operations)
- check just before commit that no conflicts occurred
- if problems, roll back conflicting transactions
Optimistic concurrency control (OCC) is a strategy to realise this.
COMP9315 24T1 ♢ Lectures Part H ♢ [42/92]
❖ Optimistic Concurrency Control (cont) | |
Under OCC, transactions have three distinct phases:
- Reading: read from database, modify local copies of data
- Validation: check for conflicts in updates
- Writing: commit local copies of data to database
Timestamps are recorded at points
S, V, F :
COMP9315 24T1 ♢ Lectures Part H ♢ [43/92]
❖ Optimistic Concurrency Control (cont) | |
Data structures needed for validation:
- S ... set of txs that are reading data and computing results
- V ... set of txs that have reached validation (not yet committed)
- F ... set of txs that have finished (committed data to storage)
- for each Ti, timestamps for when it reached S, V, F
- RS(Ti) set of all data items read by Ti
- WS(Ti) set of all data items to be written by Ti
Use the
V timestamps as ordering for transactions
- assume serial tx order based on ordering of V(Ti)'s
COMP9315 24T1 ♢ Lectures Part H ♢ [44/92]
❖ Optimistic Concurrency Control (cont) | |
Two-transaction example:
- allow transactions T1 and T2 to run without any locking
- check that objects used by T2 are not being changed by T1
- if they are, we need to roll back T2 and retry
Case 0: serial execution ... no problem
COMP9315 24T1 ♢ Lectures Part H ♢ [45/92]
❖ Optimistic Concurrency Control (cont) | |
Case 1: reading overlaps validation/writing
- T2 starts while T1 is validating/writing
- if some X being read by T2 is in WS(T1)
- then T2 may not have read the updated version of X
- so, T2 must start again
COMP9315 24T1 ♢ Lectures Part H ♢ [46/92]
❖ Optimistic Concurrency Control (cont) | |
Case 2: reading/validation overlaps validation/writing
- T2 starts validating while T1 is validating/writing
- if some X being written by T2 is in WS(T1)
- then T2 may end up overwriting T1's update
- so, T2 must start again
COMP9315 24T1 ♢ Lectures Part H ♢ [47/92]
❖ Optimistic Concurrency Control (cont) | |
Validation check for transaction T
- for all transactions Ti ≠ T
- if T∈S & Ti∈F, then ok
- if T∉V & V(Ti) < S(T) < F(Ti),
then check WS(Ti) ∩ RS(T) is empty
- if T∈V & V(Ti) < V(T) < F(Ti),
then check WS(Ti) ∩ WS(T) is empty
If this check fails for any
Ti, then
T is rolled back.
COMP9315 24T1 ♢ Lectures Part H ♢ [48/92]
❖ Optimistic Concurrency Control (cont) | |
OCC prevents:
T reading dirty data,
T overwriting Ti's changes
Problems with OCC:
- increased roll backs**
- tendency to roll back "complete" tx's
- cost to maintain S,V,F sets
** "Roll back" is relatively cheap
- changes to data are purely local before Writing phase
- no requirement for logging info or undo/redo (see later)
COMP9315 24T1 ♢ Lectures Part H ♢ [49/92]
❖ Multi-version Concurrency Control | |
Multi-version concurrency control (MVCC) aims to
- retain benefits of locking, while getting more concurrency
- by providing multiple (consistent) versions of data items
Achieves this by
- readers access an "appropriate" version of each data item
- writers make new versions of the data items they modify
Main difference between MVCC and standard locking:
- read locks do not conflict with write locks ⇒
- reading never blocks writing, writing never blocks reading
COMP9315 24T1 ♢ Lectures Part H ♢ [50/92]
❖ Multi-version Concurrency Control (cont) | |
WTS = timestamp of tx that wrote this data item
Chained tuple versions: tupoldest → tupolder → tupnewest
When a reader Ti is accessing the database
- ignore any data item D created after Ti started
- checked by: min(WTS(D)) > TS(Ti)
- use only newest version V accessible to Ti
- determined by: max(WTS(V)) < TS(Ti)
COMP9315 24T1 ♢ Lectures Part H ♢ [51/92]
❖ Multi-version Concurrency Control (cont) | |
When a writer Ti attempts to change a data item
- find newest version V satisfying WTS(V) < TS(Ti)
- if no later versions exist, create new version of data item
- if there are later versions, then abort Ti
Some MVCC versions also maintain RTS
(TS of last reader)
- don't allow Ti to write D if RTS(D) > TS(Ti)
COMP9315 24T1 ♢ Lectures Part H ♢ [52/92]
❖ Multi-version Concurrency Control (cont) | |
Advantage of MVCC
- locking needed for serializability considerably reduced
Disadvantages of MVCC
- visibility-check overhead (on every tuple read/write)
- reading an item V causes an update of RTS(V) (if used)
- storage overhead for extra versions of data items
- overhead in removing out-of-date versions of data items
Despite apparent disadvantages, MVCC is very effective.
COMP9315 24T1 ♢ Lectures Part H ♢ [53/92]
❖ Multi-version Concurrency Control (cont) | |
Removing old versions:
- Vj and Vk are versions of same item
- WTS(Vj) and WTS(Vk) precede TS(Ti) for all Ti
- remove version with smaller WTS(Vx) value
When to make this check?
- every time a new version of a data item is added?
- periodically, with fast access to blocks of data
PostgreSQL uses the latter (
vacuum).
COMP9315 24T1 ♢ Lectures Part H ♢ [54/92]
❖ Concurrency Control in PostgreSQL | |
PostgreSQL uses two styles of concurrency control:
- multi-version concurrency control (MVCC)
(used in implementing SQL DML statements (e.g. select
))
- two-phase locking (2PL)
(used in implementing SQL DDL statements (e.g. create table
))
From the SQL (PLpgSQL) level:
- can let the lock/MVCC system handle concurrency
- can handle it explicitly via
LOCK
statements
COMP9315 24T1 ♢ Lectures Part H ♢ [55/92]
❖ Concurrency Control in PostgreSQL (cont) | |
PostgreSQL provides read committed and serializable isolation levels.
Using the serializable isolation level, a select
:
- sees only data committed before the transaction began
- never sees changes made by concurrent transactions
Using the serializable isolation level, an update fails:
- if it tries to modify an "active" data item
(active = affected by some other tx, either committed or uncommitted)
The transaction containing the update must then rollback and re-start.
COMP9315 24T1 ♢ Lectures Part H ♢ [56/92]
❖ Concurrency Control in PostgreSQL (cont) | |
Implementing MVCC in PostgreSQL requires:
- a log file to maintain current status of each Ti
- in every tuple:
-
xmin
ID of the tx that created the tuple
-
xmax
ID of the tx that replaced/deleted the tuple (if any)
-
xnew
link to newer versions of tuple (if any)
- for each transaction Ti :
- a transaction ID (timestamp)
- SnapshotData: list of active tx's when Ti started
COMP9315 24T1 ♢ Lectures Part H ♢ [57/92]
❖ Concurrency Control in PostgreSQL (cont) | |
Rules for a tuple to be visible to Ti :
- the
xmin
(creation transaction) value must
- be committed in the log file
- have started before Ti's start time
- not be active at Ti's start time
- the
xmax
(delete/replace transaction) value must
- be blank or refer to an aborted tx, or
- have started after Ti's start time, or
- have been active at SnapshotData time
For details, see:
utils/time/tqual.c
COMP9315 24T1 ♢ Lectures Part H ♢ [58/92]
❖ Concurrency Control in PostgreSQL (cont) | |
Tx's always see a consistent version of the database.
But may not see the "current" version of the database.
E.g. T1 does select, then concurrent T2 deletes some of T1's selected tuples
This is OK unless tx's communicate outside the database system.
E.g. T1 counts tuples; T2 deletes then counts; then counts are compared
Use locks if application needs every tx to see same current version
-
LOCK TABLE
locks an entire table
-
SELECT FOR UPDATE
locks only the selected rows
COMP9315 24T1 ♢ Lectures Part H ♢ [59/92]
Reminder:
Transactions are atomic
- if a tx commits, all of its changes persist in DB
- if a tx aborts, none of its changes occur in DB
Transaction effects are
durable
- if a tx commits, its effects persist
(even in the event of subsequent (catastrophic) system failures)
Implementation of atomicity/durability is intertwined.
COMP9315 24T1 ♢ Lectures Part H ♢ [60/92]
What kinds of "system failures" do we need to deal with?
- single-bit inversion during transfer mem-to-disk
- decay of storage medium on disk (some data changed)
- failure of entire disk device (data no longer accessible)
- failure of DBMS processes (e.g.
postgres
crashes)
- operating system crash; power failure to computer room
- complete destruction of computer system running DBMS
The last requires off-site backup; all others should be locally recoverable.
COMP9315 24T1 ♢ Lectures Part H ♢ [61/92]
Consider following scenario:
Desired behaviour after system restart:
- all effects of T1, T2 persist
- as if T3, T4 were aborted (no effects remain)
COMP9315 24T1 ♢ Lectures Part H ♢ [62/92]
Durabilty begins with a stable disk storage subsystem
- i.e.
putPage()
and getPage()
always work as expected
We can prevent/minimise loss/corruption of data due to:
- mem/disk transfer corruption ⇒ parity checking
- sector failure ⇒ mark "bad" blocks
- disk failure ⇒ RAID (levels 4,5,6)
- destruction of computer system ⇒ off-site backups
COMP9315 24T1 ♢ Lectures Part H ♢ [63/92]
❖ Dealing with Transactions | |
The remaining "failure modes" that we need to consider:
- failure of DBMS processes or operating system
- failure of transactions (
ABORT
)
Standard technique for managing these:
- keep a log of changes made to database
- use this log to restore state in case of failures
COMP9315 24T1 ♢ Lectures Part H ♢ [64/92]
❖ Architecture for Atomicity/Durability | |
How does a DBMS provide for atomicity/durability?
COMP9315 24T1 ♢ Lectures Part H ♢ [65/92]
❖ Execution of Transactions | |
Transactions deal with three address spaces:
- stored data on the disk
(representing global DB state)
- data in memory buffers
(where held for sharing by tx's)
- data in their own local variables
(where manipulated)
Each of these may hold a different "version" of a DB object.
PostgreSQL processes make heavy use of shared buffer pool
⇒ transactions do not deal with much local data.
COMP9315 24T1 ♢ Lectures Part H ♢ [66/92]
❖ Execution of Transactions (cont) | |
Operations available for data transfer:
-
INPUT(X)
... read page containing X
into a buffer
-
READ(X,v)
... copy value of X
from buffer to local var v
-
WRITE(X,v)
... copy value of local var v
to X
in buffer
-
OUTPUT(X)
... write buffer containing X
to disk
READ/WRITE
are issued by transaction.
INPUT/OUTPUT
are issued by buffer manager (and log manager).
INPUT/OUTPUT
correspond to getPage()
/putPage()
mentioned above
COMP9315 24T1 ♢ Lectures Part H ♢ [67/92]
❖ Execution of Transactions (cont) | |
Example of transaction execution:
BEGIN
READ(A,v); v = v*2; WRITE(A,v);
READ(B,v); v = v+1; WRITE(B,v);
COMMIT
READ
accesses the buffer manager and may cause INPUT
.
COMMIT
needs to ensure that buffer contents go to disk.
COMP9315 24T1 ♢ Lectures Part H ♢ [68/92]
❖ Execution of Transactions (cont) | |
States as the transaction executes:
t Action v Buf(A) Buf(B) Disk(A) Disk(B)
-----------------------------------------------------
(0) BEGIN . . . 8 5
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5
(7) OUTPUT(A) 6 16 6 16 5
(8) OUTPUT(B) 6 16 6 16 6
After tx completes, we must have either
Disk(A)=8, Disk(B)=5 or Disk(A)=16, Disk(B)=6
If system crashes before (8), may need to undo disk changes.
If system crashes after (8), may need to redo disk changes.
COMP9315 24T1 ♢ Lectures Part H ♢ [69/92]
❖ Transactions and Buffer Pool | |
Two issues arise w.r.t. buffers:
- forcing ...
OUTPUT
buffer on each WRITE
- ensures durability; disk always consistent with buffer pool
- poor performance; defeats purpose of having buffer pool
- stealing ... replace buffers of uncommitted tx's
- if we don't, poor throughput (tx's blocked on buffers)
- if we do, seems to cause atomicity problems?
Ideally, we want stealing and not forcing.
COMP9315 24T1 ♢ Lectures Part H ♢ [70/92]
❖ Transactions and Buffer Pool (cont) | |
Handling stealing:
- transaction T loads page P and makes changes
- T2 needs a buffer, and P is the "victim"
- P is output to disk (it's dirty) and replaced
- if T aborts, some of its changes are already "committed"
- must log values changed by T in P at "steal-time"
- use these to UNDO changes in case of failure of T
COMP9315 24T1 ♢ Lectures Part H ♢ [71/92]
❖ Transactions and Buffer Pool (cont) | |
Handling no forcing:
- transaction T makes changes & commits, then system crashes
- but what if modified page P has not yet been output?
- must log values changed by T in P as soon as they change
- use these to support REDO to restore changes
Above scenario may be a problem, even if we are forcing
- e.g. system crashes immediately after requesting a
WRITE()
COMP9315 24T1 ♢ Lectures Part H ♢ [72/92]
❖ Transactions (reminder) | |
A transaction is
- a single application-level operation
- implemented via multiple database operations
Transactions deal with three levels of storage
- data on disk, data in memory buffers, data in local variables
Transactions ...
-
COMMIT
: all changes are placed in persistent storage
-
ABORT
: none of the changes appear in persistent storage
COMP9315 24T1 ♢ Lectures Part H ♢ [73/92]
Three "styles" of logging
- undo ... removes changes by any uncommitted tx's
- redo ... repeats changes by any committed tx's
- undo/redo ... combines aspects of both
All approaches require:
- a sequential file of log records
- log records describe a change to a data item
- log records are written first
- actual changes to data are written later
Known as
write-ahead logging (PostgreSQL uses WAL)
COMP9315 24T1 ♢ Lectures Part H ♢ [74/92]
Simple form of logging which ensures atomicity.
Log file consists of a sequence of small records:
-
<START T>
... transaction T
begins
-
<COMMIT T>
... transaction T
completes successfully
-
<ABORT T>
... transaction T
fails (no changes)
-
<T,X,v>
... transaction T
changed value of X
from v
Notes:
- we refer to
<T,X,v>
generically as <UPDATE>
log records
- update log entry created for each
WRITE
(not OUTPUT
)
- update log entry contains old value
(new value is not recorded)
COMP9315 24T1 ♢ Lectures Part H ♢ [75/92]
Data must be written to disk in the following order:
-
<START>
transaction log record
-
<UPDATE>
log records indicating changes
- the changed data elements themselves
-
<COMMIT>
log record
Note: sufficient to have
<T,X,v>
output before
X
, for each
X
COMP9315 24T1 ♢ Lectures Part H ♢ [76/92]
For the example transaction, we would get:
t Action v B(A) B(B) D(A) D(B) Log
--------------------------------------------------------
(0) BEGIN . . . 8 5 <START T>
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5 <T,A,8>
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5 <T,B,5>
(7) FlushLog
(8) StartCommit
(9) OUTPUT(A) 6 16 6 16 5
(10) OUTPUT(B) 6 16 6 16 6
(11) EndCommit <COMMIT T>
(12) FlushLog
Note that T is not regarded as committed until (12) completes.
COMP9315 24T1 ♢ Lectures Part H ♢ [77/92]
Simplified view of recovery using UNDO logging:
- scan backwards through log
- if
<COMMIT T>
, mark T
as committed
- if
<T,X,v>
and T
not committed, set X
to v
on disk
- if
<START T>
and T
not committed, add <ABORT T>
to log
Assumes we scan entire log; use checkpoints to limit scan.
COMP9315 24T1 ♢ Lectures Part H ♢ [78/92]
Algorithmic view of recovery using UNDO logging:
committedTrans = abortedTrans = startedTrans = {}
for each log record from most recent to oldest {
switch (log record) {
<COMMIT T> : add T to committedTrans
<ABORT T> : add T to abortedTrans
<START T> : add T to startedTrans
<T,X,v> : if (T in committedTrans)
else
{ WRITE(X,v); OUTPUT(X) }
} }
for each T in startedTrans {
if (T in committedTrans) ignore
else if (T in abortedTrans) ignore
else write <ABORT T> to log
}
flush log
COMP9315 24T1 ♢ Lectures Part H ♢ [79/92]
Simple view of recovery implies reading entire log file.
Since log file grows without bound, this is infeasible.
Eventually we can delete "old" section of log.
- i.e. where all prior transactions have completed
This point is called a
checkpoint.
- all of log prior to checkpoint can be ignored for recovery
As described we need to wait for all active tx to complete
- known as quiescent checkpointing
- tx's taking a long time to complete block the system
COMP9315 24T1 ♢ Lectures Part H ♢ [80/92]
Problem: many concurrent/overlapping transactions.
How to know that all have finished?
- periodically, write log record
<CHKPT (T1,..,Tk)>
(contains references to all active transactions ⇒ active tx table)
- continue normal processing (e.g. new tx's can start)
- when all of
T1,..,Tk
have completed,
write log record <END CHKPT>
and flush log
Note: tx manager maintains chkpt and active tx information
These kinds of checkpoints are often written as <START CHKPT ...>
COMP9315 24T1 ♢ Lectures Part H ♢ [81/92]
Recovery: scan backwards through log file processing as before.
Determining where to stop depends on ...
- whether we meet
<END CHKPT>
or <CHKPT...>
first
If we encounter <END CHKPT>
first:
- we know that all incomplete tx's come after prev
<CHKPT...>
- thus, can stop backward scan when we reach
<CHKPT...>
If we encounter
<CHKPT (T1,..,Tk)>
first:
- crash occurred during the checkpoint period
- any of
T1,..,Tk
that committed before crash are ok
- for uncommitted tx's, need to continue backward scan
COMP9315 24T1 ♢ Lectures Part H ♢ [82/92]
Problem with UNDO logging:
- all changed data must be output to disk before committing
- conflicts with optimal use of the buffer pool
Alternative approach is
redo logging:
- allow changes to remain only in buffers after commit
- write records to indicate what changes are "pending"
- after a crash, can apply changes during recovery
COMP9315 24T1 ♢ Lectures Part H ♢ [83/92]
Requirement for redo logging: write-ahead rule.
Data must be written to disk as follows:
- write start transaction
<START T>
log record
- write
<UPDATE...>
log records indicating changes
- then write
<COMMIT>
log record (and flush log)
- then
OUTPUT
changed data elements themselves
Note that update log records now contain
<T,X,v'>
,
where
v'
is the
new value for
X
.
COMP9315 24T1 ♢ Lectures Part H ♢ [84/92]
For the example transaction, we would get:
t Action v B(A) B(B) D(A) D(B) Log
--------------------------------------------------------
(0) BEGIN . . . 8 5 <START T>
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5 <T,A,16>
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5 <T,B,6>
(7) COMMIT <COMMIT T>
(8) FlushLog
(9) OUTPUT(A) 6 16 6 16 5
(10) OUTPUT(B) 6 16 6 16 6
Note that T is regarded as committed as soon as (8) completes.
COMP9315 24T1 ♢ Lectures Part H ♢ [85/92]
Simplified view of recovery using REDO logging:
- identify all committed tx's (backwards scan)
- scan forwards through log
- if
<T,X,v>
and T
is committed, set X
to v
on disk
- if
<START T>
and T
not committed, add <ABORT T>
to log
Use checkpoints (like UNDO log checkpoints) to limit scan.
COMP9315 24T1 ♢ Lectures Part H ♢ [86/92]
UNDO logging and REDO logging are incompatible in
- order of outputting
<COMMIT T>
and changed data
- how data in buffers is handled during checkpoints
Undo/Redo logging combines aspects of both
- requires new kind of update log record
<T,X,v,v'>
gives both old and new values for X
- removes incompatibilities between output orders
As for previous cases, requires write-ahead of log records.
Undo/redo loging is common in practice; Aries algorithm.
COMP9315 24T1 ♢ Lectures Part H ♢ [87/92]
❖ Undo/Redo Logging (cont) | |
For the example transaction, we might get:
t Action v B(A) B(B) D(A) D(B) Log
--------------------------------------------------------
(0) BEGIN . . . 8 5 <START T>
(1) READ(A,v) 8 8 . 8 5
(2) v = v*2 16 8 . 8 5
(3) WRITE(A,v) 16 16 . 8 5 <T,A,8,16>
(4) READ(B,v) 5 16 5 8 5
(5) v = v+1 6 16 5 8 5
(6) WRITE(B,v) 6 16 6 8 5 <T,B,5,6>
(7) FlushLog
(8) StartCommit
(9) OUTPUT(A) 6 16 6 16 5
(10) <COMMIT T>
(11) OUTPUT(B) 6 16 6 16 6
Note that T is regarded as committed as soon as (10) completes.
COMP9315 24T1 ♢ Lectures Part H ♢ [88/92]
❖ Undo/Redo Logging (cont) | |
Simplified view of recovery using UNDO/REDO logging:
- scan log to determine committed/uncommitted txs
- for each uncommitted tx
T
add <ABORT T>
to log
- scan backwards through log
- if
<T,X,v,w>
and T
is not committed, set X
to v
on disk
- scan forwards through log
- if
<T,X,v,w>
and T
is committed, set X
to w
on disk
Note: undo/redo logging requires dirty buffers to be flushed at <CHKPT...>
COMP9315 24T1 ♢ Lectures Part H ♢ [89/92]
❖ Undo/Redo Logging (cont) | |
The above description simplifies details of undo/redo logging.
Aries is a complete algorithm for undo/redo logging.
Differences to what we have described:
- log records contain a sequence number (LSN)
- LSNs used in tx and buffer managers, and stored in data pages
- additional log record to mark
<END>
(of commit or abort)
-
<CHKPT>
contains only a timestamp
-
<END CHKPT..>
contains tx and dirty page info
COMP9315 24T1 ♢ Lectures Part H ♢ [90/92]
PostgreSQL uses write-ahead undo/redo style logging.
It also uses multi-version concurrency control, which
- tags each record with a tx and update timestamp
MVCC simplifies some aspects of undo/redo, e.g.
- some info required by logging is already held in each tuple
- no need to undo effects of aborted tx's; use older version
Recall: WAL entries in
postgresql.conf
COMP9315 24T1 ♢ Lectures Part H ♢ [91/92]
❖ Recovery in PostgreSQL (cont) | |
Transaction/logging code is distributed throughout backend.
Core transaction code is in src/backend/access/transam
.
Transaction/logging data is written to files in PGDATA/pg_xlog
- a number of very large files containing log records
- old files are removed once all txs noted there are completed
- new files added when existing files reach their capacity (16MB)
- number of tx log files varies depending on tx activity
COMP9315 24T1 ♢ Lectures Part H ♢ [92/92]
Produced: 30 Apr 2024