Transaction Processing


Transaction Processing1/153

Where transaction processing fits in the DBMS:

[Diagram:Pics/txproc/dbmsarch-small.png]


Transactions2/153

A transaction is:

Transactions occur naturally in context of DB applications, e.g. In order to achieve satisfactory performance (throughput): (Notation: we use the abbreviation "tx" as a synonym for "transaction")


... Transactions3/153

A transaction

E.g.   select ... update ... insert ... select ... insert ...

A transaction can end in two possible ways:

Above is essentially the atomicity requirement in ACID (see later).


... Transactions4/153

A transaction is a DB state-change operation.

[Diagram:Pics/txproc/tx-exec1-small.png]

Assume that the code of the transaction

Above is essentially the consistency requirement in ACID (see later).


... Transactions5/153

Transactions execute on a collection of data that is

Transactions need an environment that is Goal: data integrity should be maintained at all times (for each tx)


... Transactions6/153

If a transaction commits, must ensure that

Part-way through a transaction, must ensure that If a transaction aborts, must ensure that If there is a system failure, must ensure that on restart


Concurrency in DBMSs7/153

Concurrency in a multi-user DBMS like PostgreSQL:

[Diagram:Pics/txproc/pg-runtime-small.png]


ACID Properties8/153

Data integrity is assured if transactions satisfy the following:

Atomicity

Consistency
Isolation
Durability


... ACID Properties9/153

Atomicity can be represented by state-transitions:

[Diagram:Pics/txproc/trans-states-small.png]

COMMIT all changes preserved,   ABORT database unchanged


... ACID Properties10/153

Transaction processing:

Consistency is the property mentioned earlier: Ensuring this must be left to application programmers.

Our discussion thus focusses on: Atomicity, Durability, Isolation


... ACID Properties11/153

Atomicity is handled by the commit and abort mechanisms

Durability is handled by implementing stable storage, via

Isolation is handled by concurrency control mechanisms


Review of Transaction Terminology12/153

To describe transaction effects, we consider:

Relationship between the above operations and SQL:


... Review of Transaction Terminology13/153

More on transactions and SQL

In PostgreSQL, tx's cannot be defined inside stored procedures (e.g. PLpgSQL)


... Review of Transaction Terminology14/153

The READ, WRITE, ABORT, COMMIT operations:

The operations are typically denoted as:

RT(X) read item X in transaction T
WT(X) write item X in transaction T
AT abort transaction T
CT commit transaction T


Schedules15/153

A schedule gives the sequence of operations that occur

E.g.   RT1(A)   RT2(B)   W T1(A)   WT3(C)   RT2(A)   WT3(B)   ...

For a single tx, there is a single schedule   (its operations in order).

For a group of tx's, there are very many possible schedules.


... Schedules16/153

Consider a simple banking transaction, expressed in "PLpgSQL":


create function
   withdraw(Acct integer, Required float) returns text
as $$
declare
   Bal float;
begin
   select balance into Bal from Accounts where id=Acct;  R
   Bal := Bal - Required;
   update Accounts set balance=Bal where id=Acct;        W
   if (Bal < 0) then
      rollback; return 'Insufficient funds';             A
   else
      commit; return 'New balance: '||Bal::text;         C
   end if;
end;
$$ language plpgsql;

Notes:


... Schedules17/153

If tx T = withdraw(A,R) it has two possible schedules

If tx T = withdraw(A,R1) and tx S = withdraw(B,R2)
some possible schedules are:


... Schedules18/153

Serial schedules have no interleave of operations from different tx's.

Why serial schedules are good:

As would occur e.g. in a single-user database system.


... Schedules19/153

With different-ordered serial executions, tx's may get different results.

I.e. StateAfter(T1;T2) = StateAfter(T2;T1) is not generally true.

Consider the following two transactions:

T1 : select sum(salary)
     from Employee where dept='Sales'

T2 : insert into Employee
     values (....,'Sales',...)

If we execute T1 then T2 we get a smaller salary total than if we execute T2 then T1.

In both cases, however, the salary total is consistent with the state of the database at the time the transaction is executed.


... Schedules20/153

A serial execution of consistent transactions is always consistent.

If transactions execute under a concurrent (nonserial) schedule, the potential exists for conflict among their effects.

In the worst case, the effect of executing the transactions ...

So why don't we observe such problems in real DBMSs? ...


... Schedules21/153

Not all concurrent executions cause problems.

For example, the schedules

T1: R(X) W(X)           R(Y) W(Y)
T2:           R(X) W(X)

or

T1: R(X) W(X)      R(Y)      W(Y)
T2:           R(X)      W(X)

or ...

leave the database in a consistent state.


Example Transaction #122/153

Problem: Allocate a seat on a plane flight

Implement as a function returning success status:

function allocSeat(paxID    integer,
                   flightID integer,
                   seatNum  string)
         returning boolean
{
    check whether seat currently occupied
    if (already occupied)
        tell pax seat taken; return !ok
    else
        assign pax to seat; return ok
}

Assume schema:


Flight(flightID, flightNum, flightTime, ...)
SeatingAlloc(flightID, seatNum, paxID, ...)


... Example Transaction #123/153

PLpgSQL implementation for seat allocation:


create or replace function
    allocSeat(paxID int, fltID int, seat text) returns boolean
as $$
declare
    pid int;
begin
    select paxID into pid from SeatingAlloc
    where  flightID = fltID and seatNum = seat;
    if (pid is not null) then
        return false;  -- someone else already has seat
    else
        update SeatingAlloc set pax = paxID
        where  flightID = fltID and seatNum = seat;
        commit;
        return true;
    end if;
end;
$$ langauge plpgsql;


... Example Transaction #124/153

If customer Cust1 executes allocSeat(Cust1,f,23B)

If customer Cust2 then executes allocSeat(Cust2,f,23B) The system behaves as required tx is consistent.


... Example Transaction #125/153

Consider two customers trying allocSeat(?,f,23B) simultaneously.

A possible order of operations ...

Cause of problem: unfortunate interleaving of operations within concurrent transactions.

Serial execution (e.g. enforced by locks) could solve these kinds of problem.


Example Transaction #226/153

Problem: transfer funds between two accounts in same bank.

Implement as a function returning success status:

function transfer(sourceAcct integer,
                  destAcct   integer,
                  amount     real)
         returning boolean
{
    check whether sourceAcct is valid
    check whether destAcct is valid
    check whether sourceAcct has
          sufficient funds (>= amount)
    if (all ok) {
        withdraw money from sourceAcct
        deposit money into destAcct
}   }


... Example Transaction #227/153

PLpgSQL for funds transfer between accounts:


create or replace function
    transfer(source int, dest int, amount float)
    returns boolean
as $$
declare
    ok   boolean := true;
    acct Accounts%rowtype;
begin
    select * into acct
    from   Accounts where id=source;
    if (not found) then
        raise warning 'Invalid Withdrawal Account';
        ok := false;
    end if;
    select * from Accounts where id=dest;
    if (not found) then
        raise warning 'Invalid Deposit Account';
        ok := false;
    end if;
    if (acct.balance < amount) then
        raise warning 'Insufficient funds';
        ok := false;
    end if;
    if (not ok) then rollback; return false; end if;
    update Accounts
    set    balance := balance - amount
    where  id = source;
    update Accounts
    set    balance := balance + amount
    where  id = dest;
    commit;
    return true;
end;
$$ language plpgsql;


... Example Transaction #228/153

If customer transfers $1000 from Acct1 to Acct2

But if Cust1 and Cust2 both transfer from Acct1 together But account ran out of money after Cust1 took their cash.

Similar to earlier problem; could be fixed by serialization.


... Example Transaction #229/153

Consider customer transfers $1000 from Acct1 to Acct2

If transactions are not atomic/durable:


Transaction Anomalies30/153

What problems can occur with concurrent transactions?

The set of phenomena can be characterised broadly under:


... Transaction Anomalies31/153

Dirty read: a transaction reads data written by a concurrent uncommitted transaction

Example:

     Transaction T1       Transaction T2
(1)  select a into X
     from R where id=1
(2)                       select a into Y
                          from R where id=1
(3)  update R set a=X+1
     where id=1
(4)  commit
(5)                       update R set a=Y+1
                          where id=1
(6)                       commit

Effect: T1's update on R.a is lost.


... Transaction Anomalies32/153

Nonrepeatable read: a transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read)

Example:

     Transaction T1    Transaction T2
(1)  select * from R
     where id=5
(2)                    update R set a=8
                       where id=5
(3)                    commit
(4)  select * from R
     where id=5

Effect: T1 runs same query twice; sees different data


... Transaction Anomalies33/153

Phantom read: a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction

Example:

     Transaction T1    Transaction T2
(1)  select count(*)
     from R where a=5
(2)                    insert into R(id,a,b)
                              values (2,5,8)
(3)                    commit
(4)  select count(*)
     from R where a=5

Effect: T1 runs same query twice; sees different result set


Example of Transaction Failure34/153

The above examples generally assumed that transactions committed.

Additional problems can arise when transactions abort.

We give examples using the following two transactions:

T1: read(X)           T2: read(X)
    X := X + N            X := X + M
    write(X)              write(X)
    read(Y)
    Y := Y - N
    write(Y)

and initial DB state X=100, Y=50, N=5, M=8.


... Example of Transaction Failure35/153

Consider the following schedule where one transaction fails:

T1: R(X) W(X) A
T2:             R(X) W(X)

Transaction T1 aborts after writing X.

The abort will rollback the changes to X, but where the undo occurs can affect the results.

Consider three places where rollback might occur:

T1: R(X) W(X) A [1]     [2]     [3]
T2:                 R(X)    W(X)


Transaction Failure - Case 136/153

This scenario is ok.   T1's effects have been eliminated.

Database   Transaction T1       Transaction T2
---------  ------------------   --------------
X    Y               X    Y               X
100  50              ?    ?               ?
           read(X)   100
           X:=X+N    105
105        write(X)
           abort
100        rollback
                                read(X)   100
                                X:=X+M    108
108                             write(X)
---------
108  50


Transaction Failure - Case 237/153

In this scenario, some of T1's effects have been retained.

Database   Transaction T1       Transaction T2
---------  ------------------   --------------
X    Y               X    Y               X
100  50              ?    ?               ?
           read(X)   100
           X:=X+N    105
105        write(X)
           abort
                                read(X)   105
                                X:=X+M    113
100        rollback
113                             write(X)
---------
113  50


Transaction Failure - Case 338/153

In this scenario, T2's effects have been lost, even after commit.

Database   Transaction T1       Transaction T2
---------  ------------------   --------------
X    Y               X    Y               X
100  50              ?    ?               ?
           read(X)   100
           X:=X+N    105
105        write(X)
           abort
                                read(X)   105
                                X:=X+M    113
113                             write(X)
100        rollback
---------
100  50


Transaction Isolation


Transaction Isolation40/153

If transactions always run "single-user":

Simplest form of isolation: serial execution   (T1 ; T2 ; T3 ; ...)


... Transaction Isolation41/153

In practice, serial execution gives poor performance.

We need approaches that allow "safe" concurrency

The remainder of this discussion involves


DBMS Transaction Management42/153

Abstract view of DBMS concurrency mechanisms:

[Diagram:Pics/txproc/txproc1-small.png]

The Scheduler


Serializability43/153

Consider two schedules S1 and S2 produced by

S1 and S2 are equivalent if A schedule S for a set of concurrent tx's T1 ..Tn is serializable if Under these circumstances, consistency is guaranteed.


... Serializability44/153

Two formulations of serializability:

View serializability is strictly weaker than conflict serializability.

i.e. there are VS schedules that are not CS, but not vice versa


Isolation and Concurrency Control45/153

It is not useful to

The goal is to devise concurrency control schemes Serializability tests are used in proving properties of these schemes.


Other Properties of Schedules46/153

Above discussion explicitly assumes that all transactions commit.

What happens if some transaction aborts?

Under serial execution, there is no problem:

With concurrent execution, there may be problems:


Recoverability47/153

Consider the serializable schedule:

T1:        R(X)  W(Y)  C
T2:  W(X)                 A

(where the final value of Y is dependent in the X value)

Notes:

The schedule produces an invalid database state, even though serializable.


... Recoverability48/153

Recoverable schedules avoid these kinds of problems.

For a schedule to be recoverable, we require additional constraints

and this property must hold for all transactions Tj

Note that recoverability does not prevent "dirty reads".

In order to make schedules recoverable in the presence of dirty reads and aborts, may need to abort multiple transactions.


Cascading Aborts49/153

Recall the earlier non-recoverable schedule:

T1:        R(X)  W(Y)  C
T2:  W(X)                 A

To make it recoverable requires:

T1:        R(X)  W(Y)  ...   A
T2:  W(X)                 A

Known as cascading aborts (or cascading rollback).


... Cascading Aborts50/153

Example: T3 aborts, causing T2 to abort, causing T1 to abort

T1:                    R(Y)  W(Z)        A
T2:        R(X)  W(Y)                 A
T3:  W(X)                          A

Even though T1 has no direct connection with T3
(i.e. no shared data).

This kind of problem ...


... Cascading Aborts51/153

Cascading aborts can be avoided if

Effectively: eliminate the possibility of reading dirty data.

Downside: reduces opportunity for concurrency.

GUW call these ACR (avoid cascading rollback) schedules.

All ACR schedules are also recoverable.


Strictness52/153

Strict schedules also eliminate the chance of writing dirty data.

A schedule is strict if

Strict schedules simplify the task of rolling back after aborts.


... Strictness53/153

Example: non-strict schedule

T1:  W(X)        A
T2:        W(X)     A

Problems with handling rollback after aborts:


Schedule Properties54/153

Relationship between various classes of schedules:

[Diagram:Pics/txproc/schedules-small.png]

Schedules ought to be serializable and strict.

But more serializable/strict less concurrency.

DBMSs allow users to trade off "safety" against performance.


Transaction Isolation Levels55/153

Previous examples showed:

Safest approach ... Other approaches are weaker ... Is a trade-off useful?


... Transaction Isolation Levels56/153

In many real applications, there is either

This leads to a trade-off between performance and isolation


... Transaction Isolation Levels57/153

SQL provides a mechanism for database programmers to specify how much isolation to apply to transactions

set transaction
    read only  -- so weaker isolation may be ok
    read write -- suggests stronger isolation needed
isolation level
    -- weakest isolation, maximum concurrency
    read uncommitted
    read committed
    repeatable read
    serializable
    -- strongest isolation, minimum concurrency


... Transaction Isolation Levels58/153

Meaning of transaction isolation levels:


Isolation          Dirty          Nonrepeatable   Phantom
Level              Read           Read            Read

Read uncommitted   Possible       Possible        Possible
 
Read committed     Not possible   Possible        Possible

Repeatable read    Not possible   Not possible    Possible

Serializable       Not possible   Not possible    Not possible


... Transaction Isolation Levels59/153

For transaction isolation, PostgreSQL

Note: cannot implement read uncommitted because of MVCC


... Transaction Isolation Levels60/153

A PostgreSQL tx consists of a sequence of SQL statements:

BEGIN S1; S2; ... Sn; COMMIT;

Isolation levels affect view of DB provided to each Si:


... Transaction Isolation Levels61/153

Using PostgreSQL's serializable isolation level, a select:

Using the serializable isolation level, an update fails: The transaction containing the update must then rollback and re-start.


... Transaction Isolation Levels62/153

Example of repeatable read vs serializable


Implementing Concurrency Control


Concurrency Control64/153

Approaches to concurrency control:


Lock-based Concurrency Control65/153

Locks introduce additional mechanisms in DBMS:

[Diagram:Pics/txproc/txproc2-small.png]

The Scheduler


... Lock-based Concurrency Control66/153

Lock table entries contain:

Lock and unlock operations must be atomic.

Lock upgrade:


... Lock-based Concurrency Control67/153

Synchronise access to shared data items via following rules:

These rules alone do not guarantee serializability.


... Lock-based Concurrency Control68/153

Consider the following schedule, using locks:

T1: Lr(Y)     R(Y)               U(Y)         Lw(X) W(X) U(X)
T2:      Lr(X)    R(X) U(X) Lw(Y)....W(Y) U(Y)

(where Lr = read-lock, Lw = write-lock, U = unlock)

Locks correctly ensure controlled access to shared objects (X, Y).

Despite this, the schedule is not serializable.


Two-Phase Locking69/153

To guarantee serializability, we require an additional constraint on how locks are applied:

Each transaction is then structured as:


... Two-Phase Locking70/153

Consider the following two transactions:

T1: Lw(A) R(A) F1 W(A) Lw(B) U(A) R(B) G1 W(B) U(B)
T2: Lw(A) R(A) F2 W(A) Lw(B) U(A) R(B) G2 W(B) U(B)

They follow 2PL protocol, inducing a schedule like:

T1(a): Lw(A)      R(A) F1 W(A) Lw(B) U(A)
T2(a):      Lw(A) ...................... R(A) F2 W(A)

T1(b): R(B)      G1 W(B) U(B)
T2(b):     Lw(B) ............ U(A) R(B) G2 W(B) U(B)


Problems with Locking71/153

Appropriate locking can guarantee correctness.

However, it also introduces potential undesirable effects:


Deadlock72/153

Deadlock occurs when two transactions are waiting for a lock on an item held by the other.

Example:

T1: Lw(A) R(A)            Lw(B) ......
T2:            Lw(B) R(B)       Lw(A) .....

How to deal with deadlock?


... Deadlock73/153

Handling deadlock involves forcing a transaction to "back off".


... Deadlock74/153

Simple approach: timeout

Better approach: waits-for graph


... Deadlock75/153

A cycle in the waits-for graph indicates a deadlock.

Could prevent deadlock by

Could detect deadlock by


... Deadlock76/153

Alternative deadlock handling: timestamps.

Each tx is permanently allocated a unique timestamp (e.g. start-time).

When Tj tries to get lock held by Tk

Both schemes prevent waits-for cycles by imposing order/priority on tx's.


... Deadlock77/153

Tj tries to get lock held by Tk ...

Wait-die scheme:

Wound-wait schema:


... Deadlock78/153

Properties of deadlock handling methods:


Starvation79/153

Starvation occurs when one transaction

Whether it occurs depends on the lock wait/release strategy.

Multiple locks need to decide which to release first.

Solutions:


Locking Granularity80/153

Locking typically reduces concurrency reduces throughput.

Granularity of locking can impact performance:

+ lock a small item more of database accessible

+ lock a small item quick update quick lock release

- lock small items more locks more lock management

Granularity levels: field, row (tuple), table, whole database


... Locking Granularity81/153

Multiple-granularity locking protocol:

Example: T1 scans table R and updates some tuples Example: T2 uses an index to read part of R


Locking in B-trees82/153

How to lock a B-tree leaf node?

One possibility:

If for searching (select), locks would be read locks.

If for insert/delete, locks would be write locks.

This approach gives poor performance (lock on root is bottleneck).


... Locking in B-trees83/153

Approach for searching (select) ...

Approach for insert/delete ...


Optimistic Concurrency Control84/153

Locking is a pessimistic approach to concurrency control:

Costs: lock management, deadlock handling, contention.

In systems where read:write ratio is very high


... Optimistic Concurrency Control85/153

Transactions have three distinct phases:

Timestamps are recorded at points noted:

[Diagram:Pics/txproc/occ-phases-small.png]


... Optimistic Concurrency Control86/153

Data structures needed for validation:

Use the V timestamps as ordering for transactions


... Optimistic Concurrency Control87/153

Two-transaction example:

Case 0: serial execution ... no problem

[Diagram:Pics/txproc/occ0-small.png]


... Optimistic Concurrency Control88/153

Case 1: reading overlaps validation/writing

[Diagram:Pics/txproc/occ1-small.png]


... Optimistic Concurrency Control89/153

Case 2: reading/validation overlaps validation/writing

[Diagram:Pics/txproc/occ2-small.png]


... Optimistic Concurrency Control90/153

Validation check for transaction T

If this test fails for any Ti, then T is rolled back.

What this prevents ...


... Optimistic Concurrency Control91/153

Problems with optimistic concurrency control:

Notes:


Multi-version Concurrency Control92/153

Multi-version concurrency control (MVCC) aims to

Achieves this by Main difference between MVCC and standard locking:


... Multi-version Concurrency Control93/153

Each transaction is

Each record in the database is


... Multi-version Concurrency Control94/153

When a reader Ti is accessing the database

When a writer Tj changes a data item


... Multi-version Concurrency Control95/153

Advantage of MVCC

Disadvantages of MVCC Despite apparent disadvantages, MVCC is very effective.


... Multi-version Concurrency Control96/153

Removing old versions:

When to make this check? PostgreSQL uses the latter (vacuum).


Concurrency Control in SQL97/153

Transactions in SQL are specified by

In PostgreSQL, other actions that cause rollback:


... Concurrency Control in SQL98/153

More fine-grained control of "undo" via savepoints:

Example:

begin;
  insert into numbersTable values (1);
  savepoint my_savepoint;
  insert into numbersTable values (2);
  rollback to savepoint my_savepoint;
  insert into numbersTable values (3);
commit;

will insert 1 and 3 into the table, but not 2.


... Concurrency Control in SQL99/153

SQL standard defines four levels of transaction isolation.

The weakest level allows dirty reads, phantom reads, etc.

PostgreSQL implements: repeatable-read = serializable, read-uncommitted = read-committed


... Concurrency Control in SQL100/153

Using the serializable isolation level, a select:

Using the serializable isolation level, an update fails: The transaction containing the failed update will rollback and re-start.


... Concurrency Control in SQL101/153

Explicit control of concurrent access is available, e.g.

Table-level locking: LOCK TABLE

Row-level locking: SELECT FOR UPDATE, DELETE

All locks are released at end of transaction (no explicit unlock)


Concurrency Control in PostgreSQL102/153

PostgreSQL uses two styles of concurrency control:

From the SQL (PLpgSQL) level:


... Concurrency Control in PostgreSQL103/153

Implementing MVCC in PostgreSQL requires:


... Concurrency Control in PostgreSQL104/153

Rules for a tuple to be visible to Tj:


... Concurrency Control in PostgreSQL105/153

Tx's always see a consistent version of the database.

But may not see the "current" version of the database.

E.g. T1 does select, then concurrent T2 deletes some of T1's selected tuples

This is OK unless tx's communicate outside the database system.

E.g. T1 counts tuples; T2 deletes then counts; then counts are compared

Use locks if application needs every tx to see same current version


Implementing Atomicity/Durability


Atomicity/Durability107/153

Reminder:

Transactions are atomic

Transaction effects are durable Implementation of atomicity/durability is intertwined.


Durability108/153

What kinds of "system failures" do we need to deal with?

The last requires off-site backup; all others should be locally recoverable.


... Durability109/153

Consider following scenario:

[Diagram:Pics/txproc/crash-small.png]

Desired behaviour after system restart:


... Durability110/153

Durabilty begins with a stable disk storage subsystem.

Operations:

Each call to putBlock must ensure that Subsequent getBlock on same disk sector must ensure that


... Durability111/153

Implementation of transaction operations R(V) and W(V)

Value R(Object V) {
    B = getBuf(blockContaining(V))
    return value of V from B
}
void W(Object V, value k) {
    B = getBuf(blockContaining(V))
    set value of V in B
}

Note:


... Durability112/153

getBuf() and putBuf() interface buffer pool with disk

Buffer getBuf(BlockAddr A) {
    if (!inBufferPool(A)) {
        B = availableBuffer(A);
        getBlock(B);
    }
    return bufferContaining(A);
}
void putBuf(BlockAddr A) {
    B = bufferContaining(A);
    putBlock(B);
}


Stable Store113/153

One simple strategy using redundancy: stable store.

Protects against all failures in a single disk sector.

Each logical disk page X is stored twice.

(Obvious disadvantage: disk capacity is halved)

X is stored in sector S on disk L and sector T on disk R

Assume that a sound parity-check is available

(i.e. can always recognise whether data has transferred memdisk correctly)


... Stable Store114/153

Low level sector i/o functions:

int writeSector(char *b, Disk d, Sector s) {
   int nTries = 0;
   do {
      nTries++;  write(d,s,b);
   } while (bad(parity) && nTries < MaxTries)
   return nTries;
}
int readSector(char *b, Disk d, Sector s) {
   int nTries = 0;
   do {
      nTries++;  read(d,s,b);
   } while (bad(parity) && nTries < MaxTries)
   return nTries;
}


... Stable Store115/153

Writing data to disk with stable store:

int writeBlock(Buffer *b, Disk d, Sector s) {
   int sec;
   for (;;) {
      sec = (s > 0) ? s : getUnusedSector(d);
      n = writeSector(b->data, d, sec);
      if (n == maxTries)
         mark s as unusable
      else
         return sec;
   }
}


... Stable Store116/153

Reading data from disk with stable store:

int readBlock(Buffer *b) {
   int n = readSector(b->data, b->diskL, b->sectorL);
   if (n == maxTries) {
      n = readSector(b->data, b->diskR, b->sectorR);
      if (n == maxTries) return -1; // read failure
   }
   return 0; // successful read
}


... Stable Store117/153

Consider scenario where power fails during write operation:

Wish to restore the system to a state where:


... Stable Store118/153

How stable storage handles failure during writing:


RAID119/153

RAID gives techniques to achieve

Requires: (See texts for further discussion on RAID)


Stable Storage Subsystem120/153

We can prevent/minimise loss/corruption of data due to:

If all of these implemented, assume stable storage subsystem.


Dealing with Transactions121/153

The remaining "failure modes" that we need to consider:

Standard technique for managing these:


Architecture for Atomicity/Durability122/153

How does a DBMS provide for atomicity/durability?

[Diagram:Pics/txproc/arch-small.png]


Execution of Transactions123/153

Transactions deal with three address spaces:

Each of these may hold a different "version" of a DB object.


... Execution of Transactions124/153

Operations available for data transfer:

READ/WRITE are issued by transaction.

INPUT/OUTPUT are issued by buffer manager (and log manager).


... Execution of Transactions125/153

Example of transaction execution:

-- implements A = A*2; B = B+1;
BEGIN
READ(A,v); v = v*2; WRITE(A,v);
READ(B,v); v = v+1; WRITE(B,v);
COMMIT

READ accesses the buffer manager and may cause INPUT.

COMMIT needs to ensure that buffer contents go to disk.


... Execution of Transactions126/153

States as the transaction executes:


t   Action        v  Buf(A)  Buf(B)  Disk(A)  Disk(B)
-----------------------------------------------------
(0) BEGIN         .      .       .        8        5
(1) READ(A,v)     8      8       .        8        5
(2) v = v*2      16      8       .        8        5
(3) WRITE(A,v)   16     16       .        8        5
(4) READ(B,v)     5     16       5        8        5
(5) v = v+1       6     16       5        8        5
(6) WRITE(B,v)    6     16       6        8        5
(7) OUTPUT(A)     6     16       6       16        5
(8) OUTPUT(B)     6     16       6       16        6

After tx completes, we must have either
Disk(A)=8,Disk(B)=5   or   Disk(A)=16,Disk(B)=6

If system crashes before (8), may need to undo disk changes.
If system crashes after (8), may need to redo disk changes.


Transactions and Buffer Pool127/153

Two issues arise w.r.t. buffers:

Ideally, we want stealing and not forcing.


... Transactions and Buffer Pool128/153

Handling stealing:


... Transactions and Buffer Pool129/153

Handling no forcing:


Above scenario may be a problem, even if we are forcing


Logging130/153

Three "styles" of logging

All approaches require:


Undo Logging131/153

Simple form of logging which ensures atomicity.

Log file consists of a sequence of small records:

Notes:


... Undo Logging132/153

Data must be written to disk in the following order:

  1. start transaction log record
  2. update log records indicating changes
  3. the changed data elements themselves
  4. the commit log record
Note: sufficient to have <T,X,v> output before X, for each X


... Undo Logging133/153

For the example transaction, we would get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,8>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,5>
(7)  FlushLog
(8)  StartCommit
(9)  OUTPUT(A)     6   16     6    16     5
(10) OUTPUT(B)     6   16     6    16     6
(11) EndCommit                                <COMMIT T>
(12) FlushLog

Note that T is not regarded as committed until (11).


... Undo Logging134/153

Simplified view of recovery using UNDO logging:

committedTrans = abortedTrans = startedTrans = {}
for each log record from most recent to oldest {
    switch (log record) {
    <COMMIT T> : add T to committedTrans
    <ABORT T>  : add T to abortedTrans
    <START T>  : add T to startedTrans
    <T,X,v>    : if (T in committedTrans)
                       // don't undo committed changes
                 else
                     { WRITE(X,v); OUTPUT(X) }
}   }
for each T in startedTrans {
    if (T in committedTrans) ignore
    else if (T in abortedTrans) ignore
    else write <ABORT T> to log
}
flush log


... Undo Logging135/153

Recall example transaction and consider effects of system crash at the following points:

Before (9) ... disk "restored" (unchanged); <ABORT T> written

(9)-(11) ... disk restored to original state; <ABORT T> written

After (12) ... A and B left unchanged; T treated as committed

"Disk restored" means

WRITE(B,5); OUTPUT(B); WRITE(A,8); OUTPUT(A);


Checkpointing136/153

Previous view of recovery implied reading entire log file.

Since log file grows without bound, this is infeasible.

Eventually we can delete "old" section of log.

This point is called a checkpoint.


... Checkpointing137/153

Problem: many concurrent/overlapping transactions.

How to know that all have finished?

Simplistic approach

  1. stop accepting new transactions (block them)
  2. wait until all active tx's either commit or abort
    (and have written a <COMMIT T> or <ABORT T> on the log)
  3. flush the log to disk
  4. write new log record <CHKPT>, and flush log again
  5. resume accepting new transactions
Known as quiescent checkpointing.


... Checkpointing138/153

Obvious problem with quiescent checkpointing

Better strategy: nonquiescent checkpointing
  1. write log record <CHKPT (T1,..,Tk)>
    (contains references to all active transactions active tx table)
  2. continue normal processing (e.g. new tx's can start)
  3. when all of T1,..,Tk have completed,
    write log record <ENDCHKPT> and flush log


... Checkpointing139/153

Recovery: scan backwards through log file processing as before.

Determining where to stop depends on ...

If we encounter <ENDCHKPT> first:

If we encounter <CHKPT (T1,..,Tk)> first:


Redo Logging140/153

Problem with UNDO logging:

Alternative approach is redo logging:


... Redo Logging141/153

Requirement for redo logging: write-ahead rule.

Data must be written to disk as follows:

  1. start transaction log record
  2. update log records indicating changes
  3. the commit log record
  4. the changed data elements themselves
Note that update log records now contain <T,X,v'>,
where v' is the new value for X.


... Redo Logging142/153

For the example transaction, we would get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,16>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,6>
(7)  COMMIT                                   <COMMIT T>
(8)  FlushLog
(9)  OUTPUT(A)     6   16     6    16     5
(10) OUTPUT(B)     6   16     6    16     6

Note that T is regarded as committed as soon as (8) completes.


... Redo Logging143/153

Simplified view of recovery using REDO logging:

set committedTrans // e.g. from tx table
for each log record from oldest to most recent {
    switch (log record) {
    <COMMIT T> : ignore // know these already
    <ABORT T>  : add T to abortedTrans
    <START T>  : add T to startedTrans
    <T,X,v'>   : if (T in committedTrans)
                       { WRITE(X,v'); OUTPUT(X) }
                 else
                     // nothing to do, no change on disk
}   }
for each T in startedTrans {
    if (T in committedTrans) ignore
    else if (T in abortedTrans) ignore
    else write <ABORT T> to log
}
flush log


... Redo Logging144/153

Data required for REDO logging checkpoints:


... Redo Logging145/153

Checkpoints in REDO logging use (as before):

Steps in REDO log checkpointing
  1. write <CHKPT (T1,..,Tk)> to log and flush log
  2. output to disk all changed objects in buffer pool
    which have not yet been written to disk
    whose tx's had committed before <CHKPT...>
  3. write <ENDCHKPT> to log and flush log
Note that other tx's may continue between steps 2 and 3.


... Redo Logging146/153

Recovery with checkpointed REDO log.

As for UNDO logging, two cases ...

Last checkpoint record before crash is <ENDCHKPT>

Last checkpoint record before crash is <CHKPT (T1,..,Tk)>


Undo/Redo Logging147/153

UNDO logging and REDO logging are incompatible in

Undo/Redo logging combines aspects of both As for previous cases, requires write-ahead of log records.

Undo/redo loging is common in practice; Aries algorithm.


... Undo/Redo Logging148/153

Requirement for undo/redo logging: write-ahead.

Data must be written to disk as follows:

  1. start transaction log record
  2. update log records indicating changes
  3. the changed data elements themselves
Do not specify when the <COMMIT T> record is written.


... Undo/Redo Logging149/153

For the example transaction, we might get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,8,16>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,5,6>
(7)  FlushLog
(8)  StartCommit
(9)  OUTPUT(A)     6   16     6    16     5
(10)                                          <COMMIT T>
(11) OUTPUT(B)     6   16     6    16     6

Note that T is regarded as committed as soon as (10) completes.


... Undo/Redo Logging150/153

Recovery using undo/redo logging:

  1. redo all committed tx's from earliest to latest
  2. undo all incomplete tx's from latest to earliest
Consider effects of system crash on example

Before (10) ... treat as incomplete; undo all changes

After (10) ... treat as complete; redo all changes


... Undo/Redo Logging151/153

Steps in UNDO/REDO log checkpointing

  1. write <CHKPT (T1,..,Tk)> to log and flush log
  2. output to disk all dirty memory buffers
  3. write <ENDCHKPT> to log and flush log
Note that other tx's may continue between steps 2 and 3.

A consequence of the above:

(If we allowed this, a buffer may contain changes from both committed and aborted tx's)


... Undo/Redo Logging152/153

The above description simplifies details of undo/redo logging.

Aries is a complete algorithm for undo/redo logging.

Differences to what we have described:

(For more details consult any text or COMP9315 05s1 lecture notes)


Recovery in PostgreSQL153/153

PostgreSQL uses write-ahead undo/redo style logging.

However, it also uses multi-version concurrency control

Transaction/logging code is distributed throughout backend source.

Core transaction code is in src/backend/access/transam.


Produced: 7 Aug 2019