COMP9315 Week 10 Monday Lecture

❖ Things To Note

Assignment 2
- due 11:59pm TONIGHT (220/490 submissions so far)
- learn to use GDB (for the exam)
- don't pipe through sort if not working
Help Sessions
- Tuesday 16 April, 3pm-4pm
MyExperience
- open now ... please give feedback ... closes April 25

❖ GDB in one slide

To debug the program:

$ ./select R '?,?,shoes'

Run GDB as follows

$ gdb ./select
...
(gdb) run R '?,?,shoes'
... program crashes ...
(gdb) where
... look at the line numbers and function arguments ...
(gdb) print pid
... examine values of variables in current scope ...
(gdb) quit

Has many other functions, e.g. break (program stops at a given point)

<< ∧ >>

❖ Transactions (reminder)

A transaction is

a single application-level operation
implemented via multiple database operations

Transactions deal with three levels of storage

data on disk, data in memory buffers, data in local variables

Transactions ...

COMMIT: all changes are placed in persistent storage
ABORT: none of the changes appear in persistent storage

<< ∧ >>

❖ Exercise: Transaction Internals

Consider the following transaction:

-- implements A = A*2; B = B+1;
BEGIN
READ(A,v); v = v*2; WRITE(A,v);
READ(B,v); v = v+1; WRITE(B,v);
COMMIT

Show how the following items change values after each statement

v ... value of local variable
Buf(A) ... value of A stored in memory buffer
Buf(B) ... value of B stored in memory buffer
Disk(A) ... value of A stored on disk
Disk(B) ... value of B stored on disk

What is the final state after the COMMIT completes?

<< ∧ >>

❖ Exercise: Transaction Failure

Consider the previous transaction.

(0) BEGIN
(1) READ(A,v);
(2) v = v*2;
(3) WRITE(A,v);
(4) READ(B,v);
(5) v = v+1;
(6) WRITE(B,v);
(7) COMMIT
(8) OUTPUT(A)
(9) OUTPUT(B)

What happens if ...

the tx is aborted before (4)
the tx is aborted before (7)
the system crashes before (8)
the system crashes before (9)

<< ∧ >>

❖ Logging

Three "styles" of logging

undo ... removes changes by any uncommitted tx's
redo ... repeats changes by any committed tx's
undo/redo ... combines aspects of both

All approaches require:

a sequential file of log records
log records describe a change to a data item
log records are written first
actual changes to data are written later

Known as write-ahead logging (PostgreSQL uses WAL)

<< ∧ >>

❖ Undo Logging

Simple form of logging which ensures atomicity.

Log file consists of a sequence of small records:

<START T> ... transaction T begins
<COMMIT T> ... transaction T completes successfully
<ABORT T> ... transaction T fails (no changes)
<T,X,v> ... transaction T changed value of X from v

Notes:

we refer to <T,X,v> generically as <UPDATE> log records
update log entry created for each WRITE (not OUTPUT)
update log entry contains old value (new value is not recorded)

<< ∧ >>

❖ Undo Logging (cont)

Data must be written to disk in the following order:

<START> transaction log record
<UPDATE> log records indicating changes
the changed data elements themselves
<COMMIT> log record

Note: sufficient to have <T,X,v> output before X, for each X

<< ∧ >>

❖ Undo Logging (cont)

For the example transaction, we would get:

t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,8>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,5>
(7)  FlushLog
(8)  StartCommit
(9)  OUTPUT(A)     6   16     6    16     5
(10) OUTPUT(B)     6   16     6    16     6
(11) EndCommit                                <COMMIT T>
(12) FlushLog

Note that T is not regarded as committed until (12) completes.

<< ∧ >>

❖ UNDO Logging (cont)

Simplified view of recovery using UNDO logging:

scan backwards through log
- if <COMMIT T>, mark T as committed
- if <T,X,v> and T not committed, set X to v on disk
- if <START T> and T not committed, add <ABORT T> to log

Assumes we scan entire log; use checkpoints to limit scan.

<< ∧ >>

❖ UNDO Logging (cont)

Algorithmic view of recovery using UNDO logging:

committedTrans = abortedTrans = startedTrans = {}
for each log record from most recent to oldest {
    switch (log record) {
    <COMMIT T> : add T to committedTrans
    <ABORT T>  : add T to abortedTrans
    <START T>  : add T to startedTrans
    <T,X,v>    : if (T in committedTrans)
                     // don't undo committed changes
                 else  // roll-back changes
                     { WRITE(X,v); OUTPUT(X) }
}   }
for each T in startedTrans {
    if (T in committedTrans) ignore
    else if (T in abortedTrans) ignore
    else write <ABORT T> to log
}
flush log

<< ∧ >>

❖ Exercise: Recovery with UNDO Log

Show how the UNDO log would be used if the previous tx

failed at (5)
failed at (10)
failed after (12)

<< ∧ >>

❖ Checkpointing

Simple view of recovery implies reading entire log file.

Since log file grows without bound, this is infeasible.

Eventually we can delete "old" section of log.

i.e. where all prior transactions have completed

This point is called a checkpoint.

all of log prior to checkpoint can be ignored for recovery

As described we need to wait for all active tx to complete

known as quiescent checkpointing
tx's taking a long time to complete block the system

<< ∧ >>

❖ Checkpointing (cont)

Problem: many concurrent/overlapping transactions.

How to know that all have finished?

periodically, write log record <CHKPT (T1,..,Tk)>
(contains references to all active transactions ⇒ active tx table)
continue normal processing (e.g. new tx's can start)
when all of T1,..,Tk have completed,
write log record <END CHKPT> and flush log

Note: tx manager maintains chkpt and active tx information

These kinds of checkpoints are often written as <START CHKPT ...>

<< ∧ >>

❖ Checkpointing (cont)

Recovery: scan backwards through log file processing as before.

Determining where to stop depends on ...

whether we meet <END CHKPT> or <CHKPT...> first

If we encounter <END CHKPT> first:

we know that all incomplete tx's come after prev <CHKPT...>
thus, can stop backward scan when we reach <CHKPT...>

If we encounter <CHKPT (T1,..,Tk)> first:

crash occurred during the checkpoint period
any of T1,..,Tk that committed before crash are ok
for uncommitted tx's, need to continue backward scan

<< ∧ >>

❖ Redo Logging

Problem with UNDO logging:

all changed data must be output to disk before committing
conflicts with optimal use of the buffer pool

Alternative approach is redo logging:

allow changes to remain only in buffers after commit
write records to indicate what changes are "pending"
after a crash, can apply changes during recovery

<< ∧ >>

❖ Redo Logging (cont)

Requirement for redo logging: write-ahead rule.

Data must be written to disk as follows:

write start transaction <START T> log record
write <UPDATE...> log records indicating changes
then write <COMMIT> log record (and flush log)
then OUTPUT changed data elements themselves

Note that update log records now contain <T,X,v'>,
where v' is the new value for X.

<< ∧ >>

❖ Redo Logging (cont)

For the example transaction, we would get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,16>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,6>
(7)  COMMIT                                   <COMMIT T>
(8)  FlushLog
(9)  OUTPUT(A)     6   16     6    16     5
(10) OUTPUT(B)     6   16     6    16     6

Note that T is regarded as committed as soon as (8) completes.

<< ∧ >>

❖ Redo Logging (cont)

Simplified view of recovery using REDO logging:

identify all committed tx's (backwards scan)
scan forwards through log
- if <T,X,v> and T is committed, set X to v on disk
- if <START T> and T not committed, add <ABORT T> to log

Use checkpoints (like UNDO log checkpoints) to limit scan.

<< ∧ >>

❖ Undo/Redo Logging

UNDO logging and REDO logging are incompatible in

order of outputting <COMMIT T> and changed data
how data in buffers is handled during checkpoints

Undo/Redo logging combines aspects of both

requires new kind of update log record
<T,X,v,v'> gives both old and new values for X
removes incompatibilities between output orders

As for previous cases, requires write-ahead of log records.

Undo/redo loging is common in practice; Aries algorithm.

<< ∧ >>

❖ Undo/Redo Logging (cont)

For the example transaction, we might get:


t    Action        v  B(A)  B(B)  D(A)  D(B)  Log
--------------------------------------------------------
(0)  BEGIN         .    .     .     8     5   <START T>
(1)  READ(A,v)     8    8     .     8     5
(2)  v = v*2      16    8     .     8     5
(3)  WRITE(A,v)   16   16     .     8     5   <T,A,8,16>
(4)  READ(B,v)     5   16     5     8     5
(5)  v = v+1       6   16     5     8     5
(6)  WRITE(B,v)    6   16     6     8     5   <T,B,5,6>
(7)  FlushLog
(8)  StartCommit
(9)  OUTPUT(A)     6   16     6    16     5
(10)                                          <COMMIT T>
(11) OUTPUT(B)     6   16     6    16     6

Note that T is regarded as committed as soon as (10) completes.

<< ∧ >>

❖ Undo/Redo Logging (cont)

Simplified view of recovery using UNDO/REDO logging:

scan log to determine committed/uncommitted txs
for each uncommitted tx T add <ABORT T> to log
scan backwards through log
- if <T,X,v,w> and T is not committed, set X to v on disk
scan forwards through log
- if <T,X,v,w> and T is committed, set X to w on disk

Note: undo/redo logging requires dirty buffers to be flushed at <CHKPT...>

<< ∧ >>

❖ Undo/Redo Logging (cont)

The above description simplifies details of undo/redo logging.

Aries is a complete algorithm for undo/redo logging.

Differences to what we have described:

log records contain a sequence number (LSN)
LSNs used in tx and buffer managers, and stored in data pages
additional log record to mark <END> (of commit or abort)
<CHKPT> contains only a timestamp
<END CHKPT..> contains tx and dirty page info

<< ∧ >>

❖ Recovery in PostgreSQL

PostgreSQL uses write-ahead undo/redo style logging.

It also uses multi-version concurrency control, which

tags each record with a tx and update timestamp

MVCC simplifies some aspects of undo/redo, e.g.

some info required by logging is already held in each tuple
no need to undo effects of aborted tx's; use older version

Recall: WAL entries in postgresql.conf

<< ∧ >>

❖ Recovery in PostgreSQL (cont)

Transaction/logging code is distributed throughout backend.

Core transaction code is in src/backend/access/transam.

Transaction/logging data is written to files in PGDATA/pg_xlog

a number of very large files containing log records
old files are removed once all txs noted there are completed
new files added when existing files reach their capacity (16MB)
number of tx log files varies depending on tx activity

<< ∧ >>

❖ Database Trends (overview)

<< ∧ >>

❖ Future of Database

Core "database" goals:

deal with very large amounts of data (petabyes, exabytes, ...)
very-high-level languages (deal with data in uniform ways)
fast query execution (evaluation too slow ⇒ useless)

At the moment (and for the last 30 years) RDBMSs dominate ...

simple/clean data model, backed up by theory
high-level language for accessing data
~50 years development work on RDBMS engine technology

RDBMSs work well in domains with uniform, structured data.

<< ∧ >>

❖ Future of Database (cont)

Limitations/pitfalls of classical RDBMSs:

NULL is ambiguous: unknown, not applicable, not supplied
"limited" support for constraints/integrity and rules
no support for uncertainty (data represents the state-of-the-world)
data model too simple (e.g. no direct support for complex objects)
query model too rigid (e.g. no approximate matching)
continually changing data sources not well-handled
data must be "molded" to fit a single rigid schema
database systems must be manually "tuned"
do not scale well to some data sets (e.g. Google, Telco's)

<< ∧ >>

❖ Future of Database (cont)

How to overcome (some) RDBMS limitations?

Extend the relational model ...

add new data types and query ops for new applications
deal with uncertainty/inaccuracy/approximation in data

Replace the relational model ...

object-oriented DBMS ... OO programming with persistent objects
XML DBMS ... all data stored as XML documents, new query model
noSQL data stores (e.g. (key,value) pairs, json or rdf)

<< ∧ >>

❖ Future of Database (cont)

How to overcome (some) RDBMS limitations?

Performance ...

new query algorithms/data-structures for new types of queries
parallel processing
DBMSs that "tune" themselves

Scalability ...

distribute data across (more and more) nodes
techniques for handling streams of incoming data

<< ∧ >>

❖ Future of Database (cont)

An overview of the possibilities:

"classical" RDBMS (e.g. PostgreSQL, Oracle, SQLite)
parallel DBMS (e.g. XPRS)
distributed DBMS (e.g. Cohera)
deductive databases (e.g. Datalog)
temporal databases (e.g. MariaDB)
column stores (e.g. Vertica, Druid)
object-oriented DBMS (e.g. ObjectStore)
key-value stores (e.g. Redis, DynamoDB)
wide column stores (e.g. Cassandra, Scylla, HBase)
graph databases (e.g. Neo4J, Datastax)
document stores (e.g. MongoDB, Couchbase)
search engines (e.g. Google, Solr)

<< ∧ >>

❖ Future of Database (cont)

Historical perspective

<< ∧ >>

❖ Large Data

Some modern applications have massive data sets (e.g. Google)

far too large to store on a single machine/RDBMS
query demands far too high even if could store in DBMS

Approach to dealing with such data

distribute data over large collection of nodes (also, redundancy)
provide computational mechanisms for distributing computation

Often this data does not need full relational selection

represent data via (key,value) pairs
unique keys can be used for addressing data
values can be large objects (e.g. web pages, images, ...)

<< ∧ >>

❖ Large Data (cont)

Popular computational approach to such data: map/reduce

suitable for widely-distributed, very-large data
allows parallel computation on such data to be easily specified
distribute (map) parts of computation across network
compute in parallel (possibly with further mapping)
merge (reduce) multiple results for delivery to requestor

Some large data proponents see no future need for SQL/relational ...

depends on application (e.g. hard integrity vs eventual consistency)

Humour: Parody of noSQL fans (strong language warning)

<< ∧ >>

❖ Information Retrieval

DBMSs generally do precise matching (although like/regexps)

Information retrieval systems do approximate matching.

E.g. documents containing a set of keywords (Google, etc.)

Also introduces notion of "quality" of matching
(e.g. tuple T₁ is a better match than tuple T₂)

Quality also implies ranking of results.

Ongoing research in incorporating IR ideas into DBMS context.

Goal: support database exploration better.

<< ∧ >>

❖ Multimedia Data

Data which does not fit the "tabular model":

image, video, music, text, ... (and combinations of these)

Research problems:

how to specify queries on such data? (image₁ ≅ image₂)
how to "display" results? (synchronize components)

Solutions to the first problem typically:

extend notions of "matching"/indexes for querying
require sophisticated methods for capturing data features

Sample query: find other songs like this one?

<< ∧ >>

❖ Uncertainty

Multimedia/IR introduces approximate matching.

In some contexts, we have approximate/uncertain data.

E.g. witness statements in a crime-fighting database

"I think the getaway car was red ... or maybe orange ..."

"I am 75% sure that John carried out the crime"

Work by Jennifer Widom at Stanford on the Trio system

extends the relational model (ULDB)
extends the query language (TriQL)

<< ∧ >>

❖ Stream Data Management Systems

Makes one addition to the relational model

stream = infinite sequence of tuples, arriving one-at-a-time

Applications: news feeds, telecomms, monitoring web usage, ...

RDBMSs: run a variety of queries on (relatively) fixed data

StreamDBs: run fixed queries on changing data (stream)

One approach: window = "relation" formed from a stream via a rule

E.g. StreamSQL

select avg(price)
from examplestream [size 10 advance 1 tuples]

<< ∧ >>

❖ Graph Data

Uses graphs rather than tables as basic data structure tool.

Applications: social networks, ecommerce purchases, interests, ...

Many real-world problems are modelled naturally by graphs

can be represented in RDBMSs, but not processed efficiently
e.g. recursive queries on Nodes, Properties, Edges tables

Graph data models: flexible, "schema-free", inter-linked

Typical modeling formalisms: XML, JSON, RDF

<< ∧ >>

❖ Dispersed Databases

Characteristics of dispersed databases:

very large numbers of small processing nodes
data is distributed/shared among nodes

Applications: environmental monitoring devices, "intelligent dust", ...

Research issues:

query/search strategies (how to organise query processing)
distribution of data (trade-off between centralised and diffused)

Less extreme versions of this already exist:

grid and cloud computing
database management for mobile devices

<< ∧ >>

❖ Parallelism in Databases

<< ∧ >>

❖ Parallel DBMSs

RDBMS discussion so far has revolved around systems

with a single or small number of processors
accessing a single memory space
getting data from one or more disk devices

[Diagram:Pics/parallel/one-processor.png]

<< ∧ >>

❖ Parallel DBMSs (cont)

Why parallelism? ... Throughput!

<< ∧ >>

❖ Parallel DBMSs (cont)

DBMSs are a success story in application of parallelism

can process many data elements (tuples) at the same time
can create pipelines of query evaluation steps
don't require special hardware
can hide paralleism within the query evaluator
- application programmers don't need to change habits

Compare this with effort to do parallel programming.

<< ∧ >>

❖ Parallel Architectures

Types: shared memory, shared disk, shared nothing

Example shared-nothing architecture:

[Diagram:Pics/parallel/shared-nothing.png]

Typically same room/LAN (data transfer cost ~ 100's of μsecs .. msecs)

<< ∧ >>

❖ Distributed Architectures

Distributed architectures are ...

effectively shared-nothing, on a global-scale network

[Diagram:Pics/parallel/distrib-arch.png]

Typically on the Internet (data transfer cost ~ secs)

<< ∧ >>

❖ Parallel Databases (PDBs)

Parallel databases provide various forms of parallelism ...

process parallelism can speed up query evaluation
processor parallelism can assist in speeding up memory ops
processor parallelism introduces cache coherence issues
disk parallelism can assist in overcoming latency
disk parallelism can be used to improve fault-tolerance (RAID)
one limiting factor is congestion on communication bus

<< ∧ >>

❖ Parallel Databases (PDBs) (cont)

Types of parallelism

pipeline parallelism
- multi-step process, each processor handles one step
- run in parallel and pipeline result from one to another
partition parallelism
- many processors running in parallel
- each performs same task on a subset of the data
- results from processors need to be merged

<< ∧ >>

❖ Data Storage in PDBs

Assume that each table/relation consists of pages in a file

Can distribute data across multiple storage devices

duplicate all pages from a relation (replication)
store some pages on one store, some on others (partitioning)

Replication example:

<< ∧ >>

❖ Data Storage in PDBs (cont)

Data-partitioning example:

[Diagram:Pics/parallel/data-partition.png]

<< ∧ >>

❖ Data Storage in PDBs (cont)

A table is a collection of pages (aka blocks).

Page addressing on single processor/disk: (Table, File, Page)

Table maps to a set of files (e.g. named by tableID)
File distinguishes primary/overflow files
PageNum maps to an offset in a specific file

If multiple nodes, then addressing depends how data distributed

partitioned: (Node, Table, File, Page)
replicated: ({Nodes}, Table, File, Page)

<< ∧ >>

❖ Data Storage in PDBs (cont)

Assume that partitioning is based on one attribute

Data-partitioning strategies for one table on n nodes:

round-robin, hash-based, range-based

Round-robin partitioning

cycle through nodes, new tuple added on "next" node
e.g. i ^th tuple is placed on (i mod n)^th node
balances load on nodes; no help for querying

<< ∧ >>

❖ Data Storage in PDBs (cont)

Hash partitioning

use hash value to determine which node and page
e.g. i = hash(tuple) so tuple is placed on i ^th node
helpful for equality-based queries on hashing attribute

Range partitioning

ranges of attr values are assigned to processors
e.g. values 1-10 on node₀, 11-20 on node₁, ..., 99-100 node_n-1
potentially helpful for range-based queries

In both cases, data skew may lead to unbalanced load

<< ∧ >>

❖ Parallelism in DB Operations

Different types of parallelism in DBMS operations

intra-operator parallelism
- get all machines working to compute a given operation
  (scan, sort, join)
inter-operator parallelism
- each operator runs concurrently on a different processor
  (exploits pipelining)
Inter-query parallelism
- different queries run on different processors

<< ∧ >>