COMP3311 Week 9 Tuesday Lecture

❖ Mapping SQL to RA

Mapping SQL to relational algebra, e.g.

-- schema: R(a,b) S(c,d)
select a as x                   select a as x
from   R join S on (b=c)   OR   from   R, S
where  d=100                    where  b=c and d=100

In general:

SELECT clause becomes projection
FROM clause becomes join
WHERE condition becomes selection
AS becomes rename

Note: join conditions may appear either in FROM or WHERE

<< ∧ >>

❖ Mapping SQL to RA (cont)

Example mapping using above:

select a as x
from   R join S on (b=c)
where  d=100

-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

<< ∧ >>

❖ Exercise: A Better SQL to RA Mapping

On the previous slide, we translated this SQL query as follows:

select a as x
from   R join S on (b=c)
where  d=100

-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

Suggest a better approach (where "better" = "lower cost")

Assumptions: |R| = 2000, |S| = 1000, |R⋈S| = 1000, |Res| = 100

<< ∧ >>

❖ Another Mapping Example

The query: Courses with more than 100 students in them?

Can be expressed in SQL as

select   s.id, s.code
from     Course c, Subject s, Enrolment e
where    c.id = e.course and c.subject = s.id
group by s.id, s.code
having   count(*) > 100;

and might be compiled to

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

<< ∧ >>

❖ Exercise: Relational Operation Plan

So far, we have been expressing RA evaluation as follows:

Tmp(a,b,c,...) =  Op[x](R)  or  R Op[x] S

Each statement involves a single relational algebra operation.

Render the RA from the previous slide in this form.

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

<< ∧ >>

❖ Implementations of RA Ops

Sorting (quicksort, etc. are not applicable)

external merge sort (cost O(Nlog_BN) with B memory buffers)

Selection (different techniques developed for different query types)

sequential scan (worst case, cost O(N))
index-based (e.g. B-trees, cost O(logN), tree nodes are pages)
hash-based (O(1) best case, only works for equality tests)

Join (fast joins are critical to success of relational DBMSs)

nested-loop join (cost O(N.M), buffering can reduce to O(N+M))
sort-merge join (cost O(NlogN+MlogM))
hash-join (best case cost O(N+M.N/B), with B memory buffers)

<< ∧ >>

❖ Query Cost Estimation

The cost of evaluating a query is determined by

the operations specified in the query execution plan
size of relations (database relations and temporary relations)
access mechanisms (indexing, hashing, sorting, join algorithms)
size/number of main memory buffers (and replacement strategy)

Analysis of costs involves estimating:

the size of intermediate results
then, based on this, cost of secondary storage accesses

Accessing data from disk is the dominant cost in query evaluation

<< ∧ >>

❖ Query Cost Estimation (cont)

An execution plan is a sequence of concrete relational operations.

Consider execution plans for: σ_c (R ⋈_d S ⋈_e T)

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

Both produce same result, but have different costs.

<< ∧ >>

❖ Exercise: Query Cost Estimation

Estimate the cost of each plan, where |R|=100, |S|=1000, |T|=10000

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

tmp1   :=  sort_merge_join[e](S,T)
tmp2   :=  hash_join[d](R,tmp1)
result :=  linear_search[c](tmp2)

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

<< ∧ >>

❖ Query Optimisation

A DBMS query optimiser works as follows:

Input: relational algebra expression
Output: execution plan (sequence of RA ops)

bestCost = INF; bestPlan = none
while (more possible plans) {
   plan = produce a new evaluation plan
   cost = estimated_cost(plan)
   if (cost < bestCost) {
      bestCost = cost; bestPlan = plan
   }
}
return bestPlan

Typically, there are very, very many possible plans

smarter optimisers generate likely subset of possible plans

<< ∧ >>

❖ DB Application Performance

In order to make DB applications efficient, it is useful to know:

what operations on the data does the application require
(which queries, updates, inserts and how frequently is each one performed)
how these operations might be implemented in the DBMS
(data structures and algorithms for select, project, join, sort, ...)
how much each implementation will cost
(in terms of the amount of data transferred between memory and disk)

and then, "encourage" DBMS to use the most efficient methods

Achieve by using indexes and avoiding certain SQL query structures

<< ∧ >>

❖ DB Application Performance (cont)

Application programmer choices that affect query cost:

how queries are expressed
- generally join is faster than subquery
- especially if subquery is correlated
- filter first, then join (avoids large intermediate tables)
- avoid applying functions in where/group-by clasues
creating indexes on tables
- index will speed-up filtering based on indexed attributes
- indexes generally only effective for equality, gt/lt
- mainly useful if filtering much more frequent than update

<< ∧ >>

❖ DB Application Performance (cont)

Whatever you do as a DB application programmer

the DBMS will transform your query
attempt to make it execute as efficiently as possible

Transformation is carried out by query optimiser

You have no control over the optimisation process

but choices you make can block certain options
limiting the query optimiser's chance to improve

<< ∧ >>

❖ DB Application Performance (cont)

Example: query to find sales people earning more than $50K

select name from Employee
where  salary > 50000 and
       empid in (select empid from Worksin
                 where  dept = 'Sales')

A query evaluator would have to use the strategy

SalesEmps = (select empid from WorksIn where dept='Sales')
foreach e in Employee {
    if (e.empid in SalesEmps && e.salary > 50000)
        add e to result set
}

Needs to examine all employees, even if not in Sales

This is not a good expression of the query.

<< ∧ >>

❖ DB Application Performance (cont)

A different expression of the same query:

select name
from   Employee join WorksIn using (empid)
where  Employee.salary > 5000 and
       WorksIn.dept = 'Sales'

Query evaluator might use the strategy

SalesEmps = (select * from WorksIn where dept='Sales')
foreach e in (Employee join SalesEmps) {
    if (e.salary > 50000)
        add e to result set
}

Only examines Sales employees, and uses a simpler test

This is a good expression of the query.

<< ∧ >>

❖ DB Application Performance (cont)

A very poor expression of the query (correlated subquery):

select name from Employee e
where  salary > 50000 and
       'Sales' in (select dept from Worksin where empid=e.id)

A query evaluator would be forced to use the strategy:

foreach e in Employee {
    Depts = (select dept from WorksIn where empid=e.empid)
    if ('Sales' in Depts && e.salary > 50000)
        add e to result set
}

Needs to run a query for every employee ...

<< ∧ >>

❖ Exercise: Comparing Query Costs

Measure the time costs of the following queries:

select count(*) from Movies where substring(title,1,1) = 'D';
select count(*) from Movies where title like 'D%';
select count(*) from Movies where title ilike 'd%';
select count(*) from Movies where title ~ '^D.*';
select count(*) from Movies where title ~* '^d.*';

select count(*) from Movies where title = 'District 9';
select count(*) from Movies where id = 111113;

select max(id) from Movies;
select max(title) from Movies;

select count(*)
from   Movies m join Principals p on (m.id = p.movie)
       join People n on (n.id = p.person)
where  p.ord = 1;

<< ∧ >>

❖ PostgreSQL Query Costs

PostgreSQL provides the explain statement to

give a representation of the query execution plan
with information that may help to tune query performance

Usage:

EXPLAIN [ANALYZE] Query

Without ANALYZE, EXPLAIN shows plan with estimated costs.

With ANALYZE, EXPLAIN executes query and prints real costs.

Note that runtimes may show considerable variation due to buffering.

If simply want to know the runtime is ok, maybe \timing is good enough

<< ∧ >>

❖ Indexes

Indexes provide efficient content-based access to tuples.

Can build indexes on any (combination of) attributes.

Definining indexes:

CREATE INDEX name ON table ( attr₁, attr₂, ... )

attr_i can be an arbitrary expression (e.g. upper(name)).

CREATE INDEX also allows us to specify

that the index is on UNIQUE values
an access method (USING btree, hash, ...)

<< ∧ >>

❖ Indexes (cont)

Indexes can significantly improve query costs.

Considerations in applying indexes:

is an attribute used in frequent/expensive queries?
(note that some kinds of queries can be answered from index alone)
should we create an index on a collection of attributes?
(yes, if the collection is used in a frequent/expensive query)
is the table containing attribute frequently updated?

should we use B-tree or Hash index?

-- use hashing for (unique) attributes in equality tests, e.g.
select * from Employee where id = 12345
-- use B-tree for attributes in range tests, e.g.
select * from Employee where age > 60

<< ∧ >>

❖ Exercise: Indexes and Query Types

Consider a table Students(id,name,age,...)

If we had the following mix of queries

Query Freq

select * from Students where id = zID 70%

select * from Students where name = Name 10%

select * from Students where name like '%str%' 15%

select * from Students where age > Age 5%

suggest an index definition that might help each query type
indicate whether creating such an index might be justified

<< ∧ >>

❖ Query Tuning

Sometimes, a query can be re-phrased to affect performance:

by helping the optimiser to make use of indexes
by avoiding unnecessary/expensive operations

Examples which may prevent optimiser from using indexes:

select name from Employee where salary/365 > 100
       -- fix by re-phrasing condition to (salary > 36500)
select name from Employee where name like '%ith%'
select name from Employee where birthday is null
       -- above two are difficult to "fix"
select name from Employee
where  dept in (select id from Dept where ...)
       -- fix by using Employee join Dept on (e.dept=d.id)

<< ∧ >>

❖ Query Tuning (cont)

Other tricks in query tuning (effectiveness is DBMS-dependent)

select distinct typically requires a sort ...
is the distinct really necessary? (at this stage in the query?)

if multiple join conditions are available ...
choose join attributes that are indexed, avoid joins on strings

select ... Employee join Customer on (s.name = p.name)
vs
select ... Employee join Customer on (s.ssn = p.ssn)

sometimes or prevents index from being used ...
replace the or condition by a union of non-or clauses

select name from Employee where Dept=1 or Dept=2
vs
(select name from Employee where Dept=1)
union
(select name from Employee where Dept=2)

<< ∧ >>

❖ Exercise: More Query Costs

Check whether the optimisations mentioned above actually work

-- movies longer than 3 hours

select title from Movies where runtime/60 > 3
-- or --
select title from Movies where runtime > 180

-- people who were a principal in some movie

select name from People
where  id in (select person from Principals)
-- or --
select p.name from People p
where exists (select * from Principals where person=p.id)
-- or --
select n.name
from People n join Principals p on (n.id=p.person)

<< ∧ >>

❖ Exercise: More Query Costs (cont)

Check whether the optimisations mentioned above actually work

-- movies made in Albania or Iceland

select title from Movies where origin = 'AL' or origin = 'IS'
-- or --
(select title from Movies where origin = 'AL')
union
(select title from Movies where origin = 'IS')
-- or --
select title from Movies where origin in ('AL','IS')

See how the above queries are affected by adding an index on Movies.origin

<< ∧ >>

❖ EXPLAIN Examples

Note that PostgreSQL builds a query evaluation tree, rather than a linear plan, e.g.

EXPLAIN effectively shows a pre-order traversal of the plan tree

<< ∧ >>

❖ EXPLAIN Examples (cont)

Example: Select on indexed attribute

db=# explain select * from Students where id=100250;
                            QUERY PLAN
-----------------------------------------------------------------
 Index Scan using student_pkey on student
                  (cost=0.00..5.94 rows=1 width=17)
   Index Cond: (id = 100250)

db=# explain analyze select * from Students where id=100250;
                            QUERY PLAN
-----------------------------------------------------------------
 Index Scan using student_pkey on student
                  (cost=0.00..5.94 rows=1 width=17)
                  (actual time=3.209..3.212 rows=1 loops=1)
   Index Cond: (id = 100250)
 Total runtime: 3.252 ms

<< ∧ >>

❖ EXPLAIN Examples (cont)

Example: Select on non-indexed attribute

db=# explain select * from Students where stype='local';
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
   Filter: ((stype)::text = 'local'::text)

db=# explain analyze select * from Students
db-#                          where stype='local';
                         QUERY PLAN
---------------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
             (actual time=0.061..7.784 rows=2512 loops=1)
   Filter: ((stype)::text = 'local'::text)
 Total runtime: 7.554 ms

<< ∧ >>

❖ EXPLAIN Examples (cont)

Example: Join on a primary key (indexed) attribute

db=# explain
db-# select s.sid,p.name
db-# from Students s join People p on s.id=p.id;
                        QUERY PLAN
-----------------------------------------------------------
 Hash Join  (cost=70.33..305.86 rows=3626 width=52)
   Hash Cond: ("outer".id = "inner".id)
   -> Seq Scan on person p
               (cost=0.00..153.01 rows=3701 width=52)
   -> Hash  (cost=61.26..61.26 rows=3626 width=8)
       -> Seq Scan on student s
                   (cost=0.00..61.26 rows=3626 width=8)

<< ∧ >>

❖ EXPLAIN Examples (cont)

Example: Join on a non-indexed attribute

db=# explain
db-# select s1.code, s2.code
db-# from Subjects s1 join Subjects s2 on s1.offerer=s2.offerer;
                        QUERY PLAN
----------------------------------------------------------------
 Merge Join  (cost=2744.12..18397.14 rows=1100342 width=18)
   Merge Cond: (s1.offerer = s2.offerer)
   ->  Sort  (cost=1372.06..1398.33 rows=10509 width=13)
         Sort Key: s1.offerer
         ->  Seq Scan on subjects s1
                      (cost=0.00..670.09 rows=10509 width=13)
   ->  Sort  (cost=1372.06..1398.33 rows=10509 width=13)
         Sort Key: s2.offerer
         ->  Seq Scan on subjects s2
                      (cost=0.00..670.09 rows=10509 width=13)

<< ∧ >>

❖ Transactions, Concurrency

DBMSs provide valuable information resources in an environment that is:

shared - concurrent access by multiple users
unstable - potential for hardware/software failure

Each user should see the system as:

unshared - their work is not inadvertantly affected by others
stable - the data survives in the face of system failures

Ultimate goal: data integrity is maintained at all times.

<< ∧ >>

❖ Transactions, Concurrency (cont)

Transaction processing

techniques for managing "logical units of work" which may require multiple DB operations

Concurrency control

techniques for ensuring that multiple concurrent transactions do not interfere with each other

Recovery mechanisms

techniques to restore information to a consistent state, even after major hardware shutdowns/failures

In COMP3311, we consider only transactions and concurrency

<< ∧ >>

❖ Transactions

A transaction is

an atomic "unit of work" in an application
which may require multiple database changes

Transactions happen in a multi-user, unreliable environment.

To maintain integrity of data, transactions must be:

Atomic - either fully completed or completely rolled-back
Consistent - map DB between consistent states
Isolated - transactions do not interfere with each other
Durable - persistent, restorable after system failures

<< ∧ >>

❖ Example Transaction

Bank funds transfer

move N dollars from account X to account Y
Accounts(id,name,balance,heldAt, ...)
Branches(id,name,address,assets, ...)
maintain Branches.assets as sum of balances via triggers
transfer implemented by function which
- has three parameters: amount, source acct, dest acct
- checks validity of supplied accounts
- checks sufficient available funds
- returns a unique transaction ID on success

<< ∧ >>

❖ Example Transaction (cont)

Example function to implement bank transfer ...

create or replace function
   transfer(N integer, Src text, Dest text)
   returns integer
declare
   sID integer; dID integer; avail integer;
begin
   select id,balance into sID,avail
   from   Accounts where name=Src;
   if (sID is null) then
      raise exception 'Invalid source account %',Src;
   end if;
   select id into dID
   from Accounts where name=Dest;
   if (dID is null) then
      raise exception 'Invalid dest account %',Dest;
   end if;
...

<< ∧ >>

❖ Example Transaction (cont)

Example function to implement bank transfer (cont)...

...
   if (avail < N) then
      raise exception 'Insufficient funds in %',Src;
   end if;
   -- total funds in system = NNNN
   update Accounts set balance = balance-N
   where  id = sID;
   -- funds temporarily "lost" from system
   update Accounts set balance = balance+N
   where  id = dID;
   -- funds restored to system; total funds = NNNN
   return nextval('tx_id_seq');
end;

<< ∧ >>

❖ Transaction Concepts

A transaction must always terminate, either:

successfully (COMMIT), with all changes preserved
unsuccessfully (ABORT), with database unchanged

<< ∧ >>

❖ Transaction Concepts (cont)

To describe transaction effects, we consider:

READ - transfer data from permanent storage to memory
WRITE - transfer data from memory to permanent storage
ABORT - terminate transaction, unsuccessfully
COMMIT - terminate transaction, successfully

Normally abbreviated to R(X), W(X), A, C

SELECT produces READ operations on the database.
INSERT produces WRITE operations.
UPDATE, DELETE produce both READ + WRITE operations.

<< ∧ >>

❖ Transaction Consistency

Transactions typically have intermediate states that are invalid.

However, states before and after transaction must be valid.

Valid = consistent = satisfying all specified constraints on the data

<< ∧ >>

❖ Transaction Consistency (cont)

Transaction descriptions can be abstracted

consider only Read and Write operations on shared data
e.g. T1: R(X) W(X) R(Y) W(Y), T2: R(X) R(Y) W(X) W(Y)

A schedule defines ...

a specific execution of one or more transactions
typically concurrent, with interleaved operations

Abribtrary interleaving of operations causes anomalies, so that ...

two consistency-preserving transactions

produce a final state which is not consistent

<< ∧ >>

❖ Serial Schedules

Serial execution: T1 then T2 or T2 then T1

T1: R(X) W(X) R(Y) W(Y)
T2:                     R(X) W(X)

T1:           R(X) W(X) R(Y) W(Y)
T2: R(X) W(X)

Serial execution guarantees a consistent final state if

the initial state of the database is consistent
T1 and T2 are consistency-preserving

<< ∧ >>

❖ Concurrent Schedules

Concurrent schedules interleave T1,T2,... operations

Some concurrent schedules are ok, e.g.

T1: R(X) W(X)      R(Y)      W(Y)
T2:           R(X)      W(X)

Other concurrent schedules cause anomalies, e.g.

T1: R(X)      W(X)      R(Y) W(Y)
T2:      R(X)      W(X)

Want the system to ensure that only valid schedules occur.

<< ∧

❖ Serializability

Serializable schedule:

concurrent schedule for T₁ ..T_n with final state S
S is also a final state of a serial schedule for T₁ ..T_n

Abstracting this needs a notion of schedule equivalence.

Two common formulations of serializability:

conflict serializibility (read/write operations occur in the "right" order)
view serializibility (read operations see the correct version of data)

In today's lecture ...

Things to do ...

Things to note ...

Query	Freq
`select * from Students where id = zID`	70%
`select * from Students where name = Name`	10%
`select * from Students where name like '%str%'`	15%
`select * from Students where age > Age`	5%