COMP3311 Week 9 Tuesday Lecture

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [0/45]
❖ Week 09 Tuesday

In today's lecture ...

Things to do ...

Things to note ...

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [1/45]
❖ Query Evaluation

The path of a query through its evaluation:


[Diagram:Pics/dbms/qryeval.png]

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [2/45]
❖ Mapping SQL to RA

Mapping SQL to relational algebra, e.g.

-- schema: R(a,b) S(c,d)
select a as x                   select a as x
from   R join S on (b=c)   OR   from   R, S
where  d=100                    where  b=c and d=100

In general:

Note: join conditions may appear either in FROM or WHERE
COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [3/45]
❖ Mapping SQL to RA (cont)


Example mapping using above:

select a as x
from   R join S on (b=c)
where  d=100

-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [4/45]
❖ Exercise: A Better SQL to RA Mapping

On the previous slide, we translated this SQL query as follows:

select a as x
from   R join S on (b=c)
where  d=100

-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

Suggest a better approach  (where "better" = "lower cost")

Assumptions:  |R| = 2000,  |S| = 1000,  |R⋈S| = 1000,  |Res| = 100

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [5/45]
❖ Another Mapping Example

The query: Courses with more than 100 students in them?

Can be expressed in SQL as

select   s.id, s.code
from     Course c, Subject s, Enrolment e
where    c.id = e.course and c.subject = s.id
group by s.id, s.code
having   count(*) > 100;

and might be compiled to

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [6/45]
❖ Exercise: Relational Operation Plan

So far, we have been expressing RA evaluation as follows:

Tmp(a,b,c,...) =  Op[x](R)  or  R Op[x] S

Each statement involves a single relational algebra operation.

Render the RA from the previous slide in this form.

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [7/45]
❖ Implementations of RA Ops

Sorting   (quicksort, etc. are not applicable)

Selection   (different techniques developed for different query types) Join   (fast joins are critical to success of relational DBMSs)
COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [8/45]
❖ Query Cost Estimation

The cost of evaluating a query is determined by

Analysis of costs involves estimating:
Accessing data from disk is the dominant cost in query evaluation
COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [9/45]
❖ Query Cost Estimation (cont)

An execution plan is a sequence of concrete relational operations.

Consider execution plans for:    σc (R  ⋈d  S  ⋈e  T)

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

or

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

Both produce same result, but have different costs.

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [10/45]
❖ Exercise: Query Cost Estimation

Estimate the cost of each plan, where  |R|=100,  |S|=1000,  |T|=10000

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

or

tmp1   :=  sort_merge_join[e](S,T)
tmp2   :=  hash_join[d](R,tmp1)
result :=  linear_search[c](tmp2)

or

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [11/45]
❖ Query Optimisation

A DBMS query optimiser works as follows:

Input: relational algebra expression
Output: execution plan (sequence of RA ops)

bestCost = INF; bestPlan = none
while (more possible plans) {
   plan = produce a new evaluation plan
   cost = estimated_cost(plan)
   if (cost < bestCost) {
      bestCost = cost; bestPlan = plan
   }
}
return bestPlan

Typically, there are very, very many possible plans

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [12/45]
❖ DB Application Performance

In order to make DB applications efficient, it is useful to know:

and then, "encourage" DBMS to use the most efficient methods

Achieve by using indexes and avoiding certain SQL query structures

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [13/45]
❖ DB Application Performance (cont)

Application programmer choices that affect query cost:

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [14/45]
❖ DB Application Performance (cont)

Whatever you do as a DB application programmer

Transformation is carried out by query optimiser

You have no control over the optimisation process

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [15/45]
❖ DB Application Performance (cont)

Example: query to find sales people earning more than $50K

select name from Employee
where  salary > 50000 and
       empid in (select empid from Worksin
                 where  dept = 'Sales')

A query evaluator would have to use the strategy

SalesEmps = (select empid from WorksIn where dept='Sales')
foreach e in Employee {
    if (e.empid in SalesEmps && e.salary > 50000)
        add e to result set
}

Needs to examine all employees, even if not in Sales

This is not a good expression of the query.

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [16/45]
❖ DB Application Performance (cont)

A different expression of the same query:

select name
from   Employee join WorksIn using (empid)
where  Employee.salary > 5000 and
       WorksIn.dept = 'Sales'

Query evaluator might use the strategy

SalesEmps = (select * from WorksIn where dept='Sales')
foreach e in (Employee join SalesEmps) {
    if (e.salary > 50000)
        add e to result set
}

Only examines Sales employees, and uses a simpler test

This is a good expression of the query.

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [17/45]
❖ DB Application Performance (cont)

A very poor expression of the query (correlated subquery):

select name from Employee e
where  salary > 50000 and
       'Sales' in (select dept from Worksin where empid=e.id)

A query evaluator would be forced to use the strategy:

foreach e in Employee {
    Depts = (select dept from WorksIn where empid=e.empid)
    if ('Sales' in Depts && e.salary > 50000)
        add e to result set
}

Needs to run a query for every employee ...

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [18/45]
❖ Exercise: Comparing Query Costs

Measure the time costs of the following queries:

select count(*) from Movies where substring(title,1,1) = 'D';
select count(*) from Movies where title like 'D%';
select count(*) from Movies where title ilike 'd%';
select count(*) from Movies where title ~ '^D.*';
select count(*) from Movies where title ~* '^d.*';

select count(*) from Movies where title = 'District 9';
select count(*) from Movies where id = 111113;

select max(id) from Movies;
select max(title) from Movies;

select count(*)
from   Movies m join Principals p on (m.id = p.movie)
       join People n on (n.id = p.person)
where  p.ord = 1;

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [19/45]
❖ PostgreSQL Query Costs

PostgreSQL provides the explain statement to

Usage:

EXPLAIN [ANALYZE] Query

Without ANALYZE, EXPLAIN shows plan with estimated costs.

With ANALYZE, EXPLAIN executes query and prints real costs.

Note that runtimes may show considerable variation due to buffering.

If simply want to know the runtime is ok, maybe \timing is good enough

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [20/45]
❖ Indexes

Indexes provide efficient content-based access to tuples.

Can build indexes on any (combination of) attributes.

Definining indexes:

CREATE INDEX name ON table ( attr1, attr2, ... )

attri  can be an arbitrary expression (e.g. upper(name)).

CREATE INDEX  also allows us to specify

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [21/45]
❖ Indexes (cont)

Indexes can significantly improve query costs.

Considerations in applying indexes:

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [22/45]
❖ Exercise: Indexes and Query Types

Consider a table Students(id,name,age,...)

If we had the following mix of queries

QueryFreq
select * from Students where id = zID 70%
select * from Students where name = Name 10%
select * from Students where name like '%str%' 15%
select * from Students where age > Age 5%

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [23/45]
❖ Query Tuning

Sometimes, a query can be re-phrased to affect performance:

Examples which may prevent optimiser from using indexes:

select name from Employee where salary/365 > 100
       -- fix by re-phrasing condition to (salary > 36500)
select name from Employee where name like '%ith%'
select name from Employee where birthday is null
       -- above two are difficult to "fix"
select name from Employee
where  dept in (select id from Dept where ...)
       -- fix by using Employee join Dept on (e.dept=d.id)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [24/45]
❖ Query Tuning (cont)

Other tricks in query tuning (effectiveness is DBMS-dependent)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [25/45]
❖ Exercise: More Query Costs

Check whether the optimisations mentioned above actually work

-- movies longer than 3 hours

select title from Movies where runtime/60 > 3
-- or --
select title from Movies where runtime > 180

-- people who were a principal in some movie

select name from People
where  id in (select person from Principals)
-- or --
select p.name from People p
where exists (select * from Principals where person=p.id)
-- or --
select n.name
from People n join Principals p on (n.id=p.person)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [26/45]
❖ Exercise: More Query Costs (cont)


Check whether the optimisations mentioned above actually work

-- movies made in Albania or Iceland

select title from Movies where origin = 'AL' or origin = 'IS'
-- or --
(select title from Movies where origin = 'AL')
union
(select title from Movies where origin = 'IS')
-- or --
select title from Movies where origin in ('AL','IS')

See how the above queries are affected by adding an index on  Movies.origin

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [27/45]
❖ EXPLAIN Examples

Note that PostgreSQL builds a query evaluation tree, rather than a linear plan, e.g.

[Diagram:Pics/dbms/pg-plan.png]

EXPLAIN effectively shows a pre-order traversal of the plan tree

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [28/45]
❖ EXPLAIN Examples (cont)

Example: Select on indexed attribute

db=# explain select * from Students where id=100250;
                            QUERY PLAN
-----------------------------------------------------------------
 Index Scan using student_pkey on student
                  (cost=0.00..5.94 rows=1 width=17)
   Index Cond: (id = 100250)

db=# explain analyze select * from Students where id=100250;
                            QUERY PLAN
-----------------------------------------------------------------
 Index Scan using student_pkey on student
                  (cost=0.00..5.94 rows=1 width=17)
                  (actual time=3.209..3.212 rows=1 loops=1)
   Index Cond: (id = 100250)
 Total runtime: 3.252 ms

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [29/45]
❖ EXPLAIN Examples (cont)

Example: Select on non-indexed attribute

db=# explain select * from Students where stype='local';
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
   Filter: ((stype)::text = 'local'::text)

db=# explain analyze select * from Students
db-#                          where stype='local';
                         QUERY PLAN
---------------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
             (actual time=0.061..7.784 rows=2512 loops=1)
   Filter: ((stype)::text = 'local'::text)
 Total runtime: 7.554 ms

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [30/45]
❖ EXPLAIN Examples (cont)

Example: Join on a primary key (indexed) attribute

db=# explain
db-# select s.sid,p.name
db-# from Students s join People p on s.id=p.id;
                        QUERY PLAN
-----------------------------------------------------------
 Hash Join  (cost=70.33..305.86 rows=3626 width=52)
   Hash Cond: ("outer".id = "inner".id)
   -> Seq Scan on person p
               (cost=0.00..153.01 rows=3701 width=52)
   -> Hash  (cost=61.26..61.26 rows=3626 width=8)
       -> Seq Scan on student s
                   (cost=0.00..61.26 rows=3626 width=8)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [31/45]
❖ EXPLAIN Examples (cont)

Example: Join on a non-indexed attribute

db=# explain
db-# select s1.code, s2.code
db-# from Subjects s1 join Subjects s2 on s1.offerer=s2.offerer;
                        QUERY PLAN
----------------------------------------------------------------
 Merge Join  (cost=2744.12..18397.14 rows=1100342 width=18)
   Merge Cond: (s1.offerer = s2.offerer)
   ->  Sort  (cost=1372.06..1398.33 rows=10509 width=13)
         Sort Key: s1.offerer
         ->  Seq Scan on subjects s1
                      (cost=0.00..670.09 rows=10509 width=13)
   ->  Sort  (cost=1372.06..1398.33 rows=10509 width=13)
         Sort Key: s2.offerer
         ->  Seq Scan on subjects s2
                      (cost=0.00..670.09 rows=10509 width=13)

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [32/45]
❖ Transactions, Concurrency


DBMSs provide valuable information resources in an environment that is:

Each user should see the system as: Ultimate goal: data integrity is maintained at all times.
COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [33/45]
❖ Transactions, Concurrency (cont)

Transaction processing

Concurrency control Recovery mechanisms
In COMP3311, we consider only transactions and concurrency
COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [34/45]
❖ Transactions

A transaction is

Transactions happen in a multi-user, unreliable environment.

To maintain integrity of data, transactions must be:

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [35/45]
❖ Example Transaction

Bank funds transfer

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [36/45]
❖ Example Transaction (cont)

Example function to implement bank transfer ...

create or replace function
   transfer(N integer, Src text, Dest text)
   returns integer
declare
   sID integer; dID integer; avail integer;
begin
   select id,balance into sID,avail
   from   Accounts where name=Src;
   if (sID is null) then
      raise exception 'Invalid source account %',Src;
   end if;
   select id into dID
   from Accounts where name=Dest;
   if (dID is null) then
      raise exception 'Invalid dest account %',Dest;
   end if;
...

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [37/45]
❖ Example Transaction (cont)

Example function to implement bank transfer (cont)...

...
   if (avail < N) then
      raise exception 'Insufficient funds in %',Src;
   end if;
   -- total funds in system = NNNN
   update Accounts set balance = balance-N
   where  id = sID;
   -- funds temporarily "lost" from system
   update Accounts set balance = balance+N
   where  id = dID;
   -- funds restored to system; total funds = NNNN
   return nextval('tx_id_seq');
end;

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [38/45]
❖ Transaction Concepts

A transaction must always terminate, either:


[Diagram:Pics/xact/tx-states1.png]

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [39/45]
❖ Transaction Concepts (cont)

To describe transaction effects, we consider:

Normally abbreviated to R(X), W(X), A, C

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [40/45]
❖ Transaction Consistency

Transactions typically have intermediate states that are invalid.

However, states before and after transaction must be valid.


[Diagram:Pics/xact/tx-exec1.png]



Valid = consistent = satisfying all specified constraints on the data

COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [41/45]
❖ Transaction Consistency (cont)

Transaction descriptions can be abstracted

A schedule defines ... Abribtrary interleaving of operations causes anomalies, so that ...
  • two consistency-preserving transactions
  • produce a final state which is not consistent
  • COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [42/45]
    ❖ Serial Schedules

    Serial execution:   T1 then T2   or   T2 then T1

    T1: R(X) W(X) R(Y) W(Y)
    T2:                     R(X) W(X)
    

    or

    T1:           R(X) W(X) R(Y) W(Y)
    T2: R(X) W(X)
    

    Serial execution guarantees a consistent final state if

    COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [43/45]
    ❖ Concurrent Schedules

    Concurrent schedules interleave T1,T2,... operations

    Some concurrent schedules are ok, e.g.

    T1: R(X) W(X)      R(Y)      W(Y)
    T2:           R(X)      W(X)
    

    Other concurrent schedules cause anomalies, e.g.

    T1: R(X)      W(X)      R(Y) W(Y)
    T2:      R(X)      W(X)
    

    Want the system to ensure that only valid schedules occur.

    COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [44/45]
    ❖ Serializability

    Serializable schedule:

    Abstracting this needs a notion of schedule equivalence.

    Two common formulations of serializability:

    COMP3311 22T3 ♢ Week 9 Tuesday Lecture ♢ [45/45]


    Produced: 12 Apr 2023