COMP3311 Week 9 Monday Lecture

∧ >>

❖ Week 09 Monday

In today's lecture ...

Ass2/Q5, Query Evaluation, Transactions

Things to do ...

Assignment 2 due by 23:59 Wed 15 November

Things to note ...

Help Sessions ... Mon 12-2, Tue 2-4, Thu 10-12, Fri 10-12
Exam on Mon 4 December, two sessions: morning, afternoon
No early exams; Supp Exams in O-week Term 1 2024 (Feb 5)

<< ∧ >>

❖ Assignment 2 Database

A few "issues" ...

few programs have Requirements beyond Total UOC
few streams have Requirements beyond Total UOC
stream Total UOC's are typically too high
(because Requirements were moved from stream to program)
enrolment in many courses is ridiculously small
Requirements can be expressed in many different ways
some Requirements are out of date (e.g. COMP2041 in CS electives)

Work with the database as given (except for updates to stream UOCs)

<< ∧ >>

❖ Assignment 2, Q5

Q5 is challenging, much data to keep organised

Inputs:

student ID, program, stream
a transcript (list of courses completed)
set of requirements for program/stream

The Aim:

determine which requirement each course satisfies
print an annotated transcript
show any requirements that are not fully satisfied

<< ∧ >>

❖ Exercise: 5893146 | 3778 | COMPA1

Manually carry out a progression check on:

ACCT1501:HD, COMP1511:SY, MATH1141:SY, COMP1521:UF, ECON1101:HD
MATH1231:CR, COMP2521:HD, MATH1081:DN, COMP3311:HD, ECON1102:CR,
MGMT1001:DN, COMP1521:HD, PSYC1001:CR, COMP3331:PS, COMP1531:HD,
COMP3121:CR, COMP3411:CR, ARTS1270:AW, COMP2511:CR, COMP3421:CR,
COMP4920:DN, COMP3231:CR, COMP6080:DN, DDES1110:CR, COMP3900:HD,
COMP9313:DN

Program Requirements
 Total UOC, 144
 Comp Sci Majors, 1, 1
   COMPA1,COMPD1,COMPE1,COMPI1,COMPJ1,COMPN1,COMPS1,COMPY1
 Foundational Computing, all
   COMP1511,COMP1521,COMP1531,COMP2511,COMP2521
 Comp Sci Maths, all
   MATH1081,{MATH1131;MATH1141},{MATH1231;MATH1241
 Comp Sci Advanced Core, all
   {COMP3121;COMP3821},COMP3900,COMP4920
 General Education, 12, GEN#####

Stream Requirements
 Total UOC, min:66
 COMPA1 Computing Electives, 30
   ENGG2600,ENG3600,ENGG4600,COMP3###,COMP4###,COMP6###,COMP9###
 COMPA1 Free Electives, 36, FREE####

<< ∧ >>

❖ Exercise: 3893433 | 8543 | COMPCS

Manually carry out a progression check on:

COMP9020:HD, COMP9021:CR, COMP9024:CR, COMP9311:PS, COMP9032:CR,
COMP9331:DN, COMP9201:PS, COMP9315:PS, COMP9444:DN,

Program Requirements
 Total UOC, 96, x
 COMPCS Disciplinary Electives, 24, 24, x
  BINF6###,BINF9###,COMP4###,COMP6###,COMP9##,GSOE92###

Stream Requirements
 Total UOC, 96, x
 MIT Streams, 1, 1, x
   COMPAS,COMPBS,COMPCS,COMPDS,COMPES,COMPIS,COMPSS
 Project Management, all, x, GSOE9820
 PG Core Courses, all, x
   COMP9021,COMP9024,COMP9311,COMP9331
 MIT Project Courses, all, x, {COMP9900;COMP9991}
 ADK Courses, 36, 36, x
   COMP4121,COMP4161,COMP4418,COMP6714,COMP9153,COMP9242,COMP9243,
   COMP9315,COMP9318,COMP9319,COMP9323,COMP9334,COMP9336,COMP9417,
   COMP9418, COMP9434,COMP9444,COMP9517,COMP9992,COMP9993

<< ∧ >>

❖ Exercise: 5892582 | 3707 | SENGAH

Manually carry out a progression check on:

COMP1511:PS, ENGG1000:PS, MATH1131:PS, COMP1521:CR, MATH1081:PS,
MATH1231:PS, COMP2521:FL, SENG2011:PS, COMP1531:CR, COMP2521:CR,
COMP2041:CR, DESN2000:CR, MATH2859:FL, COMP2511:FL, COMP3311:PS,
COMP3331:CR, COMP2511:DN, SENG2021:PS, COMP3141:NF,

Program Requirements
Total UOC, 192, x
BE(Hons) Streams, 1, 1,
 AEROAH,BINFAH,CEICAH,CEICDH,COMPBH,CVENAH,CVENBH,ELECAH,ELECCH,
 GMATDH,MANFBH,MINEAH,MTRNAH,PETRAH,SENGAH,SOLAAH,SOLABH,TELEAH
Industrial Training, all, x, ENGG4999
General Education, 12, 12, x, GEN#####

Stream Requirements
Total UOC, 168, x
Foundational Computing, all, x
 COMP1511,COMP1521,COMP1531,COMP2511,COMP2521
SENGAH Maths, all, x
 MATH1081,{MATH1131;MATH1141},{MATH1231;MATH1241},MATH2400,MATH2859
SENGAH Workshops/Design, all, x
 {DESN1000;ENGG1000},DESN2000, SENG2011,SENG2021,SENG3011
SENGAH Advanced Core, all, x
 COMP2041,COMP3141,COMP3311,COMP3331,SENG4920,
 COMP4951,COMP4952,COMP4953
SENGAH Discipline Electives, 36, x 
 ENGG2600,ENGG3060,ENGG3600,ENGG4600,COMP3###,COMP4###,
 COMP6###,COMP9###,INFS3###,INFS4###,MATH3###,MATH4###,
 MATH6###,ELEC3###,ELEC4###,TELE3###,TELE4###
SENGAH Free Electives, 6, 6, x,FREE####

<< ∧ >>

❖ Query Evaluation

The path of a query through its evaluation:

<< ∧ >>

❖ Mapping SQL to RA

Mapping SQL to relational algebra, e.g.

-- schema: R(a,b) S(c,d)
select a as x                   select a as x
from   R join S on (b=c)   OR   from   R, S
where  d=100                    where  b=c and d=100

In general:

SELECT clause becomes projection
FROM clause becomes join
WHERE condition becomes selection
AS becomes rename

Note: join conditions may appear either in FROM or WHERE

<< ∧ >>

❖ Mapping SQL to RA (cont)

Example mapping using above:

-- schema: R(a,b) S(c,d)
select a as x
from   R join S on (b=c)
where  d=100

-- could be mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

<< ∧ >>

❖ Exercise: A Better SQL to RA Mapping

On the previous slide, we translated this SQL query as follows:

select a as x
from   R join S on (b=c)
where  d=100

-- mapped to
Tmp1(a,b,c,d) = R Join[b=c] S
Tmp2(a,b,c,d) = Sel[d=100](Tmp1)
Tmp3(a)       = Proj[a](Tmp2)
Res(x)        = Rename[Res(x)](Tmp3)

Suggest a better approach (where "better" = "lower cost")

Assumptions: |R| = 2000, |S| = 1000, |R⋈S| = 1000, |Res| = 100

<< ∧ >>

❖ Another Mapping Example

The query: Courses with more than 100 students in them?

Can be expressed in SQL as

select   s.id, s.code
from     Course c, Subject s, Enrolment e
where    c.id = e.course and c.subject = s.id
group by s.id, s.code
having   count(*) > 100;

and might be compiled to

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

<< ∧ >>

❖ Exercise: Relational Operation Plan

So far, we have been expressing RA evaluation as follows:

Tmp(a,b,c,...) =  Op[x](R)  or  R Op[x] S

Each statement involves a single relational algebra operation.

Render the RA from the previous slide in this form.

Result =
Project[id,code](
   GroupSelect[size>100] (
      GroupBy[id,code] (
         Subject Join[s.id=c.subject]
         (Course Join[c.id=e.course] Enrolment)
)  )  )

<< ∧ >>

❖ Implementations of RA Ops

Sorting (quicksort, etc. are not applicable)

external merge sort (cost O(Nlog_BN) with B memory buffers)

Selection (different techniques developed for different query types)

sequential scan (worst case, cost O(N))
index-based (e.g. B-trees, cost O(logN), tree nodes are pages)
hash-based (O(1) best case, only works for equality tests)

Join (fast joins are critical to success of relational DBMSs)

nested-loop join (cost O(N.M), buffering can reduce to O(N+M))
sort-merge join (cost O(NlogN+MlogM))
hash-join (best case cost O(N+M.N/B), with B memory buffers)

<< ∧ >>

❖ Query Cost Estimation

The cost of evaluating a query is determined by

the operations specified in the query execution plan
size of relations (database relations and temporary relations)
access mechanisms (indexing, hashing, sorting, join algorithms)
size/number of main memory buffers (and replacement strategy)

Analysis of costs involves estimating:

the size of intermediate results
then, based on this, cost of secondary storage accesses

Accessing data from disk is the dominant cost in query evaluation

<< ∧ >>

❖ Query Cost Estimation (cont)

An execution plan is a sequence of concrete relational operations.

Consider execution plans for: σ_c (R ⋈_d S ⋈_e T)

tmp1   =  hash_join[d](R,S)
tmp2   =  sort_merge_join[e](tmp1,T)
result =  binary_search[c](tmp2)

tmp1   =  btree_search[c](R)
tmp2   =  hash_join[d](tmp1,S)
result =  sort_merge_join[e](tmp2)

Both produce same result, but have different costs.

<< ∧ >>

❖ Query Optimisation

A DBMS query optimiser works as follows:

Input: relational algebra expression
Output: execution plan (sequence of RA ops)

bestCost = INF; bestPlan = none
while (more possible plans) {
   plan = produce a new evaluation plan
   cost = estimated_cost(plan)
   if (cost < bestCost)
      { bestCost = cost; bestPlan = plan }
}
return bestPlan

Typically, there are very, very many possible plans

smarter optimisers generate "likely" subset of possible plans

<< ∧ >>

❖ DB Application Performance

In order to make DB applications efficient, it is useful to know:

what operations on the data does the application require
(which queries, updates, inserts and how frequently is each one performed)
how these operations might be implemented in the DBMS
(data structures and algorithms for select, project, join, sort, ...)
how much each implementation will cost
(in terms of the amount of data transferred between memory and disk)

and then, "encourage" DBMS to use the most efficient methods

Achieve by using indexes and avoiding certain SQL query structures

<< ∧ >>

❖ DB Application Performance (cont)

Application programmer choices that affect query cost:

how queries are expressed
- generally, join is faster than subquery
- especially if subquery is correlated
- filter first, then join (avoids large intermediate tables)
- avoid applying functions in where/group-by clasues
creating indexes on tables
- index will speed-up filtering based on indexed attributes
- indexes generally only effective for equality, gt/lt
- mainly useful if filtering much more frequent than update

<< ∧ >>

❖ Indexes

Indexes provide efficient content-based access to tuples.

Can build indexes on any (combination of) attributes.

Definining indexes:

CREATE INDEX name ON table ( attr₁, attr₂, ... )

attr_i can be an arbitrary expression (e.g. upper(name)).

CREATE INDEX also allows us to specify

that the index is on UNIQUE values
an access method (USING btree, hash, ...)

<< ∧ >>

❖ Indexes (cont)

Indexes can significantly improve query costs.

Considerations in applying indexes:

is an attribute used in frequent/expensive queries?
(note that some kinds of queries can be answered from index alone)
should we create an index on a collection of attributes?
(yes, if the collection is used in a frequent/expensive query)
is the table containing attribute frequently updated?

should we use B-tree or Hash index?

-- use hashing for (unique) attributes in equality tests, e.g.
select * from Employee where id = 12345
-- use B-tree for attributes in range tests, e.g.
select * from Employee where age > 60

<< ∧ >>

❖ Exercise: Indexes and Query Types

Consider a table Students(id,name,age,...)

If we had the following mix of queries

Query Freq

select * from Students where id = zID 70%

select * from Students where name = Name 10%

select * from Students where name like '%str%' 15%

select * from Students where age > Age 5%

suggest an index definition that might help each query type
indicate whether creating such an index might be justified

<< ∧ >>

❖ Expressing SQL Queries

Example: query to find sales people earning more than $50K

select name from Employee
where  salary > 50000 and
       empid in (select empid from Worksin
                 where  dept = 'Sales')

A query evaluator would have to use the strategy

SalesEmps = (select empid from WorksIn where dept='Sales')
foreach e in Employee {
    if (e.empid in SalesEmps && e.salary > 50000)
        add e to result set
}

Needs to examine all employees, even if not in Sales

This is not a good expression of the query.

<< ∧ >>

❖ Expressing SQL Queries (cont)

A different expression of the same query:

select name
from   Employee join WorksIn using (empid)
where  Employee.salary > 5000 and
       WorksIn.dept = 'Sales'

Query evaluator might use the strategy

SalesEmps = (select * from WorksIn where dept='Sales')
foreach e in (Employee join SalesEmps) {
    if (e.salary > 50000)
        add e to result set
}

Only examines Sales employees, and uses a simpler test

This is a good expression of the query.

<< ∧ >>

❖ Expressing SQL Queries (cont)

A very poor expression of the query (correlated subquery):

select name from Employee e
where  salary > 50000 and
       'Sales' in (select dept from Worksin where empid=e.id)

A query evaluator would be forced to use the strategy:

foreach e in Employee {
    Depts = (select dept from WorksIn where empid=e.empid)
    if ('Sales' in Depts && e.salary > 50000)
        add e to result set
}

Needs to run a query for every employee ...

<< ∧ >>

❖ Exercise: Comparing Query Costs

Measure the time costs of the following queries:

select count(*) from Subjects where substring(title,1,1) = 'D';
select count(*) from Subjects where title like 'D%';
select count(*) from Subjects where title ilike 'd%';
select count(*) from Subjects where title ~ '^D.*';
select count(*) from Subjects where title ~* '^d.*';

select count(*) from Subjects where title = 'Database Systems';
select count(*) from Subjects where id = 56571;

select max(id) from Subjects;
select max(title) from Subjects;

select count(*)
from   Subjects s join Courses c on s.id = c.subject
       join Course_enrolments e on c.id = e.course
where  code = 'COMP1511';

<< ∧ >>

❖ Query Tuning

Sometimes, a query can be re-phrased to affect performance:

by helping the optimiser to make use of indexes
by avoiding unnecessary/expensive operations

Examples which may prevent optimiser from using indexes:

select name from Employee where salary/365 > 100
       -- fix by re-phrasing condition to (salary > 36500)
select name from Employee where name like '%ith%'
select name from Employee where birthday is null
       -- above two are difficult to "fix"
select name from Employee
where  dept in (select id from Dept where ...)
       -- fix by using Employee join Dept on (e.dept=d.id)

<< ∧ >>

❖ Query Tuning (cont)

Other tricks in query tuning (effectiveness is DBMS-dependent)

select distinct typically requires a sort ...
is the distinct really necessary? (at this stage in the query?)

if multiple join conditions are available ...
choose join attributes that are indexed, avoid joins on strings

select ... Employee join Customer on (s.name = p.name)
vs
select ... Employee join Customer on (s.ssn = p.ssn)

sometimes or prevents index from being used ...
replace the or condition by a union of non-or clauses

select name from Employee where Dept=1 or Dept=2
vs
(select name from Employee where Dept=1)
union
(select name from Employee where Dept=2)

<< ∧

❖ Exercise: More Query Costs

Check whether the optimisations mentioned above actually work

-- names of course convenors
select full_name from People
where  id in (select convenor from Courses)
-- or --
select p.full_name from People p
where exists (select * from Courses where convenor=p.id)
-- or --
select p.full_name
from People p join Courses c on (p.id=c.convenor)

-- people from Colombia(48) or Mongolia(140)
select full_name from People where origin = 48 or origin = 140
-- or --
(select full_name from People where origin = 48)
union
(select full_name from People where origin = 140)
-- or --
select full_name from People where origin in (48,140)

Query	Freq
`select * from Students where id = zID`	70%
`select * from Students where name = Name`	10%
`select * from Students where name like '%str%'`	15%
`select * from Students where age > Age`	5%