Query Processing

A query in SQL:

states what answers are required
says little about how they should be computed

A query evaluator :

takes a declarative description of the query in SQL
parses the query into a relational algebra expression
determines a plan for answering the query
executes the plan via the database engine

Mapping SQL to Relational Algebra

A naive translation scheme from SQL to relational algebra:

SELECT clause → projection
FROM clause → cross-product
WHERE clause → selection

Example:

select s.name, e.course
from   Student s, Enrolment e
where  s.id = e.student and e.mark > 50;

is translated to

Project [name,course] ( Select [id=student ∧ mark>50] ( Student × Enrolment ) )

Mapping SQL to Relational Algebra (cont)

A better translation scheme would be something like:

SELECT clause → projection
WHERE clause on single reln → selection
WHERE clause on two relns R → join

Example:

select s.name, e.course
from   Student s, Enrolment e
where  s.id = e.student and e.mark > 50;

is translated to

Project [name,course] ( Select [mark>50] ( Join [id=student] ( Student, Enrolment ) ) )

Mapping SQL to Relational Algebra (cont)

Mapping other RA operations ...

Aggregation operators (e.g. MAX, SUM, ...):

add new operators to extend RA (e.g. max(Project[age](..)) )

Duplicate elimination (DISTINCT):

incorporate into projection operator (e.g. Project')

Grouping (GROUP-BY, HAVING):

add new operators to extend RA (e.g. GroupBy, GroupSelect)

Sorting (ORDER-BY):

add sort operator to extend RA

Mapping Example

The query: Courses with more than 100 students in them?

Can be expressed in SQL as

select   distinct s.code
from     Course c, Subject s, Enrolment e
where    c.id = e.course and c.subject = s.id
group by s.id
having   count(*) > 100;

and might be compiled to

Result = Project'[s.code]( GroupSelect[size>100]( GroupBy[id] ( JoinRes ) ) )

where

JoinRes = Join[s.id=c.subject] ( Subject, Join[id=course]( Course, Enrolment ) )

Mapping Example (cont)

The Join operations could be done (at least) two different ways:

Which is better? ... The query optimiser works this out.

Note: for a join involving N tables, there are O(N!) possible trees.

Query Evaluation

The order of operations is important.

Equally important is the choice of concrete operations:

each RA operator has several implementation methods
DBMSs typically provide a range of choices
each implementation is effective under certain conditions

The DBMS query optimiser needs to:

choose concrete operations for each RA operation in query
by analysing the cost of potential concrete operations

Database Engine Operations

One view of DB engine - "relational algebra virtual machine":

selection (σ)	projection (π)	join (, ×)
union (∪)	intersection (∩)	difference (-)
sort	insert	delete

For each of these operations:

various data structures and algorithms are available
DBMSs may provide only one, or may provide a choice
we need to be able to estimate the cost of each method

Cost analysis requires a model of DBMS internals ...

Query Optimisation Problem

Given:

a query Q, a database D, a database "engine" E

Determine a sequence of relational algebra operations that:

produces the answer to Q in D
executes Q efficiently on E (minimal I/O)

The term "query optimisation" is a misnomer:

not just for queries (e.g. also updates)
not necessarily optimal ("reasonably efficient")

(Finding the optimal query is NP-hard; the cost of finding it may be higher than the query cost).

Query Optimisation Problem (cont)

The query optimiser start with an RA expression, then

generates a set of equivalent expressions
generates possible execution plans for each
estimates cost of each plan, chooses chepaest

The cost of evaluating a query is determined by:

size of relations (database relations and temporary relations)
access mechanisms (indexing, hashing, sorting, join algorithms)
size/number of main memory buffers (and replacement strategy)

Analysis of costs involves estimating:

the size of intermediate results
then, based on this, cost of secondary storage accesses

Query Optimisation Problem (cont)

An execution plan is a sequence of relational operations.

Consider execution plans for: σ_c (R _d S _e T)

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

tmp1   :=  sort_merge_join[e](S,T)
tmp2   :=  hash_join[d](R,tmp1)
result :=  linear_search[c](tmp2)

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

All produce same result, but have different costs.

Implementations of RA Ops

Sorting (quicksort, etc. are not applicable)

external merge sort (cost O(Nlog_BN) with B memory buffers)

Selection (different techniques developed for different query types)

sequential scan (worst case, cost O(N))
index-based (e.g. B-trees, cost O(logN), tree nodes are pages)
hash-based (O(1) best case, only works for equality tests)

Join (fast joins are critical to success to erlational DBMSs)

nested-loop join (cost O(N.M), buffering can reduce to O(N+M))
sort-merge join (cost O(NlogN+MlogM))
hash-join (best case cost O(N+M.N/B), with B memory buffers)

Performance Tuning

Schema design:

devise data structures to represent application information

Performance tuning:

devise data structures to achieve good performance

Good performance may involve any/all of:

making applications run faster
lowering response time of queries/transactions
improving overall transaction throughput

Performance Tuning (cont)

Tuning requires us to consider the following:

which queries and transactions will be used?
(e.g. check balance for payment, display recent transaction history)
how frequently does each query/transaction occur?
(e.g. 99% of transactions are EFTPOS payments; 1% are print balance)
are there time constraints on queries/transactions?
(e.g. payment at EFTPOS terminals must be approved within 7 seconds)
are there uniqueness constraints on any attributes?
(therefore, define index on attributes to speed up insertion uniqueness check)
how frequently do updates occur?
(indexes slow down updates, because must update table and index)

Performance Tuning (cont)

Performance can be considered at two times:

during schema design
- typically towards the end of schema design process
- requires schema transformations such as denormalisation
after schema design
- requires adding extra data structures such as indexes

Denormalisation

Normalisation structures data to minimise storage redundancy.

achieves this by "breaking up" the data into logical chunks
requires minimal "maintenance" to ensure data consistency

Problem: queries that need to put data back together.

need to use a (potentially expensive) join operation
if an expensive join is frequent, system performance suffers

Solution: store some data redundantly

benefit: queries needing expensive join are now cheap
trade-off: extra maintenance effort to maintain consistency
worthwhile if joins are frequent and updates are rare

Denormalisation (cont)

Example: Courses = Course Subject Term

If we frequently need to refer to course "standard" name

add extra courseName column into Course table
cost: trigger before insert on Course to construct name
trade-off likely to be worthwhile: Course insertions infrequent


-- can now replace a query like:
select s.code||t.year||t.sess, e.grade, e.mark
from   Course c, CourseEnrolment e, Subject s, Term t
where  e.course = c.id and c.subject = s.id and c.term = t.id
-- by a query like:
select c.courseName, e.grade, e.mark
from   Course c, CourseEnrolment e
where  e.course = c.id

Indexes

Indexes provide efficient content-based access to tuples.

Can build indexes on any (combination of) attributes.

Definining indexes:

CREATE INDEX name ON table ( attr₁, attr₂, ... )

attr_i can be an arbitrary expression (e.g. upper(name)).

CREATE INDEX also allows us to specify

that the index is on UNIQUE values
an access method (USING btree, hash, rtree, or gist)

Indexes (cont)

Indexes can make a huge difference to query processing cost.

On the other hand, they introduce overheads (storage, updates).

Creating indexes to maximise performance benefits:

apply to attributes used in equality/range conditions, e.g.


select * from Employee where id = 12345
select * from Employee where age > 60
select * from Employee where salary between 10000 and 20000

but only in queries that are frequently used
and on tables that are not updated frequently

Indexes (cont)

Considerations in applying indexes:

is an attribute used in frequent/expensive queries?
(note that some kinds of queries can be answered from index alone)
should we create an index on a collection of attributes?
(yes, if the collection is used in a frequent/expensive query)
can we exploit a clustered index? (only one per table)

should we use B-tree or Hash index?


-- use hashing for (unique) attributes in equality tests, e.g.
select * from Employee where id = 12345
-- use B-tree for attributes in range tests, e.g.
select * from Employee where age > 60

Query Tuning

Sometimes, a query can be re-phrased to affect performance:

by helping the optimiser to make use of indexes
by avoiding (unnecessary) operations that are expensive

Examples which may prevent optimiser from using indexes:

select name from Employee where salary/365 > 10.0
       -- fix by re-phrasing condition to (salary > 3650)
select name from Employee where name like '%ith%'
select name from Employee where birthday is null
       -- above two are difficult to "fix"
select name from Employee
where  dept in (select id from Dept where ...)
       -- fix by using Employee join Dept on (e.dept=d.id)

Query Tuning (cont)

Other factors to consider in query tuning:

select distinct requires a sort; is distinct necessary?

if multiple join conditions are available ...
choose join attributes that are indexed, avoid joins on strings


select ... Employee join Customer on (s.name = p.name)
vs
select ... Employee join Customer on (s.ssn = p.ssn)

sometimes or in condition prevents index from being used ...
replace the or condition by a union of non-or clauses


select name from Employee where dept=1 or dept=2
vs
(select name from Employee where dept=1)
union
(select name from Employee where dept=2)

PostgreSQL Query Tuning

PostgreSQL provides the explain statement to

give a representation of the query execution plan
with information that may help to tune query performance

Usage:

EXPLAIN [ANALYZE] Query

Without ANALYZE, EXPLAIN shows plan with estimated costs.

With ANALYZE, EXPLAIN executes query and prints real costs.

Note that runtimes may show considerable variation due to buffering.

EXPLAIN Examples

Example: Select on indexed attribute


ass2=# explain select * from student where id=100250;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Index Scan using student_pkey on student  (cost=0.00..5.94 rows=1 width=17)
   Index Cond: (id = 100250)

ass2=# explain analyze select * from student where id=100250;
                                 QUERY PLAN 
-----------------------------------------------------------------------------
 Index Scan using student_pkey on student  (cost=0.00..5.94 rows=1 width=17)
                                 (actual time=31.209..31.212 rows=1 loops=1)
   Index Cond: (id = 100250)
 Total runtime: 31.252 ms

EXPLAIN Examples (cont)

Example: Select on non-indexed attribute


ass2=# explain select * from student where stype='local';
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
   Filter: ((stype)::text = 'local'::text)

ass2=# explain analyze select * from student where stype='local';
                              QUERY PLAN 
--------------------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
             (actual time=0.061..4.784 rows=2512 loops=1)
   Filter: ((stype)::text = 'local'::text)
 Total runtime: 7.554 ms

EXPLAIN Examples (cont)

Example: Join on a primary key (indexed) attribute


ass2=# explain
ass2-# select s.sid,p.name from Student s, Person p where s.id=p.id;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Hash Join  (cost=70.33..305.86 rows=3626 width=52)
   Hash Cond: ("outer".id = "inner".id)
   ->  Seq Scan on person p  (cost=0.00..153.01 rows=3701 width=52)
   ->  Hash  (cost=61.26..61.26 rows=3626 width=8)
         ->  Seq Scan on student s  (cost=0.00..61.26 rows=3626 width=8)

EXPLAIN Examples (cont)

Join on a primary key (indexed) attribute:


ass2=# explain anaylze
ass2-# select s.sid,p.name from Student s, Person p where s.id=p.id;
                                QUERY PLAN
-------------------------------------------------------------------------
 Hash Join  (cost=70.33..305.86 rows=3626 width=52)
            (actual time=11.680..28.242 rows=3626 loops=1)
   Hash Cond: ("outer".id = "inner".id)
   ->  Seq Scan on person p  (cost=0.00..153.01 rows=3701 width=52)
                       (actual time=0.039..5.976 rows=3701 loops=1)
   ->  Hash  (cost=61.26..61.26 rows=3626 width=8)
             (actual time=11.615..11.615 rows=3626 loops=1)
         ->  Seq Scan on student s  (cost=0.00..61.26 rows=3626 width=8)
                            (actual time=0.005..5.731 rows=3626 loops=1)
 Total runtime: 32.374 ms