Query Processing


∮ Query Processing


Query Processing

A query in SQL:

A query evaluator :

Query Processing (cont)

Phase 1: mapping SQL to relational algebra (RA)

[Diagram:Pic/qproc/qryeval1.png]


Mapping SQL to Relational Algebra

A naive translation scheme from SQL to relational algebra:

Example:

select s.name, e.course
from   Student s, Enrolment e
where  s.id = e.student and e.mark > 50;

is translated to

Project [name,course] ( Select [id=student ∧ mark>50] ( Student × Enrolment ) )

Mapping SQL to Relational Algebra (cont)

A better translation scheme would be something like:

Example:

select s.name, e.course
from   Student s, Enrolment e
where  s.id = e.student and e.mark > 50;

is translated to

Project [name,course] ( Select [mark>50] ( Join [id=student] ( Student, Enrolment ) ) )

Mapping SQL to Relational Algebra (cont)

Mapping other RA operations ...

Aggregation operators (e.g. MAX, SUM, ...):

Duplicate elimination (DISTINCT): Grouping (GROUP-BY, HAVING): Sorting (ORDER-BY):

Mapping Example

The query: Courses with more than 100 students in them?

Can be expressed in SQL as

select   distinct s.code
from     Course c, Subject s, Enrolment e
where    c.id = e.course and c.subject = s.id
group by s.id
having   count(*) > 100;

and might be compiled to

Result = Project'[s.code]( GroupSelect[size>100]( GroupBy[id] ( JoinRes ) ) )

where

JoinRes = Join[s.id=c.subject] ( Subject, Join[id=course]( Course, Enrolment ) )


Mapping Example (cont)

The Join operations could be done (at least) two different ways:

[Diagram:Pic/qproc/join-trees.png]


Which is better? ... The query optimiser works this out.

Note: for a join involving N tables, there are O(N!) possible trees.


Query Evaluation

The order of operations is important.

Equally important is the choice of concrete operations:

The DBMS query optimiser needs to:

Database Engine Operations

One view of DB engine - "relational algebra virtual machine":

selection (σ) projection (π) join (, ×)
union () intersection () difference (-)
sort insert delete

For each of these operations:

Cost analysis requires a model of DBMS internals ...

Query Optimisation Problem

Given:

Determine a sequence of relational algebra operations that: The term "query optimisation" is a misnomer: (Finding the optimal query is NP-hard; the cost of finding it may be higher than the query cost).

Query Optimisation Problem (cont)

The query optimiser start with an RA expression, then

The cost of evaluating a query is determined by: Analysis of costs involves estimating:

Query Optimisation Problem (cont)

An execution plan is a sequence of relational operations.

Consider execution plans for:    σc (R  d  S  e  T)

tmp1   :=  hash_join[d](R,S)
tmp2   :=  sort_merge_join[e](tmp1,T)
result :=  binary_search[c](tmp2)

or

tmp1   :=  sort_merge_join[e](S,T)
tmp2   :=  hash_join[d](R,tmp1)
result :=  linear_search[c](tmp2)

or

tmp1   :=  btree_search[c](R)
tmp2   :=  hash_join[d](tmp1,S)
result :=  sort_merge_join[e](tmp2)

All produce same result, but have different costs.


Implementations of RA Ops

Sorting   (quicksort, etc. are not applicable)

Selection   (different techniques developed for different query types) Join   (fast joins are critical to success to erlational DBMSs)

∮ Performance Tuning


Performance Tuning

Schema design:

Performance tuning: Good performance may involve any/all of:

Performance Tuning (cont)

Tuning requires us to consider the following:


Performance Tuning (cont)

Performance can be considered at two times:


Denormalisation

Normalisation structures data to minimise storage redundancy.

Problem: queries that need to put data back together. Solution: store some data redundantly

Denormalisation (cont)

Example: Courses = Course Subject Term

If we frequently need to refer to course "standard" name


-- can now replace a query like:
select s.code||t.year||t.sess, e.grade, e.mark
from   Course c, CourseEnrolment e, Subject s, Term t
where  e.course = c.id and c.subject = s.id and c.term = t.id
-- by a query like:
select c.courseName, e.grade, e.mark
from   Course c, CourseEnrolment e
where  e.course = c.id 


Indexes

Indexes provide efficient content-based access to tuples.

Can build indexes on any (combination of) attributes.

Definining indexes:

CREATE INDEX name ON table ( attr1, attr2, ... )

attri can be an arbitrary expression (e.g. upper(name)).

CREATE INDEX also allows us to specify


Indexes (cont)

Indexes can make a huge difference to query processing cost.

On the other hand, they introduce overheads (storage, updates).

Creating indexes to maximise performance benefits:


Indexes (cont)

Considerations in applying indexes:


Query Tuning

Sometimes, a query can be re-phrased to affect performance:

Examples which may prevent optimiser from using indexes:

select name from Employee where salary/365 > 10.0
       -- fix by re-phrasing condition to (salary > 3650)
select name from Employee where name like '%ith%'
select name from Employee where birthday is null
       -- above two are difficult to "fix"
select name from Employee
where  dept in (select id from Dept where ...)
       -- fix by using Employee join Dept on (e.dept=d.id)


Query Tuning (cont)

Other factors to consider in query tuning:


PostgreSQL Query Tuning

PostgreSQL provides the explain statement to

Usage:

EXPLAIN [ANALYZE] Query

Without ANALYZE, EXPLAIN shows plan with estimated costs.

With ANALYZE, EXPLAIN executes query and prints real costs.

Note that runtimes may show considerable variation due to buffering.


EXPLAIN Examples

Example: Select on indexed attribute


ass2=# explain select * from student where id=100250;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Index Scan using student_pkey on student  (cost=0.00..5.94 rows=1 width=17)
   Index Cond: (id = 100250)

ass2=# explain analyze select * from student where id=100250;
                                 QUERY PLAN 
-----------------------------------------------------------------------------
 Index Scan using student_pkey on student  (cost=0.00..5.94 rows=1 width=17)
                                 (actual time=31.209..31.212 rows=1 loops=1)
   Index Cond: (id = 100250)
 Total runtime: 31.252 ms


EXPLAIN Examples (cont)

Example: Select on non-indexed attribute


ass2=# explain select * from student where stype='local';
                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
   Filter: ((stype)::text = 'local'::text)

ass2=# explain analyze select * from student where stype='local';
                              QUERY PLAN 
--------------------------------------------------------------------
 Seq Scan on student  (cost=0.00..70.33 rows=18 width=17)
             (actual time=0.061..4.784 rows=2512 loops=1)
   Filter: ((stype)::text = 'local'::text)
 Total runtime: 7.554 ms


EXPLAIN Examples (cont)

Example: Join on a primary key (indexed) attribute


ass2=# explain
ass2-# select s.sid,p.name from Student s, Person p where s.id=p.id;
                               QUERY PLAN                                
-------------------------------------------------------------------------
 Hash Join  (cost=70.33..305.86 rows=3626 width=52)
   Hash Cond: ("outer".id = "inner".id)
   ->  Seq Scan on person p  (cost=0.00..153.01 rows=3701 width=52)
   ->  Hash  (cost=61.26..61.26 rows=3626 width=8)
         ->  Seq Scan on student s  (cost=0.00..61.26 rows=3626 width=8)


EXPLAIN Examples (cont)

Join on a primary key (indexed) attribute:


ass2=# explain anaylze
ass2-# select s.sid,p.name from Student s, Person p where s.id=p.id;
                                QUERY PLAN
-------------------------------------------------------------------------
 Hash Join  (cost=70.33..305.86 rows=3626 width=52)
            (actual time=11.680..28.242 rows=3626 loops=1)
   Hash Cond: ("outer".id = "inner".id)
   ->  Seq Scan on person p  (cost=0.00..153.01 rows=3701 width=52)
                       (actual time=0.039..5.976 rows=3701 loops=1)
   ->  Hash  (cost=61.26..61.26 rows=3626 width=8)
             (actual time=11.615..11.615 rows=3626 loops=1)
         ->  Seq Scan on student s  (cost=0.00..61.26 rows=3626 width=8)
                            (actual time=0.005..5.731 rows=3626 loops=1)
 Total runtime: 32.374 ms


Produced: 13 Sep 2020