A query in SQL:
Phase 1: mapping SQL to relational algebra (RA)
A naive translation scheme from SQL to relational algebra:
SELECT
FROM
WHERE
select s.name, e.course from Student s, Enrolment e where s.id = e.student and e.mark > 50;
is translated to
A better translation scheme would be something like:
SELECT
WHERE
WHERE
select s.name, e.course from Student s, Enrolment e where s.id = e.student and e.mark > 50;
is translated to
Mapping other RA operations ...
Aggregation operators (e.g. MAX
SUM
DISTINCT
GROUP-BY
HAVING
ORDER-BY
The query: Courses with more than 100 students in them?
Can be expressed in SQL as
select distinct s.code from Course c, Subject s, Enrolment e where c.id = e.course and c.subject = s.id group by s.id having count(*) > 100;
and might be compiled to
Result = Project'[s.code]( GroupSelect[size>100]( GroupBy[id] ( JoinRes ) ) )
where
JoinRes = Join[s.id=c.subject] ( Subject, Join[id=course]( Course, Enrolment ) )
The Join operations could be done (at least) two different ways:
Which is better? ... The query optimiser works this out.
Note: for a join involving N tables, there are O(N!) possible trees.
The order of operations is important.
Equally important is the choice of concrete operations:
One view of DB engine - "relational algebra virtual machine":
projection (π) | join (![]() |
|||
intersection (∩) | difference (-) | |||
insert | delete |
For each of these operations:
Given:
The query optimiser start with an RA expression, then
An execution plan is a sequence of relational operations.
Consider execution plans for:
σc (R d S
e T)
tmp1 := hash_join[d](R,S) tmp2 := sort_merge_join[e](tmp1,T) result := binary_search[c](tmp2)
or
tmp1 := sort_merge_join[e](S,T) tmp2 := hash_join[d](R,tmp1) result := linear_search[c](tmp2)
or
tmp1 := btree_search[c](R) tmp2 := hash_join[d](tmp1,S) result := sort_merge_join[e](tmp2)
All produce same result, but have different costs.
Sorting (quicksort, etc. are not applicable)
Schema design:
Tuning requires us to consider the following:
Performance can be considered at two times:
Normalisation structures data to minimise storage redundancy.
Example: Courses = Course Subject
Term
If we frequently need to refer to course "standard" name
courseName
Course
Course
Course
-- can now replace a query like: select s.code||t.year||t.sess, e.grade, e.mark from Course c, CourseEnrolment e, Subject s, Term t where e.course = c.id and c.subject = s.id and c.term = t.id -- by a query like: select c.courseName, e.grade, e.mark from Course c, CourseEnrolment e where e.course = c.id
Indexes provide efficient content-based access to tuples.
Can build indexes on any (combination of) attributes.
Definining indexes:
CREATE INDEX name ON table ( attr1, attr2, ... )
attri can be an arbitrary expression (e.g. upper(name)
CREATE INDEX
UNIQUE
USING
Indexes can make a huge difference to query processing cost.
On the other hand, they introduce overheads (storage, updates).
Creating indexes to maximise performance benefits:
select * from Employee where id = 12345 select * from Employee where age > 60 select * from Employee where salary between 10000 and 20000
Considerations in applying indexes:
-- use hashing for (unique) attributes in equality tests, e.g. select * from Employee where id = 12345 -- use B-tree for attributes in range tests, e.g. select * from Employee where age > 60
Sometimes, a query can be re-phrased to affect performance:
select name from Employee where salary/365 > 10.0 -- fix by re-phrasing condition to (salary > 3650) select name from Employee where name like '%ith%' select name from Employee where birthday is null -- above two are difficult to "fix" select name from Employee where dept in (select id from Dept where ...) -- fix by using Employee join Dept on (e.dept=d.id)
Other factors to consider in query tuning:
select distinct
distinct
select ... Employee join Customer on (s.name = p.name)
vs
select ... Employee join Customer on (s.ssn = p.ssn)
or
or
or
select name from Employee where dept=1 or dept=2
vs
(select name from Employee where dept=1)
union
(select name from Employee where dept=2)
PostgreSQL provides the explain
EXPLAIN [ANALYZE] Query
Without ANALYZE
EXPLAIN
With ANALYZE
EXPLAIN
Note that runtimes may show considerable variation due to buffering.
Example: Select on indexed attribute
ass2=# explain select * from student where id=100250; QUERY PLAN ----------------------------------------------------------------------------- Index Scan using student_pkey on student (cost=0.00..5.94 rows=1 width=17) Index Cond: (id = 100250) ass2=# explain analyze select * from student where id=100250; QUERY PLAN ----------------------------------------------------------------------------- Index Scan using student_pkey on student (cost=0.00..5.94 rows=1 width=17) (actual time=31.209..31.212 rows=1 loops=1) Index Cond: (id = 100250) Total runtime: 31.252 ms
Example: Select on non-indexed attribute
ass2=# explain select * from student where stype='local'; QUERY PLAN ---------------------------------------------------------- Seq Scan on student (cost=0.00..70.33 rows=18 width=17) Filter: ((stype)::text = 'local'::text) ass2=# explain analyze select * from student where stype='local'; QUERY PLAN -------------------------------------------------------------------- Seq Scan on student (cost=0.00..70.33 rows=18 width=17) (actual time=0.061..4.784 rows=2512 loops=1) Filter: ((stype)::text = 'local'::text) Total runtime: 7.554 ms
Example: Join on a primary key (indexed) attribute
ass2=# explain ass2-# select s.sid,p.name from Student s, Person p where s.id=p.id; QUERY PLAN ------------------------------------------------------------------------- Hash Join (cost=70.33..305.86 rows=3626 width=52) Hash Cond: ("outer".id = "inner".id) -> Seq Scan on person p (cost=0.00..153.01 rows=3701 width=52) -> Hash (cost=61.26..61.26 rows=3626 width=8) -> Seq Scan on student s (cost=0.00..61.26 rows=3626 width=8)
Join on a primary key (indexed) attribute:
ass2=# explain anaylze ass2-# select s.sid,p.name from Student s, Person p where s.id=p.id; QUERY PLAN ------------------------------------------------------------------------- Hash Join (cost=70.33..305.86 rows=3626 width=52) (actual time=11.680..28.242 rows=3626 loops=1) Hash Cond: ("outer".id = "inner".id) -> Seq Scan on person p (cost=0.00..153.01 rows=3701 width=52) (actual time=0.039..5.976 rows=3701 loops=1) -> Hash (cost=61.26..61.26 rows=3626 width=8) (actual time=11.615..11.615 rows=3626 loops=1) -> Seq Scan on student s (cost=0.00..61.26 rows=3626 width=8) (actual time=0.005..5.731 rows=3626 loops=1) Total runtime: 32.374 ms
Produced: 13 Sep 2020